scaling law seminar
fall 2021 classes, transformers
Seminar on scaling laws
Scaling Laws for Acoustic Models
- papers:
- how do acoustic models scale with model size?
- maybe ask haoran on this?
- https://assets.amazon.science/ef/a6/60f65ed543c9af2a519caba269bd/scaling-laws-for-acoustic-models.pdf
- what proposals are there? what does the current field look like for scaling laws?
- most models use asr + LM’s now, how does this scale with it overall?zz
- does wav2vec scale? what are the profiles for wer in terms of longer speech datasets?
Scaling laws for solution compressibility - aka Information Bottleneck
- are solutions in bits smaller as the model size increases?
- how would you measure the solution bit size?
- intuitively the solution size actually should just increase as the transofmer gets bigger
- read
- information probing on description length of solutions
- do softmaxes become hardmaxes?
- people
- three approaches
- circuit complexity - establishing upper bounds on what this thing is
- renormalization groups - is there something there? information bottleneck?
- probing - probing the description length?
- motivate the paper by saying, “we should want to predict what types of capabiltiies can emerge, and it’s important to take lessons from statistical physics”
On the Information Bottleneck Theory of Deep Learning
- Cites
- Tishby and Zalavsky 2015,Shwartz-Ziv & Tishby, 2017
- Tishby 1999, information bottleneck
- deep nn complex non-linear learning trajectories (Saxe et al 2014)
- baldi and hornik, optimization problems remain non-convex (1989)
- Deep learning as representation learners
- information bottleneck deep learning, 3 main claims
- deep nns undergo two distcint phases, an initial fitting phase and a compression one, where fitting is when the mutual information increases, compression is where mutual information decreases
- unclear b/c tanh exhibits a compression, relu does not
- networks that do not do compression exhibit generalization
- SIDEBAR: does this mean that mesa optimizers require compression?
- learning trajectories are not easily predictable, mesa optimizers may arise
- SGD - two phases from shwartz-ziv and tishby, 2017, distinguishing between drift phase (mean of gradients over training samples) and diffusion phase (mean becomes smaller than the standard devianation of the gradients)
Paper questions
- does information probing, circuit complexity, or information theory help us get anywhere?
Effects of Parameter Norm Growth During Transformer Training
- suggests that transformers learn due to inductive bias (aka projecting out the set of assumptions)
- studies the growth in the l2 norm during training
- norm growth occurs, then reaches a discretized network with saturated network activations
- saturation - neuron predominantly outputs values close to the asymptotic ends of the bounded activation function
- softmaxes become hardmaxes - https://ieeexplore.ieee.org/document/7376778
- measuring saturation in neural networks - https://ieeexplore.ieee.org/document/7376778
- the l2 norm grows, aka norm growth
- previous work on feedforward networks
- li and auorara 2019, ji and telgarsky 2020
- anywhere the norm diverges during the training approaches a saturated network
- saturation - neuron predominantly outputs values close to the asymptotic ends of the bounded activation function
- saturation allows for approximation via circuit complexity
- transformers can implement a countering mechanism (Bhattmishra et al 2020)
- Bhattmishra also finds that trained networks learn to recongize counter languages that rely on computing means
Discussions
- https://old.reddit.com/r/MachineLearning/comments/em3ynp/d_trying_to_wrap_my_head_around_the_information/
- https://old.reddit.com/r/MachineLearning/comments/elmgsz/r_on_the_information_bottleneck_theory_of_deep/
Other papers
- survey - https://arxiv.org/pdf/1904.03743.pdf
- similar paper that tries to measure model complexity with curve activation functions https://arxiv.org/abs/2006.08962
- https://arxiv.org/pdf/1909.11396.pdf
- investigating dropout regularization and model complexity in NN: https://arxiv.org/abs/2108.06628
- https://arxiv.org/pdf/1909.11396.pdf
Videos
The Information Bottleneck Problem and Its Applications in Machine Learning
- survey - https://arxiv.org/pdf/2004.14941.pdf
Deep variational information bottleneck
Saxe et al Paper
- https://openreview.net/forum?id=ry_WPG-A
- https://old.reddit.com/r/MachineLearning/comments/79efus/r_on_the_information_bottleneck_theory_of_deep/
- https://old.reddit.com/r/MachineLearning/comments/elmgsz/r_on_the_information_bottleneck_theory_of_deep/
Code
- https://github.com/ravidziv/IDNNs - code for 2015 paper
Scalable Mutual Information Using Dependence Graphs
- https://arxiv.org/abs/1801.09125
- Claims to go against the saxe paper
- https://github.com/mrtnoshad/EDGE/blob/master/information_plane/main.py - code for 2018 paper on EDGE, the scalable linear measure of Mutual Information