data:image/s3,"s3://crabby-images/20d09/20d09df350d251110109259286497748699bac38" alt=""
Note: This was originally posted as a Twitter thread. I’ve reformatted it here for better readability.
- Why do steepest descent in non-Euclidean spaces?
- Why does adaptive preconditioning work so well in practice? And,
- Why normalize everything ala nGPT?
Ideally, when training a neural network, we want to bound the features, the weights, and their respective updates so that:
- [lower] the model actually “learns” stuff; and
- [upper] model training is stable
These bounds then depend on the norms, but which norms?
data:image/s3,"s3://crabby-images/286f6/286f63d9f0e3dce9176ad5fa72709725fcb3caf9" alt=""
The fun part is that the norms of the input and output features already induce the norm of the weights between them. We can also let the feature and feature updates have the same norm (likewise for the weights). And so, we only have to choose the norms for the features!
data:image/s3,"s3://crabby-images/c3c4f/c3c4f2613b4bc8f45bab6c95f88cb5b01115e5a1" alt=""
Now, our datasets are usually Euclidean or locally Euclidean (see Manifold Hypothesis)
What’s the norm induced by Euclidean input and output vector spaces? The Spectral Norm!
data:image/s3,"s3://crabby-images/0fe9c/0fe9ca0c31cae0b7c34f4dadad7fdb25752a7332" alt=""
So even if we don’t want to do anything fancy, we’d still have to do steepest descent in non-Euclidean space because:
- The induced norm for the weights (w/ Euclidean features) is non-Euclidean; and
- We’re optimizing the weights, not the features.
The model inputs and outputs being Euclidean sounds reasonable, but why do the “hidden” features have to be Euclidean too?
If we vary the norms of these features, we also vary the induced norms of the weights and vice versa.
Adaptive preconditioning then “searches” for the proper norms.
data:image/s3,"s3://crabby-images/4eac3/4eac362ad47031f081c51e1f558f198a2e4f757a" alt=""
This also answers @mattecapu’s Q here
Shampoo & SOAP starts from Euclidean features and Spectral weights, then tunes the norms over time. SOAP does this tuning with momentum so it’s theoretically faster.
really cool to also optimize the p in the norm. do you have a conceptual idea of what that’s tuning? I guess intuitively as p->oo each dimension is getting ‘further away’ from each other..
https://x.com/mattecapu/status/1847218617567301804
A more cynical answer, from a mathematician to another, is that almost nobody in this field is actually doing proper linear algebra. Adaptive preconditioning allows us to start from really crappy configurations/parametrizations and get away scoff free.
But a more pro-ML answer would be that humans suck at predicting which inductive biases would work best when cooked into the models. E.g. why should the “hidden” features be in Euclidean space? Why not let the model learn the proper space(s) to work with?
Finally, why is it a good idea to normalize everything everywhere?
Cuz it lets us have sane bounds & same norms on the features which means we can use the same optimizer for all the layers with minimal tuning!
https://arxiv.org/abs/2410.01131
How to Cite
@misc{cesista2024firstordernormedopt,
author = {Franz Louis Cesista},
title = {Deep Learning Optimizers as Steepest Descent in Normed Spaces},
year = {2024},
url = {http://leloykun.github.io/ponder/steepest-descent-opt/},
}
References
- Loshchilov, I., Hsieh, C., Sun, S., Ginsburg, B. (2024). nGPT: Normalized Transformer with Representation Learning on the Hypersphere. URL https://arxiv.org/abs/2410.01131
- Yang, G., Simon, J., Bernstein, J. (2024). A Spectral Condition for Feature Learning. URL https://arxiv.org/abs/2310.17813
- Bernstein, J., Newhouse, L. (2024). Old Optimizer, New Norm: An Anthology. URL https://arxiv.org/abs/2409.20325