Optimizers

Steepest Descent on Finsler-Structured (Matrix) Geometries via Dual Ascent

To guarantee fast and robust model training, we can recast the optimization problem as steepest descent on Finsler-structured geometries. Here we show how to compute the optimal updates via dual ascent.

Rethinking Maximal Update Parametrization: Steepest Descent on the Spectral Ball

Novel optimizers for maximally updating both the weights and activations of neural networks while keeping weight norms under control. To get there, we needed to invent an efficient, GPU/TPU-friendly method for eigenvalue clipping and solve the Steepest Descent problem on the Positive Semidefinite Cone, Convex Spectrahedron, and finally on the Spectral Ball.

Steepest Descent on Finsler-Structured (Matrix) Manifolds

Fast and robust model training.

Heuristic Solutions for Steepest Descent on the Stiefel Manifold

What would Muon look like if we constrained the weights to be semi-orthogonal?

Training Transformers with Enforced Lipschitz Bounds

Neural networks are often highly sensitive to input and weight perturbations. This sensitivity has been linked to pathologies such as vulnerability to adversarial examples, divergent training, and overfitting. To combat these problems, past research has looked at building neural networks entirely from Lipschitz components. However, these techniques have not matured to the point where researchers have trained a modern architecture such as a transformer with a Lipschitz certificate enforced beyond initialization. To explore this gap, we begin by developing and benchmarking novel, computationally-efficient tools for maintaining norm-constrained weight matrices. Applying these tools, we are able to train transformer models with Lipschitz bounds enforced throughout training. We find that optimizer dynamics matter: switching from AdamW to Muon improves standard methods – weight decay and spectral normalization – allowing models to reach equal performance with a lower Lipschitz bound. Inspired by Muon’s update having a fixed spectral norm, we co-design a weight constraint method that improves the Lipschitz vs. performance tradeoff on MLPs and 2M parameter transformers. Our 2-Lipschitz transformer on Shakespeare text reaches validation accuracy 60%. Scaling to 145M parameters, our 10-Lipschitz transformer reaches 21% accuracy on internet text. However, to match the NanoGPT baseline validation accuracy of 39.4%, our Lipschitz upper bound increases to 10^264. Nonetheless, our Lipschitz transformers train without stability measures such as layer norm, QK norm, and logit tanh softcapping.

Sensitivity and Sharpness of n-Simplicial Attention

Towards a maximal update parameterization of n-simplicial attention

Adam with Aggressive Gradient Clipping ≈ Smoothed SignSGD/NormSGD

Why does Adam with aggressive gradient value/norm clipping have sparse updates and do well with higher learning rates? Here we show that it is essentially equivalent to a smoothed version of SignSGD/NormSGD.

Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration

A small step towards hardware-architecture-optimizer codesign in deep learning.

Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifolds

Muon from first principles, what makes it different from other optimizers, and why it works so well.

CASPR Without Accumulation is Muon

The CASPR optimizer, a variant of Shampoo, reduces to Muon when we remove the accumulation on the preconditioners.