Critical Batch Size for Steepest Descent Under Arbitrary Norms
First-order optimization under arbitrary norms with Nesterov momentum (and weight decay) yields a universal critical batch size formula.
First-order optimization under arbitrary norms with Nesterov momentum (and weight decay) yields a universal critical batch size formula.
To guarantee fast and robust model training, we can recast the optimization problem as steepest descent on Finsler-structured geometries. Here we show how to compute the optimal updates via dual ascent.
Novel optimizers for maximally updating both the weights and activations of neural networks while keeping weight norms under control. To get there, we needed to invent an efficient, GPU/TPU-friendly method for eigenvalue clipping and solve the Steepest Descent problem on the Positive Semidefinite Cone, Convex Spectrahedron, and finally on the Spectral Ball.
Fast and robust model training.
What would Muon look like if we constrained the weights to be semi-orthogonal?
Towards a maximal update parameterization of n-simplicial attention
Why does Adam with aggressive gradient value/norm clipping have sparse updates and do well with higher learning rates? Here we show that it is essentially equivalent to a smoothed version of SignSGD/NormSGD.
A small step towards hardware-architecture-optimizer codesign in deep learning.
Muon from first principles, what makes it different from other optimizers, and why it works so well.
A possible reason why Muon converges faster & does better at higher learning rates than Adam.
The blocked matrix formulation of linear attention mechanisms, multi-step online gradient descent at inference time, and chunk-wise parallelism.
Why Muon still work despite not perfectly semi-orthogonalizing the gradients.
Simply switching to Muon can already get you 2x efficiency gains. But you can squeeze out an extra 1-2% by optimizing the Newton-Schulz coefficients.
The CASPR optimizer, a variant of Shampoo, reduces to Muon when we remove the accumulation on the preconditioners.
GRPO may not be the best choice for training reasoning models. Here’s why.
A unifying framework for linear attention mechanisms as test-time regression and how to parallelize training and inference.
Instead of asking, ‘Which optimizer should I use?’ ask, ‘In which space do my features live in?’
Could ChatGPT’s shorter responses be an indication of something more bizarre going on?
Years of experience in building artificial minds led me to believe that these AIs may end up seeming more ‘human’ than we currently imagine them to be.
A thought dump on mRNA vaccines and the future of computational biology