Numerically Stable Spectral Clipping Via Newton-Schulz Iteration
A small step towards hardware-architecture-optimizer codesign in deep learning.
A small step towards hardware-architecture-optimizer codesign in deep learning.
Muon from first principles, what makes it different from other optimizers, and why it works so well.
A possible reason why Muon converges faster & does better at higher learning rates than Adam.
The blocked matrix formulation of linear attention mechanisms, multi-step online gradient descent at inference time, and chunk-wise parallelism.
Why Muon still work despite not perfectly semi-orthogonalizing the gradients.
Simply switching to Muon can already get you 2x efficiency gains. But you can squeeze out an extra 1-2% by optimizing the Newton-Schulz coefficients.
The CASPR optimizer, a variant of Shampoo, reduces to Muon when we remove the accumulation on the preconditioners.
GRPO may not be the best choice for training reasoning models. Here’s why.
A unifying framework for linear attention mechanisms as test-time regression and how to parallelize training and inference.
Instead of asking, ‘Which optimizer should I use?’ ask, ‘In which space do my features live in?’