
Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifolds
Muon from first principles, what makes it different from other optimizers, and why it works so well.
Muon from first principles, what makes it different from other optimizers, and why it works so well.
A possible reason why Muon converges faster & does better at higher learning rates than Adam.
The blocked matrix formulation of linear attention mechanisms, multi-step online gradient descent at inference time, and chunk-wise parallelism.
Why Muon still work despite not perfectly semi-orthogonalizing the gradients.
Simply switching to Muon can already get you 2x efficiency gains. But you can squeeze out an extra 1-2% by optimizing the Newton-Schulz coefficients.
The CASPR optimizer, a variant of Shampoo, reduces to Muon when we remove the accumulation on the preconditioners.
GRPO may not be the best choice for training reasoning models. Here’s why.
A unifying framework for linear attention mechanisms as test-time regression and how to parallelize training and inference.
Instead of asking, ‘Which optimizer should I use?’ ask, ‘In which space do my features live in?’
Could ChatGPT’s shorter responses be an indication of something more bizarre going on?