Ponder

Heuristic Solutions for Steepest Descent on the Stiefel Manifold

What would Muon look like if we constrained the weights to be semi-orthogonal?

Sensitivity and Sharpness of n-Simplicial Attention

Towards a maximal update parameterization of n-simplicial attention

Adam with Aggressive Gradient Clipping ≈ Smoothed SignSGD/NormSGD

Why does Adam with aggressive gradient value/norm clipping have sparse updates and do well with higher learning rates? Here we show that it is essentially equivalent to a smoothed version of SignSGD/NormSGD.

Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration

A small step towards hardware-architecture-optimizer codesign in deep learning.

Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifolds

Muon from first principles, what makes it different from other optimizers, and why it works so well.

Napkin Math on Non-Euclidean Trust Region Optimization

A possible reason why Muon converges faster & does better at higher learning rates than Adam.

Blocked Matrix Formulation of Linear Attention Mechanisms

The blocked matrix formulation of linear attention mechanisms, multi-step online gradient descent at inference time, and chunk-wise parallelism.

Steepest Descent Under Schatten-p Norms

Why Muon still work despite not perfectly semi-orthogonalizing the gradients.

Squeezing 1-2% Efficiency Gains Out of Muon by Optimizing the Newton-Schulz Coefficients

Simply switching to Muon can already get you 2x efficiency gains. But you can squeeze out an extra 1-2% by optimizing the Newton-Schulz coefficients.

CASPR Without Accumulation is Muon

The CASPR optimizer, a variant of Shampoo, reduces to Muon when we remove the accumulation on the preconditioners.

GRPO's Main Flaw

GRPO may not be the best choice for training reasoning models. Here’s why.

(Linear) Attention as Test-Time Regression

A unifying framework for linear attention mechanisms as test-time regression and how to parallelize training and inference.

Deep Learning Optimizers as Steepest Descent in Normed Spaces

Instead of asking, ‘Which optimizer should I use?’ ask, ‘In which space do my features live in?’

ChatGPT May Have Developed Seasonal Depression

Could ChatGPT’s shorter responses be an indication of something more bizarre going on?

The Human Mind May Be Universal

Years of experience in building artificial minds led me to believe that these AIs may end up seeming more ‘human’ than we currently imagine them to be.

Vaccine Search as a Computational Problem

A thought dump on mRNA vaccines and the future of computational biology