This is still a Work in Progress (WIP). I’ve decided to publish this earlier than planned to get feedback and iterate quickly. If you spot any mistakes, please don’t hesitate to let me know! Email me at franzlouiscesista@gmail.com or tag me on X (@leloykun).

A new optimizer called Muon (Jordan et al., 2024a) has recently been shown to outperform Adam (Kingma et al., 2014) in both small-scale language model training (Jordan et al., 2024b), and larger-scale language model training (Moonshot AI Team, 2025) by a factor of up to 2x in terms of flops efficiency. For non-matrix-valued parameters in a neural network, Muon falls back to Adam. But for matrix-valued parameters, Muon first semi-orthogonalizes the gradient before subtracting it from the parameter. It can also be viewed as steepest descent under the Spectral norm (Bernstein et al., 2024).

0. Introduction

In deep learning, the goal is find a function that maps input data to output data such that a certain optimization objective $\mathcal{L}$ (called “loss function”) is minimized. We parametrize this function with a set of weights $\{W^l\}_{l=1}^L$ that are typically matrix-valued. However, previous work on deep learning optimization algorithms typically ignore the matrix-structure and functional nature of these weights. A common strategy involves flattening the weights into a vector and treating them as such while others flatten then unflatten the weights in some intermediate step (Kingma et al., 2014; Li, 2015; Gupta et al., 2018; Surya et al., 2024; Pooladzandi et al., 2024). Worse, this is also prevalent in related fields such as evolutionary algorithms research (Salimans et al., 2017; Braun et al., 2024), among others.

And as demonstrated by Jordan et al. (2024a) on small-scale language model training, and Moonshot AI Team (2025) on larger-scale language model training, a simple change of perspective from a weights-as-vectors to a weights-as-matrices perspective can lead to significant improvements in training efficiency. Thus, following the work of Large et al. (2024) and Berstein et al. (2024), we argue–and demonstrate–that the underlying geometry of the weights are crucial to training performance and that we can reason, from first principles, about the properties of the geometry in which our weights (should) “live” in.

This work is a selective survey of latest advancements in deep learning optimization research, with a focus on new developments in late 2024 and early 2025. However, we also make several novel contributions:

  1. In Sections 1 and 3, we formalize Steepest Descent in Riemannian and non-Riemannian manifolds, and how different choices of norms lead to different classes of deep learning optimization algorithms.
  2. We also formalize the connection between preconditioners in optimizers and the metric tensor in Riemannian steepest descent, and how we can use this to develop more robust intuitions on optimizer design such as when to update the preconditioner.
  3. We also discuss the connection between preconditioners and dualizers in optimizers, and when to use one over the other.
  4. We also show that the optimizer CASPR (Surya et al., 2024) reduces to Muon when accumulation on the (left and right) preconditioners is disabled.
  5. In Sections 2 and 4, we motivate the Muon optimizer from first principles, and show how it can be viewed as a steepest descent under the spectral norm. We also discuss many possible reasons why it works so well in practice, despite not fitting in with more established intuitions in the field.
  6. In Section 6, we dicuss how to further improve Muon by optimizing the coefficients of the Newton-Schulz iteration. We also discuss how to use Muon to improve itself. And finally,
  7. In Section 7, we discuss convergence guarantees for Muon.

1. Preliminaries

1.1. Differential Matrix Manifolds

Show Section 1.2. Differential Geometry Terminology

We follow the work of Absil et al. (2008) for the following definitions for differentiable matrix manifolds and their properties. Note that these definitions are actually isomorphic to standard definitions in differential geometry (since $\mathbb{R}^{m \times n} \cong \mathbb{R}^{mn}$), except we highlight the fact that we are working with matrix-valued points and matrix-valued update directions.

Definition 1 (Differentiable Matrix Manifold). A differentiable matrix manifold of dimension $m \times n$ is a tuple $(\mathcal{W}, \mathcal{A})$ where $\mathcal{W}$ is a second-countable, Hausdorff topological space of matrix-valued points, and $\mathcal{A} = \{(U_i, \varphi_i)\}_{i \in I}$ is a collection (atlas) of charts such that:

  1. Each $U_i \subseteq \mathcal{W}$ is an open set, and $\varphi_i: U_i \to \mathbb{R}^{m \times n}$ is a homeomorphism (i.e., a continuous bijection with continuous inverse).
  2. The open sets $U_i$ cover $\mathcal{W}$, i.e., $\mathcal{W} \subseteq \bigcup_{i \in I} U_i$.
  3. The transition maps $\varphi_{ij} = \varphi_j \circ \varphi_i^{-1}: \varphi_i(U_i \cap U_j) \to \varphi_j(U_i \cap U_j)$ are smooth (i.e., infinitely differentiable) for all $i, j \in I$ such that $U_i \cap U_j \neq \emptyset$.

Note that $\mathcal{W}$ being second-countable and Hausdorff will not be relevant for this work, but is included for completeness.

Definition 2 (Differentiable Functions). A function $f: \mathcal{W} \to \mathbb{R}$ is said to be differentiable at a point $p \in \mathcal{W}$ if and only if, $$f \circ \varphi_i^{-1}: \mathbb{R}^{m \times n} \to \mathbb{R}$$ is differentiable at $\varphi_i(p)$ for some chart $(U_i, \varphi_i) \in \mathcal{A}$ such that $p \in U_i$.

Note that since the transition maps $\varphi_{ji} = \varphi_i \circ \varphi_j^{-1}$ of a differentiable manifold are smooth, then by the Chain Rule this is equivalent to $f \circ \varphi_j^{-1}$ being differentiable at $\varphi_j(p)$ for all charts $(U_j, \varphi_j) \in \mathcal{A}$ such that $p \in U_j$. I.e., $$f \circ \varphi_j^{-1} = \underbrace{f \circ \varphi_i^{-1}}_{\text{differentiable}} \circ \underbrace{\varphi_i \circ \varphi_j^{-1}}_{\text{differentiable}}$$ Thus the differentiability of $f$ at $p \in \mathcal{W}$ is independent of the choice of chart $(U_i, \varphi_i) \in \mathcal{A}$ such that $p \in U_i$.

Definition 3 (Curves). Let $p \in \mathcal{W}$. A curve $\gamma$ at $p$ is a function $\gamma: \mathbb{R} \to \mathcal{W}$ with $\gamma(0) = p$ which is differentiable in the sense that its composition with any chart $\varphi_i$ is differentiable, $$\varphi_i \circ \gamma: \mathbb{R} \to \mathbb{R}^{m \times n}.$$

Again, since the transition maps $\varphi_{ij} = \varphi_j \circ \varphi_i^{-1}$ are smooth, then by the Chain Rule this is equivalent to $\varphi_j \circ \gamma$ being differentiable at $p$ for all charts $(U_j, \varphi_j) \in \mathcal{A}$ such that $p \in U_j$. I.e., $$\varphi_j \circ \gamma = \underbrace{\varphi_j \circ \varphi_i^{-1}}_{\text{differentiable}} \circ \underbrace{\varphi_i \circ \gamma}_{\text{differentiable}}$$

Definition 4 (Directional Derivative). Given a differentiable function $f$, the directional derivative of $f$ at $p \in \mathcal{W}$ in the direction of a curve $\gamma$ is defined as, $$\begin{align*} Df(p)[\gamma] &= \frac{d}{dt} (\underbrace{f \circ \gamma}_{\mathbb{R} \to \mathbb{R}})(t)\bigg|_{t=0}\newline Df(p)[\gamma] &= \frac{d}{dt} (\underbrace{f \circ \varphi_i^{-1}}_{\mathbb{R}^{m \times n} \to \mathbb{R}} \circ \underbrace{\varphi_i \circ \gamma}_{\mathbb{R} \to \mathbb{R}^{m \times n}})(t) \bigg|_{t=0} \end{align*}$$ for any chart $(U_i, \varphi_i) \in \mathcal{A}$ such that $p \in U_i$. Note that this derivative is well-defined because it is simply a derivative of a composition of differentiable functions defined in the usual Euclidean space. Additionally, the choice of chart does not matter because the differentiability of $f$ and $\gamma$ at $p$ are both chart-independent.

Definition 5 (Tangent Vectors and Tangent Spaces). A tangent vector at $p \in \mathcal{W}$ is an equivalence class of curves $\gamma$ with $\gamma(0) = p$ where two curves $\gamma_1$ and $\gamma_2$ are said to be equivalent if and only if, $$\frac{d}{dt}(\underbrace{\varphi_i \circ \gamma_1}_{\mathbb{R} \to \mathbb{R}^{m \times n}})(t) \bigg|_{t=0} = \frac{d}{dt}(\underbrace{\varphi_i \circ \gamma_2}_{\mathbb{R} \to \mathbb{R}^{m \times n}})(t)\bigg|_{t=0}$$ for any chart $(U_i, \varphi_i) \in \mathcal{A}$ such that $p \in U_i$. Again, the choice of chart does not matter.

The tangent space at $p \in \mathcal{W}$ is the set of all tangent vectors at $p$ and is denoted by $T_p\mathcal{W}$. This space forms an $m \times n$-dimensional vector space over $\mathbb{R}$.

Definition 6 (Differentials and Cotangent Spaces). Let $X \in T_p\mathcal{W}$ be a tangent vector at $p \in \mathcal{W}$ and $f$ be a differentiable function near $p$, then differentiating f along any curve in the equivalence class defining $X$ gives a well-defined directional derivative along $X$: $$Xf(p) := \frac{d}{dt}(\underbrace{f \circ \gamma}_{\mathbb{R} \to \mathbb{R}})(t)\bigg|_{t = 0}$$ which is independent of the choice of curve $\gamma$ in the equivalence class defining $X$. We then define the differential of $f$ at $p$ as the linear functional $Df(p)[\cdot]: T_p\mathcal{W} \to \mathbb{R}$ given by, $$Df(p)[X] = Xf(p)$$

The cotangent space at $p \in \mathcal{W}$ then is defined as the set of all differentials at $p$ and is denoted by $T^*_p\mathcal{W}$. Note that this space also forms an $m \times n$-dimensional vector space over $\mathbb{R}$.

Definition 7 (Canonical Pairing between Tangent and Cotangent Spaces). Let $\{e_{ij}\}_{1\leq i\leq m, 1\leq j\leq n}$ be the standard basis of $\mathbb{R}^{m \times n}$ where $e_{ij} \in \mathbb{R}^{m \times n}$ is the matrix with $1$ at the $(i, j)$-th entry and $0$ elsewhere. And at every point $p \in \mathcal{W}$, define, $$\begin{align*} \frac{\partial f}{\partial x^{ij}} &:= Df(p)[e_{ij}]\newline \partial f(p) &:= \begin{bmatrix} \frac{\partial f}{\partial x^{11}} & \ldots & \frac{\partial f}{\partial x^{1n}} \newline \vdots & \ddots & \vdots \newline \frac{\partial f}{\partial x^{m1}} & \ldots & \frac{\partial f}{\partial x^{mn}} \end{bmatrix} \end{align*}$$ where $\partial f(p) \in T^*_p\mathcal{W}$ is referred to as the coordinate representation of the differential $Df(p)[\cdot]: T_p\mathcal{W} \to \mathbb{R}$. Then for any tangent vector $X \in T_p\mathcal{W} \subseteq \mathbb{R}^{m \times n}$, we have, $$\begin{align*} Df(p)[X] &= \sum_{i=1}^m \sum_{j=1}^n \frac{\partial f}{\partial x^{ij}} X^{ij}\newline Df(p)[X] &= \langle \partial f(p), X \rangle_F \end{align*}$$ where $\langle \cdot, \cdot \rangle_F$ is the Frobenius inner product. This pairing of a tangent vector $X \in T_p\mathcal{W}$ and a differential $\partial f(p) \in T^*_p\mathcal{W}$ that results in a real number is called the canonical pairing between the tangent and cotangent spaces at $p \in \mathcal{W}$.

Since the tangent spaces $T_p\mathcal{W}$ are vector spaces, we can define a notion of “length” on them. This notion of length is formalized by the concept of a norm. Some, but not all, norms can also be induced by an inner product which formalizes the notion of “angle” between two directions.

Definition 8 (Norm). A norm $||\cdot||$ on a (real matrix-valued) vector space $V$ is a function $||\cdot||: V \to \mathbb{R}$ that satisfies the following properties:

  1. Non-negativity: $||x|| \geq 0$ for all $x \in V$, and $||x|| = 0$ if and only if $x = 0$.
  2. Homogeneity: $||ax|| = |a| \cdot ||x||$ for all $x \in V$ and $a \in \mathbb{R}$.
  3. Triangle Inequality: $||x + y|| \leq ||x|| + ||y||$ for all $x, y \in V$.

If the norm depends on the point $p \in \mathcal{W}$, we denote it by $||\cdot||_p$.

Definition 9 (Inner Product). An inner product on a (real) vector space $V$ is a bilinear form $\langle \cdot, \cdot \rangle: V \times V \rightarrow \mathbb{R}$ that satisfies the following properties:

  1. Bilinearity: $\langle ax + by, z \rangle = a\langle x, z \rangle + b\langle y, z \rangle$ for all $x, y, z \in V$ and $a, b \in \mathbb{R}$.
  2. Symmetry: $\langle x, y \rangle = \langle y, x \rangle$ for all $x, y \in V$.
  3. Positive Definiteness: $\langle x, x \rangle > 0$ for all $x \in V$ such that $x \neq 0$.
  4. Non-degeneracy: $\langle x, y \rangle = 0$ for all $y \in V$ implies $x = 0$.

Every inner product has a corresponding matrix form $G_*$ such that $\langle x, y \rangle = x^T G_{*} y$ for all $x, y \in V$. The inner product also induces a norm $||\cdot||$ on $V$ defined by $||x|| = \sqrt{\langle x, x \rangle}$ for all $x \in V$.

However, not all norms are induced by an inner product. For example, the spectral norm $||\cdot||_{2 \to 2}$ is not induced by an inner product. The parallelogram law is a necessary and sufficient condition for the existence of an inner product that induces the norm (Jordan et al., 1935).

Definition 10 (Parallelogram Law). A norm $||\cdot||$ on a (real matrix-valued) vector space $V$ is said to satisfy the parallelogram law if, for all $x, y \in V$, we have, $$||x + y||^2 + ||x - y||^2 = 2||x||^2 + 2||y||^2.$$

Note that we can vary the norm or inner product on the tangent spaces $T_p\mathcal{W}$ at each point $p \in \mathcal{W}$,

Definition 11 (Riemannian Metric). Suppose that we equip the tangent spaces $T_p\mathcal{W}$ at each point $p \in \mathcal{W}$ with an inner product $\langle \cdot, \cdot \rangle_p$. Let $g$ be a function such that, $$g(p, X, Y) = g_p(X, Y) = \langle X, Y \rangle_{G_p} \qquad \forall p\in\mathcal{W}; X,Y\in T_p\mathcal{W}$$ We say that $g$ is a Riemannian metric on $\mathcal{W}$ if it smoothly various along $p$.

Definition 12 (Riemannian Gradient). Let $Df(p)[\cdot]: T_p\mathcal{W} \to \mathbb{R}$ be the differential of a differentiable function $f$ at $p \in \mathcal{W}$. The Riemannian gradient of $f$ at $p$ is defined as the unique tangent vector $\nabla f(p) \in T_p\mathcal{W}$ such that, $$Df(p)[X] = g_p(\nabla f(p), X) \qquad \forall X \in T_p\mathcal{W}$$

Finally, we can now formally define Riemannian and non-Riemannian differentiable matrix manifolds,

Definition 13 (Riemannian and Non-Riemannian Differentiable Matrix Manifolds). Given a differentiable matrix manifold $(\mathcal{W}, \mathcal{A})$ whose tangent spaces $T_p\mathcal{W}$ are equipped with a norm $||\cdot||_p$, we say that it is a Riemannian manifold if the norm $||\cdot||_p$ is induced by an inner product $\langle \cdot, \cdot \rangle_p$ (i.e., the Parallelogram Law holds) and the inner product varies smoothly along $p$. Otherwise, it is a non-Riemannian manifold.

Note that if the norm $||\cdot||_p$ smoothly varies along $p \in \mathcal{W}$, then it is also common to refer to the manifold as a Finsler Manifold (Flynn, 2017). However, we will not use this terminology in this work.

1.2. Deep Learning

Show Section 1.3. Deep Learning Terminology

Definition 14 (Neural Networks). A neural network is a parametrized function $f: \mathbb{R}^{d_0} \to \mathbb{R}^{d_L}$ defined by composing linear and non-linear transformations according to a directed acyclic graph (DAG) structure. Here, we focus on neural networks parametrized by matrix-valued weights $\{W^l\}_{l=1}^L$ where $W^l \in \mathbb{R}^{d_{l} \times d_{l-1}}$ is the weight matrix and $d_l$ is the dimension of hidden (vector) representations at the $l$-th layer, respectively. The input to the network is a vector $x_0 \in \mathbb{R}^{d_0}$, and the output is a vector $x_L \in \mathbb{R}^{d_L}$.

The weights of a neural network are often initialized randomly, and then updated iteratively using an optimizer to minimize an objective function $\mathcal{L}$. In practice, we often define $\mathcal{L}$ to be the expected “loss” of a neural network $f$ according to some differentiable “loss function” $loss: \mathbb{R}^{d_L} \times \mathbb{R}^{d_L} \to \mathbb{R}$ that measures the distance between the “predicted” output $x_L \in \mathbb{R}^{d_L}$ and the “expected” output $y \in \mathbb{R}^{d_L}$ over some dataset $\mathcal{D} \subseteq \mathbb{R}^{d_0} \times \mathbb{R}^{d_L}$ of input-output pairs. I.e., we have, $$\mathcal{L}(\{W^l\}_{l=1}^L) = \mathbb{E}_{(x^i, y^i) \sim \mathcal{D}}[loss(f(x^i | \{W^l\}_{l=1}^L), y^i)]$$ To simplify matters, we also often optimize each weight independently of each other. And so we have, $$\mathcal{L}^l: \mathcal{W} \to \mathbb{R},$$ for each weight $W^l$. We drop the superscript $l$ when it is clear from the context.

Definition 15 (Backpropagation). Backpropagation is an application of the chain rule to compute the differentials of $\mathcal{L}$ at each weight $W^l$ in a neural network, given only access to the differential(s) at later layers. For example, suppose we have a linear layer $y = Wx$ where $x \in \mathcal{n}, y \in \mathcal{m}, W \in \mathcal{m \times n}$ and that we initially have access to, $$\frac{\partial\mathcal{L}}{\partial y} = \begin{bmatrix} \frac{\partial\mathcal{L}}{\partial y_1}\newline \vdots\newline \frac{\partial\mathcal{L}}{\partial y_m} \end{bmatrix}$$ By the chain rule, we can compute the differential of $\mathcal{L}$ at $W$ as follows, $$\begin{align*} \frac{\partial\mathcal{L}}{\partial W} &= \frac{\partial\mathcal{L}}{\partial y} \cdot \frac{\partial y}{\partial W} = \frac{\partial\mathcal{L}}{\partial y} \cdot x^T\newline \frac{\partial\mathcal{L}}{\partial W} &= \begin{bmatrix} \frac{\partial\mathcal{L}}{\partial W_{11}} & \ldots & \frac{\partial\mathcal{L}}{\partial W_{1n}}\newline \vdots & \ddots & \vdots\newline \frac{\partial\mathcal{L}}{\partial W_{m1}} & \ldots & \frac{\partial\mathcal{L}}{\partial W_{mn}} \end{bmatrix} = \begin{bmatrix} \frac{\partial\mathcal{L}}{\partial y_1} x_1 & \ldots & \frac{\partial\mathcal{L}}{\partial y_1} x_n\newline \vdots & \ddots & \vdots\newline \frac{\partial\mathcal{L}}{\partial y_m} x_1 & \ldots & \frac{\partial\mathcal{L}}{\partial y_m} x_n \end{bmatrix} \end{align*}$$ Likewise, we can compute the differential of $\mathcal{L}$ at $x$ as follows, $$\begin{align*} \frac{\partial\mathcal{L}}{\partial x} &= \frac{\partial y}{\partial x} \cdot \frac{\partial\mathcal{L}}{\partial y} = W^T \cdot \frac{\partial\mathcal{L}}{\partial y}\newline \frac{\partial\mathcal{L}}{\partial x} &= \begin{bmatrix} \frac{\partial\mathcal{L}}{\partial x_1}\newline \vdots\newline \frac{\partial\mathcal{L}}{\partial x_n} \end{bmatrix} = W^T \cdot \begin{bmatrix} \frac{\partial\mathcal{L}}{\partial y_1}\newline \vdots\newline \frac{\partial\mathcal{L}}{\partial y_m} \end{bmatrix} \end{align*}$$ We can then use $\frac{\partial\mathcal{L}}{\partial x}$ to calculate the differentials at the previous layer, and so on, until we reach the input layer. This is the essence of backpropagation.

An important matter we want to highlight is that, despite the terminology used in deep learning literature, backpropagation only gives us the differentials, and not the gradients of the loss function $\mathcal{L}$ at each weight $W^l$.

Definition 16 (Optimizers). A (deep learning) optimizer is an algorithm that iteratively adjust a model’s parameters $W_t \in \mathcal{W}$ in order to minimize an objective function $\mathcal{L}: \mathcal{W} \to \mathbb{R}$. Optimizers are often stateful, meaning that they maintain a set of statistics $S_t \in \mathcal{S}$ on the previous parameters and differentials (or gradients) they have seen so far. Optimizers that do not maintain any state (or equivalently, maintains an empty set $S_t = \emptyset$ for all $t$) are called stateless optimizers.

At every training step $t$, the optimizer receives a stochastic estimate of the differential $\partial\mathcal{L}(W_t; \xi) \in T^*_{W_t}\mathcal{W}$ at the current point $W_t \in \mathcal{W}$, and uses it to update the model parameters $W_t$ to $W_{t+1} \in \mathcal{W}$ and optimizer state $S_t$ to $S_{t+1} \in \mathcal{S}$ using a function $opt$ (called the “update rule”) defined a priori,

$$opt: \mathcal{W} \times \mathcal{S} \times T^*\mathcal{W}\rightarrow \mathcal{W} \times \mathcal{S}.$$ Optimizers are characterized by their update rules. I.e., two optimizers are said to be equivalent if they have the same update function $opt$.

Definition 17 (Preconditioning). In an optimizer, a preconditioner $\mathcal{P}(\cdot; W_t): T^*_{W_t}\mathcal{W} \rightarrow T_{W_t}\mathcal{W}$ is a (possibly point-dependent) linear transform that maps the coordinate representation of the differential we have access to $\partial\mathcal{L}(W_t; \xi) \in T^*_{W_t}\mathcal{W}$ to a descent direction in the tangent space $\widehat{U}_t \in T_{W_t}\mathcal{W}$. That is, at any $W_t \in \mathcal{W}$, we have a matrix $P_{W_t}$ such that, $$\widehat{U}_t = \mathcal{P}(\partial\mathcal{L}(W_t; \xi); W_t) = -P_{W_t} \partial\mathcal{L}(W_t; \xi)$$ $$W_{t+1} = W_t - \lambda P_{W_t} \partial\mathcal{L}_\xi(W_t).$$

It is also common to assume that we can decompose $P_{W_t}$ into a Kronecker product $P_{W_t} = L_{W_t} \otimes R_{W_t}$ (Li, 2015; Gupta et al., 2018, Surya et al., 2024), such that our update rule becomes, $$ \begin{align*} \widehat{U}_t &= -P_{W_t} \partial\mathcal{L}(W_t; \xi)\newline \widehat{U}_t &= -\left( L_{W_t} \otimes R_{W_t} \right) \partial\mathcal{L}(W_t; \xi)\newline \widehat{U}_t &= -L_{W_t} \partial\mathcal{L}(W_t; \xi) R_{W_t} \end{align*} $$ $$W_{t+1} = W_t - \lambda L_{W_t} \partial\mathcal{L}(W_t) R_{W_t}.$$ We call $L_W$ and $R_W$ as the left and right preconditioners, respectively.

1.3. Optimization Objective

We consider the following optimization problem, $$\begin{equation} \arg\min_{W \in \bm{\mathcal{W}}} \mathcal{L}(W), \end{equation}$$ where $\mathcal{L}(\cdot): \bm{\mathcal{W}} \rightarrow \mathbb{R}$ is a bounded-below and differentiable objective function, and $\bm{\mathcal{W}}$ is a real-valued, finite-dimensional, differentiable matrix manifold whose tangent spaces are equipped with a norm $||\cdot||$ chosen a priori.

If the norm is induced by a smoothly-varying inner product (i.e., the parallelogram law holds), then $\bm{\mathcal{W}}$ is a Riemannian manifold. Otherwise, it is a non-Riemannian manifold. And as we will show in the succeeding sections, not only does the choice of norm naturally lead to different optimization algorithms, but also to two flavors of deep learning optimizers, preconditioners and dualizers, depending on whether $\bm{\mathcal{W}}$ is Riemannian or non-Riemannian.

In practice, $\mathcal{L}$ often does not have a simple, closed-form solution, so we resort to iterative methods of the form, $$W_{t+1} = W_{t} + \lambda \widehat{U}_t,$$ where $W_t, W_{t+1} \in \mathcal{W}$ are the “current” and “updated” points at steps $t$ and $t+1$, respectively, $\widehat{U}_t \in T_{W_t}\mathcal{W}$ is the “update direction” at step $t$, and $\lambda > 0$ is a positive learning rate parameter that controls how much we move in the direction of $\widehat{U}_t$.

There are multiple ways to estimate the optimal update direction $\widehat{U}_t$ at each step $t$. And in deep learning literature, it is common to use (approximate) second-order methods involving the Hessian of $\mathcal{L}$ at $W_t$. However, computing this Hessian is often intractable and/or computationally expensive. Thus, researchers often resort to approximating it via assumptions on its structure. And different assumptions lead to different deep learning optimizers. E.g., diagonal structure assumption leads to Adam (Kingma et al., 2014), while Kronecker product structure leads to PSGD, Shampoo, KFAC (Li, 2015; Gupta et al., 2018; Martens et al., 2015), and etcetera. We argue that these assumptions are hard to justify from a first-principles basis, and that a more natural to follow the direction of Steepest Descent on the manifold $\bm{\mathcal{W}}$ instead.

Definition 18 (Steepest Descent). Let $D\mathcal{L}(W_t)[\cdot]: T_{W_t}\mathcal{W} \rightarrow \mathbb{R}$ be the differential of $\mathcal{L}$ at $W_t$. Given that we equip the tangent spaces of $\mathcal{W}$ with the norm $||\cdot||$, the direction of Steepest Descent at $W_t$ is given by, $$\begin{equation} \widehat{U}_t = \arg\min_{\substack{\Delta W \in T_{W_t}\mathcal{W}\newline ||\Delta W|| = 1}} D\mathcal{L}(W_t)[\Delta W], \end{equation}$$ or equivalently, $$\begin{equation} \widehat{U}_t = - \arg\max_{\substack{\Delta W \in T_{W_t}\mathcal{W}\newline ||\Delta W|| = 1}} D\mathcal{L}(W_t)[\Delta W], \end{equation}$$ where $D\mathcal{L}(W_t)[\Delta W] \in \mathbb{R}$ gives the directional derivative of $\mathcal{L}$ at $W_t$ in the direction of $\Delta W \in T_{W_t}\mathcal{W}$. Intuitively, this can be understood as the update direction of unit length that leads to the largest decrease in the objective function $\mathcal{L}$.

Again, in practice, we often do not have access to the exact differential $D\mathcal{L}(W_t)[\cdot]$. However, either through, e.g., backpropagation if downstream operations are differentiable (Rumelhart et al., 1986) or evolutionary algorithms otherwise (Salimans et al., 2017), we often do have access to a stochastic estimator of the differential in coordinate form,

Assumption 1: Let $\partial\mathcal{L}(W) \in T^*_W\mathcal{W}$ be the coordinate representation of the differential $D\mathcal{L}(W)[\cdot]: T_W\mathcal{W} \rightarrow \mathbb{R}$ at $W \in \bm{\mathcal{W}}$ such that, $$D\mathcal{L}(W)[\cdot] = \langle \partial\mathcal{L}(W), \cdot \rangle_F,$$ where $\langle \cdot, \cdot \rangle_F$ is the Frobenius inner product. We assume that we have access to a stochastic estimator $\partial\mathcal{L}(W; \xi) \in T^*_W\mathcal{W}$ of the differential in coordinate form that is unbiased and has bounded variance. That is,

$$ \begin{align*} &\mathbb{E}_{\xi \sim D}[\partial\mathcal{L}(W; \xi)] = \partial\mathcal{L}(W) && \forall W \in \bm{\mathcal{W}}\newline &\mathbb{E}_{\xi \sim D}[||\partial\mathcal{L}(W; \xi) - \partial\mathcal{L}(W) ||_F^2] \leq \sigma^2 && \forall W \in \bm{\mathcal{W}} \end{align*} $$ where $\xi$ is a random variable sampled from a distribution $D$, $\sigma > 0$ is a positive variance parameter, and $||\cdot||_F = \sqrt{\langle \cdot, \cdot \rangle_F}$ is the corresponding Frobenius norm. Note that in the Riemannian case, the above is equivalent to, $$ \begin{align*} &\mathbb{E}_{\xi \sim D}[\nabla\mathcal{L}(W; \xi)] = \nabla\mathcal{L}(W) && \forall W \in \bm{\mathcal{W}}\newline &\mathbb{E}_{\xi \sim D}[||\nabla\mathcal{L}(W; \xi) - \nabla\mathcal{L}(W) ||_F^2] \leq \hat{\sigma}^2 && \forall W \in \bm{\mathcal{W}} \end{align*} $$ where $\nabla\mathcal{L}(W)$ is the “true” gradient of $\mathcal{L}$ at $W$, $\nabla\mathcal{L}(W; \xi) = G_W^{-1}\partial\mathcal{L}(W; \xi)$ is the equivalent stochastic estimator of the gradient, $\hat{\sigma} = \sigma(\max_{W \in \mathcal{W}}||G_W^{-1}||_F) > 0$ is the adjusted variance parameter, and $G_W$ is the metric tensor at $W \in \bm{\mathcal{W}}$.

We also make the following standard continuity assumption on the differential $\partial\mathcal{L}(\cdot)$ (Mokhtari et al., 2018; Kovalev, 2025),

Assumption 2: The differential $\partial\mathcal{L}(\cdot)$ is Lipschitz continuous with respect to the norm $||\cdot||$ with Lipschitz constant $L > 0$. That is, for all $W \in \bm{\mathcal{W}}$, $$ \begin{equation} ||\partial\mathcal{L}(W + \Delta W) - \partial\mathcal{L}(W)||^\dagger \leq L||\Delta W|| \quad \forall \Delta W \in T_W\bm{\mathcal{W}} \end{equation} $$ where $||\cdot||^\dagger$ is the dual norm of $||\cdot||$. Note that in the Riemannian case, this is equivalent to, $$ \begin{equation} ||\nabla\mathcal{L}(W + \Delta W) - \nabla\mathcal{L}(W)||^\dagger \leq \hat{L}||\Delta W|| \quad \forall \Delta W \in T_W\bm{\mathcal{W}} \end{equation} $$ where $\nabla\mathcal{L}(W)$ is the gradient of $\mathcal{L}$ at $W$, $\hat{L} = L(\max_{W \in \mathcal{W}}||G_W^{-1}||^\dagger) > 0$, and $G_W$ is the metric tensor at $W \in \bm{\mathcal{W}}$.

2. Why do Steepest Descent Under the Spectral Norm?

The geometry of $\mathcal{W}$ and the optimizer we will need both depend on the choice of norm $||\cdot||$. Our core argument is that it is most natural to do steepest descent under the spectral norm $||\cdot||_{2 \to 2}$ in the context of training the linear weights $W$ of a neural network. The spectral norm induces $\mathcal{W}$ to be non-Riemannian, and therefore, intuitions on optimization we have developed in Riemannian manifolds may not apply.

2.1. Majorization-Minimization Perspective

We can upper bound our objective function $\mathcal{L}$ by the following approximation at an arbitrary point $W \in \bm{\mathcal{W}}$, $$ \begin{equation} \mathcal{U}(\Delta W; W) = \mathcal{L}(W) + \langle \partial\mathcal{L}(W), \Delta W \rangle_F + \frac{\lambda}{2}||\Delta W||^2 \end{equation} $$ for some norm $||\cdot||$ chosen a priori. Note that in the Riemannian case, the above is equivalent to the more standard presentation, $$ \begin{equation} \mathcal{U}(\Delta W; W) = \mathcal{L}(W) + \langle \nabla\mathcal{L}(W), \Delta W \rangle + \frac{\lambda}{2}||\Delta W||^2 \end{equation} $$ where $\nabla\mathcal{L}(W)$ is the gradient of $\mathcal{L}$ at $W$, and $\langle \cdot, \cdot \rangle$ is the inner product that induces the norm $||\cdot||$. Using standard arguments, we can show that, $$\mathcal{L}(W + \Delta W) \leq \mathcal{U}(\Delta W; W),\quad\quad\Delta W \in T_W\bm{\mathcal{W}}$$ as long as $\lambda \leq L$ (Hunter et al., 2004), where $L$ is the Lipschitz constant from Assumption 2.

A natural strategy to (iteratively) minimize $\mathcal{L}$ from point $W \in \mathcal{W}$ then is to (iteratively) minimize the majorant $\mathcal{U}(\cdot; W)$. And as discussed by Carlson et al. (2015), the spectral norm gives us a very tight upper bound and is thus a good choice. In fact, the spectral norm gives the tightest bound among all the Schatten-$p$ norms (the Frobenius norm included). And just as importantly, Equation (5) above has a simple, closed-form solution for the spectral norm as we will discuss in Section 4.

2.2 Feature Learning Perspective

This section can be summarized as,

If we want the Euclidean norm of our features and feature updates to ‘grow’ with the model size, then the Spectral norm of our weights and weight updates must also ‘grow’ with the model size.

Suppose that we have a linear transform $x_{l+1} = W_{l} x_{l}$ at the $l$-th layer of a neural network where $x_l \in \mathcal{R}^{d_l}$ and $x_{l+1} \in \mathcal{R}^{d_{l+1}}$ are the input and output hidden representations (or “features”), respectively, and $W_l \in \mathcal{R}^{d_{l+1} \times d_l}$ is the weight matrix. Additionally, let $\Delta x_l \in \mathcal{R}^{d_l}$, $\Delta x_{l+1} \in \mathcal{R}^{d_{l+1}}$, and $\Delta W_l \in \mathcal{R}^{d_{l+1} \times d_l}$ be their updates after a backward pass.

Ideally, we want the sizes of both the hidden representations $x_l$ and their updates $\Delta x_l$ to scale with the model width $d_l$. Otherwise, if the hidden representations are ’too small’, we are wasting capacity, in a sense (Elhage et al., 2022); and if they are ’too large’, we are pushing the model towards the edge of numerical stability and prevent grokking (Prieto et al., 2025). Likewise, if the updates are ’too small’, they vanish at larger scales, slowing down convergence; and if they are ’too large’, they cause training instability. Yang et al. (2024) summarizes this as follows,

Desideratum 1 (Feature Learning). We desire that our features $x_l$ and feature updates $\Delta x_l$ be of size, $$ \begin{equation} ||x_l||_2 = \Theta(\sqrt{d_l})\quad\text{and}\quad ||\Delta x_l||_2 = \Theta(\sqrt{d_l})\quad\text{for all layers } l = 1, 2, \ldots, L-1 \end{equation} $$ where $f(d) = \Theta(g(d))$ means that $f(d)$ “scales like” or “grows in the order of” $g(d)$. More formally, there exists positive real constants $c, C > 0$ and positive integer $D$ such that, for all $d > D$, we have $c \cdot g(d) \leq f(d) \leq C \cdot g(d)$.

We ensure this by imposing constraints on the size of the weights $W_l$ and their updates $\Delta W_l$:

  1. From the definition of the spectral norm, we have, $$ \begin{align*} x_{l+1} &= W_l x_l\newline ||x_{l+1}||_2 &\leq ||W_l||_{2\to 2} \cdot ||x_l||_2 \end{align*} $$ Combining this with Desideratum 1, we have, $$ \begin{align*} \underbrace{||x_{l+1}||_2}_{\Theta(\sqrt{d_{l+1}})} &\leq ||W_l||_{2\to 2} \cdot \underbrace{||x_l||_2}_{\Theta(\sqrt{d_l})} \end{align*} $$ Thus the size of the weights $W_l$ must be, $$ \begin{equation} ||W_l||_{2 \to 2} = \Theta\left(\sqrt{\frac{d_{l+1}}{d_l}}\right) \end{equation} $$

  2. Now let’s consider the feature updates $\Delta x_l$, $$ \begin{align*} x_{l+1} + \Delta x_{l+1} &= (W_l + \Delta W_l)(x_l + \Delta x_l)\newline \Delta x_{l+1} &= W_l \Delta x_l + \Delta W_l x_l + \Delta W_l \Delta x_l\newline ||\Delta x_{l+1}||_2 &\leq ||W_l||_{2\to 2} \cdot ||\Delta x_l||_2 + ||\Delta W_l||_{2\to 2} \cdot ||x_l||_2 + ||\Delta W_l||_{2\to 2} \cdot ||\Delta x_l||_2 \end{align*} $$ Combining this with Desideratum 1 and our result above, we have, $$ \begin{align*} \underbrace{||\Delta x_{l+1}||_2}_{\Theta(\sqrt{d_{l+1}})} &\leq \underbrace{||W_l||_{2\to 2}}_{\Theta\left(\sqrt{\frac{d_{l+1}}{d_l}}\right)} \cdot \underbrace{||\Delta x_l||_2}_{\Theta\left(\sqrt{d_l}\right)} + ||\Delta W_l||_{2\to 2} \cdot \underbrace{||x_l||_2}_{\Theta\left(\sqrt{d_l}\right)} + ||\Delta W_l||_{2\to 2} \cdot \underbrace{||\Delta x_l||_2}_{\Theta\left(\sqrt{d_l}\right)} \end{align*} $$ Thus the size of the weight updates $\Delta W_l$ must be, $$||\Delta W_l||_{2 \to 2} = \Theta\left(\sqrt{\frac{d_{l+1}}{d_l}}\right)$$

These become our Spectral Scaling Conditions (Yang et al., 2024),

Condition 1 (Spectral Scaling). The spectral norms of our weights $W_l$ and weight updates $\Delta W_l$ must be, $$ \begin{equation} ||W_l||_{2\to 2} = \Theta\left(\sqrt{\frac{d_{l+1}}{d_l}}\right)\quad\text{and}\quad||\Delta W_l||_{2\to 2} = \Theta\left(\sqrt{\frac{d_{l+1}}{d_l}}\right)\quad\text{at layers } l = 1, \ldots, L-1 \end{equation} $$

2.3 Input-Tensor Alignment Phenomenon [Under Construction]

3. Steepest Descent in Riemannian and Non-Riemannian Manifolds

Let us consider the different cases of the geometry of $\bm{\mathcal{W}}$ induced by the choice of norm $||\cdot||$.

3.1. $\bm{\mathcal{W}}$ is Euclidean

That is, we pick the Frobenius norm $||\cdot||_F$ as our norm. In this case, our points, differentials, and gradients are all already in standard Euclidean coordinates. Thus,

$$ \begin{align*} \widehat{U} &= -\arg\max_{\substack{\Delta W \in T_W\mathcal{W}\newline ||\Delta W|| = 1}} D\mathcal{L}(W)[\Delta W]\newline &= -\arg\max_{\substack{\Delta W \in T_W\mathcal{W}\newline ||\Delta W|| = 1}} \langle \partial\mathcal{L}(W), \Delta W \rangle_F\newline &= -\frac{\partial\mathcal{L}(W)}{||\partial\mathcal{L}(W)||_F}\newline \widehat{U} &\approx -\frac{\partial\mathcal{L}(W; \xi)}{||\partial\mathcal{L}(W; \xi)||_F}\newline \end{align*} $$

Thus our update rule becomes, $$ \begin{align*} W_{t+1} &= W_t + \lambda \widehat{U}\newline W_{t+1} &= W_t - \hat{\lambda} \partial\mathcal{L}(W; \xi) \end{align*} $$ where $\hat{\lambda} = \frac{\lambda}{||\partial\mathcal{L}(W; \xi)||_F}$. This is simply Stochastic Gradient Descent (SGD) with an adaptive learning rate.

3.2. $\bm{\mathcal{W}}$ is a Riemannian Manifold

That is, our choice of norm $||\cdot||$ is induced by a smoothly-varying metric $g_W(\cdot, \cdot): T_W \bm{\mathcal{W}} \times T_W \bm{\mathcal{W}} \rightarrow \mathbb{R}$ for each $W \in \bm{\mathcal{W}}$ such that the parallelogram law holds and that, $$ \begin{align*} ||X|| &= \sqrt{g_W(X, X)}&&\forall X \in T_W\mathcal{W}\newline \text{and}\quad g_W(X, Y) &= \langle X, Y \rangle_{G_W} = \langle G_W X, Y \rangle_F = \text{tr}(X^T G_W Y) &&\forall X,Y \in T_W\mathcal{W} \end{align*} $$ for some (symmetric) positive-definite matrix $G_W$ that may depend on the point $W$.

Special Cases.

  1. Euclidean Manifold: $G_W = I$ for all $W \in \bm{\mathcal{W}}$. In this case we have, $$||X|| = \sqrt{g_W(X, X)} = \sqrt{\langle I X, X \rangle_F} = \sqrt{\langle X, X \rangle_F} = ||X||_F,\quad\forall X \in T_W\mathcal{W}$$ which is simply the Euclidean case above.
  2. Euclidean Manifold in Disguise: $G_W$ is a constant matrix, i.e. it does not depend on $W$, but may not be the identity matrix. Since the metric matrix $G_W$ is guaranteed to be (symmetric) positive-definite, we can always factor it as $G_W = C^T C$ for some invertible matrix $C$. Thus, $$||X|| = \sqrt{g_W(X, X)} = \sqrt{\text{tr}(X^T C^T C X)} = \sqrt{\langle \overline{X}, \overline{X} \rangle_F} = ||\overline{X}||_F\quad\forall X \in T_W\mathcal{W}$$ where $\overline{X} = CX\in CT_W\mathcal{W}$. This means that, up to a simple, linear change of coordinates, this case is equivalent to the Euclidean case above.

Our proofs below still hold in these special cases. But note that, the metric matrix $G_W$ may depend on the point $W \in \mathcal{W}$ and thus potentially induce a non-zero curvature somewhere on the manifold, making it non-Euclidean.

An interesting property of Riemannian manifolds is that we have a canonical bijection between differentials $D\mathcal{L}(W)[\cdot] \in T_W^* \bm{\mathcal{W}}$ and gradients $\nabla \mathcal{L}(W) \in T_W \bm{\mathcal{W}}$ such that, $$ \begin{align*} D\mathcal{L}(W)[\cdot] &= \langle \nabla \mathcal{L}(W), \cdot \rangle\newline D\mathcal{L}(W)[\cdot] &= \langle \underbrace{G_W\nabla \mathcal{L}(W)}_{\partial\mathcal{L}(W)}, \cdot \rangle_F \end{align*} $$ Thus, $$ \begin{align*} G_W\nabla \mathcal{L}(W) &= \partial\mathcal{L}(W)\newline \nabla \mathcal{L}(W) &= G_W^{-1} \partial\mathcal{L}(W)\newline \end{align*} $$ Thus, $$ \begin{align} \widehat{U} &= -\arg\max_{\substack{\Delta W \in T_W\mathcal{W}\newline ||\Delta W|| = 1}} D\mathcal{L}(W)[\Delta W]\nonumber\newline &= -\arg\max_{\substack{\Delta W \in T_W\mathcal{W}\newline ||\Delta W|| = 1}} \langle \nabla \mathcal{L}(W), \Delta W \rangle\nonumber\newline &= -\arg\max_{\substack{\Delta W \in T_W\mathcal{W}\newline ||\Delta W|| = 1}} \langle G_W^{-1} \partial\mathcal{L}(W), \Delta W \rangle\newline &= -\frac{G_W^{-1} \partial\mathcal{L}(W)}{||G_W^{-1}\partial\mathcal{L}(W)||}\nonumber\newline \widehat{U} &\approx -\frac{G_W^{-1}\partial\mathcal{L}(W; \xi)}{||G_W^{-1}\partial\mathcal{L}(W; \xi)||}\nonumber\newline \end{align} $$ where the maximum above can be achieved by aligning $\Delta W$ with $G_W^{-1}\partial\mathcal{L}(W)$ and scaling such that $||\Delta W|| = 1$ (Absil, 2008). Thus our update rule becomes, $$ \begin{align*} W_{t+1} &= W_t + \lambda \widehat{U}\newline W_{t+1} &= W_t - \hat{\lambda} G_{W_t}^{-1}\partial\mathcal{L}(W_t) \end{align*} $$ where $\hat{\lambda} = \frac{\lambda}{||G_{W_t}^{-1}\partial\mathcal{L}(W_t)||}$. This is Riemannian Stochastic Gradient Descent (RSGD) with an adaptive learning rate. And if we let $P_W = G_W^{-1}$ be the preconditioner at point $W$, we can relate this to Preconditioned Stochastic Gradient Descent (PSGD) algorithms (Li, 2015; Pooladzandi et al., 2024).

Important Takeaway: In a Riemannian manifold, the preconditioner $P_W$ is unique and well-defined at each point $W \in \mathcal{W}$, but may not be constant across the manifold. Thus, as we move across the manifold, we may need to recompute or update our running estimate of the preconditioner. However, if our updates are “small-enough” by some definition, then we may not need to update it at every step; near convergence, or even earlier, it would suffice to update only every $K > 1$ steps for some positive integer $K$ chosen a priori.

3.3. $\bm{\mathcal{W}}$ is a Non-Riemannian Manifold

In this case, our choice of norm $||\cdot||$ does not admit a well-behaved metric $g_W(\cdot, \cdot)$ and consequently also does not admit a well-behaved inner product $\langle \cdot, \cdot \rangle$ such that $||\cdot|| = \sqrt{\langle \cdot, \cdot \rangle}$ for all $W \in \mathcal{W}$. Our differentials $D\mathcal{L}(W)[\cdot]$ are still well-defined, but we no longer have the bijective relationship between differentials and gradients. And so, we do not always have a unique $V \in T_W\mathcal{W}$ such that $D\mathcal{L}(W)[\cdot] = \langle V, \cdot \rangle$ if this inner product even exists.

While we still have access to the stochastic estimator of the differential in standard Euclidean coordinates $\partial\mathcal{L}(W)$ from Assumption 1, it no longer has geometric meaning by itself. More precisely, we no longer have information on how to transform $\partial\mathcal{L}(W)$ to the direction of steepest descent by a (possibly point-dependent) change of coordinates on the tangent space $T_W\mathcal{W}$.

We can, however, still use Assumption 1 to define a dualizer for our norm, $\text{dualizer}_{||\cdot||}(\cdots; W) : T^*_W\mathcal{W} \rightarrow T_W\mathcal{W}$ for $W \in \mathcal{W}$, that maps the differential we get empirically to a direction of steepest descent, $$ \begin{align*} \widehat{U} &= -\arg\max_{\substack{\Delta W \in T_W\mathcal{W}\newline ||\Delta W|| = 1}} D\mathcal{L}(W)[\Delta W]\newline &= -\arg\max_{\substack{\Delta W \in T_W\mathcal{W}\newline ||\Delta W|| = 1}} \langle \partial\mathcal{L}(W), \Delta W \rangle_F\newline &\approx -\arg\max_{\substack{\Delta W \in T_W\mathcal{W}\newline ||\Delta W|| = 1}} \langle \partial\mathcal{L}(W; \xi), \Delta W \rangle_F\newline \widehat{U} &= -\text{dualizer}_{||\cdot||}(\partial\mathcal{L}(W; \xi); W) \end{align*} $$ where, $$ \begin{equation} \text{dualizer}_{||\cdot||}(\cdots; W) = \arg\max_{\substack{\Delta W \in T_W\mathcal{W}\newline ||\Delta W|| = 1}} \langle \cdots, \Delta W \rangle_F \end{equation} $$

And to simplify our notation, we use $\text{dualizer}_{||\cdot||}(\cdots)$ if this map is independent of $W$.

Important Takeaways:

  1. In the Riemannian case, the dualization is equivalent to preconditioning. But in the non-Riemannian case, the dualizer may not be linear and may even have multiple solutions! However, as we will discuss in the next section, it may suffice to approximate the dualizer with a (linear) preconditioner in practice.
  2. The intuitions we have developed for the Riemannian case may not apply in the non-Riemannian case. Dualizers in the non-Riemannian case may behave counterintuitively, and may not even have approximate (linear) preconditioners that would work well in practice. And so, we must be vigilant about the geometry of the manifold we are working with.
  3. Instantaneous (i.e. preconditioner-free) dualization like what Muon does may be more appropriate in cases where the dualizer has a closed form solution and is cheap-enough to compute, according to one’s compute budget.

4. Muon as Steepest Descent in a Non-Riemannian Manifold

4.1. The Muon Optimizer

Algorithm 1 (Muon) by Jordan et al. (2024a). The weights are treated independently.

Inputs: Initial weight $W_0 \in \mathcal{W}$, and momentum term $M_0 \in \mathcal{W}$.

Parameters: Learning rate $\lambda > 0$, momentum decay $\beta \in [0, 1)$, and number of iterations $T \in \{1, 2, \ldots\}$

$\textbf{for } t = 0, 1, \ldots, T-1 \textbf{ do}\newline \text{… Compute }G_t = \partial\mathcal{L}(W; \xi)\newline \text{… Compute }W_{t+1}\text{ and }M_{t+1}\text{ as follows:}\newline \text{……. }M_{t+1} = \beta M_t + (1 - \beta) G_t\newline \text{……. }O_{t+1} = \text{approx-orth}(M_{t+1})\newline \text{……. }W_{t+1} = W_t - \lambda O_{t+1} $

Output: $W_T \in \mathcal{W}$.

Algorithm 2 (Approximate Orthogonalization through Newton-Schulz Iteration)

def zeropower_via_newtonschulz(G: Tensor, steps: int=5):
    assert G.ndim == 2
    a, b, c = (3.4445, -4.7750, 2.0315)
    X = G.bfloat16()
    X /= (X.norm() + 1e-7)
    if G.size(-2) > G.size(-1):
        X = X.mT
    for _ in range(steps):
        A = X @ X.mT
        B = b * A + c * A @ A
        X = a * X + B @ X
    if G.size(-2) > G.size(-1):
        X = X.mT
    return X

Muon (Algorithm 1) is an optimizer for matrix-valued parameters in neural networks (Jordan et al., 2024a). For each weight $W \in \mathcal{W}$, it first accumulates the momentum term, then approximately semi-orthogonalizes the result using the Newton-Schulz iteration (Algorithm 2), before applying it as an update to the weights.

We can fold the momentum term into $\partial\mathcal{L}(W; \xi)$ as it can be seen as a way to smooth out outlier empirical gradients. In fact, Mokhtari et al. (2018) and more recently Kovalev (2025) have shown that, under Muon’s update rule, the momentum term does become a tighter approximation of the true gradient $\partial\mathcal{L}(W)$ as the number of iterations $T$ increases.

And while Muon only approximately (semi-)orthogonalizes the gradient, we have found that it still empirically performs just as well as exact orthogonalization. We will discuss this in more detail in the next sections. Muon is also not the first optimizer that does approximate orthogonalization. For example, Carlson et al.’s randomized algorithm Sketching (2015) does this explicitly, and so does Shampoo (Gupta et al., 2018), CASPR (Surya et al., 2024), and PSGD (Li, 2015) implicitly through their preconditioners. However, Muon is the first, non-randomized, preconditioner-free optimizer that explicitly aims to orthogonalize the gradient.

An interesting fact from prior work (Carlson et al., 2015; Flynn, 2017; Mokhtari et al., 2018, Bernstein et al., 2024) is that the dualizer for steepest descent under the spectral norm $||\cdot||_{2 \to 2}$ is exactly this orthogonalization process, $$ \begin{equation} \text{dualizer}_{||\cdot||_{2\to 2}}(\partial\mathcal{L}(W; \xi)) = UV^T \end{equation} $$ where $U\Sigma V^T$ is the singular value decomposition (SVD) of $\partial\mathcal{L}(W; \xi)$. The spectral norm does not admit a well-behaved inner product. And so, Muon, and related optimizers, can be thought of as steepest descent in a non-Riemannian manifold.

In the next section, we discuss why Muon can be viewed as an instantaneous version of already existing optimizers such as Shampoo, CASPR, PSGD, and etc. We will also discuss an alternate perspective on how such preconditioning optimizers can be viewed as approximators to the dualizer of the spectral norm.

4.2. Approximating Dualization with Preconditioners

As we have shown in Section 3.2, the dualization process in Riemannian steepest descent always has an equivalent preconditioning process by letting $P_W = G_W^{-1}$ at every point $W \in \bm{\mathcal{W}}$. And likewise, if we have a preconditioning process where every $P_W$ is invertible, then it can be thought of as Riemannian steepest descent under the metric $G_W = P_W^{-1}$.

However, we may not always have an equivalent preconditioning process in non-Riemannian manifolds. And if there is, it may not be unique. Here, let’s examine multiple preconditioning processes that approximate the dualizer under the spectral norm in turn.

4.2.1. The matrix sign function.

Definition (Matrix Sign Function) Let $X = \mathbb{R}^{m \times n}$. The matrix sign function is defined as, $$ \begin{align*} \text{msign}(X) &= (XX^T)^{-1/2} X = X (X^T X)^{-1/2}\newline \text{msign}(X) &= U V^T \end{align*} $$ where $U\Sigma V^T$ is the singular value decomposition (SVD) of $X$.

Thus, one can interpret Muon’s update rule as simply the matrix sign function applied to $\partial\mathcal{L}(W_t)$. And if we let $G_t = \partial\mathcal{L}(W_t)$ and $P_{W_t} = (G_t G_t^T)^{-1/2}$, then we arrive at a preconditioner form of Muon’s update rule, $$ \begin{align*} W_{t+1} &= W_t - \lambda P_{W_t} G_t\newline &= W_t - \lambda (G_t G_t^T)^{-1/2} G_t\newline W_{t+1} &= W_t - \lambda UV^T \end{align*} $$

4.2.2. Shampoo. Let $G_t = \partial\mathcal{L}(W_t)$. Then Shampoo (Gupta et al., 2018; Anil et al., 2020) has the following update rule,

$$L_t := L_{t-1} + G_t G_t^T, \quad\quad R_t := R_{t-1} + G_t^T G_t$$ $$\Delta W_t = L^{-1/4}_t G_t R^{-1/4}_t$$

As first noted by Bernstein (2024a), Shampoo reduces to Muon if we disable the updates on the left and right preconditioners. That is, let $G_t = U\Sigma V^T$ be the singular value decomposition (SVD) of $G_t$ and let $$ L_t := G_t G_t^T\quad\quad R_t := G_t^T G_t $$ Then, $$\begin{aligned} \Delta W_t &= L^{-1/4}_t G_t R^{-1/4}_t\newline &= (G_t G_t^T)^{-1/4} G_t (G_t^T G_t)^{-1/4}\newline &= (U\Sigma V^T V \Sigma U^T)^{-1/4} U\Sigma V^T (V \Sigma U^T U \Sigma V^T)^{-1/4}\newline &= U\left(\frac{\Sigma}{\sqrt{\Sigma^2}} \right)V^T\newline \Delta W_t &= UV^T \end{aligned}$$ which is Muon’s update rule. $\blacksquare$

4.2.3. CASPR. Let $G_t = \partial\mathcal{L}(W_t)$. Then CASPR (Surya et al., 2024) has the following update rule,

$$L_t := L_{t-1} + G_t G_t^T, \quad\quad R_t := R_{t-1} + G_t^T G_t$$ $$\tilde{L}_t := L_t + \epsilon I_m, \quad\quad \tilde{R}_t := R_t + \epsilon I_n$$ $$\Delta W_t = (\tilde{L}^{-1/2}_t G_t + 2 \tilde{L}^{-1/4}_t G_t \tilde{R}^{-1/4}_t + G_t \tilde{R}^{-1/2}_t)/4.$$

As previously noted by Cesista (2025), CASPR reduces to Muon if we disable the updates on the left and right preconditioners. That is, let $G_t = U\Sigma V^T$ be the singular value decomposition (SVD) of $G_t$ and let $$ L_t := G_t G_t^T\quad\quad R_t := G_t^T G_t $$ Then, $$\begin{aligned} \Delta W_t &= (\tilde{L}^{-1/2}_t G_t + 2 \tilde{L}^{-1/4}_t G_t \tilde{R}^{-1/4}_t + G_t \tilde{R}^{-1/2}_t)/4\newline &= (1/4) \cdot [(G_t G_t^T + \epsilon I_m)^{-1/2} G_t\newline &\quad\quad\quad+ 2 (G_t G_t^T + \epsilon I_m)^{-1/4} G_t (G_t^T G_t + \epsilon I_n)^{-1/4}\newline &\quad\quad\quad+ G_t (G_t^T G_t + \epsilon I_n)^{-1/2}]\newline &= (1/4) \cdot [[(U \Sigma V^T) (U \Sigma V^T)^T + \epsilon U I U^T]^{-1/2}(U \Sigma V^T)\newline &\quad\quad\quad + 2[(U \Sigma V^T) (U \Sigma V^T)^T + \epsilon U I U^T]^{-1/4}(U \Sigma V^T) [(U \Sigma V^T)^T (U \Sigma V^T) + \epsilon V I V^T]^{-1/4}\newline &\quad\quad\quad + (U \Sigma V^T) [(U \Sigma V^T)^T (U \Sigma V^T) + \epsilon V I V^T]^{-1/2}]\newline &= (1/4) \cdot [U(\Sigma(\Sigma^2 + \epsilon I)^{-1/2})V^T + U(2\Sigma(\Sigma^2 + \epsilon I)^{-1/2})V^T + U(\Sigma(\Sigma^2 + \epsilon I)^{-1/2})V^T]\newline &= (1/4) \cdot [U(1+2+1)(\Sigma(\Sigma^2 + \epsilon I)^{-1/2})V^T]\newline &= U\left(\frac{\Sigma}{\sqrt{\Sigma^2 + \epsilon I}} \right)V^T \newline \Delta W_t &\approx UV^T \end{aligned}$$ which is Muon’s update rule. $\blacksquare$

4.2.4. PSGD Family. [Under Review] This family of optimizers (Li, 2015 & 2018; Pooladzandi, 2024) explicitly tries to learn the preconditioner $\mathcal{P}(\cdot; W)$ according to some criterion to ensure training stability and, potentially, faster convergence. This criterion is involved with the noise suppression gain which is defined as, $$ \text{noise\_suppresion\_gain}_{||\cdot||_F}(P) = \frac{\mathbb{E}[||H_0^{-1}\epsilon’||_F^2]}{\mathbb{E}[||P\epsilon’||_F^2]} = \frac{\mathbb{E}[(\epsilon’)^T H_0^{-2} \epsilon’]}{\mathbb{E}[(\epsilon’)^T P^2 \epsilon’]}, $$ where $\epsilon’$ is some (matrix-valued) noise term on the Hessian $H$, which aims to reduce the noise of the preconditioned gradients.

We get different update rules depending on which Lie group we restrict the preconditioner $P$ to. However, the criterion above typically leads to update rules that whitens, i.e. decorrelates, the entries of the gradient $\partial\mathcal{L}(W; \xi)$. And so while Muon can be seen as an instantaneous version of PSGD, an important difference is that Muon merely projects the gradient to its nearest (semi-)orthogonal matrix, but not necessarily decorrelates the entries.

For future work, it would also be interesting to see what kind of update rules we get if we measure the noise suppression gain with respect to the spectral norm instead of the Frobenius norm. That is, $$ \text{noise\_suppresion\_gain}_{||\cdot||_{2\to 2}}(P) = \frac{\mathbb{E}[||H_0^{-1}\epsilon’||_{2\to 2}]}{\mathbb{E}[||P\epsilon’||_{2\to 2}]} $$

5. Steepest Descent under Elementwise $p$-Norms and Schatten-$p$ Norms

Definition 2 (Vector $p$-Norms). Given $p \in [1, \infty]$, the vector $p$-norm of a finite-dimensional, real-valued vector $\bm{x} \in \mathbb{R}^n$ is defined as, $$ ||\bm{x}||_p = \begin{cases} \left(\sum_{i=1}^{n} |x_i|^p\right)^{1/p} & \text{if } 1 \leq p < \infty\newline \max_{i} |x_i| & \text{if } p = \infty \end{cases} $$

Examples:

  1. $p = 1$: The Manhattan/Taxicab norm, $||\bm{x}||_1 = \sum_{i=1}^{n} |x_i|$
  2. $p = 2$: The Euclidean norm, $||\bm{x}||_2 = \sqrt{\sum_{i=1}^{n} |x_i|^2} = ||\bm{x}||_F$
  3. $p = \infty$: The Max norm, $||\bm{x}||_\infty = \max_{i} |x_i|$

Definition 3 (Matrix Elementwise $p$-Norms). Given $p = [1, \infty]$, the elementwise $p$-norm of a finite-dimensional, real-valued matrix $X \in \mathbb{R}^{m \times n}$ is defined as, $$ ||X||_{e,p} = \begin{cases} \left(\sum_{i=1}^{m} \sum_{j=1}^{n} |X_{ij}|^p\right)^{1/p} & \text{if } 1 \leq p < \infty\newline \max_{i,j} |X_{ij}| & \text{if } p = \infty \end{cases} $$ or equivalently, $$||X||_{e,p} = ||\text{vec}(X)||_p,$$ where $\text{vec}(\cdot)$ is the vectorization operator that stacks the columns of $X$ into a single vector.

Examples:

  1. $p = 1$: The Sum-of-Sums norm, $||X||_{e,1} = \sum_{i=1}^{m} \sum_{j=1}^{n} |X_{ij}|$
  2. $p = 2$: The Frobenius norm, $||X||_{e,2} = \sqrt{\sum_{i=1}^{m} \sum_{j=1}^{n} |X_{ij}|^2} = ||X||_F$
  3. $p = \infty$: The Max-of-Max norm, $||X||_{e,\infty} = \max_{i,j} |X_{ij}|$

Definition 4 (Schatten-$p$ Norms). Given $p = [1, \infty]$, the Schatten-$p$ norm of a finite-dimensional, real-valued matrix $X \in \mathbb{R}^{m \times n}$ is defined as, $$ ||X||_{S_p} = \begin{cases} \left(\sum_{i=1}^{\min(m, n)} |\sigma_i(X)|^p\right)^{1/p} & \text{if } 1 \leq p < \infty\newline \max_{i} \sigma_i(X) & \text{if } p = \infty \end{cases} $$ where $\sigma(X) = (\sigma_1(X), \ldots, \sigma_{\min(m,n)}(X))$ are the singular values of $X$. Or equivalently, $$||X||_{S_p} = ||\sigma(X)||_p$$

Examples:

  1. $p = 1$: The Nuclear norm, $||A||_{S_1} = \sum_{i=1}^{\min(m,n)} |\sigma_i(A)| = ||A||_{\text{nuc}}$
  2. $p = 2$: The Frobenius norm, $||A||_{S_2} = \left(\sum_{i=1}^{\min(m,n)} |\sigma_i(A)|^2\right)^{\frac{1}{2}} = ||A||_F$
  3. $p = \infty$: The Spectral norm, $||A||_{S_{\infty}} = \max_{i} \sigma_i(A) = ||A||_{2 \to 2}$

A special case is when $p = 2$, and so we have the Frobenius norm. Equipping $\mathcal{W}$ with this norm gives us the standard Euclidean manifold, which is Riemannian. However, Propositions (6) and (7) below still applies.

And to find the dualizers for the Schatten-$p$ norms, we will use the following inequality,

Theorem 5 (von Neumann’s Trace Inequality). Let $A, B \in \mathbb{R}^{m \times n}$. Then the following inequality holds, $$\text{tr}(A^TB) \leq \sum_{i=1}^{\min(m,n)} \sigma_i(A) \sigma_i(B),$$ where $\sigma(A) = (\sigma_1(A), \ldots, \sigma_{\min(m,n)}(A))$ and $\sigma(B) = (\sigma_1(B), \ldots, \sigma_{\min(m,n)}(B))$ are the singular values of $A$ and $B$, respectively. And equality holds if and only if $A$ and $B$ share singular vectors. If so, then, $$ \begin{align*} \text{tr}(A^TB) &= \sum_{i=1}^{\min(m,n)} \sigma_i(A) \sigma_i(B)\newline \langle A, B\rangle_F &= \langle \sigma(A), \sigma(B) \rangle_F \end{align*} $$

5.1. Dualizers for Elementwise $p$-Norms and Schatten-$p$ Norms

Proposition 7. Given $p = [1, \infty]$, the dualizer for the Schatten-$p$ norm is: $$ \text{dualizer}_{||\cdot||_{S_p}}(X) = \begin{cases} U \frac{\text{diag}\left(\sigma_1(X)^{q-1}, \ldots, \sigma_{\min(m,n)}(X)^{q-1}\right)}{||X||_{S_q}^{q-1}} V^T & \text{if } 1 \leq p < \infty\newline UV^T & \text{if } p = \infty \end{cases} $$ where $\frac{1}{p} + \frac{1}{q} = 1$, and $X = U\Sigma V^T$ is the singular value decomposition of $X \in \mathbb{R}^{m \times n}$.

Proof: For a given $X \in T_W\mathcal{W}$ at $W \in \mathcal{W}$, let $T^* \in T_W\mathcal{W}$ be, $$ \begin{align*} T^* &= \text{dualizer}_{||\cdot||_{S_p}}(X; W)\newline T^* &= \arg\max_{\substack{T \in T_W\mathcal{W}\newline ||T||_{S_p} = 1}} \langle X, T \rangle_F\newline T^* &= \arg\max_{\substack{T \in T_W\mathcal{W}\newline ||T||_{S_p} = 1}} \text{tr}(X^T T) \end{align*} $$ Then from von Neumann’s Trace Inequality, we know that $T^*$ must share singular vectors with $X$ and that, $$ \begin{align*} T^* &= \arg\max_{\substack{T \in T_W\mathcal{W}\newline ||T||_{S_p} = 1}} \sum_{i=1}^{\min(m,n)} \sigma_i(X) \sigma_i(T)\newline T^* &= \arg\max_{\substack{T \in T_W\mathcal{W}\newline ||\sigma(T)||_{p} = 1}} \langle \sigma(X), \sigma(T) \rangle_F \end{align*} $$ Thus, our optimization problem reduces to, $$\arg\max_{\sigma(T)} \sum_{i=1}^{\min(m,n)} \sigma_i(X) \sigma_i(T) \quad\text{s.t.}\quad \sum_{i=1}^{\min(m,n)} \sigma_{i}(T)^p = 1$$ And solving via Lagrange multipliers, we have, $$\sigma_i(T) = \frac{\sigma_i(X)^{q-1}}{||X||_{S_q}^{q-1}}$$ where $\frac{1}{p} + \frac{1}{q} = 1$. Note that this is indepdent of $W$. Hence, $$T^* = \text{dualizer}_{||\cdot||_{S_p}}(X) = U \frac{\text{diag}\left(\sigma_1(X)^{q-1}, \ldots, \sigma_{\min(m,n)}(X)^{q-1}\right)}{||X||_{S_q}^{q-1}} V^T\quad\blacksquare$$

5.2. Stochastic Gradient Descent and Muon as Special Cases of Steepest Descent under Schatten-$p$ Norms

5.2.1. Recovering SGD. Let $p = 2$, and so we have $q = 2$ and $||\cdot||_{S_2} = ||\cdot||_F$. Thus for $\partial\mathcal{L}(W; \xi) \in T^*_W\mathcal{W}$ at $W \in \mathcal{W}$, we have, $$ \begin{align*} \Delta W &= \text{dualizer}_{||\cdot||_{S_\infty}}(\partial\mathcal{L}(W; \xi); W)\newline &= U \frac{\text{diag}\left(\sigma_1(\partial\mathcal{L}(W; \xi))^{2-1}, \ldots, \sigma_{\min(m,n)}(\partial\mathcal{L}(W; \xi))^{2-1}\right)}{||\partial\mathcal{L}(W; \xi)||_{S_2}^{2-1}} V^T\newline \Delta W &= \frac{\partial\mathcal{L}(W; \xi)}{||\partial\mathcal{L}(W; \xi)||_F} \end{align*} $$ which matches the update rule we expect from SGD.

5.2.2. Recovering Muon. Let $p = \infty$, and so we have $q = 1$ and $||\cdot||_{S_\infty} = ||\cdot||_{2\to 2}$. Thus for $\partial\mathcal{L}(W; \xi) \in T^*_W\mathcal{W}$ at $W \in \mathcal{W}$, we have, $$ \begin{align*} \Delta W &= \text{dualizer}_{||\cdot||_{2 \to 2}}(\partial\mathcal{L}(W; \xi); W)\newline &= U \frac{\text{diag}\left(\sigma_1(\partial\mathcal{L}(W; \xi))^{1-1}, \ldots, \sigma_{\min(m,n)}(\partial\mathcal{L}(W; \xi))^{1-1}\right)}{||\partial\mathcal{L}(W; \xi)||_{S_1}^{1-1}} V^T\newline \Delta W &= UV^T \end{align*} $$ which matches the update rule we expect from Muon.

5.3. Why Muon Still Works Well in Practice Despite the Approximate Orthogonalization

We observe that, qualitatively, steepest descent under the Schatten-$p$ norm very quickly converges to steepest descent under the spectral norm as $p$ approaches $\infty$. This is probably why optimizers like Sketching and Muon work well in practice despite not perfectly orthogonalizing the gradients.

To support this, we show that the (1) variance of singular values post-dualization, and the (2) relative size, and (3) stable rank of the dualized gradients under the Schatten-$p$ norm quadratically converges to those of the Spectral norm as $p$ approaches $\infty$. And in fact, at $p = 32$, the results are already very close to those of the Spectral norm.

5.3.1. On the variance of singular values post-dualization

Proposition 8. The variance of the singular values post-dualization under the Schatten-$p$ Norm converges quadratically to $0$ as $p$ approaches $\infty$.

Proof: Let $X \in \mathbb{R}^{m \times n}$ and let $t_i$ be the $i$-th singular value post-dualization. From Proposition 7 earlier, we have $$ \begin{align*} t_i &= \left(\frac{\sigma_i(X)}{||X||_{S_q}}\right)^{q-1}\newline t_i &= \exp\left((q-1)\ln\frac{\sigma_i(X)}{||X||_{S_q}}\right)\newline t_i &\approx 1 + (q-1)\ln\frac{\sigma_i(X)}{||X||_{S_q}} \end{align*} $$ where the last line follows from first-order Taylor approximation of $t_i$. Thus, the mean and variance are: $$ \begin{align*} \mathbb{E}[t_i] &\approx 1 + (q-1)\mathbb{E}\left[\ln\frac{\sigma_i(X)}{||X||_{S_q}}\right]\newline \mathbb{E}[t_i] &\approx 1 + (q-1)\ln\frac{\mathbb{E}[\sigma_i(X)]}{||X||_{S_q}}\newline t_i - \mathbb{E}[t_i] &\approx (q-1)\ln\left[\sigma_i(X) - \mathbb{E}[\sigma_i(X)]\right]\newline Var[t_i] &\approx (q-1)^2\mathbb{E}\left[\ln^2\left[\sigma_i(X) - \mathbb{E}[\sigma_i(X)]\right]\right]\newline Var[t_i] &\approx \frac{1}{(p-1)^2}\mathbb{E}\left[\ln^2\left[\sigma_i(X) - \mathbb{E}[\sigma_i(X)]\right]\right] \end{align*} $$ Hence, the variance of the singular values post-dualization converges quadratically to $0$ as $p$ approaches $\infty$.

Empirically, we can see this in the following plot. And at $p = 32$, the variance of the resulting singular values are already close to $0$.

5.3.2. On the relative size and stable rank of gradients

Definition 6: Relative Size of a Gradient. Given a norm $||\cdot||$ chosen a priori, the relative size of a gradient-update $\Delta W$ relative to the parameter matrix $W$ is defined as: $$\text{relsize}(\Delta W) = \frac{||\Delta W||}{||W||}$$

Definition 7: Stable Rank. The stable rank of a matrix $A$ is defined as $$srank(A) = \frac{||A||_F^2}{||A||_{2 \to 2}^2}$$

As we can see in the following plot, the raw gradients have very low-stable rank. But the stable rank of the gradients post-dualization under the Schatten-$p$ norm converges very quickly to that of the Spectral norm as $p$ approaches $\infty$.

One can interpret this as, for some large enough $p$, the dualized gradient is already very close to being “maximal” in a sense. And increasing $p$ further would only offer rapidly diminishing returns.

A side-effect of this is that it allows the model parameters to “escape” the small region around the initialization and explore the parameter space more effectively, contrary to prior work on SGD and Adam optimizers (Lee et al., 2020; Jesus et al., 2021). This phenomenon is known as weight erasure (Bernstein, 2024b).

6. Optimizing Muon’s Newton-Schulz Coefficients [Under Construction]

7. Convergence Guarantees [Under Construction]

Acknowledgements

Many thanks to Jeremy Bernstein, Omead Pooladzandi, Simo Ryu, and Antonio Silveti-Falls for their feedback on this work. I have been (and still is) trying to incorporate their feedback into this work. Also many thanks to Keller Jordan for developing Muon in the first place and for conversations on optimization and machine learning, in general.

How to Cite

@misc{cesista2025sdnr,
  author = {Franz Louis Cesista},
  title = {Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifolds},
  year = {2025},
  url = {http://leloykun.github.io/ponder/steepest-descent-non-riemannian/},
}

References

  1. Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein (2024). Muon: An optimizer for hidden layers in neural networks. Available at: https://kellerjordan.github.io/posts/muon/
  2. Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 (2024). modded-nanogpt: Speedrunning the NanoGPT baseline. URL https://github.com/KellerJordan/modded-nanogpt/
  3. Diederik P. Kingma, Jimmy Ba (2014). Adam: A Method for Stochastic Optimization. URL https://arxiv.org/abs/1412.6980
  4. Moonshot AI Team (2025). Muon is Scalable for LLM Training. URL https://arxiv.org/abs/2502.16982
  5. Jeremy Bernstein and Laker Newhouse. “Old optimizer, new norm: An anthology.” arXiv preprint arXiv:2409.20325 (2024).
  6. Jeremy Bernstein, Laker Newhouse (2024). Modular Duality in Deep Learning. URL https://arxiv.org/abs/2410.21265
  7. Jeremy Bernstein (2024). Spectral perspective on Shampoo. X post. URL https://x.com/jxbz/status/1819846348130418706
  8. Jeremy Bernstein (2024). “Weight erasure.” Available at: https://docs.modula.systems/examples/weight-erasure/
  9. Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, Jeremy Bernstein (2024). Scalable Optimization in the Modular Norm. URL https://arxiv.org/abs/2405.14813
  10. Greg Yang, James B. Simon, Jeremy Bernstein (2024). A Spectral Condition for Feature Learning. URL https://arxiv.org/abs/2310.17813
  11. Xi-Lin Li (2015). Preconditioned Stochastic Gradient Descent. URL https://arxiv.org/abs/1512.04202
  12. Xi-Lin Li (2018). Preconditioner on Matrix Lie Group for SGD. URL https://arxiv.org/abs/1809.10232
  13. Omead Pooladzandi, Xi-Lin Li (2024). Curvature-Informed SGD via General Purpose Lie-Group Preconditioners. URL https://arxiv.org/abs/2402.04553
  14. P.-A. Absil, R. Mahony, and Rodolphe Sepulchre (2008). Optimization Algorithms on Matrix Manifolds. Princeton University Press.
  15. David E Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, Volkan Cevher. Preconditioned Spectral Descent for Deep Learning. In Advances in Neural Information Processing Systems 28 (NIPS 2015), 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/hash/f50a6c02a3fc5a3a5d4d9391f05f3efc-Abstract.html
  16. Thomas Flynn. The duality structure gradient descent algorithm: Analysis and applications to neural networks. arXiv:1708.00523, 2017. URL https://arxiv.org/abs/1708.00523
  17. Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms. The American Statistician, 58(1):30–37.
  18. Elhage, et al., “Toy Models of Superposition”, Transformer Circuits Thread, 2022.
  19. Lucas Prieto, Melih Barsbey, Pedro A.M. Mediano, Tolga Birdal (2025). Grokking at the Edge of Numerical Stability. URL https://arxiv.org/abs/2501.04697
  20. Vineet Gupta, Tomer Koren, Yoram Singer (2018). Shampoo: Preconditioned Stochastic Tensor Optimization. URL https://arxiv.org/abs/1802.09568
  21. Rohan Anil et al. “Scalable second order optimization for deep learning.” arXiv preprint arXiv:2002.09018 (2020).
  22. Surya, S., Duvvuri, Devvrit, F., Anil, R., Hsieh, C., & Dhillon, I.S. (2024). Combining Axes Preconditioners through Kronecker Approximation for Deep Learning. International Conference on Learning Representations.
  23. Franz Louis Cesista (2025). {CASPR} Without Accumulation is {M}uon. URL https://leloykun.github.io/ponder/caspr-wo-accum-is-muon/
  24. Rohan Anil. “Just some fun linear algebra.” X post, 6 Oct. 2024, Available at: https://x.com/_arohan_/status/1843050297985466565.
  25. Dmitry Kovalev (2025). Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization. Available at: https://arxiv.org/abs/2503.12645
  26. Lee, Jaehoon, et al. “Wide Neural Networks of Any Depth Evolve as Linear Models under Gradient Descent.” Journal of Statistical Mechanics: Theory and Experiment, vol. 2020, no. 12, Dec. 2020, p. 124002. Crossref, https://doi.org/10.1088/1742-5468/abc62b.
  27. Jesus, Ricardo J., et al. “Effect of Initial Configuration of Weights on Training and Function of Artificial Neural Networks.” Mathematics, vol. 9, no. 18, Sept. 2021, p. 2246. Crossref, https://doi.org/10.3390/math9182246.
  28. Aryan Mokhtari, Hamed Hassani, Amin Karbasi (2018). Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization. URL https://arxiv.org/abs/1804.09554
  29. David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams (1986). Learning representations by back-propagating errors. URL https://www.nature.com/articles/323533a0
  30. Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, Ilya Sutskever (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. https://arxiv.org/abs/1703.03864
  31. Cornelius V. Braun, Robert T. Lange, Marc Toussaint (2024). Stein Variational Evolution Strategies. URL https://arxiv.org/abs/2410.10390
  32. James Martens, Roger Grosse (2020). Optimizing Neural Networks with Kronecker-factored Approximate Curvature. URL https://arxiv.org/abs/1503.05671
  33. Jordan, P. ; Neumann, J. V. (1935). On Inner Products in Linear, Metric Spaces