1. Introduction

In this blog post, we shall consider the problem of steepest descent on Finsler-structured (matrix) geometries. This problem naturally arises in deep learning optimization because we want model training to be fast and robust. That is, we want our updates to maximally change weights and activations (or “features”) so that we can train larger models more quickly while also keeping activation, weight, and gradient norms within some reasonable bounds to guarantee that our expensive training runs would not randomly blow up halfway through.

$\blacksquare$ Let us start on robustness: as we discussed in Training Transformers with Enforced Lipschitz Constants and Rethinking Maximal Update Parametrization: Steepest Descent on the Spectral Ball, we can achieve our robustness goals by enforcing both layer-wise and global Lipschitz constraints on our models. Intuitively, Lipschitzness is the knob that controls how fast our features “grow” in the forward pass and how fast our gradients “grow” in the backward pass. Lower Lipschitzness then guarantees stabler training dynamics and flatter loss landscapes. But why do we want layer-wise Lipschitz constraints? That is because we can convert any possibly-highly-unstable $L$-Lipschitz model into a globally 1-Lipschitz model by scaling its final logits by $1/L$, and this setup does nothing to prevent intermediate activations and gradients from blowing up.

To enforce layer-wise Lipschitz constraints, we have to consider parameter-free and parametrized layers separately. In Sensitivity and Sharpness of n-Simplicial Attention, we previously derived a parametrization of $n$-Simplicial Attention (Roy et. al., 2025) that is 1-Lipschitz by construction, generalizing prior work by Large et. al. (2024). And for parametrized layers, we can enforce Lipschitz constraints by bounding the induced operator norm of the weights (Newhouse et. al., 2025). In this blog post, we shall focus on the latter.

A neat consequence of controlling the weight and update norms is that, by the norm equivalence theorem, we also bound the size of our weight and update matrix entries. A low-enough bound allows us to shave off bits from our floating-point representations without overflowing or underflowing, enabling more aggressive quantization-aware training, basically “for free”, as we have demonstrated in Training Transformers with Enforced Lipschitz Constants.

$\blacksquare$ Now, let us consider our other goal: maximal updates for faster training. For this, we need to consider (1) how we measure the “size” of our updates and (2) how to make sure that our updates do not interfere with our weight control mechanisms. Combining these with our need to constrain the weights to some bounded set naturally leads us to the problem of optimization on some normed geometry. But which geometry and what norm?

In this blog post, we shall consider the general case where we constrain the weights to some manifold $\mathcal{M}$ whose points have tangent spaces $T_{W}\mathcal{M}$ equipped with some possibly point-dependent, non-Euclidean, or even non-Riemannian norm $\|\cdot\|_{W}$. Our method also works for cone geometries such as the positive semidefinite cone and the spectral band where boundary points could instead have tangent cones where if $V \in T_{W}\mathcal{M}$, then $-V$ may not be in $T_{W}\mathcal{M}$.

Another neat consequence of doing steepest descent under the induced operator norm is that it enables learning rate transfer across model widths, given that the feature norms we induce the operator norms from also scale appropriately with model width (Pethick et. al., 2025; Bernstein et. al., 2024; Filatov et. al., 2025). We will discuss this in more detail in Section 2.2.

This work expands on and generalizes prior work by Bernstein (2025), Keigwin (2025), Cesista (2025), and Su (2025) on ‘Manifold Muon’.

2. Preliminaries

2.1. First-order optimization on Finsler geometries

Let $\mathcal{M}$ be our Finsler geometry of interest. That is, a constraint set in the form of a closed, convex linear subspace of $\mathbb{R}^{m \times n}$ equipped with a possibly point-dependent norm $\|\cdot\|_{W}$ on each tangent set, $T_{W}\mathcal{M}$, at each point $W \in \mathcal{M}$. First-order optimization on such geometries goes as follows:

  1. Let $W_t \in \mathcal{M}$ be the ‘weight’ parameter at time step $t$. Compute the “raw gradient” $G_t = \nabla f(W_t)$ via e.g. backpropagation.
  2. Compute an ‘optimal’ descent direction $A^* \in T_{W_t} \mathcal{M}$ under the norm in the tangent set at $W_t$, $$\begin{equation} A^* = \arg\min_{A \in \mathbb{R}^{m \times n}} \langle G, A \rangle \quad \text{ s.t. } \quad \| A \|_{W_t} \leq \eta,\quad A \in T_{W_t}\mathcal{M}, \end{equation}$$ where $\eta > 0$ is the learning rate hyperparameter.
  3. Update the weight in the direction of $A^*$ and retract the result back to the manifold via a retraction map, $\texttt{retract}_{\mathcal{M}}: \mathbb{R}^{m \times n} \to \mathcal{M}$, $$\widetilde{W}_{t+1} \leftarrow \texttt{retract}_{\mathcal{M}}(W_t + A^*).$$

Note that both constraints on $A$ in Equation (1) are membership constraints to closed convex sets, and so it is simply a convex optimization problem. A special case is when $T_{W_t}\mathcal{M} = \mathbb{R}^{m \times n}$, and we can solve the problem via a Linear Minimization Oracle (LMO) (Pethick et. al., 2025) for the chosen norm, $$\begin{align} A^* &= \arg\min_{\| A \|_{W_t} \leq \eta} \langle G, A \rangle \nonumber \\ &= -\eta \cdot \text{LMO}_{\|\cdot\|_{W_t}}(G). \nonumber \end{align}$$

Unfortunately, as we have discussed in Heuristic Solutions for Steepest Descent on the Stiefel Manifold, LMOs typically do not preserve tangency, requiring more complicated solutions to solve Equation (1). We will discuss one such solution via dual ascent in the next section.

2.2. “Natural” feature and weight norms

As we mentioned in the introduction, we can enable learning rate transfer across model widths by choosing a feature norm that scales appropriately with model width (Pethick et. al., 2025; Bernstein et. al., 2024; Filatov et. al., 2025). We argue that the “natural” feature norm is a norm that has the following two properties:

  1. It has to scale with the entries. That is, if the entries are $\pm 1$, then $\| \cdot \| = 1$. Likewise, if the entries are $\pm r$, then $\| \cdot \| = r$. And,
  2. It has to be width-invariance, in a sense. Informally, if we double the width of our features by duplicating it width-wise, then the feature norm should remain unchanged. That is, for some $n, k > 0$, $$\| \underbrace{\begin{bmatrix} 1 & \ldots & 1 \end{bmatrix}}_{\text{width } n} \| = \| \underbrace{\begin{bmatrix} 1 & \ldots & 1 \end{bmatrix}}_{\text{width } k \cdot n} \|$$

The $\texttt{RMS}$ norm, $\| \cdot \|_{\texttt{RMS}} = \frac{1}{\sqrt{n}} \| \cdot \|_F$, satisfies both criteria, and so it is a good candidate for the “natural” feature norm. This then induces the $\texttt{RMS}\to\texttt{RMS}$ norm, $$\| A \|_{\texttt{RMS}\to\texttt{RMS}} = \sup_{X \neq 0} \frac{\| AX \|_{\texttt{RMS}}}{\| X \|_{\texttt{RMS}}} = \sqrt{\frac{n}{m}} \| A \|_{2 \to 2},$$ as a good candidate for the “natural” weight norm. Yang et. al. (2024) also makes a similar argument.

3. Steepest descent on Finsler geometries via dual ascent

3.1. General strategy

Our goal is to solve Equation (1) for any choice of norm $\|\cdot\|_{W}$ and tangent set $T_{W}\mathcal{M}$. Let the latter be represented as, $$\begin{equation} T_{W}\mathcal{M} = \{ A \in \mathbb{R}^{m \times n} \mid L(A) \in -K \} \end{equation}$$ for some linear map $L: \mathbb{R}^{m \times n} \to \mathcal{Y}$ and a closed, convex tangent/cone $K \subseteq \mathcal{Y}$. Equality constraints can be represented by setting $K = \{0\}$. For example, for the Stiefel manifold, we have $L(A) = W^\top A + A^\top W$ and $K = \{0\}$.

Let $\mathcal{Y}^*$ be the dual space of $\mathcal{Y}$, then the adjoint of $L$, $L^*: \mathcal{Y}^* \to \mathbb{R}^{m \times n}$, is defined as the unique linear map satisfying, $$\langle L(A), Y^* \rangle = \langle A, L^*(Y^*) \rangle, \quad \forall A \in \mathbb{R}^{m \times n}, Y^* \in \mathcal{Y}^*.$$ Restricting $Y^*$ to the dual space $K^*$ then yields the Lagrangian of Equation (1), $$\begin{align} \mathcal{L}(A, Y^*) &= \langle G, A \rangle + \mathcal{i}_{\| \cdot \|_{W} \leq \eta}(A) + \langle Y^*, L(A) \rangle \nonumber \\ &= \mathcal{i}_{\| \cdot \|_{W} \leq \eta}(A) + \langle G + L^*(Y^*), A \rangle, \end{align}$$ where $\mathcal{i}_\mathcal{S}$ is the indicator function of set $\mathcal{S}$. One can then check that, $$A^* = \arg\min_{A \in T_{W}\mathcal{M}} \left[ \max_{Y^* \in K^*} \mathcal{L}(A, Y^*) \right]$$ which, by Sion’s minimax theorem, we can solve via iteratively switching the order of minimization and maximization, $$ \min_{\| A \|_{W_t} \leq \eta} \left[ \max_{Y^* \in K^*} \mathcal{L}(A, Y^*) \right] = \max_{Y^* \in K^*} \left[ \underbrace{\min_{\| A \|_{W_t} \leq \eta} \mathcal{L}(A, Y^*)}_{A(Y^*)} \right]$$

First, let us consider the primal minimizer, $$\begin{align} A(Y^*) &= \arg\min_{A \in \mathbb{R}^{m \times n}} \mathcal{L}(A, Y^*) \nonumber \\ &= \arg\min_{A \in \mathbb{R}^{m \times n}} \mathcal{i}_{\| \cdot \|_{W_t} \leq \eta}(A) + \langle G_t + L^*(Y^*), A \rangle \nonumber \\ &= \arg\min_{\| A \|_{W_t} \leq \eta} \langle G_t + L^*(Y^*), A \rangle \nonumber \\ &= -\eta\cdot\texttt{LMO}_{\| \cdot \|_{W_t}}(G_t + L^*(Y^*)) \nonumber \end{align}$$

This then yields the dual problem, $$\begin{equation} \max_{Y^* \in K^*} -\eta \| G_t + L^*(Y^*) \|_{W_t}^* \end{equation}$$ where $\| \cdot \|_{W_t}^*$ is the dual norm of $\| \cdot \|_{W_t}$. And by chain rule, the above has supergradient, $$\begin{align} \nabla_{Y^*} (-\eta\| G_t + L^*(Y^*) \|_{W_t}^*) &= -\eta\cdot L(\texttt{LMO}_{\| \cdot \|_{W_t}}(G_t + L^*(Y^*))) \nonumber \end{align}$$ which we could use to do gradient ascent on the dual variable $Y^*$.

Putting everything together, we have the following update rule, $$\begin{align} A_t &= -\eta\cdot\texttt{LMO}_{\| \cdot \|_{W_t}}(G_t + L^*(Y^*)) \\ Y^*_{t+1} &= \texttt{proj}_{K^*} \left(Y^*_{t} + \sigma L( A_t )\right) \end{align}$$ where $\sigma > 0$ is the dual ascent learning rate hyperparameter and $\texttt{proj}_{K^*}$ is the orthogonal projection onto the dual cone $K^*$. At convergence, we have $A_t \to A^*$.

In all, we only need three components to implement the above algorithm:

  1. The Linear Minimization Oracle (LMO) for the chosen norm $\| \cdot \|_{W}$;
  2. The linear map $L$ and its adjoint $L^*$ for the tangent/cone constraints; and
  3. The orthogonal projection onto the dual cone $K^*$.

3.1. Steepest descent on the Spectral Band under the $\texttt{RMS}\to\texttt{RMS}$ norm

Suppose that, during training, we want to bound the singular values of our weights to be within some band $[\sigma_{\min}, \sigma_{\max}]$. This is to prevent features from either exploding or vanishing completely. Additionally, we pick the “natural” weight norm, the $\texttt{RMS}\to\texttt{RMS}$ norm, to maximally update the RMS norm of our features and enable learning rate transfer across model widths. And hence we want to do steepest descent on the spectral band $\mathcal{S}_{[\alpha, \beta]}$ under the $\texttt{RMS}\to\texttt{RMS}$ norm.

We discussed several ways to do this in Appendix A1 of Rethinking Maximal Update Parametrization: Steepest Descent on the Spectral Ball. Here, we show that the dual ascent approach we discussed in that blog post is a special case of the general dual ascent framework we discussed in the previous section. To see this, consider the tangent cone at a point $W$ in the spectral band, $$\begin{equation} T_{W_t} \mathcal{S}_{[\alpha, \beta]} = \{ A \in \mathbb{R}^{m \times n} : \underbrace{\texttt{sym}(U_{\alpha}^T A V_{\alpha}) \succeq 0}_{\text{don’t go below } \alpha}, \underbrace{\texttt{sym}(U_{\beta}^T A V_{\beta}) \preceq 0}_{\text{don’t go above } \beta} \} \end{equation}$$ where $U_\alpha$ and $V_\alpha$ are the left and right singular vectors corresponding to the singular value $\alpha$, and likewise for $U_\beta$ and $V_\beta$. We can represent this as in Equation (2) by setting, $$\begin{align} K &= \mathbb{S}^{r_{\alpha}}_{-} \times \mathbb{S}^{r_{\beta}}_{+} \nonumber \\ L(A) &= (\texttt{sym}(U_{\alpha}^T A V_{\alpha}), \texttt{sym}(U_{\beta}^T A V_{\beta})). \nonumber \end{align}$$

The dual of the positive and negative semidefinite cones are themselves, and so, $$K^* = \mathbb{S}^{r_{\alpha}}_{-} \times \mathbb{S}^{r_{\beta}}_{+}.$$ The adjoint of $L$, $L^*: K^* \to \mathbb{R}^{m \times n}$, and the projection onto the dual cone, $\texttt{proj}_{K^*}$, are given by, $$\begin{align} L^*(Y^*_{\alpha}, Y^*_{\beta}) &= U_{\alpha} Y^*_{\alpha} V_{\alpha}^T + U_{\beta} Y^*_{\beta} V_{\beta}^T \nonumber \\ \texttt{proj}_{K^*}(Y_{\alpha}, Y_{\beta}) &= (\texttt{proj\_nsd}(Y_{\alpha}), \texttt{proj\_psd}(Y_{\beta})). \nonumber \end{align}$$

And finally, the LMO for the $\texttt{RMS}\to\texttt{RMS}$ norm is given by, $$\texttt{LMO}_{\texttt{RMS}\to\texttt{RMS}}(G) = \sqrt{\frac{m}{n}} \texttt{msign}(G).$$

Thus, our update rule becomes, $$\begin{align} A_t &= -\eta \sqrt{\frac{m}{n}} \cdot \texttt{msign}\left(G_t + U_{\alpha} Y^*_{\alpha} V_{\alpha}^T + U_{\beta} Y^*_{\beta} V_{\beta}^T\right) \nonumber \\ Y^*_{\alpha, t+1} &= \texttt{proj\_nsd}\left(Y^*_{\alpha, t} + \sigma \cdot \texttt{sym}(U_{\alpha}^T A_t V_{\alpha})\right) \nonumber \\ Y^*_{\beta, t+1} &= \texttt{proj\_psd}\left(Y^*_{\beta, t} + \sigma \cdot \texttt{sym}(U_{\beta}^T A_t V_{\beta})\right) \nonumber \end{align}$$ which matches exactly with the update rule we derived in Appendix A1 of Rethinking Maximal Update Parametrization: Steepest Descent on the Spectral Ball.

4. Experiments [Under Construction]

Acknowledgements

Big thanks to Jeremy Bernstein, Cédric Simal, and Antonio Silveti-Falls for productive discussions on the topic! All remaining mistakes are mine.

How to cite

@misc{cesista2025steepestdescentfinslerdualascent,
  author = {Franz Louis Cesista},
  title = {{S}teepest {D}escent on {F}insler-Structured (Matrix) Geometries via Dual Ascent},
  year = {2025},
  month = {October},
  day = {29},
  url = {https://leloykun.github.io/ponder/steepest-descent-finsler-dual-ascent/},
}

If you find this post useful, please consider supporting my work by sponsoring me on GitHub: Sponsor on GitHub

References

  1. Franz Cesista (2025). Steepest Descent on Finsler-Structured (Matrix) Manifolds. URL https://leloykun.github.io/ponder/steepest-descent-finsler/
  2. Franz Cesista (2025). Rethinking Maximal Update Parametrization: Steepest Descent on the Spectral Ball. URL https://leloykun.github.io/ponder/rethinking-mup-spectral-ball/
  3. Greg Yang, James B. Simon, Jeremy Bernstein (2024). A Spectral Condition for Feature Learning. URL https://arxiv.org/abs/2310.17813
  4. Laker Newhouse*, R. Preston Hess*, Franz Cesista*, Andrii Zahorodnii, Jeremy Bernstein, Phillip Isola (2025). Training Transformers with Enforced Lipschitz Constants. URL https://arxiv.org/abs/2507.13338
  5. Franz Cesista (2025). Sensitivity and Sharpness of n-Simplicial Attention. URL https://leloykun.github.io/ponder/lipschitz-n-simplical-transformer/
  6. Aurko Roy, Timothy Chou, Sai Surya Duvvuri, Sijia Chen, Jiecao Yu, Xiaodong Wang, Manzil Zaheer, Rohan Anil (2025). Fast and Simplex: 2-Simplicial Attention in Triton. URL https://arxiv.org/abs/2507.02754v1
  7. Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, Jeremy Bernstein (2024). Scalable Optimization in the Modular Norm. URL https://arxiv.org/abs/2405.14813
  8. Jeremy Bernstein (2025). Stiefel manifold. URL https://docs.modula.systems/algorithms/manifold/stiefel/
  9. Ben Keigwin, Dhruv Pai, Nathan Chen (2025). Gram-Space Manifold Muon. URL https://www.tilderesearch.com/vignettes/gram-space
  10. Jianlin Su (2025). Muon + Stiefel. URL https://kexue.fm/archives/11221
  11. Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, Volkan Cevher (2025). Training Deep Learning Models with Norm-Constrained LMOs. URL https://arxiv.org/abs/2502.07529
  12. Jeremy Bernstein, Laker Newhouse (2024). Old Optimizer, New Norm: An Anthology. URL https://arxiv.org/abs/2409.20325
  13. Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim. Optimal Scaling Needs Optimal Norm. URL https://arxiv.org/abs/2510.03871
  14. Greg Yang, James B. Simon, Jeremy Bernstein (2024). A Spectral Condition for Feature Learning. URL https://arxiv.org/abs/2310.17813