1. FreqMuon: Muon in the frequency domain
As discussed in Ponder: Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifolds, the Muon optimizer only makes sense when applied to linear operator matrices, e.g. MLP weights (Jordan et al., 2024). But the convolution kernels in Convolutional Neural Networks (CNNs) are not the operator matrices; they are merely representations of the transform in pixel space. As such, it does not make sense to apply Muon directly to these kernels. To get a frequency-domain surrogate of the operator matrices, we perform a ‘coordinate change’ to a fixed FFT grid. There, circular convolution becomes (blockwise) matrix multiplication. For the cropped finite-support operator implemented below, this should be read as a useful surrogate rather than an exact operator identity. It is in that surrogate geometry where we apply Muon’s orthogonalization logic.
This builds on top of Ji-Ha Kim’s recent work on FreqMuon.
The core algorithm goes as follows:
- Compute $G := \nabla f(W_{\text{CNN}})$ in pixel space via backpropagation.
- FFT to the frequency domain: $\widehat{G} := \texttt{FFT}(G)$.
- Apply Muon’s orthogonalization to $\widehat{G}$, treating each frequency bin as a separate linear operator: $\widehat{U} := \texttt{msign}(\widehat{G})$.
- Inverse FFT back to the spatial domain: $U := \texttt{FFT}^{-1}(\widehat{U})$.
- Update CNN weights with $U$: $W_{\text{CNN}} \leftarrow W_{\text{CNN}} - \eta U$.
1.1. Sample implementation
@dataclass
class FreqMuonCfg:
fft_size: int = 8
ns_steps: int = 2
eps: float = 1e-7
def _zeropower_ns5_complex(
X: torch.Tensor, # [B, m, n], complex64
*,
steps: int,
eps: float,
) -> torch.Tensor:
"""
Muon-style NS5 polynomial on native complex tensors.
Returns approx polar factor / orthogonalized update.
"""
assert X.ndim == 3 and X.is_complex()
_, m, n = X.shape
a, b, c = (3.4445, -4.7750, 2.0315)
norm = torch.linalg.matrix_norm(X, ord="fro", dim=(-2, -1), keepdim=True).clamp_min(eps)
X = X / norm
if transpose := m > n:
X = X.transpose(-2, -1)
for _ in range(steps):
A = X @ X.mH # [B, r, r]
X = a * X + (b * A + c * A @ A) @ X
if transpose:
X = X.transpose(-2, -1)
return X
def _freq_muon_conv_update_batched(g32: torch.Tensor, cfg: FreqMuonCfg) -> torch.Tensor:
"""
g32: [P, Cout, Cin, kH, kW] float32
returns: [P, Cout, Cin, kH, kW] float32
"""
assert g32.ndim == 5 and g32.dtype == torch.float32
P, Cout, Cin, kH, kW = g32.shape
M = cfg.fft_size
if kH > M or kW > M:
raise ValueError(f"Kernel {kH}x{kW} > fft_size {M}. Increase --fft_size.")
# Full FFT on an MxM logical grid; PyTorch pads/trims automatically.
Khat = torch.fft.fft2(g32, s=(M, M), dim=(-2, -1), norm="ortho") # [P, Cout, Cin, M, M]
# Apply Muon orthogonalization to each Cout x Cin matrix in the frequency domain
Kflat = Khat.permute(0, 3, 4, 1, 2).reshape(P * M * M, Cout, Cin)
Kflat = _zeropower_ns5_complex(Kflat, steps=cfg.ns_steps, eps=cfg.eps)
Khat2 = Kflat.reshape(P, M, M, Cout, Cin).permute(0, 3, 4, 1, 2)
# Back to spatial domain, then crop to original kernel support
upd_pad = torch.fft.ifft2(Khat2, s=(M, M), dim=(-2, -1), norm="ortho")
upd = upd_pad.real[:, :, :, :kH, :kW]
return upd.to(dtype=g32.dtype)
# _zeropower_ns5_complex = torch.compile(_zeropower_ns5_complex, mode="max-autotune")
_freq_muon_conv_update_batched = torch.compile(_freq_muon_conv_update_batched, mode="max-autotune")
2. Results
| Method | Validation Accuracy (%) |
|---|---|
| Baseline | 91.44 |
| FreqMuon (fft_size=8, ns_steps=2) | 93.51 |
We evaluate FreqMuon on the CIFAR-10 Airbench benchmark, training a highly-optimized CNN from scratch for 7 epochs. We find that FreqMuon beats the SOTA baseline optimizer by a large margin, achieving 93.51% validation accuracy vs the baseline’s 91.44%.
Note: these results are preliminary and we have not yet performed a hyperparameter sweep for FreqMuon.
How to cite
@misc{cesista2026freqmuon,
author = {Franz Louis Cesista},
title = {Frequency Domain Muon for Convolutional Neural Networks: Simplified},
year = {2026},
url = {https://github.com/leloykun/freqmuon/},
}
References
- Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein (2024). Muon: An optimizer for hidden layers in neural networks. Available at: https://kellerjordan.github.io/posts/muon/
- Ji-Ha Kim (2026). Frequency-Domain Muon for Conv Filters - Orthogonalizing the Operator. URL https://jiha-kim.github.io/posts/frequency-domain-muon-for-conv-filters/
- Keller Jordan (2024). cifar10-airbench. URL https://github.com/KellerJordan/cifar10-airbench
- Franz Louis Cesista (2025). Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifolds. URL https://leloykun.github.io/ponder/steepest-descent-non-riemannian/