Frequency Domain Muon for Convolutional Neural Networks: Simplified

Franz Louis Cesista

Frequency Domain Muon for Convolutional Neural Networks: Simplified

March 6, 2026 · 4 min · Franz Louis Cesista · Github Repository

Table of Contents

1. FreqMuon: Muon in the frequency domain

As discussed in Ponder: Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifolds, the Muon optimizer only makes sense when applied to linear operator matrices, e.g. MLP weights (Jordan et al., 2024). But the convolution kernels in Convolutional Neural Networks (CNNs) are not the operator matrices; they are merely representations of the transform in pixel space. As such, it does not make sense to apply Muon directly to these kernels. To get a frequency-domain surrogate of the operator matrices, we perform a ‘coordinate change’ to a fixed FFT grid. There, circular convolution becomes (blockwise) matrix multiplication. For the cropped finite-support operator implemented below, this should be read as a useful surrogate rather than an exact operator identity. It is in that surrogate geometry where we apply Muon’s orthogonalization logic.

This builds on top of Ji-Ha Kim’s recent work on FreqMuon.

The core algorithm goes as follows:

Compute $G := \nabla f(W_{\text{CNN}})$ in pixel space via backpropagation.
FFT to the frequency domain: $\widehat{G} := \texttt{FFT}(G)$.
Apply Muon’s orthogonalization to $\widehat{G}$, treating each frequency bin as a separate linear operator: $\widehat{U} := \texttt{msign}(\widehat{G})$.
Inverse FFT back to the spatial domain: $U := \texttt{FFT}^{-1}(\widehat{U})$.
Update CNN weights with $U$: $W_{\text{CNN}} \leftarrow W_{\text{CNN}} - \eta U$.

1.1. Sample implementation

@dataclass
class FreqMuonCfg:
    fft_size: int = 8
    ns_steps: int = 2
    eps: float = 1e-7

def _zeropower_ns5_complex(
    X: torch.Tensor,  # [B, m, n], complex64
    *,
    steps: int,
    eps: float,
) -> torch.Tensor:
    """
    Muon-style NS5 polynomial on native complex tensors.
    Returns approx polar factor / orthogonalized update.
    """
    assert X.ndim == 3 and X.is_complex()
    _, m, n = X.shape
    a, b, c = (3.4445, -4.7750, 2.0315)
    norm = torch.linalg.matrix_norm(X, ord="fro", dim=(-2, -1), keepdim=True).clamp_min(eps)
    X = X / norm
    if transpose := m > n:
        X = X.transpose(-2, -1)
    for _ in range(steps):
        A = X @ X.mH      # [B, r, r]
        X = a * X + (b * A + c * A @ A) @ X
    if transpose:
        X = X.transpose(-2, -1)
    return X

def _freq_muon_conv_update_batched(g32: torch.Tensor, cfg: FreqMuonCfg) -> torch.Tensor:
    """
    g32: [P, Cout, Cin, kH, kW] float32
    returns: [P, Cout, Cin, kH, kW] float32
    """
    assert g32.ndim == 5 and g32.dtype == torch.float32
    P, Cout, Cin, kH, kW = g32.shape
    M = cfg.fft_size

    if kH > M or kW > M:
        raise ValueError(f"Kernel {kH}x{kW} > fft_size {M}. Increase --fft_size.")

    # Full FFT on an MxM logical grid; PyTorch pads/trims automatically.
    Khat = torch.fft.fft2(g32, s=(M, M), dim=(-2, -1), norm="ortho")  # [P, Cout, Cin, M, M]

    # Apply Muon orthogonalization to each Cout x Cin matrix in the frequency domain
    Kflat = Khat.permute(0, 3, 4, 1, 2).reshape(P * M * M, Cout, Cin)
    Kflat = _zeropower_ns5_complex(Kflat, steps=cfg.ns_steps, eps=cfg.eps)
    Khat2 = Kflat.reshape(P, M, M, Cout, Cin).permute(0, 3, 4, 1, 2)

    # Back to spatial domain, then crop to original kernel support
    upd_pad = torch.fft.ifft2(Khat2, s=(M, M), dim=(-2, -1), norm="ortho")
    upd = upd_pad.real[:, :, :, :kH, :kW]
    return upd.to(dtype=g32.dtype)

# _zeropower_ns5_complex = torch.compile(_zeropower_ns5_complex, mode="max-autotune")
_freq_muon_conv_update_batched = torch.compile(_freq_muon_conv_update_batched, mode="max-autotune")

2. Results

Method	Validation Accuracy (%)
Baseline	91.44
FreqMuon (fft_size=8, ns_steps=2)	93.51

We evaluate FreqMuon on the CIFAR-10 Airbench benchmark, training a highly-optimized CNN from scratch for 7 epochs. We find that FreqMuon beats the SOTA baseline optimizer by a large margin, achieving 93.51% validation accuracy vs the baseline’s 91.44%.

Note: these results are preliminary and we have not yet performed a hyperparameter sweep for FreqMuon.

How to cite

@misc{cesista2026freqmuon,
  author = {Franz Louis Cesista},
  title = {Frequency Domain Muon for Convolutional Neural Networks: Simplified},
  year = {2026},
  url = {https://github.com/leloykun/freqmuon/},
}

References

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein (2024). Muon: An optimizer for hidden layers in neural networks. Available at: https://kellerjordan.github.io/posts/muon/
Ji-Ha Kim (2026). Frequency-Domain Muon for Conv Filters - Orthogonalizing the Operator. URL https://jiha-kim.github.io/posts/frequency-domain-muon-for-conv-filters/
Keller Jordan (2024). cifar10-airbench. URL https://github.com/KellerJordan/cifar10-airbench
Franz Louis Cesista (2025). Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifolds. URL https://leloykun.github.io/ponder/steepest-descent-non-riemannian/

1. FreqMuon: Muon in the frequency domain#

1.1. Sample implementation#

2. Results#

How to cite#

References#

1. FreqMuon: Muon in the frequency domain

1.1. Sample implementation

2. Results

How to cite

References