Sensitivity and Sharpness of n-Simplicial Attention

Towards a maximal update parameterization of n-simplicial attention

July 6, 2025 · Franz Louis Cesista

Adam with Agressive Gradient Clipping ≈ Smoothed SignSGD/NormSGD

Why does Adam with aggressive gradient value/norm clipping have sparse updates and do well with higher learning rates? Here we show that it is essentially equivalent to a smoothed version of SignSGD/NormSGD.

July 3, 2025 · Franz Louis Cesista

Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration

A small step towards hardware-architecture-optimizer codesign in deep learning.

June 23, 2025 · Franz Louis Cesista