Sensitivity and Sharpness of n-Simplicial Attention
Towards a maximal update parameterization of n-simplicial attention
Towards a maximal update parameterization of n-simplicial attention
Why does Adam with aggressive gradient value/norm clipping have sparse updates and do well with higher learning rates? Here we show that it is essentially equivalent to a smoothed version of SignSGD/NormSGD.
A small step towards hardware-architecture-optimizer codesign in deep learning.