Flash Attention MinimalA minimal implementation of Flash Attention 1 & 2 in just ~350 lines of CUDA code.