A Simple Heuristic Solution for Steepest Descent on Stiefel Manifold

Fast, numerically stable, and differentiable solution for steepest descent on Stiefel manifold.

July 18, 2025 · Franz Louis Cesista

Sensitivity and Sharpness of n-Simplicial Attention

Towards a maximal update parameterization of n-simplicial attention

July 6, 2025 · Franz Louis Cesista

Adam with Aggressive Gradient Clipping ≈ Smoothed SignSGD/NormSGD

Why does Adam with aggressive gradient value/norm clipping have sparse updates and do well with higher learning rates? Here we show that it is essentially equivalent to a smoothed version of SignSGD/NormSGD.

July 3, 2025 · Franz Louis Cesista

Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration

A small step towards hardware-architecture-optimizer codesign in deep learning.

June 23, 2025 · Franz Louis Cesista

Muon and a Selective Survey on Steepest Descent in Riemannian and Non-Riemannian Manifolds

Muon from first principles, what makes it different from other optimizers, and why it works so well.

April 3, 2025 · Franz Louis Cesista

Napkin Math on Non-Euclidean Trust Region Optimization

A possible reason why Muon converges faster & does better at higher learning rates than Adam.

March 24, 2025 · Franz Louis Cesista

Blocked Matrix Formulation of Linear Attention Mechanisms

The blocked matrix formulation of linear attention mechanisms, multi-step online gradient descent at inference time, and chunk-wise parallelism.

March 16, 2025 · Franz Louis Cesista

Steepest Descent Under Schatten-p Norms

Why Muon still work despite not perfectly semi-orthogonalizing the gradients.

February 27, 2025 · Franz Louis Cesista

Squeezing 1-2% Efficiency Gains Out of Muon by Optimizing the Newton-Schulz Coefficients

Simply switching to Muon can already get you 2x efficiency gains. But you can squeeze out an extra 1-2% by optimizing the Newton-Schulz coefficients.

February 21, 2025 · Franz Louis Cesista

CASPR Without Accumulation is Muon

The CASPR optimizer, a variant of Shampoo, reduces to Muon when we remove the accumulation on the preconditioners.

February 13, 2025 · Franz Louis Cesista

GRPO's Main Flaw

GRPO may not be the best choice for training reasoning models. Here’s why.

February 11, 2025 · Franz Louis Cesista

(Linear) Attention as Test-Time Regression

A unifying framework for linear attention mechanisms as test-time regression and how to parallelize training and inference.

January 27, 2025 · Franz Louis Cesista

Deep Learning Optimizers as Steepest Descent in Normed Spaces

Instead of asking, ‘Which optimizer should I use?’ ask, ‘In which space do my features live in?’

October 20, 2024 · Franz Louis Cesista

Multimodal Structured Generation

Generate interleaved text and image content in a structured format you can directly pass to downstream APIs.

July 14, 2024 · Franz Louis Cesista

Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

[Technical Report for CVPR’s 2nd MMFM Challenge] This report presents Multimodal Structured Generation, a general framework which constrains the output logits of frozen Multimodal Foundation Models to force them to reason before responding with structured outputs that downstream APIs can parse and use. This approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference.

June 17, 2024 · Franz Louis Cesista

Flash Hyperbolic Attention Minimal [WIP]

A minimal implementation of Flash Attention 1 & 2 in just ~350 lines of CUDA code. This is still a work-in-progress, but the ultimate goal is to implement the various variations of Hyperbolic Attention in CUDA.

April 16, 2024 · Franz Louis Cesista

Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use

[IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR) 2024] This paper presents Retrieval Augmented Structured Generation (RASG), a novel general framework for Business Document Information Extraction that achieves state of the art (SOTA) results on both Key-Information Extraction (KIE) and Line Items Recognition (LIR).

April 15, 2024 · Franz Louis Cesista, Rui Aguiar, Jason Kim, Paolo Acilo

The Human Mind May Be Universal

Years of experience in building artificial minds led me to believe that these AIs may end up seeming more ‘human’ than we currently imagine them to be.

December 10, 2023 · Franz Louis Cesista

Llama.cpp

A C++ implementation of Meta’s Llama2 generative large-language model. I also optimized the original C implementation by Karpathy by adding parallelization on the multi-head attention layer.

July 25, 2023 · Franz Louis Cesista

Expedock Assistant: ChatGPT Applied to Logistics Data

Expedock Assistant is a chatbot that allows you to ask questions about your shipments and get answers in real time. It’s like having a personal assistant that knows everything about your business, shipments and industry.

January 31, 2023 · Franz Louis Cesista

Expedock AutoML

Expedock’s AutoML Library – fit a model, run batch inference, and get explanations in one line of code each.

July 25, 2022 · Franz Louis Cesista

Vaccine Search as a Computational Problem

A thought dump on mRNA vaccines and the future of computational biology

February 6, 2021 · Franz Louis Cesista

Booking Demand Prediction for Grab SEA

Booking demand prediction for Grab’s Southeast Asia operations. The project involves spatio-temporal forecasting, anomaly detection, and econometric modeling.

June 16, 2019 · Franz Louis Cesista