I was tired of OOM errors while fine-tuning, so I built my own optimizer
Hey Product Hunt! 👋
If you’ve ever tried to fine-tune an LLM locally, you know the "Cuda Out of Memory" heartbreak.
I wanted the convergence speed of 2nd-order optimizers (like Shampoo), but those methods usually destroy consumer GPUs because they require massive matrix inversions.
The "Aha!" Moment:
I spent months figuring out a way to capture the network's curvature without the memory overhead. I came up with SCAO (Sparse Curvature-Aware Optimizer). It uses a "Diagonal Fallback" and INT8 quantization to keep things light.
The Rejection:
I actually tried to contribute this to the Hugging Face transformers library. The maintainers were cool, but they rejected it, saying it was "too new" and needed community proof before they'd merge it.
So, I made it standalone.
Instead of waiting for a big library to approve it, I turned SCAO into a single scao.py file. No recompiling, no complex setup. It’s a 1-line drop-in for the Hugging Face Trainer.
What I saw on my own rig:
VRAM Savings: A 36.7% reduction in memory usage. It actually runs LoRA on a GPT-2 or Llama-3 comfortably on 8GB cards.
Raw Speed: In full fine-tuning, it hit ~627 tokens/second.
Results: In 50 steps, it beat AdamW's perplexity by 25.8%.
I'm launching this as a "Developer Tool" for everyone who wants to train smarter, not just harder. It’s open-source and ready for you to break it.
Check out the repo and let me know if it helps your local training sessions!

Replies