FlashMLA

FlashMLA

Faster LLM Inference on Hopper GPUs

10 followers

FlashMLA, from DeepSeek, is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences. Achieves up to 3000 GB/s memory bandwidth and 580 TFLOPS.
FlashMLA gallery image
FlashMLA gallery image
FlashMLA gallery image
Free
Launch Team
Anima Playground
AI with an Eye for Design
Promoted

What do you think? …

Zac Zuo

Hi everyone!

Sharing FlashMLA, a new open-source project from DeepSeek. This one's highly technical, but potentially a big deal for anyone working on large language model (LLM) inference, especially with NVIDIA Hopper GPUs (like the H800).

In a nutshell, FlashMLA is an optimized "kernel" – a low-level code component – for a critical part of LLM decoding called MLA (Multi-Layer Attention). It's designed for speed and efficiency, particularly with the variable-length sequences common in real-world serving.

This is the first release from DeepSeek's Open Source Week. This might be the core secret behind DeepSeek's ability to achieve such strong results, even with limited access to the latest GPUs, compared to companies with unlimited resources. And DeepSeek is completely open-sourcing it.