TurboQuant-MoE:8.5x KV-Cache Compression
8.5x KV-cache compression for LLM inference
1 follower
8.5x KV-cache compression for LLM inference
1 follower
Production KV-cache compression for Mixture-of-Experts language models. LLM inference costs explode because: ⢠KV-cache grows with sequence length (16k tokens = 256MB per token) ⢠MoE models waste GPU storing inactive experts ⢠Memory becomes the bottleneck, not compute š REAL BENCHMARKS (Mixtral 8x7B) ⢠KV Memory: 256MB ā 30MB (8.53x smaller) ⢠Quality: 100% preserved (zero degradation) ⢠Speed: 8.48x faster in production ⢠Expert Cache Hit: 96.75% ⢠GPU Memory Saved: 6.42 GB per layer


Hey, this is Denis.
PROBLEM: LLM inference costs $10k/month because KV cache eats up memory.
SOLUTION: I squeezed it 8.5 times. The quality is the same.
PROOF:
256MB ā 30MB
8.48x faster
$10k ā $1.2k per month
HOW: Orthogonal conversion from Google DeepMind is used.
RESULT: Works with Mixtral, DeepSeek, Qwen. MIT license. Free.
github.com/RemizovDenis/turboquant
Questions? I'm here.
TurboQuant-MoE v0.3.0 released! ⢠Up to 15.4à KV-cache compression ⢠Cross-layer delta + 3-bit PolarQuant Serious VRAM killer for MoE models (Mixtral, DeepSeek, Qwen etc.)