trending
Denis•

1mo ago

TurboQuant-MoE:8.5x KV-Cache Compression - 8.5x KV-cache compression for LLM inference

Production KV-cache compression for Mixture-of-Experts language models. LLM inference costs explode because: • KV-cache grows with sequence length (16k tokens = 256MB per token) • MoE models waste GPU storing inactive experts • Memory becomes the bottleneck, not compute 📊 REAL BENCHMARKS (Mixtral 8x7B) • KV Memory: 256MB → 30MB (8.53x smaller) • Quality: 100% preserved (zero degradation) • Speed: 8.48x faster in production • Expert Cache Hit: 96.75% • GPU Memory Saved: 6.42 GB per layer