TurboQuant-MoE:8.5x KV-Cache Compression

TurboQuant-MoE:8.5x KV-Cache Compression

8.5x KV-cache compression for LLM inference

1 follower

Production KV-cache compression for Mixture-of-Experts language models. LLM inference costs explode because: • KV-cache grows with sequence length (16k tokens = 256MB per token) • MoE models waste GPU storing inactive experts • Memory becomes the bottleneck, not compute šŸ“Š REAL BENCHMARKS (Mixtral 8x7B) • KV Memory: 256MB → 30MB (8.53x smaller) • Quality: 100% preserved (zero degradation) • Speed: 8.48x faster in production • Expert Cache Hit: 96.75% • GPU Memory Saved: 6.42 GB per layer
TurboQuant-MoE:8.5x KV-Cache Compression gallery image
Free
Launch tags:Open Source•Developer Tools•GitHub
Launch Team