TurboQuant-MoE:8.5x KV-Cache Compression - 8.5x KV-cache compression for LLM inference
byā¢
Production KV-cache compression for Mixture-of-Experts language models.
LLM inference costs explode because:
⢠KV-cache grows with sequence length (16k tokens = 256MB per token)
⢠MoE models waste GPU storing inactive experts
⢠Memory becomes the bottleneck, not compute
š REAL BENCHMARKS (Mixtral 8x7B)
⢠KV Memory: 256MB ā 30MB (8.53x smaller)
⢠Quality: 100% preserved (zero degradation)
⢠Speed: 8.48x faster in production
⢠Expert Cache Hit: 96.75%
⢠GPU Memory Saved: 6.42 GB per layer
Replies
Hey, this is Denis.
PROBLEM: LLM inference costs $10k/month because KV cache eats up memory.
SOLUTION: I squeezed it 8.5 times. The quality is the same.
PROOF:
256MB ā 30MB
8.48x faster
$10k ā $1.2k per month
HOW: Orthogonal conversion from Google DeepMind is used.
RESULT: Works with Mixtral, DeepSeek, Qwen. MIT license. Free.
github.com/RemizovDenis/turboquant
Questions? I'm here.
TurboQuant-MoE v0.3.0 released! ⢠Up to 15.4à KV-cache compression ⢠Cross-layer delta + 3-bit PolarQuant Serious VRAM killer for MoE models (Mixtral, DeepSeek, Qwen etc.)