Denis

TurboQuant-MoE:8.5x KV-Cache Compression - 8.5x KV-cache compression for LLM inference

by•
Production KV-cache compression for Mixture-of-Experts language models. LLM inference costs explode because: • KV-cache grows with sequence length (16k tokens = 256MB per token) • MoE models waste GPU storing inactive experts • Memory becomes the bottleneck, not compute šŸ“Š REAL BENCHMARKS (Mixtral 8x7B) • KV Memory: 256MB → 30MB (8.53x smaller) • Quality: 100% preserved (zero degradation) • Speed: 8.48x faster in production • Expert Cache Hit: 96.75% • GPU Memory Saved: 6.42 GB per layer

Add a comment

Replies

Best
Denis
Maker
šŸ“Œ
✨ SUPPORTED MODELS āœ… Mixtral 8x7B, 8x22B (Production Ready) šŸ”„ DeepSeek, Qwen 1.5-MoE (Experimental) šŸŽÆ WHAT'S INCLUDED • KV quantization engine • Dynamic expert cache • Speculative prefetch • Transformers/vLLM integration • Full benchmark suite • MIT open-source license Built in 3 hours using AI to architect, then properly engineered.
Denis
Maker

Hey, this is Denis.

PROBLEM: LLM inference costs $10k/month because KV cache eats up memory.

SOLUTION: I squeezed it 8.5 times. The quality is the same.

PROOF:

256MB → 30MB

8.48x faster

$10k → $1.2k per month

HOW: Orthogonal conversion from Google DeepMind is used.

RESULT: Works with Mixtral, DeepSeek, Qwen. MIT license. Free.

github.com/RemizovDenis/turboquant

Questions? I'm here.

Denis
Maker
šŸš€

TurboQuant-MoE v0.3.0 released! • Up to 15.4Ɨ KV-cache compression • Cross-layer delta + 3-bit PolarQuant Serious VRAM killer for MoE models (Mixtral, DeepSeek, Qwen etc.)