OpenCut-AI now runs TurboQuant on your GPU — 7.3× KV cache compression
🚀 OpenCut-AI just shipped real GPU support for TurboQuant KV cache compression.
OpenCut-AI is an open-source, local-first AI video editor. Everything runs on your machine — transcription, voice cloning, image generation, LLM commands. No cloud, no API keys.
The catch was always memory. Running a 7B LLM + Whisper + TTS + Stable Diffusion locally means fighting for every gigabyte of RAM. TurboQuant solves this by compressing the KV cache (the biggest memory hog during inference) by up to 7.3×.
What's new in this release:
→ User-selectable Compute Mode in Settings → AI Optimization. Pick Auto, CPU, or GPU (CUDA).
→ Real integration with the turboquant-gpu library. The GPU backend runs cuTile fused kernels for the full 2-bit / 3-bit KV compression path. The CPU backend uses a PyTorch fallback with physical-core thread pinning and MKLDNN acceleration.
→ Live-measured compression ratios in the UI. No more static lookup tables — you see the actual compression your backend produced on the last request.
→ Graceful fallback everywhere. Missing CUDA? Falls back to CPU. Missing cuTile kernels? Falls back to PyTorch. The service always comes up.
Huge thanks to Anirudh Bharadwaj Vangara for the turboquant-gpu library that made the real GPU path possible.
OpenCut-AI: https://github.com/Ekaanth/OpenCut-AI
turboquant-gpu: https://github.com/DevTechJr/turboquant-gpu


Replies