Abhishek Sira Chandrashekar

OpenCut-AI now runs TurboQuant on your GPU — 7.3× KV cache compression

🚀 OpenCut-AI just shipped real GPU support for TurboQuant KV cache compression.

OpenCut-AI is an open-source, local-first AI video editor. Everything runs on your machine — transcription, voice cloning, image generation, LLM commands. No cloud, no API keys.

The catch was always memory. Running a 7B LLM + Whisper + TTS + Stable Diffusion locally means fighting for every gigabyte of RAM. TurboQuant solves this by compressing the KV cache (the biggest memory hog during inference) by up to 7.3×.

What's new in this release:

→ User-selectable Compute Mode in Settings → AI Optimization. Pick Auto, CPU, or GPU (CUDA).

→ Real integration with the turboquant-gpu library. The GPU backend runs cuTile fused kernels for the full 2-bit / 3-bit KV compression path. The CPU backend uses a PyTorch fallback with physical-core thread pinning and MKLDNN acceleration.

→ Live-measured compression ratios in the UI. No more static lookup tables — you see the actual compression your backend produced on the last request.

→ Graceful fallback everywhere. Missing CUDA? Falls back to CPU. Missing cuTile kernels? Falls back to PyTorch. The service always comes up.

Huge thanks to Anirudh Bharadwaj Vangara for the turboquant-gpu library that made the real GPU path possible.

OpenCut-AI: https://github.com/Ekaanth/OpenCut-AI
turboquant-gpu: https://github.com/DevTechJr/turboquant-gpu

13 views

Add a comment

Replies

Be the first to comment