The novel part — Helical Shift: When the KV cache fills, a GPU compute shader slides the cached keys and values backward in the sequence dimension. Because keys and values are stored in raw pre-RoPE form (no position encoding baked in), the slide is a pure data copy — no trigonometric recomputation needed. Two independent 5,200-token runs crossing multiple compaction boundaries produce SHA-identical output. That's not an optimization; it's a provable mathematical invariant. Why this matters: Every other local inference tool — llama.cpp, candle, whisper.cpp — has a C or C++ core that Rust wrappers call through FFI. Airframe is the first production-ready GGUF inference engine that is Rust all the way down, including the GPU shaders. Tech stack: -13,586 lines Rust + 855 lines WGSL -wgpu (WebGPU), bytemuck, tokio, axum -Targets: Windows (D3D12), macOS (Metal), Linux (Vulkan) What you can do right now: -Run TinyLlama, Phi, Llama 3.2, DeepSeek Coder, and others from GGUF files -Connect AnythingLLM, SillyTavern, Zed, Cursor, Open WebUI via Ollama or OpenAI API -Generate beyond your context limit without crashes or garbage output Privacy first! Own your process from implementation to production! Down with our evil corporate AI overlords. https://github.com/Michael-A-Kuy...

Shimmy v2.0

The first pure-Rust GGUF inference engine. No C. No Python.

The first pure-Rust GGUF inference engine. No C. No Python.