Mike Kuykendall

Shimmy v2.0 - The first pure-Rust GGUF inference engine. No C. No Python.

by
Two 5,200-token runs. Same model. SHA-identical byte output. That's a proof, not a benchmark. Shimmy v2.0 ships Airframe: pure-Rust GPU inference with hand-written WGSL compute shaders. No llama.cpp. No C. No Python. No CUDA. First production GGUF engine Rust all the way down — including the GPU shaders. Run TinyLlama, Llama 3.2, Phi, DeepSeek from GGUF. Drop-in for AnythingLLM, Open WebUI, Cursor, Zed via OpenAI or Ollama API. Windows, macOS, Linux. cargo install shimmy

Add a comment

Replies

Best
Mike Kuykendall
The novel part — Helical Shift: When the KV cache fills, a GPU compute shader slides the cached keys and values backward in the sequence dimension. Because keys and values are stored in raw pre-RoPE form (no position encoding baked in), the slide is a pure data copy — no trigonometric recomputation needed. Two independent 5,200-token runs crossing multiple compaction boundaries produce SHA-identical output. That's not an optimization; it's a provable mathematical invariant. Why this matters: Every other local inference tool — llama.cpp, candle, whisper.cpp — has a C or C++ core that Rust wrappers call through FFI. Airframe is the first production-ready GGUF inference engine that is Rust all the way down, including the GPU shaders. Tech stack: -13,586 lines Rust + 855 lines WGSL -wgpu (WebGPU), bytemuck, tokio, axum -Targets: Windows (D3D12), macOS (Metal), Linux (Vulkan) What you can do right now: -Run TinyLlama, Phi, Llama 3.2, DeepSeek Coder, and others from GGUF files -Connect AnythingLLM, SillyTavern, Zed, Cursor, Open WebUI via Ollama or OpenAI API -Generate beyond your context limit without crashes or garbage output Privacy first! Own your process from implementation to production! Down with our evil corporate AI overlords. https://github.com/Michael-A-Kuy...