GitHub - Transparent semantic cache for LLM API calls on Redis VS

Khazad is a transparent semantic cache for LLM API calls. It intercepts LLM HTTP traffic at the httpx transport layer and serves semantically-equivalent requests from a Redis 8 vector cache with zero code changes. Works with OpenAI, Anthropic, Gemini, Azure OpenAI, and Mistral. Model-aware and conversation-aware caching, full streaming support, TTL, and tunable similarity thresholds. Stop paying for the same prompt twice in dev, CI, demos, or production. Open source (MIT).

Hi everyone, I'm the maker of Khazad. I kept running into the same problem: I was paying for the exact same LLM prompt over and over, and even in production a lot of user traffic is near-identical questions (FAQ bots, RAG front-ends). Traditional caching doesn't help because no two prompts are byte-for-byte the same. So I built a semantic cache, but I wanted it to be truly transparent. Most caching tools make you wrap their SDK or route traffic through a proxy. Khazad instead intercepts outgoing LLM HTTP requests at the httpx transport layer, so it works with the OpenAI, Anthropic, Gemini, Azure OpenAI, and Mistral SDKs with zero changes to your application code. You call init() once and it's active. Under the hood it uses Redis Vector Sets: each (provider, model) pair gets its own vector set, the whole conversation is embedded (not just the last message), and a similarity search decides whether to replay a cached response or let the call go upstream. The part that evolved most was getting streaming right, cache hits replay as real SSE streams, and streamed misses are captured chunk-by-chunk and reassembled into JSON, so a streamed answer can later serve a non-streamed request and vice versa. It's open source (MIT) and I'd genuinely love feedback, especially on the transport-layer approach and how people would want to handle false-positive control. GitHub: https://github.com/GuglielmoCerr...

GitHub - Transparent semantic cache for LLM API calls on Redis VS

Replies