Hey everyone 👋 I’m the creator of Chameleon — a stateless AI runtime that can become any LLM on demand. I built this after constantly running into the same problem: either you load multiple models and waste huge amounts of VRAM, or you stick to one model and compromise on quality. Neither felt right. Chameleon takes a different approach: Instead of hosting models, it dynamically loads the best model for each request, runs it, and fully unloads it — going back to zero memory usage. Some things I’m especially excited about: • Per-request model routing (code, chat, reasoning, etc.) • Zero idle VRAM usage • Warm cache with configurable memory budget • Rust control plane + Python inference backends • Plug-and-play support for llama.cpp, vLLM, Transformers This is still early, and I’d really love your feedback — especially from folks running multiple models or working with limited GPU resources. A few questions for you: • Would this fit into your current AI stack? • What features would you want next? • Would you use this locally, in the cloud, or both? Thanks for checking it out 🙏 Happy to answer anything!

Chameleon

Run any LLM on demand — zero idle VRAM.

Run any LLM on demand — zero idle VRAM.