Hey, I'm Sacha, co-founder at @Edgee
Over the last few months, we've been working on a problem we kept seeing in production AI systems:
LLM costs don't scale linearly with usage, they scale with context.
As teams add RAG, tool calls, long chat histories, memory, and guardrails, prompts become huge and token spend quickly becomes the main bottleneck.
So we built a token compression layer designed to run before inference.
Edgee
Hey Product Hunt 👋
Sacha here, co founder of Edgee.
Story time. A few weeks ago I was working with Claude Code on a refactor with Opus. The model knew exactly what to do, but I sat there watching a 500-line file crawl out one token at a time. Two minutes for one file. Multiply that by every step of the agent loop and you realize: speed is the silent tax on every coding session.
Around the same time I started testing open-source models like GLM and Kimi K2.7. The quality on coding tasks was honestly impressive.
But the speed on standard endpoints was even slower than the closed models. And the setup was painful: API keys, code changes, CLAUDE md to rewrite, MCP servers to reconfigure.
That's the problem we built Edgee Turbo Models to solve.
What it does:
→ Run frontier open-source models (GLM 5.1, Kimi K2.7 Code, Kimi K2.6, MiniMax 2.7) directly in Claude Code.
→ At up to 4x the speed of standard endpoints (~200 tok/s vs ~50).
→ Flat $29/month. No metered token bill that climbs as your agents work harder.
→ Setup in 2 minutes. Your CLAUDE md, MCP servers, and entire setup stay exactly where they are.
Important point I want to get out front because it'll come up:
Turbo is NOT a smaller or quantized version of these models. They are the full open-weight checkpoints. Turbo only changes how they are served, on dedicated high-throughput inference infrastructure built for raw speed, not a shared best-effort endpoint. Same outputs, just faster.
How this fits with our previous launches:
- Compression: use fewer tokens per request
- Teams: see who uses what, per repo, per PR
- Fallback Models: keep working when Claude or Copilot hit limits
- Turbo Models: run open-source models at premium speed, for flat pricing
Together that is the Route + Compress + Observe stack of our Agent Gateway. Today we're shipping the speed layer.
Why now: The Economist published a piece this week confirming that "token-maxxing is over" and that companies are routing to cheaper models. Open-source models are clearly part of the answer. Turbo
makes them actually usable.
A few questions I'd love your feedback on:
→ Which open-source coding model are you most curious to try?
→ Is flat $29/month the right price point, or would you prefer usage-based?
→ What other models should we add to the Turbo lineup?
Will be in comments all day. Thanks for checking it out 🙏
Tabstack by Mozilla
@sachamorard @Product Hunt is about consistency, S/O for this new launch! keep up the great, and keep launching 👏👏
Being able to run different models through Claude Code is really cool. Can you switch between models mid-session, or is it set per project?
Edgee
@doganakbulut You can switch whenever you want, please be aware that this will result in a cache miss at providers level therefore we recommend waiting for the next session if you want to switch !
Proxying Claude Code's API calls through a gateway to route to Kimi K2.7 or MiniMax without code changes is clean architecture. We've hit throughput ceilings in agentic workflows where task latency compounds fast, so the 4x speed claim is interesting. Does Edgee handle automatic fallback if a model hits rate limits mid-session?
Edgee
@anand_thakkar1 Yes, exactly! Edgee is able to fall back to the model/provider you choose.
Even without the turbo models, you can use it with Claude and fall back to Kimi when your usage limit is reached (or when Anthropic has an incident), for example.
flat $29/month instead of usage-based is the right call for anyone running agents that loop unpredictably. the worst part of token-based pricing is never knowing what the bill will be until it's too late. also being able to swap in open-source models without changing any code or rewriting configs removes the biggest barrier to actually trying them. most people stick with what they know because switching is painful, not because alternatives aren't good enough
Edgee
@tina_chhabra Both points are exactly what we kept hearing in customer conversations before we built this.
On flat pricing: agentic workflows are genuinely unpredictable. One day you have a clean refactor, the next your agent is in a 30-minute edit-run-fix loop and you've burned $40. Knowing your monthly ceiling upfront is a different kind of freedom. It changes how teams use the tool because the meter isn't running in their heads.
On switching cost: this is the part most people underestimate. The quality gap between closed and open frontier models on coding tasks is far smaller than the gap in setup friction. We benchmarked GLM 5.1 and Kimi K2.6 Code against Sonnet on real coding sessions for weeks before this launch, and the outputs are genuinely close. But "close on output" doesn't matter if "setup takes a Saturday." So that's the actual battle: zero-config switching.
The whole point of routing through a gateway is that "trying a new model" becomes a 30-second toggle, not a project.
Thanks for the thoughtful comment 🙏
Love the flat rate approach for unpredictable agent loops, excited to test Kimi K2.7 Code with this kind of speed. Huge congrats on shipping this, @sachamorard
Edgee
@priya_kushwaha1 You'll see, Kimi K2.7 Code (turbo version) is really impressive. Looking forward to having your feedback
Tabstack by Mozilla
I can relate. Kimi's AI models are solid options for coding tasks. See this comparison with Opus 4.7. TL,DR: Great for prototyping or exploring a design. Opus remains ahead for work requiring correctness and accuracy.
Also fun fact: Composer 2.5, Cursor's most recent coding model, actually is based on Kimi's.