Hey, I'm Sacha, co-founder at @Edgee
Over the last few months, we've been working on a problem we kept seeing in production AI systems:
LLM costs don't scale linearly with usage, they scale with context.
As teams add RAG, tool calls, long chat histories, memory, and guardrails, prompts become huge and token spend quickly becomes the main bottleneck.
So we built a token compression layer designed to run before inference.
Edgee
Hey Product Hunt 👋
Sacha here, co founder of Edgee.
Story time. A few weeks ago I was working with Claude Code on a refactor with Opus. The model knew exactly what to do, but I sat there watching a 500-line file crawl out one token at a time. Two minutes for one file. Multiply that by every step of the agent loop and you realize: speed is the silent tax on every coding session.
Around the same time I started testing open-source models like GLM and Kimi K2.7. The quality on coding tasks was honestly impressive.
But the speed on standard endpoints was even slower than the closed models. And the setup was painful: API keys, code changes, CLAUDE md to rewrite, MCP servers to reconfigure.
That's the problem we built Edgee Turbo Models to solve.
What it does:
→ Run frontier open-source models (GLM 5.1, Kimi K2.7 Code, Kimi K2.6, MiniMax 2.7) directly in Claude Code.
→ At up to 4x the speed of standard endpoints (~200 tok/s vs ~50).
→ Flat $29/month. No metered token bill that climbs as your agents work harder.
→ Setup in 2 minutes. Your CLAUDE md, MCP servers, and entire setup stay exactly where they are.
Important point I want to get out front because it'll come up:
Turbo is NOT a smaller or quantized version of these models. They are the full open-weight checkpoints. Turbo only changes how they are served, on dedicated high-throughput inference infrastructure built for raw speed, not a shared best-effort endpoint. Same outputs, just faster.
How this fits with our previous launches:
- Compression: use fewer tokens per request
- Teams: see who uses what, per repo, per PR
- Fallback Models: keep working when Claude or Copilot hit limits
- Turbo Models: run open-source models at premium speed, for flat pricing
Together that is the Route + Compress + Observe stack of our Agent Gateway. Today we're shipping the speed layer.
Why now: The Economist published a piece this week confirming that "token-maxxing is over" and that companies are routing to cheaper models. Open-source models are clearly part of the answer. Turbo
makes them actually usable.
A few questions I'd love your feedback on:
→ Which open-source coding model are you most curious to try?
→ Is flat $29/month the right price point, or would you prefer usage-based?
→ What other models should we add to the Turbo lineup?
Will be in comments all day. Thanks for checking it out 🙏
Tabstack by Mozilla
@sachamorard @Product Hunt is about consistency, S/O for this new launch! keep up the great, and keep launching 👏👏
@sachamorard fight the good fight to bring inference costs down! Can you guarantee that your compression is lossless?
The edge angle I always want clarity on: cold-start and state. For these turbo models running at the edge, are you keeping them warm across regions or is there a first-hit penalty when a PoP hasn't served the model recently? And for anything stateful, how do you reconcile across PoPs without round-tripping to a central region — or is the model purely stateless inference?
Edgee
Hey @mikebrandswarm
Important clarification first: the models themselves don't run at the edge. What runs at the edge is the Edgee gateway. The models run on dedicated high-throughput inference infrastructure (ours, or partners like Together AI, Fireworks, and a few others depending on the model), and our gateway routes each request to the fastest path between your machine and the inference backend.
This split matters for your cold-start question. Here's how we handle it:
The gateway itself has no cold start for a given PoP. It runs continuously across regions on infrastructure built for low-latency request handling.
For the inference backend, we maintain persistent keep-alive connections to each provider so the first hit from a PoP doesn't pay a TCP/TLS handshake tax. We use HTTP/2 with multiplexing, TCP_NODELAY, and pre-warmed connection pools.
Combined with provider keep-warm on their side, the practical result is sub-100ms TTFB from most PoPs to most Turbo Models, even on a request that hasn't been served from that PoP in hours.
On the stateful question:
LLM inference is fundamentally stateless per request. Every prompt carries its own context, and the model itself doesn't retain anything between calls. So there's nothing to reconcile across PoPs at our layer.
The interesting state lives on the provider side, specifically KV-cache and prompt cache. Each provider has their own strategy for how they keep that warm and how they route subsequent requests to the same warm replica. That's their domain and they're better at it than we'd ever be from the gateway layer. Our job is to pick the best provider for the model you asked for, route you to them on the fastest path, and stay out of the way.
Edgee owns the routing and connection optimization layer. Providers own the inference stateful layer. The specialization works because each side is doing what they're best at.
Happy to go deeper if you want, this is a fun part of the system.
Being able to run different models through Claude Code is really cool. Can you switch between models mid-session, or is it set per project?
Edgee
@doganakbulut You can switch whenever you want, please be aware that this will result in a cache miss at providers level therefore we recommend waiting for the next session if you want to switch !
Edgee
@doganakbulut Yep, we can switch mid-session whenever you want. Or we can switch automatically when you reach your Claude usage limit, or when the Anthropic API is down (quite often, isn't it? 😆)
Proxying Claude Code's API calls through a gateway to route to Kimi K2.7 or MiniMax without code changes is clean architecture. We've hit throughput ceilings in agentic workflows where task latency compounds fast, so the 4x speed claim is interesting. Does Edgee handle automatic fallback if a model hits rate limits mid-session?
Edgee
@anand_thakkar1 Yes, exactly! Edgee is able to fall back to the model/provider you choose.
Even without the turbo models, you can use it with Claude and fall back to Kimi when your usage limit is reached (or when Anthropic has an incident), for example.
Edgee
Are you planning to add more models to the Turbo lineup? Curious specifically about DeepSeek V4 and Qwen 3.5 Coder. Also, will Turbo ever work with Codex or just Claude Code?
Edgee
Hey @gilles_raymond, you again ;)
Yes on both.
Models: DeepSeek V4 and Qwen 3.5 Coder are both on the shortlist.
We're benchmarking them right now and they'll likely join the lineup in the next few weeks. The criteria we use: agentic capability, tool calling reliability, and inference infrastructure availability for genuine high-throughput serving.
Codex: yes. Turbo already works with Codex today through the same gateway.
And it works on Cursor, Copilot, OpenCode...
We focused the launch messaging on Claude Code because that's where the speed pain point is most acute in our user base, but the same flat $29/month gets you Turbo Models in Codex too.
❤️
How do GLM 5.1 and Kimi K2.7 Code actually compare to Sonnet 4.6 on real coding tasks? I've tried open-source coding models a few times over the past year and they always felt one tier below the closed ones. Has that gap really closed?
Edgee
@gilles_raymond2 Honest answer: the gap is much smaller than it was 6 months ago, and on a lot of tasks it's gone.
We benchmarked GLM 5.1, Kimi K2.6, and MiniMax 2.7 against Sonnet 4.6 on real coding sessions for several weeks before this launch. Three patterns emerged:
On focused tasks (write a function, fix a bug, refactor a file), outputs are genuinely comparable. A blind test wouldn't reliably pick the closed model.
On long agentic loops with heavy tool calling, GLM 5.1 holds up best. It was trained explicitly for agentic workflows and it shows.
On extreme long-context reasoning (whole-repo Q&A across hundreds of files), Sonnet still has a small edge for now.
The bigger insight from those weeks: the choice of model matters less than people think. The choice of how the model is served matters way more. A "weaker" model at 200 tok/s with no metered bill often produces a better dev experience than a "stronger" one crawling out tokens at premium pricing.
token reduction at the gateway is the kind of infra that quietly changes the economics of agent workflows. curious about the tradeoff curve: at what compression ratio do you start seeing measurable accuracy loss on downstream tasks? and is that loss uniform across model families or does it hit some harder than others?
Edgee
@thenameisarian Our defaults target ~50% reduction with measurable accuracy loss
in the noise on coding benchmarks. Past ~70% you start seeing degradation on long-horizon tasks (multi-step refactors, anything that needs to recall earlier tool results). Single-step tasks
tolerate more.
It's less "how much can I compress" and more "what can I drop without breaking the loop".
We have implemented three compression techniques:
- Tool result trimming by RTK
- Tool surface reduction
- Output brevity
We are trying to make each of these techniques as lossless as possible, and I can already say that the effects on the model's efficiency are almost negligible. We will be releasing our benchmark work within the next few days.