Hey, I'm Sacha, co-founder at @Edgee
Over the last few months, we've been working on a problem we kept seeing in production AI systems:
LLM costs don't scale linearly with usage, they scale with context.
As teams add RAG, tool calls, long chat histories, memory, and guardrails, prompts become huge and token spend quickly becomes the main bottleneck.
So we built a token compression layer designed to run before inference.
Edgee
Hey friends
Sacha here, founder of @Edgee .
Two weeks ago Anthropic announced that starting June 15, your programmatic Claude usage gets capped at a $20-$200 monthly credit pool. For heavy Claude Code users, that's roughly a 25 to 40x cut in effective inference.
Same with Copilot that is moving to usage-based pricing June 1st.
A lot of people are angry about it. I get it. But we're builders, and the right answer to a market change is to ship better tools, not to complain.
We started building Fallback Models the week before Anthropic's announcement, after one too many Anthropic outages. The timing is now coincidentally perfect.
Here's what our Fallback Models feature does:
→ Anthropic down? Route to Kimi K2.6, GLM, Qwen, Gemma, or others.
→ Plan limit hit? Same thing, automatically.
→ Want to route always? Pick your model.
You can also fall back to your own Bedrock, Vertex, or Azure account in one click. Same Claude Code on top, your cloud underneath, zero code changes.
And it works the same with Copilot, Codex...
How it fits with our other features:
- Compression: use fewer tokens
- Teams: see who uses tokens and on what
- Fallback Models: keep working when your primary model can't
Fallback Models ships with our Team plan. The compression engine that powers all of it is free to try, no credit card.
Two questions for you:
- Which fallback models would you actually want to use?
- What other failure modes should your coding assistant handle?
Will be in comments all day 🙏
edgee.ai/fallback-models
Kilo Code
@sachamorard bravo for this new launch - keep up the great work, keep launching
RiteKit Company Logo API
Can we set the sequense of fallbacks? See, I'd love to give you a sequence of the LLMs I don't pay for and then, last resort, OpenAI and Grok can squeeze the last of my life blood out of me. Thanks.
Edgee
@osakasaul Fallbacks are indeed Chainable yes ! Do you have a quick idea of which LLMs you might talking about ?
RiteKit Company Logo API
@nicolasgirdt Probably about 3-4 you list on the site, and grok, gemini, chatGPT at the end, since like clude, I pay through the nose for them. Right now, optimizing with claude code first, we'll see how it goes.
Edgee
@sachamorard Upvoted! The rate limit issue in Claude Code is real, and the automatic fallback with context compression is exactly what was needed.
Looking forward to testing this on a large project to see if it holds up.
AISA AI Skills Test
smart approach to a real pain point. the rate limiting on Claude Code during peak hours has killed my flow more times than id like to admit. curious how the token compression affects output quality though — does it handle long context windows well or is there a tradeoff with the 50% savings?
Edgee
The transparent proxy approach here is clever. Intercepting at the API layer means zero client changes, and that matters. We've burned time at RetainSure debugging failures partway through a session when Claude's rate limits kicked in at the worst moments. How do you normalize tool_use schemas across models? Claude's format doesn't map cleanly to Qwen or Gemma, and that mismatch can quietly degrade agent output.
Edgee
Interesting - the pain point is real: coding agents are now operational dependencies, so provider limits and outages become workflow risk. The part I'm interested in is not whether the session keeps running, but whether the fallback path preserves intent, tool-use behaviour, and 'reviewability'. A model switch that silently changes judgement would be worse than a hard stop unless teams have good evals and clear evidence around what changed. I need to give this a proper try!
Fallback models for Claude Code is exactly what's needed hitting a rate limit mid-task and losing context is painful. Does it maintain the full context when switching models or does the fallback start fresh?
@imad_elkhafi we do maintain the context ! otherwise the feature would be less useful !