Edgee

One gateway for cheaper, faster, unstoppable coding agents

5.0•2 reviews•

1.3K followers

One gateway for cheaper, faster, unstoppable coding agents

5.0•2 reviews•

1.3K followers

Visit website

AI Infrastructure Tools

•

AI Metrics and Evaluation

•

LLM Developer Tools

Edgee is the gateway for your coding agents. It compresses tokens before they reach Anthropic, OpenAI, or any other LLM rrpovider (up to 50% lower cost), routes to faster or cheaper models when you want, and falls back automatically when a provider goes down or your plan hits its limit. Plus team attribution per repo and per PR. Same Claude Code, same Codex, lower bills, no downtime.

This is the 7th launch from Edgee. View more

Edgee Claude Code Compressor V2

Launched this week

Fewer tokens, same context, 50% cost reduction

Compression V2 cuts coding-agent token bills with three techniques across two layers: sharper tool result trimming, new task-aware tool surface reduction, and output brevity. Drop-in for Claude Code, Codex, OpenCode, and Cursor. Semantically lossless.

Edgee Claude Code Compressor V2 gallery image

Free

Launch tags:API•Developer Tools•Artificial Intelligence

Launch Team / Built With

Wispr Flow: Dictation That Works EverywhereStop typing. Start speaking. 4x faster.

Promoted

Edgee

Maker

📌

Hey friends

Sacha here, founder of @Edgee.

Back in March we shipped our first compressor. One technique, tool result trimming, inspired by the RTK project. It delivered around 10% cost savings on real coding sessions. Safe, simple, limited.

Today we're launching Compression V2, and it is a different beast.

Three orthogonal techniques, each attacking a different layer of an agent request, each toggleable independently:

1. Brevity (output layer). The model still does every tool call and produces the same final patches, it just stops narrating its plan. Output is the most expensive token class, and this is where the big

win lives.

2. Tool surface reduction a.k.a TSR (input layer). Agents send the model the union of every MCP tool on every request, even when 95% are irrelevant. V2 runs a fast classifier that scores each tool against the task and strips the rest before the request hits the model. Your IDE still exposes everything, the model just sees a curated subset.

3. Tool result trimming (history layer, refined from V1). Cleans up the verbose tool outputs that pile up over a long session without dropping what the model needs.

Because the three touch different layers, they compose cleanly. Combined, that lands around 50% cost reduction on a typical session.

The part I am most proud of is not the number, it is how we measured it. Our research engineer @0kham ran this on SWE-bench Lite in agent mode, with paired sign tests, bootstrap confidence intervals, per-replicate cache nonces so no run gets an unfair cache advantage, and token counts read straight from the raw usage fields. We published the full methodology, including where each technique is strong and where it is modest. Brevity is ~30% median on coding, TSR is a huge token-volume win on tool-heavy MCP work, trimming compounds over long sessions. No single inflated headline number, just the decomposition and the stats behind each claim.

Full technical write-up here if you want the tests and the CIs: https://www.edgee.ai/blog/posts/introducing-compressor-v2-three-compression-layers-measured-end-to-end-for-a-50-cost-reduction

It is semantically lossless on code tasks, drop-in for Claude Code, Codex, OpenCode, Github Copilot and Cursor, with under 12ms P50 gateway overhead. Your CLAUDEmd and MCP servers stay exactly where they are.

A few questions I would love your take on:

Which of the three techniques sounds most useful for your workflow?
Anyone running heavy MCP setups who wants to try tool surface reduction and tell us what you see?

Will be in the comments all day. Thanks for checking it out

Report

8d ago

Coding agents are amazing until you realize how much token waste is hidden in the background. not just the actual code changes, but repeated tool context, long outputs, noisy history, and models explaining things I didn't really need explained. The TSR idea is the most interesting part for me. with more MCP tools connected, the "tool surface" can get huge very fast, and sending irrelevant tools to the model on every request feels like exactly the kind of invisible cost that compounds.

I also really respect that you published the methodology instead of just saying "up to 50% cheaper". paired tests, cache nonces, and decomposition by technique makes the claim much easier to trust. Curious how you think about the risk side of TSR. if the classifier removes a tool that would've been useful later in the task, can the agent recover, or does it need to restart with a wider tool set?

Report

5d ago

Edgee

Maker

@andrasczeizel TSR is challenging for sure, we wanted to make sure we reduce the MCP surface with still 100% of the hit on MCPs use.

The classifier was the hard part. We needed to make sure there was no loss of information for the model to understand the different capabilities so went added some info to the model with the following prompt :

"Here are the available integration if needed let us the resolver Tool if you need an external integration" and then adding a list of the MCPs we detected.

The gateway then sends the right MCP request to the client which execute the query. This shouldn't need a restart of the agent or anything.

So far the model always uses the right MCP when needed, hopefully the upcoming usage from Edgee user will help us understand if we're on point every time.

Report

5d ago

Rare to see a cost claim backed by paired sign tests and per-replicate cache nonces instead of one headline number — that got me to read the full write-up. The one thing I couldn't work out: how compression interacts with prompt-cache prefix stability. On Anthropic, a cache read costs ~1/10 of uncached input, and a long Claude Code session gets most of its economics from the prefix staying byte-stable. History-layer trimming that touches an already-sent tool result mutates the prefix and invalidates everything after it — and TSR looks like it has the same tension, since the tool block sits at the very top of the prefix, so a per-task curated subset would change the first bytes of the prompt as the task evolves. Two questions: (1) Is trimming append-only — freeze what's already been sent, compress only new turns — or re-optimized per request? (2) When you say ~50%, is that net of the lost cache discounts, or measured on uncached token counts? The nonces read like caching was deliberately neutralized — the right call for isolating each technique, but it leaves the production net effect open. Asking as squarely the target user: solo dev, long-running sessions, a bill that's mostly cache reads.

Report

5d ago

Edgee

Maker

@kyo_shino Hi Kyo,

So we thought a lot about the possible issues with caching and I agree that we need to be really careful on not invalidating cache.

Therefore the real MCP never get sent so no cache invalidation and no recomputing of cache which would end up being more expensive.

The way that it works is that the virtual MCP decides which user MCP to call and sends it to the client so that the client can call it. Once we have the result of the call we send it back to the model without invalidating the cache

The 50% we have is cost reduction, not token, but real $$$ therefore it accounts all the cache handling and everything !

Report

5d ago

@nicolasgirdt Thanks Nicolas — the virtual-MCP indirection is an elegant way out: the model sees one stable resolver instead of a shifting tool block, so the prefix survives task switches, and the round-trip through the client keeps the results in-band. And "50% in dollars, cache handling included" is exactly the number that matters — noted. The one detail I'll go dig out of the write-up is whether history-layer trimming freezes what's already been sent or re-optimizes per request. Congrats on the launch — the measurement rigor alone earned my upvote.

Report

5d ago

Edgee

Maker

Thanks @kyo_shino , we will try to continue on this path, and bring even more scientific rigor to the improvement of our compression technologies.

Report

4d ago

I enjoy products that improve existing tools of replacing then. how does Compression V2 perform on very long debugging sessions with repeated tool calls? Publishing those numbers could answer another important question for engineering teams.

Report

5d ago

Edgee

Maker

@mikkel_banner Hey Mikkel,

We ran a bunch of benchmarks to make sure our results are as close as possible to real use cases.

Feel free to check @0kham article which explains the different results we have !

Report

5d ago

Edgee

Maker

@mikkel_banner Compressor V2 performs very well during long coding sessions, which is something the SWE Benchmark evaluates quite effectively. Feel free to test it and let us know what you think - we love that kind of feedback

Report

4d ago

Curious how Edgee handles the “gateway” part in practice for coding agents. Is the main idea routing requests across different agent providers for cost and speed, or is it more about giving engineering teams one place to manage access and workflows? The “cheaper, faster, unstoppable” promise is clear, just wondering where the biggest control point is.

Report

5d ago

Edgee

Maker

@crystalmei it’s complicated to choose between these two focuses. Haha, I would answer both are top of our priorities. - routing is so powerful when it comes to optimise token cost - and of course, compression and team observability are really important too

Report

5d ago

Me noticed the emphasis on keeping results semantically lossless. could users review before and after token reports for every session? That level of transparency might encourage wider adoption across larger engineering organizations.

Report

5d ago

Edgee

Maker

@stacey_connolly2 Absolutely! By default, Edgee does not store prompts. Our gateway processes and optimizes the context, then forwards it to the LLM API without storing the content. However, if you enable debug mode, we do log the content before and after compression, giving you a complete view of how we handle it.

Report

5d ago

Brevity being the biggest win makes sense to me — I notice Claude Code narrating plans I never asked for. Does suppressing that narration change how easy it is for a human to follow along mid-session, or is it only trimmed on the wire?

Report

5d ago

Edgee

Maker

@kojimajunya At first, the model's response is a little surprising, but it's still clear and understandable. After using it for a while, I don't even notice anymore that the model's responses are brief… you quickly get used to it.

Report

4d ago

1 2 3

Previous Edgee Launches

Edgee Turbo ModelsUse Claude Code with Kimi K2.7 Code, MiniMax M2.7, and more

Launched on June 16th, 2026

Edgee Fallback ModelsClaude Code that never stops

Launched on May 24th, 2026

Edgee TeamStrava for your coding assistants

Launched on April 26th, 2026

Edgee Codex CompressorUse Codex at 35.6% lower costs

Launched on April 12th, 2026

View all Edgee launches

Forum Threads

p/edgee

•

5mo ago

Token Compression for LLMs: How to reduce context size without losing accuracy

Hey, I'm Sacha, co-founder at @Edgee

Over the last few months, we've been working on a problem we kept seeing in production AI systems:

LLM costs don't scale linearly with usage, they scale with context.
As teams add RAG, tool calls, long chat histories, memory, and guardrails, prompts become huge and token spend quickly becomes the main bottleneck.

So we built a token compression layer designed to run before inference.

View all

Hey friends

Sacha here, founder of @Edgee.

Back in March we shipped our first compressor. One technique, tool result trimming, inspired by the RTK project. It delivered around 10% cost savings on real coding sessions. Safe, simple, limited.

Today we're launching Compression V2, and it is a different beast.

Three orthogonal techniques, each attacking a different layer of an agent request, each toggleable independently:

win lives.

3. Tool result trimming (history layer, refined from V1). Cleans up the verbose tool outputs that pile up over a long session without dropping what the model needs.

Because the three touch different layers, they compose cleanly. Combined, that lands around 50% cost reduction on a typical session.

Full technical write-up here if you want the tests and the CIs: https://www.edgee.ai/blog/posts/introducing-compressor-v2-three-compression-layers-measured-end-to-end-for-a-50-cost-reduction

A few questions I would love your take on:

Which of the three techniques sounds most useful for your workflow?
Anyone running heavy MCP setups who wants to try tool surface reduction and tell us what you see?

Will be in the comments all day. Thanks for checking it out

Edgee

One gateway for cheaper, faster, unstoppable coding agents

One gateway for cheaper, faster, unstoppable coding agents

Edgee Claude Code Compressor V2

Previous Edgee Launches

Forum Threads

Token Compression for LLMs: How to reduce context size without losing accuracy

Previous Edgee Launches

Forum Threads

Token Compression for LLMs: How to reduce context size without losing accuracy

What's great

What needs improvement

What's great

What needs improvement

vs Alternatives

What's great

What needs improvement

What's great

What needs improvement

vs Alternatives