Token Compression for LLMs: How to reduce context size without losing accuracy
Hey, I'm Sacha, co-founder at @Edgee
Over the last few months, we've been working on a problem we kept seeing in production AI systems:
LLM costs don't scale linearly with usage, they scale with context.
As teams add RAG, tool calls, long chat histories, memory, and guardrails, prompts become huge… and token spend quickly becomes the main bottleneck.
So we built a token compression layer designed to run before inference.
What we're trying to achieve
Our goal is simple:
Reduce the number of tokens sent to the model
Without breaking output quality
While keeping latency low enough for production
How it works (high level)
We treat the prompt as structured input rather than plain text.
The system compresses the context by:
Detecting redundancy (repeated instructions, duplicated history, boilerplate)
Prioritizing information based on relevance to the latest user query
Compressing older context into compact representations
Preserving critical constraints (system instructions, policies, tool schemas)
We're not trying to "summarize everything".
We're trying to preserve what matters for the next completion.
A few things surprised us:
Many prompts contain 20–40% redundant tokens (especially in multi-agent / tool-heavy setups)
Most waste comes from repeated scaffolding (tool definitions, formatting rules, system policies)
The best compression is often structural, not linguistic
We're actively benchmarking across tasks like code generation, retrieval-augmented chat, structured JSON outputs, agent tool calling... We are only at the beginning of the story, but the results are already impressive.
@0kham Research Scientist and @nicolasgirdt Software Engineer at Edgee can explain our work in more detail if necessary. Don't hesitate, guys ;)
I'd love to hear from the community: Have you tried prompt/context compression in production? What worked, what failed, and what metrics did you use to validate it?
Happy to share more details and learn from your experiences.


Replies
Hi everyone 👋
Super excited to see the discussion around this.
We’ve been digging deep into hard vs soft compression, token scoring, and meta-tokenization, especially around what actually survives compression in production settings.
One major challenge is that it’s not just about reducing tokens, but about retaining evaluation scores, alignment, and tool-calling reliability after compression. That’s where things get tricky (and interesting 😅).
We’re also excited about how compression interacts with context management over long-running sessions — what should decay, what should be frozen, and what should be structurally preserved.
Curious to hear from others:
How are you validating output quality post-compression (e.g. BLEU, ROUGE, BERTScore, cosine similarity, task benchmarks)?
What properties should an information-preserving compression objective satisfy?
Any approaches to efficient context management and token usage that generalize across all types of input ?
Happy to share what’s worked (and what hasn’t) on our end, and learn from your experiences!