Launching today

Sipcode
Keep Claude Code's context clean for sharper answers
97 followers
Keep Claude Code's context clean for sharper answers
97 followers
Context hygiene for Claude Code. Caps verbose tool output and dedupes same-session re-reads so the model sees signal, not noise. Anthropic measures 29% quality lift from cleaner context. Proof: 62.6% median tool-output savings on a locked 20-task benchmark. MIT.








Sipcode
Foyer
The context window management problem in Claude Code is real. Long sessions accumulate dead weight fast, old tool outputs, abandoned approaches, redundant file reads, and once the context gets bloated the model starts hedging more and the answers get muddier. Curious whether Sipcode is doing something principled to decide what to prune (like deprioritizing failed attempts or stale file state) or whether it's more of a manual curation layer where you're telling it what to keep. Also wondering if there's any handling for cases where something that looked like a dead end earlier in the session turns out to be relevant again.
Sipcode
@fberrez1 Florent, sharp question. The distinction you are drawing is real.
Honest answer: Sipcode operates at the mechanical layer, not the semantic one. It does NOT currently decide "this approach was abandoned" or "this file is stale." That kind of semantic curation needs an LLM in the loop (kills the privacy story) or a structured intent trace (research territory).
What Sipcode does today:
1. Reads: dedup by file path + content hash. If Claude already read it and disk has not changed, the re-Read short-circuits. Original content stays in context.
2. Verbose tool output (git log, npm install, grep, find): cap volume via parameter injection. Static rules, not semantic.
On your dead-end-becomes-relevant-again case: Sipcode does not remove what is already in context. It catches DUPLICATIVE reads only. If something seemed irrelevant earlier and matters now, Claude still has the original bytes and can re-engage.
The real edge case: if Sipcode caps a verbose output (grep at 100 results) and result #500 was the one you needed. That is a failure mode. Every rewriter declares an integrity score on each fire so over-stripping is visible in sipcode why.
Semantic curation (deprioritize failed attempts, drop stale state) is the right next layer. Honest pre-commitment: it requires an architecture I have not figured out yet, or a privacy compromise I am not willing to make. Thinking on it.
Context bloat is my #1 frustration with Claude Code in long sessions. You watch it re-read the same files and re-print npm install walls of text and by the end of a complex session the answers are noticeably worse. The 40% agent error reduction stat is the one that got my attention - quality lift is nice but errors are the thing that actually breaks workflows. The PreToolUse hook approach is smart because it intercepts before the context gets polluted rather than trying to clean up after. Installing this today. Does it handle situations where Claude Code genuinely needs to re-read a file because it changed, or does it dedupe those too?
Sipcode
@galdayan Thanks Gal, that 40% number is exactly why I lean on it over the quality lift in the copy.
To your question: no, changed files are never deduped. On every potential dedup hit, the proxy compares cached bytes against current disk bytes after LF and BOM canonicalization. If they differ by even one byte, the read goes through untouched. The cost is one stat + hash per re-read, the benefit is I never feed Claude stale content. Designed it that way because a wrong dedup is worse than no dedup at all.
The dogfooding story sells this more than any benchmark — discovering your own drift tool read 624,940 tokens wasted while --stats credited 7,553 saved, then root-causing it to uncached mid-session installs and shipping Warm-Fill in 24h. Most launches would've quietly buried that. And the 38% duplicate-Read finding finally names that "why does this session feel sluggish" sensation I could never explain.
One question on the dedup: you canonicalize LF and BOM before the byte comparison. For files where whitespace carries meaning — Python, Makefiles, YAML — can that normalization ever flatten a real change into a false no-op, or is it strictly newline/BOM and never touches interior whitespace?
Sipcode
@david_vilalta Great question, and it is the exact thing I was paranoid about when I wrote it
.
It is strictly line-ending plus a leading BOM, and it never touches interior whitespace. The whole canonicalizer is two operations: strip one leading U+FEFF, then replace CRLF and lone CR with LF. That is it. No tab/space folding, no indentation collapsing, no trailing-whitespace trimming. So a real change in a Python block's indentation, a tab-vs-space edit in a Makefile, or a re-nest in YAML all survive as genuine byte differences and the read passes through. They are never flattened to a false no-op.
The only thing it does flatten is a pure line-ending change or a BOM toggle with no other edit. For Python, Make, and YAML that is a semantic no-op anyway, so deduping it is the correct call rather than a risk. The one theoretical exception is a file whose meaning literally depends on CRLF bytes, like a fixture testing newline handling, but that is not Claude re-reading source for understanding, and even then the cost is one redundant read, never a wrong one.
Design rule I held to: when unsure, let the read through. A missed dedup costs tokens. A wrong dedup costs trust.
The re-read dedup looks like a clear win! QQ - when you inject a head_limit on a grep Claude ran without one, does the model see a "truncated, N more matches exist" marker? Does it read the capped list as the full set? Overall, very well done!
Sipcode
@artstavenka1 Good question, and it gets at the exact risk I worried about with this rewriter.
Key thing: I inject the native head_limit parameter rather than truncating the output myself. So whatever Claude Code normally surfaces when a grep is capped is preserved untouched, I'm setting the same param a user could set by hand, not post-processing the result and stripping a marker. I never hide matches behind the model's back.
The honest residual risk is the one you're pointing at: if a real query genuinely needed more than the cap, the model works from the capped set. That's exactly why v1.6.16 raised the cap from 50 to 100. My dogfood data showed native-grep was the highest-volume and lowest-integrity rewriter, and 50 was clipping real symbol lookups across larger codebases. 100 covers the vast majority of real Claude Code greps while still bounding pathological ones. The rewriter declares a 0.78 integrity score precisely to keep that residual honest in the stats.
Two guardrails: it never reorders, it keeps ripgrep's native ordering for the first N, and if Claude sets its own head_limit I leave it alone and don't override. So the model can always opt out by being explicit.
Appreciate you reading down to this level.
Tendem by Toloka
Hey, congrats!
A couple of questions.
Have you measured the quality performance somehow? I mean, the speed/quality on certain tasks.
Also - is it configurable be Claude to "disable" it if needed, if it things that the hook over-stripped the content?
Thanks!
Sipcode
@perrymason Hey Viacheslav, thanks for the early look and the real questions.
On quality measurement: no controlled A/B on real user tasks yet. What I measure directly is per-rewriter signal kept (every rewriter declares an integrity score on each fire), tool-output savings on a locked 20-task benchmark (62.6% median, range 37.4% to 80.6%, reproducible via sipcode benchmark from the repo), and per-session proxy stats.
The 29% quality lift number is Anthropic's published research, not mine. I am careful not to claim Sipcode users specifically see 29%. The gap between "context got cleaner" (measurable) and "answers got better by X%" (requires controlled experiments) is real and I would rather flag it than oversell.
On configurability, three layers:
Per-tool-call: if Claude passes an explicit parameter (head_limit on Grep, count output mode, explicit offset on Read), the relevant rewriter detects the user-supplied value and steps aside. Claude can effectively opt out of compression for a specific call by being explicit. Rewriters skip rather than fight.
Per-rewriter selective disable via env var or config: not shipped yet. Honest gap. Today a user who hits over-stripping either passes an explicit param on that call or removes the proxy entirely via sipcode proxy --uninstall.
Per-session bypass triggered from inside the agent: also not shipped. Your specific scenario, where Claude itself decides "this hook over-stripped, back off for now", is a really good design idea I have not built. The per-fire integrity scores are there, so the data exists. Wiring it to an agent-side self-modulation primitive is something I want to think about for v1.7.
Tendem by Toloka
@axlerodd Thanks, glad that you think on improving the product)
Regarding quality, I think it could be not even about boosting it, but rather sustaining the similar quality level (drop of 5-10% can be acceptable) if tokens usage falls 50/60/70%. However, this one requires actual thorough benchmarking, as the performance on different tasks may differ as well. And that's something many compression products lack... Glad that you don't hide that)
Sipcode
@perrymason Viacheslav, that "sustaining" framing is sharper than where I was. Honest base case: same quality, lower tokens. Anything else is gravy.
Real benchmark needs a corpus someone else picks (not mine, since curated by me = optimized by me), agent-eval style with verifiable outcomes. Toggle sipcode on/off, measure agreement rate or judged quality.
I have not built it yet. Public commitment that I should is on my list. If you have seen any agent-eval frameworks that handle seed/temperature non-determinism well, I would love a pointer.
Easier to flag the gap than ship a slogan I cannot back up.
Congrats on the launch! Keeping Claude Code context clean is a very real pain point for anyone building with AI coding tools. I like the focus on sharper answers instead of just longer context. How are you deciding what should stay in context versus what should be summarized or dropped?
Sipcode
@rahulbhavsar Thanks Rahul. The rule is intentionally boring: I never summarize and I never drop anything model-facing. I only rewrite where I can prove the rewrite preserves every fact Claude could realistically need next.
So for Bash output I cap volume (head_limit on grep, truncating npm install walls). For Read I dedup byte-identical re-reads inside the same session, with a hash check against current disk so changed files always pass through. Each rewriter declares a 0-1 integrity score so the savings number is never decoupled from how lossy the rewrite is.
Semantic summarization and importance ranking are higher-leverage and I have research on both, but neither clears the bar I've set for shipping into someone else's session. Lossless first, lossy never.
Are you hitting a case where you wish it dropped more aggressively, or kept more?