Spotlight by Backplanes - Session reports for Claude Code & Codex to improve your code
by•
Keep up with your agents. Spotlight reads your Claude Code and Codex sessions and shows you what your agents actually did, and how to get recursively better every session: what to fix now, what to ship better next time, what's worth sharing. One harness or seven, solo or across your team. Free.


Replies
The scary part of vibe coding fast isn't the bug you catch, it's the secret you committed three sessions ago and never noticed. I spent years in risk and security before I ever touched Claude Code, so "what my agent actually did" is exactly the report I always wished I had. Does Spotlight call out the security stuff specifically, leaked keys, missing checks, or is it more about code quality and patterns?
@luca_capone, "the secret you committed three sessions ago" is almost word-for-word how Spotlight actually started for us: I asked Claude to fix one file, and an API key ended up in a tracked .env that we only caught by accident. So yes, security is very much called out specifically, and it leads the report rather than riding along. Findings arrive severity-ordered in their own stream: secrets landing in files git tracks, prod-touching commands that skipped a dry run, an agent quietly reaching a service you've never used, each with the evidence behind it and a concrete fix.
Code quality and patterns that can help make you more effective with your harness are in there too, so the report always gives you value, even when there are no security-related findings. And since those transcripts are already sitting on your machine, the first report can start with the sessions you've already run. Your "three sessions ago" is still catchable. :)
The "what your agents actually did" angle is great, that read-47-files scare is too real. When you're running several harnesses at once, does Spotlight give you one combined report or one per session?
Thanks@ianhxu. "Too real" is exactly how it felt on the inside too. :) The answer is both, at different layers. Each session gets its own report, with its own findings and evidence, so you know exactly which session did what, and exactly what to do about it.
Running several harnesses at once just means several reports, and your Claude Code and Codex reports live side by side in Spotlight. But your highest-leverage opportunities are often in the trends and patterns across sessions, and so Spotlight gives you a report of what's important across all of them. And finally, connect your whole team and the view widens even further: patterns and trends across every engineer in an org.
One individual or a team, one harness or seven, Spotlight gives you both the detail and the big picture.
The OS-level instrumentation approach is smart. It captures what agents actually do rather than what they report back. We've run into exactly this problem: an agent silently inlining an API key when it couldn't find the env var, and that key landing in git history. How do you distinguish intentional credential usage in test fixtures from actual leakage?
@anand_thakkar1, that war story could be one of ours: an agent improvising around a missing env var by quietly inlining the key is the kind of move nobody catches in review. One small correction, it's not OS-level instrumentation. Spotlight reads only the session transcripts the harnesses themselves write, nothing else on your machine. But your key insight holds: it's what the tools actually did, not what the agent says it did.
On fixtures vs leakage, we treat those as two different jobs. Redaction is deliberately paranoid: anything secret-shaped gets masked on your machine before upload, fixtures included, because that step shouldn't be guessing intent. The judgment lives in the analysis: where the credential landed, whether it looks live, and what the session was doing at the time, with severity reflecting that context. A dummy key in a test fixture and a live-looking key written into a tracked file are very different findings. Every finding carries its evidence, so on close calls you're the judge, with the receipts in front of you.
We'd rather flag a fixture at low severity than miss a live key. That asymmetry is on purpose.
DiffSense
300$ for 50min coding. what kind of models are you running? 😅 How does it get recursivly better for each session i dont get it? reminds off entire.io
@conduit_design Ha, right?! The wild part: that's the agents' own tab, we just hand you the receipt. It's crazy how quickly token usage accelerates when you're running multiple subagents on an intensive job, and Fable pricing is going to make this even more fun for all of us soon. 😅
On "recursively better," the idea is that it's a loop with you in it. The model never changes; your setup does. Each report turns what happened into concrete and actionable advice: a fix to apply, a CLAUDE.md line to include, a Skill to draft. Your agent loads that richer setup next session and starts more informed than the last one.
looks really cool! Gonna take it for a spin
Tabstack
@louislecat lfg! here you go: backplanes.com
looking forward to your thoughts
This is a useful direction. For coding agents, the hard part is usually not generating more code, it is making the session reviewable afterward.
The report I’d want is pretty boring: changed files, risky assumptions, tests/checks run, failed attempts, and a short “what a human should look at first” section.
@kevinzrzgg, you seem to have written our report spec almost exactly. :)
Changed files: all files read and written are in there. Tests and checks: flagged with their outcomes when the session shows them. Failed attempts: called out, including the distinction between deliberate re-verification and flailing retries. "What a human should look at first": that's the top of the report, a one-line verdict with the main outcome, then findings ordered by severity and guidance on what to do for each.
The one we can only claim half credit on is risky assumptions: concrete risky choices surface as findings, and a blind-spots section names what the report couldn't verify, but a dedicated assumptions section is a great idea.
And we're with you on boring: the standing rule inside the report is no invented findings and no padded advice, an empty section beats a manufactured one.
The Neil story is the real pitch here, not productivity, but security. Most devs assume they're reviewing what the agent does, but at scale (multiple sessions, multiple team members), drift is invisible. The framing as "session reports" makes it feel like a dev tool, but this is really an audit trail. Smart. Curious whether you'll add diff-level visibility (which files were read, not just that 47 were).
The interpretation layer on top of raw transcripts is the real product here. Distinguishing a retry storm from deliberate re-verification, or flagging a credential class without holding the value, it's genuine signal extraction. We've wrestled with agent filesystem boundary decisions. How do you handle cross-session pattern detection when the same agent operates across different repos or machines?
@retain_dev Gaurav, "filesystem boundary decisions" tells me we've fought some of the same battles. :)
Short answer: the anchor is identity, not inference. The CLI is signed in as you, so every session carries the same account identity no matter which repo or machine it ran on, with the repo, harness, and model riding along as context. Patterns aggregate across the account, so the same retry habit surfaces whether it happened in your API repo on a desktop or a scratch project on a laptop. That's literally how Spotlight started for us: stitching our own sessions together across machines, and being floored by how much we'd missed and how few of our good patterns traveled.
One thing we deliberately don't do is behavioral fingerprinting to guess "same agent" across accounts: identity stays explicit and predictable. And since you mentioned boundaries: every report flags file access outside the project, per session, so drift is visible long before it needs to be policy.