
Bench for Claude Code
Store, review, and share your Claude Code sessions
716 followers
Store, review, and share your Claude Code sessions
716 followers
Claude Code just opened a PR. But do you really know what it did? By using Bench you can automatically store every session and easily find out what happened. Spot issues at a glance, dig into every tool call and file change, and share the full context with others through a single link: no further context needed. When things go right, embed the history in your PRs. When things go wrong, send the link to a colleague to ask for help. Free, no limits. One prompt to set up on Mac and Linux.











Clipboard Canvas v2.0
I've tackled similar challenges with code reviews and context sharing, and I love how Bench automates session storage. How do you handle sensitive data in stored sessions to ensure developers aren’t accidentally sharing proprietary code?
Bench for Claude Code
@trydoff Hi there! :)
That really is a tough topic, that we will surely iterate on in the future. Right now, we moved in these directions:
you own your trace: you are free to delete any tracking code, along with all its related sessions, anytime you deem fit. You can even set expiration dates
we intentionally do not record any tool use OUTPUT, just the inputs, precisely because we want to do this right. And, when we'll implement this, it will become an opt-in feature for sure
you can define separate tracking codes for different uses: they are configured through simple envfiles, so it's quite trivial to keep data separated, and e.g. use a disposable tracking code for the activity logs that you may want to eventually delete in the future - or you may even just disable Bench altogether for specific projects if you need to
of course, the sharing functionality is opt-in and completely under your control, so you can share only the sessions that you deem right and stop sharing it whenever you want
It is also worth mentioning that our company, Silverstream, is part of our larger AI Alliance collaboration with CUBE (https://arxiv.org/abs/2603.15798) and we are in the process of offering an open source version, which will help to completely clear out that doubt if this concerns you too much.
I also encourage you to contact us at manuel@silverstream.ai to get further details about the whole process :)
Bench for Claude Code
Claude Code is so capable that we end up trusting it a little too much. But that's exactly when things get interesting:
I've had it silently migrate my local DB to an incompatible version while fixing a bug.
Another time, Claude decided the only way it had to fix an issue with a particulary inefficient for loop, was to turn off my audio drivers.
The real problem isn't that it made mistakes. It's that I had no way to go back and understand what it did, when, and why, to learn from it and finetune my prompts. Sure, I could just scroll the claude logs, but what if the "failures" weren't apparent until much later? Or what if the issue was at step 315 out of an hour-long agent run of 500 steps?
That's why Bench is a big deal. Not just a logger, but an audit trail that makes agent actions legible: every tool call, file change, conversation, subagent detail, all is there for you for as long as you need it, searchable and shareable. A great way to "share your context" to your colleagues, as well as being what I really needed to learn from my mistakes and improve in terms of prompt writing!
Now add observability + failure handling, otherwise it’s just scheduled guessing.
Bench for Claude Code
@ion_simion_bajinaru That's exactly what we are here for :) Providing observability for your sessions, both scheduled and in real time!
Unfold
heyy does this work only with claude? or i can use it with gemini, codex too?
Bench for Claude Code
@janhavidadhania Hey Janhavi! We are currently only supporting Claude Code. However, support for other agents is high priority in our feature list - so stay tuned! You can subscribe at bench.silverstream.ai to get notified when support for future agents will be available!
Bench for Claude Code
Hey folks! I’m Simone, Co-founder and CTO of Silverstream AI.
Really happy to be launching this today. I’m excited to share it, and very curious to hear your feedback!
One habit we’ve introduced across the team is linking Bench sessions in PRs whenever Claude Code was involved in creating or debugging a change. It gives reviewers a lot more context on how a bug was found and fixed, instead of just showing the final diff.
That’s been one of the most useful workflows for us, and I’d recommend it to other teams using Claude Code too.
I’m also using Bench in a research setting, where session data helps generate detailed methodology reports showing how results were obtained. I’m already finding it useful, and I think there’s a lot more to unlock there!
Looking forward to your thoughts. I want to make Bench as useful for other devs as it's been so far for us, and your input really matters!
OpenOwl
The session sharing feature is what makes this stand out. I've lost count of how many times I've wanted to show a teammate "hey look what Claude did here" and had to resort to copy-pasting terminal output into Slack.
Being able to just send a link to a full session with context would save so much back and forth. Especially useful for code reviews where you want to show the AI's reasoning, not just the final diff.
Bench for Claude Code
@mihir_kanzariya Exactly, that's one of the main reasons we created Bench! I hope we have effectively solved your issue. When you have the chance to try, please let me know how the experience was, and how we can improve the product!
This is interesting.
Feels like we’re starting to treat AI outputs more like artifacts that need to be tracked and reviewed, not just ephemeral responses.
Curious how you think about evaluation over time.
For example, do you see this becoming something like a feedback loop where past sessions actually improve future workflows?
Bench for Claude Code
@bigcat_aroido Exactly, and that's very much the direction we're heading!
Storing sessions is just the foundation. The real value is what you can build on top of that history. We're already working on prompt improvement suggestions based on past failures, so Bench can tell you not just what went wrong, but nudge you toward patterns that work better.
Longer term, we envision Bench as a context optimizer: it knows your agent's full history, recognizes recurring failure patterns, and can proactively suggest solutions, like automatically recommending a skill when it detects an issue that one would reliably fix.
So yes, the feedback loop you're describing is exactly what we're building toward!
@younik That direction makes a lot of sense.
Feels like once you have enough history, the real value shifts from storage to actually guiding behavior.
Curious how far you want to take that.
Do you see it more as suggesting improvements, or eventually something that can actively adjust workflows?
Bench for Claude Code
@bigcat_aroido Definitely something that can adjust the workflow! After we have reliably tested the "suggestions" in production, gathered positive feedback, and assured the quality of it, we can allow users who opt-in to have the improvements directly in their memory, skills, or prompt :)
I am happy you like the direction!