When you replay or fork a run in Retrace, the steps before the fork come from the recording, but everything after runs live against the model. So two runs of the same input rarely match exactly, even when nothing actually broke.
That makes the useful question harder than it sounds: when a replay diverges, is it a real regression from your change, or just provider non-determinism? Retrace currently shows a first-divergence diff and a verdict of improved, regressed, or unchanged, but I would like to hear how others handle it. What tolerance do you use in practice, and would you rather see a strict step-by-step diff or a semantic comparison of each step?
Retrace
Forking a run like a git branch is exactly how agent debugging should work. Replay alone rarely helps when the failure came from one weird tool response ten steps in.
Also went through your forum thread on separating real regressions from provider noise — nice to see nondeterminism treated as a first-class problem (first-divergence diff + verdict) instead of being waved away.
One thing I couldn't find though: when everything downstream of the fork runs live, do the agent's tool calls actually execute?
I work on agents with real side effects (checkout, payments, emails), and mocking those from the recording would be the difference between "safe to fork production runs" and not.
Retrace
@akbar_b Tool calls aren't re-executed on a fork; they replay from the recorded tape, so checkout, payments, and emails never fire again, and only the LLM calls are re-issued live (which is exactly where the divergence you care about shows up). You can also override a specific tool's output before replaying if you want to force a different branch.
the git branching analogy for forking a run is the right mental model, most "replay" tools stop at showing you what happened instead of letting you actually change the input at the broken step and re-run from there. i've lost hours re-running an entire agent chain from scratch just to test one fix at step 8. does forking work if the tool call at that step had side effects, like a real API write, or only for pure LLM steps
Retrace
@omri_ben_shoham1 re-running the whole chain just to test step 8 is exactly the pain that made me build this. It work for side-effecting steps too: the tool call at that step isn't re-fired against the real API, its recorded output is replayed (or you can override it), so no duplicate writes. Only the downstream LLM steps re-run live, which is where the fix actually shows up.
recorded output replay for the side-effecting steps is the piece I was missing, that's a much cleaner solve than I expected. does the override option let you edit the recorded response inline before the downstream steps re-run, or do you have to swap it out some other way
Retrace
@omri_ben_shoham1 you edit it, not blind-swap it. the recorded output is your starting point, you change it to whatever you want that step to have returned, and that becomes the step's output for the replay, keyed to that exact span and tagged as "mocked" so the diff makes it obvious which step you overrode vs which ones actually re-ran. from there only the downstream llm steps re-execute against your edited value, the real tool call never re-fires, so for your step 8 you just pin the tool result to the value you're testing and 9 onward runs live against it, then you get original vs forked side by side. and if you'd rather script it than click through, the same thing is just a per-step map (span id → the output you want), so you can even sweep a few different "what if the API had returned X" values at once.
that per-step map for scripting is exactly what I was hoping to hear - being able to sweep a few "what if X" values without touching the real API makes this way more useful for actual regression testing, not just one-off debugging. does the override map get versioned/saved anywhere or is it scoped to a single replay session?
The replay-from-tape answer makes sense for stopping side effects re-firing, but there's a subtle failure once you fork and swap the model. The new branch might call the same tool with different arguments than the recorded run did, so the taped response is now the answer to a question the new path never asked. Do you match a replay on the tool name only, or on the actual call arguments, and what happens when a forked run makes a tool call that has no matching entry on the tape?
Retrace
@dipankar_sarkar Honestly, you've spotted a real limit. Replay matches the recorded step positionally, not by tool name or arguments, so a forked model that calls the same tool with different args gets the taped (now-stale) answer, and a brand-new tool call has no tape entry at all. Since your tools run in your app we don't re-execute them server-side, so those runs are flagged best-effort and not authoritative for tool-calling agents, with a per-step override so you can drop in the correct output.
For Retrace, when you say users can replay and fork runs, does the fork preserve the full context of the original AI agent run, or is it more about starting from a selected point in the trace? I can imagine both being useful for debugging, especially when a bad tool call or prompt change happens midway through a run.
Retrace
@mia_qiao Both, and that's really the point. Everything before the fork point is preserved exactly from the original recording, so the run keeps its full context up to the step you pick, and from that step forward it re-executes with your change and cascades the new context to the downstream steps. So for a bad tool call or a prompt tweak midway, you fork right at that step and only the affected part re-runs, on the corrected context instead of from scratch.
Finally a way to actually see what my agents are doing under the hood. The replay feature saved me a ton of time figuring out why one tool call was looping.
Retrace
@suna288943 Love hearing this, catching a looping tool call is exactly what replay is built for.
this solves a problem every team building agents eventually runs into.
how well does it scale when an agent has dozens of tool calls, nested workflows, and multiple sub-agents? would love to know how you've approached visualizing complex traces.
congrats on the launch!
Retrace
@sonali_nayak2 Thanks, really appreciate it! For big traces the spans render as a nested, scrubbable timeline (parent/child, so dozens of tool calls and nested workflows stay grouped instead of a flat wall), and for multi-agent runs every span carries an agent id/role with an agent-topology graph that shows how the sub-agents hand off, plus inter-agent detectors that flag things reasoning/action mismatches. Very large traces are the area I'm still actively hardening, so honest feedback there is genuinely welcome.