When you replay or fork a run in Retrace, the steps before the fork come from the recording, but everything after runs live against the model. So two runs of the same input rarely match exactly, even when nothing actually broke.
That makes the useful question harder than it sounds: when a replay diverges, is it a real regression from your change, or just provider non-determinism? Retrace currently shows a first-divergence diff and a verdict of improved, regressed, or unchanged, but I would like to hear how others handle it. What tolerance do you use in practice, and would you rather see a strict step-by-step diff or a semantic comparison of each step?
Retrace
Forking a run like a git branch is exactly how agent debugging should work. Replay alone rarely helps when the failure came from one weird tool response ten steps in.
Also went through your forum thread on separating real regressions from provider noise — nice to see nondeterminism treated as a first-class problem (first-divergence diff + verdict) instead of being waved away.
One thing I couldn't find though: when everything downstream of the fork runs live, do the agent's tool calls actually execute?
I work on agents with real side effects (checkout, payments, emails), and mocking those from the recording would be the difference between "safe to fork production runs" and not.
For Retrace, when you say users can replay and fork runs, does the fork preserve the full context of the original AI agent run, or is it more about starting from a selected point in the trace? I can imagine both being useful for debugging, especially when a bad tool call or prompt change happens midway through a run.
Replay + fork is exactly how agent debugging should work. Today my 'debugging' is reading transcripts of production calls and guessing which turn derailed it - being able to fork from the exact step and test a fix against the same context would save hours. Does it work with voice agents / live conversation logs, or is it aimed at tool-calling agents? Congrats on the launch.
finally something that lets me actually see why my agent broke instead of digging through logs. the replay view caught a tool call loop in seconds, super useful.