When you replay or fork a run in Retrace, the steps before the fork come from the recording, but everything after runs live against the model. So two runs of the same input rarely match exactly, even when nothing actually broke.
That makes the useful question harder than it sounds: when a replay diverges, is it a real regression from your change, or just provider non-determinism? Retrace currently shows a first-divergence diff and a verdict of improved, regressed, or unchanged, but I would like to hear how others handle it. What tolerance do you use in practice, and would you rather see a strict step-by-step diff or a semantic comparison of each step?
Record, replay, fork & share AI agent executions. See every LLM call, tool invocation, and error your agent makes, then debug and iterate in seconds. Free for 1,000 traces/mo.