When you replay or fork a run in Retrace, the steps before the fork come from the recording, but everything after runs live against the model. So two runs of the same input rarely match exactly, even when nothing actually broke.
That makes the useful question harder than it sounds: when a replay diverges, is it a real regression from your change, or just provider non-determinism? Retrace currently shows a first-divergence diff and a verdict of improved, regressed, or unchanged, but I would like to hear how others handle it. What tolerance do you use in practice, and would you rather see a strict step-by-step diff or a semantic comparison of each step?
Finally something that makes debugging AI agents less painful. The forking feature let me branch off a stuck trace and rerun it with a different prompt in like a minute. Super practical for anyone shipping agents right now.
The replay feature is genuinely useful, I reran a flaky agent run and could pinpoint exactly where it stalled without digging through logs. Free tier is enough to actually evaluate it before committing.
Love how clean the replay view is, being able to scrub through each LLM call and tool invocation without losing context makes debugging agents feel way less like guesswork.
Spent a few minutes replaying a flaky agent run and being able to fork the exact trace to try a different prompt without rerunning the whole thing was honestly a nice surprise. The tool call breakdown finally makes it obvious where my agent was looping.