Launching today

Retrace

Launching today

Debug AI agents by replaying and forking runs

18 followers

Debug AI agents by replaying and forking runs

18 followers

Visit website

AI Chatbots

•

AI Metrics and Evaluation

•

AI Engineer

Record, replay, fork & share AI agent executions. See every LLM call, tool invocation, and error your agent makes, then debug and iterate in seconds. Free for 1,000 traces/mo.

Free

Launch tags:Productivity•Developer Tools•Artificial Intelligence

Launch Team

SerpApi for AI Apps and Agents100+ search APIs for LLMs, AI apps, agents, and developers

Promoted

Retrace

Maker

📌

Retrace records every LLM call, tool call, and error in a run as a span inside a trace. You can replay a past run step by step, like scrubbing through a video. When you find the step that broke, you fork it, change the input or model at that point and the agent re-executes from there, so you can compare the original and the new path side by side. The part I care most about is the forking: it's closer to git branching than to re-running a prompt. Pre-fork steps replay from the recording; everything downstream runs live. It's early, and I'd really like your feedback — especially on the replay and fork flow, and what would make it fit your stack. Which frameworks or providers are you using? Happy to answer anything here.

Report

16h ago

finally something that lets me actually see why my agent broke instead of digging through logs. the replay view caught a tool call loop in seconds, super useful.

Report

6m ago

Forum Threads

p/retrace-2

•

12h ago

How do you tell a real regression from model noise when replaying a run?

When you replay or fork a run in Retrace, the steps before the fork come from the recording, but everything after runs live against the model. So two runs of the same input rarely match exactly, even when nothing actually broke.

That makes the useful question harder than it sounds: when a replay diverges, is it a real regression from your change, or just provider non-determinism? Retrace currently shows a first-divergence diff and a verdict of improved, regressed, or unchanged, but I would like to hear how others handle it. What tolerance do you use in practice, and would you rather see a strict step-by-step diff or a semantic comparison of each step?

View all