When you replay or fork a run in Retrace, the steps before the fork come from the recording, but everything after runs live against the model. So two runs of the same input rarely match exactly, even when nothing actually broke.
That makes the useful question harder than it sounds: when a replay diverges, is it a real regression from your change, or just provider non-determinism? Retrace currently shows a first-divergence diff and a verdict of improved, regressed, or unchanged, but I would like to hear how others handle it. What tolerance do you use in practice, and would you rather see a strict step-by-step diff or a semantic comparison of each step?
Congrats on the launch. Replaying tool calls from the tape instead of re-executing them is the right default, that alone puts it ahead of most homegrown replay scripts I've seen. The question I'd need answered before pointing this at a client project: my tool args and prompts regularly carry things like bearer tokens and customer emails, so when a trace is stored or shared with a teammate, does Retrace redact any of that or does the span keep everything as-is? Redaction at the SDK level would be a real selling point for the paid tier.
Retrace
@vollos good news, it's handled by default: every span is run through PII redaction before anything is written to storage, so bearer tokens, auth headers and customer emails come out as [TOKEN_REDACTED] / [EMAIL_REDACTED] in both stored and shared traces, on every plan (not just paid). it runs server-side across all ingestion paths so nothing raw is ever persisted even if the client slips one through, best way to gut-check it is to sign up (free tier), point one real trace at it, and watch it get redacted in the span view.
@yash1511_bogam Default-on for every plan is the right call — plenty of tools gate redaction behind enterprise and everyone below that tier just leaks. Glad to hear it runs server-side too. Good luck with the rest of launch week.
Retrace
@vollos Appreciate it, and totally agree that gating redaction behind enterprise just means everyone below leaks by default.
The first-divergence approach surfaces an input-side problem I've hit in my own agent harness: volatile tokens the harness itself embeds in prompts — timestamps, run ids, sampled examples — make every replay look like it diverges at step 1, before any real regression. My fix was blunt: ban wall-clock and randomness inside the orchestration layer entirely (time gets injected as an argument), so replays are byte-stable by construction. Curious where Retrace draws this line: do you normalize/mask known-volatile spans when computing first divergence, so a timestamp delta doesn't count as a fork point — or is the recommendation to make the harness deterministic upstream, like I did? And if it's masking, is the mask list configurable per project? Feels like the difference between a diff you trust and a diff you learn to ignore.
The fork-as-git-branch model is the right call for agent debugging — re-running a whole prompt throws away the exact upstream state that caused the break. The thing I'd need pinned before wiring this into a real stack is side effects: when a forked run re-executes downstream live, does a tool call that writes to a DB or hits a payment/email API actually fire again, or can you stub specific tools so a fork doesn't repeat real-world writes? Being able to mark tools as replay-only vs live seems like the difference between using this on prod agents or only read-only ones.
finally something that lets me actually see why my agent broke instead of digging through logs. the replay view caught a tool call loop in seconds, super useful.
Retrace
@muhammetbelgin Appreciate the kind words! If you want to try it on your own agent, you can sign up free and have your first trace replaying in a couple of minutes: retraceai.tech
Replay + fork is exactly how agent debugging should work. Today my 'debugging' is reading transcripts of production calls and guessing which turn derailed it - being able to fork from the exact step and test a fix against the same context would save hours. Does it work with voice agents / live conversation logs, or is it aimed at tool-calling agents? Congrats on the launch.
How does the free tier handle traces that get close to the limit mid-session — does it cut off or let you finish and just throttle new ones?
The failures that actually bite me only show up on a real user's weird input in prod, never when I'm testing locally, so recording a run and replaying it after the fact is the dream. Two Qs: can I ingest traces from a deployed backend (not just a local dev harness), and since those recordings carry real user messages, is there any redaction/PII control before a trace gets stored or shared?