been building agents for a while and kept hitting the same problem. fix a failure, change the prompt or model, same failure comes back quietly. nobody catches it until a user does.
built replayd to solve this. captures failed agent runs as regression tests and replays them before you deploy. if the same failure returns after a prompt, model, or tool change, it catches it.
the grading part was the interesting problem. can't use exact output matching because LLMs are non-deterministic. so instead of checking the text, it checks whether the specific failure came back. wrong tool called gets a hard assertion. policy violation gets an LLM judge.
v0.1.2, early but works end to end. zero runtime dependencies in the core.