built an open source SDK for catching AI agent regressions before you ship

been building agents for a while and kept hitting the same problem. fix a failure, change the prompt or model, same failure comes back quietly. nobody catches it until a user does.

built replayd to solve this. captures failed agent runs as regression tests and replays them before you deploy. if the same failure returns after a prompt, model, or tool change, it catches it.

the grading part was the interesting problem. can't use exact output matching because LLMs are non-deterministic. so instead of checking the text, it checks whether the specific failure came back. wrong tool called gets a hard assertion. policy violation gets an LLM judge.

v0.1.2, early but works end to end. zero runtime dependencies in the core.

pip install replayd

github.com/TaimoorKhan10/replayd

star it if you want to follow progress. feedback welcome especially from anyone running agents in production.

10 views

built an open source SDK for catching AI agent regressions before you ship

Replies