Taimoor Khan

built an open source SDK for catching AI agent regressions before you ship

by

been building agents for a while and kept hitting the same problem. fix a failure, change the prompt or model, same failure comes back quietly. nobody catches it until a user does.

built replayd to solve this. captures failed agent runs as regression tests and replays them before you deploy. if the same failure returns after a prompt, model, or tool change, it catches it.

the grading part was the interesting problem. can't use exact output matching because LLMs are non-deterministic. so instead of checking the text, it checks whether the specific failure came back. wrong tool called gets a hard assertion. policy violation gets an LLM judge.

v0.1.2, early but works end to end. zero runtime dependencies in the core.

pip install replayd

github.com/TaimoorKhan10/replayd

star it if you want to follow progress. feedback welcome especially from anyone running agents in production.

10 views

Add a comment

Replies

Be the first to comment