How do you verify code shipped by autonomous AI agents (especially when running multiple)?

I've tried agent orchestrators like Conductor but I've never really stuck to it long term because I have no way of verifying the huge amount of work they've done with basically no validation. I always end up going to my one-agent Claude Code workflow and always feel like I'm missing out.

What kept happening to me:

3-5 agents fan out, each opens a PR
CIs pass, diffs look reasonable
I can't possibly click through every preview by morning
I merge based on the diff and CI signal
Something breaks in prod the next day because the agent did what I asked literally, but the feature doesn't REALLY work

CI passing doesn't tell you whether clicking the button does anything. The more agents I run in parallel, the more PRs I'm merging without verifying.

How are people handling this?

- Click through every preview manually? (took 2-3 hours of my mornings)

- Some kind of QA agent that drives the preview deploy?

- Pre-merge integration tests covering every UI flow? (lol)

- Just merge and roll back when prod breaks?

I've been building in this space. It's a second AI agent that takes each PR's preview deploy, opens it in a real browser via Browserbase, clicks through the feature, and fails the PR if it doesn't work. The verification runs on its own so I don't have to be the QA step.

If it fails, the build agent gets the report back and iterates up to 3x.

For me this was the missing piece. Without it I'd run 5 agents and be too scared to merge any of their PRs. With it I just review the QA reports and merge the ones that passed.

Anyone else solved this, or are we all merging without checking?

11 views

How do you verify code shipped by autonomous AI agents (especially when running multiple)?

Replies