Your coding agent ships code that works but does the wrong thing. How are you catching it?

We kept running into this while building DatGrout Invariant.

The agent makes a change. Tests pass. No errors. But the behaviour doesn't match what you actually asked for. No crash to debug, no obvious failure, just silent goal drift that shows up later.

We called this the "plausible but wrong" problem.

The frustrating part is that existing tools don't catch it, linters check syntax, test suites check output, but nothing checks whether the change matched the original intent.

Curious how others are handling this. Are you catching it in review? In prod? Or just living with it?

We're launching DataGrout Invariant tomorrow, built specifically to solve this. Happy to dig into specifics here.

9 views

Your coding agent ships code that works but does the wrong thing. How are you catching it?

Replies