The gap between "reviewed" and "rubber-stamped" for AI-generated code — is it measurable?

Something I keep running into while building LineageLens: teams know they should review AI-generated code before it merges, but "reviewed" in practice means "someone clicked Approve." The diff could have been open for 3 seconds or 30 minutes. The record looks the same.

I shipped a feature this week that tries to make this distinction real. Three behavioral signals — time-per-line on the diff, comment count, lines reviewed — get scored into a classification: `shallow`, `adequate`, or `deep`. Anything below 1 second per line gets hard-classified as `shallow` regardless of score, because no one is actually reading at that speed.

The classification gets signed with Ed25519 and stored as an attestation. A CI gate endpoint then blocks merges when the review depth is below a configurable minimum.

I have real doubts about parts of the formula. The comment signal is the weakest — three "looks good" comments score the same as three substantive engagement comments. The 1-second-per-line floor is a judgment call that will occasionally misclassify a genuinely fast reviewer. And any behavioral proxy like this can be gamed if someone is motivated.

But "approved" as the only review record for AI-generated auth code is not a governance posture — it is an illusion of one. The question is whether a behavioral floor, even an imperfect one, is better than no floor at all.

Curious if anyone here has built or worked with review quality measurement systems — and what signals you found actually correlated with genuine engagement vs. nominal approval.

Project: https://www.lineagelens.dev/

30 views

Interesting framing, the 1-sec-per-line floor reads sensible. The more durable signal you might already be sitting on is content-of-comment: did the reviewer reference something not in the diff itself (a test name, a related ticket, a prior bug)? Bots and rubber-stamps can match line-time but they don't reach for off-diff context, so the false-positive rate on "deep" drops sharply if you require one such reference per non-trivial PR. Google's code-review tone-of-comment study (https://research.google/pubs/pub47853/) breaks down the comment-quality dimension nicely; pairing it with a blast-radius weight (auth or migrations get a higher floor than CSS) would also let teams tune by risk rather than uniformly.

Replies

Best

16d ago

Lineage Lens

@fabriziowexare I really like the idea of weighting review depth by blast radius rather than treating every PR the same. A superficial review on a CSS tweak and a superficial review on authentication logic clearly do not carry the same risk.

The off-diff context signal is interesting too. Referencing related tickets, tests, incidents, or architectural constraints feels much harder to fake than raw review time or comment count. It might be a stronger indicator that the reviewer was engaging with the change as part of a broader system rather than just scanning the diff.