What building an AI incident agent actually taught us

by•1d ago

We've spent the last 6 months pointing an agent at real production errors. Going in, we were sure the hard problem was "can an AI write the fix." Almost everything that mattered turned out to be somewhere else. Three surprising things we learned:

1. Most errors were never worth a page

Across real traffic, ~70% of errors triage out as noise. Which means they are not actionable and no human is needed. This rate holds surprisingly consistently across tenants and services. I still remember integrating with our first design partner. First Slack notification comes in, the team is high-fiving, pure excitement. This thing actually works! Next thing we know our Slack inbox blows up and a burst of 12 notifications arrive. So we built a relevance gate to make sure notifications only happen if something truly breaks. We set out to fix bugs and discovered the bigger win was the 70% of times we give time back.

2. Smart hashing beat our fancy AI pipeline

We built an agentic grouping pipeline with fine-tuned embeddings, similarity search and classifier head to merge duplicate errors. Then we measured the dumb baseline: a plain deterministic fingerprint. It already handled 90%+ of dedup correctly. Around 6,000 errors collapsed to 130 incidents before a model touched anything. The clever layer's actual job turned out to be most valuable in the long tail and not the main corpus. Again it paid off to measure the boring baseline before pushing the smart thing to prod. It also gives you great eval data to understand edge cases which will continue to fail. The nice side-effect of this: Reduced cost and a significantly easier pipeline to maintain due to the shift of volume.

3. We optimize for recall and eat the false positives

We hand-audit incidents to maintain quality. As part of this endeavor we are zooming into false negative rates because we can hold ourselves to the standard of catching all relevant errors. Building out this eval set gives us a guiding principal we can use to automate parts of the evaluation pipeline by using agentic loops that are aligned with human annotation. Our latest evaluation run showed a 97% triage agreement with human assessment and all errors were false positives. So every mistake it made was in the safe direction: flagging something that turned out not to need action. That's a deliberate choice on our end. A false positive costs ten minutes to scan. A false negative is a potential outage where nobody got paged.

Every one of these pointed the same way: let agents do the grunt work. Let them triage, correlate, draft fixes and keep a human on the trigger. We don't see this as a limitation but as a guiding principal for reliable and trustworthy systems.

55 views

Replies

Best

Thanks for sharing! It seems you folks found great learnings from this experience.

Low SNR is one of the most common productivity killers where engineers constantly have to CX from working on product, to scouba diving into red herrings.

Trying to get 100% signal is even more dangerous, omitting the right alert can cost millions.

What I love about this post is the focus on the solution to a problem and not on the tool itself, which it's really hard to find with all the AI overhype.

The error de-duplication pipeline seems key to give LLMs clean data to do what they do best: analyze large text inputs. It seems like a solid design to me!

So the question is not how many lines of code I can ship with AI, it's how productive can make me and It seems Ourbase found a great way of doing it: helping triaging, omitting noise and fail open (prefer false positives than negatives).

Extra dope!

Report

13h ago

@fernando_crespo_gravalos Amen to all of the smart reflections you share. I completely agree that more code being generated doesn't equate to more value being shipped. I recently read that "741% increase in code written only translates to a 65% bump in pull requests (PRs), and just a 20% increase in actual software releases" which seems to hammer that point home. Source: https://leaddev.com/ai/ai-isnt-making-developers-more-productive-its-making-them-busier

Report

13h ago