Tell me your AI-agent-in-prod horror story. Here's mine to start.

by

In the early LangGraph and GPT-4o days, we deployed a simple LangGraph agent for an internal tool. It worked well for about a week. Then a few colleagues told me some questions were taking several minutes, and it was failing most of the time.

I checked the trace logs and found the problem. The agent was stuck in a loop, calling the same function again and again until it filled up the context window. After 15 to 30 minutes of that, the request would just fail. Nothing was there to notice it had called the same function many times with no progress and stop it. This shows up everywhere.

  • Two agents once waited on each other in a loop for 11 days and ran up a $47K bill with no output.

  • McDonald's drive-thru added 260 McNuggets to a single order.

  • Air Canada's chatbot made up a refund policy and a tribunal made them honor it.

In most of these cases the model did its job. What was missing were basic controls around it.

Years later, this is still the gap between an agent that demos well and one you can deploy with confidence. (For context, it's part of why we're building Clyro, runtime governance for agents. We're still pre-launch. But genuinely, I just want the stories.)

So what's yours? Drop it below, bonus points for the root cause.

44 views

Add a comment

Replies

Best

One of the biggest failure modes I’ve seen is not a dramatic hallucination, it’s silent overconfidence.

The agent looked like it was progressing, kept producing well-formed steps, and even called the right tools, but it was optimizing against the wrong objective the whole time. So for 20–30 minutes it kept doing “reasonable” work that was completely useless. No hard failure, just expensive motion with zero outcome.

That’s what makes agent failures tricky in prod. The problem is often not intelligence, it’s missing runtime controls: loop detection, step limits, cost guardrails, deadlock checks, and a way to verify that each action is actually moving closer to the goal.

Demo-grade agents fail loudly. Production-grade agents fail expensively.