In the early LangGraph and GPT-4o days, we deployed a simple LangGraph agent for an internal tool. It worked well for about a week. Then a few colleagues told me some questions were taking several minutes, and it was failing most of the time.
I checked the trace logs and found the problem. The agent was stuck in a loop, calling the same function again and again until it filled up the context window. After 15 to 30 minutes of that, the request would just fail. Nothing was there to notice it had called the same function many times with no progress and stop it. This shows up everywhere.