Garry Tan

Atla - Automatically detect errors in your AI agents

Atla is the only eval tool that helps you automatically discover the underlying issues in your AI agents. Understand step-level errors, prioritize recurring failure patterns, and fix issues fast–before your users ever notice.

Add a comment

Replies

Best
Sashank Pisupati

Massively proud of the whole @Atla team for getting us here - it's been a labor of love, and we're finally out there ❤️

We spend all our time thinking about how to diagnose agent failures better, faster & smarter - and we've found the most reliable route to be focussing on recurring failure patterns (to cut through the noise), while keeping an eye out for new ones (to stay on-policy).

I think we've built something pretty cool that attempts to do that, but more importantly we're eager to learn continuously from feedback and make our eval tools better - so that people can make their agents better. Give us a try & let us know what you think!

Young Sun Park

@thelemonbot Pattern king

Konrad Urban

Big congrats to the Atla team on launch!!

Debugging AI agents has always felt like chasing shadows. Not anymore.

What I love most:

  • Step-level visibility

  • Pattern clustering

  • Actionable fixes + integrations with tools like Claude Code make it feel like an engineer is already drafting the PR for you.

  • And the ability to chat with traces is a total gamechanger. finally a way to ask “what’s really happening here?” and get a real answer, backed by data.

Super excited to see where the roadmap takes it. Congrats again, Roman, Jackson, and team! this is going to be a must-have for anyone building at the frontier of AI.

Young Sun Park

@kkonrad Thanks for your support Konrad 🥜! Happy to see you highlight the chat with traces feature, which the team made a big push to ship for this launch! We want agent builders to not only see critical failures quickly, but also dig deeper into issues that matter most for their own users.

When you chat with traces, you get an answer and a list of relevant traces where that issue is occurring. Excited for people to use this and more in Atla.

Armin Schöpf

Nice! Really enjoyed the demo. It seems like it can easily surface the cause of errors that took us a long time to debug previously.

Also liked the compare feature as it seems to uncover the different failure modes of models and see the improvements / degradation between experiments.

Excited to implement it and see if then just handing the quick fix to CC will solve the errors. That would be fantastic.

Roman

Exactly — the core value is in automatically surfacing failure patterns and highlighting what matters, so you don’t drown in noisy logs.

Early tests show Claude Code can already implement fixes quite well. We’re working on making it more reliable by detecting precise failure patterns, which lets coding agents apply targeted fixes and avoid regressions. That way they can iterate quickly through errors.

Matt Falconer

Congrats! Atla is a much needed product - and it's awesome to see this launch.

Mathias Leys

Thanks Matt, appreciate the kind words!

Kyle

Excited to launch Atla 🚀

We built it so agent teams can ship faster, more reliably. Huge shoutout to the team for the grind that got us here. Can’t wait to help make your agents better—curious how you’re debugging today and where we can support!

Jack Miller

Very exciting! Have known the Atla team for a while now and they are excellent engineers and researchers :)

Roman

Thanks Jack for the nice words!

Yehan Xiao

Really smart concept. Using AI to debug AI just makes sense, especially when you're dealing with complex agent behaviors. Way better than trying to manually catch all these edge cases.

Roman

Completely agree! The way we approach it preserves back-traceability from failure patterns down to the individual spans where they occurred. This also allows to organically build up an evaluation dataset from failure patterns.

Cruise Chen

Finally someone has crafted a tool that evals Agents... too many agents nowadays and I believe Atla could be a stress testing tool for them... How does it cater to different scenarios and biz logics?

Sashank Pisupati

Thank you @cruise_chen! Super important to stress test agents before sending them into the wild.

We've benchmarked our granular LLMJ annotator on many scenarios (customer support, coding agents, browsing etc) but the real adaptiveness comes from aggregating these into failure patterns tailored to each individual agent - rather than generic eval criteria, you see the specific ways in which your agent is misbehaving.

We're already working on the next steps of customizability - which is letting users dynamically shape patterns over time to make them their own, and understanding how different patterns influence specific business metrics of interest!

Sneh Shah

Congratulations on the launch! 🚀 I’m building AI agents for business workflows, and error detection is always tough. Does Atla only look at LLM outputs, or can it diagnose issues across the whole agent process—including code and APIs? How customizable is the error tracking for unique workflows? Would love to hear if teams use Atla for improving non-LLM agents too.

Henry Broomfield

@sneh_shah this is a great question! We currently focus on LLM outputs, which include the LLM tool calls, i.e. the tool call arguments, and also handoffs to other agents. Thus, we assume that the tool outputs are correct, and we leave the intricacies of the tool and tool error handling to the developer, though, we do pick up on how the agent reacts to tool outputs. Systematic issues across agent processes are then highlighted in common failure patterns.

The error tracking is automatically customised to your system message and tool information - thus, we measure how well the agent follows the policy and how well it completes the task that you have specified, rather than have you repeat this information in the evaluation. There is further customisability where individual metrics can be tracked in our custom metrics suite.

lerato kgopa

Atla looks super helpful in discovering root causes of errors, not just raising alerts.