Atla is the only eval tool that helps you automatically discover the underlying issues in your AI agents. Understand step-level errors, prioritize recurring failure patterns, and fix issues fast–before your users ever notice.
Massively proud of the whole @Atla team for getting us here - it's been a labor of love, and we're finally out there ❤️
We spend all our time thinking about how to diagnose agent failures better, faster & smarter - and we've found the most reliable route to be focussing on recurring failure patterns (to cut through the noise), while keeping an eye out for new ones (to stay on-policy).
I think we've built something pretty cool that attempts to do that, but more importantly we're eager to learn continuously from feedback and make our eval tools better - so that people can make their agents better. Give us a try & let us know what you think!
Debugging AI agents has always felt like chasing shadows. Not anymore.
What I love most:
Step-level visibility
Pattern clustering
Actionable fixes + integrations with tools like Claude Code make it feel like an engineer is already drafting the PR for you.
And the ability to chat with traces is a total gamechanger. finally a way to ask “what’s really happening here?” and get a real answer, backed by data.
Super excited to see where the roadmap takes it. Congrats again, Roman, Jackson, and team! this is going to be a must-have for anyone building at the frontier of AI.
@kkonrad Thanks for your support Konrad 🥜! Happy to see you highlight the chat with traces feature, which the team made a big push to ship for this launch! We want agent builders to not only see critical failures quickly, but also dig deeper into issues that matter most for their own users.
When you chat with traces, you get an answer and a list of relevant traces where that issue is occurring. Excited for people to use this and more in Atla.
Nice! Really enjoyed the demo. It seems like it can easily surface the cause of errors that took us a long time to debug previously.
Also liked the compare feature as it seems to uncover the different failure modes of models and see the improvements / degradation between experiments.
Excited to implement it and see if then just handing the quick fix to CC will solve the errors. That would be fantastic.
Exactly — the core value is in automatically surfacing failure patterns and highlighting what matters, so you don’t drown in noisy logs.
Early tests show Claude Code can already implement fixes quite well. We’re working on making it more reliable by detecting precise failure patterns, which lets coding agents apply targeted fixes and avoid regressions. That way they can iterate quickly through errors.
We built it so agent teams can ship faster, more reliably. Huge shoutout to the team for the grind that got us here. Can’t wait to help make your agents better—curious how you’re debugging today and where we can support!
Report
Very exciting! Have known the Atla team for a while now and they are excellent engineers and researchers :)
Really smart concept. Using AI to debug AI just makes sense, especially when you're dealing with complex agent behaviors. Way better than trying to manually catch all these edge cases.
Completely agree! The way we approach it preserves back-traceability from failure patterns down to the individual spans where they occurred. This also allows to organically build up an evaluation dataset from failure patterns.
Finally someone has crafted a tool that evals Agents... too many agents nowadays and I believe Atla could be a stress testing tool for them... How does it cater to different scenarios and biz logics?
Thank you @cruise_chen! Super important to stress test agents before sending them into the wild.
We've benchmarked our granular LLMJ annotator on many scenarios (customer support, coding agents, browsing etc) but the real adaptiveness comes from aggregating these into failure patterns tailored to each individual agent - rather than generic eval criteria, you see the specific ways in which your agent is misbehaving.
We're already working on the next steps of customizability - which is letting users dynamically shape patterns over time to make them their own, and understanding how different patterns influence specific business metrics of interest!
Report
Congratulations on the launch! 🚀 I’m building AI agents for business workflows, and error detection is always tough. Does Atla only look at LLM outputs, or can it diagnose issues across the whole agent process—including code and APIs? How customizable is the error tracking for unique workflows? Would love to hear if teams use Atla for improving non-LLM agents too.
@sneh_shah this is a great question! We currently focus on LLM outputs, which include the LLM tool calls, i.e. the tool call arguments, and also handoffs to other agents. Thus, we assume that the tool outputs are correct, and we leave the intricacies of the tool and tool error handling to the developer, though, we do pick up on how the agent reacts to tool outputs. Systematic issues across agent processes are then highlighted in common failure patterns.
The error tracking is automatically customised to your system message and tool information - thus, we measure how well the agent follows the policy and how well it completes the task that you have specified, rather than have you repeat this information in the evaluation. There is further customisability where individual metrics can be tracked in our custom metrics suite.
Report
Atla looks super helpful in discovering root causes of errors, not just raising alerts.
Replies
Atla
Massively proud of the whole @Atla team for getting us here - it's been a labor of love, and we're finally out there ❤️
We spend all our time thinking about how to diagnose agent failures better, faster & smarter - and we've found the most reliable route to be focussing on recurring failure patterns (to cut through the noise), while keeping an eye out for new ones (to stay on-policy).
I think we've built something pretty cool that attempts to do that, but more importantly we're eager to learn continuously from feedback and make our eval tools better - so that people can make their agents better. Give us a try & let us know what you think!
Atla
@thelemonbot Pattern king
Knit – Your Virtual Meeting Place
Big congrats to the Atla team on launch!!
Debugging AI agents has always felt like chasing shadows. Not anymore.
What I love most:
Step-level visibility
Pattern clustering
Actionable fixes + integrations with tools like Claude Code make it feel like an engineer is already drafting the PR for you.
And the ability to chat with traces is a total gamechanger. finally a way to ask “what’s really happening here?” and get a real answer, backed by data.
Super excited to see where the roadmap takes it. Congrats again, Roman, Jackson, and team! this is going to be a must-have for anyone building at the frontier of AI.
Atla
@kkonrad Thanks for your support Konrad 🥜! Happy to see you highlight the chat with traces feature, which the team made a big push to ship for this launch! We want agent builders to not only see critical failures quickly, but also dig deeper into issues that matter most for their own users.
When you chat with traces, you get an answer and a list of relevant traces where that issue is occurring. Excited for people to use this and more in Atla.
First Words - Multilingua
Nice! Really enjoyed the demo. It seems like it can easily surface the cause of errors that took us a long time to debug previously.
Also liked the compare feature as it seems to uncover the different failure modes of models and see the improvements / degradation between experiments.
Excited to implement it and see if then just handing the quick fix to CC will solve the errors. That would be fantastic.
Atla
Exactly — the core value is in automatically surfacing failure patterns and highlighting what matters, so you don’t drown in noisy logs.
Early tests show Claude Code can already implement fixes quite well. We’re working on making it more reliable by detecting precise failure patterns, which lets coding agents apply targeted fixes and avoid regressions. That way they can iterate quickly through errors.
Instruct
Congrats! Atla is a much needed product - and it's awesome to see this launch.
Atla
Thanks Matt, appreciate the kind words!
Atla
Excited to launch Atla 🚀
We built it so agent teams can ship faster, more reliably. Huge shoutout to the team for the grind that got us here. Can’t wait to help make your agents better—curious how you’re debugging today and where we can support!
Very exciting! Have known the Atla team for a while now and they are excellent engineers and researchers :)
Atla
Thanks Jack for the nice words!
Really smart concept. Using AI to debug AI just makes sense, especially when you're dealing with complex agent behaviors. Way better than trying to manually catch all these edge cases.
Atla
Completely agree! The way we approach it preserves back-traceability from failure patterns down to the individual spans where they occurred. This also allows to organically build up an evaluation dataset from failure patterns.
Agnes AI
Finally someone has crafted a tool that evals Agents... too many agents nowadays and I believe Atla could be a stress testing tool for them... How does it cater to different scenarios and biz logics?
Atla
Thank you @cruise_chen! Super important to stress test agents before sending them into the wild.
We've benchmarked our granular LLMJ annotator on many scenarios (customer support, coding agents, browsing etc) but the real adaptiveness comes from aggregating these into failure patterns tailored to each individual agent - rather than generic eval criteria, you see the specific ways in which your agent is misbehaving.
We're already working on the next steps of customizability - which is letting users dynamically shape patterns over time to make them their own, and understanding how different patterns influence specific business metrics of interest!
Congratulations on the launch! 🚀 I’m building AI agents for business workflows, and error detection is always tough. Does Atla only look at LLM outputs, or can it diagnose issues across the whole agent process—including code and APIs? How customizable is the error tracking for unique workflows? Would love to hear if teams use Atla for improving non-LLM agents too.
Atla
@sneh_shah this is a great question! We currently focus on LLM outputs, which include the LLM tool calls, i.e. the tool call arguments, and also handoffs to other agents. Thus, we assume that the tool outputs are correct, and we leave the intricacies of the tool and tool error handling to the developer, though, we do pick up on how the agent reacts to tool outputs. Systematic issues across agent processes are then highlighted in common failure patterns.
The error tracking is automatically customised to your system message and tool information - thus, we measure how well the agent follows the policy and how well it completes the task that you have specified, rather than have you repeat this information in the evaluation. There is further customisability where individual metrics can be tracked in our custom metrics suite.
Atla looks super helpful in discovering root causes of errors, not just raising alerts.