Ben Lang

Agent Mode on Arena - Get real-world tasks done with autonomous AI agents

byβ€’
Most AI benchmarks test models in controlled environments. Agent Mode tests them on complex tasks to get more work done. Run autonomous agents that browse, research, code, use files, and complete multi-step workflows from a single prompt. Then watch each workflow unfold step by step. Every run contributes to the Agent Arena Leaderboard, ranking frontier models by real-world agentic performance.

Add a comment

Replies

Best
MD Amirul Islam

One thing I appreciate about Arena is that it shifts the conversation from "which model is trending" to "which model actually performs best for my use case." With the pace of AI innovation today, having a reliable way to evaluate and compare models is incredibly valuable. This feels like a product that can help builders make smarter decisions instead of relying on assumptions or marketing.

Congratulations on the launch β€” excited to see how the platform evolves and serves the AI community! πŸš€

Elliott Gluck

@1mirulΒ Yes! It's all about how the models perform for actual real-world use cases. Appreciate the well wishes, excited to get this out to our community.

Dhiraj Patel

Really interesting. How does Arena prevent agents from doing something destructive during a benchmark run?

Ted Moran

@dhiraj_patel5Β great question – the agent's actions are limited to a sandbox environment at the moment.

Elliott Gluck

πŸ‘‹ Hey Product Hunt! We're excited to launch Agent Mode on Arena.


AI chat experiences are often limited to rigid, single-modality interactions that require switching tools or

additional prompting. Agent Mode changes that. You can now prompt once and the agent will plan, browse,

research, and code in a sandbox testing environment to complete real-world, multi-step tasks for you.


Every Agent Mode session also powers our new Agent Leaderboard, built entirely from behavioral signals (such as confirmed success, bash recovery, steerability, and more) collected from real users running real-world workflows. We’re excited to have our community contributing to the leaderboard, and provide a new standard for measuring AI advancement.


We'd love your feedback: What agentic tasks did you throw at it? What tools should we add next? Thanks for checking it out πŸ™

Ted Moran

@elliott_gluckΒ let's gooooooooo

Monir

Arena feels like a much-needed reality check for the AI space. Instead of guessing or trusting scattered benchmarks, it brings everything into one place where models can be evaluated side by side in a practical way. For anyone building with AI, this kind of clarity is extremely valuable. Excited to see how it grows and how the community contributes to making AI evaluation more transparent and useful over time.

Elliott Gluck

@monir_Β Really appreciate the thoughtful comment! We're very excited to bring Agent Mode, and the Agent Arena leaderboard to the public to help better measure agentic AI!

Odeth N
Really impressed by how Agent Mode cuts through the usual frictionβ€”prompt once and it actually carries the task end-to-end. The leaderboard angle is clever too, feels like a transparent way to measure real progress. Curious to see what new tools you’ll plug in next.
Ted Moran

@odeth_negapatan1 next up is: Github integration, full stack coding, slide creation, pdf creation, image edit, video gen, and wayyy more. Strap in 🀘

Nathan Tran

the UI alone is pretty awesome. you guy really have that "taste", Elliott

I just tested it, and it's mind-blowing

Elliott Gluck

@nathan_tran2Β Thank you Nathan, super exciting to hear you're already getting value and loving using the product!!

Ted Moran

@nathan_tran2Β very kind of you, thank you!

Zaid Mallik

One thing we've noticed with agent workflows is that execution is becoming less of a bottleneck than decision-making.

Are you seeing users struggle more with planning and prioritization, or with agents actually completing the tasks once they're started?

Ted Moran

@zaid_mallik1Β we see a couple things
1) Most users start messages by handing over a whole job rather than asking for advice: the delegation posture skews heavily toward "build this deliverable" and "operate autonomously." However, after seeing the first response, they tighten the reins β€” pulling control back far more often than they hand over more.
2) We also find that when the opening ask bundles several explicit parts, agents usually cover all of them; the typical shortfall is leaving one incomplete. A rarer but more consequential shortfall is covert: the agent could have surfaced the incomplete work, but instead presents the result as complete. We call this "Bluffing".

More info here in our blog about the Agent Leaderboard if you're curious: https://arena.ai/blog/agent-arena-methodology/

Zaid Mallik

@teozorroΒ The "bluffing" distinction is interesting because it feels less like a capability problem and more like an uncertainty-reporting problem.

Have you found bluffing correlates more with task complexity, or with agents lacking a reliable mechanism to know when they've actually reached the limits of what they can verify?

Ted Moran

@zaid_mallik1Β it really is a capability problem. We're depending on agents to check their work and hold themselves accountable, because if they don't, users have to practice constant vigilance and babysit their tasks. In the world of autonomous agents, this capability is absolutely essential.

And yes, to your point, the more complex the task, the more likely the model is going to misreport completion... until you say please double check your work πŸ˜†

Huisong Li

@elliott_gluck Congratulations. And happy product launch.

Elliott Gluck

@huisong_liΒ Thank you @huisong_li , appreciate your support!!

Karim Ben

Do you publish the exact tasks and grading so people can reproduce runs and compare fairly?

Natalia Iankovych

Cool! What about industry-specific modes? For example, we built a travel AI and want to compare the quality of our AI with other models, but specifically in the travel domain.