Benchmark and compare the best AI models

Start new thread

Agent Mode on Arena - Get real-world tasks done with autonomous AI agents

Cursor

•4d ago

Most AI benchmarks test models in controlled environments. Agent Mode tests them on complex tasks to get more work done. Run autonomous agents that browse, research, code, use files, and complete multi-step workflows from a single prompt. Then watch each workflow unfold step by step. Every run contributes to the Agent Arena Leaderboard, ranking frontier models by real-world agentic performance.

Replies

Best

Earth.fm

One thing I appreciate about Arena is that it shifts the conversation from "which model is trending" to "which model actually performs best for my use case." With the pace of AI innovation today, having a reliable way to evaluate and compare models is incredibly valuable. This feels like a product that can help builders make smarter decisions instead of relying on assumptions or marketing.

Congratulations on the launch — excited to see how the platform evolves and serves the AI community! 🚀

Report

4d ago

Arena

Maker

@1mirul Yes! It's all about how the models perform for actual real-world use cases. Appreciate the well wishes, excited to get this out to our community.

Report

4d ago

Really interesting. How does Arena prevent agents from doing something destructive during a benchmark run?

Report

4d ago

Arena

Maker

@dhiraj_patel5 great question – the agent's actions are limited to a sandbox environment at the moment.

Report

4d ago

Arena

Maker

👋 Hey Product Hunt! We're excited to launch Agent Mode on Arena.

AI chat experiences are often limited to rigid, single-modality interactions that require switching tools or

additional prompting. Agent Mode changes that. You can now prompt once and the agent will plan, browse,

research, and code in a sandbox testing environment to complete real-world, multi-step tasks for you.

Every Agent Mode session also powers our new Agent Leaderboard, built entirely from behavioral signals (such as confirmed success, bash recovery, steerability, and more) collected from real users running real-world workflows. We’re excited to have our community contributing to the leaderboard, and provide a new standard for measuring AI advancement.

We'd love your feedback: What agentic tasks did you throw at it? What tools should we add next? Thanks for checking it out 🙏

Report

5d ago

Arena

Maker

@elliott_gluck let's gooooooooo

Report

4d ago

CheckYa

Arena feels like a much-needed reality check for the AI space. Instead of guessing or trusting scattered benchmarks, it brings everything into one place where models can be evaluated side by side in a practical way. For anyone building with AI, this kind of clarity is extremely valuable. Excited to see how it grows and how the community contributes to making AI evaluation more transparent and useful over time.

Report

4d ago

Arena

Maker

@monir_ Really appreciate the thoughtful comment! We're very excited to bring Agent Mode, and the Agent Arena leaderboard to the public to help better measure agentic AI!

Report

4d ago

Lancepilot

Really impressed by how Agent Mode cuts through the usual friction—prompt once and it actually carries the task end-to-end. The leaderboard angle is clever too, feels like a transparent way to measure real progress. Curious to see what new tools you’ll plug in next.

Report

4d ago

Arena

Maker

@odeth_negapatan1 next up is: Github integration, full stack coding, slide creation, pdf creation, image edit, video gen, and wayyy more. Strap in 🤘

Report

4d ago

Uselink

the UI alone is pretty awesome. you guy really have that "taste", Elliott

I just tested it, and it's mind-blowing

Report

4d ago

Arena

Maker

@nathan_tran2 Thank you Nathan, super exciting to hear you're already getting value and loving using the product!!

Report

4d ago

Arena

Maker

@nathan_tran2 very kind of you, thank you!

Report

4d ago

One thing we've noticed with agent workflows is that execution is becoming less of a bottleneck than decision-making.

Are you seeing users struggle more with planning and prioritization, or with agents actually completing the tasks once they're started?

Report

4d ago

Arena

Maker

@zaid_mallik1 we see a couple things
1) Most users start messages by handing over a whole job rather than asking for advice: the delegation posture skews heavily toward "build this deliverable" and "operate autonomously." However, after seeing the first response, they tighten the reins — pulling control back far more often than they hand over more.
2) We also find that when the opening ask bundles several explicit parts, agents usually cover all of them; the typical shortfall is leaving one incomplete. A rarer but more consequential shortfall is covert: the agent could have surfaced the incomplete work, but instead presents the result as complete. We call this "Bluffing".

More info here in our blog about the Agent Leaderboard if you're curious: https://arena.ai/blog/agent-arena-methodology/

Report

4d ago

@teozorro The "bluffing" distinction is interesting because it feels less like a capability problem and more like an uncertainty-reporting problem.

Have you found bluffing correlates more with task complexity, or with agents lacking a reliable mechanism to know when they've actually reached the limits of what they can verify?

Report

4d ago

Arena

Maker

@zaid_mallik1 it really is a capability problem. We're depending on agents to check their work and hold themselves accountable, because if they don't, users have to practice constant vigilance and babysit their tasks. In the world of autonomous agents, this capability is absolutely essential.

And yes, to your point, the more complex the task, the more likely the model is going to misreport completion... until you say please double check your work 😆

Report

4d ago

@elliott_gluck Congratulations. And happy product launch.

Report

4d ago

Arena

Maker

@huisong_li Thank you @huisong_li , appreciate your support!!

Report

4d ago

Mailwarm

Do you publish the exact tasks and grading so people can reproduce runs and compare fairly?

Report

16h ago

Cool! What about industry-specific modes? For example, we built a travel AI and want to compare the quality of our AI with other models, but specifically in the travel domain.

Report

16h ago