Web Bench

Name: Web Bench
Rating: 5.0 (1 reviews)

A 10x better benchmark for AI browser agents

5.0•1 review•

140 followers

A 10x better benchmark for AI browser agents

5.0•1 review•

140 followers

Visit website

Compare and benchmark different AI web browsing agents. Web Bench provides comprehensive performance metrics for AI agents navigating the web.

Free

Launch tags:Analytics•Developer Tools•Artificial Intelligence

Launch Team / Built With

Figr AI: UX Agent for Product Teams — Learns your product. Thinks through UX

Learns your product. Thinks through UX

Promoted

Skyvern

Maker

📌

TL;DR: Web Bench is a new dataset to evaluate web browsing agents that consists of 5,750 tasks on 452 different websites, with 2,454 tasks being open sourced. It builds on the foundations of WebVoyager, which didn't represent the internet well because it only spanned 15 websites. Anthropic Sonnet 3.7 CUA is the current SOTA, with Skyvern being the best agent for WRITE-HEAVY tasks. The detailed results here.

I bet you've seen a bunch of flashy demos of web browsing agents, looked at the crazy high scores on the benchmarks and excitedly tried them out... only to realize they don't work as well as advertised

This is because the previous benchmark (WebVoyager) only spanned 643 tasks across 15 websites. While it was a great starting point, the internet’s adversarial nature towards browser automation and the difficulty of tasks involving mutating of data on a website.

As a result, the Skyvern and Halluminate and created a new benchmark to better quantify these failures. Our goal was to create a new consistent measurement system for AI Web Agents by expanding the foundations created by WebVoyager by:

Expanding the number of websites from 15 → 452, and tasks from 642 -> 5,750 to test agent performance on a wider variety of websites
Introduce the concept of READ vs WRITE tasks
1. READ tasks involve navigating websites and fetching data
2. WRITE tasks involve entering data, downloading files, logging in, solving 2FA, etc and were not well represented in the WebVoyager dataset
Measure the impact of browser infrastructure (eg access the websites, solve captchas, not crash, etc)

We ran the benchmark and open sourced 2454 of the tasks to help the industry move towards a new standard, and the results surprised us:

The best model is Anthropic's CUA model
All models did very poorly on write heavy tasks
Browser Infrastructure played a bigger role in the agents' ability to take actions than previously expected

If you're interested, read the full report here

Have any cool use-cases for browser agents? Reply below and let me know below👇

Report

9mo ago

@suchintan_singh Huge leap forward! Finally a benchmark that tests agents on real-world WRITE tasks, not just simple data scraping. Excited to see how this pushes the next-gen of web agents. 👏

Report

9mo ago

Next3 Offload

Impressed by how Web Bench simplifies load testing for modern web apps without needing complex setups. It’s a huge win for teams wanting quick performance insights. Does it also provide recommendations or benchmarks to interpret test results better?

Report

9mo ago

Web Bench

Maker

@shahriardgm currently working on a Web Bench Lite with automated verification and error analysis!

Report

9mo ago

PrettyPolly

Awesome! Literally exactly what I needed. Have been working on an agentic product and, until now, have just been testing it using whatever wild task I dream up on any given day.

Having something as comprehensive as this means I can be objective about the quality/usefulness of what I’m building.

GL with the launch

Report

9mo ago

Web Bench

Maker

@cwbuilds1 Thanks Chris!

Report

9mo ago

Reworkd

Congrats on the launch folks! Huge eval hole with web agents so this work is really appreciated

Report

9mo ago

Skyvern

Maker

@asim_shrestha1 <3

Report

9mo ago

NFT Gallery by Jemi

Love how you can see the agent runs broken down on Skyvern by task. Super cool and transparent!

Report

9mo ago

Skyvern

Maker

@jasonscui Transparency is a requirement when being open source :)

Report

9mo ago

That's great and it achieves more coverage for testing of web agents. Probably in future this should be extended more towards complex platforms and use-cases like banking sites, CRMs, etc. Web agents can bring real value for automation in these domains. Also I find playwright-mcp quite stable when being used with appropriate MCP client for web automation and would love to have it compared as well against skyvern and browser use over this dataset.

Report

9mo ago

Web Bench redefining AI agent benchmarking? ⚙️🤖 The "10x better" claim suggests:

- Real-world task simulations (form filling, CAPTCHAs)

- Multimodal evaluation (text+image understanding)

- Latency/accuracy tradeoff metrics

Potential to become the new standard if it includes cross-browser testing (Chromium/WebKit).

Report

9mo ago

1 2

5.0

Based on 1 review

Review Web Bench?

Reviews

Most Informative

Expanding the number of websites from 15 → 452, and tasks from 642 -> 5,750 to test agent performance on a wider variety of websites
Introduce the concept of READ vs WRITE tasks
1. READ tasks involve navigating websites and fetching data
2. WRITE tasks involve entering data, downloading files, logging in, solving 2FA, etc and were not well represented in the WebVoyager dataset
Measure the impact of browser infrastructure (eg access the websites, solve captchas, not crash, etc)

We ran the benchmark and open sourced 2454 of the tasks to help the industry move towards a new standard, and the results surprised us:

The best model is Anthropic's CUA model
All models did very poorly on write heavy tasks
Browser Infrastructure played a bigger role in the agents' ability to take actions than previously expected

If you're interested, read the full report here

Have any cool use-cases for browser agents? Reply below and let me know below👇