
Notte
Build and deploy reliable browser agents at scale
1.7K followers
Build and deploy reliable browser agents at scale
1.7K followers
One platform to build and deploy enterprise-grade browser agents. Managed infrastructure (sessions, proxies, identities, vaults) via a single API. Hybrid architecture combines deterministic scripts with AI reasoning for production reliability.
This is the 4th launch from Notte. View more
Browser Arena
Launching today
Browser Arena is an open-source benchmark that tests 7 cloud browser providers on speed, reliability, and cost. Same tests, same EC2 instances, 1,000+ runs each. All results and code are public - deploy on Railway and reproduce every number yourself.




Free
Launch Team / Built With





Notte
Hey PH! I'm Sam from Notte. 👋
We built Browser Arena because we were tired of seeing benchmarks in the AI agent/browser infra space that couldn't be reproduced. Companies claiming SOTA performance based on cherry-picked runs, undisclosed infrastructure, and small sample sizes.
So we built an answer: an open-source benchmark suite that tests every major cloud browser provider under identical conditions.
What makes it different❓
1,000 runs per provider
Same AWS infrastructure, same test, same Playwright version
Full VM metadata published (region, instance type, RTT)
Median, P90, P95 — not just "best case"
Error rates and failure breakdowns included
Cost per session calculated from real pricing
MIT licensed. Clone it and run it yourself.
The benchmark ⚡
Minimal session lifecycle (create → connect → navigate → release). Tests both sequential and concurrent (up to 16 parallel sessions) execution.
Notte performs well in the results. But we didn't build this to win (we don't in all cases), we built it so the results mean something. If another provider is faster, the leaderboard shows it. That's the point.
Run it on your own infra and tell us if your numbers differ.
Would love your feedback on what benchmarks you'd want to see next! 🌸
@samatnotte Congrats on the launch, Sam! This is a such a helpful tool for developers trying to pick the right browser infra without the marketing noise. Making it open-source and reproducible is a huge win for the communty.
I have a quick question about the testing environment: I noticed you're using AWS t3.micro for the runs. Do you find that the CPU performance stays consistent over 1,000 runs, or does the 'burst credit' system on those smaller instances affect the final results at all?
Also, I'd love to see how these providers handle more complex, 'heavy' websites in the future. Thanks for sharing the MIT code—definitely going to check the repo!
Notte
@dora_lin2 Thanks Dora! The t3.micro burst credit concern is valid. A few things keep the results reliable:
1. The runner's workload is I/O-bound, not CPU-bound. The t3.micro is just sending API calls and relaying CDP messages. Actual browser rendering and JS execution happens on each provider's own infrastructure. Even if burst credits deplete, the baseline CPU is more than enough for what the runner does, so timings aren't affected.
2. 10 warm-up runs before measurement begins, discarded from the output. This isn't about burst credits specifically, it's to get past cold-start effects (JIT warm-up, connection pools, DNS caches) before any numbers are recorded.
3. The instance type is logged in every result set via AWS IMDS, so runs are always self-documenting and comparable across different hardware.
On heavier sites, totally agree, that's on the roadmap. Real-world workloads with heavy JS, auth flows, and dynamic content would surface a different set of provider tradeoffs. Happy to hear what sites you'd find most useful :)
Love the focus on reproducibility. Btw do you think open benchmarks will become a norm in AI infra or will most companies still optimize for perception over truth?
Notte
@lak7 I think open benchmarks are inevitable but there will be a transition.
Right now most AI infra vendors publish numbers they control. A pattern played out with database vendors and TPC benchmarks..independent reproducible tests eventually became the baseline expectation because buyers got burned too many times by vendor-controlled numbers.
I imagine AI infra is heading the same direction, just faster e.g. when your infrastructure choice directly affects model output quality and cost at scale clarity relly matters. Devs (including us) are already skeptical, want to run the benchmark themselves on their own workloads before committing (and we often found we couldn't reproduce)
What happens when an agent session goes rogue mid-task? Curious if there are circuit breakers built into the session management layer or if the human has to manually kill it.
Notte
Great question. A few layers handle this automatically:
1. Session cleanup is guaranteed, every task runs inside a try/finally block, so the browser session is always released back to the provider even if the agent crashes or hangs mid-task.
2. All I/O has hard timeouts. Network calls (page fetches, recording downloads, browser creation) use AbortSignal timeouts ranging from 30s to 120s, so nothing can block forever at the infrastructure level. 3. The provider is responsible for session TTLs. BrowserArena releases sessions via the provider's API, but doesn't control what happens mid-task inside the provider. Each provider (Notte, Steel, Hyperbrowser) enforces its own session limits server-side.
So in practice: you don't need to manually kill anything. The session gets cleaned up regardless, and runaway execution gets caught either by the provider's own limits or by the I/O timeouts in the harness.
Simple Utm
Really interesting approach to benchmarking cloud browser providers. I have been evaluating a few of these services for running automated workflows and the lack of standardized benchmarks has made it difficult to compare them objectively. The fact that all results and code are public is a huge plus. Do you plan to add latency benchmarks for dynamic page interactions, or is the focus mainly on static page loads and rendering?
Notte
@najmuzzaman Thanks! Yes, dynamic interaction benchmarks are definitely on the roadmap, that's actually the next thing we're working on :)
The current Hello Browser benchmark is intentionally minimal: session create → CDP connect → page.goto → release. It measures the session lifecycle and the control/data plane overhead of each provider, which is the baseline every automation workflow pays on every run. We wanted to nail that first so the numbers are clean and directly comparable.
The next benchmark focuses on realistic multi-step workflows, data extraction with scrolling and lazy content, multi-page crawls, and form fills with submission/verification. That's where provider differences in CDP throughput, network stability, and headful vs. headless rendering really show up, and it's closer to what most people actually use cloud browsers for.
If there are specific interaction patterns you'd find most useful (auth flows, file uploads, shadow DOM, etc.), let us know!
Wait, so u built a benchmark where your own product doesnt win 100% of the time? This level of honesty is illegal in silicone valley)
Notte
@kostfast black swan of silicon valley haha - in all seriousness we felt browser infrastructure claims were being thrown around but found it extremely hard to reproduce at lot of benchmarks - so the idea of Browser Arena is to make it easier for people to compare cloud browser solutions using fair reproducible metrics (open-source)
Notte
Hey PH! It’s Lucas, CTO of Notte 👋
We built Browser Arena to make it easier for people to compare cloud browser solutions using fair, reproducible metrics.
Check it out and we’d love to hear what you think!
Notte
@giordano_lucas if anyone wants to run it themselve or has input on the methodology we're interested:)
Notte
Let's go!
Notte
@ogandreakiro ☁️🚀