cto.new

cto.new

Completely free AI code agent

5.0
1 review

511 followers

Code with the latest frontier models from Anthropic, OpenAI and more. No credit card or API keys required. Get started for free at https://cto.new/product-hunt
This is the 2nd launch from cto.new. View more
cto bench

cto bench

The ground truth code agent benchmark
Most AI benchmarks are built backwards. Someone sits down, dreams up hard problems, and then measures how well agents solve them. The results are interesting, sure. But they don't always tell you what matters: how agents perform on the actual work that's sitting in your queue. That's why we built cto.bench. Instead of hypothetical tasks, we're building our benchmark from real work. Every data point on cto bench comes directly from how cto.new users are actually using our platform.
cto bench gallery image
cto bench gallery image
cto bench gallery image
Free
Launch Team / Built With
agent by Firecrawl
agent by Firecrawl
Gather structured data wherever it lives on the web
Promoted

What do you think? …

Michael Ludden
I'm excited to share cto bench is live. This is a benchmarking tool that tests against real world usage of the latest and greatest frontier models by cto.new users. Many benchmarking tools run LLMs through custom suites to test viability, but cto bench uses actual usage patterns and PR merge rates to verify how well models are performing on actual tasks. We hope this ads valuable, practical data points to the LLM benchmarking space as it evolves.
Maklyen May

Finally, a benchmark that measures usefulness instead of academic cleverness. This feels much closer to how teams actually decide whether an agent is worth adopting.

Michael Ludden

@maklyen_may thanks! Interesting that OSS models are so high up the list for practical use, eh?

Anton Loss

Wow, this is amazing! All the best models for free! 🚀

How can this be sustainable for you?

Michael Ludden

@avloss great question! We're still working on that. What would you recommend?

Anton Loss

@michael_ludden 

Some ideas:

  • Provide additional services for fee, like Domain, Hosting, Monitoring, Promotion / Ads, Databases.

  • Charge for organisational use any/or for dedicated deployment.

  • Charge for additional features, like a human reviewing and solving a problem in case LLM is stuck.

  • Use collected data to train proprietary models, then sell those.

Michael Ludden

@avloss love it! 🙏

ElevenApril

This is a really refreshing take on benchmarks 👀

Grounding it in real work instead of synthetic tasks feels way more honest — as a builder, that’s the kind of signal I actually trust. Love the “built from usage” philosophy. Congrats on the launch! 🚀

Curious how you’re thinking about bias over time — do you plan to balance workloads or surface context around where the data comes from?

Michael Ludden

@elevenapril can you expand on the question a bit more? Not sure what you're asking.

Mykyta Semenov 🇺🇦🇳🇱

Awesome! Very useful!