cto bench - The ground truth code agent benchmark
byโข
Most AI benchmarks are built backwards. Someone sits down, dreams up hard problems, and then measures how well agents solve them. The results are interesting, sure. But they don't always tell you what matters: how agents perform on the actual work that's sitting in your queue.
That's why we built cto.bench.
Instead of hypothetical tasks, we're building our benchmark from real work. Every data point on cto bench comes directly from how cto.new users are actually using our platform.



Replies
cto.new
TrackerJam
Finally, a benchmark that measures usefulness instead of academic cleverness. This feels much closer to how teams actually decide whether an agent is worth adopting.
cto.new
@maklyen_mayย thanks! Interesting that OSS models are so high up the list for practical use, eh?
DeepTagger
Wow, this is amazing! All the best models for free! ๐
How can this be sustainable for you?
cto.new
@avlossย great question! We're still working on that. What would you recommend?
DeepTagger
@michael_luddenย
Some ideas:
Provide additional services for fee, like Domain, Hosting, Monitoring, Promotion / Ads, Databases.
Charge for organisational use any/or for dedicated deployment.
Charge for additional features, like a human reviewing and solving a problem in case LLM is stuck.
Use collected data to train proprietary models, then sell those.
cto.new
@avlossย love it! ๐
This is a really refreshing take on benchmarks ๐
Grounding it in real work instead of synthetic tasks feels way more honest โ as a builder, thatโs the kind of signal I actually trust. Love the โbuilt from usageโ philosophy. Congrats on the launch! ๐
Curious how youโre thinking about bias over time โ do you plan to balance workloads or surface context around where the data comes from?
cto.new
@elevenaprilย can you expand on the question a bit more? Not sure what you're asking.
Awesome! Very useful!