Fastest way to get AI to write and run tests

Start new thread

APIEval-20 - An open benchmark for AI agents that test APIs

KushoAI

•6d ago

APIEval-20 is a black-box benchmark for API testing agents. Each agent gets only a JSON schema and one sample payload, then generates a test suite. We run those tests against live reference APIs with planted bugs and score bug detection, API coverage, and efficiency. Unlike LLM-as-judge evals, scoring is fully objective: a bug is either caught or it isn’t. Tasks span auth, errors, pagination, schemas, and multi-step flows. Open on Hugging Face.

Replies

Best

KushoAI

Maker

📌

Hey Product Hunt, I’m Abhishek, CEO of KushoAI. We built APIEval-20 because API testing is now a common claim across AI agents, but there was no reliable way to verify it. The evaluations we found usually had one of three gaps. They assumed source code access, depended on detailed documentation, or checked whether the output looked valid instead of measuring actual bugs found. That felt far from how most teams test APIs in practice. So we built a black-box benchmark. Schema and payload in. Nothing else. The agent generates a test suite. We run those tests against live reference APIs with planted bugs. The score comes from what the agent actually catches: bug detection, API coverage, and efficiency. No LLM judges. No subjective calls. A bug is either caught or missed. The part I’m most proud of is the complexity taxonomy. Sending nulls to every field is easy. The real test is whether an agent can reason about field relationships, auth behavior, pagination, error handling, schema constraints, and multi-step flows. That is where stronger agents start to separate from weaker ones. APIEval-20 is open on Hugging Face. We are also putting together a leaderboard comparing major AI agents in a separate research report. If you run your agent on the benchmark before then, we would love to include your results. Two questions for the community: 1. What domains or API patterns should we add next? 2. If you are building a testing tool or agent, would you want your results included in the leaderboard? I’ll be here all day. Drop a comment or reach us at hello@kusho.ai

Report

7d ago

@abhishek_saikia Great product, Since you focus on bug detection and API coverage without source code access, how does KushoAI handle complex, state-dependent edge cases that require a specific sequence of API calls to trigger?

Report

6d ago

Nice. I thought LLMs as a judge is what we need in some cases. Do you have a classifier to pick one vs another?

Report

6d ago

KushoAI

Maker

@lakshminath_dondeti Lakshminath, I agree. LLM-as-judge is useful when the output needs semantic evaluation, like judging reasoning quality, intent coverage, or whether a generated explanation is useful.

For API testing, we tried to keep the core scoring executable wherever possible. If the generated test catches the planted bug, it scores. If it does not, it does not. That removes a lot of ambiguity.

We don’t have a classifier for choosing eval type yet, but the rough rule we use is:

If the outcome can be executed or verified deterministically, avoid LLM-as-judge.
If the outcome needs human-like interpretation, use LLM-as-judge carefully with rubrics and calibration.

Report

6d ago

@abhishek_saikia makes sense

Report

5d ago

Socrati

Really like the black-box setup. Feels much closer to how teams actually test APIs than benchmarks that assume source code access. Curious how you’re thinking about the planted bugs: do auth, pagination, schema issues, multi-step flows, etc. all count the same, or are you planning to weight them by severity/commonness?

Report

6d ago

KushoAI

Maker

@davidsolsonap David, great question.

For v1, we are keeping the core score simple and objective: did the agent catch the planted bug or not. That makes the benchmark easier to reproduce and harder to game.

But we don’t think all failures are equal in practice. An auth bypass, a broken multi-step flow, and a minor schema edge case should not carry the same business impact.

So the plan for the leaderboard/report is to show both:

An unweighted objective score for comparability
A breakdown by bug class, and potentially severity/commonness as a second lens

I think the breakdown matters as much as the overall score. Two agents can look close on aggregate but be very different in where they fail.

Report

6d ago

Do you publish per bug breakdowns so people can see exactly what types of failures each agent misses?

Report

6d ago

KushoAI

Maker

@karimbenkeroum Karim, yes, that is part of the plan for the leaderboard.

We want the breakdown to go beyond one aggregate score and show which types of failures each agent catches or misses, across auth, schema constraints, pagination, error handling, field relationships, and multi-step flows.

That is where the benchmark becomes more useful, because two agents can have similar overall scores but fail in very different ways.

Report

6d ago

the black box scoring is the right call, been skeptical of llm-as-judge for anything that has an objective answer. curious about the multi step flows though, if a bug only shows up at step 3 does the agent get credit for catching it or does it need to find it proactively from the schema alone?

Report

6d ago

Can you compare Figma and the finished frontend page by page? Testing code with AI is fairly straightforward, but testing design - I still haven’t seen a high-quality solution for that.

Report

3d ago