Abhishek Saikia

APIEval-20 - An open benchmark for AI agents that test APIs

APIEval-20 is a black-box benchmark for API testing agents. Each agent gets only a JSON schema and one sample payload, then generates a test suite. We run those tests against live reference APIs with planted bugs and score bug detection, API coverage, and efficiency. Unlike LLM-as-judge evals, scoring is fully objective: a bug is either caught or it isn’t. Tasks span auth, errors, pagination, schemas, and multi-step flows. Open on Hugging Face.

Add a comment

Replies

Best
Abhishek Saikia
Hey Product Hunt, I’m Abhishek, CEO of KushoAI. We built APIEval-20 because API testing is now a common claim across AI agents, but there was no reliable way to verify it. The evaluations we found usually had one of three gaps. They assumed source code access, depended on detailed documentation, or checked whether the output looked valid instead of measuring actual bugs found. That felt far from how most teams test APIs in practice. So we built a black-box benchmark. Schema and payload in. Nothing else. The agent generates a test suite. We run those tests against live reference APIs with planted bugs. The score comes from what the agent actually catches: bug detection, API coverage, and efficiency. No LLM judges. No subjective calls. A bug is either caught or missed. The part I’m most proud of is the complexity taxonomy. Sending nulls to every field is easy. The real test is whether an agent can reason about field relationships, auth behavior, pagination, error handling, schema constraints, and multi-step flows. That is where stronger agents start to separate from weaker ones. APIEval-20 is open on Hugging Face. We are also putting together a leaderboard comparing major AI agents in a separate research report. If you run your agent on the benchmark before then, we would love to include your results. Two questions for the community: 1. What domains or API patterns should we add next? 2. If you are building a testing tool or agent, would you want your results included in the leaderboard? I’ll be here all day. Drop a comment or reach us at hello@kusho.ai
Hiurich G.

@abhishek_saikia Great product, Since you focus on bug detection and API coverage without source code access, how does KushoAI handle complex, state-dependent edge cases that require a specific sequence of API calls to trigger?

Lakshminath Reddy Dondeti
Nice. I thought LLMs as a judge is what we need in some cases. Do you have a classifier to pick one vs another?
Abhishek Saikia

@lakshminath_dondeti Lakshminath, I agree. LLM-as-judge is useful when the output needs semantic evaluation, like judging reasoning quality, intent coverage, or whether a generated explanation is useful.

For API testing, we tried to keep the core scoring executable wherever possible. If the generated test catches the planted bug, it scores. If it does not, it does not. That removes a lot of ambiguity.

We don’t have a classifier for choosing eval type yet, but the rough rule we use is:

  • If the outcome can be executed or verified deterministically, avoid LLM-as-judge.

  • If the outcome needs human-like interpretation, use LLM-as-judge carefully with rubrics and calibration.

Lakshminath Reddy Dondeti
@abhishek_saikia makes sense
David Solsona

Really like the black-box setup. Feels much closer to how teams actually test APIs than benchmarks that assume source code access. Curious how you’re thinking about the planted bugs: do auth, pagination, schema issues, multi-step flows, etc. all count the same, or are you planning to weight them by severity/commonness?

Abhishek Saikia

@davidsolsonap David, great question.

For v1, we are keeping the core score simple and objective: did the agent catch the planted bug or not. That makes the benchmark easier to reproduce and harder to game.

But we don’t think all failures are equal in practice. An auth bypass, a broken multi-step flow, and a minor schema edge case should not carry the same business impact.

So the plan for the leaderboard/report is to show both:

  • An unweighted objective score for comparability

  • A breakdown by bug class, and potentially severity/commonness as a second lens

I think the breakdown matters as much as the overall score. Two agents can look close on aggregate but be very different in where they fail.

Karim Ben

Do you publish per bug breakdowns so people can see exactly what types of failures each agent misses?

Abhishek Saikia

@karimbenkeroum Karim, yes, that is part of the plan for the leaderboard.

We want the breakdown to go beyond one aggregate score and show which types of failures each agent catches or misses, across auth, schema constraints, pagination, error handling, field relationships, and multi-step flows.

That is where the benchmark becomes more useful, because two agents can have similar overall scores but fail in very different ways.

Vinamra Yadav

the black box scoring is the right call, been skeptical of llm-as-judge for anything that has an objective answer. curious about the multi step flows though, if a bug only shows up at step 3 does the agent get credit for catching it or does it need to find it proactively from the schema alone?

Natalia Iankovych

Can you compare Figma and the finished frontend page by page? Testing code with AI is fairly straightforward, but testing design - I still haven’t seen a high-quality solution for that.