Badges

3

Top 5 Launch
Top 5 Launch
Tastemaker
Tastemaker
Gone streaking
Gone streaking
Gone streaking 5
Gone streaking 5

Maker History

Forums

Abhishek Saikia

13d ago

APIEval-20 - An open benchmark for AI agents that test APIs

APIEval-20 is a black-box benchmark for API testing agents. Each agent gets only a JSON schema and one sample payload, then generates a test suite. We run those tests against live reference APIs with planted bugs and score bug detection, API coverage, and efficiency. Unlike LLM-as-judge evals, scoring is fully objective: a bug is either caught or it isn’t. Tasks span auth, errors, pagination, schemas, and multi-step flows. Open on Hugging Face.
Abhishek Saikia

21d ago

KushoAI for Playwright - Open-source Terminal UI, just record & get exhaustive tests

Open-source TUI for Playwright testing. Record your flow in the browser, then everything happens in your terminal. No tab-switching to ChatGPT/Claude, no copy-pasting, no manual context juggling. Bring your own API key (OpenAI, Claude, Gemini). Runs entirely local. Our LLM orchestration expands one recording into comprehensive test coverage - edge cases, error handling, boundary conditions - more efficiently than calling LLMs directly. Record, generate, run: all terminal-native. MIT licensed.
Abhishek Saikia

2mo ago

We couldn't find an open benchmark for AI-generated API tests, so we built one

Every API testing eval we found either required source code access, relied on rich documentation, or measured output format rather than whether a test would catch a real failure.

So we built APIEval-20. Twenty scenarios across e-commerce, payments, auth, scheduling, and user management. Each scenario gives a model exactly two things: a JSON schema and a sample payload. No implementation details, no docs, no further context. The model has to generate a test suite from that alone.

The bugs are planted in live reference implementations. A bug is only caught if a generated test produces a response that deviates from correct behavior when run against the implementation.
Submit through the hosted eval harness and get a score back.

Scoring weights bug detection at 70%, API surface coverage at 20%, and test efficiency at 10%.

View more