agentrial - Run your AI agent 20x. Get confidence intervals, not vibes.
byā¢
Your AI agent passed the test. But would it pass again? LLMs are non-deterministic ā the same task can fail 30% of the time on the next run.
agentrial runs each test case N times and gives you confidence intervals instead of pass/fail. Wilson CI on pass rates, failure attribution via Fisher exact test, real API cost tracking, CI/CD regression detection.
Works with LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, any Python callable. YAML config, MIT license.


Replies
Hey everyone! I built agentrial because I was going crazy with my LangGraph agents randomly failing on the same prompts.
The core insight: treating agent evaluation as a statistical problem, not a deterministic one. Every test runs N times. You get confidence intervals, not "it passed once." When something fails, Fisher exact test tells you exactly which step is the bottleneck.
v0.2.0 just shipped with:
ā 438 tests, 15 CLI commands
ā 6 framework adapters + OpenTelemetry
ā Agent Reliability Score (0-100 composite metric)
ā VS Code extension (live on Marketplace)
ā MCP security scanner for 6 vulnerability classes
ā Production drift detection (CUSUM, Page-Hinkley, KS test)
It's free and local-first ā your prompts and data never leave your machine.
Would love your feedback, especially on what metrics matter most for your agent workflows.