Launching today
agentrial

agentrial

Run your AI agent 20x. Get confidence intervals, not vibes.

1 follower

Your AI agent passed the test. But would it pass again? LLMs are non-deterministic β€” the same task can fail 30% of the time on the next run. agentrial runs each test case N times and gives you confidence intervals instead of pass/fail. Wilson CI on pass rates, failure attribution via Fisher exact test, real API cost tracking, CI/CD regression detection. Works with LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, any Python callable. YAML config, MIT license.
agentrial gallery image
agentrial gallery image
agentrial gallery image
agentrial gallery image
Free
Launch Team / Built With
Unblocked AI Code Review
Unblocked AI Code Review
High-signal comments based on your team's context
Promoted

What do you think? …

Alessandro Potenza
Maker
πŸ“Œ

Hey everyone! I built agentrial because I was going crazy with my LangGraph agents randomly failing on the same prompts.

The core insight: treating agent evaluation as a statistical problem, not a deterministic one. Every test runs N times. You get confidence intervals, not "it passed once." When something fails, Fisher exact test tells you exactly which step is the bottleneck.

v0.2.0 just shipped with:

β†’ 438 tests, 15 CLI commands

β†’ 6 framework adapters + OpenTelemetry

β†’ Agent Reliability Score (0-100 composite metric)

β†’ VS Code extension (live on Marketplace)

β†’ MCP security scanner for 6 vulnerability classes

β†’ Production drift detection (CUSUM, Page-Hinkley, KS test)

It's free and local-first β€” your prompts and data never leave your machine.

Would love your feedback, especially on what metrics matter most for your agent workflows.