Launching today

agentrial
Run your AI agent 20x. Get confidence intervals, not vibes.
1 follower
Run your AI agent 20x. Get confidence intervals, not vibes.
1 follower
Your AI agent passed the test. But would it pass again? LLMs are non-deterministic β the same task can fail 30% of the time on the next run. agentrial runs each test case N times and gives you confidence intervals instead of pass/fail. Wilson CI on pass rates, failure attribution via Fisher exact test, real API cost tracking, CI/CD regression detection. Works with LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, any Python callable. YAML config, MIT license.








Hey everyone! I built agentrial because I was going crazy with my LangGraph agents randomly failing on the same prompts.
The core insight: treating agent evaluation as a statistical problem, not a deterministic one. Every test runs N times. You get confidence intervals, not "it passed once." When something fails, Fisher exact test tells you exactly which step is the bottleneck.
v0.2.0 just shipped with:
β 438 tests, 15 CLI commands
β 6 framework adapters + OpenTelemetry
β Agent Reliability Score (0-100 composite metric)
β VS Code extension (live on Marketplace)
β MCP security scanner for 6 vulnerability classes
β Production drift detection (CUSUM, Page-Hinkley, KS test)
It's free and local-first β your prompts and data never leave your machine.
Would love your feedback, especially on what metrics matter most for your agent workflows.