agentrial - Run your AI agent 20x. Get confidence intervals, not vibes.

Your AI agent passed the test. But would it pass again? LLMs are non-deterministic — the same task can fail 30% of the time on the next run. agentrial runs each test case N times and gives you confidence intervals instead of pass/fail. Wilson CI on pass rates, failure attribution via Fisher exact test, real API cost tracking, CI/CD regression detection. Works with LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, any Python callable. YAML config, MIT license.

Hey everyone! I built agentrial because I was going crazy with my LangGraph agents randomly failing on the same prompts.

The core insight: treating agent evaluation as a statistical problem, not a deterministic one. Every test runs N times. You get confidence intervals, not "it passed once." When something fails, Fisher exact test tells you exactly which step is the bottleneck.

v0.2.0 just shipped with:

→ 438 tests, 15 CLI commands

→ 6 framework adapters + OpenTelemetry

→ Agent Reliability Score (0-100 composite metric)

→ VS Code extension (live on Marketplace)

→ MCP security scanner for 6 vulnerability classes

→ Production drift detection (CUSUM, Page-Hinkley, KS test)

It's free and local-first — your prompts and data never leave your machine.

Would love your feedback, especially on what metrics matter most for your agent workflows.

agentrial - Run your AI agent 20x. Get confidence intervals, not vibes.

Replies

Engineering & Development

LLMs

Productivity

Marketing & Sales

Design & Creative

Social & Community

Finance

AI Agents

Trending categories

Top reviewed

Trending products

Top forum threads