All activity
Alessandro Potenzaleft a comment
Hey everyone! I built agentrial because I was going crazy with my LangGraph agents randomly failing on the same prompts. The core insight: treating agent evaluation as a statistical problem, not a deterministic one. Every test runs N times. You get confidence intervals, not "it passed once." When something fails, Fisher exact test tells you exactly which step is the bottleneck. v0.2.0 just...

agentrialRun your AI agent 20x. Get confidence intervals, not vibes.
Your AI agent passed the test. But would it pass again? LLMs are non-deterministic — the same task can fail 30% of the time on the next run.
agentrial runs each test case N times and gives you confidence intervals instead of pass/fail. Wilson CI on pass rates, failure attribution via Fisher exact test, real API cost tracking, CI/CD regression detection.
Works with LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, any Python callable. YAML config, MIT license.

agentrialRun your AI agent 20x. Get confidence intervals, not vibes.
