This is awesome! Multi-agent systems are becoming increasingly critical as companies move from simple AI chatbots to real AI workflows.
What stands out about AgentX is its focus on evaluation, observability, and reliability before deployment. Partners across leading banks and system integrators have shared strong positive feedback after adopting AgentX solutions.
This is exactly what enterprises need to adopt AI agents with greater confidence. Excited to see AgentX help teams build, test, and scale AI agents in production. Highly recommended!
AgentX - Multi-agent and eval framework
Premast
@robin_xw Looks good! Congrats
@robin_xw this is honestly one of the most needed pieces in the whole AI agent stack right now!!!
building agents is getting easier but trusting them in production is still kinda scary because failures are usually invisible until a user hits them
one thing I’m curious about: in your experience what is the hardest thing to evaluate properly today — multi step reasoning tool usage accuracy long running state consistency or just reproducing real world scenarios reliably
also love the CI/CD analogy for agents that feels exactly right
excited to see where this goes 🔥
@robin_xw One-click fixes for pinpointed issues is solid. What was the hardest part of building the detection layer? Did you train on specific agent failure patterns or go more general?
The hardest part is turning an eval failure into an action boundary, not just a score.
For agent workflows, I’d want each failed case to show which tool call or write would have happened, what state it touched, and what receipt or approval would block it next time. Are you modeling external side effects in eval cases, or mostly message/tool correctness for now?
AgentX - Multi-agent and eval framework
@blah_mad
That’s a great point!
For us, eval failures should point to the action boundary, not just a score: which tool call, write, state change, or approval step caused the risk.
That’s where agent evals become useful for real workflows, not just message quality.
That’s the right shape. The useful next step is making that failure artifact portable: eval case, predicted tool or write, state diff, approval or block reason, and the fix that changed the score.
Do you expose that as an exportable run record, or mainly inside the AgentX UI for now?
AgentX - Multi-agent and eval framework
@blah_mad
That’s exactly the direction we think this should go.
A failed eval should become a reusable artifact, not just something trapped in a dashboard: scenario, trace, expected vs actual behavior, affected tool/write, state diff, block reason, and fix history.
Right now the main workflow is inside AgentX, but making those run records portable for teams and CI/CD is an important part of the roadmap.
The eval suite plus multi-provider simulate-run (basically CI/CD for agents) is the part I'd wire in first — pre-prod agent debugging is exactly where I lose the most time. Where do the eval suites and traces actually live: stored per-project in AgentX's hosted backend, or can I export/version them in my own repo so they run in my CI? And when you simulate across LLM providers, do I bring my own keys per provider or does AgentX proxy those calls?
AgentX - Multi-agent and eval framework
@noctis06
Today, the main eval suites and traces live per project in AgentX, so teams can inspect runs, compare results, and debug failures in one place.
Where we want this to go is exactly what you described: portable/versionable eval artifacts that can run as part of CI, not just live in a UI.
For provider simulation, we’re designing around flexibility: teams should be able to test across providers without being locked into one setup, whether that means BYO keys or a managed/proxied flow depending on their workflow
That flexibility is exactly the right call — BYO keys would be my default so cost and rate limits stay on my own provider accounts. When the portable eval artifacts land, are you picturing them as plain diffable files (JSON/YAML committed in-repo) so a changed eval shows up in a normal PR review, or more of an export/import bundle? The in-repo route is what would actually get this into my CI.
The "CI/CD for agents" framing resonates — the hard part has always been defining what "passing" even means for a non-deterministic agent. How does AgentX handle the eval oracle: are test suites assertion-based, LLM-judged, or a mix, and how do you keep those judgments stable across runs? The multi-LLM cost/latency comparison is a genuinely useful addition — picking a model on vibes is still way too common. I'd just want the AI-suggested fix to show its reasoning before I trust it anywhere near production.
AgentX - Multi-agent and eval framework
@codeamesh_consultancy Great observation. There are many metrics we offer for various of scenarios.
For example cosine score for the semantic closeness, Jaccard score for the text overlapping rate. And most useful is the overall score of the whole eval from the multiple LLM-as-a-judge.
GNGM
I like the "CI/CD for AI agents" framing.
What does a failed deployment look like in AgentX? Can teams set quality thresholds that block releases?
AgentX - Multi-agent and eval framework
@polman_trudo Exactly. Teams can define evaluation criteria and quality thresholds. If a change causes performance regressions, the evaluation can fail before deployment, similar to how software teams use automated tests to prevent bad releases.
Triforce Todos
Running the same agent across multiple LLM providers to compare cost/latency is such an underrated feature.
How many providers do you support right now?
AgentX - Multi-agent and eval framework
@abod_rehman Thank you Abdul. We currently support all major LLM vendors out of box (Claude, GPT, Gemini, Llama, Grok). You can also use custom LLM to provide your own base url that point to any other LLM that is not listed here.
Solid work! IMO the CI/CD framing only holds if the evals are deterministic and an issue could be that agents almost never are. Are you guys gating deploys on a pass rate (like 9/10 runs)? Thanks.
AgentX - Multi-agent and eval framework
@artstavenka1 Great challenge! And you're right that naive CI/CD breaks down if you treat each agent run as a binary test. We don't.
The gate sits on the aggregate, not the individual run. Each case runs multiple times, gets graded 0-10 by a panel of LLM judges, and the threshold is set on the averaged score across runs. So one off-sample nudges the average instead of flipping the gate.
You can also track consistency explicitly - it's one of the core metrics in the report. An agent that scores 8.0 average with low variance ships very differently from one that swings between 5 and 10. Both might have the same average but only one is actually reliable.
So yes, closer to "9/10 runs must score above X" than a hard pass/fail, but applied to a distribution rather than a count.
AgentX - Multi-agent and eval framework
@artstavenka1
Great question, and yes - determinism is tricky with agents.
We don’t think about gating only as a single pass/fail run. For agent workflows, it usually needs repeated runs and thresholds: pass rate, consistency, tool-call accuracy, and severity of failures.
So a team might gate on something like “9/10 runs pass for critical scenarios,” but also block deploys immediately for certain high-severity failures, like wrong tool usage, missing required fields, or unsafe outputs.
The goal is not to pretend agents are deterministic - it’s to measure the variance before production.