AgentX - Multi-agent and eval framework

Build and evaluate multi AI agents with any LLM

5.0•6 reviews•

2.2K followers

Build and evaluate multi AI agents with any LLM

5.0•6 reviews•

2.2K followers

Visit website

Marketing automation platforms

•

Lead generation software

•

AI Chatbots

AgentX - AI Workforce is a multi-agent system that scales your operations by organizing AI agents into collaborative, hierarchical teams. Automate complex tasks, streamline workflows, and unlock new levels of productivity with intelligent agent coordination. Now come with evaluation framework. Ship your AI agent with confidence.

This is the 3rd launch from AgentX - Multi-agent and eval framework. View more

AgentX

Launched this week

Evaluate AI agent, pinpoint issues, and fix with one click.

Evaluate AI agents before they fail. Create test suites, run evaluations, and pinpoint issues before they reach production. AgentX provides full observability and traceability for your AI agents. AI analysis not only identifies problems but also suggests fixes-like an AI doctor for your agents. Simulate run your agents across multiple LLM providers to compare performance, cost, and latency, helping you make better decisions about which LLM to go. Run eval before deploy. Like CI/CD for AI agents.

Free Options

Launch tags:Analytics•Developer Tools•Artificial Intelligence

Launch Team / Built With

Fin Startups get Fin free for a year + 93% off Intercom

Promoted

AgentX - Multi-agent and eval framework

Maker

📌

Hey Product Hunt! 👋 AI agents are getting more capable, but evaluating and debugging them is still painful. We built AgentX evaluation framework to help teams test, evaluate, and monitor AI agents before failures reach production. Think CI/CD + observability for AI agents: • Create eval suites • Compare models across providers • Trace failures end-to-end • Get AI-powered root cause analysis and suggested fixes It also run on multiple Agent platform. Our goal is simple: help teams ship reliable AI agents with confidence. Would love to hear, what's been your biggest challenge with AI agent evaluation or debugging?

Report

3d ago

Premast

@robin_xw Looks good! Congrats

Report

2d ago

@robin_xw this is honestly one of the most needed pieces in the whole AI agent stack right now!!!

building agents is getting easier but trusting them in production is still kinda scary because failures are usually invisible until a user hits them

one thing I’m curious about: in your experience what is the hardest thing to evaluate properly today — multi step reasoning tool usage accuracy long running state consistency or just reproducing real world scenarios reliably

also love the CI/CD analogy for agents that feels exactly right

excited to see where this goes 🔥

Report

2d ago

@robin_xw One-click fixes for pinpointed issues is solid. What was the hardest part of building the detection layer? Did you train on specific agent failure patterns or go more general?

Report

1d ago

The hardest part is turning an eval failure into an action boundary, not just a score.

For agent workflows, I’d want each failed case to show which tool call or write would have happened, what state it touched, and what receipt or approval would block it next time. Are you modeling external side effects in eval cases, or mostly message/tool correctness for now?

Report

2d ago

AgentX - Multi-agent and eval framework

Maker

@blah_mad
That’s a great point!

For us, eval failures should point to the action boundary, not just a score: which tool call, write, state change, or approval step caused the risk.

That’s where agent evals become useful for real workflows, not just message quality.

Report

1d ago

That’s the right shape. The useful next step is making that failure artifact portable: eval case, predicted tool or write, state diff, approval or block reason, and the fix that changed the score.

Do you expose that as an exportable run record, or mainly inside the AgentX UI for now?

Report

1d ago

AgentX - Multi-agent and eval framework

Maker

@blah_mad
That’s exactly the direction we think this should go.

A failed eval should become a reusable artifact, not just something trapped in a dashboard: scenario, trace, expected vs actual behavior, affected tool/write, state diff, block reason, and fix history.

Right now the main workflow is inside AgentX, but making those run records portable for teams and CI/CD is an important part of the roadmap.

Report

1d ago

The eval suite plus multi-provider simulate-run (basically CI/CD for agents) is the part I'd wire in first — pre-prod agent debugging is exactly where I lose the most time. Where do the eval suites and traces actually live: stored per-project in AgentX's hosted backend, or can I export/version them in my own repo so they run in my CI? And when you simulate across LLM providers, do I bring my own keys per provider or does AgentX proxy those calls?

Report

1d ago

AgentX - Multi-agent and eval framework

Maker

@noctis06
Today, the main eval suites and traces live per project in AgentX, so teams can inspect runs, compare results, and debug failures in one place.

Where we want this to go is exactly what you described: portable/versionable eval artifacts that can run as part of CI, not just live in a UI.

For provider simulation, we’re designing around flexibility: teams should be able to test across providers without being locked into one setup, whether that means BYO keys or a managed/proxied flow depending on their workflow

Report

1d ago

That flexibility is exactly the right call — BYO keys would be my default so cost and rate limits stay on my own provider accounts. When the portable eval artifacts land, are you picturing them as plain diffable files (JSON/YAML committed in-repo) so a changed eval shows up in a normal PR review, or more of an export/import bundle? The in-repo route is what would actually get this into my CI.

Report

1d ago

The "CI/CD for agents" framing resonates — the hard part has always been defining what "passing" even means for a non-deterministic agent. How does AgentX handle the eval oracle: are test suites assertion-based, LLM-judged, or a mix, and how do you keep those judgments stable across runs? The multi-LLM cost/latency comparison is a genuinely useful addition — picking a model on vibes is still way too common. I'd just want the AI-suggested fix to show its reasoning before I trust it anywhere near production.

Report

1d ago

AgentX - Multi-agent and eval framework

Maker

@codeamesh_consultancy Great observation. There are many metrics we offer for various of scenarios.
For example cosine score for the semantic closeness, Jaccard score for the text overlapping rate. And most useful is the overall score of the whole eval from the multiple LLM-as-a-judge.

Report

1d ago

GNGM

💎 Pixel perfection

I like the "CI/CD for AI agents" framing.

What does a failed deployment look like in AgentX? Can teams set quality thresholds that block releases?

Report

2d ago

AgentX - Multi-agent and eval framework

Maker

@polman_trudo Exactly. Teams can define evaluation criteria and quality thresholds. If a change causes performance regressions, the evaluation can fail before deployment, similar to how software teams use automated tests to prevent bad releases.

Report

2d ago

Triforce Todos

💎 Pixel perfection

Running the same agent across multiple LLM providers to compare cost/latency is such an underrated feature.

How many providers do you support right now?

Report

2d ago

AgentX - Multi-agent and eval framework

Maker

@abod_rehman Thank you Abdul. We currently support all major LLM vendors out of box (Claude, GPT, Gemini, Llama, Grok). You can also use custom LLM to provide your own base url that point to any other LLM that is not listed here.

Report

2d ago

Solid work! IMO the CI/CD framing only holds if the evals are deterministic and an issue could be that agents almost never are. Are you guys gating deploys on a pass rate (like 9/10 runs)? Thanks.

Report

2d ago

AgentX - Multi-agent and eval framework

Maker

@artstavenka1 Great challenge! And you're right that naive CI/CD breaks down if you treat each agent run as a binary test. We don't.
The gate sits on the aggregate, not the individual run. Each case runs multiple times, gets graded 0-10 by a panel of LLM judges, and the threshold is set on the averaged score across runs. So one off-sample nudges the average instead of flipping the gate.

You can also track consistency explicitly - it's one of the core metrics in the report. An agent that scores 8.0 average with low variance ships very differently from one that swings between 5 and 10. Both might have the same average but only one is actually reliable.

So yes, closer to "9/10 runs must score above X" than a hard pass/fail, but applied to a distribution rather than a count.

Report

2d ago

AgentX - Multi-agent and eval framework

Maker

@artstavenka1
Great question, and yes - determinism is tricky with agents.

We don’t think about gating only as a single pass/fail run. For agent workflows, it usually needs repeated runs and thresholds: pass rate, consistency, tool-call accuracy, and severity of failures.

So a team might gate on something like “9/10 runs pass for critical scenarios,” but also block deploys immediately for certain high-severity failures, like wrong tool usage, missing required fields, or unsafe outputs.

The goal is not to pretend agents are deterministic - it’s to measure the variance before production.

Report

2d ago

1 2 3

•••

Previous AgentX - Multi-agent and eval framework Launches

AgentX 2.0Build your own cross-vendor multi-agent AI team

Launched on June 16th, 2025

AgentXA reliable AI Agent to generate leads for your business

Launched on February 29th, 2024

AgentX - Multi-agent and eval framework

Build and evaluate multi AI agents with any LLM

Build and evaluate multi AI agents with any LLM

AgentX

Previous AgentX - Multi-agent and eval framework Launches

Previous AgentX - Multi-agent and eval framework Launches

What's great

What needs improvement

vs Alternatives

What's great

What needs improvement

vs Alternatives