Launching today
Vibe training for AI agent reliability. Describe what your agent should and should not do — Plurai generates training data, validates it, and deploys a custom model in minutes. It feels like vibe coding, but for evaluation and guardrails. No labeled data. No annotation pipeline. No prompt engineering. Under the hood, small language models deliver sub 100ms latency, 8x lower cost than GPT as judge, and over 43% fewer failures. Always on, not sampled. Built on published research (BARRED).












Plurai
Hey Product Hunt, Ilan from Plurai here.
We spent the last year on a research problem: can you train a production-grade eval or guardrail from just a task description, no labeled data, no annotation pipeline?
Turns out you can. We call it vibe-training.
Most teams today rely on LLM as a judge. It never fully converges, breaks on edge cases, and at 100ms per call it collapses economically at scale. So teams sample instead of evaluating everything. Failures happen between the samples, invisibly.
Plurai lets you describe what your agent should and should not do. The platform generates training data, validates it through a multi-agent debate process, and deploys a custom small language model in minutes.
Results against GPT-5 LLM-as-judge: over 43% fewer failures, 8x lower cost, sub 100ms.
Good enough to run on every interaction, not just a sample.
The research behind it is public.
Try it free at https://app.plurai.ai, I'd love to hear what eval problem you're working on.
@ilankad23 neat product - keep up the great work
Plurai
@fmerian thank you. we just getting started! excited to see what people build with it.
Plurai
@ilankad23 An exciting day!
Plurai
@tammy_wolfson2 Exciting day! So fortunate to have you with us on the Plurai journey, we’re just getting started!
Plurai
@ilankad23 This is such a great step forward - really inspiring to see this come together. Looking forward to seeing the impact it creates!
Plurai
@arnonmz Thank you Arnon! This is just the beginning, excited to see what people build with it.
Plurai
@ilankad23 It's the closest thing to giving birth 😅 Go Plurai!
Plurai
@reut_v_plurai I wouldn't argue with that, grateful to have you with us on the Plurai journey — this is only the beginning!
Plurai
@ilankad23 Congrats on the launch team! Been heads down on this one for a while, feels great to finally share it.
Plurai
@ben_wisbih Thank you, we’re so fortunate to have you with us and for the incredible part you play in our journey. We’re just getting started.
The "always on, not sampled" part is what makes this interesting. When I was running engineering at scale, sampling-based quality checks gave us a false sense of security - the failures always happened in the gaps between samples. The LLM-as-judge approach has the same problem but worse: it's expensive enough that teams only run it on a fraction of requests, and the edge cases it misses are exactly the ones that blow up in production. Sub 100ms with small models changes the economics enough to actually evaluate everything. Curious about the cold start experience - when someone describes a new guardrail in plain language, how much iteration does it typically take before the generated eval catches the subtle violations versus just the obvious ones?
Plurai
@avrisimon
Avri, the sampling point is exactly right - the failures that matter live in the tail.
on cold start: typically 1–2 iterations. When someone describes a guardrail in plain language, we decompose it into behavioral dimensions and generate synthetic boundary cases specifically designed to probe the subtle violations, not just the obvious ones. There's an adversarial validation step that challenges generated cases before they go into the training set - that's what calibrates for edge cases from the start rather than after manual iteration.
One round of enrichment usually gets you there. The system tends to surface sub-categories the user hadn't articulated - they just confirm or redirect.
Plurai
@avrisimon We're here if you have any more questions! Let us know what you think once you try it out!
Plurai
@avrisimon Exactly! the failures live in the gaps.
On cold start, the first version usually catches the obvious stuff. The subtle violations come after 1–2 quick iterations, where we:
generate boundary cases (not random samples)
filter them via multi-agent debate (to avoid noisy labels)
retrain and tighten the guardrail
From there, real traffic just keeps improving it.
Curious — what kind of “slipped to prod” failures hit you the hardest?
sampling-only eval has a real blind spot: anything that doesn't repeat doesn't get caught. ran into the same building eval flows for an AI form filler we work on — by the time a flaky failure shows up twice, you've already shipped it.
the part i can't quite picture is how the multi-agent debate establishes ground truth without existing failure modes — adversarial generation against the task spec is one read, test-time disagreement is the other. one of those would explain how the BARRED setup actually converges.
Plurai
@webappski Great point, that “doesn’t repeat → doesn’t get caught” blind spot is exactly what bites in production. On the ground truth side: it’s a mix of both. We start with adversarial generation against the task spec to surface candidate failures, then use multi-agent debate to stress and validate them until there’s strong agreement — that loop is what makes it converge without needing pre-labeled failure modes.
Plurai
@webappski We're here if you have any more questions! Let us know what you think once you try it out!
Plurai
@webappski on BARRED: ground truth isn't inferred from observed failures, it's constructed from the task spec. we decompose the policy into semantic dimensions and generate boundary cases at the intersections, the places where a generic judge is most likely to disagree with itself. multi-agent debate filters out cases that don't converge, so only label-clean boundaries make it into training. that's why a 3B model beats LLM-as-judge on edge cases, not smarter, just calibrated where it matters.
happy to share the paper if useful.
@omri_sela2 the "decompose policy into semantic dimensions, generate boundary cases at intersections" framing is the cleanest articulation of the problem i've seen. would love the paper — that detail matters for whether the approach is reproducible outside your stack or specific to BARRED itself.
@ilankad23 makes sense — adversarial generation seeding candidates, debate filtering for convergence. that's the piece i was missing.
Plurai
@webappski cool! would love to get your feeback on the product
OpenPlugin
the 'LLM as judge breaks at 100ms per call' pain is exactly where most eval pipelines silently rot. you end up with a sampling regime nobody actually trusts. the part i'm curious about is calibration in the wild: when the small model and the original llm-judge disagree on a real production trace, who do you trust, and how do you surface that disagreement to the team? that's usually where these systems either become real or quietly shelfware.
Plurai
@sebastian_sosa1 We're here if you have any more questions! Let us know what you think once you try it out!
OpenPlugin
the 'disagreement as training signal, not production decision' shift is the part most teams haven't internalized yet. multi-agent debate as the consensus mechanism for ground truth side-steps the human-in-the-loop bottleneck without giving up calibration, which is the move. one thing i'd love to know: how do you keep the debate from collapsing into 'all judges share priors because they're trained on the same base model'? does the advocate role pull from a meaningfully different distribution, or is it more about role prompting?@tammy_wolfson2
Plurai
@sebastian_sosa1 Great question — this is exactly where systems either become real infra or shelfware.
Our approach (from the paper) is: don’t resolve disagreement in production, resolve it in training. Here is the link to our paper for more details about our approach (https://huggingface.co/papers/2604.25203)
We establish ground truth via multi-agent debate (judges + advocate). Only samples that reach consensus make it into training; the rest are refined or discarded
This gives us a high-fidelity dataset, so the SLM is trained on much cleaner signals than raw LLM judgments
In production, disagreements (SLM vs LLM/spec) are treated as high-value edge cases, fed back into the same loop (debate → refine → retrain)
So instead of asking “who do you trust?” the system continuously earns trust by learning from disagreement
That shift is what makes it actually work.
Plurai
@sebastian_sosa1 Exactly, that’s where most systems break.
Our take: you don’t “pick a winner” in production.
Ground truth is established offline via multi-agent debate, so the SLM is trained on high-fidelity labels (not raw LLM judgments)
In production, disagreements are surfaced explicitly as high-signal events (not hidden in sampling)
Those cases get fed back into the loop (debate → refine → retrain), so the system keeps calibrating on real traffic
So the SLM becomes the trusted real-time guardrail, while the LLM judge is used more for auditing / drift detection
More details in the paper: https://huggingface.co/papers/2604.25203
Curious — how are you handling these disagreements today?
This team just coined the concept of vibe-training: real-time, tailored evals for your AI agents, with high accuracy, at a fraction of the cost.
Brilliant.
Plurai
@fmerian <3
Plurai
@fmerian exactly what we did! what a crazy era we live in where it's so hard to coin new concepts!
Plurai
@fmerian Thanks for the support, it means a lot. This is just the beginning of making AI agents reliable in the real world.
TabAI
The multi-turn simulation piece is interesting.
Single prompt evals are easy, but most real failures happen across a sequence of interactions.
If this actually captures that well, that’s a meaningful step up from most eval tooling I’ve seen.
Plurai
@igor_martinyuk exactly. that's one of the challenges we have been facing and a main differentiator
Plurai
@igor_martinyuk Exactly! most real failures aren’t single turns, they’re stateful across interactions.
That’s why we simulate multi-turn flows and generate edge cases across the sequence, not just isolated prompts including those “looks fine at each step, breaks at the end” scenarios.
Curious go hear what kind of multi-turn failures have you seen most often?
Plurai
@igor_martinyuk We're here if you have any more questions! Let us know what you think once you try it out!
You've mentioned 43% fewer failures, was that averaged on any type of task or does the industry have specific benchmarks for that?
Plurai
@michael_vavilov Great question!
The 43% fewer failures comes from our research benchmarks across multiple tasks (conversational policies, agent workflows, compliance), not a single narrow use case. In the paper, we evaluate across different domains and datasets, and consistently see that task-specific models trained with our method outperform LLM-as-a-judge baselines and generic guardrails
If you want the full breakdown (datasets, tasks, and comparisons), we shared it here:
https://huggingface.co/papers/2604.25203
Curious what kind of failures you’re measuring today?
Plurai
@michael_vavilov We're here if you have any more questions! Let us know what you think once you try it out!