Plurai is launching on Product Hunt this week, introducing the first vibe-training platform to build real-time, tailored evals for your AI agents, with high accuracy, at a fraction of the cost.
I had the opportunity to collaborate with their team on this first launch after months in stealth modeI - no pressure - and wanted to share with you some insights on how we prepped it.
We were running a customer-facing agent in production for about three months before we started using Plurai. Everything looked fine on the surface. Then we ran it through their evaluation pipeline and found a bunch of edge cases we never would have caught manually — responses that were technically correct but violated our policies in ways we hadn't fully defined yet.
That's what actually sold me. Not the benchmarks, though those are real. It was the realization that our previous "testing" was basically vibes. Plurai turned that into something measurable.
The thing I use most is the guardrail endpoint. Sub-100ms, fits into our existing stack without replacing anything. I was skeptical that a small custom model could outperform GPT-4-based judges but the accuracy on our specific use case is noticeably better — and cheaper by a lot.
Setup was surprisingly fast. I described what the agent should and shouldn't do in plain language, it generated boundary cases I hadn't thought of, and I had an endpoint to test against within the same day.
The UI for reviewing synthetic test cases could be faster — scrolling through 50+ examples to find the interesting ones takes more clicks than it should. Some kind of filtering by confidence score or edge-case type would help. Documentation is also improving but still a bit thin in places, especially for the LLM optimization path vs. the SLM path and when to choose which.
Looked at LangSmith and Arize for observability — both solid, but they tell you what happened, not whether it was okay. That's a different problem. Also evaluated LlamaGuard and a homegrown prompt-based judge. LlamaGuard's taxonomy was too rigid for our use case. The prompt judge was unpredictable — good on simple cases, unreliable on anything nuanced. Building something custom would have taken months and a dataset we didn't have.
Plurai handled the dataset problem. That was the blocker everywhere else.
Hey Product Hunt, Ilan from Plurai here.
We spent the last year on a research problem: can you train a production-grade eval or guardrail from just a task description, no labeled data, no annotation pipeline?
Turns out you can. We call it vibe-training.
Most teams today rely on LLM as a judge. It never fully converges, breaks on edge cases, and at 100ms per call it collapses economically at scale. So teams sample instead of evaluating everything. Failures happen between the samples, invisibly.
Plurai lets you describe what your agent should and should not do. The platform generates training data, validates it through a multi-agent debate process, and deploys a custom small language model in minutes.
Results against GPT-5 LLM-as-judge: over 43% fewer failures, 8x lower cost, sub 100ms.
Good enough to run on every interaction, not just a sample.
The research behind it is public.
Try it free at https://app.plurai.ai, I'd love to hear what eval problem you're working on.
@tammy_wolfson2 Exciting day! So fortunate to have you with us on the Plurai journey, we’re just getting started!
@ilankad23 This is such a great step forward - really inspiring to see this come together. Looking forward to seeing the impact it creates!
@reut_v_plurai I wouldn't argue with that, grateful to have you with us on the Plurai journey — this is only the beginning!
@ilankad23 Congrats on the launch team! Been heads down on this one for a while, feels great to finally share it.
@ben_wisbih Thank you, we’re so fortunate to have you with us and for the incredible part you play in our journey. We’re just getting started.
Love it. The product looks great and super proffesional!
I'm just wondering can it help with any type of models or only textual models for now?
If I'm working with VLMs, or with LLMs in a pipeline but processing audio, still images or video it could help with any model as long as it's dealing with language and semantics ?
@jodoron @tammy_wolfson2 indeed, we currently support LLMs but already cooking more modalities including vision
@ilankad23 really looking forward to it! launch, and keep launching. Product Hunt pays off in the long term!
the 'LLM as judge breaks at 100ms per call' pain is exactly where most eval pipelines silently rot. you end up with a sampling regime nobody actually trusts. the part i'm curious about is calibration in the wild: when the small model and the original llm-judge disagree on a real production trace, who do you trust, and how do you surface that disagreement to the team? that's usually where these systems either become real or quietly shelfware.
@sebastian_sosa1 We're here if you have any more questions! Let us know what you think once you try it out!
the 'disagreement as training signal, not production decision' shift is the part most teams haven't internalized yet. multi-agent debate as the consensus mechanism for ground truth side-steps the human-in-the-loop bottleneck without giving up calibration, which is the move. one thing i'd love to know: how do you keep the debate from collapsing into 'all judges share priors because they're trained on the same base model'? does the advocate role pull from a meaningfully different distribution, or is it more about role prompting?@tammy_wolfson2
@sebastian_sosa1 Great question — this is exactly where systems either become real infra or shelfware.
Our approach (from the paper) is: don’t resolve disagreement in production, resolve it in training. Here is the link to our paper for more details about our approach (https://huggingface.co/papers/2604.25203)
We establish ground truth via multi-agent debate (judges + advocate). Only samples that reach consensus make it into training; the rest are refined or discarded
This gives us a high-fidelity dataset, so the SLM is trained on much cleaner signals than raw LLM judgments
In production, disagreements (SLM vs LLM/spec) are treated as high-value edge cases, fed back into the same loop (debate → refine → retrain)
So instead of asking “who do you trust?” the system continuously earns trust by learning from disagreement
That shift is what makes it actually work.
@sebastian_sosa1 Exactly, that’s where most systems break.
Our take: you don’t “pick a winner” in production.
Ground truth is established offline via multi-agent debate, so the SLM is trained on high-fidelity labels (not raw LLM judgments)
In production, disagreements are surfaced explicitly as high-signal events (not hidden in sampling)
Those cases get fed back into the loop (debate → refine → retrain), so the system keeps calibrating on real traffic
So the SLM becomes the trusted real-time guardrail, while the LLM judge is used more for auditing / drift detection
More details in the paper: https://huggingface.co/papers/2604.25203
Curious — how are you handling these disagreements today?
@ilankad23 different shape on noemica side, we run synthetic personas through real user flows so the eval is closer to UX research than judge-correctness. but yeah, disagreement-as-signal carries over. two personas failing differently on the same flow ends up being more useful than them both passing. the messy case is when each is right from its lens and what's broken is positioning, not the build. those still need a human to resolve.
The "always on, not sampled" part is what makes this interesting. When I was running engineering at scale, sampling-based quality checks gave us a false sense of security - the failures always happened in the gaps between samples. The LLM-as-judge approach has the same problem but worse: it's expensive enough that teams only run it on a fraction of requests, and the edge cases it misses are exactly the ones that blow up in production. Sub 100ms with small models changes the economics enough to actually evaluate everything. Curious about the cold start experience - when someone describes a new guardrail in plain language, how much iteration does it typically take before the generated eval catches the subtle violations versus just the obvious ones?
@avrisimon
Avri, the sampling point is exactly right - the failures that matter live in the tail.
on cold start: typically 1–2 iterations. When someone describes a guardrail in plain language, we decompose it into behavioral dimensions and generate synthetic boundary cases specifically designed to probe the subtle violations, not just the obvious ones. There's an adversarial validation step that challenges generated cases before they go into the training set - that's what calibrates for edge cases from the start rather than after manual iteration.
One round of enrichment usually gets you there. The system tends to surface sub-categories the user hadn't articulated - they just confirm or redirect.
@avrisimon We're here if you have any more questions! Let us know what you think once you try it out!
@avrisimon Exactly! the failures live in the gaps.
On cold start, the first version usually catches the obvious stuff. The subtle violations come after 1–2 quick iterations, where we:
generate boundary cases (not random samples)
filter them via multi-agent debate (to avoid noisy labels)
retrain and tighten the guardrail
From there, real traffic just keeps improving it.
Curious — what kind of “slipped to prod” failures hit you the hardest?
sampling-only eval has a real blind spot: anything that doesn't repeat doesn't get caught. ran into the same building eval flows for an AI form filler we work on — by the time a flaky failure shows up twice, you've already shipped it.
the part i can't quite picture is how the multi-agent debate establishes ground truth without existing failure modes — adversarial generation against the task spec is one read, test-time disagreement is the other. one of those would explain how the BARRED setup actually converges.
@webappski Great point, that “doesn’t repeat → doesn’t get caught” blind spot is exactly what bites in production. On the ground truth side: it’s a mix of both. We start with adversarial generation against the task spec to surface candidate failures, then use multi-agent debate to stress and validate them until there’s strong agreement — that loop is what makes it converge without needing pre-labeled failure modes.
@webappski We're here if you have any more questions! Let us know what you think once you try it out!
@webappski on BARRED: ground truth isn't inferred from observed failures, it's constructed from the task spec. we decompose the policy into semantic dimensions and generate boundary cases at the intersections, the places where a generic judge is most likely to disagree with itself. multi-agent debate filters out cases that don't converge, so only label-clean boundaries make it into training. that's why a 3B model beats LLM-as-judge on edge cases, not smarter, just calibrated where it matters.
happy to share the paper if useful.
@omri_sela2 the "decompose policy into semantic dimensions, generate boundary cases at intersections" framing is the cleanest articulation of the problem i've seen. would love the paper — that detail matters for whether the approach is reproducible outside your stack or specific to BARRED itself.
@ilankad23 makes sense — adversarial generation seeding candidates, debate filtering for convergence. that's the piece i was missing.
The multi-agent debate validation is the part I want to understand better. How do you keep the debate from converging on the same model's biases? Different model families per agent, or the same base with different role prompts? Asking because validation-by-consensus often inherits failure modes from the underlying judge, and avoiding that is the actual hard problem.
@fredcallagan Great question! and I agree, naive consensus can just amplify the same judge bias.
In BARRED, the debate is intentionally asymmetric, not just “ask 3 judges and average.” We use a rigid Advocate that must defend the target label, and independent Judges that challenge it over multiple rounds. If the Advocate cannot convince the Judges, the sample is rejected or refined using the dissenting feedback. This is designed to stress-test label faithfulness, especially on boundary cases.














Plurai
Thank you!