Plurai

Vibe-train evals and guardrails tailored to your use case

5.0•1 review•

1.4K followers

Vibe-train evals and guardrails tailored to your use case

5.0•1 review•

1.4K followers

Visit website

Engineering & Development

•

AI Metrics and Evaluation

Vibe training for AI agent reliability. Describe what your agent should and should not do — Plurai generates training data, validates it, and deploys a custom model in minutes. It feels like vibe coding, but for evaluation and guardrails. No labeled data. No annotation pipeline. No prompt engineering. Under the hood, small language models deliver sub 100ms latency, 8x lower cost than GPT as judge, and over 43% fewer failures. Always on, not sampled. Built on published research (BARRED).

Launch tags:API•Developer Tools•Artificial Intelligence

Launch Team / Built With

Lightfield — AI-native CRM that builds itself and does work for you

AI-native CRM that builds itself and does work for you

Promoted

Plurai

Maker

📌

Hey Product Hunt, Ilan from Plurai here.

We spent the last year on a research problem: can you train a production-grade eval or guardrail from just a task description, no labeled data, no annotation pipeline?

Turns out you can. We call it vibe-training.

Most teams today rely on LLM as a judge. It never fully converges, breaks on edge cases, and at 100ms per call it collapses economically at scale. So teams sample instead of evaluating everything. Failures happen between the samples, invisibly.

Plurai lets you describe what your agent should and should not do. The platform generates training data, validates it through a multi-agent debate process, and deploys a custom small language model in minutes.

Results against GPT-5 LLM-as-judge: over 43% fewer failures, 8x lower cost, sub 100ms.
Good enough to run on every interaction, not just a sample.

The research behind it is public.

Try it free at https://app.plurai.ai, I'd love to hear what eval problem you're working on.

Report

19d ago

Kilo Code

Hunter

@ilankad23 neat product - keep up the great work

Report

19d ago

Plurai

Maker

@fmerian thank you. we just getting started! excited to see what people build with it.

Report

17d ago

Plurai

Maker

@ilankad23 An exciting day!

Report

17d ago

Plurai

Maker

@tammy_wolfson2 Exciting day! So fortunate to have you with us on the Plurai journey, we’re just getting started!

Report

17d ago

Plurai

Maker

@ilankad23 This is such a great step forward - really inspiring to see this come together. Looking forward to seeing the impact it creates!

Report

17d ago

Plurai

Maker

@arnonmz Thank you Arnon! This is just the beginning, excited to see what people build with it.

Report

17d ago

Plurai

Maker

@ilankad23 @arnonmz ❤️

Report

17d ago

Plurai

Maker

@ilankad23 It's the closest thing to giving birth 😅 Go Plurai!

Report

17d ago

Plurai

Maker

@reut_v_plurai I wouldn't argue with that, grateful to have you with us on the Plurai journey — this is only the beginning!

Report

17d ago

Plurai

Maker

@ilankad23 @reut_v_plurai 💖

Report

17d ago

Plurai

Maker

@ilankad23 Congrats on the launch team! Been heads down on this one for a while, feels great to finally share it.

Report

17d ago

Plurai

Maker

@ben_wisbih Thank you, we’re so fortunate to have you with us and for the incredible part you play in our journey. We’re just getting started.

Report

17d ago

Plurai

Maker

@ben_wisbih @ilankad23 💖

Report

17d ago

Plurai

Maker

@ben_wisbih @tammy_wolfson2 🏆

Report

17d ago

yes.no

Love it. The product looks great and super proffesional!

I'm just wondering can it help with any type of models or only textual models for now?

If I'm working with VLMs, or with LLMs in a pipeline but processing audio, still images or video it could help with any model as long as it's dealing with language and semantics ?

Report

17d ago

Plurai

Maker

@jodoron Thank you!

@ilan_kadar @ilankad23 will answer your questions

Report

17d ago

Plurai

Maker

@jodoron Currently LLMs, working on extension for vision

Report

17d ago

Plurai

Maker

@jodoron @tammy_wolfson2 indeed, we currently support LLMs but already cooking more modalities including vision

Report

17d ago

Kilo Code

Hunter

This team just coined the concept of vibe-training: real-time, tailored evals for your AI agents, with high accuracy, at a fraction of the cost.

Brilliant.

Report

19d ago

Plurai

Maker

@fmerian <3

Report

17d ago

Plurai

Maker

@fmerian exactly what we did! what a crazy era we live in where it's so hard to coin new concepts!

Report

17d ago

Plurai

Maker

@fmerian Thanks for the support, it means a lot. This is just the beginning of making AI agents reliable in the real world.

Report

17d ago

Plurai

Maker

@fmerian @ilankad23 🙌🏻🔥

Report

17d ago

Kilo Code

Hunter

@ilankad23 really looking forward to it! launch, and keep launching. Product Hunt pays off in the long term!

Report

17d ago

OpenPlugin

the 'LLM as judge breaks at 100ms per call' pain is exactly where most eval pipelines silently rot. you end up with a sampling regime nobody actually trusts. the part i'm curious about is calibration in the wild: when the small model and the original llm-judge disagree on a real production trace, who do you trust, and how do you surface that disagreement to the team? that's usually where these systems either become real or quietly shelfware.

Report

17d ago

Plurai

Maker

@sebastian_sosa1 We're here if you have any more questions! Let us know what you think once you try it out!

Report

17d ago

OpenPlugin

the 'disagreement as training signal, not production decision' shift is the part most teams haven't internalized yet. multi-agent debate as the consensus mechanism for ground truth side-steps the human-in-the-loop bottleneck without giving up calibration, which is the move. one thing i'd love to know: how do you keep the debate from collapsing into 'all judges share priors because they're trained on the same base model'? does the advocate role pull from a meaningfully different distribution, or is it more about role prompting?@tammy_wolfson2

Report

17d ago

Plurai

Maker

@sebastian_sosa1 Great question — this is exactly where systems either become real infra or shelfware.

Our approach (from the paper) is: don’t resolve disagreement in production, resolve it in training. Here is the link to our paper for more details about our approach (https://huggingface.co/papers/2604.25203)

We establish ground truth via multi-agent debate (judges + advocate). Only samples that reach consensus make it into training; the rest are refined or discarded
This gives us a high-fidelity dataset, so the SLM is trained on much cleaner signals than raw LLM judgments
In production, disagreements (SLM vs LLM/spec) are treated as high-value edge cases, fed back into the same loop (debate → refine → retrain)

So instead of asking “who do you trust?” the system continuously earns trust by learning from disagreement

That shift is what makes it actually work.

Report

17d ago

Scade.pro

@sebastian_sosa1 @ilankad23 Thank you for the detailed explanation.

Report

17d ago

Plurai

Maker

@sebastian_sosa1 Exactly, that’s where most systems break.

Our take: you don’t “pick a winner” in production.

Ground truth is established offline via multi-agent debate, so the SLM is trained on high-fidelity labels (not raw LLM judgments)
In production, disagreements are surfaced explicitly as high-signal events (not hidden in sampling)
Those cases get fed back into the loop (debate → refine → retrain), so the system keeps calibrating on real traffic

So the SLM becomes the trusted real-time guardrail, while the LLM judge is used more for auditing / drift detection

More details in the paper: https://huggingface.co/papers/2604.25203

Curious — how are you handling these disagreements today?

Report

17d ago

OpenPlugin

@ilankad23 different shape on noemica side, we run synthetic personas through real user flows so the eval is closer to UX research than judge-correctness. but yeah, disagreement-as-signal carries over. two personas failing differently on the same flow ends up being more useful than them both passing. the messy case is when each is right from its lens and what's broken is positioning, not the build. those still need a human to resolve.

Report

17d ago

Plurai

Maker

@ilankad23 @sebastian_sosa1 interesting

Report

17d ago

The "always on, not sampled" part is what makes this interesting. When I was running engineering at scale, sampling-based quality checks gave us a false sense of security - the failures always happened in the gaps between samples. The LLM-as-judge approach has the same problem but worse: it's expensive enough that teams only run it on a fraction of requests, and the edge cases it misses are exactly the ones that blow up in production. Sub 100ms with small models changes the economics enough to actually evaluate everything. Curious about the cold start experience - when someone describes a new guardrail in plain language, how much iteration does it typically take before the generated eval catches the subtle violations versus just the obvious ones?

Report

17d ago

Plurai

Maker

@avrisimon
Avri, the sampling point is exactly right - the failures that matter live in the tail.

on cold start: typically 1–2 iterations. When someone describes a guardrail in plain language, we decompose it into behavioral dimensions and generate synthetic boundary cases specifically designed to probe the subtle violations, not just the obvious ones. There's an adversarial validation step that challenges generated cases before they go into the training set - that's what calibrates for edge cases from the start rather than after manual iteration.

One round of enrichment usually gets you there. The system tends to surface sub-categories the user hadn't articulated - they just confirm or redirect.

Report

17d ago

Scade.pro

@avrisimon @omri_sela2
This is interesting

Report

17d ago

Plurai

Maker

@avrisimon @omri_sela2 @maria_anosova Thank you!

Report

17d ago

Plurai

Maker

@avrisimon We're here if you have any more questions! Let us know what you think once you try it out!

Report

17d ago

Plurai

Maker

@avrisimon Exactly! the failures live in the gaps.

On cold start, the first version usually catches the obvious stuff. The subtle violations come after 1–2 quick iterations, where we:

generate boundary cases (not random samples)
filter them via multi-agent debate (to avoid noisy labels)
retrain and tighten the guardrail

From there, real traffic just keeps improving it.

Curious — what kind of “slipped to prod” failures hit you the hardest?

Report

17d ago

Plurai

Maker

@avrisimon @ilankad23 nice

Report

17d ago

sampling-only eval has a real blind spot: anything that doesn't repeat doesn't get caught. ran into the same building eval flows for an AI form filler we work on — by the time a flaky failure shows up twice, you've already shipped it.

the part i can't quite picture is how the multi-agent debate establishes ground truth without existing failure modes — adversarial generation against the task spec is one read, test-time disagreement is the other. one of those would explain how the BARRED setup actually converges.

Report

17d ago

Plurai

Maker

@webappski Great point, that “doesn’t repeat → doesn’t get caught” blind spot is exactly what bites in production. On the ground truth side: it’s a mix of both. We start with adversarial generation against the task spec to surface candidate failures, then use multi-agent debate to stress and validate them until there’s strong agreement — that loop is what makes it converge without needing pre-labeled failure modes.

Report

17d ago

Plurai

Maker

@webappski We're here if you have any more questions! Let us know what you think once you try it out!

Report

17d ago

Plurai

Maker

@webappski on BARRED: ground truth isn't inferred from observed failures, it's constructed from the task spec. we decompose the policy into semantic dimensions and generate boundary cases at the intersections, the places where a generic judge is most likely to disagree with itself. multi-agent debate filters out cases that don't converge, so only label-clean boundaries make it into training. that's why a 3B model beats LLM-as-judge on edge cases, not smarter, just calibrated where it matters.

happy to share the paper if useful.

Report

17d ago

@omri_sela2 the "decompose policy into semantic dimensions, generate boundary cases at intersections" framing is the cleanest articulation of the problem i've seen. would love the paper — that detail matters for whether the approach is reproducible outside your stack or specific to BARRED itself.

@ilankad23 makes sense — adversarial generation seeding candidates, debate filtering for convergence. that's the piece i was missing.

Report

17d ago

Plurai

Maker

@webappski cool! would love to get your feeback on the product

Report

17d ago

Plurai

Maker

@ilankad23 @webappski here's the paper https://arxiv.org/abs/2604.25203
enjoy!

Report

17d ago

The multi-agent debate validation is the part I want to understand better. How do you keep the debate from converging on the same model's biases? Different model families per agent, or the same base with different role prompts? Asking because validation-by-consensus often inherits failure modes from the underlying judge, and avoiding that is the actual hard problem.

Report

17d ago

Plurai

Maker

@fredcallagan Thank you for your comment

Report

16d ago

Plurai

Maker

@fredcallagan looping in @ilankad23 and @reut_v_plurai to answer your question

Report

16d ago

Plurai

Maker

@fredcallagan Great question! and I agree, naive consensus can just amplify the same judge bias.

In BARRED, the debate is intentionally asymmetric, not just “ask 3 judges and average.” We use a rigid Advocate that must defend the target label, and independent Judges that challenge it over multiple rounds. If the Advocate cannot convince the Judges, the sample is rejected or refined using the dissenting feedback. This is designed to stress-test label faithfulness, especially on boundary cases.

Report

11d ago

1 2 3

•••

Forum Threads

p/plurai

•

17d ago

Plurai - Setting up the launchpad

Plurai is launching on Product Hunt this week, introducing the first vibe-training platform to build real-time, tailored evals for your AI agents, with high accuracy, at a fraction of the cost.

I had the opportunity to collaborate with their team on this first launch after months in stealth modeI - no pressure - and wanted to share with you some insights on how we prepped it.

View all

Hey Product Hunt, Ilan from Plurai here.

We spent the last year on a research problem: can you train a production-grade eval or guardrail from just a task description, no labeled data, no annotation pipeline?

Turns out you can. We call it vibe-training.

Results against GPT-5 LLM-as-judge: over 43% fewer failures, 8x lower cost, sub 100ms.
Good enough to run on every interaction, not just a sample.

The research behind it is public.

Try it free at https://app.plurai.ai, I'd love to hear what eval problem you're working on.

Plurai

Vibe-train evals and guardrails tailored to your use case

Vibe-train evals and guardrails tailored to your use case

Forum Threads

Plurai - Setting up the launchpad

Forum Threads

Plurai - Setting up the launchpad

What's great

What needs improvement

vs Alternatives

What's great

What needs improvement

vs Alternatives