Launching today

Plurai
Vibe-train evals and guardrails tailored to your use case
93 followers
Vibe-train evals and guardrails tailored to your use case
93 followers
Vibe training for AI agent reliability. Describe what your agent should and should not do — Plurai generates training data, validates it, and deploys a custom model in minutes. It feels like vibe coding, but for evaluation and guardrails. No labeled data. No annotation pipeline. No prompt engineering. Under the hood, small language models deliver sub 100ms latency, 8x lower cost than GPT as judge, and over 43% fewer failures. Always on, not sampled. Built on published research (BARRED).











Plurai
Hey Product Hunt, Ilan from Plurai here.
We spent the last year on a research problem: can you train a production-grade eval or guardrail from just a task description, no labeled data, no annotation pipeline?
Turns out you can. We call it vibe-training.
Most teams today rely on LLM as a judge. It never fully converges, breaks on edge cases, and at 100ms per call it collapses economically at scale. So teams sample instead of evaluating everything. Failures happen between the samples, invisibly.
Plurai lets you describe what your agent should and should not do. The platform generates training data, validates it through a multi-agent debate process, and deploys a custom small language model in minutes.
Results against GPT-5 LLM-as-judge: over 43% fewer failures, 8x lower cost, sub 100ms.
Good enough to run on every interaction, not just a sample.
The research behind it is public.
Try it free at https://app.plurai.ai, I'd love to hear what eval problem you're working on.
@ilankad23 neat product - keep up the great work
Plurai
@fmerian thank you. we just getting started! excited to see what people build with it.
Plurai
@ilankad23 An exciting day!
Plurai
@tammy_wolfson2 Exciting day! So fortunate to have you with us on the Plurai journey, we’re just getting started!
Plurai
@ilankad23 This is such a great step forward - really inspiring to see this come together. Looking forward to seeing the impact it creates!
Plurai
@arnonmz Thank you Arnon! This is just the beginning, excited to see what people build with it.
Plurai
@ilankad23 It's the closest thing to giving birth 😅 Go Plurai!
Plurai
@reut_v_plurai I wouldn't argue with that, grateful to have you with us on the Plurai journey — this is only the beginning!
Plurai
@ilankad23 Congrats on the launch team! Been heads down on this one for a while, feels great to finally share it.
Plurai
@ben_wisbih Thank you, we’re so fortunate to have you with us and for the incredible part you play in our journey. We’re just getting started.
This team just coined the concept of vibe-training: real-time, tailored evals for your AI agents, with high accuracy, at a fraction of the cost.
Brilliant.
Plurai
@fmerian <3
Plurai
@fmerian exactly what we did! what a crazy era we live in where it's so hard to coin new concepts!
Plurai
@fmerian Thanks for the support, it means a lot. This is just the beginning of making AI agents reliable in the real world.
The "always on, not sampled" part is what makes this interesting. When I was running engineering at scale, sampling-based quality checks gave us a false sense of security - the failures always happened in the gaps between samples. The LLM-as-judge approach has the same problem but worse: it's expensive enough that teams only run it on a fraction of requests, and the edge cases it misses are exactly the ones that blow up in production. Sub 100ms with small models changes the economics enough to actually evaluate everything. Curious about the cold start experience - when someone describes a new guardrail in plain language, how much iteration does it typically take before the generated eval catches the subtle violations versus just the obvious ones?
Plurai
@avrisimon
Avri, the sampling point is exactly right - the failures that matter live in the tail.
on cold start: typically 1–2 iterations. When someone describes a guardrail in plain language, we decompose it into behavioral dimensions and generate synthetic boundary cases specifically designed to probe the subtle violations, not just the obvious ones. There's an adversarial validation step that challenges generated cases before they go into the training set - that's what calibrates for edge cases from the start rather than after manual iteration.
One round of enrichment usually gets you there. The system tends to surface sub-categories the user hadn't articulated - they just confirm or redirect.
Plurai
@avrisimon We're here if you have any more questions! Let us know what you think once you try it out!
sampling-only eval has a real blind spot: anything that doesn't repeat doesn't get caught. ran into the same building eval flows for an AI form filler we work on — by the time a flaky failure shows up twice, you've already shipped it.
the part i can't quite picture is how the multi-agent debate establishes ground truth without existing failure modes — adversarial generation against the task spec is one read, test-time disagreement is the other. one of those would explain how the BARRED setup actually converges.
Plurai
@webappski Great point, that “doesn’t repeat → doesn’t get caught” blind spot is exactly what bites in production. On the ground truth side: it’s a mix of both. We start with adversarial generation against the task spec to surface candidate failures, then use multi-agent debate to stress and validate them until there’s strong agreement — that loop is what makes it converge without needing pre-labeled failure modes.
Plurai
@webappski We're here if you have any more questions! Let us know what you think once you try it out!
OpenPlugin
the 'LLM as judge breaks at 100ms per call' pain is exactly where most eval pipelines silently rot. you end up with a sampling regime nobody actually trusts. the part i'm curious about is calibration in the wild: when the small model and the original llm-judge disagree on a real production trace, who do you trust, and how do you surface that disagreement to the team? that's usually where these systems either become real or quietly shelfware.
Plurai
@sebastian_sosa1 We're here if you have any more questions! Let us know what you think once you try it out!
Plurai
We talked to hundreds of AI teams before building this.
The same thing kept coming up: evals are on the roadmap, always. They just never get done. Too slow, too expensive, someone needs to label data, someone needs to set up a pipeline, and suddenly it's a Q3 project that rolls into Q4.
That's the problem we actually solves.
Describe what your agent should and shouldn't do, and you have a custom model running in minutes. Not a prototype. In prod.
Launching today and genuinely excited about it.
Go try it free: app.plurai.ai. Come back and tell me what eval problem you're working on.
Plurai
@omri_sela2 🚀
Plurai
@omri_sela2 can you believe it's finally out??
Plurai
@reut_v_plurai our baby 👶
Plurai
Hello world, I'm the product behind the product :)
VVibe training is here to make model training accessible — and to help your agents and LLM apps actually work in production.
Also - we obsessed over both the tech and the UX -> so we can't wait to hear your feedback!
Plurai
@reut_v_plurai Great to work with you
Plurai
@reut_v_plurai so glad to work with you on this
Plurai
@reut_v_plurai The product behind the product and what a product it is.
Amazing work, couldn’t have done it without you, so lucky to have you with us ❤️