Launching today

Plurai

Launching today

Vibe-train evals and guardrails tailored to your use case

93 followers

Vibe-train evals and guardrails tailored to your use case

93 followers

Visit website

Engineering & Development

•

AI Metrics and Evaluation

Vibe training for AI agent reliability. Describe what your agent should and should not do — Plurai generates training data, validates it, and deploys a custom model in minutes. It feels like vibe coding, but for evaluation and guardrails. No labeled data. No annotation pipeline. No prompt engineering. Under the hood, small language models deliver sub 100ms latency, 8x lower cost than GPT as judge, and over 43% fewer failures. Always on, not sampled. Built on published research (BARRED).

Launch tags:API•Developer Tools•Artificial Intelligence

Launch Team / Built With

getviktor.com — An AI coworker that actually does the work

An AI coworker that actually does the work

Promoted

Plurai

Maker

📌

Hey Product Hunt, Ilan from Plurai here.

We spent the last year on a research problem: can you train a production-grade eval or guardrail from just a task description, no labeled data, no annotation pipeline?

Turns out you can. We call it vibe-training.

Most teams today rely on LLM as a judge. It never fully converges, breaks on edge cases, and at 100ms per call it collapses economically at scale. So teams sample instead of evaluating everything. Failures happen between the samples, invisibly.

Plurai lets you describe what your agent should and should not do. The platform generates training data, validates it through a multi-agent debate process, and deploys a custom small language model in minutes.

Results against GPT-5 LLM-as-judge: over 43% fewer failures, 8x lower cost, sub 100ms.
Good enough to run on every interaction, not just a sample.

The research behind it is public.

Try it free at https://app.plurai.ai, I'd love to hear what eval problem you're working on.

Report

2d ago

Hunter

@ilankad23 neat product - keep up the great work

Report

1d ago

Plurai

Maker

@fmerian thank you. we just getting started! excited to see what people build with it.

Report

1h ago

Plurai

Maker

@ilankad23 An exciting day!

Report

2h ago

Plurai

Maker

@tammy_wolfson2 Exciting day! So fortunate to have you with us on the Plurai journey, we’re just getting started!

Report

1h ago

Plurai

Maker

@ilankad23 This is such a great step forward - really inspiring to see this come together. Looking forward to seeing the impact it creates!

Report

2h ago

Plurai

Maker

@arnonmz Thank you Arnon! This is just the beginning, excited to see what people build with it.

Report

1h ago

Plurai

Maker

@ilankad23 It's the closest thing to giving birth 😅 Go Plurai!

Report

2h ago

Plurai

Maker

@reut_v_plurai I wouldn't argue with that, grateful to have you with us on the Plurai journey — this is only the beginning!

Report

1h ago

Plurai

Maker

@ilankad23 Congrats on the launch team! Been heads down on this one for a while, feels great to finally share it.

Report

2h ago

Plurai

Maker

@ben_wisbih Thank you, we’re so fortunate to have you with us and for the incredible part you play in our journey. We’re just getting started.

Report

1h ago

Hunter

This team just coined the concept of vibe-training: real-time, tailored evals for your AI agents, with high accuracy, at a fraction of the cost.

Brilliant.

Report

1d ago

Plurai

Maker

@fmerian <3

Report

2h ago

Plurai

Maker

@fmerian exactly what we did! what a crazy era we live in where it's so hard to coin new concepts!

Report

2h ago

Plurai

Maker

@fmerian Thanks for the support, it means a lot. This is just the beginning of making AI agents reliable in the real world.

Report

1h ago

The "always on, not sampled" part is what makes this interesting. When I was running engineering at scale, sampling-based quality checks gave us a false sense of security - the failures always happened in the gaps between samples. The LLM-as-judge approach has the same problem but worse: it's expensive enough that teams only run it on a fraction of requests, and the edge cases it misses are exactly the ones that blow up in production. Sub 100ms with small models changes the economics enough to actually evaluate everything. Curious about the cold start experience - when someone describes a new guardrail in plain language, how much iteration does it typically take before the generated eval catches the subtle violations versus just the obvious ones?

Report

2h ago

Plurai

Maker

@avrisimon
Avri, the sampling point is exactly right - the failures that matter live in the tail.

on cold start: typically 1–2 iterations. When someone describes a guardrail in plain language, we decompose it into behavioral dimensions and generate synthetic boundary cases specifically designed to probe the subtle violations, not just the obvious ones. There's an adversarial validation step that challenges generated cases before they go into the training set - that's what calibrates for edge cases from the start rather than after manual iteration.

One round of enrichment usually gets you there. The system tends to surface sub-categories the user hadn't articulated - they just confirm or redirect.

Report

1h ago

Plurai

Maker

@avrisimon We're here if you have any more questions! Let us know what you think once you try it out!

Report

15m ago

sampling-only eval has a real blind spot: anything that doesn't repeat doesn't get caught. ran into the same building eval flows for an AI form filler we work on — by the time a flaky failure shows up twice, you've already shipped it.

the part i can't quite picture is how the multi-agent debate establishes ground truth without existing failure modes — adversarial generation against the task spec is one read, test-time disagreement is the other. one of those would explain how the BARRED setup actually converges.

Report

1h ago

Plurai

Maker

@webappski Great point, that “doesn’t repeat → doesn’t get caught” blind spot is exactly what bites in production. On the ground truth side: it’s a mix of both. We start with adversarial generation against the task spec to surface candidate failures, then use multi-agent debate to stress and validate them until there’s strong agreement — that loop is what makes it converge without needing pre-labeled failure modes.

Report

1h ago

Plurai

Maker

@webappski We're here if you have any more questions! Let us know what you think once you try it out!

Report

15m ago

OpenPlugin

the 'LLM as judge breaks at 100ms per call' pain is exactly where most eval pipelines silently rot. you end up with a sampling regime nobody actually trusts. the part i'm curious about is calibration in the wild: when the small model and the original llm-judge disagree on a real production trace, who do you trust, and how do you surface that disagreement to the team? that's usually where these systems either become real or quietly shelfware.

Report

1h ago

Plurai

Maker

@sebastian_sosa1 We're here if you have any more questions! Let us know what you think once you try it out!

Report

15m ago

Plurai

Maker

We talked to hundreds of AI teams before building this.

The same thing kept coming up: evals are on the roadmap, always. They just never get done. Too slow, too expensive, someone needs to label data, someone needs to set up a pipeline, and suddenly it's a Q3 project that rolls into Q4.

That's the problem we actually solves.

Describe what your agent should and shouldn't do, and you have a custom model running in minutes. Not a prototype. In prod.

Launching today and genuinely excited about it.

Go try it free: app.plurai.ai. Come back and tell me what eval problem you're working on.

Report

2h ago

Plurai

Maker

@omri_sela2 🚀

Report

2h ago

Plurai

Maker

@omri_sela2 can you believe it's finally out??

Report

2h ago

Plurai

Maker

@reut_v_plurai our baby 👶

Report

1h ago

Plurai

Maker

Hello world, I'm the product behind the product :)

VVibe training is here to make model training accessible — and to help your agents and LLM apps actually work in production.

Also - we obsessed over both the tech and the UX -> so we can't wait to hear your feedback!

Report

2h ago

Plurai

Maker

@reut_v_plurai Great to work with you

Report

2h ago

Plurai

Maker

@reut_v_plurai so glad to work with you on this

Report

1h ago

Plurai

Maker

@reut_v_plurai The product behind the product and what a product it is.
Amazing work, couldn’t have done it without you, so lucky to have you with us ❤️

Report

1h ago

1 2

Hey Product Hunt, Ilan from Plurai here.

We spent the last year on a research problem: can you train a production-grade eval or guardrail from just a task description, no labeled data, no annotation pipeline?

Turns out you can. We call it vibe-training.

Results against GPT-5 LLM-as-judge: over 43% fewer failures, 8x lower cost, sub 100ms.
Good enough to run on every interaction, not just a sample.

The research behind it is public.

Try it free at https://app.plurai.ai, I'd love to hear what eval problem you're working on.