Vibe training for AI agent reliability. Describe what your agent should and should not do — Plurai generates training data, validates it, and deploys a custom model in minutes. It feels like vibe coding, but for evaluation and guardrails. No labeled data. No annotation pipeline. No prompt engineering. Under the hood, small language models deliver sub 100ms latency, 8x lower cost than GPT as judge, and over 43% fewer failures. Always on, not sampled. Built on published research (BARRED).












Asa.team
The part that stands out to me is the economics argument. LLM-as-judge at 100ms per call means you're forced to sample, and failures happen in the gaps between samples. That's a real problem we've run into.
Curious about the drift question though: once the agent's prompt or tool surface changes, how much of the vibe-training do you have to redo? Is there a way to do incremental updates or does a significant prompt change basically mean starting fresh?
Also interested in whether the small model you deploy is hosted by Plurai or exportable. For anything touching sensitive data the deployment model matters a lot.
Plurai
@ng_junsheng Exactly, that sampling gap is one of the core reasons we built it.
When the agent prompt or tool surface changes, you don’t need to start from scratch. With Vibe-training you can start from the existing dataset, refine or expand the relevant edge cases, and fine-tune the SLM with the updated dataset.
For major product or policy changes, you may recalibrate the task definition, but the previous dataset remain useful starting points.
On deployment, Plurai can host it, but for sensitive use cases we also support private VPC / on-prem deployment, so the evaluator runs close to the customer’s data and production stack.
This is a really clever approach to the eval problem. As someone who's spent way too many hours trying to wrangle GPT-4 into being a consistent judge for my agent outputs, the "vibe training" framing actually makes a lot of sense — describing behavior in natural language rather than crafting elaborate rubrics.
The sub-100ms latency is what catches my attention most. For agents that need real-time guardrails (not just batch evaluation), that's the difference between usable and not usable in production.
Curious how this handles edge cases that emerge after deployment — is there a feedback loop to refine the model when it misses something in the wild?
Plurai
@robin_heinsohn - everything we do is exactly aligned with your analysis here - consistency and latency were both fundamental metrics in our solution design.
Regarding the feedback loop - spot on again - yes, we do go this extra mile where we also provide a feedback loop and fixes suggestions - it's part of the enterprise offering. If interesting, drop us a note at reutv@plurai.ai
Do your evaluation algorithms backed by science? Do you have any peer-reviewed papers?
There is a lot of noise in this space.
Plurai
@sha_maayan Of course! Our approach is backed by the research and benchmarks we’ve published — you can check it out here: https://huggingface.co/papers/2604.25203
Would love to hear your thoughts on both the method and the product.
Plurai
@sha_maayan We're here if you have any more questions! Let us know what you think once you try it out!
Plurai
@sha_maayan looking forward to your feedback!
TabAI
The multi-turn simulation piece is interesting.
Single prompt evals are easy, but most real failures happen across a sequence of interactions.
If this actually captures that well, that’s a meaningful step up from most eval tooling I’ve seen.
Plurai
@igor_martinyuk exactly. that's one of the challenges we have been facing and a main differentiator
Plurai
@igor_martinyuk Exactly! most real failures aren’t single turns, they’re stateful across interactions.
That’s why we simulate multi-turn flows and generate edge cases across the sequence, not just isolated prompts including those “looks fine at each step, breaks at the end” scenarios.
Curious go hear what kind of multi-turn failures have you seen most often?
Plurai
@igor_martinyuk We're here if you have any more questions! Let us know what you think once you try it out!
Plurai
@igor_martinyukglad you love it!
Vibe training is such a good framing, finally something that matches how teams actually think about agent behavior. cheers team 🙌
BTW, what happens when two guardrails conflict with each other at runtime?
Plurai
@boyuan_deng1 thank you :) we're also obsessed with the framing 🤩
each guardrail returns its classification and reasoning - and your "state machine" can figure out how to mitigate between the two having the full context
Plurai
@reut_v_plurai great answer, and "state machine" is exactly the right mental model here 🎯
Plurai
@boyuan_deng1 means a lot, we're obsessed with that framing too 🙌 @reut_v_plurai nailed the answer below — each guardrail returns its classification + reasoning, so your logic layer has full context to resolve conflicts. Not just verdicts, actual signal.
Plurai
@boyuan_deng1 did you get a chance to try the product! Curious what you think
Plurai
@boyuan_deng1 each guardrail returns its classification, confidence and reasoning so you have the full context
Oh, this looks really cool, esp the idea of running evals on every interaction (not just samples). Just curious, how it performs on more subjective tasks though))) And congrats on the launch, btw :)
Plurai
@natalie_ermishina Great question, Natalie! We use an 'intent calibration' process that fine-tunes evals and guardrails to match your subjective expectations. We generate a custom training set to demonstrate the classification, then let you iterate via an agentic experience until the results are exactly where you want them
@reut_v_plurai Thanks )) It makes sense) The iteration part and 'intent calibration' sound esp valuable for subjective cases! ))
Plurai
@natalie_ermishina Thanks a lot, really appreciate it!
Great question on subjective tasks — that’s actually where this approach becomes even more interesting. Instead of relying on a generic judge, we define subjectivity explicitly (via the spec / examples), and then generate diverse boundary cases around that intent. The key is that labels aren’t coming from a single model they’re validated through multi-agent debate, which helps reduce ambiguity and noise in more nuanced cases
In practice, we’ve seen that once the SLM is trained on this kind of task-specific, high-fidelity data, it handles subjective criteria (tone, style, compliance, etc.) much more consistently than LLM-as-a-judge setups.
We go deeper into this (and share benchmarks) in the paper:
https://huggingface.co/papers/2604.25203
Would love to hear what kind of subjective evals you’re thinking about, that’s exactly where things get interesting 🙂
Plurai
@natalie_ermishina We're here if you have any more questions! Let us know what you think once you try it out!
Plurai
@natalie_ermishina Thanks a lot — really appreciate it!
On subjective tasks, we make the criteria explicit (spec + examples), generate boundary cases, and validate them with multi-agent debate — that’s what makes it consistent in practice
We shared more details here: https://huggingface.co/papers/2604.25203
Curious — what kind of subjective evals are you dealing with today?
@ilankad23 Thank you for the reply, gona have a look at the links first :)
Plurai
@natalie_ermishina glad you liked it, thank you!
You've mentioned 43% fewer failures, was that averaged on any type of task or does the industry have specific benchmarks for that?
Plurai
@michael_vavilov Great question!
The 43% fewer failures comes from our research benchmarks across multiple tasks (conversational policies, agent workflows, compliance), not a single narrow use case. In the paper, we evaluate across different domains and datasets, and consistently see that task-specific models trained with our method outperform LLM-as-a-judge baselines and generic guardrails
If you want the full breakdown (datasets, tasks, and comparisons), we shared it here:
https://huggingface.co/papers/2604.25203
Curious what kind of failures you’re measuring today?
Plurai
@michael_vavilov We're here if you have any more questions! Let us know what you think once you try it out!