We taught AI agents when to shut up (and when to talk)
Hey PH π
Deep voice AI technology is the DNA of everything we build at Krisp. Today we're launching VIVA 2.0 SDK, bringing that expertise to voice AI agent developers.
It makes AI agents sound more human. They know when to talk and when to stop. A skill some humans could use too π
The problem is real: voice agents work great in demos, then fall apart in production. Audio is messy, and bots can't read conversations the way humans do. They talk over you. They stop every time you say "uh-huh." They can't tell a real interruption from someone just agreeing.
What's in VIVA 2.0:
π Voice Isolation v3 β cleans real-world audio, directly reduces word error rate
π£οΈ Turn Prediction v3 β predicts end of turn from the rhythm of speech, not silence timers. 47% faster than v2. 12+ languages
π€ Interruption Prediction v1 β first audio-only model that tells "mhm, keep going" from "wait, stop." Under a second, under 6% false positives
π Signal Detectors β real-time TTS detection, gender, accent classification
Already running inside Daily, Vapi, LiveKit, and some of the largest AI labs. 3.5x better turn-taking, 50% fewer dropped calls.
All new models bundled at existing pricing.
Would love your thoughts. Happy to go deep on any of the technical details in the comments.
Replies
Krisp
Some context on why we built this. We've been doing voice AI for 8 years, processing over a trillion minutes of real-world audio. If you use Discord, our tech is what powers their noise cancellation. We recently won two Webby Awards for Technical Achievement. VIVA 2.0 takes that same voice engine and opens it up for developers building voice agents.
Krisp
Diving deeper on Turn Taking v3. Most voice agents today just wait for silence. User stops talking, bot counts a few seconds of quiet, then responds. That's why every voice agent feels slow and robotic. Our model works differently. It listens to the rhythm and intonation of speech and predicts the end of turn in milliseconds. We tested it against SmartTurn, LiveKit, and Deepgram Flux. Full benchmark table and our public test dataset are in the technical
Technical blog link: /blog/turn-taking-v3
I have worked in the past building support agents and The pain is real. Excited to see and explore.
This is a very real production problem for voice agents. The jump from silence-based turn detection to rhythm/context-aware turn prediction feels especially important, because most awkward AI calls break when the agent interrupts too early or waits too long. The βmhm vs wait, stopβ distinction is a small detail that can make the whole experience feel much more human. Congrats on the launch!
Krisp
@alpertayfurrΒ thanks. Yeah, this is definitely one of the things that breaks in production