Krisp Voice Translation API - Real-time speech-to-speech translation API
by•
Most voice translation APIs work great in demos. Then real users show up with background noise, accents and verification code that gets garbled. We built our technology on a million live contact center calls where accuracy is non negotiable. 96% accuracy on real calls, zero patient safety incidents, 61+ languages with any to any pair.
Translation API is now available self-serve with 60 mins free credit upon signup to dev dashboard.

Replies
Krisp
Tobira.ai
@asti_pili C++ SDK timeline? Any plans for Go, Rust, or mobile SDKs
Krisp
@olia_nemirovski The C++ is coming soon but specific date yet.
RiteKit Company Logo API
@asti_pili This is impressive work—the fact that you've handled over a million minutes of production translation in regulated industries with zero incidents is a strong signal of real reliability. The 96% accuracy on live calls with real-world noise is the kind of number that matters way more than lab benchmarks.
OH my, I was looking for something like this for months at this point. Do you have plans to integrate this into your mobile app as a native functionality?
Krisp
@volodymyr_demchenko it will definitely come to our consumer product
@asti_pili Amazing, will be waiting for this!
Refocus
The "works great in demos, then real users show up with background noise and accents" line is exactly the wall we hit building voice AI for older adults. Phone-quality audio and unfamiliar accents break most pipelines that benchmark beautifully. Training on a million real contact-center calls is a smart moat for that reason. One question on the speech-to-speech path: how much added latency does translation introduce over plain transcription, and is it low enough to keep a live call feeling like a natural back-and-forth?
Krisp
@igorgurovich good questions.
We measure latency in Krisp as the time to first translated audio after a person speaks.
Total time-to-first-translated-audio is approximately 1.5–3 seconds, driven by three factors: context window size, source language structural complexity, and amount of speech. AI inference latency is around 700-800ms here.
Language structural complexity is the primary variable. Languages with word order parallel to the target language (e.g., Spanish) can be translated incrementally as words arrive, resulting in latency toward the lower end.
Languages with high reordering distance — such as Japanese, Korean, or Turkish — are verb-final or agglutinative, requiring the model to buffer more context before producing a grammatically correct translation, resulting in latency toward the higher end
The AI latency difference between transcription and translation is ~100ms.
Shark.Health
Impressive to see accuracy claims based on real contact center traffic instead of lab conditions. How does the API handle industry-specific terminology, like healthcare or financial services vocabulary, where a single mistranslation can create major issues?
Krisp
Looking forward to seeing what the Product Hunt community builds with it. We'd love your feedback!
Anchor
Great to see this finally launched. Super useful update to a great tool!
Mailwarm
Congratulations! I will definitely try it.
Krisp
So proud of this launch!
Real-time speech-to-speech at API level means you've solved the three-stage pipeline problem: ASR accuracy, translation context, and TTS naturalness all simultaneously. We've built on streaming audio APIs and the hardest part is always mid-utterance interruptions breaking the translation context. What's your P99 latency for a 10-second utterance, and how do you handle speaker turn overlap?
Krisp
@retain_dev Great questions — you're clearly speaking from experience with the same failure modes we've worked through.
On latency: We don't benchmark against utterance length, and that's deliberate. A 10-second utterance never waits for completion — it's streamed and translated as a series of segments, so utterance-level P99 would mostly measure how long the speaker talked, not how fast the system is. The segment-level numbers are what matter: typical segments run 1.5–2 seconds of speech, with translation delay averaging under a second and staying under ~1 second even at P95. In practice, listeners hear translated audio continuously throughout a long utterance rather than waiting for it to end.
On mid-utterance interruptions: Our segmentation is context-aware rather than purely acoustic. When the model judges that the audio received so far is insufficient for a reliable translation, it waits — within a configurable threshold — for more input before committing. If nothing more arrives, it emits the best translation from accumulated context. This bounds the quality/responsiveness trade-off and avoids the context fragmentation you're describing, where an interruption forces a translation of a half-formed thought.
On speaker turn overlap: The API is designed as a single semi-synchronous translation stream, so overlap handling stays with the client — you decide whether overlapping speakers get separate streams or how to arbitrate the floor. We're adding an explicit interrupt command so clients can control what happens to in-flight translation and synthesis when a turn is cut off, rather than us guessing at a policy that won't fit every application.
Stripo.email
Congrats on the launch! 🚀 Real-time voice translation is impressive on its own, but production experience in healthcare and finance makes it even more compelling. Best of luck today!