Voice AI for Meetings

Start new thread

Krisp Voice Translation API - Real-time speech-to-speech translation API

Anchor

•2mo ago

Most voice translation APIs work great in demos. Then real users show up with background noise, accents and verification code that gets garbled. We built our technology on a million live contact center calls where accuracy is non negotiable. 96% accuracy on real calls, zero patient safety incidents, 61+ languages with any to any pair. Translation API is now available self-serve with 60 mins free credit upon signup to dev dashboard.

Replies

Best

Krisp

Maker

📌

Hey Product Hunt! We've been running real-time voice translation in enterprise contact centers. Healthcare, insurance, finance. Calls where a wrong word means a patient safety incident or a compliance violation. That pressure built an engine most benchmarks can't replicate. 96% accuracy on live calls with real accents and noise. Zero patient safety incidents across 8+ languages. Over a million minutes of production translation. Today we're opening that engine up as a self-serve API. Same model, same accuracy, same 61 languages. Python and JS SDKs, playground with 60 free minutes, custom vocabulary and translation dictionaries from day one. No sales call. If you're building anything where voice crosses a language barrier and accuracy matters, hear it yourself: https://lab.krisp.ai/products/vo... Our team will be here all day. Ask us anything.

Report

2mo ago

Tobira.ai

@asti_pili C++ SDK timeline? Any plans for Go, Rust, or mobile SDKs

Report

2mo ago

Krisp

Maker

@olia_nemirovski The C++ is coming soon but specific date yet.

Report

2mo ago

OH my, I was looking for something like this for months at this point. Do you have plans to integrate this into your mobile app as a native functionality?

Report

2mo ago

Krisp

Maker

@volodymyr_demchenko it will definitely come to our consumer product

Report

2mo ago

@asti_pili Amazing, will be waiting for this!

Report

2mo ago

@asti_pili @volodymyr_demchenko thats so nice

Report

15d ago

Refocus

The "works great in demos, then real users show up with background noise and accents" line is exactly the wall we hit building voice AI for older adults. Phone-quality audio and unfamiliar accents break most pipelines that benchmark beautifully. Training on a million real contact-center calls is a smart moat for that reason. One question on the speech-to-speech path: how much added latency does translation introduce over plain transcription, and is it low enough to keep a live call feeling like a natural back-and-forth?

Report

2mo ago

Krisp

Maker

@igorgurovich good questions.

We measure latency in Krisp as the time to first translated audio after a person speaks.

Total time-to-first-translated-audio is approximately 1.5–3 seconds, driven by three factors: context window size, source language structural complexity, and amount of speech. AI inference latency is around 700-800ms here.
Language structural complexity is the primary variable. Languages with word order parallel to the target language (e.g., Spanish) can be translated incrementally as words arrive, resulting in latency toward the lower end.
Languages with high reordering distance — such as Japanese, Korean, or Turkish — are verb-final or agglutinative, requiring the model to buffer more context before producing a grammatically correct translation, resulting in latency toward the higher end
The AI latency difference between transcription and translation is ~100ms.

Report

2mo ago

Shark.Health

Impressive to see accuracy claims based on real contact center traffic instead of lab conditions. How does the API handle industry-specific terminology, like healthcare or financial services vocabulary, where a single mistranslation can create major issues?

Report

2mo ago

Krisp

Maker

Looking forward to seeing what the Product Hunt community builds with it. We'd love your feedback!

Report

2mo ago

Curious how deep the localization goes. Spanish in Mexico and Spanish in Spain can feel quite different in everyday use. Do you support regional variations like that?

Congrats on the launch!

Report

2mo ago

Krisp

Maker

@jared_salois great question.

Yes, we support locale-specific variants (US Spanish, French Canadian, Egyptian Arabic, etc.) and regional languages (Catalan, Galician, Basque)

Report

2mo ago

Anchor

Hunter

Great to see this finally launched. Super useful update to a great tool!

Report

2mo ago

Mailwarm

Congratulations! I will definitely try it.

Report

2mo ago

Krisp

Maker

So proud of this launch!

Report

2mo ago

Real-time speech-to-speech at API level means you've solved the three-stage pipeline problem: ASR accuracy, translation context, and TTS naturalness all simultaneously. We've built on streaming audio APIs and the hardest part is always mid-utterance interruptions breaking the translation context. What's your P99 latency for a 10-second utterance, and how do you handle speaker turn overlap?

Report

2mo ago

Krisp

Maker

@retain_dev Great questions — you're clearly speaking from experience with the same failure modes we've worked through.

On latency: We don't benchmark against utterance length, and that's deliberate. A 10-second utterance never waits for completion — it's streamed and translated as a series of segments, so utterance-level P99 would mostly measure how long the speaker talked, not how fast the system is. The segment-level numbers are what matter: typical segments run 1.5–2 seconds of speech, with translation delay averaging under a second and staying under ~1 second even at P95. In practice, listeners hear translated audio continuously throughout a long utterance rather than waiting for it to end.

On mid-utterance interruptions: Our segmentation is context-aware rather than purely acoustic. When the model judges that the audio received so far is insufficient for a reliable translation, it waits — within a configurable threshold — for more input before committing. If nothing more arrives, it emits the best translation from accumulated context. This bounds the quality/responsiveness trade-off and avoids the context fragmentation you're describing, where an interruption forces a translation of a half-formed thought.

On speaker turn overlap: The API is designed as a single semi-synchronous translation stream, so overlap handling stays with the client — you decide whether overlapping speakers get separate streams or how to arbitrate the floor. We're adding an explicit interrupt command so clients can control what happens to in-flight translation and synthesis when a turn is cut off, rather than us guessing at a policy that won't fit every application.

Report

2mo ago

1 2