Krisp Voice Translation API - Real-time speech-to-speech translation API
by•
Most voice translation APIs work great in demos. Then real users show up with background noise, accents and verification code that gets garbled. We built our technology on a million live contact center calls where accuracy is non negotiable. 96% accuracy on real calls, zero patient safety incidents, 61+ languages with any to any pair.
Translation API is now available self-serve with 60 mins free credit upon signup to dev dashboard.

Replies
Stripo.email
Congrats on the launch! 🚀 Real-time voice translation is impressive on its own, but production experience in healthcare and finance makes it even more compelling. Best of luck today!
How well does Custom Vocabulary / Dictionary work? How many terms can I add, and does it slow things down?
Krisp
@marija_pojasnikova The API supports custom vocabulary, so you can pass your specific terminology. We also support a custom translation_dictionary, where you can provide an exact word and its translation for each language — that word will always keep your translation.
Here's the documentation on how to pass these parameters: https://sdk-docs.krisp.ai/docs/voice-translation-api#initial-client-message
Krisp
@marija_pojasnikova Both the vocabulary and the dictionary can contain up to 200 entries each
Product Hunt
Krisp launches go wayyyy back. Congrats on the latest. :)
Krisp
@rrhoover yeah, way back to those working from home days during covid - when we launched Noise Cancellation.
Interesting breakdown of the latency tradeoffs. The language reordering problem is something many demos conveniently avoid discussing. I was wondering how you're handling workload spikes when multiple streams require larger context windows simultaneously. Do you dynamically allocate translation capacity per stream, or is there some form of queueing and prioritization to prevent latency from cascading across tenants?
Krisp
@apexbackene6x You're right that this is where a lot of real-time translation systems fall over — the failure mode is usually context accumulation, where long sessions inflate per-stream memory and a burst of concurrent streams turns into cascading queue delay.
We sidestep most of that architecturally. Translation context windows are deliberately narrow — scoped to what's needed for translation continuity across the current segment boundary, not extended conversation history. Unlike serving a general-purpose LLM, per-session footprint stays small and roughly constant over the life of a stream, so "many streams simultaneously needing larger context" isn't a spike vector in the first place. Sessions are also fully isolated: one tenant's workload spike can't bleed into another tenant's latency.
On capacity itself, scaling is horizontal and automatic against concurrent session load — there's no manual provisioning step, and no shared queue where one tenant's burst pushes everyone else back. For customers with strict latency requirements, we additionally offer reserved capacity with guaranteed compute allocation, so your streams are served from dedicated headroom rather than competing in the general pool.
So short answer: per-stream allocation, kept cheap by design, with isolation rather than prioritization doing the cascade-prevention work — and reservation available where best-effort isn't acceptable.
The custom translation_dictionary that locks an exact term-to-term mapping is the detail I'd actually reach for — brand and product names are usually the first thing these pipelines mangle. On the free 60-min tier, is the dictionary per-request or stored server-side once you set it up?
Krisp
@lennoxbeflying you pass it as params here is the documentation https://sdk-docs.krisp.ai/docs/voice-translation-api#initial-client-message
jared.so
Following Krisp: Voice AI for Meetings with interest. What is next on the roadmap after launch day?
Krisp
@borrellbr Voice (sky) is the limit :D
if I am serious probably translation will be next landing to voice ai app
Real-time speech-to-speech is the hard part - what's the round-trip latency at acceptable quality? Most translation APIs I've tested hit 800ms+ which is fine for async but breaks conversational flow completely.
Krisp
@christian_knaut We measure latency in Krisp as the time to first translated audio after a person speaks.
Total time-to-first-translated-audio is approximately 1.5–3 seconds, driven by three factors: context window size, source language structural complexity, and amount of speech. AI inference latency is around 700-800ms here.
Language structural complexity is the primary variable. Languages with word order parallel to the target language (e.g., Spanish) can be translated incrementally as words arrive, resulting in latency toward the lower end.
Languages with high reordering distance — such as Japanese, Korean, or Turkish — are verb-final or agglutinative, requiring the model to buffer more context before producing a grammatically correct translation, resulting in latency toward the higher end
The AI latency difference between transcription and translation is ~100ms.
Spotlight by Backplanes
@asti_pili tangential question-- how does this work with speaker attribution when multiple people are in a room together? This is my biggest pet peeve with most transcription agents today. Granola's amazing if I use my phone, and has nothing when using my laptop. My kingdom for good attribution regardless of setting! What does Krisp do here?
Krisp
@antifreeze the is a translation API launch from our developer product. You are referring to our voice ai app for meetings. To answer your question speaker attribution is tough challenge. Currently we use AI to suggest speakers based on transcription.
Huge fan of Krisp's noise cancellation, and seeing you guys cross the language line with a Dev API is massive! For developers looking to integrate this translation API into real-time voice apps (like live customer support), what is the average latency (in milliseconds) we should expect for the live translation stream?
Krisp
@dropa We measure latency in Krisp as the time to first translated audio after a person speaks.
Total time-to-first-translated-audio is approximately 1.5–3 seconds, driven by three factors: context window size, source language structural complexity, and amount of speech. AI inference latency is around 700-800ms here.
Language structural complexity is the primary variable. Languages with word order parallel to the target language (e.g., Spanish) can be translated incrementally as words arrive, resulting in latency toward the lower end.
Languages with high reordering distance — such as Japanese, Korean, or Turkish — are verb-final or agglutinative, requiring the model to buffer more context before producing a grammatically correct translation, resulting in latency toward the higher end
The AI latency difference between transcription and translation is ~100ms.
jared.so
Congrats on the launch! How do you plan to handle all the traffic during these days?