Reviewers mostly praise Krisp for two things: strong background-noise removal and useful meeting capture. Users say it works across Zoom, Meet, Teams, Discord, and even in noisy homes, cars, conferences, or outdoor settings, often without adding a bot to the call. Many also like the transcripts, summaries, and action items, calling them accurate, clear, and easy to fit into workflows. A few drawbacks come up repeatedly: occasional lag, random meeting restarts or over-triggering, sometimes unnatural voice quality, limited integrations or API access, and uneven customer support.
Krisp
Tobira.ai
@asti_pili C++ SDK timeline? Any plans for Go, Rust, or mobile SDKs
Krisp
@olia_nemirovski The C++ is coming soon but specific date yet.
RiteKit Company Logo API
@asti_pili This is impressive work—the fact that you've handled over a million minutes of production translation in regulated industries with zero incidents is a strong signal of real reliability. The 96% accuracy on live calls with real-world noise is the kind of number that matters way more than lab benchmarks.
Refocus
The "works great in demos, then real users show up with background noise and accents" line is exactly the wall we hit building voice AI for older adults. Phone-quality audio and unfamiliar accents break most pipelines that benchmark beautifully. Training on a million real contact-center calls is a smart moat for that reason. One question on the speech-to-speech path: how much added latency does translation introduce over plain transcription, and is it low enough to keep a live call feeling like a natural back-and-forth?
Krisp
@igorgurovich good questions.
We measure latency in Krisp as the time to first translated audio after a person speaks.
Total time-to-first-translated-audio is approximately 1.5–3 seconds, driven by three factors: context window size, source language structural complexity, and amount of speech. AI inference latency is around 700-800ms here.
Language structural complexity is the primary variable. Languages with word order parallel to the target language (e.g., Spanish) can be translated incrementally as words arrive, resulting in latency toward the lower end.
Languages with high reordering distance — such as Japanese, Korean, or Turkish — are verb-final or agglutinative, requiring the model to buffer more context before producing a grammatically correct translation, resulting in latency toward the higher end
The AI latency difference between transcription and translation is ~100ms.
Real-time speech-to-speech at API level means you've solved the three-stage pipeline problem: ASR accuracy, translation context, and TTS naturalness all simultaneously. We've built on streaming audio APIs and the hardest part is always mid-utterance interruptions breaking the translation context. What's your P99 latency for a 10-second utterance, and how do you handle speaker turn overlap?
Krisp
@retain_dev Great questions — you're clearly speaking from experience with the same failure modes we've worked through.
On latency: We don't benchmark against utterance length, and that's deliberate. A 10-second utterance never waits for completion — it's streamed and translated as a series of segments, so utterance-level P99 would mostly measure how long the speaker talked, not how fast the system is. The segment-level numbers are what matter: typical segments run 1.5–2 seconds of speech, with translation delay averaging under a second and staying under ~1 second even at P95. In practice, listeners hear translated audio continuously throughout a long utterance rather than waiting for it to end.
On mid-utterance interruptions: Our segmentation is context-aware rather than purely acoustic. When the model judges that the audio received so far is insufficient for a reliable translation, it waits — within a configurable threshold — for more input before committing. If nothing more arrives, it emits the best translation from accumulated context. This bounds the quality/responsiveness trade-off and avoids the context fragmentation you're describing, where an interruption forces a translation of a half-formed thought.
On speaker turn overlap: The API is designed as a single semi-synchronous translation stream, so overlap handling stays with the client — you decide whether overlapping speakers get separate streams or how to arbitrate the floor. We're adding an explicit interrupt command so clients can control what happens to in-flight translation and synthesis when a turn is cut off, rather than us guessing at a policy that won't fit every application.
Interesting breakdown of the latency tradeoffs. The language reordering problem is something many demos conveniently avoid discussing. I was wondering how you're handling workload spikes when multiple streams require larger context windows simultaneously. Do you dynamically allocate translation capacity per stream, or is there some form of queueing and prioritization to prevent latency from cascading across tenants?
Krisp
@apexbackene6x You're right that this is where a lot of real-time translation systems fall over — the failure mode is usually context accumulation, where long sessions inflate per-stream memory and a burst of concurrent streams turns into cascading queue delay.
We sidestep most of that architecturally. Translation context windows are deliberately narrow — scoped to what's needed for translation continuity across the current segment boundary, not extended conversation history. Unlike serving a general-purpose LLM, per-session footprint stays small and roughly constant over the life of a stream, so "many streams simultaneously needing larger context" isn't a spike vector in the first place. Sessions are also fully isolated: one tenant's workload spike can't bleed into another tenant's latency.
On capacity itself, scaling is horizontal and automatic against concurrent session load — there's no manual provisioning step, and no shared queue where one tenant's burst pushes everyone else back. For customers with strict latency requirements, we additionally offer reserved capacity with guaranteed compute allocation, so your streams are served from dedicated headroom rather than competing in the general pool.
So short answer: per-stream allocation, kept cheap by design, with isolation rather than prioritization doing the cascade-prevention work — and reservation available where best-effort isn't acceptable.
How well does Custom Vocabulary / Dictionary work? How many terms can I add, and does it slow things down?
Krisp
@marija_pojasnikova The API supports custom vocabulary, so you can pass your specific terminology. We also support a custom translation_dictionary, where you can provide an exact word and its translation for each language — that word will always keep your translation.
Here's the documentation on how to pass these parameters: https://sdk-docs.krisp.ai/docs/voice-translation-api#initial-client-message
Krisp
@marija_pojasnikova Both the vocabulary and the dictionary can contain up to 200 entries each
Curious how deep the localization goes. Spanish in Mexico and Spanish in Spain can feel quite different in everyday use. Do you support regional variations like that?
Congrats on the launch!
Krisp
@jared_salois great question.
Yes, we support locale-specific variants (US Spanish, French Canadian, Egyptian Arabic, etc.) and regional languages (Catalan, Galician, Basque)
Real-time speech-to-speech is the hard part - what's the round-trip latency at acceptable quality? Most translation APIs I've tested hit 800ms+ which is fine for async but breaks conversational flow completely.
Krisp
@christian_knaut We measure latency in Krisp as the time to first translated audio after a person speaks.
Total time-to-first-translated-audio is approximately 1.5–3 seconds, driven by three factors: context window size, source language structural complexity, and amount of speech. AI inference latency is around 700-800ms here.
Language structural complexity is the primary variable. Languages with word order parallel to the target language (e.g., Spanish) can be translated incrementally as words arrive, resulting in latency toward the lower end.
Languages with high reordering distance — such as Japanese, Korean, or Turkish — are verb-final or agglutinative, requiring the model to buffer more context before producing a grammatically correct translation, resulting in latency toward the higher end
The AI latency difference between transcription and translation is ~100ms.