Why Whisper breaks on Russian tech interviews — and what we did about it

In January I bought subscriptions to Cluely, Final Round AI, and Sensei. I wanted to see how they handle Russian tech interviews, since all three claim to support Russian. I connected them one at a time to a test call in Telemost (I doubt the platform mattered, but anyway), ran the same recording through each: Senior Python backend developer, 45 minutes, stack is FastAPI + PostgreSQL + Kafka + Kubernetes. Regular Russian speaker — from Moscow, if that matters, no speech issues, normal microphone.
All three produced a transcript. All three failed, what a surprise.
"Kafka" became "как-то" ("somehow") or "кофта" ("sweater") about half the time. "Kubernetes" turned into "губер нет тест" ("no governor test"). "Subscriber pattern" became "саб скрайп патерн." "Middleware for CSRF" became "мидл-вер для си эс эр эф" — that one's actually fine.
The problem isn't that the person was speaking Russian, and it isn't that Whisper doesn't know Russian (footnote: it knows it adequately). Whisper handles Russian at around 9.8% WER on Common Voice. The problem is something else: a Russian-speaking developer doesn't speak pure Russian or pure English. They speak a hybrid — Russian grammar, English terms, a distinct way of pronouncing those terms, and sometimes their own slang like "гошечка" and "крудошлёп."
No popular STT system handles this hybrid, because it's almost entirely absent from training data.
Below I break down how this problem is structured, what competitors do about it (almost nothing), and what we did at JobPath.
How the speech layer works in interview assistants
Architecturally, every product in the space does the same thing:
Intercept audio from the system output (to hear the interviewer through Zoom/Meet/Telemost) plus optionally from the candidate's microphone
Speech-to-text in real time. Audio becomes transcript.
LLM. Transcript plus context (resume, stack, previous answers) goes into the model, answer comes back.
Display output in a way that doesn't appear in the interviewer's screen share.
The interesting part from a speech recognition standpoint happens at step two: if the STT hears something wrong, no amount of prompt engineering can fix it downstream — the LLM gets "sweater" and either generates an answer about clothing or hallucinates (trying to patch the problem with a system prompt).
Under the hood, competitors use three main options for STT:
Whisper (OpenAI, open-source) or its forks. Based on analysis of open-source Cluely clones and the error patterns I see in transcripts, it's most likely Whisper large-v3. Sensei produces similar error patterns. "Parakeet" in the name suggests NVIDIA NeMo Parakeet, but I couldn't confirm this from open-source clones.
OpenAI Realtime API (gpt-4o-realtime). Uses a proprietary STT. For Russian it performs at Whisper's level or slightly worse, and costs significantly more.
Commercial APIs: Deepgram, AssemblyAI, Google Cloud Speech. For Russian these are all worse than Whisper, except Google which is roughly comparable.
All of these options hit the same wall on Russian-language IT interviews. More on that wall below.
Why Whisper breaks specifically on tech speech
Whisper large-v3 was trained on one million hours of weakly labeled data plus four million hours of pseudo-labeled data. OpenAI doesn't disclose where that data came from, but based on which acoustic domains the model handles well, you can get a rough sense of the composition:
YouTube with auto-generated subtitles
Podcasts and audiobooks
Movie subtitles
Russian is in that corpus, but the breakdown is such that IT content makes up tenths of a percent. The word "кофта" appears thousands of times in the training set; the word "Kafka" in a meaningful Russian-language context appears almost never. When audio is ambiguous, the model picks the more probable word. That's how "кафка" becomes "кофта."
The hardest problem, which I call code-switching, comes next. A Russian-speaking developer switches effortlessly between languages within a single sentence:
"Я там сделал subscriber pattern на RabbitMQ, обернул в try-except, и на пятисотой ошибке делаю retry с exponential backoff"
In that sentence, every English word puts Whisper in front of the same question: is this Russian text with an English insertion, or did we switch to English? The model resolves this through a soft language switch that works on long English passages but fails on individual words. "Subscriber" might come out as "сабскрайбер" (in Cyrillic, which the LLM may then fail to interpret), or "собеседник" ("interlocutor"). "Retry" becomes "ритри" or "три" (the number three). "Try-except" becomes "трой эксепт" or "трое детей" ("three children"). And so on.
Another major issue: Russian pronunciation of English terms. "Kubernetes" in English is [ku:bər'netis]; in Russian it's [куберне́тис] with stress in the wrong place. "Nginx" in English is [ˈendʒɪnɛks]; in Russian it's usually [энджи́никс] or even [нгинкс]. "Cache" gets pronounced [кэш] often with a hard к, without the characteristic English softening. Whisper was trained on native English speakers pronouncing these terms correctly. The Russian variant sounds to the model like a different word entirely (transcriptions from Wikipedia).
And finally, there's the IT slang that's purely Russian. Not anglicisms — genuinely Russian words from IT subculture. "Гошечка" (Go), "плюсы" (C++), "жаба" (Java), "шарпы" (C#), "падать в кафку" (writing to Apache Kafka), "катать роллауты" (doing a rollout to production), "переливать в прод," "крудошлёп" (CRUD developer). None of these words appear in Whisper's training data in an IT context. "Гошечка" the model transcribes as "хорошо" ("good"), "горшочек" ("little pot"), or nothing recognizable at all.
Put it all together: the model must simultaneously understand Russian speech, correctly transcribe English terms with Russian pronunciation, not get confused by code-switching, and know IT slang. Standard Whisper doesn't do any of these four at the level a production assistant requires.
Real numbers
To be concrete rather than anecdotal, I assembled a test corpus. One criterion: content the assistant actually needs to handle in the wild.
What's in the corpus:
200 minutes of audio from real IT interviews, collected with participants' consent
Five speakers, different regions (Moscow, St. Petersburg, Minsk, Yekaterinburg, Novosibirsk), no strong accents
Different stacks: Python, Go, Java, JavaScript, some C++
Manual transcription of everything as ground truth
I ran three systems through it and got the following:
Model | WER on general speech | WER on technical terms | WER on dictated commands/code |
|---|---|---|---|
Whisper large-v3 (vanilla) | ~8% | ~34% | ~52% |
whisper-large-v3-russian (Antony66, fine-tuned on Common Voice RU) | ~6% | ~29% | ~48% |
JobPath (our fine-tune) | ~4% | ~7% | ~11% |
General speech works decently for all three. But on technical terms the gap is enormous: one in three words wrong for the competitors versus one in fifteen for us.
What counts as "technical terms" in this benchmark: names of frameworks, libraries, tools, protocols. Precisely the things an interviewer asks about every 30 seconds. One in three wrong means a broken product.
What the errors look like in practice
The same 30-second audio clip through three different systems. The interviewer is asking about working with databases.
Original (manual transcription):
"Расскажите, как бы вы решали проблему лишних запросов к базе при использовании ORM. Что такое N+1 проблема, как её избегают в SQLAlchemy и в Django?"
Cluely:
"Расскажите, как бы вы решали проблему лишних запросов к базе при использовании о-эр-эм. Что такое эн плюс один проблема, как её избегают в эс-кью-эль-алкемия и в данже?"
Sensei Copilot:
"Расскажите, как бы вы решали проблему лишних запросов к базе при использовании oRm. Что такое н плюс 1 проблема, как её избегают в sqlachemi и в Django?"
JobPath:
"Расскажите как бы вы решали проблему лишних запросов к базе при использовании ORM. Что такое N+1 проблема. Как её избегают в SQLAlchemy и в Django?"
The difference looks cosmetic. It's not. When this transcript goes to the LLM, the output is completely different — because the model doesn't understand the context.

What we actually built at JobPath
Our STT is a fine-tuned Whisper large-v3 on an internal corpus of Russian-language IT interviews. We spent about a month and a half assembling the corpus. Some of it came from explicit user consent on the platform (an "improve speech recognition" option, with a transparent description of what happens to the audio). Some from recorded interviews with colleagues and acquaintances, with their permission. Some from publicly available IT interview recordings on YouTube (channels like "Айти Борода," "Хабр," meetup recordings) with manual annotation.
Total: around 30 GB of cleaned audio with verified transcriptions.
Fine-tuning was done with a LoRA adapter. Full fine-tuning on large-v3 requires 8+ H100s for several days — we don't have those resources:
Target modules: q_proj, k_proj, v_proj, out_proj in the decoder's attention layers
Rank: 32
Alpha: 64
Dropout: 0.05
Learning rate: 1e-4, cosine schedule
Batch size 8 with gradient accumulation 4 (effective 32)
3 epochs (more led to overfitting)
Hardware: 2x A100 80GB, about 18 hours of training

On top of the fine-tune there's a three-layer postprocessing pipeline:
1. A technical term dictionary with Russian pronunciation variants. About 4,000 entries. If Whisper outputs a word phonetically close to a known term (compared via Metaphone and Double Metaphone), we replace it with the canonical form. "Куберне́тис" becomes "Kubernetes," "сабскрайбер" becomes "subscriber." Works deterministically, adds no latency.
2. Session context. When the user specifies their stack in settings (e.g., Python + Django + PostgreSQL), that context is automatically injected into initial_prompt when Whisper starts. Whisper uses this when choosing between similar words and begins preferring terms from the specified stack.
3. LLM postprocessing for low-confidence segments. If a transcript segment's confidence score is below the threshold (rare, roughly 3–5% of segments), it goes through a lightweight LLM with a prompt of "correct only technical transcription errors, don't change the meaning." Adds 200–400 ms of latency, so it's only applied when needed.
Why foreign competitors don't do this
For Cluely, Sensei, Final Round AI, and Parakeet, Russia and the CIS represent optimistically 2–3% of TAM. They have enormous English-language markets: the US, India, Europe. Investing in a fine-tune for Russian IT jargon means hundreds of engineering hours plus training infrastructure plus ongoing model maintenance with updates. It pays off only if they capture meaningful share of the Russian market — which won't happen without localizing marketing, support, the site, payment, and everything else they have no plans to do.
The second problem is more serious than the engineering one. To assemble a Russian-language IT interview corpus, you need a user base in Russia that's willing to hand over audio of their interviews. Cluely doesn't have that base. JobPath assembles it through the platform itself — our users are Russian-speaking by definition (and only with personal consent; we don't collect from anyone without asking (!)).
Bottom line: no foreign interview assistant will work properly with Russian IT speech in any foreseeable future. This isn't an engineering problem, it's an economic one. It's not that they can't build it — it's that building it makes no sense for them.
One more fact about Cluely specifically, not directly about STT but relevant for Russian users. In 2025 Cluely had a user data breach that included interview recordings. Details are in a writeup on Medium. For a Russian user this means interview audio is stored in the US, and if there's a breach it becomes public along with the transcripts. JobPath stores audio locally on the user's device. Only transcripts go to the server (and we're planning to encrypt those).
Where we still fall short
I don't want to leave the impression that fine-tuning is a silver bullet. It gives a huge boost on what it was trained for and no boost on what it wasn't.
Strong regional accents: Kazakh, Uzbek, Georgian, Armenian on Russian speech. Our corpus is optimized for the standard Moscow/St. Petersburg variant, and on accents WER is still 15–20% on technical terms. We're planning to add regional adapters, but that's still in the roadmap.
Bad microphones. Built-in on an old laptop, over Bluetooth headphones from 2014, or when the speaker is far from the mic. You can't fight physics — WER shoots past 20% for all systems in those conditions, including ours.
Dictated code. An interviewer reading a long class or method name aloud: AbstractSingletonProxyFactoryBean, buildUserPreferencesRepositoryImpl. Everything falls apart for everyone, including us. Fortunately this type of question is rare, but when it comes up the only thing that actually works is the user typing the code into the interface manually.
Numbers for the metrics-minded
Other characteristics of the stack if you're interested:
WER on technical Russian: 6–7% for us, 29–34% for competitors
STT latency: ~400 ms from end of phrase to ready transcript (Whisper large-v3 with LoRA on RTX 3060)
Memory: 3.2 GB GPU VRAM with LoRA, 6+ GB without
Total time from end of interviewer's phrase to first response token in UI: ~1.5 seconds

What's next
Three directions we're working on:
Regional adapters for CIS country accents. Kazakh and Uzbek on Russian are a significant part of our audience, and right now we serve them worse than we should.
Ukrainian and other CIS languages as first-class support, not routed through Whisper's English mode.
Latency reduction — experimenting with distil-whisper and our LoRA adapter, targeting 150–200 ms instead of the current 400.
If anyone is working on similar problems — especially around code-switching and assembling technical speech corpora in non-English languages — I'd genuinely like to compare notes. Leave a comment (!)

Replies