Word Error Rate is Broken: Why Dictation Apps are Testing the Wrong Metric
The standard speech-to-text metric is Word Error Rate, or WER.
But users don’t feel percentages.
A transcript can be 97% accurate and still feel like garbage if the missing 3% is a client’s name, a critical date, a URL, or broken formatting.
For voice input, the real metric is much more binary:
Did I have to touch my keyboard after speaking?
We’re aiming for a zero-edit commit rate, not just raw transcription accuracy.
That shift changed how we built Juno. We stopped treating the live transcript and the final text as the same thing.
The Live HUD is there for real-time confidence while you talk.
The Final Pass is where Juno actually commits the clean text.
From there, the local writing engine takes over based on context, stripping disfluencies, formatting messy notes, drafting a structured email, or outputting clean bullets.
The product loop is not: speech → transcript
It is: messy speech → visual preview → final transcript → writing transform → safe insertion
So if you use dictation or voice-writing tools, what is the actual make-or-break metric for you?
Zero-edit rate: It just works, no cleanup required.
Final insertion latency: Does it paste instantly, or are you waiting around?
Live Transcript: Can you see what you've been speaking?
Ubiquity: Does it work natively in every app?
Context awareness: Does it catch your specific jargon and contact names?
Voice correction: Can you fix a mistake without grabbing the mouse?
Reliability: Does it ever dump text into the wrong window?
Privacy: Is your voice processed locally, or is it hitting a cloud?


Replies