Jas

Word Error Rate is Broken: Why Dictation Apps are Testing the Wrong Metric

by

The standard speech-to-text metric is Word Error Rate, or WER.

But users don’t feel percentages.

A transcript can be 97% accurate and still feel like garbage if the missing 3% is a client’s name, a critical date, a URL, or broken formatting.

For voice input, the real metric is much more binary:

  • Did I have to touch my keyboard after speaking?

We’re aiming for a zero-edit commit rate, not just raw transcription accuracy.

That shift changed how we built Juno. We stopped treating the live transcript and the final text as the same thing.

  • The Live HUD is there for real-time confidence while you talk.

  • The Final Pass is where Juno actually commits the clean text.

From there, the local writing engine takes over based on context, stripping disfluencies, formatting messy notes, drafting a structured email, or outputting clean bullets.

The product loop is not: speech → transcript

It is: messy speech → visual preview → final transcript → writing transform → safe insertion

So if you use dictation or voice-writing tools, what is the actual make-or-break metric for you?

  • Zero-edit rate: It just works, no cleanup required.

  • Final insertion latency: Does it paste instantly, or are you waiting around?

  • Live Transcript: Can you see what you've been speaking?

  • Ubiquity: Does it work natively in every app?

  • Context awareness: Does it catch your specific jargon and contact names?

  • Voice correction: Can you fix a mistake without grabbing the mouse?

  • Reliability: Does it ever dump text into the wrong window?

  • Privacy: Is your voice processed locally, or is it hitting a cloud?

4 views

Add a comment

Replies

Be the first to comment