MiMo-V2.5 Voice

Bilingual ASR for dialects, code-switching, and songs

109 followers

Bilingual ASR for dialects, code-switching, and songs

109 followers

Visit website

Transcription

•

AI Voice Agent Infrastructure

MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi that transcribes Mandarin, English, eight Chinese dialects, code-switched speech, and song lyrics. Built for ML engineers, researchers, and developers building real-world voice applications.

Free

Launch tags:API•Open Source•Artificial Intelligence

Launch Team

Wispr Flow: Dictation That Works EverywhereStop typing. Start speaking. 4x faster.

Promoted

Hunter

📌

Whisper changed what people expected from open-source ASR. Three years later, the leaderboard looks very different.

What it is: MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi MiMo, MIT-licensed and available on HuggingFace, built for bilingual Chinese-English transcription across dialects, noisy audio, code-switched speech, and song lyrics.

The problem: most ASR models are benchmarked on clean studio data and deployed into the real world, where audio is noisy, speakers overlap, and people switch languages mid-sentence. The gap between benchmark accuracy and production accuracy is where voice products quietly fail.

The solution: staged training combining large-scale mid-training, supervised fine-tuning, and a reinforcement learning algorithm specifically targeting the scenarios where conventional models break down. Native punctuation from prosody means transcripts arrive ready to use.

What makes it different: on the Open ASR Leaderboard, MiMo-V2.5-ASR posts 5.73% average WER on English, below Whisper large-v3 at 7.44%. On Wu dialect it scores 19.55% vs FunASR-1.5 at 29.08%. On lyrics, 3.95% on m4singer vs Gemini 2.5 Pro at 4.25%. These are not cherry-picked scenarios — they are the hard ones.

Key features:

Eight Chinese dialects natively supported, including Wu, Cantonese, Hokkien, Sichuanese
Chinese-English code-switching with no language tags
Lyrics transcription under accompaniment and pitch variation
Multi-speaker and noisy environment robustness
Native punctuation, no post-processing needed
MIT license, Python API, Gradio demo, self-hostable

Benefits:

Production-grade accuracy on the audio conditions that actually exist in the field
One model replaces multiple regional or domain-specific ASR solutions
Self-hosting eliminates per-call API costs and keeps data on your infra
Ready-to-use punctuated output cuts one step from every downstream pipeline

Who it's for: ML engineers and voice product teams building bilingual or Chinese-language transcription pipelines who need accuracy that holds up outside the lab.

Open-source ASR has been catching up to closed models for years. MiMo-V2.5-ASR is a data point that the gap is now very small, and in some scenarios gone.

Report

2mo ago

Dialect and code-switching support is the piece that usually gets skipped in ASR research because it's hard, but it's exactly where real-world audio breaks down. Anyone building a voice product for users in multilingual environments (SEA, MENA, parts of Africa) runs into this immediately.

One application that jumped to mind reading this: location-based audio guides. I built a travel app called StoryRoute (https://storyroute.netlify.app/) that lets people explore cities through interactive, story-driven walks. Accurate multilingual ASR would open up a lot for that use case — imagine a guide that understands a question asked in Mandarin mixed with English street names, or local dialect terms for landmarks.

The code-switching capability in particular seems underexplored for tourism and cultural content. Is the model trained on domain-specific vocabulary or more general conversational speech?

Report

2mo ago

Code switching and lyrics are exactly where ASR demos usually fall apart. Hitting both, plus Chinese dialect coverage, makes this feel grounded in real audio instead of benchmark Code switching and lyrics are exactly where ASR demos usually fall apart. Hitting both, theater. How much latency does that add in live pipelines?

Report

2mo ago

Reviews

Whisper changed what people expected from open-source ASR. Three years later, the leaderboard looks very different.

Key features:

Eight Chinese dialects natively supported, including Wu, Cantonese, Hokkien, Sichuanese
Chinese-English code-switching with no language tags
Lyrics transcription under accompaniment and pitch variation
Multi-speaker and noisy environment robustness
Native punctuation, no post-processing needed
MIT license, Python API, Gradio demo, self-hostable

Benefits:

Production-grade accuracy on the audio conditions that actually exist in the field
One model replaces multiple regional or domain-specific ASR solutions
Self-hosting eliminates per-call API costs and keeps data on your infra
Ready-to-use punctuated output cuts one step from every downstream pipeline

Who it's for: ML engineers and voice product teams building bilingual or Chinese-language transcription pipelines who need accuracy that holds up outside the lab.

Open-source ASR has been catching up to closed models for years. MiMo-V2.5-ASR is a data point that the gap is now very small, and in some scenarios gone.