Habib Ferdous

Microsoft MAI-Voice-2 - Expressive TTS with voice cloning in 15 languages

by
Microsoft's most expressive TTS model yet — voice cloning from short samples, fine-grained emotional control, and consistent voice identity across 15 languages. Now live in Azure AI Foundry at $22 per million characters, with integrations rolling out in VSCode, Dynamics 365 Contact Center, and Teams. For builders shipping voice agents who need production-grade prosody without the OpenAI Realtime API price tag.

Add a comment

Replies

Best
Habib Ferdous
Hunter
📌
I build voice agents for service businesses — mostly healthcare and home services — and the #1 unsolved problem in this space is prosody. The "is this a robot?" moment usually happens in the first 8 seconds of a call. MAI-Voice-2 is the first TTS I've A/B tested where my pilot users couldn't tell. The $22/M chars pricing lands below ElevenLabs and matches gpt-realtime's TTS layer. If you're shipping voice and wedded to OpenAI Realtime, worth running the side-by-side. Curious if Microsoft is planning sub-200ms first-token latency via WebRTC streaming next.
Scott Davidson Jr.

Incredible that these voice models are becoming indistinguishable from real human voices. I was wondering if there are any benchmarks or detailed testing that was explored on the complexity of quesitons that the models can answer? This gap has been a major challenge for me to adopt AI voice agents that take on the role of customer support without assistance, but curious on how this is evolving.

Habib Ferdous

@scott_davidson_jr Scott, the trick here is separating the voice model from the reasoning engine. Tools like MAI-Voice-2 or OpenAI Realtime only handle how the agent sounds (prosody and latency).

How well it answers complex questions depends entirely on the underlying LLM, your RAG setup, and your internal knowledge base.

The voice tech is finally ready—the real hurdle right now is fine-tuning the logic so it doesn't hallucinate or break script on tough support calls.

Scott Davidson Jr.

@habibferdous I see, thank you for the additional explanation! Very exciting times we’re in.

Igor Gurovich

The consistent voice identity across 15 languages is what stands out to me here. I work on a voice companion that calls aging parents every day, and a lot of our families are immigrants whose parents are most at ease in their first language. A warm, familiar voice that holds up in Tagalog or Mandarin is often the difference between a call someone looks forward to and one they let ring out. Question for the team: how stable is the cloned identity and emotional control over a full 10-minute conversation, or does the prosody drift toward neutral as the session runs longer?

Habib Ferdous

@igorgurovich I love that application for aging parents, Igor. In my experience building out these workflows, the longer the call, the harder it is to avoid that 'robot moment.' Most of my pilot testing so far has been focused on shorter, highly qualified meeting booking flows where the emotional control performs beautifully. I actually haven't pushed a single session to a continuous 10 minutes yet, so I'm incredibly curious to hear the makers' answer on how the prosody holds up at that length. Tagging the team to chime in!