Microsoft MAI-Voice-2 - Expressive TTS with voice cloning in 15 languages
by•
Microsoft's most expressive TTS model yet — voice cloning from short samples, fine-grained emotional control, and consistent voice identity across 15 languages. Now live in Azure AI Foundry at $22 per million characters, with integrations rolling out in VSCode, Dynamics 365 Contact Center, and Teams. For builders shipping voice agents who need production-grade prosody without the OpenAI Realtime API price tag.

Replies
Honestly
Incredible that these voice models are becoming indistinguishable from real human voices. I was wondering if there are any benchmarks or detailed testing that was explored on the complexity of quesitons that the models can answer? This gap has been a major challenge for me to adopt AI voice agents that take on the role of customer support without assistance, but curious on how this is evolving.
@scott_davidson_jr Scott, the trick here is separating the voice model from the reasoning engine. Tools like MAI-Voice-2 or OpenAI Realtime only handle how the agent sounds (prosody and latency).
How well it answers complex questions depends entirely on the underlying LLM, your RAG setup, and your internal knowledge base.
The voice tech is finally ready—the real hurdle right now is fine-tuning the logic so it doesn't hallucinate or break script on tough support calls.
Honestly
@habibferdous I see, thank you for the additional explanation! Very exciting times we’re in.
Refocus
The consistent voice identity across 15 languages is what stands out to me here. I work on a voice companion that calls aging parents every day, and a lot of our families are immigrants whose parents are most at ease in their first language. A warm, familiar voice that holds up in Tagalog or Mandarin is often the difference between a call someone looks forward to and one they let ring out. Question for the team: how stable is the cloned identity and emotional control over a full 10-minute conversation, or does the prosody drift toward neutral as the session runs longer?
@igorgurovich I love that application for aging parents, Igor. In my experience building out these workflows, the longer the call, the harder it is to avoid that 'robot moment.' Most of my pilot testing so far has been focused on shorter, highly qualified meeting booking flows where the emotional control performs beautifully. I actually haven't pushed a single session to a continuous 10 minutes yet, so I'm incredibly curious to hear the makers' answer on how the prosody holds up at that length. Tagging the team to chime in!