Microsoft's most expressive TTS model yet — voice cloning from short samples, fine-grained emotional control, and consistent voice identity across 15 languages. Now live in Azure AI Foundry at $22 per million characters, with integrations rolling out in VSCode, Dynamics 365 Contact Center, and Teams. For builders shipping voice agents who need production-grade prosody without the OpenAI Realtime API price tag.
At Microsoft AI, our vision is humanist superintelligence. That means
building world-class models that are as safe as they are capable, made for the demands of real work, and designed not to outpace human potential, but to amplify it.
MAI-Transcribe-1 is Microsoft’s new multilingual speech-to-text model built for real-world audio. It delivers best-in-class accuracy across 25 languages, strong robustness in noisy environments, faster batch transcription, and pricing aimed at production speech workflows.
MAI-Image-2 is Microsoft's new text-to-image model built with photographers, designers, and visual storytellers in mind. It pushes hard on photoreal lighting, reliable in-image text, and rich cinematic scenes for actual creative work.