Launching today

Google Gemini 3.1 Flash TTS
Text-to-speech API with natural language voice direction
163 followers
Text-to-speech API with natural language voice direction
163 followers
Google's TTS API with inline audio tags, multi-speaker dialogue, and 70+ language support. For developers building voice agents, dubbing tools, or AI content products via the Gemini API and Vertex AI.




Gemini 3.1 Flash TTS is Google's new text-to-speech model, now available in preview via the Gemini API, Google AI Studio, and Vertex AI.
The problem:
TTS APIs have always treated voice as a static output.
You pick a voice, set a speed, and the model delivers a flat read.
Getting expressiveness meant engineering workarounds or accepting robotic delivery.
The solution:
Gemini 3.1 Flash TTS introduces audio tags natural language commands embedded directly in the text input to control tone, pacing, accent, and expression mid-sentence.
You can define scene context, cast multiple speakers with unique voice profiles, and export the full configuration as API code for consistent reuse across projects.
What stands out:
🎙 Inline audio tags mean you can shift tone, pacing, and delivery mid-sentence without re-prompting
🗣 Native multi-speaker dialogue means you can cast and direct multiple characters in a single API call
🌍 70+ language support with per-locale accent control means you can localise expressive speech without a separate pipeline
📤 Exportable voice config means your characters and delivery style stay consistent across every projec
🔒 SynthID watermarking means every output is attributable as AI-generated out of the box
Who it's for:
developers and product teams building voice agents, AI dubbing tools, interactive storytelling apps, and multilingual content platforms that need expressive, controllable speech at scale.
the inline audio tags unlock something specific for interactive web apps — not just narration, but contextual feedback. building with voice input, you always want the confirmation to sound different from the question, which meant separate prompts or post-processing hacks. being able to embed that context inline changes the design space for conversational interfaces.
I did the tests, oh my god, it turned out amazing.