Text-to-Speech Voice AI Model Guide 2025
Anyone building voice AI agents knows how hard it is to stay up-to-date with the latest text-to-speech voice models.
We spend time testing and experimenting with all of the available paid and open-source text-to-speech voice AI models and consolidated our own notes and experience testing different models into a single guide for developers evaluating multiple models.
Read on for a quick primer on the current state of text-to-speech voice AI models. (If you’re just looking for the model comparison, skip the next section).
Text-to-speech voice models are improving rapidly
A year ago the only reliable way to add a fluid, natural-sounding voice to an LLM-powered AI agent in production was to call a model provider’s API and accept the cost, latency and vendor lock-in associated with choosing a cloud service.
Today, things look quite different: While the quality of speech these models are capable of generating has improved tremendously, open-source models like Coqui XTTS v2.0.3, Canopy Labs’ Orpheus and Hexgrad’s Kokoro 82 M have developed in lockstep: in blind tests, most listeners can’t reliably separate them from the incumbents.
Broadly, today's models fall into two distinct categories that serve fundamentally different purposes:
Real-time models like Cartesia Sonic, ElevenLabs Flash, and Hexgrad Kokoro prioritize streaming audio generation, producing speech as text arrives rather than waiting for complete sentences. These models excel in conversational AI where low-latency makes the difference between natural dialogue and awkward pauses. These models are often architected for immediate response but may sacrifice some prosodic quality for speed.
High fidelity models like Dia 1.6B and Coqui XTTS take the opposite approach: processing entire text passages to optimize for naturalness, emotion, and overall speech quality. They're ideal for content creation, audiobook narration, or any application where the extra processing time translates to noticeably better output quality.
This architectural difference explains why you'll see such variation in the latency figures across our comparison table — it's not just about optimization, but the models’ fundamental design and intended purpose.
Making sense of voice AI latency metrics
When evaluating model speed, you'll often encounter "TTFB" (Time To First Byte): This measures how long it takes from sending your text request to receiving the first chunk of audio data back (essentially, how quickly you hear the voice start speaking). This metric is crucial for real-time applications because it directly impacts the responsiveness users experience.
For context, human conversation typically has response delays under 200ms, so TTFB figures above this threshold start to feel unnatural in conversational AI. However, TTFB only tells part of the story: total processing time for longer passages and the consistency of streaming quality matter just as much for overall user experience. The nuances of conversational latency is a separate, deep (and fascinating) topic. More on that soon!
Why great voice models don’t = great voice AI products
In short, latency and cost (the twin hurdles that kept real-time speech out of most production roadmaps) have been significantly and meaningfully reduced in the last 12 months.
But, cheap, fast, high-quality voices alone don’t automatically translate into great real-time conversational products. A production-grade agent still needs to capture microphone audio, gate and normalise it, transcribe it in real time, pass clean text to an LLM or custom backend, stream the response to the chosen TTS, then return the synthesized audio with no audible gaps — all while handling disconnects, turn-taking, silence detection, regional scaling and usage accounting.
That complexity is exactly where many developers start to get a headache when looking at building production-ready voice AI, even if they have a great application in mind.
With open models now offering incredible speech quality and rich emotion, matching the leaders on quality (and often beating them on speed), the main competitive frontier is infrastructure: who can deliver those voices, at scale, with the lowest latency and the least friction?
We’re building Layercode to eliminate this complexity from the equation: Handling all of the plumbing required to power production-ready low-latency voice agents (read more about how here).
Layercode is currently in beta, and we’re working to integrate as many model providers as we can. If you are working on one of the voice models we’ve tested for this post — or one we haven’t — we’d love to explore an integration.

Comparing today’s leading text-to-speech voice models
Beyond the real-time VS non real-time distinction, there's significant nuance to consider when evaluating text-to-speech voice models for your specific use case.
Customer service bots and phone-based agents benefit from the ultra-low latency of real-time models like Cartesia Sonic or ElevenLabs Flash, where every millisecond of delay affects conversation flow. Content creation workflows—podcast generation, audiobook narration, or video voiceovers—can leverage the superior quality of non-real-time models like Dia 1.6B and Eleven Multilingual v2, where processing time matters less than the final output.
In our experience, new models' marketing claims don't always match our direct experience testing and using the models in production scenarios.
The comparison table below shows how these models stack up across key technical dimensions, followed by our hands-on experience with each model. Note how the real-time models cluster around the 40-200ms TTFB range, while non-real-time models prioritize quality over speed.
To keep this resource focused on the most practical concern for developers, we've ranked the subsequent list of models by overall voice quality, as experienced by end users.
If you're interested to go deeper on any of these models, check out the full guide on our website →



Replies