Text-to-Speech Voice AI Model Guide 2025
Anyone building voice AI agents knows how hard it is to stay up-to-date with the latest text-to-speech voice models.
We spend time testing and experimenting with all of the available paid and open-source text-to-speech voice AI models and consolidated our own notes and experience testing different models into a single guide for developers evaluating multiple models.
Read on for a quick primer on the current state of text-to-speech voice AI models. (If you’re just looking for the model comparison, skip the next section).
Text-to-speech voice models are improving rapidly
A year ago the only reliable way to add a fluid, natural-sounding voice to an LLM-powered AI agent in production was to call a model provider’s API and accept the cost, latency and vendor lock-in associated with choosing a cloud service.
Today, things look quite different: While the quality of speech these models are capable of generating has improved tremendously, open-source models like Coqui XTTS v2.0.3, Canopy Labs’ Orpheus and Hexgrad’s Kokoro 82 M have developed in lockstep: in blind tests, most listeners can’t reliably separate them from the incumbents.
Broadly, today's models fall into two distinct categories that serve fundamentally different purposes:
Real-time models like Cartesia Sonic, ElevenLabs Flash, and Hexgrad Kokoro prioritize streaming audio generation, producing speech as text arrives rather than waiting for complete sentences. These models excel in conversational AI where low-latency makes the difference between natural dialogue and awkward pauses. These models are often architected for immediate response but may sacrifice some prosodic quality for speed.
High fidelity models like Dia 1.6B and Coqui XTTS take the opposite approach: processing entire text passages to optimize for naturalness, emotion, and overall speech quality. They're ideal for content creation, audiobook narration, or any application where the extra processing time translates to noticeably better output quality.
This architectural difference explains why you'll see such variation in the latency figures across our comparison table — it's not just about optimization, but the models’ fundamental design and intended purpose.
Making sense of voice AI latency metrics
When evaluating model speed, you'll often encounter "TTFB" (Time To First Byte): This measures how long it takes from sending your text request to receiving the first chunk of audio data back (essentially, how quickly you hear the voice start speaking). This metric is crucial for real-time applications because it directly impacts the responsiveness users experience.
For context, human conversation typically has response delays under 200ms, so TTFB figures above this threshold start to feel unnatural in conversational AI. However, TTFB only tells part of the story: total processing time for longer passages and the consistency of streaming quality matter just as much for overall user experience. The nuances of conversational latency is a separate, deep (and fascinating) topic. More on that soon!
Why great voice models don’t = great voice AI products
In short, latency and cost (the twin hurdles that kept real-time speech out of most production roadmaps) have been significantly and meaningfully reduced in the last 12 months.
But, cheap, fast, high-quality voices alone don’t automatically translate into great real-time conversational products. A production-grade agent still needs to capture microphone audio, gate and normalise it, transcribe it in real time, pass clean text to an LLM or custom backend, stream the response to the chosen TTS, then return the synthesized audio with no audible gaps — all while handling disconnects, turn-taking, silence detection, regional scaling and usage accounting.
That complexity is exactly where many developers start to get a headache when looking at building production-ready voice AI, even if they have a great application in mind.
With open models now offering incredible speech quality and rich emotion, matching the leaders on quality (and often beating them on speed), the main competitive frontier is infrastructure: who can deliver those voices, at scale, with the lowest latency and the least friction?
We’re building Layercode to eliminate this complexity from the equation: Handling all of the plumbing required to power production-ready low-latency voice agents (read more about how here).
Layercode is currently in beta, and we’re working to integrate as many model providers as we can. If you are working on one of the voice models we’ve tested for this post — or one we haven’t — we’d love to explore an integration.

Comparing today’s leading text-to-speech voice models
Beyond the real-time VS non real-time distinction, there's significant nuance to consider when evaluating text-to-speech voice models for your specific use case.
Customer service bots and phone-based agents benefit from the ultra-low latency of real-time models like Cartesia Sonic or ElevenLabs Flash, where every millisecond of delay affects conversation flow. Content creation workflows—podcast generation, audiobook narration, or video voiceovers—can leverage the superior quality of non-real-time models like Dia 1.6B and Eleven Multilingual v2, where processing time matters less than the final output.
In our experience, new models' marketing claims don't always match our direct experience testing and using the models in production scenarios.
The comparison table below shows how these models stack up across key technical dimensions, followed by our hands-on experience with each model. Note how the real-time models cluster around the 40-200ms TTFB range, while non-real-time models prioritize quality over speed.
To keep this resource focused on the most practical concern for developers, we've ranked the subsequent list of models by overall voice quality, as experienced by end users.
If you're interested to go deeper on any of these models, check out the full guide on our website →



Replies
What I prioritize in a text-to-speech system is emotional depth. I hope the voice feels less robotic and conveys real emotions.
Layercode
@charlene_zh1 yeah, this is absolutely crucial for a lot of real-world use cases. We're building in this space in an exciting time — the rate of model development is crazy and today we have lots of great options for models that can actually deliver emotional depth. Even 6 months ago there were far less to choose from!
Finally, an objective comparison of today’s leading text-to-speech voice models. It's about time! Super helpful.
@aidanhornsby any learnings that struck you when you did your research?
Layercode
@fmerian glad you found it valuable!
The main thing that struck me as we were putting this together (which built on an earlier version of the comparison table that we created) was simply how wild the speed of progress is. The entire voice AI industry that builds with LLMs is only a few years old (at most), but the rate of improvement in model quality and variety seems to have been increasing exponentially in the last ~6 months especially.
Developers building voice agents today are spoilt for choice, and we want to make sure that @Layercode makes it easy for builders to take advantage of the latest and greatest models.