Inworld builds the infrastructure for production voice AI. One platform with speech-to-text, an LLM router, and the top-ranked text-to-speech, all connected on a single API so context flows between every layer. Used by developers building voice agents, AI companions, and conversational apps.
This is the 5th launch from Inworld. View more

Realtime TTS-2
Launching today
Realtime TTS 1.5 is #1 on Artificial Analysis, voted best in blind tests by thousands of real users. TTS-2 builds on that with six major upgrades: natural language voice direction for tone, emotion, speed, and pitch. Text-based voice design, where you describe a voice in words and generate it. Cross-lingual synthesis across 100+ languages preserving speaker identity. IPA phonetic control for brand names and rare words. And improved alphanumeric pronunciation. Try it free at inworld.ai/tts.








Free Options
Launch Team / Built With








Inworld
Hi Product Hunt! We're back! I'm Kylan, CEO and co-founder of @Inworld.
Some of you might remember when we launched Inworld TTS here. It went on to become the #1 ranked voice AI on Artificial Analysis, voted best in blind listening tests by thousands of real users. That meant a lot to us, so we went back and rebuilt the model from the ground up.
Today we're launching Realtime TTS 2.0. Try the live speech-to-speech experience at realtime.ai.
Here's the thing we kept hearing from builders: voice AI was built for audiobooks and voiceovers. It sounds good, but it sounds like a human reading from a script. If you've ever talked to a voice agent and thought "something feels off," that's why. Realtime conversation is a completely different problem, and we decided to solve it.
What can you build with it?
Companion apps that adapt to your user's mood and tone in real time through natural language voice direction
Language tutors that switch languages mid-session with the same voice, no re-recording
Characters that sound exactly how you describe them with text-based voice design
Support agents that get every code, name, and number right with improved alphanumeric handling and International Phonetic Alphabet (IPA) support
So what actually changed?
Natural conversationality. We trained the model on conversational speech instead of narration. You get natural rhythm, breath, micro-pauses, the cadence humans actually use when they talk to each other. Every voice you build on TTS 2.0 sounds like a person in conversation, not a narrator.
Conversational awareness. TTS 2.0 is informed by the full audio context of the multi-turn exchange. Not just the current sentence, the whole conversation. How it speaks adapts to how it was spoken to. A line delivered after a joke lands differently than the same line after bad news. The model knows the difference because it heard what came before.
Full voice direction. You steer the model with natural language the way you'd direct a voice actor. Not preset emotion tags, full descriptions: "act like you just got home from a long day, tired but warm." Combined with inline controls for specific moments ([whispering], [sigh], [excited]), the voice is as controllable as it is expressive.
Text-based voice design. Describe a voice in plain text, generate it. "A posh british man, aged 30-40, speaking deliberately" Iterate on the prompt until it fits, save it, deploy it. No casting calls, no recording booth.
Crosslingual fluency. One voice across 100+ languages with on-the-fly switching inside a single generation. Your voice identity is preserved across every language. No re-recording, no managing separate voices per locale.
Realtime TTS 1.5 is still #1 on the leaderboard. TTS 2.0 takes that quality and adds everything that was missing to uplevel realtime conversation.
Learn more at inworld.ai/tts. Happy to answer any questions in the comments.
– Kylan
DiffSense
It sounds too much like audio book narration. I guess it was trained on that input? Same thing that plagues every single elevenlabs voice. The only voice that sounds human out there is the alloy voice from open ai. and thats an old ai voice. its so strange. this field should be wide open. competative. whats going on ? what an I missing?
Inworld
@conduit_design Did you try Myles on realtime.ai? Curious what feels off there for you.
DiffSense
@kylan_gibbs I cant speak back to it. im at the library 😂 but the opening sounded audiobook ishh. Its jsut that humans do not speak like audio books. We speak much less teathrical in a way. do you know what I mean? I guess Ai voice will be unsolved for a while still, stuck in the audiobook period. until we escape that. I feel the Alloy voice released like 2-3 years ago escapes this. but it takes like 10-12 recordings to get it with the right flow. I wish it was possible to just say. End it with less punchyness. not so fast. take it easy on the first part. etc etc. Instead of editng and doing man iterations. or have an AI just solve that editing. then just write a text and say. make this sound like a product demo, smth that theverge would make etc. and an ai would take care of the iterating and editng etc.
Inworld
@conduit_design Give it a try when you can speak to it, the naturalness comes from how it can interpret you and your context as well, let me know once you give it a shot!
I'm most excited about the improvements made in cross-lingual. It's so seamless to have an engaging conversation and switch between multiple languages like English, Hindi, then French and it's the same voice.
Inworld
Hey everyone, Andreas from the Inworld team! I've been pumped about this launch for weeks and I'm so excited that we finally get share TTS-2 with you all. If you want to hear what it can do, jump into the playground at inworld.ai/tts and try voice design or steering for yourself or play with our realtime demo at realtime.ai. Would love to hear your reactions!
Inworld
Realtime TTS 2 is our best model yet.
It's designed to be a frontend of a voice interfaced application of any kind and scale.
Besides naturalness and multilingual quality improvements, in this iteration, this model can't be actually called a "yet another" TTS. Because similarly to speech-to-speech models, Realtime TTS 2.0 was trained to be explicitly steered to provide the most appropriate response, given the conversation context and agent's goal.
Check it out!
So I tried it, Speech to Speech. It confuses itself and hallucinates very quickly with just basic questions and conversation, I asked both bots how are you, what are you doing today, and what are you doing for dinner. Both gave me completely different spectrum of answers. They gave alot of filler responses like hey, hmm, huh, which I can understand why those are there. But Jason started telling me how to increase the gain of my television set, and Sarah thought I was going to a party. Also the vocal fidelity is alot to desire, in speech to speech. Just my honest feedback so far. Keep at it.
training on conversation instead of narration is the right call. every voice agent ive tried sounds like an audiobook reading my support ticket back.
congrats team !!