
IndexTTS2
Precise duration & emotional zero-shot tts
4 followers
Precise duration & emotional zero-shot tts
4 followers
Production-ready text-to-speech for dubbing, games, podcasts, and education — precise duration control, emotion–speaker decoupling, zero-shot cloning


I’m Hyde, the solo dev behind IndexTTS-2 Online (indextts-2.com).
This project started from a very selfish need: I love playing with new TTS models, but I hate dealing with GPUs, CUDA errors, giant checkpoints and random “works on my machine” repos. I wanted something where I could just open a browser, paste some text, and get back a voice that actually sounds like a human — with emotion and proper timing — without touching a terminal.
So I wrapped the open-source IndexTTS-2 model into a small SaaS-style tool.
🎙 What it is
IndexTTS-2 Online is a browser-based studio for emotionally expressive, duration-controlled, zero-shot text-to-speech.
You can:
Type text and get natural, expressive speech (not the flat “robot podcast” sound).
Upload or record a short voice reference and have the model speak in that voice.
Control duration so the audio fits your video cuts, subtitles, or lip-sync window.
Use it for Chinese, English, Japanese and some cross-lingual cases.
Use cases I had in mind: YouTube / TikTok dubbing, quick voice tracks for indie games, early drafts for audiobooks & podcasts, or multilingual versions of the same script.
⚙️ Under the hood
On the backend I’m running IndexTTS-2 as an autoregressive model with:
A reference encoder for timbre & style (the “voice cloning” part).
Duration / alignment control so you can aim for specific lengths.
A simple API layer that the web app calls from the browser.
The frontend is a boring-but-solid stack (Next.js + Tailwind etc.), with a small queue system so the model doesn’t fall over when multiple people generate at once.
There’s a free tier to play with, and a Pro plan for higher limits + custom voice reference upload/record.
🙏 I’d love your feedback
I’m especially curious about:
Voice quality vs. latency – is it good enough for your use case?
Duration control – does it help in your real editing workflow?
Product direction – what would make this a “must-have” tool for you (API, batch jobs, plugins, etc.)?
If you try it and break it, or get funny/uncanny outputs, please share them — those are super helpful for improving the product.
Thanks for checking it out and supporting indie builders! 💛