Hi everyone!

There's a new open-source text-to-speech model out called Muyan-TTS, from the MYZY-AI team, and it's specifically designed with podcast applications in mind.

What's notable is that Muyan-TTS was pre-trained on over 100,000 hours of podcast audio. This allows it to generate high-quality voices zero-shot, meaning it can use a short audio sample to generate speech in that voice without new training. For more customized voices, the fine-tuned version (Muyan-TTS-SFT) can adapt to a specific speaker with just dozens of minutes of their audio. They've also been transparent about their development, mentioning it was built within a ~$50k budget.

The models (both base zero-shot and the SFT version for speaker adaptation) and training code are all released.

Muyan-TTS

Open-source, high-quality TTS for podcasts & voice cloning

Open-source, high-quality TTS for podcasts & voice cloning