Hi everyone!

The new ThinkSound model from Alibaba Tongyi Labs brings a new idea to audio generation: letting the AI "think" before it creates sound.

It's the first model to apply Chain-of-Thought to this area. Instead of just matching sounds, it analyzes a video's events step-by-step to create high-fidelity, synchronized audio that really fits the scene. The results are pretty stunning!

The code is open-source under Apache 2.0. Just a heads up on the license for the model itself: it's available for research and educational use, but you'll need to contact the team for commercial licensing. Still, I do have a feeling this new approach to audio models will inspire and speed up the arrival of commercial ones.

You can try the demo here.

ThinkSound

Your AI sound designer with a Chain-of-Thought

Your AI sound designer with a Chain-of-Thought