Tedi P

DramaBox by Resemble AI - AI turns scene descriptions into vocal performances

A TTS model should give you two things: an oscar-worthy performance and a verifiable signature to prove it's yours. DramaBox is the first to do both. Describe a scene the way you would to an actor, like 'a talk show host gasps in mock shock, bursts into laughter,' and the model interprets it as performance. Every output is watermarked with Resemble Watermarker. Open source, English-only for now, find it in your Resemble account or on Hugging Face.

Add a comment

Replies

Best
Tedi P
Hey Product Hunt 👋 Tedi here, CTO at Resemble AI. Today we're launching DramaBox, a watermarked and open-source TTS model you direct the way you'd direct an actor. What you put in: - A natural language scene description - Dialogue in quotes (e.g., "HAHAHA! I can't believe it!") - Optional: a reference voice that the model will clone from, or let the model pick one that fits the scene - Example: 'A talk show host gasps with shock, "No! You did NOT just say that!" He bursts into uncontrollable laughter, "Hahaha! Oh my god, oh my god!" He wheezes, "I cannot, I literally cannot breathe right now!"' What you get out: - An expressive vocal performance with laughter, gasps, pacing, and emphasis driven by the scene description - 48kHz stereo, broadcast-quality output - A single audio file ready to ship - PerTh watermarking embedded at generation for provenance All without style tags, emotion tags or post-processing required! DramaBox is open source, and available on GitHub, Hugging Face or in your Resemble account. A note on why we included watermarking by default: gen AI is being used in increasingly malicious ways, so voice AI with embedded security gives everyone who builds with DramaBox a verification advantage.  If you're building an agent, feel free to check out our detection API as well. Curious to see and hear what you build, please share feedback or any interesting edge cases with us here. Tedi
Thami Benjelloun

When using DramaBox, how consistent is the performance if you generate the same prompt twice, do you get similar delivery or totally different takes?

Tedi P

@thamibenjelloun you get different takes unless you fix the seed.

Rivra

The shift from text to vocal performance is exciting for storytellers. Is there support for multiple characters with distinct accents within a single scene description?

Tedi P

@rivra_dev we haven't thoroughly tested multiple characters in a scene, but theoretically yes it should be supported. Would be curious to see your findings.

Harshal Chaudhary

directing TTS like an actor with scene context rather than emotion tags is the right abstraction, tags always felt like a workaround for not having enough context. curious how it handles ambiguity in the scene description. if two people would direct the same line differently, does the model pick one interpretation or is there variance across runs?