Realistic audio with expanded emotional range

by•4mo ago

I'm trying to create realistic audio to support scenarios for frontline staff in homeless shelters and housing working with clients. The challenge is finding realistic voices that have a wide range of emotional affect. We are hoping to find a generative approach to developing multiple voices rather than creating voices with actors or ourselves. We've tried v3 Voice Design which expands on monotone generated voices but not much. We want voices that go from soft whispers to screaming and everything in between. Perhaps I'm not very good at prompting, but I've tried various attempts. Again, we're trying to do this without needing to record every voice which is not sustainable for our approach. Any recommendations? Thanks!

187 views

Replies

Best

Cal ID

Great question! Emotional range is still the most challenging part with AI voices. ElevenLabs is probably your best bet right now, but blending real recordings for extreme emotions with generated speech helps cover those gaps. Also, some creators use emotion tags and vary prompt styles, but the results are hit-or-miss. Curious if anyone’s cracked truly lifelike, expressive AI voices yet.

Report

4mo ago

@sanskarix Emotion tags help a little, but the range still seems restricted/muted.

Report

4mo ago

That’s such a meaningful use of technology, Jim. Realistic audio with emotional depth could really make training more impactful for staff in those challenging environments.

Report

4mo ago

Triforce Todos

How scalable is it? Could it handle dozens of different voice profiles for different roles?

Report

4mo ago

@abod_rehman We'll have dozen of personas representative of other the diversity of individuals experiencing homelessness or living in supportive housing.

Report

4mo ago

hey Jim, your use. ase is really interesting and carry nice values. I may be able to introduce you to a few people that can help.

Report

4mo ago

@thibaulttbot I'd appreciate any assistance. Thanks.

Report

4mo ago

Hybrid Workflow (Recommended)

For your use case (training for frontline shelter staff), the best balance is a hybrid generative workflow:

Choose 3–4 base voices (e.g., two from ElevenLabs, two from Azure).

Script scenes with emotion tags — e.g. [calmly], [nervously], [yelling], [crying softly].

Generate speech variants and blend/sequence them to simulate real escalation (using crossfades or gain adjustments in an audio editor like Audacity, Descript, or Reaper).

Optionally, use whisper/scream sound design overlays (non-verbal breaths, gasps, sighs) from sound libraries like Boom Library or BBC Sound Effects for realism.

This hybrid approach gives you emotional realism without recording actors.

Report

4mo ago

@top9trends_ai Thanks. We're hoping to avoid overlays and extensive post-production work so that we can develop a workflow (likely hybrid) to quickly generate the voices. Future goal is to use these voices in a dynamic environment and we need to minimize lag. ElevenLabs and Azure seem to have a good selection of base voices. Any other sites you'd recommend?

Report

4mo ago

That’s such an important and compassionate use case, Jim — love that you’re focusing on realism and emotional nuance for training frontline workers. It’s true that most AI voice models still struggle with dynamic emotional range.

You might explore multi-agent setups or fine-tuned emotional conditioning — at Growstack, we’ve been experimenting with AI agents that adapt tone and emotion contextually, and the results are surprisingly human-like. Would be happy to exchange notes if you’re exploring similar generative pipelines

Report

4mo ago