Launching Vois on Thursday 5th March ā a desktop voice AI studio
Hey PH community,
I'm launching Vois on Thursday it's a desktop voice AI studio I've been building as a solo maker for the past year.
Some of you may have seen my earlier threads here about voice production costs for game devs, podcast workflows, audiobook production, and accessibility. Those conversations directly shaped what I built.
Text-to-audio for accessibility ā where are the gaps?
I'm partially dyslexic. Long text has always been difficult for me not impossible, just slow enough that by the time I reach the bottom of a page, the top has faded. Since high school, I've been converting articles, papers, and reports to audio so I could actually absorb them.
Over the years I've tried everything: screen readers (functional but robotic), browser extensions (limited), cloud TTS services (good quality but expensive for heavy use), and various read-aloud apps.
None of them were quite right. Most are designed for occasional use read this one article, listen to this one page. They're not built for someone who processes a significant chunk of their reading through audio every single day.
The gaps I've personally experienced:
Local-first AI vs cloud AI ā which is winning for voice generation?
Most voice AI services ElevenLabs, PlayHT, Murf run in the cloud. You upload your text, they generate audio, you download it. Per-character pricing.
But there's a clear shift toward local-first AI happening across the board. Apple's MLX framework, Ollama for LLMs, Whisper.cpp for transcription. Models are getting small enough and hardware is getting fast enough that "run it on your own machine" is a real option.
For voice generation specifically, the tradeoffs are interesting:
Cloud advantages:
How are L&D teams handling voice for e-learning content?
Enterprise learning and development teams produce a staggering amount of audio content onboarding modules, compliance training, product walkthroughs, internal communications. And most of it needs to be updated quarterly or annually.
The traditional workflow is painful:
Script changes require re-recording (book the studio, schedule the narrator, wait for delivery)
Multi-language versions multiply the cost and timeline
Compliance updates on tight deadlines mean rushing voice talent
Brand voice consistency across hundreds of modules is nearly impossible with different narrators over time
Cloud TTS services solve some of this but introduce new problems for enterprise:
Has anyone self-produced an audiobook with AI voices?
The audiobook market is growing fast something like 25% year-over-year but production costs are still a major barrier for independent authors.
Professional narration typically runs $200-400 per finished hour. A 10-hour audiobook? That's $2,000-4,000 before editing and mastering. For self-published authors who might sell 100-500 copies, the math is brutal.
AI narration is the obvious alternative, and platforms like Google Play Books and some ACX distributors now accept AI-narrated audiobooks (with disclosure). But the workflow is surprisingly clunky:
Cloud TTS services charge per character. A full-length book (80,000 words) burns through a lot of credits especially when you need to regenerate chapters after editing
Most TTS tools aren't designed for long-form content. They handle single paragraphs well but struggle with maintaining consistent voice quality over hours of audio
Mastering to ACX standards (RMS levels, noise floor, peak levels) requires separate tools
Multi-voice books (dialogue between characters) need manual stitching in most tools
The faceless YouTube channel trend ā what voice solution are creators actually using?
Faceless YouTube channels are everywhere now. Finance explainers, tech reviews, history deep dives, true crime, Reddit compilations millions of views, no face on camera.
The voice is the entire brand for these channels. And from what I can see, creators are split between a few approaches:
Recording their own voice works but takes time, needs decent equipment, and not everyone likes their voice
Hiring voiceover talent Fiverr ranges from $20-100 per video depending on length and quality. Gets expensive at 3-5 videos per week
Cloud TTS ElevenLabs, PlayHT, etc. Quality has gotten impressive, but per-character pricing at high volume (daily or multi-weekly uploads) adds up
Free TTS tools Still sounds robotic enough to get comments about it
The interesting tension: YouTube's algorithm rewards consistency and volume. The more you upload, the more the algorithm favors you. But voice production is often the bottleneck whether it's recording time, talent costs, or TTS credits.
Are podcasters actually using AI voices? What's working?
I keep seeing "AI-powered podcast" tools pop up, and I'm curious what's actually working for people in practice.
The pitch is obvious: skip the recording, editing, scheduling just write a script and generate audio. But the reality seems more nuanced.
What I've been hearing:
Solo podcasters who hate the sound of their own voice are interested, but worried about authenticity. "Will my audience know?"
Show producers want AI for filler segments (intros, transitions, recaps) but keep human hosts for interviews and personality
Some people run multiple shows and physically can't record enough AI voices are a capacity multiplier, not a replacement
Non-English creators want to produce English-language versions of their shows without hiring voice talent

