Launched this week

speech-swift
The whole speech stack, on your laptop.
20 followers
The whole speech stack, on your laptop.
20 followers
One Swift package for every speech capability you'd normally rent from a cloud API — transcription, expressive TTS, voice cloning, speaker-aware diarization, denoising, full-duplex speech-to-speech — running on-device on Apple Silicon. brew install soniqo/tap/speech ships a CLI, a local HTTP server, and the Swift API. Apache 2.0. No cloud, no keys, no per-minute bills.




Five API keys to build a voice AI app: ElevenLabs, Deepgram, Pyannote, OpenAI, and whoever's hosting your VAD. Five bills, five round-trips, five companies hearing your users.
Whisper.cpp killed cloud ASR. speech-swift does the same for the rest — expressive TTS, voice cloning, diarization, full-duplex speech-to-speech. One Swift package, one brew install, runs on Apple Silicon.
brew install soniqo/tap/speech
speech-server --port 8080
Your Mac is now a local speech API. Point your app at localhost:8080. Apache 2.0, no keys, no per-minute billing, no internet required.
Genuinely curious: which of the five cloud APIs would you replace first? And which one do you think is still worth paying for? I have opinions but I'd rather hear yours before I post them.
→ https://github.com/soniqo/speech-swift · https://soniqo.audio
And Discord server invitation
mailX by mailwarm
How’s the quality compared to the big APIs for voice cloning?
@thamibenjelloun In practice, indistinguishable for the everyday case, after some inference tuning. The audio demo above this comment is my own voice, cloned with VoxCPM2 from a short reference, and several people couldn't tell it apart from me actually talking. A careful listener might catch slightly off pause lengths between sentences, or sometimes within a sentence — but that's the level you have to listen at. That used to be the exclusive territory of the top API tier.
I think a really big gap in the market with voice model is that they are very dumb. There are literally meme accounts making fun of OpenAI reasoning and they are the best. I think this is a gap in the market and that is why paying for reasoning from OpenAI or another similar frontier model will be a cloud API I think its still worth paying for. The issue is the latency, definitely has to be a the turbo/flash version of the frontier models.
@ivo_kolev You're right that end-to-end S2S models (GPT-4o-realtime, Moshi, the open ones) reason worse than text models — training through audio compresses cognition. But STT → text LLM → TTS keeps reasoning in text space, so the LLM ceiling is whatever you point at (GPT-5, Claude, Gemini, local).
The latency question is where on-device speech matters. Conversational threshold is ~1.2s end-to-end. Cloud speech APIs eat 400–600ms before the LLM even starts — that's what forces flash/turbo. Move ASR + TTS on-device (~100ms) and you've reclaimed half a second of budget you can spend on a smarter cloud model.
The trade-off isn't dumb local vs smart cloud. It's where the latency goes. On-device speech buys you room for a smarter brain in the cloud.