
Hush
Open-source noise suppression for voice AI agents
339 followers
Open-source noise suppression for voice AI agents
339 followers
Hush removes competing voices, background noise, and audio interference from real-time calls so your voice AI agents always hear what matters.




Refocus
The CPU-only, sub-1ms-per-frame number is what jumped out at me. Most enhancement I've tried adds enough latency to break the natural turn-taking on a live call. We build voice AI that phones elderly parents at home, where the hard part is exactly what you describe: a TV going in the background, a spouse talking across the room, sometimes a hearing aid whistling. My question: when the primary speaker is quiet, slurred, or unsteady (pretty common with older users), does isolating them ever clip that softer speech? Planning to test Hush on some of our real call audio.
Hush
@igorgurovich Thanks! That's a great use case. To answer your question: the model applies a gain mask and deep filtering per frame, it doesn't gate or hard-clip. So quieter speech gets enhanced, not cut. That said, we optimized primarily for telephony scenarios with a clearly dominant primary speaker. Slurred or very low-energy speech at low SNR is a harder edge case and I'd honestly want to see how it performs on your specific audio before making promises. Please do test it on your real call data and share what you find. Would love to hear how it holds up, and if there are failure modes with elderly speakers that's exactly the kind of feedback that would shape v2.
Sub-1ms on CPU is the claim that matters most here and also the one I'd want stress-tested. What's the degradation curve? Does it hold at 1ms with a single stream, and what happens at 10 or 50 concurrent calls on commodity hardware? That's the production reality for anyone running voice agents at scale.
The open-source angle is smart for adoption but the real question is where the commercial model sits. Apache 2.0 gets you into production stacks fast. What's the wedge that converts users to paying customers?
Hush
@sergio_jivan Good questions. On concurrency: the Rust runtime shares the compiled ONNX model across all sessions via a single Arc<TypedSimplePlan>. Each additional session allocates only its own frame buffers (a few KB), not a copy of the model. So 50 concurrent streams is 50 independent inference calls on the same ~10 MB model, not 50x memory. On a 4-core machine we've tested, per-frame latency stays around 1ms up to around 40 concurrent streams before you start seeing CPU contention push it higher. It scales linearly with cores.
On the commercial question: Hush is genuinely open source, no "open core" catch. The model and runtime are the product we built for our own voice agent platform at Weya. Open-sourcing it is about closing a gap in the ecosystem that was hurting everyone building in this space, including us. Weya's business is the omnichannel agent platform itself, orchestrating entire workflows using voice, video, and WhatsApp agents, not the noise cancellation layer.
Foyer
Most noise suppression libraries are built for human listeners, where "good enough" means the person on the other end doesn't notice. For voice AI agents the bar is different because the model is doing ASR first, and artifacts that a human brain filters out can wreck transcription accuracy pretty badly. Curious whether Hush is tuned specifically for that ASR pipeline use case or whether it's general-purpose suppression you're applying upstream. Also wondering how it handles near-field keyboard noise and fan hum during long agent sessions, since those tend to be the consistent offenders in real deployments.
Hush
@fberrez1 Great framing, you've nailed exactly why we built this the way we did.
You're right that the bar for voice AI is fundamentally different from human-listener suppression. A human brain is remarkably forgiving of artifacts. An ASR model isn't a subtle spectral smear; over-aggressive suppression can flip a phoneme, and suddenly your agent is acting on the wrong intent. That's a real business failure, not just a quality issue.
Hush is explicitly tuned for the upstream-of-ASR use case. Our eval loop during training measured downstream transcription accuracy (WER), not just perceptual scores like PESQ or DNSMOS. If suppression was introducing artifacts that hurt WER, we treated that as a model failure regardless of how "clean" it sounded to a human ear.
On keyboard noise and fan hum: stationary and near-stationary noise is actually the easier problem — the model handles those well since it was trained heavily on DNS Challenge data, which includes exactly those profiles. Long agent sessions with consistent fan hum are arguably the cleanest scenario Hush faces. Where it earns its keep is when a second human voice enters the frame mid-session, which is what breaks every other model we tested.
Happy to share some WER comparison numbers across noisy conditions if that's useful for your evaluation.
Real-time noise suppression always involves tradeoffs - curious what the actual pipeline latency looks like end-to-end, not just model inference. WebRTC jitter buffers, chunking, and resampling all add overhead on top of the model itself, and for voice AI phone agents that budget is already tight with STT + LLM + TTS in the chain. Also wondering how it handles overlapping speakers mid-sentence vs. steady-state noise - that's usually where suppression models fall apart. How does it compare to what Deepgram or Twilio already offer natively in their voice pipelines?
Hush
@galdayan All fair and sharp questions: these are exactly the tradeoffs we live with daily, building Weya's voice agent pipeline.
On end-to-end latency: The <1ms model inference is just one slice. The honest full picture on our stack: we chunk at 10ms frames (native to the model), resampling from 8kHz telephony to 16kHz adds ~0.5ms, and our Rust runtime's C-ABI boundary is effectively zero-copy so no meaningful overhead there. Total Hush-attributed latency in our pipeline sits around 12-13ms, including buffering. That's the number that actually matters for your STT→LLM→TTS budget, not the raw inference figure.
On overlapping mid-sentence speech: This is honestly the hardest problem in the space, and I won't oversell it. Steady-state background noise is a solved problem that every model handles. Where Hush specifically differs is that 60% of our training data included a competing human voice, so the model has learned to treat overlapping speech as the primary threat, not an edge case. Mid-sentence intrusions do degrade performance, but the degradation is significantly more graceful than models that weren't trained for speaker separation at all. We're targeting this directly in v2.
On Deepgram/Twilio native suppression: Their built-in noise handling is solid for the human-listener use case, stationary noise, light background hum. But it's not designed to suppress a competing human speaker, which is the failure mode that specifically breaks voice AI agents. It's also a black box you can't tune, can't run offline, and can't integrate upstream of a non-Deepgram STT. Hush is provider-agnostic it sits at the audio layer before anything else touches the stream.
This is exactly the kind of voice-agent infra where the test set matters more than the demo clip. I would love to see three numbers side by side: added latency per frame, word deletion rate for quiet primary speakers, and false retention when a second speaker is louder than the caller. The open-source angle is especially useful if teams can run the same stress clips before deploying it into live calls.
Hush
@tang_weigang You're speaking our language, we're infrastructure people too, and we share exactly that instinct. Demo clips are marketing. Reproducible numbers on hard audio are what actually matter before you put something in a production call path.
Here's where we stand on your three asks:
Added latency per frame: 10ms frame size, ~12-13ms total Hush-attributed latency including resampling and buffering. The <1ms inference figure is real, but as I've said to others here, that's one slice, not the full picture.
Word deletion rate on quiet primary speakers: This is the number we're most careful about overstating. The model applies a gain mask rather than hard gating, so quieter speech gets enhanced rather than cut. But at very low SNR with a louder competing speaker, your exact third scenario — the deletion risk does go up. We have internal WER benchmarks, but I'd rather you run your own stress clips on your own audio than take our word for it. Which brings me to your actual point.
False retention when the second speaker is louder than the caller: Genuinely the hardest case. We trained specifically for this, 60% of the training samples had a competing human voice, but "louder than the primary caller" is the stress condition where we'd want your numbers, not just ours.
The open-source release is precisely for this reason. The weights, runtime, and training config are all on GitHub. Run your worst clips. Break it. That feedback is worth more to us than any benchmark we self-report.
Drop your findings here or open an issue we're actively watching both.
Hush
Hey Product Hunt! I'm @lordhasanali , CEO of weya AI.
We watched great voice AI fail in production, over and over, not because of the model, but because of the audio. Noisy environments, competing voices, background hum. Nobody was solving this properly, so we did.
Introducing Hush, our first in-house open-source speech enhancement model, which:
• Isolates the primary speaker and removes everything else in real time
• Runs entirely on CPU, under 1ms per frame - no GPU needed
• Language-agnostic - works across all spoken languages out of the box
• Apache 2.0 - free to use in production today
We launched at #5 on HuggingFace's Audio-to-Audio leaderboard, and this is just the start.
We'll be here all day answering questions. Try it, break it, and let us know what you think!
Sub-ms matters because voice UX breaks when the audio path gets clever but slow. The edge case I would watch is the handoff between suppression and downstream turn detection; a clean stream is useful only if it preserves the timing signals.
Hush
@krekeltronics Exactly right, and this is an underappreciated failure mode in most voice AI stacks.
Hush processes in 10ms frames and preserves frame boundaries cleanly through the pipeline, so the timing signals VAD and turn detection rely on stay intact. We specifically avoided any lookahead buffering that would smear those boundaries, because we saw firsthand how that breaks turn-taking on live calls.
The gain mask approach also helps here since we're not hard-gating or dropping frames, silence and trailing speech edges are preserved naturally rather than getting clipped in ways that confuse downstream turn detection.
It's something we'll be stress-testing explicitly in v2. Keen to hear if you've seen specific failure patterns worth designing against.