Hush - Open-source noise suppression for voice AI agents

byβ€’
Hush removes competing voices, background noise, and audio interference from real-time calls so your voice AI agents always hear what matters.

Add a comment

Replies

Best

This is exactly the kind of voice-agent infra where the test set matters more than the demo clip. I would love to see three numbers side by side: added latency per frame, word deletion rate for quiet primary speakers, and false retention when a second speaker is louder than the caller. The open-source angle is especially useful if teams can run the same stress clips before deploying it into live calls.

Β  You're speaking our language, we're infrastructure people too, and we share exactly that instinct. Demo clips are marketing. Reproducible numbers on hard audio are what actually matter before you put something in a production call path.

Here's where we stand on your three asks:

Added latency per frame: 10ms frame size, ~12-13ms total Hush-attributed latency including resampling and buffering. The <1ms inference figure is real, but as I've said to others here, that's one slice, not the full picture.


Word deletion rate on quiet primary speakers: This is the number we're most careful about overstating. The model applies a gain mask rather than hard gating, so quieter speech gets enhanced rather than cut. But at very low SNR with a louder competing speaker, your exact third scenario β€” the deletion risk does go up. We have internal WER benchmarks, but I'd rather you run your own stress clips on your own audio than take our word for it. Which brings me to your actual point.


False retention when the second speaker is louder than the caller: Genuinely the hardest case. We trained specifically for this, 60% of the training samples had a competing human voice, but "louder than the primary caller" is the stress condition where we'd want your numbers, not just ours.

The open-source release is precisely for this reason. The weights, runtime, and training config are all on GitHub. Run your worst clips. Break it. That feedback is worth more to us than any benchmark we self-report.

Drop your findings here or open an issue we're actively watching both.

This seems pretty useful. We would love to give it a try!

Β πŸ™ŒπŸ»

Looks good! Congrats

Β πŸ™ŒπŸ»

Real-time noise suppression always involves tradeoffs - curious what the actual pipeline latency looks like end-to-end, not just model inference. WebRTC jitter buffers, chunking, and resampling all add overhead on top of the model itself, and for voice AI phone agents that budget is already tight with STT + LLM + TTS in the chain. Also wondering how it handles overlapping speakers mid-sentence vs. steady-state noise - that's usually where suppression models fall apart. How does it compare to what Deepgram or Twilio already offer natively in their voice pipelines?

Β All fair and sharp questions: these are exactly the tradeoffs we live with daily, building Weya's voice agent pipeline.


On end-to-end latency: The <1ms model inference is just one slice. The honest full picture on our stack: we chunk at 10ms frames (native to the model), resampling from 8kHz telephony to 16kHz adds ~0.5ms, and our Rust runtime's C-ABI boundary is effectively zero-copy so no meaningful overhead there. Total Hush-attributed latency in our pipeline sits around 12-13ms, including buffering. That's the number that actually matters for your STT→LLM→TTS budget, not the raw inference figure.


On overlapping mid-sentence speech: This is honestly the hardest problem in the space, and I won't oversell it. Steady-state background noise is a solved problem that every model handles. Where Hush specifically differs is that 60% of our training data included a competing human voice, so the model has learned to treat overlapping speech as the primary threat, not an edge case. Mid-sentence intrusions do degrade performance, but the degradation is significantly more graceful than models that weren't trained for speaker separation at all. We're targeting this directly in v2.


On Deepgram/Twilio native suppression: Their built-in noise handling is solid for the human-listener use case, stationary noise, light background hum. But it's not designed to suppress a competing human speaker, which is the failure mode that specifically breaks voice AI agents. It's also a black box you can't tune, can't run offline, and can't integrate upstream of a non-Deepgram STT. Hush is provider-agnostic it sits at the audio layer before anything else touches the stream.

sub-1ms-per-frame on cpu is the easy number to benchmark — on real-time agent pipelines the harder thing is frame jitter compounding across the STT→LLM→TTS hop, where any lookahead the suppressor needs eats the latency budget you saved. gain-mask + deep filtering also has the target-speaker vs blind-separation question lurking when two voices overlap mid-utterance.