SAM Audio - Segment any sound with text, visual, or time prompts
SAM Audio is a unified model that separates any sound from any source. Use text ("dog barking"), visual clicks on video, or time spans to isolate specific audio. It unifies speech, music, and sound effect separation into one promptable model.



Replies
Flowtica Scribe
Hi everyone!
To be honest, when I first saw this, I didn't think much of it. But after looking closer... SAM Audio is absolutely mind-blowing.
Attention to all makers building audio-related products: Do not ignore this model.
Just like the original SAM changed image segmentation forever, SAM Audio breaks the "fragmented" world of audio processing.
Old Way: You needed separate tools for noise reduction, vocal isolation, and speaker diarization. It was a mess of "signal processing."
The SAM Way: It understands Semantic Intent. You don't filter frequencies, you just tell it what you want.
-> "Isolate the guitar" (Text Prompt)
-> Click on the car in the video (Visual Prompt)
-> Select this specific timestamp (Span Prompt)
It basically shifts audio editing from "engineering" to "describing." And since the inference is pretty fast, the engineering potential here is massive.
P.S. Checked the license—commercial use allowed!✌️
This really nails it.
It makes audio work feel far more approachable. As an engineer and creators in it will be quite helpful for me.
Excited to see how people actually use this day to day.
DeepTagger
That's pretty impressive! 🚀
I think there'll be a lot of new SaaS-s build around controlling this new model. As well as it being integrating into existing video editing software. Looking forward to that!
Wow, Meta sounds amazing! The SAM Audio feature is seriously impressive - being able to isolate specific sounds like that is wild. Curious, how well does it handle overlapping sound events when separating audio in real-time?