
Meta has launched SAM Audio, a unified model for isolating sounds from complex audio through natural language, visual, or time prompts. Released on December 16, 2025, this tool positions Meta as a leader in audio separation technology, boosting usability for both creators and developers.
SAM Audio combines multiple interaction methods for extracting specific sounds such as speech and instruments. Built on flow-matching diffusion transformers, it surpasses existing models. The Perception Encoder Audiovisual (PE-AV) drives its functionality, offering high-quality separation. Explore SAM Audio on Meta's Segment Anything Playground.
SAM Audio allows users to isolate sounds through text prompts like “dog barking,” visual selection in videos, and span prompts for time-marked extraction. Designed for practical tasks like noise removal, it operates in mono at around 0.7x real-time speed on A100 GPUs, and struggles with similar audio events.
SAM Audio and PE-AV are currently available for free download. Details on commercial licensing and API access are not yet provided, positioning the tool primarily for research purposes.
The model's unique approach faces scrutiny. Claims of being the first unified model are contested, and potential for misuse remains. LALAL.AI highlights issues like audio artifacts and lack of stereo fidelity. Meta plans collaborations, such as with Starkey, to enhance audio separation tech, influencing audio editing and creative media tools. Full insights can be found in Meta's official announcement.
