Hi guys,
Here is a simple web application for generating immersive sound effects based on visuals!
In a nutshell, the system first tries to capture a detailed understanding of the visual scene. Then, the system asks a language model (like ChatGPT) to brainstorm plausible sound descriptions. Finally, the system generates the audio files from sound descriptions.
If interested, feel free to read our short research paper: https://arxiv.org/pdf/2311.05609.... In addition, this work was built on top of a previous full research paper: https://arxiv.org/pdf/2112.09726....
Thanks and enjoy!
Which Frame?