Hi everyone!

Sharing R1-AQA, a new open-source audio question answering (AQA) model from Xiaomi – and it's taking a really interesting approach, inspired by DeepSeek-R1!

What's cool:

🎧 Audio Question Answering: It goes beyond simple transcription, allowing you to ask questions and get answers based on the audio's content.
🧠 Reinforcement Learning (GRPO): They used a technique called Group Relative Policy Optimization (GRPO) – a type of reinforcement learning – to train the model.
🏆 State-of-the-Art: Achieves top results on the MMAU Test-mini benchmark, beating models like GPT-4o and Gemini Pro.
🌱 Small Data: They did this with only 38,000 training samples, and based on Qwen2-Audio-7B-Instruct.
🔓 Both the code and the model weights are available.

The use of reinforcement learning is particularly interesting. It seems like a very effective way to train these kinds of models, even with limited data.

To try it yourself, upload an audio or video file here.

R1-AQA

Xiaomi's DeepSeek-R1 Inspired Audio AI

Xiaomi's DeepSeek-R1 Inspired Audio AI