0.8B-9B native multimodal w/ more intelligence, less compute
Qwen just released the Qwen3.5 Small Model Series — 0.8B, 2B, 4B and 9B. Native multimodal with improved architecture and scaled RL. 0.8B and 2B are tiny and fast for edge devices, 4B makes a strong lightweight agent base, and 9B is already closing the gap with much larger models. Base versions released too.
The Qwen team just dropped the Qwen3.5 Small Model Series: 0.8B, 2B, 4B, and 9B, along with their Base versions.
This release fills the missing piece for on-device deployment and completes the full Qwen3.5 matrix from 0.8B all the way to 397B. Now you have clear choices:
@zaczuo Impressive release! Already played with all the small series both locally (MLX) and in the cloud. Now that's something that can be reliably and constantly used in agentic workflows!
Report
Native multimodal at 0.8B is genuinely impressive - most teams trade off size for capability, but 262K context windows + text/image/video in under 1B parameters changes the edge deployment math.
The 9B beating GPT-OSS-20B on GPQA Diamond is interesting. Curious about structured output reliability at 0.8B though - small models tend to drop JSON schema adherence under complex instructions. Is there a differentiated training approach for structured data tasks?
MTP for faster inference on constrained hardware is a smart addition. Real-world throughput numbers on consumer GPU vs Apple Silicon would help developers size their deployment targets.
Report
The 9B punching above its weight class is the real story here. Running capable models locally without needing a data center changes what's possible for privacy-conscious apps and edge deployments. Been waiting for small open-source models to close this gap.
I’ve been using Qwen for building a simple code and website generator, and it works really well for fast iterations. Great for prototyping and lightweight generation.
What needs improvement
I need more on the history pages, a section when we can re-edit the input/process/output with easy UX. Basically, better handling of edge cases without extra prompting
vs Alternatives
I choose Qwen because it’s fast, lightweight, and great for turning ideas into simple, working code or websites. It was also the first web-based tool I explored for code generation, which made it easy to start prototyping right away.
How accurate is Qwen3 on real coding tasks you tried?
Quite good, but still need some touch-up especially on the logic.
Does Qwen3-Coder reduce PR review time or defects?
I’ve been trying Qwen alongside GPT-4o, and honestly it feels great — it’s noticeably faster and cheaper, yet most of the time the answer quality is hard to tell apart. For quick everyday tasks, I barely notice any trade-offs, which makes it a super practical choice.
I chose the Qwen model as the default starting in version 1.2 because it delivers an ideal balance of speed, accuracy, and lightweight performance. It runs efficiently on-device, uses very little storage, and responds quickly even on less powerful hardware. This makes it a perfect fit for an offline AI assistant where reliability, low resource usage, and a smooth user experience are essential.
Flowtica Scribe
Hi everyone!
The Qwen team just dropped the Qwen3.5 Small Model Series: 0.8B, 2B, 4B, and 9B, along with their Base versions.
This release fills the missing piece for on-device deployment and completes the full Qwen3.5 matrix from 0.8B all the way to 397B. Now you have clear choices:
0.8B/2B for embedded/IoT/Mobile
4B for lightweight multimodal agents
9B for edge servers
Plus the bigger MoE models for heavier workloads.
The 9B is the real shocker: matching or beating GPT-OSS-120B on several key benchmarks while being 13x smaller.
Even Elon chimed in:
Edge AI is heating up fast. This opens up exciting new opportunities for AI hardware and local innovation.
Play with these models on @Ollama!
Fluent
@zaczuo Impressive release! Already played with all the small series both locally (MLX) and in the cloud. Now that's something that can be reliably and constantly used in agentic workflows!
Native multimodal at 0.8B is genuinely impressive - most teams trade off size for capability, but 262K context windows + text/image/video in under 1B parameters changes the edge deployment math.
The 9B beating GPT-OSS-20B on GPQA Diamond is interesting. Curious about structured output reliability at 0.8B though - small models tend to drop JSON schema adherence under complex instructions. Is there a differentiated training approach for structured data tasks?
MTP for faster inference on constrained hardware is a smart addition. Real-world throughput numbers on consumer GPU vs Apple Silicon would help developers size their deployment targets.
The 9B punching above its weight class is the real story here. Running capable models locally without needing a data center changes what's possible for privacy-conscious apps and edge deployments. Been waiting for small open-source models to close this gap.
@binyuan_hui @chen_cheng1 @junyang_lin Hi guys. Can non-technical developers use these models easily? What tooling or platforms do they support?
Flowtica Scribe
@kimberly_ross Try them on Locally AI :)
What kinds of tasks can be performed with these models?
The 9B matching models that size is wild — smaller models getting this good makes edge + local AI way more practical than most people realize.