The 397B native multimodal agent with 17B active params
An open-weight, native vision-language model built for long-horizon agentic tasks. Its hybrid architecture (linear attention + MoE) delivers the capabilities of a 397B giant with the inference speed of a 17B model.
Qwen3.5 is here. It is a native vision-language model with a massive 397B parameter count.
Built on the Qwen3-Next architecture (Linear Attention + MoE), only 17B parameters are active per forward pass. This hits a specific sweet spot: you get the reasoning depth of a giant model with the inference latency of a much smaller one.
For applications, this efficiency is key for agents.
It is natively multimodal with no glued-on vision adapters, demonstrating outstanding results on agentic tasks. This means handling complex workflows without burning through tokens.
Apache 2.0 and ready for vLLM/SGLang out of the box!
Excited to test it against agentic workflows. Being a fan of Qwen3 – always a rock solid choice as a local model.
Report
Serving a 397B MoE native multimodal model for long-horizon agents will bottleneck on KV-cache growth and multimodal prefill latency, and expert-routing variance can reduce batching efficiency at high throughput. Best practice: run it under vLLM or SGLang with continuous batching plus paged KV cache, add aggressive prompt and image embedding caching, and lean on FP8 where supported to keep cost predictable. :contentReference[oaicite:0]{index=0} Question: what max context length are you targeting for Qwen3.5 in production and how stable is expert routing under long tool-using trajectories when served via vLLM or SGLang?
Report
@ryan_thill How does Qwen3.5's 3:1 ratio of linear attention to full attention layers hold up when tool calls return wildly different payload sizes? 397B params with only 17B active keeps inference fast, but uneven chunk lengths from mixed tool outputs could still spike memory on those full attention layers even if the linear ones stay flat.
Report
Linear attention keeping latency flat across long tool-call chains is the part that actually matters for agents. Standard transformers get brutal once you're 50+ steps into a workflow with accumulated context. 17B active params on a 397B base with vLLM support out of the box makes self-hosting realistic too.
Report
397B with only 17B active params is impressive efficiency. The hybrid linear attention + MoE approach seems like the right direction for long-horizon agentic tasks. As someone building a vision AI app for pet health, I'm always watching open-weight multimodal models closely — excited to benchmark this against our current pipeline. Congrats on the release!
I’ve been using Qwen for building a simple code and website generator, and it works really well for fast iterations. Great for prototyping and lightweight generation.
What needs improvement
I need more on the history pages, a section when we can re-edit the input/process/output with easy UX. Basically, better handling of edge cases without extra prompting
vs Alternatives
I choose Qwen because it’s fast, lightweight, and great for turning ideas into simple, working code or websites. It was also the first web-based tool I explored for code generation, which made it easy to start prototyping right away.
How accurate is Qwen3 on real coding tasks you tried?
Quite good, but still need some touch-up especially on the logic.
Does Qwen3-Coder reduce PR review time or defects?
I’ve been trying Qwen alongside GPT-4o, and honestly it feels great — it’s noticeably faster and cheaper, yet most of the time the answer quality is hard to tell apart. For quick everyday tasks, I barely notice any trade-offs, which makes it a super practical choice.
I chose the Qwen model as the default starting in version 1.2 because it delivers an ideal balance of speed, accuracy, and lightweight performance. It runs efficiently on-device, uses very little storage, and responds quickly even on less powerful hardware. This makes it a perfect fit for an offline AI assistant where reliability, low resource usage, and a smooth user experience are essential.
Flowtica Scribe
Hi everyone!
Qwen3.5 is here. It is a native vision-language model with a massive 397B parameter count.
Built on the Qwen3-Next architecture (Linear Attention + MoE), only 17B parameters are active per forward pass. This hits a specific sweet spot: you get the reasoning depth of a giant model with the inference latency of a much smaller one.
For applications, this efficiency is key for agents.
It is natively multimodal with no glued-on vision adapters, demonstrating outstanding results on agentic tasks. This means handling complex workflows without burning through tokens.
Apache 2.0 and ready for vLLM/SGLang out of the box!
Fluent
Congrats @zaczuo !
Excited to test it against agentic workflows. Being a fan of Qwen3 – always a rock solid choice as a local model.
Serving a 397B MoE native multimodal model for long-horizon agents will bottleneck on KV-cache growth and multimodal prefill latency, and expert-routing variance can reduce batching efficiency at high throughput. Best practice: run it under vLLM or SGLang with continuous batching plus paged KV cache, add aggressive prompt and image embedding caching, and lean on FP8 where supported to keep cost predictable. :contentReference[oaicite:0]{index=0} Question: what max context length are you targeting for Qwen3.5 in production and how stable is expert routing under long tool-using trajectories when served via vLLM or SGLang?
@ryan_thill How does Qwen3.5's 3:1 ratio of linear attention to full attention layers hold up when tool calls return wildly different payload sizes? 397B params with only 17B active keeps inference fast, but uneven chunk lengths from mixed tool outputs could still spike memory on those full attention layers even if the linear ones stay flat.
Linear attention keeping latency flat across long tool-call chains is the part that actually matters for agents. Standard transformers get brutal once you're 50+ steps into a workflow with accumulated context. 17B active params on a 397B base with vLLM support out of the box makes self-hosting realistic too.
397B with only 17B active params is impressive efficiency. The hybrid linear attention + MoE approach seems like the right direction for long-horizon agentic tasks. As someone building a vision AI app for pet health, I'm always watching open-weight multimodal models closely — excited to benchmark this against our current pipeline. Congrats on the release!