Qwen3

Think Deeper or Act Faster

5.0•10 reviews•

1.6K followers

Think Deeper or Act Faster

5.0•10 reviews•

1.6K followers

Visit website

LLMs

Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud. - QwenLM/Qwen3

This is the 13th launch from Qwen3. View more

Qwen3.5

Launched this week

The 397B native multimodal agent with 17B active params

An open-weight, native vision-language model built for long-horizon agentic tasks. Its hybrid architecture (linear attention + MoE) delivers the capabilities of a 397B giant with the inference speed of a 17B model.

Free

Launch tags:Open Source•Artificial Intelligence•Development

Launch Team

Auth0 — Start building with Auth0 for AI Agents, now generally available.

Start building with Auth0 for AI Agents, now generally available.

Promoted

Flowtica Scribe

Hunter

📌

Hi everyone!

Qwen3.5 is here. It is a native vision-language model with a massive 397B parameter count.

Built on the Qwen3-Next architecture (Linear Attention + MoE), only 17B parameters are active per forward pass. This hits a specific sweet spot: you get the reasoning depth of a giant model with the inference latency of a much smaller one.

For applications, this efficiency is key for agents.

It is natively multimodal with no glued-on vision adapters, demonstrating outstanding results on agentic tasks. This means handling complex workflows without burning through tokens.

Apache 2.0 and ready for vLLM/SGLang out of the box!

Report

7d ago

Fluent

Congrats @zaczuo !

Excited to test it against agentic workflows. Being a fan of Qwen3 – always a rock solid choice as a local model.

Report

6d ago

Linear attention keeping latency flat across long tool-call chains is the part that actually matters for agents. Standard transformers get brutal once you're 50+ steps into a workflow with accumulated context. 17B active params on a 397B base with vLLM support out of the box makes self-hosting realistic too.

Report

6d ago

Serving a 397B MoE native multimodal model for long-horizon agents will bottleneck on KV-cache growth and multimodal prefill latency, and expert-routing variance can reduce batching efficiency at high throughput. Best practice: run it under vLLM or SGLang with continuous batching plus paged KV cache, add aggressive prompt and image embedding caching, and lean on FP8 where supported to keep cost predictable. :contentReference[oaicite:0]{index=0} Question: what max context length are you targeting for Qwen3.5 in production and how stable is expert routing under long tool-using trajectories when served via vLLM or SGLang?

Report

6d ago

@ryan_thill How does Qwen3.5's 3:1 ratio of linear attention to full attention layers hold up when tool calls return wildly different payload sizes? 397B params with only 17B active keeps inference fast, but uneven chunk lengths from mixed tool outputs could still spike memory on those full attention layers even if the linear ones stay flat.

Report

6d ago

397B with only 17B active params is impressive efficiency. The hybrid linear attention + MoE approach seems like the right direction for long-horizon agentic tasks. As someone building a vision AI app for pet health, I'm always watching open-weight multimodal models closely — excited to benchmark this against our current pipeline. Congrats on the release!

Report

6d ago