Mellum by JetBrains - Fast LLMs for low-latency and high-performance workflows

Meet Mellum, a family of fast language models, including a next-generation model for ultra-low-latency and high-performance inference.

Add a comment

Replies

Best
What percentage of real-world developer tasks do you believe can eventually be handled by specialized models like Mellum without needing a frontier model at all?
I would say 80% in the next 3-5 year time frame.
We need to see more of this to remove the dependency on the cloud based models

Latency-first models are underrated. I run AI voice agents and on a live phone call latency isn't a nice-to-have, it's the whole UX — a 2-second pause feels broken to a caller in a way it never does inside an IDE. For narrow, well-scoped tasks I'll take fast-and-good-enough over slow-and-brilliant every time. Is Mellum something you'd consider for real-time / voice use cases, or is it squarely aimed at the coding loop?

Yeah! Workflow performance became key and this is bringing a clear advantage there. doing what Flo does! The real hunting goat!

Shipping a focused, smaller coding model as open weights is the interesting bet here — the frontier-model-for-everything approach is expensive and overkill for completion. What I'd want to know: what context window does Mellum practically use for repo-level completion, and is it trained for fill-in-the-middle specifically, or general next-token? FIM quality is usually what separates a good in-IDE model from a chat model bolted into an editor.

How does it compare with the Qwen 3.6 and Gemma 4 models? It's disappointing to only see the old models. It seems misleading.

Fast, focused models make a lot of sense for coding. I’d rather get a useful completion instantly than wait for a huge model to overthink it.

Really like the latency-first direction here. Not every AI workflow needs a frontier model — especially for completion, routing, classification, and small sub-agent tasks where speed and cost matter a lot.

Would love to know how Mellum performs in local/self-hosted setups compared to cloud inference.