Zac Zuo

Kimi K2.6 - Open-source SOTA for long-horizon coding and agent swarms

Kimi K2.6 is Moonshot’s latest open-source model, built to push coding, long-horizon execution, and agent swarms forward at the same time. It brings stronger end-to-end coding, 300-agent swarm orchestration, and improved reliability for always-on agent frameworks like OpenClaw and Hermes.

Add a comment

Replies

Best
Crystal J

Hey PH 👋

Kimi K2.6 is our latest open-source model, built for long-horizon coding and agents - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization).

Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2)

Live at kimi.com, the app, API, and Kimi Code. Would love your feedback :)

DAYAL PUNJABI

@crystal_j For a non-coder like me scripting PH launch trackers, how does Kimi K2.6 handle multi-step tool chains with error recovery? Like if an API flakes or a prompt needs human tweak mid-flow?

Zac Zuo

I’ve been on K2.6-code-preview for a while, and now it’s officially K2.6. It has been kind of wild!

The model really shines on long-horizon coding: thousands of tool calls across hours of continuous execution, strong generalization across languages and tasks, plus the ability to generate rich, animated frontends with real motion and 3D elements. The agent swarm upgrades (300 parallel sub-agents) and proactive 24/7 agent support also feel like a meaningful step up.

As always, Kimi keeps delivering frontier-level models as open source. Respect🫡🫡

André J

@zaczuo Whats long horizon coding? What do you use it for? 1000 of calls? I do max 100 calls on PR. How well does it compare to opus 4.7? I heard the previous Kimi was almost as good as opus 4.6

Zac Zuo

@conduit_design Me & team mainly use it for heavy debugging in our recent Android sprint. The level of deep bugs it surfaced was not weaker than 5.4.

André J

@zaczuo Ahh thats really smart. Just use it as a Smart UI-test / Unit tester.

fmerian

K2.6 offers SOTA-level performance at a fraction of the cost.

It's open-weights, it's fast, and optimized for long-context tasks across the codebase, as well as the day-to-day work needed to support an always-on agent like @OpenClaw and @KiloClaw.

Impressive.

Ivan Braun

How strict is Kimi with sensitive topics? How would you rate it against the big three US models on filter sensitivity toward information security, copyright, interpersonal boundaries, etc.?

I'm not talking about explicitly dangerous activity, but about legitimate tasks that that trigger the filters occasionally. An example is Claude Code refusing to configure the Microsoft Entra dashboard because it looks like a hacker attack to it.

Tijo Gaucher

300-agent swarm orchestration is wild — curious how reliable the long-horizon execution actually is in practice. Anyone tried it on multi-hour coding sessions yet?

Amedeo Viscido
how does Kimi code allegretto and moderato compares to Claude or Gemini quota? I have both Pro subscriptions and I get through the week consuming both quotas.
Andy Maciejewski

Solid open-weights drop. How does K2.6 compare to Claude Sonnet on multi-file refactors where you need to hold the call graph across 30+ files? SWE-bench score looks great but curious about real-world agent loops where context drift kills smaller models.


Ethan Frost

The 300 parallel sub-agents thing is wild. Most coding agents I've used top out at like 5-10 concurrent tool calls before they start stepping on each other. If Kimi K2.6 can actually coordinate 300 without losing coherence, that's a genuine architectural advantage not just a benchmark flex. How does it handle conflicting edits when multiple agents touch the same file?

Martí Carmona Serrat

300-agent swarm orchestration as a default capability is the bet I want to see real numbers on. Curious about the failure mode at scale: when one of the 300 sub-agents goes off-track or hallucinates a tool call, does K2.6 surface that to the orchestrator early, or does it propagate quietly through the swarm? The recovery semantics matter more than peak SWE-bench at this fan-out.