Kimi K2.6 - Open-source SOTA for long-horizon coding and agent swarms
Kimi K2.6 is Moonshot’s latest open-source model, built to push coding, long-horizon execution, and agent swarms forward at the same time. It brings stronger end-to-end coding, 300-agent swarm orchestration, and improved reliability for always-on agent frameworks like OpenClaw and Hermes.


Replies
Kimi AI - Now with K2.6
Hey PH 👋
Kimi K2.6 is our latest open-source model, built for long-horizon coding and agents - 4,000+ tool calls, over 12 hours of continuous execution, with generalization across languages (Rust, Go, Python) and tasks (frontend, devops, perf optimization).
Open-source SOTA on HLE w/ tools (54.0), SWE-Bench Pro (58.6), SWE-bench Multilingual (76.7), BrowseComp (83.2), Toolathlon (50.0), Charxiv w/ python(86.7), Math Vision w/ python (93.2)
Live at kimi.com, the app, API, and Kimi Code. Would love your feedback :)
@crystal_j For a non-coder like me scripting PH launch trackers, how does Kimi K2.6 handle multi-step tool chains with error recovery? Like if an API flakes or a prompt needs human tweak mid-flow?
Flowtica Scribe
I’ve been on K2.6-code-preview for a while, and now it’s officially K2.6. It has been kind of wild!
The model really shines on long-horizon coding: thousands of tool calls across hours of continuous execution, strong generalization across languages and tasks, plus the ability to generate rich, animated frontends with real motion and 3D elements. The agent swarm upgrades (300 parallel sub-agents) and proactive 24/7 agent support also feel like a meaningful step up.
As always, Kimi keeps delivering frontier-level models as open source. Respect🫡🫡
DiffSense
@zaczuo Whats long horizon coding? What do you use it for? 1000 of calls? I do max 100 calls on PR. How well does it compare to opus 4.7? I heard the previous Kimi was almost as good as opus 4.6
Flowtica Scribe
@conduit_design Me & team mainly use it for heavy debugging in our recent Android sprint. The level of deep bugs it surfaced was not weaker than 5.4.
DiffSense
@zaczuo Ahh thats really smart. Just use it as a Smart UI-test / Unit tester.
Kilo Code
K2.6 offers SOTA-level performance at a fraction of the cost.
It's open-weights, it's fast, and optimized for long-context tasks across the codebase, as well as the day-to-day work needed to support an always-on agent like @OpenClaw and @KiloClaw.
Impressive.
Brila
How strict is Kimi with sensitive topics? How would you rate it against the big three US models on filter sensitivity toward information security, copyright, interpersonal boundaries, etc.?
I'm not talking about explicitly dangerous activity, but about legitimate tasks that that trigger the filters occasionally. An example is Claude Code refusing to configure the Microsoft Entra dashboard because it looks like a hacker attack to it.
300-agent swarm orchestration is wild — curious how reliable the long-horizon execution actually is in practice. Anyone tried it on multi-hour coding sessions yet?
Solid open-weights drop. How does K2.6 compare to Claude Sonnet on multi-file refactors where you need to hold the call graph across 30+ files? SWE-bench score looks great but curious about real-world agent loops where context drift kills smaller models.
The 300 parallel sub-agents thing is wild. Most coding agents I've used top out at like 5-10 concurrent tool calls before they start stepping on each other. If Kimi K2.6 can actually coordinate 300 without losing coherence, that's a genuine architectural advantage not just a benchmark flex. How does it handle conflicting edits when multiple agents touch the same file?
jared.so
300-agent swarm orchestration as a default capability is the bet I want to see real numbers on. Curious about the failure mode at scale: when one of the 300 sub-agents goes off-track or hallucinates a tool call, does K2.6 surface that to the orchestrator early, or does it propagate quietly through the swarm? The recovery semantics matter more than peak SWE-bench at this fan-out.