MiniMax M3 vs. Claude Opus 4.8

Kilo Code

•2mo ago

@Claude by Anthropic announced Opus 4.8 late May, and @MiniMax launched M3 four days later.

They are two models at radically different price points (Opus 4.8 is 8-10 times higher). And from a performance perspective? The @Kilo Code team ran both models with a code audit job. They tracked tokens, cost, time, and how many known issues each model identified. Here's what they found.

Key takeaways

@MiniMax M3 surfaced 13 of 17 known issues for $0.07. So did Opus 4.8 at medium and high reasoning level.
@Claude by Anthropic Opus 4.8 at xhigh and max found 2 more known issues. However, every run cost 10+ times more than MiniMax M3.

So, MiniMax M3 or Claude Opus 4.8? Claude Opus 4.8 or MiniMax M3? To quote the report:

The choice is less about which model is better and more about matching the run to the job.

For low-cost or high-volume audits, MiniMax M3 is the value pick
For a fast pass, Claude Opus 4.8 (medium) is the cheapest setting
For a more precise review without the longest waits, Opus 4.8 (high) works
For the most thorough single pass, Opus 4.8 (xhigh) produced the best report

Open-weight models are closing the gap fast. Test a few on your own work and pick on budget and coverage.

Open-weight or frontier models - any preferences for coding tasks?

See full report here →

285 views

Replies

Best

Quick disclaimer ... my multi-provider routing is in a productivity app, not a coding tool, so I can't tell you how either of these models handles code audits specifically.

But the pattern Kilo describes lines up with what I see at the product level: I'm running Qwen 3.5 4B locally (llama.cpp sidecar) and Groq's openai/gpt-oss-120b on the cloud side ... both open-weight. Once we stop trying to pick "the best model" and just treat them as different tools wired into a router, the "open-weight vs frontier" framing kind of goes away.

For my own coding work I'm a Claude Code daily-driver, so my personal stack is totally different from what my product ships. But the Kilo takeaway — "matching the run to the job" — tracks both ways. And the gap they're measuring shrinks even more once you add "what data am I actually willing to send to someone else's server" as another routing dimension.

Report

2mo ago

Most teams will probably end up with routing rules, not a favorite model. For coding work I would split by failure cost. Cheap model for broad sweeps, frontier model where a missed issue touches security, data integrity, or production behavior, then a second reviewer before anything gets changed. Are people already routing by repo area and risk, or still picking a model manually per task?

Report

2mo ago

Kilo Code

"Most teams will probably end up with routing rules, not a favorite model."

exactly. see this poll after @Github Copilot moved to a usage-based billing model. most teams (61%) are already considering an hybrid approach: cap the heavy stuff and route simpler tasks to more cost-effective models.

Report

2mo ago

memi

My current question is not "which model is smartest?" It is "which model can I trust for this exact step?" Planning, refactoring, UI interpretation, and copy all seem to reward different model behavior.

Report

2mo ago

Kilo Code

@sarveshsea good point! a few months ago, picking a model was mostly about their raw capability. [1] the gap between open-weight and frontier models shrinking with every release, we can now add more parameters - pricing, latency, and yes, trust as well.

[1]: What's the best AI model for coding?

Report

2mo ago

FetchSandbox

for agentic coding tasks the cost question gets way sharper because you're not running one pass, you're running 10-20 to verify the output actually works. we hit this building FetchSandbox - frontier models for the hard reasoning step, cheaper open-weight for the verification loops. the $0.07 MiniMax number is really interesting at that scale. curious if anyone's tested MiniMax on async/webhook behavior specifically, that's where most agent pipelines silently break and i'd want to know if the cheaper model catches those edge cases or just the obvious ones.

Report

2mo ago