MiniMax M3 vs. Claude Opus 4.8

@Claude by Anthropic announced Opus 4.8 late May, and @MiniMax launched M3 four days later.
They are two models at radically different price points (Opus 4.8 is 8-10 times higher). And from a performance perspective? The @Kilo Code team ran both models with a code audit job. They tracked tokens, cost, time, and how many known issues each model identified. Here's what they found.
Key takeaways
@MiniMax M3 surfaced 13 of 17 known issues for $0.07. So did Opus 4.8 at medium and high reasoning level.
@Claude by Anthropic Opus 4.8 at xhigh and max found 2 more known issues. However, every run cost 10+ times more than MiniMax M3.
So, MiniMax M3 or Claude Opus 4.8? Claude Opus 4.8 or MiniMax M3? To quote the report:
The choice is less about which model is better and more about matching the run to the job.
For low-cost or high-volume audits, MiniMax M3 is the value pick
For a fast pass, Claude Opus 4.8 (medium) is the cheapest setting
For a more precise review without the longest waits, Opus 4.8 (high) works
For the most thorough single pass, Opus 4.8 (xhigh) produced the best report
Open-weight models are closing the gap fast. Test a few on your own work and pick on budget and coverage.
Open-weight or frontier models - any preferences for coding tasks?


Replies
Quick disclaimer ... my multi-provider routing is in a productivity app, not a coding tool, so I can't tell you how either of these models handles code audits specifically.
But the pattern Kilo describes lines up with what I see at the product level: I'm running Qwen 3.5 4B locally (llama.cpp sidecar) and Groq's openai/gpt-oss-120b on the cloud side ... both open-weight. Once we stop trying to pick "the best model" and just treat them as different tools wired into a router, the "open-weight vs frontier" framing kind of goes away.
For my own coding work I'm a Claude Code daily-driver, so my personal stack is totally different from what my product ships. But the Kilo takeaway — "matching the run to the job" — tracks both ways. And the gap they're measuring shrinks even more once you add "what data am I actually willing to send to someone else's server" as another routing dimension.
Most teams will probably end up with routing rules, not a favorite model. For coding work I would split by failure cost. Cheap model for broad sweeps, frontier model where a missed issue touches security, data integrity, or production behavior, then a second reviewer before anything gets changed. Are people already routing by repo area and risk, or still picking a model manually per task?
Tabstack by Mozilla
exactly. see this poll after @Github Copilot moved to a usage-based billing model. most teams (61%) are already considering an hybrid approach: cap the heavy stuff and route simpler tasks to more cost-effective models.
My current question is not "which model is smartest?" It is "which model can I trust for this exact step?" Planning, refactoring, UI interpretation, and copy all seem to reward different model behavior.
Tabstack by Mozilla
@sarveshsea good point! a few months ago, picking a model was mostly about their raw capability. [1] the gap between open-weight and frontier models shrinking with every release, we can now add more parameters - pricing, latency, and yes, trust as well.
[1]: What's the best AI model for coding?
for agentic coding tasks the cost question gets way sharper because you're not running one pass, you're running 10-20 to verify the output actually works. we hit this building FetchSandbox - frontier models for the hard reasoning step, cheaper open-weight for the verification loops. the $0.07 MiniMax number is really interesting at that scale. curious if anyone's tested MiniMax on async/webhook behavior specifically, that's where most agent pipelines silently break and i'd want to know if the cheaper model catches those edge cases or just the obvious ones.