Launched this week

TokenHunger
Find the cheapest model that still passes your task
7 followers
Find the cheapest model that still passes your task
7 followers
Benchmark every model on your own task, rank by cost per correct answer, and pay only for the runs you make. Run the benchmark. Feed the lean one.








I Benchmarked Claude Fable 5 Against 24 Models. It Finished 23rd.
My feed has been wall-to-wall Claude Fable 5 this week — new frontier model, “the new default.” So I ran it on a real task against 24 other models and measured the only number that shows up on your invoice: cost per correct answer.
Fable 5 finished 23rd of 25. The winner is a model almost nobody’s talking about.
The task: judgment, not trivia. Release governance — the call an engineering org makes thousands of times a week. Deployed 45 minutes ago, errors at 8.2% against a 1% SLO, one mitigation already failed: ship or roll back? A new image-proxy path forwards user URLs to the metadata endpoint with no validation and security reproduced an SSRF: ship or hold? Twelve cases, five possible verdicts, real consequences. The kind of high-volume decision where price per call is the whole budget.
The results, ranked by cost per success:
🥇 qwen3.5-flash — 100% — $0.000023 (and fastest, 0.65s)
claude-sonnet-4-6 — 100% — $0.000786
openai/gpt-5 — 100% — $0.003960
🐢 claude-fable-5 — 100% — $0.005191 (#23)
Eleven models scored a perfect 12/12. Fable was the most expensive one of all of them — a 226x premium over qwen3.5-flash for the identical result.
Three twists the hype is hiding:
Newer isn’t better. The newer qwen3.6-flash dropped to 92%. On the frontier, Opus 4-8 scored just 75% while the older, cheaper Opus 4-7 went a clean 100%. Bigger and newer actively hurt — and charged more.
“Cheap per token” is a trap. Claude Haiku 4-5 has one of the lowest token prices on the board. Its cost per correctanswer? Infinite — it scored 0/12. Cheap tokens that return wrong answers aren’t cheap; they’re waste with a low sticker price.
The leaderboard flips. Rank by capability and the frontier names float up. Rank by cost per success — what you actually pay — and the board inverts. The winner is small, fast, and unglamorous.
What this means. Fable 5 isn’t a bad model; on genuine frontier reasoning it may be worth every cent. But for everyday, high-volume work, the frontier model is almost never the right default. You pay a premium for a ceiling you never touch, on a task a model 226x cheaper already solves perfectly.
Stop asking “which model is smartest.” Ask “which hits the accuracy I need at the lowest cost per correct answer?”Sometimes that’s the flagship. Here it was qwen3.5-flash, by two orders of magnitude.
Every figure is live on TokenHunger — same 12 cases, all 25 models, ranked by what a correct answer actually costs. Run your own task before you make the trending model your default.
What’s the most expensive model in your stack doing a job a model 100x cheaper does just as well?
#LLM #AIEngineering #MLOps #AICost #tokenhunger