The Multivac - Which LLM thinks best? Blind peer-judged leaderboard.

Most LLM leaderboards are static, gameable, or judged by a single model. The Multivac runs a 10×10 blind peer matrix: every frontier model answers, then judges every other model's answer without knowing whose it is. What you get is a ranking of reasoning quality, not memorized benchmarks. Features: Ask Multivac (live multi-model answers + share pages), Model Pulse heatmap, head-to-head Compare, full data export, and an open-source evaluation engine (MIT).

Hey PH 👋 I'm Yash, solo-built The Multivac out of rural Manitoba over the last several months. The thing that nagged me about every existing LLM leaderboard: a single judge model decides who wins, or the benchmark gets memorized into oblivion within a release cycle. So I built the opposite — a blind peer matrix where every model evaluates every other model's answer without knowing whose it is, scored across 4 dimensions. The engine is open source (MIT). The methodology is public. You can run "Ask Multivac" on your own hard question and see all the frontier models reason through it side-by-side, then share the result. Genuinely curious what you'd want to see next — head-to-heads against new releases? Domain-specific leaderboards? Roast the methodology, I want it to hold up.

The Multivac - Which LLM thinks best? Blind peer-judged leaderboard.

Replies