Cli Modelarium

Compare LLMs with real statistics, right from your terminal

5 followers

Compare LLMs with real statistics, right from your terminal

5 followers

CLI tool for comparing AI language models with statistical rigor. Supports 8 cloud providers (OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Groq, OpenRouter) plus local models. Bootstrap confidence intervals, paired significance tests, hallucination detection, LLM-as-judge panels, cost tracking with hard caps. One pip install, no infrastructure. Available on Linux, macOS, and Windows. Python 3.11+. Apache 2.0. pip install cli-modelarium

Free

Launch tags:Open Source•Developer Tools•Artificial Intelligence

Launch Team

Framer AI AgentsDesign and publish professional sites with AI

Promoted

Maker

📌

I built Cli Modelarium because every time I wanted to compare two LLMs, I had to pick between eyeballing outputs in a chat window or spinning up an entire evaluation platform. The CLI does it from the terminal. You give it a prompt, pick your models, and get a side-by-side with cost tracking, latency, and actual statistical tests. Bootstrap confidence intervals, McNemar's test, hallucination detection, the works. 8 cloud providers, local model support, and a --max-cost flag so you don't burn through API credits by accident. 917 tests across 9 OS/Python combinations. Apache 2.0. Would love to hear what you think.

Report

2mo ago

Maker

Quick update for anyone following along. Cli Modelarium 0.1.4 just shipped, and the headline is two new providers.

It is 10 cloud providers now, up from 8. You can put Alibaba's Qwen models (via DashScope) and Z.AI's GLM models head to head against the usual lineup (OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Groq, OpenRouter), plus your local models. If you have wanted to benchmark the open-weight models against the frontier ones on your own prompts, that is now a single command.

Also in this release:

Refreshed all the pricing to current provider rates
Wired Qwen and GLM into the model groups (all-flagship, all-budget, all-fast, all-cheap, all-reasoning), so you can pull them in by group instead of one at a time
Added Python 3.14 support
A few model id updates to track provider renames

Everything else works the same: side-by-side output with cost and latency, real statistical tests (bootstrap confidence intervals, McNemar's, paired significance), hallucination detection, LLM-as-judge, and the --max-cost flag so you do not burn through API credits by accident. Still one pip install, no infrastructure, Apache 2.0.

Upgrade with: pip install --upgrade cli-modelarium

Would love to hear how the new providers work for your use case.

Report

23d ago

Reviews

Framer AI AgentsDesign and publish professional sites with AI

Promoted

Maker

📌

Report

2mo ago

Maker

Quick update for anyone following along. Cli Modelarium 0.1.4 just shipped, and the headline is two new providers.

Also in this release:

Refreshed all the pricing to current provider rates
Wired Qwen and GLM into the model groups (all-flagship, all-budget, all-fast, all-cheap, all-reasoning), so you can pull them in by group instead of one at a time
Added Python 3.14 support
A few model id updates to track provider renames

Upgrade with: pip install --upgrade cli-modelarium

Would love to hear how the new providers work for your use case.

Report

23d ago