Maker here. Solo dev, no team. Six weeks ago I was debugging a multi-LLM consensus tool I'd built for myself. Three models gave the same answer in three different phrasings. My agreement scorer — Jaccard + Sørensen-Dice on tokens — returned 0.31. Low confidence. I almost shipped a "models disagree" warning to the user. That was the moment. Lexical similarity is the wrong primitive for LLM ensembles. Two answers that say the same thing with disjoint vocabulary look like disagreement. Two that say *opposite* things with overlapping vocab ("the patient should NOT take..." vs "the patient should take...") look like agreement. I'd been measuring the wrong thing for months. So I rebuilt the agreement layer on cosine similarity over sentence embeddings (nomic-embed-text via Ollama by default, swappable to Gemini or OpenAI). Score on the same three answers went 0.31 → 0.89. False-disagree rate on my eval set dropped ~40%. That fix is the spine of this kit. **What's in it** - 11 providers wired: OpenAI, Anthropic, Gemini, xAI Grok, Mistral, Cohere, NVIDIA, DeepSeek, Moonshot/Kimi, Zhipu/GLM, Ollama (Llama local). One adapter interface, per-call cost ledger. - Semantic consensus scorer — pairwise cosine matrix over the panel, largest connected component above threshold becomes canonical, outliers logged. - Opt-in HSP fail-closed gate (Patent Pending PCT/US26/11908). If consensus drops below threshold, the call errors instead of returning a "best guess." Default off. - EU AI Act Article 12 + 13 audit log + SHA-256 hash-chained PDF cert generator. Not a substitute for a real audit — it's the artifact your auditor will ask for. **What I learned** - Cost was the hard part, not the math. Naive 11-way fanout = $0.40+ per answer. The kit ships a tiered MoE router (cheap models vote first, escalate on low-confidence). Cut my average cost ~6x. - Streaming + ensembles is a UX trap. You can't stream a consensus answer. I stream the leading candidate and reconcile at the end. - Embedder is the trust anchor. If it dies, fail-closed. Took 3 prod incidents to learn this. **What's still bad** - Embedder choice changes which answers cluster, even when LLMs stay the same. No clean per-domain fine-tuning path yet. - No good answer for tool-call branches inside the ensemble. Currently excluded from scoring. - First-request latency is rough — every provider called in parallel, no early exit. **Feedback I actually want** 1. If you've shipped LLM ensembles in prod — what's your fail-closed policy? Hard error, fallback, or human handoff? 2. Anyone measuring consensus quality on structured outputs (JSON, function calls)? Cosine over prose embeddings doesn't translate cleanly. 3. The $497 / $997 / $2497 split is single-project / 5-project / unlimited+white-label. Right cut, or should the audit cert generator be its own SKU? Repo: https://github.com/jaquelinejaqu... Hosted: https://quorum-ai.dev Code is Apache 2.0 + HSP commercial restriction. Happy to answer anything — especially the hostile questions.

Quorum — Multi-LLM Starter Kit - Ship AI that doesn't hallucinate. 11 LLMs vote.

Replies