
Cenvero Orion
AI support agent — handle, escalate & auto-ticket
4 followers
AI support agent — handle, escalate & auto-ticket
4 followers
Cenvero Orion is an AI support agent for your website. It answers questions from your knowledge base, captures leads, and handles live chat. When the AI can't resolve an issue or the user asks for a human, it escalates. No response or frustrated user? It auto-creates a ticket — nothing gets dropped. Built on the Cenvero Orion Engine v8 — 8-layer architecture with hallucination suppression & multi-model routing. orion.cenvero.com/how-it-works Access → orion@cenvero.com





The auto-escalation to tickets is a nice touch, that's the part most AI chat tools skip. I'm building in the same space and the biggest challenge has been keeping responses grounded to the KB without hallucinating. How are you handling that with the 8-layer architecture, is it mostly prompt-level or do you have filters too?
@cuygun Great question — it's not prompt-level at all, we found that relying on prompts alone for grounding is too fragile at scale.
Layer 03 in the Orion Engine v8 is what we call the Neural Sieve — raw LLM outputs never reach the user directly. Every generated token is cross-referenced against the source knowledge base and assigned a per-token confidence score. Anything below the dynamic threshold gets regenerated through a constrained process that keeps the output strictly grounded in the KB content.
So it's a post-inference filter running on every response, not a prompt instruction that the model can drift from. Combined with how we handle retrieval in Layer 02 — multi-hop reasoning through the KB rather than a simple vector lookup — the grounding stays tight even on edge case queries.
In our testing we've got hallucination rates down to near zero — even with large contexts we rarely see it drift. That said it's a computationally expensive process, which is why we launched in beta — we want to validate it against real world usage patterns and keep refining the Orion Engine from there.
Worth mentioning — the Orion Engine v8 is fully written in Rust. That gives us complete control over the inference pipeline and the speed to run post-inference validation at this level without it becoming a bottleneck.
We also built three response modes directly into Orion — Strict (answers only from KB, says "I don't know" if info isn't there), Balanced (KB-first but can handle general topics), and Creative (uses KB flexibly, adds commentary, chats naturally). Creative mode combined with a custom system prompt is where it gets really interesting — you can turn Orion into a full lead generation machine, not just a support agent.
And because Orion is fully API-first it's not locked to a website widget at all. People are already thinking about dropping it into WhatsApp, phone calls, internal tools — anywhere you can hit an API endpoint, Orion works.
Curious what approach you're taking on your end — are you doing any post-inference validation or purely retrieval-side grounding?
@shubhdeep_singh2, Nice, thanks for the detailed breakdown. We take a different approach, ours is retrieval-side grounding rather than post-inference filtering. For small knowledge bases (under ~12K tokens) we skip vector search entirely and send the full KB to the model, so nothing gets lost in retrieval. For larger KBs we use vector similarity search but always include boundary chunks (first and last sections per source document) because that's where contact info and key details usually live. On the prompt side we enforce strict grounding rules, the model paraphrases from KB content rather than copying verbatim, and if the info isn't there it says so. No post-inference token-level validation, we found that getting retrieval right means the model rarely needs correction. Keeps the stack simple and the latency low. How does the per-token confidence scoring affects your response times, that sounds computationally heavy even in Rust.
@cuygun Solid approach — full KB injection under 12K tokens is clean, avoids retrieval gaps entirely.
On latency — the per-token scoring runs in parallel with the response stream rather than blocking it, and Rust keeps the overhead tight enough that it's not user-observable in practice. Adds some compute cost on the backend but response time stays clean on the client side.