marius ndini

Feature Update: Rolling Conversation Summaries — Cut Chat Costs Without Losing Context

We built a feature to solve a problem most AI apps eventually run into:

The longer the conversation, the more you keep paying to resend the entire chat history — over and over.

Blog here (https://www.mnexium.com/blogs/chat-summarization)

Docs here (https://www.mnexium.com/docs#summarize)

That “token tax” adds up fast.
In the blog, we walked through a realistic scenario:

  • 40 messages per conversation

  • ~200 tokens per message

  • 1,000 daily active users

  • 3 conversations per user, per day

Without summarization, the same history gets re-sent repeatedly — totaling:

➡️ 492M tokens per day
➡️ ~14.7B tokens per month
➡️ ≈ $36,900/month (at $2.50 / 1M tokens)

We shipped Conversation Summaries.

Older segments of a chat get automatically compressed into concise summaries, while the most recent turns stay fully intact — preserving accuracy, tone, and state.

With summarization turned on, that same scenario drops to:

➡️ 36M tokens per day
➡️ ≈ $2,700/month
➡️ ~93% cost reduction

Completely runs in the background — meaning:

✔️ conversations stay long
✔️ latency doesn’t change
✔️ users don’t see “summary artifacts”
✔️ you stop paying for repeated context

We also included configurable modes:

  • Light — minimal compression

  • Balanced — smart middle ground

  • Aggressive — maximize savings

  • Custom — tune thresholds & token limits yourself

summaries keep long conversations affordable, while memories preserve important facts across sessions (preferences, goals, profile info). You get continuity inside the chat and persistence beyond it.

If you're building AI chats and want to stay scalable, the post breaks down how it works and when to use each mode.

78 views

Add a comment

Replies

Best
Malani Willa

curious if this integrates smoothly with existing AI pipelines or if some retraining/adjustment is needed for summarization modes.

marius ndini

@malani_willa The entire premise behind @Mnexium AI was to integrate as smoothly as possible into existing AI infra and code. For example in our get started guide we walk through cloning ChatGPT using Mnexium.

If we look at our example below - you can use existing OpenAI/Anthropic libraries (below we use OpenAI and ChatGPT).

Purposefully, very little changes in how you use the existing ChatGPT library, you just expand upon them some additional parameters to use Mnexium.

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: MNX_KEY,
  baseURL: "https://www.mnexium.com/api/v1",
  defaultHeaders: { "x-openai-key": OPENAI_KEY },
});

// Make a request - works exactly like the OpenAI SDK
const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "What is the capital of France?" }],
  
  // Mnexium-specific options via extra body
  mnx: {
    subject_id: SUBJECT_ID,
    chat_id: CHAT_ID,
    log: true, // History flag
    learn: true, // Memorize flag
    ... : ... // Other flags
  },
});

We think what you gain by using Mnexium is substantial with as little effort as possible. For example with the above call you automatically get history, memory, auditing, agent-state etc. All the features discussed in our docs. If you were to implement all of this yourself you'd need databases, orchestrators, other tools & infra, etc.

Hope the above helps -