Théophile Louvart

Tokenwise - A smart LLM proxy that shows where you're overpaying

Tokenwise is a one-line LLM proxy (OpenAI-compatible baseURL) for makers and small teams. It learns from your real requests, shows exactly where you're overpaying, proven with quality checks on your own traffic, not public benchmark, and lets you apply the fix in one click while it verifies the savings in real dollars.

Add a comment

Replies

Best
Théophile Louvart
Hey everyone, Theo here. I build a few small SaaS on the side of a full-time data engineering job, and at some point every one of them started leaning on LLMs. My API bills crept up every month and honestly I could never tell you why. Which feature, which prompt I'd changed last week, which model I picked without really thinking about it. I'd just top up credits and move on. The part that really got to me was the spend I couldn't even see. Claude Code running all day while I work, plus Cursor and Codex. None of that shows up anywhere until the invoice lands, and it turned out to be the money I understood the least. I tried the tools that already existed. One felt like it was in maintenance mode, one needed a whole observability setup just to get started, and one only worked if your stack was built around a specific framework. None of them were made for someone like me who just wanted to know where the money went and what to do about it. So I built Tokenwise. You add one line of code, or point your coding agents at it with no production changes, and you see every call: cost, latency, tokens, and what's being wasted. Then it tells you what to cut. A cheaper model here, a cache there, a bloated prompt to trim. Every fix gets checked against your own quality bar first, so you're never trading cost for worse output. The idea shifted a lot while I was building it. I started out thinking it was a dashboard. Then I realised nobody wants another dashboard, they want the answer: here's the $842 a month you're burning, and here's the one click to fix it. The real value was proving the savings on your own traffic, live. It's early and I'd genuinely love your honest feedback. Tell me what's missing, what's confusing, what you'd never use. That's more useful to me right now than anything. Thanks for taking a look.
Zolani Matebese

@tofil congrats on the launch Theo, this is very useful (I can never match the advertised input/output costs to my work either). What's the overhead fo r this and how deep does it go reporting wise?

Théophile Louvart

Thanks @zolani_matebese , really appreciate it

On overhead: the proxy runs on Cloudflare Workers at the edge, so we add ~30-50ms p50 in most regions (the actual provider call dominates latency anyway).

On reporting depth, here's what you get per request:

  • Exact cost (we re-tokenize and apply current pricing tables, so the "I can never match the bill" problem you mentioned goes away)

  • Input/output token counts, latency (TTFT + total), status, error type if any

  • Full prompt/response payload if you opt-in per project (off by default for privacy)

  • Model + provider + project + custom tags you set

And on top of that, aggregations by prompt template (we cluster semantically), recommendations with quality proof on your own data, and a "saved this month" counter that tracks the impact of applied recos in real $.

Anna Ludwinowski

@tofil Congratulations on the build! I imagine the more one builds, optimizing those token$ will become more important. Do you see this product as something primarily for heavy users/developers or do you see it benefiting those of us just getting started? I'm building my first app - would I likely see a savings?

Théophile Louvart

@anna_ludwinowski Honest answer: the dollar savings scale with how much you spend. At $1k/mo, cutting 30% is real money, so heavy users feel it most. But I think it's genuinely more useful when you're just starting out, for a different reason.

When you're building your first app, the problem usually isn't "I'm overpaying by 30%." It's the stuff you can't see yet. A retry loop firing 5 times, reaching for the biggest model when a cheaper one would've been fine, no caching on prompts you send over and over. You don't notice any of it until the bill shows up. Tokenwise plugs in with one line and just shows you where the money goes from your first request, so you catch that stuff early instead of learning the hard way.

It's free to start, no card needed, so there's no real downside to plugging your first app in and seeing what happens. Worst case you learn where your tokens actually go. Would love to hear what you're building!

Anna Ludwinowski

@tofil That's very helpful, especially for us just starting out. Does it advise/warn you of any odd activity? I got robbed by a bot today so warnings would be helpful! Luckily I just happened to notice today but didn't notice the one from 2 days ago - ugh!

Bengeekly

Does Tokenwise break down coding agent spend by session or only by model?

Théophile Louvart

Hey @bengeekly not by session yet, but the workaround works well today.

You can pass a tag (or session ID) on each request via the X-Tokenwise-Tags header, and Tokenwise clusters all requests sharing that tag, so for a coding agent you'd set X-Tokenwise-Tags: session-{conversationId} and see the full cost breakdown per session: total spend, which model, which prompt template (we cluster those semantically too), outliers, etc.

First-class sessions view (auto-grouped, no header needed) is on the short roadmap.

Felix Li

Observe-only is probably where I’d start, especially for Claude Code spend. The scary part is the “apply” step.

Before swapping a model, does Tokenwise show exactly which traffic it will touch, and is there an easy rollback?

Théophile Louvart

@novamaker01 

Here's how it works:

Before apply, you see exactly:

  • The prompt template(s) affected (with a sample of recent requests)

  • The estimated traffic % (e.g. "this rule will route ~12% of your project's requests")

  • Optional scoping: limit to a tag, a project, exclude certain endpoints

The "apply" doesn't blindly cutover. By default it runs as an A/B split, say 10% of matching traffic on the new model and you watch the quality scores + latency + cost for 24h before deciding to ramp to 100%. You can also choose immediate cutover if you prefer.

xiaosong

The "quality check on your own traffic" angle is genuinely the right frame. I've tried a few LLM cost tools and most just show you aggregate spend with generic benchmark comparisons — which is basically useless when your prompts are domain-specific.

One thing I'd love to see: for coding agents specifically, the spend is often bursty and session-based. A 2-hour Claude Code session can easily hit $5-10, but you don't know *which* part of the session burned the tokens. Was it the initial codebase ingestion? A long loop of test-fix retries? Breaking that down by sub-task within a session would be way more actionable than just "session-xyz cost $8.40."

Natalia Iankovych

Are you talking about query optimization/compression? Something like what Google recently did with an algorithm that compresses prompts by 7x without losing quality?

Théophile Louvart

@natalia_iankovych A bit of both, but not only compression. There's a compress option you can switch on per route, and it runs a quality check first so it doesn't quietly wreck your outputs. The bigger wins usually come from routing the cheap calls to a smaller model and caching repeats though. Nothing as aggressive as that Google 7x thing, I keep it safe and measurable so the savings are actually trustworthy.

Fabrizio Pfannl

The "quality check on your own traffic, not public benchmarks" is the right frame, that's exactly the gap most LLM-cost tools wave at. Question for Théophile: when you replay a request on the cheaper model to verify quality, how do you score "same answer" without a human in the loop? Embedding similarity tends to be permissive and exact-match too strict.

Théophile Louvart

@fabriziowexare Yeah, you put your finger on exactly why we don't use either. Embedding similarity waves through answers that are subtly wrong, and exact match fails anything reworded. So we run an LLM judge against a rubric that's scoped to that one prompt template, and you tell it what "good" means for your case (correctness, format, whatever actually matters to you). It returns per-criterion scores, not one fuzzy number. We score the cheaper model on your own recent traffic, only switch if it clears your bar, then keep judging live so it rolls back automatically if quality slips. A judge isn't magic either, but scoring against your criteria on your traffic beats a public benchmark by a mile.

Justin Winter
Site is down?
Théophile Louvart

@justin_winter Just checked and it's loading fine on my end, and fast too. Might've been a quick blip, mind trying again or a hard refresh? If it's still down for you let me know what you're seeing and roughly where you are and I'll dig into it.

Will Smith

Great idea, I've had the same problem and spent many hours debugging transcripts to look for token savings.

Théophile Louvart

@willsmithte  Ha yeah, that's literally why this exists. Digging through transcripts by hand to find the one expensive prompt is the worst. Tokenwise just shows you cost per prompt so you skip the manual hunting. Appreciate it.

Jonathan Vital

This looks awesome. I use a load balancer but I can probably use that OpenAI key and output to tokenwise but would you ever be interested in developing a built in load balancer for multi-account setups? I imagine there's far more savings to be had if your also load balancing time-based limits but I can also see where it might be out-of-scope for this project and better chained.

Théophile Louvart

@nohj Yeah I love this. You can already chain it today, just point your load balancer's output at Tokenwise (or put Tokenwise in front) and it still tracks cost and savings on top. A built in multi-account balancer that rotates keys to dodge per-account rate limits is something I keep coming back to, and you're right that the time-based limits are where a lot of the extra savings hide. Would love to see how your setup looks, it'd actually help me decide whether to build it in or keep it chained.

Vamshi Reddy

We've had seven AI agents running in production since last year, and token costs were a complete black box until the invoice arrived. We built a basic per-model logger ourselves -- took more engineering time than it should have. The edge case I'd push on: can you attribute spend to a specific workflow or user journey, not just a model? When you're debugging why a particular sequence of agent calls got expensive, model-level rollups aren't granular enough. That's where the real cost surprises live.

Théophile Louvart

@thekrew With you completely, model rollups are where the surprises hide, not where they get explained. We attribute three ways: by prompt template (grouped automatically on the system prompt, so each distinct call type is its own line), by tags you put on a workflow or journey, and per API key. So you can tag a whole agent sequence, watch what it costs end to end, then drill into which step blew up.

12
Next
Last