Respan Gateway - One AI gateway with built-in observability and evals

Respan AI Gateway connects your app to 1,000+ AI models through one endpoint. But routing is the easy part. Respan keeps production AI reliable and under control with fallbacks, retries, caching, spend limits, alerts, and full traces for every call. Gateway, observability, evals, prompt management, monitors, and cost controls all run on one platform, so you do not need to stitch together five tools to debug production.

Add a comment

Replies

Best

Hi Product Hunt,

We built Respan AI Gateway because routing to more models is only the first step.

Once your AI product is in production, the harder questions show up fast:

What happens when a provider fails?

Which customer is driving cost?

Which model version caused the latency spike?

Did the fallback work?

How do we trace, evaluate, and control everything without stitching together five tools?

Respan Gateway gives teams one OpenAI- and Anthropic-compatible endpoint for 1,000+ models, with fallbacks, retries, caching, spend limits, alerts, traces, evals, prompt management, and monitors on the same platform.

The goal is simple: make production AI easier to ship, debug, and control.

Would love your feedback, questions, and support today!

 Congrats on the launch! For teams that already have a multi-provider setup, what’s the simplest, lowest-risk way to try Respan Gateway in production what safeguards do you provide to ensure no customer-facing downtime or cost surprises during the transition?

 Thanks Swati!

For teams with an existing multi-provider setup, the lowest-risk path is to start with one provider or one route first, then gradually expand traffic once everything looks good.

For customer-facing downtime, Respan Gateway can automatically fall back to another live provider when the primary provider fails, so requests do not just break at the first failure.

For cost surprises, teams can set per-API-key spend limits, caps, or blocking rules. Cost and usage changes can also be reported through Slack or email alerts, so teams know quickly before spend gets out of control.

 the evals piece is what makes this interesting to me. most teams have routing and fallbacks figured out, but almost nobody has a real answer for "how do I know this model is actually performing well in production" beyond eyeballing logs. curious how you handle eval drift over time as user inputs shift.

 Absolutely that’s exactly why we built the evals feature, it’s easy to set up routing and fallbacks but knowing how a model is actually performing in production is a different story. With Keywords ai, we continuously monitor outputs against intent and quality benchmarks so you can catch drift early. The system adapts as inputs shift, helping teams spot issues before they show up in logs. It’s still early but so far its been a huge help for keeping performance consistent.

2 lines of code complete DevOps platform. always sounds a bit too good 😄
What breaks first when you try to use it on a real production agent with tool calls and long traces?

 Totally fair. The 2 lines are for getting traffic into the gateway and traces showing up, not pretending production agents are easy.

In real agents, the first things that break are rate limits, long tool-call chains, cost spikes, and not knowing which model/tool/prompt version caused the issue.

This is what many dev teams are missing. I’ve seen so many projects stall because they couldn’t effectively trace which model version caused a latency spike.

How does Respan handle 'evals' for non-deterministic outputs? Is it easy to set up automated regression tests for prompt changes?

 Hi,

For non-deterministic outputs, we don’t rely only on exact-match evals. Teams can evaluate outputs with a mix of LLM judges, rubric-based scoring, semantic checks, structured/schema checks, and custom pass/fail criteria depending on the task.

For prompt changes, yes, the goal is to make regression testing easy. You can keep a dataset of representative inputs, run a new prompt or model version against the same cases, compare scores against the previous version, and catch quality, latency, or cost regressions before rolling it out!

🔥🔥

🎉🎉🎉

The "2 lines of code" promise immediately caught my attention. Anything that helps teams focus on shipping AI experiences instead of rebuilding infrastructure deserves a closer look. Well done!

 Thank you! That’s exactly the goal.

Teams should be spending their time building better AI experiences, not wiring together gateway logic, traces, evals, monitors, and cost controls from scratch.

Honestly, Al reliability is still a huge challenge. Glad to see tools tackling this problem.

 Totally agree. AI reliability is still one of the hardest parts of putting these products into production.

the DevOps platform framing alongside 2 lines of code is an interesting positioning tension. DevOps platforms usually require significant setup and ongoing configuration to be useful. 2 lines of code implies you get value immediately. curious which one is more accurate for a new user and at what point the simple integration becomes a platform with enough configuration to actually catch the problems that matter in production

 Totally get what you’re pointing out. The 2 lines of code are meant to get new users immediate value without the heavy setup, but once you start feeding live traffic and want deeper insights, Keywords AI naturally scales into more of a platform. That’s when configuration—like custom routing, spend limits, and eval tracking—starts to matter, helping catch the production issues that simple integration alone can’t surface.

The underrated part here is having traces, evals, fallbacks, and cost controls in one place. Production AI gets messy fast, so fewer moving parts is a real win.

 Really appreciate that!

Connecting to models is rarely the hard part anymore. Figuring out why smth failed three days later is usually where the pain starts. Interesting to see more tools focusing on that side

 Totally!

On Respan, we log every LLM call, including the model, provider, latency, cost, prompt/version, errors, and traces, so teams can go back and actually analyze the cause instead of guessing from scattered logs.

123
Next
Last