Sina Tayebati

PandaProbe - open source agent engineering platform

PandaProbe is an open-source agent engineering platform that gives you deep observability into AI agent applications. Use it to trace, evaluate, monitor and debug your AI agents in development and production.

Add a comment

Replies

Best
Sina Tayebati

👋 Hey Product Hunt!

I’m Sina, founder of PandaProbe.

Building AI agents is getting easier, but understanding and trusting them in production is still hard.

Once agents start calling LLMs, tools, APIs, MCPs, and sub-agents, logs aren’t enough anymore. You need to see what happened, why it failed, whether quality regressed, and how reliable the system is across full sessions.

PandaProbe is my attempt to solve this: an open-source agent engineering platform for tracing, evaluation, monitoring, and debugging AI agent applications.

The goal is simple: help developers move from “it works on my laptop” to “I understand production behavior, can measure quality, and continuously improve it.”

What PandaProbe provides

🔎 Trace — capture full agent executions as sessions, traces, and spans across LLMs, tools, agents, and custom logic.
📊 Evaluate — score traces and sessions using mission-critical, agent-specific metrics.
⏱️ Monitor — schedule recurring evaluations to automatically validate new traces and sessions in production.
📈 Analytics — track performance, cost, latency, errors, and quality trends over time.
🛠️ Open source + cloud — use the open-source core on GitHub or run PandaProbe in the cloud.

Who it’s for

🧑‍💻 AI engineers — debug agent behavior across LLMs, tools, and workflows.
🏗️ Platform teams — monitor quality, regressions, and reliability in production.
🔬 Builders experimenting with agents — understand failures and iterate faster.
🚀 Startups — add observability and evaluation before things become unmanageable.reason about.

Quick links

GitHub: https://github.com/chirpz-ai/pandaprobe

Docs: https://docs.pandaprobe.com

Cloud: https://www.pandaprobe.com/

I’ll be here all day answering questions and collecting feedback.

If you’re building agents today, what’s the hardest part to debug or evaluate?

Thanks for checking it out 🙏
— Sina

Saul Fleischman

@sina_tayebati This addresses a real pain point—once agents get complex with multiple tool calls and sub-agents, traditional logging becomes almost useless for understanding what actually happened. The focus on evaluation and continuous monitoring in production sounds like exactly what teams need to move beyond one-off testing.

Sina Tayebati
@osakasaul Really appreciate that — and that’s exactly the problem we kept running into. Once agents become multi-step and start using tools and sub-agents, logs stop being enough to understand behavior. You need visibility across the full execution and a way to evaluate it over time. That’s why PandaProbe focuses on combining tracing with continuous evaluation and monitoring in production — so teams can move from one-off debugging to actually understanding and improving agent behavior over time.
Takahito Yoneda

Handling state and debugging for long-running autonomous agents is usually a nightmare, so having an open-source platform to standardize that workflow is huge. I can definitely see myself using PandaProbe to self-host my agent evaluation pipeline to keep sensitive client data entirely local. I am really curious to hear if you currently support custom tracing for raw API calls instead of just the standard frameworks.

Sina Tayebati

@y_taka Really appreciate that — and yes, democratizing observability and enabling teams to keep sensitive workflows fully self-hosted were big motivations behind making PandaProbe open source from day one.

And absolutely: we support custom tracing beyond standard frameworks. Alongside native integrations, PandaProbe also provides manual instrumentation APIs and decorators, so you can trace raw API calls, internal services, custom orchestration layers, or essentially any part of your agent workflow you want visibility into.

A lot of teams end up with hybrid architectures, so supporting low-level custom instrumentation was important for us early on.

Vincent F

Evaluation is the hardest part of this whole space and most platforms hand-wave it. The failure mode that actually bites in production isn't crashes or schema errors. It's slow drift in subjective quality (voice, classification accuracy, output style) that only shows up when a human reads 50 outputs in a row. How does PandaProbe handle that in practice? LLM-as-judge with custom rubrics, human-in-loop on a held-out set, embedding-distance from a golden corpus, or something else? And how do you stop eval cost from outpacing inference cost when you're re-judging every trace?

Sina Tayebati

@vincentf This is actually one of the core motivations behind PandaProbe.

A lot of evaluation systems today focus on isolated outputs, but in production we kept seeing failures emerge as trajectory-level drift: looping, degraded tool grounding, coordination breakdowns, subtle quality regression, output-style drift, etc. The model can still sound confident locally while the overall session quality quietly collapses.

That led directly to a research paper I recently published called TRACER (Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning). The core idea is evaluating uncertainty and failure at the trajectory/session level rather than the individual response level.

That research became a major foundation for PandaProbe’s evaluation system and heavily shaped how we think about observability and longitudinal agent evaluation.

On the cost side, we’re also very conscious about evals becoming more expensive than inference. PandaProbe supports async and sampled evaluations, composable metrics, and lightweight structural trajectory signals so teams don’t have to run expensive judge models on every trace.

Vincent F

@sina_tayebati Thanks Sina, that's a sharper framing than most of what's out there. Got a link to the TRACER paper?

Sina Tayebati
@vincentf hey Vincent, yes here’s the link to the paper: https://arxiv.org/abs/2602.11409
Mykyta Semenov 🇺🇦🇳🇱

We use LangGraph for these purposes. How is PandaProbe better and why should we switch to it?

Sina Tayebati
@mykyta_semenov_ That’s a great point — and to clarify, PandaProbe isn’t a replacement for LangGraph. LangGraph is for building and orchestrating agents. PandaProbe is for observing and evaluating them once they run, especially in production. If you’re already using LangGraph, PandaProbe actually plugs in on top of it. It captures traces from your workflows and then helps you: - evaluate behavior across full trajectories (not just outputs) - monitor quality and detect regressions over time - debug failures and understand what went wrong So it’s less “switch from LangGraph” and more “add a layer to understand and improve what you’ve built with it.”
Abdul Rehman

Congratulations on the launch @sina_tayebati
BTW, how well does PandaProbe handle tracking regressions across different agent versions over time?

Sina Tayebati
@abod_rehman Thank you! Great question — this is something we designed PandaProbe to handle explicitly. You can tag traces with things like agent version, prompt version, model, or environment (prod, staging, etc.). From there, monitoring and analytics let you compare behavior across versions side by side. That makes it much easier to spot regressions or subtle drifts when you ship updates — whether it’s quality degradation, tool misuse, latency changes, or cost shifts. In practice, a lot of teams use this to validate new versions against production traffic before fully rolling them out.
Kate Ramakaieva
Congrats on another great product going live! does it support MCP tool tracing natively or do you have to instrument those calls manually?
Sina Tayebati

@kate_ramakaieva Thanks for the support, Kate! Great question.

If you’re using one of our supported integrations for frameworks like LangGraph, CrewAI, and others, MCP tool calls are automatically captured and traced out of the box.

For custom agent architectures or internal tooling, we also provide lightweight manual instrumentation via decorators, so you can trace virtually any function, tool call, or workflow step in your agent logic.

Luigi Pederzani

Congrats on the launch and thanks for using mcp-use :)

Sina Tayebati

@pederzh thanks for the support Luigi. I'm a big fan of mcp-use :)

Igor Sorokin

Really nice work. The gap between "it ran" and "I understand what happened" is enormous for agents and nobody's solved it cleanly yet. Rooting for you!

Sina Tayebati

@igorsorokinua Really appreciate that and I completely agree. That gap becomes painfully obvious once agents start interacting with tools, memory, APIs, and other agents in production.

A big part of PandaProbe’s vision is making agent behavior actually inspectable (like traditional software engineering) and understandable instead of feeling like a black box.

Ardalan Mirshani

Great pain to tackle, Sina. Good luck.

Sina Tayebati
@ardalan2 thanks for your support Ardalan. happy to hear your feedback if you adopt the platform.
Sounak Bhattacharya

Where does PandaProbe sit relative to LangSmith, Langfuse, and Helicone? They all claim "agent observability" but mean very different things underneath — some are basically prompt loggers, others actually trace tool-call DAGs. Curious which problem you decided was the real one.

Sina Tayebati
@sounak_bhattacharya Great question — and I agree, these tools mean very different things by “observability.” Most of them do a solid job at logging and tracing. PandaProbe is more focused on what comes after that: evaluation. The core problem we’re solving is: once you have traces, how do you measure quality, detect drift, and catch regressions over time? So beyond tracing, PandaProbe is built around: • trajectory-level evaluation (not just single outputs) • monitoring quality across versions/environments • catching subtle degradation in production In short: collecting traces our base, but our focus is more about understanding and evaluating agent behavior over time.
12
Next
Last