PandaProbe

open source agent engineering platform

895 followers

open source agent engineering platform

895 followers

Visit website

Observability tools

PandaProbe is an open-source agent engineering platform that gives you deep observability into AI agent applications. Use it to trace, evaluate, monitor and debug your AI agents in development and production.

Free

Launch tags:Open Source•Developer Tools•Artificial Intelligence

Launch Team / Built With

AdaptYour company brain. AI that thinks + acts across your stack.

Promoted

PandaProbe

Maker

📌

👋 Hey Product Hunt!

I’m Sina, founder of PandaProbe.

Building AI agents is getting easier, but understanding and trusting them in production is still hard.

Once agents start calling LLMs, tools, APIs, MCPs, and sub-agents, logs aren’t enough anymore. You need to see what happened, why it failed, whether quality regressed, and how reliable the system is across full sessions.

PandaProbe is my attempt to solve this: an open-source agent engineering platform for tracing, evaluation, monitoring, and debugging AI agent applications.

The goal is simple: help developers move from “it works on my laptop” to “I understand production behavior, can measure quality, and continuously improve it.”

What PandaProbe provides

🔎 Trace — capture full agent executions as sessions, traces, and spans across LLMs, tools, agents, and custom logic.
📊 Evaluate — score traces and sessions using mission-critical, agent-specific metrics.
⏱️ Monitor — schedule recurring evaluations to automatically validate new traces and sessions in production.
📈 Analytics — track performance, cost, latency, errors, and quality trends over time.
🛠️ Open source + cloud — use the open-source core on GitHub or run PandaProbe in the cloud.

Who it’s for

🧑‍💻 AI engineers — debug agent behavior across LLMs, tools, and workflows.
🏗️ Platform teams — monitor quality, regressions, and reliability in production.
🔬 Builders experimenting with agents — understand failures and iterate faster.
🚀 Startups — add observability and evaluation before things become unmanageable.reason about.

Quick links

GitHub: https://github.com/chirpz-ai/pandaprobe

Docs: https://docs.pandaprobe.com

Cloud: https://www.pandaprobe.com/

I’ll be here all day answering questions and collecting feedback.

If you’re building agents today, what’s the hardest part to debug or evaluate?

Thanks for checking it out 🙏
— Sina

Report

2mo ago

RiteKit Company Logo API

@sina_tayebati This addresses a real pain point—once agents get complex with multiple tool calls and sub-agents, traditional logging becomes almost useless for understanding what actually happened. The focus on evaluation and continuous monitoring in production sounds like exactly what teams need to move beyond one-off testing.

Report

1mo ago

PandaProbe

Maker

@osakasaul Really appreciate that — and that’s exactly the problem we kept running into. Once agents become multi-step and start using tools and sub-agents, logs stop being enough to understand behavior. You need visibility across the full execution and a way to evaluate it over time. That’s why PandaProbe focuses on combining tracing with continuous evaluation and monitoring in production — so teams can move from one-off debugging to actually understanding and improving agent behavior over time.

Report

1mo ago

Okan

Handling state and debugging for long-running autonomous agents is usually a nightmare, so having an open-source platform to standardize that workflow is huge. I can definitely see myself using PandaProbe to self-host my agent evaluation pipeline to keep sensitive client data entirely local. I am really curious to hear if you currently support custom tracing for raw API calls instead of just the standard frameworks.

Report

2mo ago

PandaProbe

Maker

@y_taka Really appreciate that — and yes, democratizing observability and enabling teams to keep sensitive workflows fully self-hosted were big motivations behind making PandaProbe open source from day one.

And absolutely: we support custom tracing beyond standard frameworks. Alongside native integrations, PandaProbe also provides manual instrumentation APIs and decorators, so you can trace raw API calls, internal services, custom orchestration layers, or essentially any part of your agent workflow you want visibility into.

A lot of teams end up with hybrid architectures, so supporting low-level custom instrumentation was important for us early on.

Report

2mo ago

Evaluation is the hardest part of this whole space and most platforms hand-wave it. The failure mode that actually bites in production isn't crashes or schema errors. It's slow drift in subjective quality (voice, classification accuracy, output style) that only shows up when a human reads 50 outputs in a row. How does PandaProbe handle that in practice? LLM-as-judge with custom rubrics, human-in-loop on a held-out set, embedding-distance from a golden corpus, or something else? And how do you stop eval cost from outpacing inference cost when you're re-judging every trace?

Report

1mo ago

PandaProbe

Maker

@vincentf This is actually one of the core motivations behind PandaProbe.

A lot of evaluation systems today focus on isolated outputs, but in production we kept seeing failures emerge as trajectory-level drift: looping, degraded tool grounding, coordination breakdowns, subtle quality regression, output-style drift, etc. The model can still sound confident locally while the overall session quality quietly collapses.

That led directly to a research paper I recently published called TRACER (Trajectory Risk Aggregation for Critical Episodes in Agentic Reasoning). The core idea is evaluating uncertainty and failure at the trajectory/session level rather than the individual response level.

That research became a major foundation for PandaProbe’s evaluation system and heavily shaped how we think about observability and longitudinal agent evaluation.

On the cost side, we’re also very conscious about evals becoming more expensive than inference. PandaProbe supports async and sampled evaluations, composable metrics, and lightweight structural trajectory signals so teams don’t have to run expensive judge models on every trace.

Report

1mo ago

@sina_tayebati Thanks Sina, that's a sharper framing than most of what's out there. Got a link to the TRACER paper?

Report

1mo ago

PandaProbe

Maker

@vincentf hey Vincent, yes here’s the link to the paper: https://arxiv.org/abs/2602.11409

Report

1mo ago

We've been running Langfuse for our agent stack for about six months and the trace UI is decent, but session-level evals across multi-agent runs are still where things get messy. Curious how PandaProbe handles that. If a sub-agent fails three turns deep, do you surface root cause at the session level or do I still have to walk the span tree manually? Also, what's the storage model look like for self-hosted? Postgres only, or something columnar for the trace volume? One more thing: any plans for OpenTelemetry-native ingestion so I don't have to swap out my existing tracing SDK across services?

Report

1mo ago

PandaProbe

Maker

@brainystudy Great questions — you’re hitting exactly the pain points we’ve been focusing on. On evaluation: this is actually the primary focus of PandaProbe. Instead of just surfacing spans, we evaluate at the session level using trajectory-based metrics designed for multi-step, multi-agent workflows. So if a sub-agent fails a few steps deep, you don’t have to manually walk the tree — the system surfaces degradation and helps point you to where things started going wrong. On storage: current self-hosted setup is Postgres + Redis. On OpenTelemetry: our schema is largely OTEL-compatible. We apply some normalization on top, and if your schema differs, we surface warnings with guidance — but in most cases (~90%) it works without needing to swap out your existing tracing setup.

Report

1mo ago

FuseBase

Congrats on another great product going live! does it support MCP tool tracing natively or do you have to instrument those calls manually?

Report

2mo ago

PandaProbe

Maker

@kate_ramakaieva Thanks for the support, Kate! Great question.

If you’re using one of our supported integrations for frameworks like LangGraph, CrewAI, and others, MCP tool calls are automatically captured and traced out of the box.

For custom agent architectures or internal tooling, we also provide lightweight manual instrumentation via decorators, so you can trace virtually any function, tool call, or workflow step in your agent logic.

Report

2mo ago

Where does PandaProbe sit relative to LangSmith, Langfuse, and Helicone? They all claim "agent observability" but mean very different things underneath — some are basically prompt loggers, others actually trace tool-call DAGs. Curious which problem you decided was the real one.

Report

1mo ago

PandaProbe

Maker

@sounak_bhattacharya Great question — and I agree, these tools mean very different things by “observability.” Most of them do a solid job at logging and tracing. PandaProbe is more focused on what comes after that: evaluation. The core problem we’re solving is: once you have traces, how do you measure quality, detect drift, and catch regressions over time? So beyond tracing, PandaProbe is built around: • trajectory-level evaluation (not just single outputs) • monitoring quality across versions/environments • catching subtle degradation in production In short: collecting traces our base, but our focus is more about understanding and evaluating agent behavior over time.

Report

1mo ago

Quick q, how does PandaProbe’s tracing handle multi-step agent loops where the failure is caused by an earlier decision that only becomes obvious later?

Report

1mo ago

PandaProbe

Maker

@boyuan_deng1 Great question — that’s exactly the kind of failure mode we care about. PandaProbe traces the full execution as a structured trajectory (sessions → traces → spans), so you can follow multi-step loops end-to-end, not just isolated steps. More importantly, we don’t just log steps — we evaluate across the trajectory. That means when a failure shows up later, you can trace it back to earlier decisions and see where things started to drift (e.g., looping, bad tool use, misalignment). So instead of “something broke at step 20,” you can actually pinpoint “the breakdown started at step 5.”

Report

1mo ago

1 2 3

Forum Threads

p/pandaprobe

•

2mo ago

why are ai agents still so hard to debug in production?

feels like the industry figured out how to build ai agents faster than how to understand them.

everyone demos agents.
very few teams can confidently answer:

why an agent failed
what changed between runs
whether quality is improving or regressing
or if the agent is actually reliable over time

curious how people here are handling this today.

View all

👋 Hey Product Hunt!

I’m Sina, founder of PandaProbe.

Building AI agents is getting easier, but understanding and trusting them in production is still hard.

PandaProbe is my attempt to solve this: an open-source agent engineering platform for tracing, evaluation, monitoring, and debugging AI agent applications.

The goal is simple: help developers move from “it works on my laptop” to “I understand production behavior, can measure quality, and continuously improve it.”

What PandaProbe provides

GitHub: https://github.com/chirpz-ai/pandaprobe

Docs: https://docs.pandaprobe.com

Cloud: https://www.pandaprobe.com/

I’ll be here all day answering questions and collecting feedback.

If you’re building agents today, what’s the hardest part to debug or evaluate?

Thanks for checking it out 🙏
— Sina