PandaProbe is an open-source agent engineering platform that gives you deep observability into AI agent applications. Use it to trace, evaluate, monitor and debug your AI agents in development and production.
Replies
Best
Quick q, how does PandaProbe’s tracing handle multi-step agent loops where the failure is caused by an earlier decision that only becomes obvious later?
@boyuan_deng1 Great question — that’s exactly the kind of failure mode we care about.
PandaProbe traces the full execution as a structured trajectory (sessions → traces → spans), so you can follow multi-step loops end-to-end, not just isolated steps.
More importantly, we don’t just log steps — we evaluate across the trajectory. That means when a failure shows up later, you can trace it back to earlier decisions and see where things started to drift (e.g., looping, bad tool use, misalignment).
So instead of “something broke at step 20,” you can actually pinpoint “the breakdown started at step 5.”
@olia_nemirovski Thank you!
Yes — they’re all captured within the same session.
If a supervisor agent (A) calls a sub-agent (B), it’s treated as part of the same execution thread. The sub-agent call appears as a span within the parent trace, and that span can expand into its own nested chain of steps.
So you get a unified, hierarchical view of the full interaction — making it easy to see how parent and sub-agents relate and where issues emerge.
Report
We've been running Langfuse for our agent stack for about six months and the trace UI is decent, but session-level evals across multi-agent runs are still where things get messy. Curious how PandaProbe handles that. If a sub-agent fails three turns deep, do you surface root cause at the session level or do I still have to walk the span tree manually? Also, what's the storage model look like for self-hosted? Postgres only, or something columnar for the trace volume? One more thing: any plans for OpenTelemetry-native ingestion so I don't have to swap out my existing tracing SDK across services?
@brainystudy Great questions — you’re hitting exactly the pain points we’ve been focusing on.
On evaluation: this is actually the primary focus of PandaProbe. Instead of just surfacing spans, we evaluate at the session level using trajectory-based metrics designed for multi-step, multi-agent workflows. So if a sub-agent fails a few steps deep, you don’t have to manually walk the tree — the system surfaces degradation and helps point you to where things started going wrong.
On storage: current self-hosted setup is Postgres + Redis.
On OpenTelemetry: our schema is largely OTEL-compatible. We apply some normalization on top, and if your schema differs, we surface warnings with guidance — but in most cases (~90%) it works without needing to swap out your existing tracing setup.
Honestly the open source + self hostable combo is what makes this worth a proper look. most observability tools want you locked into their cloud and charging per seat by the time you actually need it. been burned by that before with Datadog at a startup. one instrument() call to trace the whole run is a nice dx too, gonna try this on a side project this week
@shlokmestry Really appreciate that — and yeah, that exact lock-in/pricing pain is something we wanted to avoid from day one.
That’s why we made it open source + self-hostable, so teams can keep full control as they scale instead of getting boxed into per-seat or per-trace pricing later.
And glad you called out the DX — we’ve been trying to make instrumentation as lightweight as possible.
Would love to hear how it works for your side project 🙌
Report
We've been running Langfuse for our agent stack for about six months and the trace UI is decent, but session-level evals across multi-agent runs are still where things get messy. Curious how PandaProbe handles that. If a sub-agent fails three turns deep, do you surface root cause at the session level, or do I still have to walk the span tree manually? Also, what's the storage model look like for self-hosted? Postgres only, or something columnar for the trace volume? One more thing: any plans for OpenTelemetry-native ingestion so I don't have to swap out my existing tracing SDK across services?
Report
This looks neat and seems great especially for complex AI workflows, but just thinking about it, how'd the revenue side work?
Report
What does the integration ecosystem look like for native orchestration frameworks like LangGraph or CrewAI?
Replies
Quick q, how does PandaProbe’s tracing handle multi-step agent loops where the failure is caused by an earlier decision that only becomes obvious later?
PandaProbe
Tobira.ai
Congrats on launching! How does PandaProbe handle sub-agent calls? Like if agent A spins up agent B, do both get traced under the same session tree
PandaProbe
We've been running Langfuse for our agent stack for about six months and the trace UI is decent, but session-level evals across multi-agent runs are still where things get messy. Curious how PandaProbe handles that. If a sub-agent fails three turns deep, do you surface root cause at the session level or do I still have to walk the span tree manually? Also, what's the storage model look like for self-hosted? Postgres only, or something columnar for the trace volume? One more thing: any plans for OpenTelemetry-native ingestion so I don't have to swap out my existing tracing SDK across services?
PandaProbe
Origio
Honestly the open source + self hostable combo is what makes this worth a proper look. most observability tools want you locked into their cloud and charging per seat by the time you actually need it. been burned by that before with Datadog at a startup. one instrument() call to trace the whole run is a nice dx too, gonna try this on a side project this week
PandaProbe
@shlokmestry Really appreciate that — and yeah, that exact lock-in/pricing pain is something we wanted to avoid from day one.
That’s why we made it open source + self-hostable, so teams can keep full control as they scale instead of getting boxed into per-seat or per-trace pricing later.
And glad you called out the DX — we’ve been trying to make instrumentation as lightweight as possible.
Would love to hear how it works for your side project 🙌
We've been running Langfuse for our agent stack for about six months and the trace UI is decent, but session-level evals across multi-agent runs are still where things get messy. Curious how PandaProbe handles that. If a sub-agent fails three turns deep, do you surface root cause at the session level, or do I still have to walk the span tree manually? Also, what's the storage model look like for self-hosted? Postgres only, or something columnar for the trace volume? One more thing: any plans for OpenTelemetry-native ingestion so I don't have to swap out my existing tracing SDK across services?
This looks neat and seems great especially for complex AI workflows, but just thinking about it, how'd the revenue side work?
What does the integration ecosystem look like for native orchestration frameworks like LangGraph or CrewAI?