ConvoProbe

Automated scenario testing for Dify chatbots

3 followers

Automated scenario testing for Dify chatbots

3 followers

ConvoProbe lets you design multi-turn conversation scenarios and run them against your Dify chatbot automatically to measure response quality. Existing eval tools (LangSmith, Langfuse, Opik) work great for tracing and single-turn evaluation — but they don't support designing and executing multi-turn conversation scenarios end-to-end. ConvoProbe fills that gap.

Free Options

Launch tags:SaaS•Developer Tools•Artificial Intelligence

Launch Team / Built With

Wispr Flow: Dictation That Works Everywhere — Stop typing. Start speaking. 4x faster.

Stop typing. Start speaking. 4x faster.

Promoted

Maker

📌

Hi everyone! I built ConvoProbe to solve a problem I kept running into with my own Dify chatbots. Single-turn testing always looked fine. But once real users started having 3-4 turn conversations, quality would drop — context loss, mixed-up information, regressions after workflow updates. The worst part: these failures are silent. No errors in the logs, just plausible-sounding wrong answers. I looked into existing tools (LangSmith, Langfuse, Opik, etc.) and they're great for tracing and single-turn evaluation, but none of them support designing multi-turn conversation scenarios and running them against a Dify chatbot. So I built one. It started as an internal tool but turned out to be useful enough to publish. Would love to hear how others are handling chatbot quality — it feels like an underexplored problem.

Report

2mo ago

Multi-turn evaluation is a huge gap in the current tooling. How do you measure "response quality" across turns, is it rule based checks or LLM as judge? Awesome idea!

Report

2mo ago

Maker

@mateuszjacni

Thanks! Great question — it's LLM-as-Judge, not rule-based.

Each turn is scored on 4 criteria: semantic alignment, completeness, accuracy, and relevance. You write an "expected response" for each turn in the scenario, and the judge LLM compares the bot's actual response against it.

I considered rule-based checks early on, but they're too brittle for natural language — the same correct answer can be phrased in many ways. LLM-as-Judge handles that well, especially when you give it clear evaluation criteria rather than exact string matching.

The branching logic also uses LLM evaluation — for example, "if the bot mentioned a specific product, ask a follow-up about pricing; otherwise, ask it to clarify." The LLM decides which branch to take at runtime based on what the bot actually said.

Report

2mo ago

Reviews

Most Informative

No reviews yetBe the first to leave a review for ConvoProbe

@mateuszjacni

Thanks! Great question — it's LLM-as-Judge, not rule-based.