All activity
Shuntaro Okumaleft a comment
Hi everyone! I built ConvoProbe to solve a problem I kept running into with my own Dify chatbots. Single-turn testing always looked fine. But once real users started having 3-4 turn conversations, quality would drop ā context loss, mixed-up information, regressions after workflow updates. The worst part: these failures are silent. No errors in the logs, just plausible-sounding wrong answers. I...
ConvoProbeAutomated scenario testing for Dify chatbots
ConvoProbe lets you design multi-turn conversation scenarios and run them against your Dify chatbot automatically to measure response quality.
Existing eval tools (LangSmith, Langfuse, Opik) work great for tracing and single-turn evaluation ā but they don't support designing and executing multi-turn conversation scenarios end-to-end. ConvoProbe fills that gap.
ConvoProbeAutomated scenario testing for Dify chatbots
AdaptGauge detects when adding few-shot examples degrades LLM performance instead of improving it.
Testing 8 models across 4 tasks revealed three failure patterns:
⢠Peak regression ā 64% at 4-shot, crashed to 33% at 8-shot
⢠Ranking reversal ā best zero-shot model dropped to third with examples
⢠Selection collapse ā TF-IDF examples broke a model from 50%+ to 35%
Tracks learning curves, auto-detects collapse, classifies patterns, and compares example selection methods.
Demo results included.

AdaptGaugeDetect when few-shot examples make your LLM worse
Shuntaro Okumaleft a comment
Hi Product Hunt! š I'm Shuntaro, and I built AdaptGauge after discovering something counterintuitive: giving LLMs more few-shot examples can make them worse. I call this "few-shot collapse" ā and it's backed by multiple independent research papers from 2025. But until now, there was no tool to detect it automatically before it hits production. AdaptGauge is open source (MIT) and includes...

AdaptGaugeDetect when few-shot examples make your LLM worse
