AdaptGauge detects when adding few-shot examples degrades LLM performance instead of improving it. Testing 8 models across 4 tasks revealed three failure patterns: • Peak regression — 64% at 4-shot, crashed to 33% at 8-shot • Ranking reversal — best zero-shot model dropped to third with examples • Selection collapse — TF-IDF examples broke a model from 50%+ to 35% Tracks learning curves, auto-detects collapse, classifies patterns, and compares example selection methods. Demo results included.

Figr AI: UX Agent for Product Teams — Learns your product. Thinks through UX

Learns your product. Thinks through UX

Hi Product Hunt! 👋 I'm Shuntaro, and I built AdaptGauge after discovering something counterintuitive: giving LLMs more few-shot examples can make them worse. I call this "few-shot collapse" — and it's backed by multiple independent research papers from 2025. But until now, there was no tool to detect it automatically before it hits production. AdaptGauge is open source (MIT) and includes pre-computed demo results so you can explore the patterns right away. I'd love to hear: • Have you encountered cases where adding examples hurt LLM performance? • What tasks/models are you most interested in testing? Repo: https://github.com/ShuntaroOkuma... Article: https://shuntaro-okuma.medium.co...

Figr AI: UX Agent for Product Teams — Learns your product. Thinks through UX

Learns your product. Thinks through UX

AdaptGauge

Detect when few-shot examples make your LLM worse

Detect when few-shot examples make your LLM worse