AdaptGauge detects when adding few-shot examples degrades LLM performance instead of improving it.
Testing 8 models across 4 tasks revealed three failure patterns:
β’ Peak regression β 64% at 4-shot, crashed to 33% at 8-shot
β’ Ranking reversal β best zero-shot model dropped to third with examples
β’ Selection collapse β TF-IDF examples broke a model from 50%+ to 35%
Tracks learning curves, auto-detects collapse, classifies patterns, and compares example selection methods.
Demo results included.