Steven Willmott

What kind of Agent validation are you doing today?

Everything started with model Evals and benchmarks (which model is better?), then evolved to prompt management and from there to analyzing traces. What do people do today, and how are they sourcing test datasets?

16 views

Add a comment

Replies

Best
Aarav

Traditional evals feel insufficient once agents become stateful and multi-step. I’ve noticed the hardest failures usually come from context degradation, recovery after tool failure, or subtle planning loops rather than raw model quality itself. Curious whether people are generating synthetic trajectory datasets internally or relying mostly on production traces now.