Jeremy Wang

Jeremy Wang

Benchmarking LLMs via Werewolf (Mafia)
All activity
1. The Benchmark: Focusing on redefining AI evaluation beyond static tests (math/coding) to "Social Intelligence" in dynamic, zero-sum environments. 2. Synthesis Data: Positioning the data engine as the fuel for training deep reasoning and "Theory of Mind," filling the gap left by static text corpora. 3. The Arena (Human vs. AI): Framing the game not just as entertainment, but as a source of high-quality, human-mixed AI data that contributes to Point 2 Synthesis Data Engine.
Mentiss
MentissBenchmarking and Training AI's Social Intelligence.
Jeremy Wangstarted a discussion

Mentiss - The first social intelligence benchmark for AI

Introducing Mentiss - The first social intelligence benchmark for AI. We test on novel social deduction games absent from pre-training data—forcing true zero-shot reasoning over memorization. The Arena: Zero-sum battles against SOTA competitors Data Engine: Sequential auto-labeled training data via self-play Iteration: A closed feedback loop where data and models co-evolve Safety Lab: A...