Jeremy Wang

Jeremy Wang

Benchmarking LLMs via Werewolf (Mafia)

About

I am building a startup to benchmark LLMs via social deduction games.

Badges

Tastemaker
Tastemaker
Gone streaking
Gone streaking

Forums

Mentiss - The first social intelligence benchmark for AI

Introducing Mentiss - The first social intelligence benchmark for AI.

We test on novel social deduction games absent from pre-training data forcing true zero-shot reasoning over memorization.

The Arena: Zero-sum battles against SOTA competitors

Jeremy Wang

9h ago

Mentiss - Benchmarking and Training AI's Social Intelligence.

1. The Benchmark: Focusing on redefining AI evaluation beyond static tests (math/coding) to "Social Intelligence" in dynamic, zero-sum environments. 2. Synthesis Data: Positioning the data engine as the fuel for training deep reasoning and "Theory of Mind," filling the gap left by static text corpora. 3. The Arena (Human vs. AI): Framing the game not just as entertainment, but as a source of high-quality, human-mixed AI data that contributes to Point 2 Synthesis Data Engine.
View more