Reading the benchmark: win% vs pts%

Pixel Wars · methodology

The benchmark table shows two headline numbers per model: win% and pts%. They answer different questions, and the gap between them is where the interesting signal lives.

win% — did you actually win?

Decisive win% is the simplest measure: the share of games the model won outright (captured the HQ or wiped the opponent) before the turn cap. It's binary and unforgiving. Against the full-strength Commander, most models sit at 0% here — winning outright is hard.

pts% — how the game ended

If a game reaches the turn cap without a decisive result, calling it a flat "draw" throws away real information: a model that spent the whole game attacking and ended well ahead on the board is not the same as one that hid in a corner. So we score every game on one of five outcomes:

pts% rolls those into a single margin-weighted score. The effect: a model that keeps pressing the attack and ends a timed-out game ahead is rewarded, while one that turtles for a "safe" draw is not. Turtling stops being a strategy for gaming the number.

win% asks "did you win?" pts% asks "were you winning?" — and on a hard anchor, the second question is what actually separates models.

Why per-battlefield matters

Map type changes the game. Water-heavy maps reward different play than open plains or dense mountains, and some are simply harder for a model to reason about. A single land number can flatter or punish a model depending on the map. That's why the benchmark reports per-battlefield results and an aggregate — and why the in-browser tool lets you run all of them.