A BENCHMARK THAT STILL WORKS why a game you play beats a quiz you can memorize
Saturated quizzes tell you what a model knows. Pixel Wars tells you whether it can think — plan dozens of moves ahead, track hidden state under fog, and beat an opponent that punishes every mistake — on a board it has never seen, with a score nobody can fake or memorize.
Same models. One test tells them apart.
On a saturated knowledge quiz, today's models cluster within a few points — the test has run out of room to separate them. Run the same models against our Commander and the field fans out from 2% to 46%, with a clear line between the models that can out-plan the anchor and the ones that can't.
Left is illustrative — frontier models cluster near the ceiling on saturated quizzes. Right is measured: pts% vs the Commander, best-of-25 on the baseline map, fog on. pts% credits decisive play (turtling to a draw isn't safe); 50% is parity with the Commander. Numbers are vs ultimate-2026.06; a stronger v3 anchor is in calibration and results will be re-tagged to it.
We don't just rank models — we show how they think
A score tells you who won. The replay tells you how. Step through a real game move by move — each decision graded against the engine's best line — and a behavioural fingerprint emerges: aggression vs tidiness, when a model overcommits, what it's willing to sacrifice. Same Commander, four recognizably different playstyles:
DeepSeek V4 Flash
Leads the field. Aggressive — 10 brilliant moves and 26 blunders, still wins on points. 73% accuracy
Watch the game →GPT-5.4 mini
Clean, but overrun. Tidy play — zero blunders — yet still loses this one. 85% accuracy
Watch the game →Haiku 4.5
Collapses. Five blunders in thirty moves — overwhelmed early. 68% accuracy
Watch the game →Commander
The baseline. The anchor playing itself — near-optimal moves throughout. 98% accuracy
Watch the game →Accuracy = the model's own moves graded against the engine's best line. Note the contrast: DeepSeek's high-variance aggression wins on points, while GPT-5.4 mini plays cleaner but still loses — clean ≠ winning.
Every percentage on these cards — and the wins-on-points result above — is measured against Commander ultimate-2026.06, the current anchor. We treat that as an early, soft baseline: these standings are provisional (pre-v3), and Commander v3 is built to raise the bar, so expect the numbers to tighten.
Coming Run the same model across versions and the fingerprint becomes a drift signal — a checkpoint that turns more reckless, or more passive, than the one before it. Today you read each fingerprint by hand from the replay; tracking that shift automatically across checkpoints is a view we're building, not a shipped dashboard.
Built to dodge the three ways benchmarks rot
Saturation kills discrimination, contamination kills validity, gaming kills meaning. A static quiz can't escape all three at once. A task with a seed, an opponent, and an objective outcome can.
Saturation → it discriminates
Procedural difficulty plus a tunable Commander. When the field catches up, we raise the anchor and the ceiling moves — the test never runs out of headroom.
Contamination → fresh every game
Maps are procedurally generated and mirror-symmetric for provable fairness. There's no fixed position set to leak into training data, so a score can't decay into recall.
Gaming → nothing to game
Deterministic win/loss plus a margin score — no LLM judge, no rubric to optimize. Every game is a seed + action log the server re-runs to confirm the outcome.
Recall → real reasoning
Fog of war, an economy, terrain, ~40–100 sequential moves against an adversary. You can't pattern-match your way through — you have to plan over a long horizon.
Does winning here mean anything?
Our thesis: long-horizon strategic planning under uncertainty is one of the most under-measured capabilities in AI — and it's exactly what Pixel Wars puts a model through. We won't pretend the link to real-world agent work is settled science; we validate transfer in the open and publish the per-capability breakdown below, so you can judge it instead of taking our word for it.
Long-horizon planning
A game runs ~40–100 sequential moves where early choices decide late outcomes — a one-shot quiz answer never spans that horizon.
Hidden-state tracking
Fog of war hides the enemy; you have to maintain a belief about a board you can't fully see — a quiz hands you the whole question up front.
Adapting to an adversary
The Commander punishes every mistake and reacts to your plan, so the right move depends on the opponent — a static prompt pushes back on nothing.
Resource allocation
You manage an economy and spend income under pressure across a whole match — there's no budget to balance in a single multiple-choice item.
We report the per-capability breakdown — long-horizon planning, hidden-state tracking, adversarial adaptation, economy — not just a single headline score. The confidence is earned the hard way: it's a real task with no answer key.
Don't trust the number — reproduce it
Every ranked result is a seed plus an action log. Re-run it and you get the same outcome, or it's rejected. Bring your own key and benchmark any model yourself in minutes.
Run it yourself
The in-browser benchmark plays your model vs the Commander, best-of-25, for a few dollars in API calls. Your key talks straight to your vendor — never our servers.
Replay-verified
Submit a run and the server replays every game move-by-move before it counts. Fabricated or illegal logs are rejected, so a community number means what it says.
Rises with the frontier
When a model beats the Commander, we mine those games and ship a stronger anchor. Beating it isn't a finish line — it just raises the bar for the next model.
Bring a model. Take on the Commander.
Two models beat it so far — DeepSeek V4 Flash and GPT-5.4 mini set the bar, vs Commander ultimate-2026.06. Provisional, pre-v3: a stronger anchor is in calibration and will lift the bar. Free in your browser; see if yours can clear it.