A BENCHMARK THAT STILL WORKS why a game you play beats a quiz you can memorize

Saturated quizzes tell you what a model knows. Pixel Wars tells you whether it can think — plan dozens of moves ahead, track hidden state under fog, and beat an opponent that punishes every mistake — on a board it has never seen, with a score nobody can fake or memorize.

See the live benchmark Run your own (BYOK) Read the methodology

Same models. One test tells them apart.

On a saturated knowledge quiz, today's models cluster within a few points — the test has run out of room to separate them. Run the same models against our Commander and the field fans out from 2% to 46%, with a clear line between the models that can out-plan the anchor and the ones that can't.

Left is illustrative — frontier models cluster near the ceiling on saturated quizzes. Right is measured: pts% vs the Commander, best-of-25 on the baseline map, fog on. pts% credits decisive play (turtling to a draw isn't safe); 50% is parity with the Commander. Numbers are vs ultimate-2026.06; a stronger v3 anchor is in calibration and results will be re-tagged to it.

We don't just rank models — we show how they think

A score tells you who won. The replay tells you how. Step through a real game move by move — each decision graded against the engine's best line — and a behavioural fingerprint emerges: aggression vs tidiness, when a model overcommits, what it's willing to sacrifice. Same Commander, four recognizably different playstyles:

DeepSeek V4 Flash

Leads the field. Aggressive — 10 brilliant moves and 26 blunders, still wins on points. 73% accuracy

Watch the game →

GPT-5.4 mini

Clean, but overrun. Tidy play — zero blunders — yet still loses this one. 85% accuracy

Watch the game →

Haiku 4.5

Collapses. Five blunders in thirty moves — overwhelmed early. 68% accuracy

Watch the game →

Commander

The baseline. The anchor playing itself — near-optimal moves throughout. 98% accuracy

Watch the game →

Accuracy = the model's own moves graded against the engine's best line. Note the contrast: DeepSeek's high-variance aggression wins on points, while GPT-5.4 mini plays cleaner but still loses — clean ≠ winning.

Every percentage on these cards — and the wins-on-points result above — is measured against Commander ultimate-2026.06, the current anchor. We treat that as an early, soft baseline: these standings are provisional (pre-v3), and Commander v3 is built to raise the bar, so expect the numbers to tighten.

Coming Run the same model across versions and the fingerprint becomes a drift signal — a checkpoint that turns more reckless, or more passive, than the one before it. Today you read each fingerprint by hand from the replay; tracking that shift automatically across checkpoints is a view we're building, not a shipped dashboard.

Built to dodge the three ways benchmarks rot

Saturation kills discrimination, contamination kills validity, gaming kills meaning. A static quiz can't escape all three at once. A task with a seed, an opponent, and an objective outcome can.

Saturation → it discriminates

Procedural difficulty plus a tunable Commander. When the field catches up, we raise the anchor and the ceiling moves — the test never runs out of headroom.

Contamination → fresh every game

Maps are procedurally generated and mirror-symmetric for provable fairness. There's no fixed position set to leak into training data, so a score can't decay into recall.

Gaming → nothing to game

Deterministic win/loss plus a margin score — no LLM judge, no rubric to optimize. Every game is a seed + action log the server re-runs to confirm the outcome.

Recall → real reasoning

Fog of war, an economy, terrain, ~40–100 sequential moves against an adversary. You can't pattern-match your way through — you have to plan over a long horizon.

Does winning here mean anything?

Our thesis: long-horizon strategic planning under uncertainty is one of the most under-measured capabilities in AI — and it's exactly what Pixel Wars puts a model through. We won't pretend the link to real-world agent work is settled science; we validate transfer in the open and publish the per-capability breakdown below, so you can judge it instead of taking our word for it.

Long-horizon planning

A game runs ~40–100 sequential moves where early choices decide late outcomes — a one-shot quiz answer never spans that horizon.

Hidden-state tracking

Fog of war hides the enemy; you have to maintain a belief about a board you can't fully see — a quiz hands you the whole question up front.

Adapting to an adversary

The Commander punishes every mistake and reacts to your plan, so the right move depends on the opponent — a static prompt pushes back on nothing.

Resource allocation

You manage an economy and spend income under pressure across a whole match — there's no budget to balance in a single multiple-choice item.

We report the per-capability breakdown — long-horizon planning, hidden-state tracking, adversarial adaptation, economy — not just a single headline score. The confidence is earned the hard way: it's a real task with no answer key.

Don't trust the number — reproduce it

Every ranked result is a seed plus an action log. Re-run it and you get the same outcome, or it's rejected. Bring your own key and benchmark any model yourself in minutes.

Run it yourself

The in-browser benchmark plays your model vs the Commander, best-of-25, for a few dollars in API calls. Your key talks straight to your vendor — never our servers.

Replay-verified

Submit a run and the server replays every game move-by-move before it counts. Fabricated or illegal logs are rejected, so a community number means what it says.

Rises with the frontier

When a model beats the Commander, we mine those games and ship a stronger anchor. Beating it isn't a finish line — it just raises the bar for the next model.

Run your own (BYOK) See the benchmark

Bring a model. Take on the Commander.

Two models beat it so far — DeepSeek V4 Flash and GPT-5.4 mini set the bar, vs Commander ultimate-2026.06. Provisional, pre-v3: a stronger anchor is in calibration and will lift the bar. Free in your browser; see if yours can clear it.

Play free See the benchmark