A BENCHMARK THAT STILL WORKS why a game you play beats a quiz you can memorize

Saturated quizzes tell you what a model knows. Pixel Wars tells you whether it can think — plan dozens of moves ahead, track hidden state under fog, and beat an opponent that punishes every mistake — on a board it has never seen, with a score nobody can fake or memorize.

Same models. One test tells them apart.

On a saturated knowledge quiz, today's models cluster within a few points — the test has run out of room to separate them. Run the same models against our Commander and the field fans out from 2% to 46%, with a clear line between the models that can out-plan the anchor and the ones that can't.

Clustered on a saturated quiz, spread out on Pixel Wars On an illustrative saturated quiz the six models cluster near 90 percent. On Pixel Wars pts percent versus the Commander they spread from 2 to 46 percent; DeepSeek V4 Flash and GPT-5.4 mini lead. Typical knowledge quiz — illustrative all six within ~6 points Pixel Wars — pts% vs the Commander (measured) 50 = even vs Commander 0 25 50 75 100% 1 · DeepSeek V4 Flash — 46% pts · 40% win 2 · GPT-5.4 mini — 45% pts · 36% win 3 · Grok 4.20 non-reasoning — 12% pts · 0% win 4 · GPT-5.4 nano — 11% pts · 4% win 5 · Haiku 4.5 — 9% pts · 0% win 6 · Mistral Small — 2% pts · 0% win teal = wins 36–40% of games vs the Commander · gray = 0–4%

Left is illustrative — frontier models cluster near the ceiling on saturated quizzes. Right is measured: pts% vs the Commander, best-of-25 on the baseline map, fog on. pts% credits decisive play (turtling to a draw isn't safe); 50% is parity with the Commander. Numbers are vs ultimate-2026.06; a stronger v3 anchor is in calibration and results will be re-tagged to it.

We don't just rank models — we show how they think

A score tells you who won. The replay tells you how. Step through a real game move by move — each decision graded against the engine's best line — and a behavioural fingerprint emerges: aggression vs tidiness, when a model overcommits, what it's willing to sacrifice. Same Commander, four recognizably different playstyles:

Accuracy = the model's own moves graded against the engine's best line. Note the contrast: DeepSeek's high-variance aggression wins on points, while GPT-5.4 mini plays cleaner but still loses — clean ≠ winning.

Every percentage on these cards — and the wins-on-points result above — is measured against Commander ultimate-2026.06, the current anchor. We treat that as an early, soft baseline: these standings are provisional (pre-v3), and Commander v3 is built to raise the bar, so expect the numbers to tighten.

Coming Run the same model across versions and the fingerprint becomes a drift signal — a checkpoint that turns more reckless, or more passive, than the one before it. Today you read each fingerprint by hand from the replay; tracking that shift automatically across checkpoints is a view we're building, not a shipped dashboard.

Built to dodge the three ways benchmarks rot

Saturation kills discrimination, contamination kills validity, gaming kills meaning. A static quiz can't escape all three at once. A task with a seed, an opponent, and an objective outcome can.

Saturation → it discriminates

Procedural difficulty plus a tunable Commander. When the field catches up, we raise the anchor and the ceiling moves — the test never runs out of headroom.

Contamination → fresh every game

Maps are procedurally generated and mirror-symmetric for provable fairness. There's no fixed position set to leak into training data, so a score can't decay into recall.

Gaming → nothing to game

Deterministic win/loss plus a margin score — no LLM judge, no rubric to optimize. Every game is a seed + action log the server re-runs to confirm the outcome.

Recall → real reasoning

Fog of war, an economy, terrain, ~40–100 sequential moves against an adversary. You can't pattern-match your way through — you have to plan over a long horizon.

Does winning here mean anything?

Our thesis: long-horizon strategic planning under uncertainty is one of the most under-measured capabilities in AI — and it's exactly what Pixel Wars puts a model through. We won't pretend the link to real-world agent work is settled science; we validate transfer in the open and publish the per-capability breakdown below, so you can judge it instead of taking our word for it.

Long-horizon planning

A game runs ~40–100 sequential moves where early choices decide late outcomes — a one-shot quiz answer never spans that horizon.

Hidden-state tracking

Fog of war hides the enemy; you have to maintain a belief about a board you can't fully see — a quiz hands you the whole question up front.

Adapting to an adversary

The Commander punishes every mistake and reacts to your plan, so the right move depends on the opponent — a static prompt pushes back on nothing.

Resource allocation

You manage an economy and spend income under pressure across a whole match — there's no budget to balance in a single multiple-choice item.

We report the per-capability breakdown — long-horizon planning, hidden-state tracking, adversarial adaptation, economy — not just a single headline score. The confidence is earned the hard way: it's a real task with no answer key.

Don't trust the number — reproduce it

Every ranked result is a seed plus an action log. Re-run it and you get the same outcome, or it's rejected. Bring your own key and benchmark any model yourself in minutes.

Run it yourself

The in-browser benchmark plays your model vs the Commander, best-of-25, for a few dollars in API calls. Your key talks straight to your vendor — never our servers.

Replay-verified

Submit a run and the server replays every game move-by-move before it counts. Fabricated or illegal logs are rejected, so a community number means what it says.

Rises with the frontier

When a model beats the Commander, we mine those games and ship a stronger anchor. Beating it isn't a finish line — it just raises the bar for the next model.

Bring a model. Take on the Commander.

Two models beat it so far — DeepSeek V4 Flash and GPT-5.4 mini set the bar, vs Commander ultimate-2026.06. Provisional, pre-v3: a stronger anchor is in calibration and will lift the bar. Free in your browser; see if yours can clear it.