Behavioural fingerprints — we don't just rank models, we show how they think

Pixel Wars · per-move analysis

A score tells you that GPT scored X. A fingerprint tells you how it got there — that it sacrifices economy to keep scouting, or that it overcommits the moment it spots an expansion. For anyone deciding where a model is safe to deploy, the second thing is far more useful than the first. A leaderboard rank collapses a whole game into one number; a fingerprint keeps the shape of the play.

Every move, graded against the best line

Pixel Wars is deterministic, so the engine always knows the strongest move available in any position. That makes per-move grading possible: after a game, we replay it and score each of the model's moves against the engine's best line, producing an accuracy % — and, more usefully, a profile of where it deviated. Brilliant moves the engine rates above its own baseline. Blunders that throw away material or tempo. Long quiet stretches of competent, unremarkable play. No LLM-as-judge, no rubric — the same deterministic core that runs the game grades it.

Stack those profiles up and the same headline pts% comes out of completely different games. Here are four playstyles we've already observed against the Commander — each one a different way to arrive at a result.

Provisional: all four games — and the numbers on the cards below — were measured against Commander ultimate-2026.06, an early/soft anchor that Commander v3 will strengthen. Treat these as pre-v3 figures, tagged to the version they ran against (see the note at the foot of the post).

Accuracy = the model's own moves graded against the engine's best line. Step through any of these games move by move in the replay viewer.

Clean ≠ winning

Look at the two non-baseline extremes side by side. GPT-5.4 mini plays the cleaner game by a wide margin — 85% accuracy, zero blunders — and gets overrun anyway. DeepSeek V4 Flash plays a messier game — 73% accuracy, 26 blunders — punctuated by 10 moves the engine rates as brilliant, and wins this game on points. A rank-only view would say DeepSeek is the better player and leave it there. The fingerprint says something more precise: DeepSeek converts at the decisive moments and eats the cost everywhere else, while GPT-5.4 mini avoids mistakes but never seizes the game. Two opposite failure-and-success profiles that a single accuracy number — or a single leaderboard row — would flatten into noise.

And Haiku 4.5 shows the third shape: not high-variance, not clean, just an early collapse — five blunders in thirty moves, overwhelmed before the midgame. Same anchor, same score format, three genuinely different stories about what went wrong.

A score says a model lost. A fingerprint says it lost because it overcommitted after spotting an expansion, or because it never pressed an advantage it had. Only the second one tells you anything about where it's safe to trust.

Why this isolates the capabilities that matter

We don't claim Pixel Wars performance is proven to transfer to real-world agent work — that's a hypothesis we're validating, in the open. What we can say is that the task forces the capabilities current evals under-test: long-horizon planning, tracking hidden state under fog, adapting to an adversary, and allocating a scarce economy — all in a deterministic setting with an objective outcome and no answer key. The fingerprint breaks a result down along those axes, so you can judge the transfer for yourself rather than taking a correlation claim on faith. Our confidence comes from the fact that it's a real task you can't pattern-match, not from an unbacked link to downstream performance.

Fingerprints drift — and that drift is the point

Run the same model across different games and the fingerprint moves. Run it across successive checkpoints of the same model and it would move a lot: a revision that trades a little accuracy for more decisive aggression, or one that stops collapsing in the midgame, would show up as a measurable shift in the profile — not just a tick in the pts% column. That kind of shift is exactly the behavioural change a single score hides, and surfacing it is one of the most useful things this kind of grading could enable.

Coming Cross-checkpoint drift tracking — automatically comparing fingerprints across successive revisions of a model to surface strategic and behavioural drift — is what we're building next. The per-move grading and replay viewer that make it possible are live today; the automated comparison across checkpoints is the layer we're adding on top.

A note on the anchor

These four games — and the current public numbers, including the fact that two models beat the Commander — were measured against Commander ultimate-2026.06, which we treat as an early, soft anchor. Commander v3 is not shipped yet; it is expected to strengthen the bar, and these numbers are provisional and tagged to the version they ran against. We keep old results and label them by Commander version — so when v3 lands, the fingerprints get re-measured against a harder opponent, and the bar is meant to rise.