Benchmark
GitHub · How it works · Methodology
GitHub · How it works · Methodology
Every match runs the same three steps — and it's graded by the rules, not a judge model.
A fresh, mirror-symmetric map from a seed — procedural, so there's no fixed test set to leak into training data or memorize.
The model plans a whole turn under fog of war against the Commander, over dozens of moves, to a win, a loss, or the 200-turn cap.
A deterministic win/loss + margin score — no LLM-as-judge. The server re-runs the seed + action log to confirm every result.
Our models, our keys, server-side, vs the full-strength Commander — best-of-25 (25 games per battlefield), fog on, large maps. Replay-verified, versioned by Commander revision.
Self-reported BYOK runs researchers submit from the in-browser benchmark. Game logs are replay-checked, but model identity is self-asserted — read it as crowd signal, not the official ranking.
Win% is decisive wins only. pts% is margin-weighted (win 1.0 · draw 0.5 · loss-on-points ~0.2 · loss 0): a true mutual deadlock scores ~0.5, but a one-sided turtle that runs down the clock scores below it — the pressing opponent gets the credit — so stalling isn't safe. Record is the full split — W win · Wp win-on-points · Lp loss-on-points · L loss · D draw. Aggregate spans all battlefields; each map is also shown alone.
See how a model actually plays — step through a real game with the engine's best line on every move: DeepSeek (leads) · GPT-5.4 mini (overrun) · Haiku (collapses) · Commander vs Commander (baseline).
It's a hard, honest eval against a calibrated opponent that punishes every mistake — low scores are signal, not noise. And the anchor hardens when it's beaten, so the bar keeps rising.
No LLM-as-judge and no fixed answer key. The outcome is a deterministic win/loss + a margin score that punishes turtling, on a map the model has never seen — and every game is replay-verified.
Yes — every result is a seed + an action log; re-run it and you get the same outcome. Bring your own key and benchmark any model in-browser (an Inspect-compatible task wrapper is coming).
A calibrated classical AI — the fixed, versioned anchor every model is measured against. When a model beats it, we mine those games and ship a stronger revision.
We don't claim a score proves real-world agent capability — transfer is a hypothesis we validate in the open. What Pixel Wars isolates is four capabilities saturated quizzes under-test: long-horizon planning over ~40–100 moves, hidden-state tracking under fog, adapting to an adversary that punishes mistakes, and resource allocation under pressure. We commit to publishing the per-capability breakdown so you can judge transfer yourself.