Reproducible, adversarial evaluation for long-horizon agents

A REPRODUCIBLE AGENTIC EVAL FOR FRONTIER MODELS models fight a calibrated, self-improving opponent — long-horizon, adversarial, every game replay-verified

Pixel Wars is an agentic evaluation wrapped as a war game: an LLM plans over dozens of moves under fog of war against the Commander — a calibrated AI anchor that hardens every time it's beaten. Fresh procedurally-generated maps (nothing to memorize), an objective win or loss, and a replay-verified result for every match. Watch the models fight, or run it in your own harness.

See the benchmark How it works Run your own (BYOK) For eval teams →

Benchmarks are losing their usefulness

Headline evals are pinned near the ceiling, leak into training data, and get gamed. A score stops measuring reasoning the moment it becomes a target. Pixel Wars is the opposite kind of test.

A task, not a quiz

Long-horizon spatial tactics in a deterministic, fair game — fog of war, economy, terrain, combat. There's no answer key to memorize.

Fresh every game

Maps are procedurally generated and mirror-symmetric for provable fairness. Every match is a new position, so a model can't overfit to the test set.

Replay-verified

A game is a seed plus an action log; the server re-runs it to confirm the result. No trust-me scores — every ranked number is reproducible.

Four ways to run it

Free in your browser. Model seats are bring-your-own-key — your key talks straight to your vendor, never our servers or logs.

Your LLM vs the Commander

The benchmark matchup — can your model out-plan the calibrated anchor?

LLM vs LLM

Two models head-to-head — the arena, ranked on a ladder.

You vs the Commander

Play the eval yourself against the baseline — no key needed.

You vs your LLM

Spar with a model you key in — or coach it.

Wire it into your eval loop

A static score you run once is a snapshot — teams shipping agents need a repeatable signal they can watch move over time, and that's the product we're building. Run a full game today; the cards below are how Pixel Wars is becoming part of the loop you already run.

Benchmark any model now

In-browser BYOK: your model vs the Commander, best-of-25, fog on, large maps — a few dollars of API, key straight to your vendor. Every game is a seed plus an action log the server re-runs move-by-move, so any result is reproducible.

Current public numbers are measured against Commander ultimate-2026.06; v3 is in calibration and hardens the anchor, so read today's figures as provisional, pre-v3.

Run your own (BYOK)

Drop it into your harness Coming

An Inspect-compatible task wrapper so Pixel Wars drops into the eval suite you already run — same grading, same replay-verified result, no bespoke glue.

Track it across checkpoints Coming

Point it at successive checkpoints or nightly builds and watch strategic skill, economy, and scouting move version-over-version — long-horizon regression testing, not a one-shot number.

The bigger arc: Pixel Wars is environment one. The same deterministic, replay-verified, self-improving engine is how we intend to measure long-horizon agents in the arenas that come next — logistics, negotiation, adversarial planning.

Request early access →

The Commander is the anchor

One calibrated classical AI is the fixed yardstick — think of it as the Stockfish of Pixel Wars. We score timed-out games like boxing (on who was pressing to win, not a flat draw), so turtling isn't safe and the metric actually discriminates between models.

models beat the Commander on large, fog-on maps so far — DeepSeek V4 Flash and GPT-5.4 mini. Provisional: measured vs Commander ultimate-2026.06, the current anchor — v3 is in calibration and raises the ceiling.

5-outcome

scoring: win / loss / win-by-points / loss-by-points / true draw.

Glicko-2

one unified rating for humans and AI on the same ladder.

It rises with the frontier

When a model beats the Commander, we mine those games, harden the anchor — tune its evaluation, add the missed counter, deepen its search — and re-run the benchmark against the stronger version, with the old numbers kept and tagged by the version they were run against. Beating the benchmark can't permanently solve or memorise it — it just raises the bar. That's the durable, un-gameable angle.

Why it works Read the launch post See the live ladder

Point your model at the Commander.

Run the benchmark free in your browser (bring your own key), or watch the frontier models fight on the public ladder.

See the benchmark Run your own (BYOK)

It's also a game. The full version is coming soon to Steam (will be SteamDeck Verified) — wishlist it.