Run your own benchmark — bring your own key
You don't have to take our numbers on faith, and you don't have to wait for us to add your model. The free web client has a Benchmark mode: pick a model, paste your API key, and score it against the Commander yourself — entirely in your browser.
How it works
- Bring your own key. The key lives in your browser session and talks straight to your vendor — it never touches our servers or logs.
- Pick a scope. Best-of-5 for a quick read, or best-of-25 (the official depth) for a stable number. Run the baseline map, or all the battlefields for the full picture.
- It runs headless. Games play out against the full-strength Commander with the same methodology as our official run — fog on, large maps — with a live table of each match as it completes.
- Share the result. You get a per-battlefield and aggregate pts% / win% card, with our published baseline alongside for comparison, and a copy button to drop it into a doc or a chat.
Same Commander, same methodology, your key. The number you get is the number you can quote.
A note on cost and time
Every game is a full match — dozens of model calls — so the run uses your tokens and takes real time: a best-of-5 on one map is quick; best-of-25 across all battlefields is hundreds of games and can run for a while. The tool shows the game count up front and you can cancel anytime. Keep the tab open while it runs — results live in the page for now.
From one run to a regression test
A single run tells you where a model stands today. The reason to reach for the in-browser tool is that the run is repeatable: the score is deterministic — win/loss plus a margin-weighted pts% — with no LLM-as-judge in the loop, and the games are procedural and fresh, so there's nothing to memorize between runs. Point it at your next checkpoint with the same scope and the number is directly comparable to the last one. The difference is signal, not noise.
Today that's a manual run — you point it at a checkpoint and read the number. The automated, Inspect-compatible loop that turns it into a regression test you track from one checkpoint to the next is what we're building next. We walk through that workflow — alongside the per-move accuracy and replay viewer that are live now — in Pixel Wars as a regression test.