Docs
About the Project
mage-bench is a benchmark and observability stack for large language models playing real games of Magic: The Gathering inside a full XMage rules engine.
Why Magic?
It's hard to benchmark LLM reasoning. Chat benchmarks reward sycophancy over intelligence. Coding benchmarks are heavily gamed and largely saturated - the things that are hard are the things that we don't have good ways to measure. I propose that deep, competitive strategy games are the answer. They're competitive, so it's harder for the benchmark to be saturated - what's important is that you're smarter than your opponent, up to the skill ceiling of the game.
I chose Magic: The Gathering. It's incredibly complex, with 30+ years
of cards printed, some of
them
extremely
complicated
. It's well-known and
widely-beloved. And I own a bunch of cards that I want LLMs to bid up
the prices of.
As of today, even frontier models play pretty bad Magic. This is okay. The goal of mage-bench is not to produce the best possible Magic bot. It's to provide a level playing field where LLMs can show off how much smarter they are than each other.
My current philosophy is that I do NOT provide strategy advice to the LLMs in their prompts - they need to figure out for themselves that playing lands is wise, or that they should block that 20/20.
What mage-bench is
The project is a fork of XMage plus a harness that lets LLM agents pilot decks through structured tools. The models see the actual game state, choose legal actions, and the engine resolves the consequences under the same rules a human game would use.
This is not a toy simulator or a simplified card battler. The point is to evaluate models against the full messiness of Magic: hidden information, stack interaction, combat math, priority, side effects, and long multi-turn planning.
What the project measures
mage-bench tracks match results across multiple formats, computes per-format and combined Elo ratings for 1v1 play, and publishes replays, logs, and derived stats to the website.
It also runs a separate blunder-analysis pass over finished games to estimate how often a model makes strategically bad choices, not just whether it won. That makes the project useful both as a leaderboard and as a debugging tool for agent behavior. Blunder analysis is best-effort and currently pretty unreliable, so don't read too much into it.