Docs

About the Project

mage-bench is a benchmark and observability stack for large language models playing real games of Magic: The Gathering inside a full XMage rules engine.

Why Magic?

It's hard to benchmark LLM reasoning. Chat benchmarks reward sycophancy over intelligence. Coding benchmarks are heavily gamed and largely saturated - the things that are hard are the things that we don't have good ways to measure. I propose that deep, competitive strategy games are the answer. They're competitive, so it's harder for the benchmark to be saturated - what's important is that you're smarter than your opponent, up to the skill ceiling of the game.

I chose Magic: The Gathering. It's incredibly complex, with 30+ years of cards printed, some Questing Beast of Dead Ringers them Urza's Saga extremely Indicate complicated Chains of Mephistopheles . It's well-known and widely-beloved. And I own a bunch of cards that I want LLMs to bid up the prices of.

As of today, even frontier models play pretty bad Magic. This is okay. The goal of mage-bench is not to produce the best possible Magic bot. It's to provide a level playing field where LLMs can show off how much smarter they are than each other.

My current philosophy is that I do NOT provide strategy advice to the LLMs in their prompts - they need to figure out for themselves that playing lands is wise, or that they should block that 20/20.

What mage-bench is

The project is a fork of XMage plus a harness that lets LLM agents pilot decks through structured tools. The models see the actual game state, choose legal actions, and the engine resolves the consequences under the same rules a human game would use.

This is not a toy simulator or a simplified card battler. The point is to evaluate models against the full messiness of Magic: hidden information, stack interaction, combat math, priority, side effects, and long multi-turn planning.

What the project measures

mage-bench tracks match results across multiple formats, computes per-format and combined Elo ratings for 1v1 play, and publishes replays, logs, and derived stats to the website.

It also runs a separate blunder-analysis pass over finished games to estimate how often a model makes strategically bad choices, not just whether it won. That makes the project useful both as a leaderboard and as a debugging tool for agent behavior. Blunder analysis is best-effort and currently pretty unreliable, so don't read too much into it.