
Playing Chess in 57 Megabytes
A transformer that sees only notation, compressed via polar-angle quantization into a file smaller than a photograph.
April 2026
Kabir Murjani · April 2026
There is something philosophically uncomfortable about treating chess as a language modeling problem. Chess has geometry. It has material balance, pawn structure, king safety, tactical patterns that span a dozen moves. A language model has none of that built in. It operates on a flat sequence of tokens with no explicit notion of a board, a square, or a piece. And yet, if you train a transformer on the move sequences of strong players, it has no choice but to learn something about chess in order to predict those sequences well. The only interesting questions are how much, and whether that something is enough to make legal moves reliably.
That is the premise of ChessNano: a 51.9-million-parameter transformer trained on Lichess games between players above 2500 Elo, compressed to under 60 megabytes with a two-stage quantization scheme I adapted from recent work in KV-cache compression. The model is never shown a board. No piece positions, no evaluation, no rules. Just notation, and the pressure to predict what a strong player does next.
Chess as a Token Stream
Standard Algebraic Notation encodes every move as a short string: e4, Nf3, Bxc6+, O-O. A game becomes a sequence of these tokens between a start and end marker, and the vocabulary covers every SAN string that shows up in the training corpus, padded to 2048 entries for fast embedding lookup. The objective is the most boring one in the book. Given all previous moves, predict the next. Cross-entropy loss. Nothing clever.
The data is the Lichess database filtered to games where both players are at least 2500 Elo and the time control gives them at least ten minutes. That filter is doing real work. Moves from weaker players are noisier, full of patterns a strong player would never reproduce, and they smear the distribution you are trying to learn. At 2500 and up the distribution tightens: strong players in similar positions tend to make similar moves, so the model sees the same structural motifs from hundreds of thousands of angles and can actually converge on the shape of the game rather than the noise around it.
The metric that matters here is not perplexity, it is the legal move rate: the fraction of the model's predictions that are actually legal in the current position. A random policy scores essentially zero. A model that has genuinely internalized chess structure, with no board ever handed to it, should clear 90 percent. That number is the whole question of the project compressed into one statistic.
Architecture
The backbone is a standard causal decoder stack, with three departures from the vanilla GPT-2 design, all chosen for the same reason: they buy capacity or efficiency without spending parameters I cannot afford.
Grouped Query Attention. The model runs 12 query heads but only 4 key-value heads, with each KV head shared across three query heads. The head dimension is 64. This cuts the KV cache footprint by a factor of three at inference, which matters as sequences stretch toward the 512-token context limit, and it costs nothing in attention parameters per layer because GQA simply shrinks the KV projection from 768 to 256 dimensions rather than removing anything the query side needs.
Rotary position embedding. Instead of a learned position table, position is rotated directly into the query and key vectors before attention, using precomputed frequencies:
θ_i = 1 / 10000^(2i/d), f_t = t · θ, q' = q·cos f + q⊥·sin fThis generalizes more gracefully to positions near the edge of the context window, and it contributes exactly zero learnable parameters, the kind of free lunch you take gratefully when your whole budget is 52 million weights.
SwiGLU feedforward. The feedforward block uses a gated activation rather than a plain nonlinearity:
FFN(x) = W2 ( SiLU(Wg·x) ⊙ Wu·x ), Wg, Wu ∈ ℝ^{768×2048}, W2 ∈ ℝ^{2048×768}SwiGLU reliably shaves perplexity relative to ReLU or GELU feedforward at the same parameter count, which is the only currency that matters here.
All of that comes to 51.9 million parameters. In bfloat16 that is 104 megabytes. The target is under 60, which means almost every projection and feedforward weight has to be compressed hard while the embedding table stays in float16. This is where the actual research lives.
The Problem with Uniform Quantization
The naive approach is post-training INT8 with absmax scaling: take the trained float32 weights, divide by the maximum absolute value, multiply by 127, round. Fast, simple, and a perfectly reasonable baseline.
The trouble is that the error does not stay local. Each weight W gets replaced by Ŵ = round(W/s)·s/127, and while the elementwise error is small, it compounds through every matrix multiply. By the eighth layer the output distributions have drifted far enough to visibly degrade predictions. Quantization-aware training helps, by simulating the rounding in the forward pass and treating it as an identity for gradients (the straight-through estimator), the optimizer learns weight configurations that survive being rounded. That is genuinely useful, and it is still not enough, because it cannot fix the deeper mismatch.
Uniform quantization spaces its levels evenly across the range. But trained transformer weights are roughly Gaussian: dense near zero, sparse in the tails. So uniform quantization spends precious levels resolving rare tail values and starves the crowded center where almost all of the weight mass actually sits. You are allocating your bits to the part of the distribution that barely happens. The optimizer can work around a fixed error budget, but it cannot reshape where that budget is being wasted.
TurboQuant-QAT
The fix is to change the domain you quantize in. Instead of quantizing magnitudes on a Gaussian, you rotate the weights so their angular distribution becomes roughly uniform, and then you quantize angles, where uniform spacing is finally the right tool.
Stage one: PolarQuant with a Hadamard rotation. Take a weight matrix and partition its input dimension into blocks of 64. Multiply each block by a shared 64×64 normalized Hadamard matrix (Sylvester construction, one matrix reused across every layer, satisfying H·Hᵀ = I). After this rotation the elements are more isotropic, and the angles of coordinate pairs spread out nearly uniformly around the circle, which is exactly the condition polar quantization wants.
Now pair consecutive rotated elements. Each pair (x, y) has a polar angle θ = atan2(y, x) and a norm γ = √(x² + y²). Quantize the angle onto a uniform circular grid:
Δ = 2π / 2⁷ = 2π / 128, θq = round(θ / Δ) · ΔThe detail that is easy to get wrong: use 128 intervals, not 127. Angles live on a circle, where −π and +π are the same point. If you naively use 2π / (2ⁿ − 1), you place two quantization points almost on top of each other at the wrap-around and waste a level. Circular quantization always wants 2ⁿ intervals. Reconstruct from the quantized angle and the original norm, then rotate back:
W̃ = Hᵀ ( γ · [cos θq, sin θq] )The norm is stored in float16 per pair; the angle index is 7 bits.
Stage two: a 1-bit QJL residual correction. Stage one leaves a residual in the rotated domain, ε = W_rot − W̃_rot. Rather than store it, encode its direction with sign bits through a fixed random Johnson-Lindenstrauss projection Φ ∈ {−1/√32, +1/√32}^{64×32}, generated once from a fixed seed and shared across all layers:
s = sign(ε · Φ) ∈ {−1, +1}³², ε_corr = s · Φᵀ, W_TQ = Hᵀ(W̃_rot + ε_corr)The sign encoding is an unbiased estimator of the residual's direction, and the JL lemma guarantees that projecting from 64 dimensions down to 32 preserves pairwise distances with bounded distortion. That is 32 correction bits per 64-element block, about one extra bit per weight on average. The two stages together land at eight bits per weight, and both are wrapped in a single straight-through estimator, so the optimizer sees gradients through the full quantized forward pass while the weights stay in float32 and are discretized only inside the forward operation.
What the Experiment Tests
The two training runs are identical in data, architecture, and optimizer. The only thing that changes is the quantization in the forward pass: uniform INT8 against the two-stage TurboQuant scheme. Both land between roughly 52 and 57 megabytes.
The hypothesis is that TurboQuant-QAT produces a higher legal move rate and lower validation perplexity at the same bit budget, and the reason is specific rather than hand-wavy. Uniform INT8 QAT can absorb quantization error as a perturbation and let the optimizer route around it, but it cannot reshape the error distribution itself. TurboQuant-QAT quantizes in a space the weight statistics are actually suited to: near-Gaussian weights, a Hadamard rotation that isotropizes the angles, and a one-bit residual stage that corrects the leftover direction. The angular discretization error is smaller in expectation than the linear discretization error of uniform INT8 on a Gaussian, for the same number of bits.
Whether that gap is large enough to surface in something as concrete as chess move legality is the actual experiment. I like the metric precisely because it is hard to fool. Perplexity can look fine while the model still confidently proposes illegal moves; legal move rate catches that failure directly. It is binary, it is interpretable, and it ties straight back to the only question that started this: with nothing but notation, how much chess does the model really know?
- TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate (Google Research, ICLR 2026). The two-stage PolarQuant-plus-QJL scheme this borrows from. Worth noting the adaptation: the original is a training-free method for the KV cache and uses a random orthogonal rotation, whereas ChessNano applies it to weights under QAT and substitutes a Hadamard rotation, which is cheaper and structured.
- PolarQuant: Quantizing KV Caches with Polar Transformation (Han, Kacham, Karbasi, Mirrokni & Zandieh, AISTATS 2026). The angle-quantization idea: rotate, then quantize angles instead of magnitudes, and skip the per-block normalization that ordinary quantization has to store in full precision.
- QJL, the 1-bit Quantized Johnson-Lindenstrauss transform (Zandieh et al.). The residual stage. A random sign projection that encodes the direction of the leftover error in one bit per projected dimension.
- The architectural pieces are standard but worth naming: Grouped Query Attention (Ainslie et al., 2023), RoPE (Su et al., 2021), and SwiGLU (Shazeer, 2020).