A transformer that sees only notation, compressed via polar-angle quantization into a file smaller than a photograph.
There is something philosophically uncomfortable about treating chess as a language modeling problem. Chess has geometry, material balance, pawn structure, king safety, and tactical patterns that span a dozen moves. A language model has none of these built in. It operates on a flat token sequence with no explicit spatial representation. And yet, if you train a transformer on the move sequences of strong players, it has to learn something about chess to predict those sequences accurately. The question is how much, and whether that something is enough to make reliably legal moves.
This is the premise of ChessNano: a 51.9-million-parameter transformer trained on Lichess game records from players above 2500 Elo, compressed into under 60 megabytes using a two-stage quantization scheme I adapted from recent work in KV-cache compression. No board representation is given to the model. No piece positions. No evaluation. Just notation.
Standard Algebraic Notation, SAN, encodes every chess move as a short string: e4, Nf3, Bxc6+, O-O. A full game becomes a sequence of these tokens framed by start and end markers. The vocabulary covers all SAN strings that appear in the training corpus, padded to 2048 entries for fast embedding lookup. The training objective is straightforward: given all previous moves in a game, predict the next one. Cross-entropy loss. Nothing else.
The training data comes from the Lichess database filtered to games where both players hold at least 2500 Elo and the time control provides at least 10 minutes. This selection is deliberate. Move sequences from weaker players are noisier, contain patterns that a strong player would never reproduce, and pollute the distribution. At 2500+, the distribution is tighter. Strong players in similar positions play similar moves. The model sees the same structural patterns across hundreds of thousands of games from different angles and learns the distribution.
The evaluation metric that matters is the legal move rate: the fraction of model predictions that are actually legal in the board position. A random policy achieves essentially zero. A model that has learned real chess structure should exceed 90 percent.
The transformer is a standard causal decoder stack with three modifications from the vanilla GPT-2 design.
Grouped Query Attention. The model has 12 query heads and 4 key-value heads. Each KV head is shared by three query heads via repeat_interleave. This reduces the KV cache footprint by a factor of three during inference, which matters when sequences grow toward the 512-token context limit. The head dimension is 64. Total attention parameters per layer remain the same as GQA reduces KV dimensions from 768 to 256.
Rotary Position Embedding (RoPE). Position is encoded directly into query and key vectors before the attention computation, not through a learned table. The frequencies are precomputed as:
This generalizes better to sequence positions near the context boundary and contributes zero learnable parameters.
SwiGLU feedforward. The feedforward network uses a gated activation:
SwiGLU consistently reduces perplexity compared to standard ReLU or GELU feedforward at the same parameter count.
The full model has 51.9 million parameters. In bfloat16 that is 104 MB. The target is under 60 MB, which requires heavy compression of every projection and feedforward weight while keeping the embedding table in float16.
Naive post-training quantization takes a trained float32 model and rounds each weight to the nearest INT8 value using absmax scaling: divide by the maximum absolute value, multiply by 127, round. Fast, simple, a known baseline.
The issue is accumulated approximation error. At each layer the true weight W is replaced by Ŵ = round(W / s) · s / 127 where s is the per-tensor or per-channel scale. The elementwise error ||W − Ŵ|| is small, but through a matrix multiply it compounds. By the eighth layer, output distributions have shifted enough to degrade prediction quality.
Quantization-Aware Training (QAT) addresses this by simulating quantization during the forward pass and letting the optimizer explicitly compensate. The rounding is treated as an identity for gradient computation: a straight-through estimator. The model learns weight configurations that survive rounding gracefully.
Standard uniform INT8 QAT is a solid approach but it has a structural mismatch. Uniform quantization assigns equal spacing to all levels across the representational range. Weight matrices in trained transformers have approximately Gaussian distributions: dense near zero, sparse in the tails. Uniform quantization wastes levels on tail values that appear rarely and crowds the levels in the central region where most of the weight mass sits.
TurboQuant changes the quantization domain.
The central idea is to quantize angles rather than magnitudes after a linear rotation that makes the angular distribution of weight vectors approximately uniform.
For a weight matrix W ∈ ℝdout×din, partition the second dimension into blocks of 64 elements:
Apply a shared normalized Hadamard matrix H ∈ ℝ64×64:
Post-rotation, the distribution of each element is more isotropic; the angles of pairs of coordinates are distributed more uniformly around the circle. This is the property that makes polar quantization efficient.
Pair consecutive rotated elements. Each pair (x, y) has a polar angle θ = atan2(y, x) and a norm γ = √(x² + y²). Quantize the angle to a uniform circular grid:
The grid uses 27 = 128 equal intervals rather than 27 − 1 = 127. Angles live on a circle: −π and +π are the same point. The linear formula Δ = 2π / (2n − 1) places two quantization points nearly coincident at the wrap-around boundary, wasting a level and introducing a discontinuity. Circular quantization always uses 2n intervals.
Reconstruct from quantized angle and original norm, then un-rotate:
Stage 1 does not eliminate the quantization error. A residual remains in the rotated domain:
A fixed random Johnson–Lindenstrauss projection Φ ∈ {−1/√32, +1/√32}64×32, generated once from a fixed seed and shared across all layers, encodes this residual as sign bits:
The sign encoding is an unbiased estimator of the direction of the residual. The Johnson–Lindenstrauss lemma guarantees that a random projection from 64 dimensions down to 32 preserves pairwise distances with bounded distortion. One bit per projected dimension gives 32 bits of correction per 64-element block, at a cost of one additional bit per weight on average. The two stages together use eight bits total per weight.
Both stages are wrapped in a single straight-through estimator. The optimizer receives gradients computed through the quantized forward pass as if the quantization were an identity transform. The weights remain in float32 during training and are discretized only within the forward operation.
The two training conditions are identical in data, architecture, and optimizer. The only difference is the quantization scheme applied during the forward pass: uniform INT8 versus two-stage TurboQuant. Both deploy to roughly 52 to 57 megabytes.
The hypothesis is that TurboQuant-QAT produces a higher legal move rate and lower validation perplexity at the same bit budget, for a specific reason. Uniform INT8 QAT absorbs quantization error as a perturbation that the optimizer works around, but it cannot reshape the error distribution. TurboQuant-QAT quantizes in a space where the weight statistics after Hadamard rotation are better matched to the quantization structure: weights are near-Gaussian, Hadamard rotation isotropizes the angle distribution, and the JL stage corrects the residual direction with one additional bit per weight.
The angular discretization error is smaller in expectation than the linear discretization error of uniform INT8 on a Gaussian distribution, for the same bit count.
Whether that difference is large enough to show up in chess move legality is the experiment. Legal move rate is a binary and interpretable metric that connects directly to whether the model has learned anything structurally useful. Perplexity matters too, but a model could achieve low perplexity while still predicting illegal moves with high confidence; the legal rate catches that failure mode directly.