Chess Move Compression
A routine move representation of 16 or 32 bits makes sense for normal chess computation. For archive or transfer compression makes sense. It is possible to compress chess moves to much less than 8 bits (on average), although information theory dictates that sometimes 8 bits are required. In this article/blog post I describe a kind of hybrid system that does not seek to compress aggressively, instead using 8 bits for every move. The benefits of this system are; a useful amount of compression, extreme convenience (what could be simpler than representing games as strings, once move per 8 bit ‘character’?), and (very importantly) negligible consumption of CPU cycles compared to more aggressive schemes. This representation method has worked out well for implementing the new database features of the Tarrasch V3 program.
Please note that I pursue my hobby of chess programming more or less in isolation. I don’t use (or even look at) other people’s chess code. I haven’t read any academic research. I play around and come to my own conclusions. I don’t pretend this is a strength, it’s clearly a weakness. But for now at least it suits what I am trying to achieve – basically personal satisfaction. So my conclusions are (far from) peer reviewed best practice. It’s just a random dude’s thoughts on the internet, buyer beware.
I absolutely don’t expect any of what follows to be original, I expect it to be essentially a rediscovery and/or restatement of existing ideas.
One Move in 40 Bits
Unsurprisingly normal chess move notation is not particularly space efficient. Two common conventions are SAN or standard algebraic notation and a more terse form used to communicate with UCI (Universal Chess Interface) engines. A single move can take up to eight characters in SAN. for example e7xf8=Q+. UCI uses either four or five characters, five characters are required only for promotions, our example in UCI move format is e7f8q.
Already we can see an issue that will affect our quest for efficient representation. How much information are we trying to store? In the SAN example, we can tell that the move is a capture, a promotion and gives check. In the UCI example we can see promotion directly and infer capture (since only pawns promote and a pawn can only go from e7 to f8 by capturing). In our quest we are going for lean and mean – we seek to store only the information required to uniquely identify any legal move in any legal position. Very importantly (more about this later) we seek a scheme that will always produce exactly the same encoding for a given move in a given position, irrespective of how that position arose.
One Move in 12 Bits
My first naive dabbling led me to the conclusion that to store a chess move requires somewhat more than 12 bits (so somewhat more than 1.5 bytes, for our purposes a byte is 8 bits, and is also essentially equivalent to a character in the examples above). A chessboard of course has 64 squares and since 2^6 = 64, six bits neatly accommodate a single square. So a move requires two squares (source and destination) or twelve bits. Unfortunately just as UCI sometimes requires five characters, 12 bits seemed to me to be sufficient usually, but insufficient universally because of the need to accommodate underpromotion. Four possible promotions are always possible, so since 2^2 = 4, 2 more bits are required, for a total of 14. Right? Well no, even I quickly realised it was possible to avoid a 2 bit promotion tax. Imagine a pawn promoting from e7. It can go only to d8, e8 or f8. That’s only three possibilities, yet we are allocating 6 whole bits to encode this destination. Wasteful! Our first piece of cleverness today comes into play. We will adjust the destination square to indicate underpromotion, so e7-f8 indicates a rightward capture and Queen promotion, e7-f7 indicates a Rook, e7-f6 indicates a Bishop, e7-f5 indicates a Knight (say). Of course we apply the same idea to promoting leftward captures and non capturing promotions.
One Move in 5.5 Bits
Now if I had been paying better attention, it should have occurred to me that this was really too easy , it’s hardly a case of squeezing blood from a stone. This is contra-indicative of an optimum strategy. We can use information theory to probe the limits of compression, and to inform us whether we are really getting close to an optimum solution.
Information theory tells us that the number of bits needed to encode one of N equally likely possibilities is log2(N). We have already seen log2(64) = 6 bits and log2(4) = 2 bits in the discussion above. As an aside, all these even powers of two, 2 colours, 4 possible promotions, 8 pawns per side, 8 pieces per side, 64 squares make the chess programmer’s life easier than it would otherwise be. So the key question becomes, how many moves are possible? That will dictate the theoretical minimum number of bits needed to encode one of the possible moves.
For the sake of argument, let’s assume there are 32 legal moves available in a position. Log2(32)=5 or equivalently 2^5 = 32. So 5 bits can theoretically encode a move. How? All that is required is a list of the legal moves. This isn’t a problem since generating a list of legal moves is a fundamental building block of computer chess. The order of the moves in the list is all important, so we need a convention to ensure the move ordering is deterministic, perhaps we always sort the list according to some agreed criteria. Then we simply find the move we wish to encode in the list – if it’s a legal move it will be in there somewhere. The offset (or index, or position) of the move in the list is a 5 bit number (or code) in the range 0-31 and uniquely identifies our move. Voila.
Of course there aren’t always 32 legal moves available in any position, but a straightforward generalisation of this procedure already gets us to a reasonably practical and close to optimal scheme. The generalisation is to vary the number of bits per move according to the length of the list of legal moves in the position. The number of bits used in any position is obtained from our old friend, the formula log2(N) where N is the length of the list of legal moves. In practice most moves will need 5 or 6 bits, 6 bits is required if there are more than 2^5=32 possibilities, but that is almost always enough. Practical positions with more than 2^6 = 64 legal moves are rare. In “The Joys of Chess” by Christian Hesse the author addresses the upper limits in the chapter “How Many Moves?” and concludes that the current record holders are 75 for a real game (Podhorzer-Palda Vienna 1933, White’s 25th move) and 218 for a constructed positition (Petrovic 1964, FEN notation “R6R/3Q4/1Q4Q1/4Q3/2Q4Q/Q4Q2/pp1Q4/kBNN1KB1 w – – 0 1” note that WordPress is converting normal hyphens into something else in this FEN string – sorry about that – retype the two hyphens to get Tarrasch or some other chess program to read the string). Naturally, in the Petrovic position White has promoted all 8 pawns to Queens, so now there are 9 Queens all strategically arranged a knight’s move apart in order to minimise interference. The most important thing we can take from this is that 8 bits will absolutely, positively, be sufficient in any legal position. Even a casual inspection of the Petrovic position reveals it must be reasonably close to the upper limit. The empirical evidence is that even in such an esoteric field as this, no record would stand for fifty years if it could be absolutely decimated, and it would require absolute decimation to get from 218 to beyond 2^8 = 256.
Although we are ostensibly discussing compressing individual moves, a variable bit scheme like this is probably best thought about in the context of a whole game. A game as a series of moves is represented as a stream of bits. Each move is represented as somewhere between 0 and 8 bits according to the number of possible moves in the position. Most moves would need 5 or 6 bits, it would be quite common to use less (for example endings with much reduced material – and replies to checks) and very rare to use more. An amusing point is that some moves really would require zero bits – this happens when there is only one legal move in the position, there’s no need to store anything at all in that case. Our fundamental formula expresses this concept in the language of mathematics, log2(1)=0 [or equivalently 2^0=1].
One Move in 4 Bits
It would be a mistake to think no further compression is possible. For a start, there are usually unused codes in the scheme, since usually the number of legal moves in a position is not normally an exact power of two. Consider the initial position, there are twenty possible moves, so 5 bits are required, and consequently 12 unused codes (2^5-20=12). These codes could represent common opening sequences. Later in the game such gap codes could be used to represent common multi-move sequences when they are uniquely identifiable (eg O-O then O-O, PxP then PxP, RxR then RxR, h6 then BxNf6 then recapture [or the same pattern starting with a6,h3,a3]).
We can get more ambitious. In our magic log2(N) formula there are N equally likely possibilities. But all chess moves are not created equal, they are not equally likely. This is an example of the principle that non random information is compressible. Consider the order of the moves in the legal moves list. Instead of using some simplistic sort technique to establish a deterministic order, imagine instead using a chess engine to rank the moves in order of strength. Now vary our basic scheme so that moves early in the list use less bits than moves later in the list. Designing the details of such a scheme can be a lot of fun. One idea that suggests itself is a 3 bit code sometimes followed by extended bits. 3 bits allows 2^3=8 codes. 6 of these codes represent the first 6 moves in the list. The other 2 codes indicate one of two possible formats for extended bits. One format is 2 extended bits coding for 2^2=4 further moves from the list (moves 7 through 10). The other format is a fallback scheme, a variable number of extended bits, as many as required to represent moves 11 and above in the list. In this scheme well played games are more compressed than poorly played games. At a guess, a reasonable game would comprise perhaps 70% 3 bit moves, 20% 3+2=5 bit moves, 8% 3+5=8 bit moves bits, 2% 3+6=9 bit moves. This works out at an average of 3.9 bits per move. So we have improved on a naive 12 bit scheme by at a factor of about three.
In all of this we can see one of the basic tenets of all computation playing out- there is no such thing as a free lunch. In this case space costs us time (it’s a bit like chess!). What I mean by this is that reducing the amount of space (memory) required costs a lot of time (CPU cycles). Move list generation takes significant CPU time. Ranking moves by strength (to get more compression) takes (comparatively) vast amounts of time.
One Move in 2 Bits, Maybe?
A couple of weeks after I wrote this post it got some attention on Hacker News and Reddit. That experience convinced me to add the introductory paragraph that now starts this post, which hopefully helps make this work as an article not just a blog post. I also learned some new things from the commentary there and I’ve added this paragraph to reflect that. Specific thanks are due to Reddit commentator Dag Ågren. It turns out that there’s a way to turbo charge the chess engine based compression scheme. Consider the 3.9 bits per move scheme above. It works best when there are 6 or less good moves available, reasonably when there are 10 or less, poorly when there are more than 10. It’s a compromise scheme, it attempts to be at least adequate in every position. But what if we could tailor a scheme for each position we encounter? We could potentially do a lot better in many, maybe most, positions. Especially positions in which there aren’t many good moves. For example, if there is only one good move in a position, we’d design a scheme that used only one bit if that move was played. If there were two good moves, we’d use a scheme that used only two bits if either of those moves were played. And so on. Surprisingly (to me anyway) it turns out it is possible to do this in a practical way. The trick is to dynamically select a scheme well suited to each position in the sequence of positions that constitute a game. There is no need waste extra bits indicating which scheme is being used, because the decoder is synchronised to the encoder. It follows along through the same sequence of positions. As long as it uses exactly the same algorithm as the encoder to select the scheme best suited to any position, it will always pick the right scheme for decoding. It turns out that a technique called “arithmetic coding” can automatically generate optimal schemes given a list of symbols (moves in our context) and symbol probabilities (which can reasonably be inferred from the engine’s move score). The title of this section is just speculation on my part, if there are an average of 4 good moves in a position, and if a strong player (almost) always plays a good move, then 2 bits per move is theoretically the best we could do, for high quality games. Actually it has just occurred to me that when playing through a master game with kibitzing in Tarrasch gives a feel for this, as the “best” four lines are shown. The percentage of moves played that come from the engine’s top four list often does feel very high. But remember the downside, even if we can get reasonable output from a chess engine in 1 millisecond (a stretch), a big database has of the order of 1 billion moves, so compressing or uncompressing a complete database is going to take of the order of a million seconds (more than a week).
The Sweet Spot – One Move in 8 Bits
I’ve given hints in the language I’ve used that I actually haven’t implemented any of the complicated schemes above for Tarrasch 3. Basically they are far too slow for my purposes. The scheme I adopted instead trades off compression for time. It turns out it is possible to compress moves into 8 bits with negligible CPU cycles. For the pragmatic programmer this represents great value. A fixed one move per byte system is very convenient for the programmer, much more convenient than 12 bits. In particular, 8 bit moves allow move sequences to be represented as strings, the length of the string is the number of moves, and simple indexing into the string locates individual moves in a natural way. Although further compression with a variable bit scheme has some attractions, it would not only cost a lot of time, it would also be less simple and convenient to use.
Our 8 bit scheme works as follows. We split each 8 bit code into two 4 bit fields. Each field has 2^4=16 codes. There are a maximum of 16 pieces (let’s use the word ‘piece’ as shorthand for ‘pawn or piece’) on each side of the board, so one of our two fields (let’s call it the piece code) selects one of the 16 pieces in the army of the side to move. The other field (let’s call that one the vector code) selects one of 16 possible moves for the selected piece. Too easy! As they say in Australia.
Of course if it were really that easy I probably wouldn’t be spending hours writing this up. There are some difficulties still to overcome. We’ll deal with the most obvious one first. A queen can make up to 27 moves, it’s the one piece on the board that can exceed the 16 possibilities our vector code allows. We solve this problem by conceptually splitting the Queen into two parts, a bishop and a rook. The bishop and rook each get their own piece code. This seems to need a 17th piece code we don’t have, but we balance splitting a queen into two by combining both knights into one. The 16 vector codes of this dual knight are sufficient for both knights since they can only make up to 8 moves each.
Another fundamental problem we need to overcome is extra pieces appearing on the board due to promotion. Extra queens are particularly problematic since they need two piece codes, so it’s not possible to simply recycle the code formerly used by the pawn that promoted. Our strategy is to have a fall back mode we can call upon in some (quite rare in practice) circumstances. The fall back mode is the slow system we are already familiar with; we generate a list of all legal moves, sort them, and then index into the list to select our move. There’s no need to mess about with variable bit fields, we’ll just use 8 bits, which we know is always sufficient, every time. More about this later.
Our 16 available piece codes comprise eight pawn codes, one king code, one dual knight code, three bishop codes and three rook codes. Remember that we are treating the queen as a bishop plus a rook. A vector coding scheme is required for each type of piece code. A king makes the least demands on the 16 codes available, we simply define 8 direction vectors, N,S,E,W,NE,NW,SE,SW. Castling can be accommodated by using two more codes, or we can be a little tricky and use the fact that a king cannot move in all 8 directions on its initial square. So we can represent White O-O (for example) as the White King moving South East from e1.
For the dual knight, we use 3 of the 4 available vector bits to define 8 direction vectors NNE, NNW, NWW, NEE, SSE, SSW, SEE, SWW corresponding to the eight possible knight moves. The other bit selects one of the two knights.
Rook and bishop vector codes also split the 4 available bits into 1 and 3. The rook code uses the 1 bit field to indicate whether the move is changing the rank or the file, the 3 bit field then indicates one of the 8 ranks or files. A bishop code uses the one bit field to indicate whether the move is along a diagonal that rises from left to right, or falls from left to right. The 3 bit field indicates a destination file.
Due to the need to accommodate the possibility of underpromotion, the humble pawn definitely needs all 4 vector code bits. The most obvious system is to use 2 bits to indicate one of the 4 possible types of pawn move (advance, double advance, left capture, right capture) and the other 2 bits to indicate one of the four possible promotions, if it is a promoting move. But we can exploit the fact that a double advance never promotes to improve on this scheme; We use 2 bits to select one of 4 possible types of pawn move – non promoting, promoting advance, promoting left capture or promoting right capture. If the move is one of the three promoting types, then the other 2 bits select one of the four possible promotions. If the move is non-promoting, then the other 2 bits indicate the type of move (advance, double advance, left capture, right capture). This is a better scheme because it is more efficient, it encodes extra information (is the move a promotion or not?) in the same space.
To calculate the 8 bit move code for any legal move in any legal position we first scan through all pieces of the side to move. Opposing pieces are essentially irrelevant. Up to four pieces are assigned fixed piece codes; the king and (if present), queen, light squared bishop and dark squared bishop. Pawns, rooks and knights are assigned dynamic piece codes according to their square location using an arbitrary square ordering scheme. So up to two knights are identified as knight 0 and knight 1, they share the dual knight piece code. Up to two rooks are identified as rook 0 and rook 1 which are individual piece codes. Up to eight pawns are identified as pawn 0, pawn 1… pawn 7, which are individual piece codes. If there is more than one queen, or more than one light squared bishop, or more than one dark squared bishop, or more than two knights or more than two rooks, then (and only then) we use the slow fall back scheme to calculate the move code. Otherwise the 8 bit move code is a combination of the 4 bit piece code of the piece to move with the appropriate 4 bit vector code.
At first glance this might seem to be unnecessarily computationally expensive. Why scan over the whole board every time? Why not assign fixed piece codes in the starting position and then track them? In fact that’s exactly how I reasoned when I first implemented compressed moves and for a long time I did use fixed codes. It was a huge mistake. Huge! The dynamic coding scheme buys us a very important property. Once we have settled on all the various arbitrary conventions we need, our algorithm always produces the same move code for a given move in a given position. Unfortunately using fixed codes means the history leading to a position influences the move code generated.
Consider the following opening transposition; A) 1.e4 e6 2.c4 d5 3.cxd5 exd5 4. exd5 c6 5.dxc6. B) 1.e4 e6 2.c4 d5 3.exd5 exd5 4. cxd5 c6 5.dxc6. The move 5.dxc6 is the same move in the same position in each case, but because the capturing pawn started its life as an ‘e’ pawn in line A) and as a ‘c’ pawn in line B), a fixed scheme will generate two different 8 bit move codes for the move. This has the effect of potentially messing up basically any chess database feature at least occasionally. Pawns and Knights quite often mischievously swap places like this, rooks less so, but I am sure there are some real life instances. At least theoretically it is possible for a promoted piece to feature in a position normally reached without promotion. I agonised over these things and killed off huge numbers of brain cells trying to think up elaborate “normalisation” tricks to try to repair an inherently faulty underlying scheme. Futility exemplified.
One day I (finally) returned to thinking about using dynamic coding instead. All those pawns that constantly needed to be reordered, surely it was impractically slow. I think all programmers are familiar with the delicious sensation of a momentary flash of insight suddenly dissolving a problem. It turns out that we can use dynamic codes with a negligible amount of extra computation. The way to do it is to track the position of each piece and only change the dynamic codes as required. Of course that much is obvious, and with rooks and knights it is no big burden. Just keep track of both knights and as long as there are two knights on the board, check whether they need to swap places after each knight move. Make adjustments as required if a knight is captured or created by promotion. All of this applies in exactly the same way for rooks.
It’s the pawns that are a problem and the solution starts by using the right type of square ordering convention. My normal convention is; a8=0, b8=1, c8=2 … h8=7,a7=8 … h1=63. It’s normally a good convention but for this particular task it’s horrible. Instead we need a convention where we increment along files not ranks. For example, a1=0,a2=1,a3=2…a8=7,b1=8…h8=63. Now any pawn advance doesn’t affect pawn ordering at all. This is a direct consequence of the laws of chess, a pawn cannot leap over another pawn on the same file. Capturing pawn moves can require significant reordering, but a simple algorithm handles the task, and in the vast majority of cases it’s computationally cheap.
To understand the simple pawn reordering algorithm, consider that although a capturing pawn may need to be reordered relative to its neighbours, no neighbour needs to be reordered relative to any other neighbour. For a right capture, check if our pawn needs to swap places with its adjacent neighbour to the right. If it does, swap the pawns and repeat with the next neighbour to the right. The first time no swap is needed, we know the pawns have been reordered successfully.
An example might help. White has three pawns on a2 (pawn 0),b2 (pawn 1) and c2 (pawn 2). After right capture a2xb3, we initially have b3 (pawn 0), b2 (pawn 1) and c2 (pawn 2). Does our relocated pawn on b3 need to swap with b2, its adjacent neighbour to the right? Yes. So b2 (pawn 0) b3 (pawn 1) c2 (pawn 2). Does the b3 pawn need to swap with c2, its adjacent neighbour to the right? No. We are done with one simple swap. Zero or one swaps are by far the most common cases.
Let’s do another example with three swaps, the most I observed when compressing a database of 4.5 million games;
b5 (pawn 0) c2 (pawn 1) c3 (pawn 2) c5 (pawn 3) f5 (pawn 4), White plays b5xc6
c6 (pawn 0) c2 (pawn 1) c3 (pawn 2) c5 (pawn 3) f5 (pawn 4), Swap c6 and c2? Yes
c2 (pawn 0) c6 (pawn 1) c3 (pawn 2) c5 (pawn 3) f5 (pawn 4), Swap c6 and c3? Yes
c2 (pawn 0) c3 (pawn 1) c6 (pawn 2) c5 (pawn 3) f5 (pawn 4), Swap c6 and c5? Yes
c2 (pawn 0) c3 (pawn 1) c5 (pawn 2) c6 (pawn 3) f5 (pawn 4), Swap c6 and f5? No
Left captures work the same way (of course). An example of a left capture;
a4 (pawn 0) b4 (pawn 1) c2 (pawn 2), White plays c2xb3
a4 (pawn 0) b4 (pawn 1) b3 (pawn 2), Swap b3 and b4? Yes
a4 (pawn 0) b3 (pawn 1) b4 (pawn 2), Swap b3 and a4? No
The conceptual breakthrough that allows computationally inexpensive dynamic piece coding prompted me to completely rework my move compression algorithm. I introduced the simplifying concept that only the side with the move needs to be considered when calculating the compressed move code. Previously I would switch into fallback mode globally (i.e. for both sides) immediately an inconvenient promotion occurred. But more often than not an inconvenient new promoted piece (typically a second queen) is immediately captured before the promoting side gets to move again. So my improved approach is not only much simpler, it often avoids a completely unnecessary switch to fallback mode.
Statistically speaking, it turns out about 0.015% of compressed moves are calculated with fallback mode, and about one in 250 games requires some such moves. A reasonably simple change would improve these statistics by perhaps a factor of 100 or so (I will come back and edit the blog when I find out more precisely [see edit below]). I am sorely tempted to implement the change, which would basically confine fallback mode to freakish games. The idea is to allow a second queen, in much the same way as two knights and two rooks are allowed. When two queens (queen 0 and queen 1) are on the board, the piece codes for pawn 6 and pawn 7 are recycled and used for queen 1. Two queens would not be compatible with seven pawns, that really would necessitate fallback mode. In other words, the side with two queens would need to have lost at least one pawn to capture in addition to the pawn ‘lost’ in the act of promotion.
Successful program design is all about effectively managing trade-offs. I am not a hundred percent convinced that the extra ugly complexity is justified in this case. Time will tell. In the meantime you can check my current implementation at;
Edit: I decided to go ahead and implement the “second queen” enhancement I envisaged above. It felt like the right thing to do. Before making the change scanning a 4.5M game dataset produced 17971 games where fallback mode was required due to two (or more) queens, and 99 games where fallback mode was required for other reasons. Implementing the “second queen” option reduces the 17971 games to 211, an improvement by a factor of 94. The number of moves requiring fallback mode improved by a factor of 40, from 0.015% to 0.00037%. Interestingly, the change improved the performance fairly dramatically as well, the time to compress then uncompress 350M moves went from 257 seconds to 110 seconds. So I can compress and decompress 3 million moves per second on my commodity laptop. At this point I have made absolutely no effort to optimise the code itself, (as opposed to optimising the underlying algorithm). The code is in the first phase of the “first get it working, then get it fast” mantra, and is still full of diagnostic instrumentation. I think with some effort I could improve performance by a factor of 10 or so, which I’ll need every bit of if I ever get to implement my idea of abandoning database index position searches in favour of scanning the moves of all games [Edit 7Dec2016: By the time Tarrasch V3 was released in November 2016 I had optimised the code and gone ahead with replacing the index position searches with scanning the moves of all games, it turned out to be a much more practical approach as I describe in a follow up blog post .
One final point of detail, for the record. I actually made a second performance tweak as I implemented “second queen”. Fallback mode requires a move list sort, and I have previously used the (extravagantly slow with my chess API) approach of converting each move to SAN text equivalent (eg “exf8=Q+”) and then sorting the strings. Writing this post jogged the realisation that it would be hugely faster and just as documentable (to coin an adjective) to use UCI text instead (eg “e7f8q”). This was an almost trivial change to make. I felt the need to measure the impact of each tweak, and as often happens when you measure performance (as opposed to just speculating and assuming) something surprising emerged. Implementing either tweak alone gets the full 257 -> 110 second improvement! I find that sufficiently surprising that I’d like to repeat the whole experiment when I get a chance, but assuming I didn’t make a dumb experimental method error, I can rationalise this result as follows;
The super slow SAN conversions caused fallback mode to be a performance issue, even at a rate of 0.015%. Improving the rate to 0.00037% or using fast UCI conversions is each sufficient to effectively eliminate fallback mode as a performance issue.