Tarrasch Database Implementation
This is by way of a small digression. It won’t be of interest to most readers but might be of interest to programmers. It can be considered as a follow up to my earlier Chess Move Compression post where I outline the compact and efficient move representation used by the Tarrasch V3 database code.
In the comments to an earlier post, reader Umij asked this question; I would really love to hear the technical reasons you abandoned the SQLite approach for an in-memory database. Was it just the required database space? Or did performance break down when dealt with millions of games? Was some chess functionality (e.g. specific searches, opening trees etc.) impossible to implement in a fast manner using SQL?
I (eventually) answered in-place, but it might be worth breaking the answer out as an independent blog post. Basically answering Umij’s question serves to explain how Tarrasch’s database is implemented. The rest of this post is the text of my answer.
I think I can explain this best by starting with the current in-memory solution, then introducing the old SQLite solution. The in-memory solution is essentially a big vector of games. Each game is represented as a C++ string. I am misusing the concept of a string really, for a start the string is binary data and not textual in any way. If this bothers you think of a game as a vector of bytes rather than a string. Also, in software engineering, encoding a bunch of fields in a single string is actually a classic anti-pattern (i.e. mistake). But it does have advantages (e.g. one allocation per game) and as Reti said rules of thumb are glasses for the short-sighted. To mix metaphors sometimes you have to break a few eggs to make an omelette. The first part of the string efficiently encodes the game meta data (players, event, site, date. result, elo information etc.). The remainder of the string is a one byte per move representation of the game moves using the performant yet compact encoding system I described earlier in a big blog post.
An important refinement is that for storage and copying I actually exclusively manipulate pointers to games rather than games themselves. I use C++11 shared pointers to do this efficiently with automatic memory management. This means that I can efficiently assemble multiple subsets of the entire database (for the clipboard for example, or the results of a database search) without wasting additional memory for multiple copies of games.
I implement all the database features by brute-force search through all games move by move. I put a lot of effort into the design of my move encoding and the implementation and optimisation of the move by move search. But it’s fair to say I was amazed at how quickly I can scan through a million (or 2 or 3) games this way. It’s a tribute to the brilliant engineers who put together the extraordinary CPUs we have today. A key trick is that I can end a game search early when I find that a home square pawn (eg a White pawn on a2) in the target position moves or is captured in the game I am scanning. In other words, if there is a White pawn on e2 in the position I am searching for, and in the game I am scanning White plays e2-e4, then obviously I am not going to reach the target position in this game and I can stop scanning. Similarly I can end early when the piece type counts and pawn counts I track indicate the game I am scanning can never reach the position I am targeting. A refinement is that I handle games with no promotions (95% of games statistically) with a faster version of the scan algorithm. In the faster version, if say there are two white rooks in the target position, then if during the game scan the white rook count goes from 2 to 1 (due to a capture of a white rook) I can end early because with no promotions I know there is no possibility the white rook count will increase later in the game.
When I used SQL the equivalent to the big vector of games was a “games” table. The rows of the table were games, the columns were gamed_id (the primary key – basically just an incrementing row index) plus White, Black, Event etc. and finally Moves which was a string using my one byte per move scheme.
With this table SQL would let me find, sort and present games reasonably effectively. But of course any kind of position searching required something extra. The solution I used was a second table, a “positions” table. Each row of the positions table was a position and a game_id. Actually I used a hash of the position (instead of the entire position) – for space efficiency. Positions which occurred in multiple games generated multiple rows (same position, but different game_id). So the total number of rows was much greater than the total number of positions that occurred in the entire database. Obviously that’s a lot of rows! As I got this working I found that building this table got intolerably slow with larger databases. So I refined the idea; I used 4096 individual position tables. Each position would automatically be associated with one of the 4096 tables according to its hash. I think database programmers call this “sharding”. The position hash is basically a number generated by XORing together all squares of the board using pseudo random constants for each square – each square has 13 possible pseudo random constants (64×13 constants altogether). For each square you select a pseudo random constant corresponding to the piece on the square (13 possibilities; 6 white, 6 black, empty). It’s faster than you’d think because there’s a trick that lets you incrementally recalculate the hash move by move without a total recalculation.
The scheme I settled on was to use a 44 bit hash. The underlying hash calculation was 64 bit and the top 20 bits were discarded (a shame but the way these things work you basically do a 8, 16, 32, 64 or 128 bit calculation then discard excess bits if you have them – it’s impossible or at least offers no benefit to do the calculation at any intermediate size). Of the 44 bits, 12 selected one of the 4096 tables and the position hash values in each row of the table were the remaining 32 bits. It was necessary to account for hash collisions – two completely different positions will occasionally generate an identical 44 bit hash. To find the games where any target position occurred, calculate the hash for the position, select one of the 4096 position tables (directly from 12 bits of the hash), and then ask SQL to find all the rows in that table with the remaining 32 bit position hash code. Each row has a game_id which lets you access a complete game from the games table. It was necessary to scan through the game to make sure the position was really in the game and not the result of a hash collision.
If this is all clear to you you are doing very well! From now on I’ll dial down the details somewhat and things should be easier to follow.
The biggest advantage of this SQL approach was that it was very good at quickly finding unique or at least rarely encountered positions – the kind of positions that occur well after opening theory ends. Soon after getting all this working I remember getting a postcard from Iceland with a stamp with an interesting looking chess position on it. I thought that the Icelanders know their chess, this wouldn’t be a random or meaningless position. Sure enough, it turned out that it was the final position in the game Olafsson-Fischer, Portoroz Interzonal 1958. A win for Iceland’s favourite son against the most famous chessplayer in history (later also a naturalised Icelander of course). The cool thing about this was that the search delivered the result instantly. Press the search button – boom – there’s Olafsson-Fischer. I distinctly remember the burst of adrenaline I got from performing this experiment and getting a perfect result immediately. Unfortunately this is more of an edge case than the kind of thing you normally want your database to do. A much more normal situation is to put in a standard position from opening theory, to find out how good players handle the position. Often in that case there might be thousands of games to deal with. If the position is (say) the position after 1.d4 d5 2.c4 e6 you might have 100,000 games (admittedly an extreme case). To do the things you want to do, like tabulate stats for all the options (3.Nc3, 3.Nf3 etc.) you need to read all 100,000 games. This tended to be unbearably slow.
By way of contrast, the in-memory approach might take a few seconds to locate Olafsson-Fischer Portoroz 1958. But it finds all the 1.d4 d5 2.c4 e6 games in a trice because there are so many pieces on the board (all 32 of them, so any capture means you can immediately abort a game scan), and especially since there are 12 pawns still on their home squares (so any of those moving [or being captured] immediately aborts a game scan). And tabulating the stats is child’s play too, since all these games are in memory. So in effect the in-memory approach is biased towards good performance in things that matter and worse performance for things that don’t. The SQL approach is the other way around.
As if all this wasn’t enough, there were other compelling reasons to switch to the in-memory scheme. To make the SQL position searches fast I had to add “indexes”. This bulked out the already large database files even more, and dramatically slowed building the databases. The 4.5 million game database I was using for development worked reasonably well for most things. But it was an overnight job creating it. And it was 12 Gigabytes! It was way larger than the equivalent .pgn (you really want that ratio turned on its head), basically an unruly monster too big to ever be distributed. By contrast the 2.5 million game millionbase file on my website is 238 Megabytes (so 55% of the games in 2% of the space). And it can be created in a few minutes. But wait there’s even more! Once I had transitioned over to the in-memory approach I started thinking about other types of searches, not just simple position searches. It was a straightforward process to extend the in-memory searches to find various types of patterns and material balances. I am very proud of these features, and no SQL constructs I can think of would be remotely relevant or useful to such searches.
In summary, switching from my original SQLite database model to a simple binary format with in-memory searching was an absolute no-brainer, at least with the benefit of 20/20 hindsight. However I absolutely don’t discount the possibility that a more experienced and capable SQL developer could think of dramatic improvements to the model I was using. I concede I don’t know how Chessbase searches work – they have a very practical solution with compact databases, reasonable creation times, reasonable search speed and (I think) a disk based rather than memory based model. All power to them, I don’t know how to do that.
I think my in-memory approach has a decent future. Computers aren’t coming out with less RAM, and with my compact 100 bytes per game (approx) scheme I can at least in theory accommodate 10 million games in a gig of memory. As I wrote somewhere else, Chrome can eat a gig of memory doing a little web browsing. I’ve noticed that Windows 10 is smart enough to transparently (to user and programmer) swap the memory used by the database out to disk if it’s not accessed for a while. I haven’t exhausted the tricks I can use to speed the search. Most simply I can throw extra threads at the problem – at the moment I just use one core. And also I can anticipate that the user is likely to drill down further when he or she does a search – so I could (again using multi-threading) search just-in-case in the background and have the drill-down results ready instantaneously. At the moment the only multi-threading I use is to do an initial database load into memory in the background.