Fast Chess Move Generation With Magic Bitboards

One of my larger personal projects is a chess engine called Shallow Blue that's been under on and off development since I started it a few years ago. Almost since I started writing Shallow Blue, magic bitboards have been sitting in the feature backlog. That changed the past few weeks as I had some time off school for the holidays. Impressed by the cleverness of the trick employed by magic bitboards, I figured it would make a good subject for a blog post.

Overview

Most chess engines are based on the Minimax algorithm (or some modification of it). Minimax is based on searching future game positions for a position that gives the engine the highest advantage, while at the same time being a feasible position to end up in (ie. it doesn't require any stupid moves from our opponent).

Generally speaking, the further ahead we look in the game tree, the better our engine will play. As the game tree grows very rapidly in chess with each additional ply we search, any reasonable chess engine must be able to generate chess moves very quickly.

The way almost all engines go about generating moves is with a variety of bit twiddling hacks involving bitboards. A bitboard is simply a 64 bit value, with each bit representing a square on the chess board (usually bit 0 = a1 and bit 63 = h8). Since most modern CPU architectures are capable of manipulating 64 bit values very quickly, a variety of bit hacks can be used to efficiently generate new bitboards containing all possible moves for a given piece.

Sliding pieces are the chess pieces that can move an indefinite number of squares along a column or diagonal until hitting another piece, or the edge of the board. If you have any knowledge of the rules of chess, you'll know this means the bishop, rook and queen.

Sliding pieces present particular challenges for fast move generation as their attack sets (the bitboard containing all squares they can possibly move to or capture to, given the square they sit on and a bitboard containing pieces blocking their movement), are much more dependent on the positions of other pieces on the board than non-sliding pieces. For example, if a king on a1 is blocked by a pawn on b2, it cannot move to one square (b2) that it otherwise could have. However if a bishop on a1 is blocked by a pawn on b2, it cannot move to six other squares that it otherwise could have. We could theoretically just loop over all those squares and mask them out one by one, however this is incredibly slow in practice when millions of moves need to be generated in a very short period of time.

The Classical Approach

Shallow Blue previously used an approach known as the classical approach to generate siding piece moves. In the classical approach, a two dimensional table is generated once when the engine starts. It contains rays in the eight cardinal and intercardinal directions on each square and is arranged to allow easy lookup. For example:

RAYS[NORTH_EAST][25]  RAYS[NORTH][10]       RAYS[EAST][4]
. . . . . 1 . .       . . 1 . . . . .       . . . . . . . .
. . . . 1 . . .       . . 1 . . . . .       . . . . . . . .
. . . 1 . . . .       . . 1 . . . . .       . . . . . . . .
. . 1 . . . . .       . . 1 . . . . .       . . . . . . . .
. . . . . . . .       . . 1 . . . . .       . . . . . . . .
. . . . . . . .       . . 1 . . . . .       . . . . . . . .
. . . . . . . .       . . . . . . . .       . . . . . . . .
. . . . . . . .       . . . . . . . .       . . . . . 1 1 1

In the case of rooks and bishops, the classical approach calculates sliding piece moves ray by ray for each of the four directions the piece can move. Queen moves are simply calculated by taking the bitwise OR of rook and bishop moves.

As an example, say we want to calculate all moves for a bishop on c3, with the bitboard containing blocking pieces looking like this: (the bishop square is marked by a B)

blockers
. 1 . 1 . . . 1
. . . . . . . .
1 . . . 1 1 1 .
. . . . . . . .
. . . . . . . .
. . B . . . . .
1 . . 1 . . . .
. . . . . . . .

We start by considering the northwest direction. We take the northwest ray from the bishop's position, and bitwise AND it with the blocker bitboard to get a maskedBlockers bitboard.

blockers                RAYS[NORTH_WEST][c3]    maskedBlockers
. 1 . 1 . . . 1         . . . . . . . 1         . . . . . . . 1
. . . . . . . .         . . . . . . 1 .         . . . . . . . .
1 . . . 1 1 1 .         . . . . . 1 . .         . . . . . 1 . .
. . . . . . . .         . . . . 1 . . .         . . . . . . . .
. . . . . . . .    &    . . . 1 . . . .    =    . . . . . . . .
. . B . . . . .         . . B . . . . .         . . . . . . . .
1 . . 1 . . . .         . . . . . . . .         . . . . . . . .
. . . . . . . .         . . . . . . . .         . . . . . . . .

Now we need to identify the position of the lowest nonzero bit in maskedBlockers. This can be done quickly on x86 using the bsf (bitscan forward) instruction.

bitscanForward(maskedBlockers) = f6

We now get the northwest ray from square f6 and use it to mask out all squares the bishop cannot move to in the northwest direction.

RAYS[NORTH_WEST][c3]  ~RAYS[NORTH_WEST][f6] result
. . . . . . . 1       1 1 1 1 1 1 1 .       . . . . . . . .
. . . . . . 1 .       1 1 1 1 1 1 . 1       . . . . . . . .
. . . . . 1 . .       1 1 1 1 1 1 1 1       . . . . . 1 . .
. . . . 1 . . .       1 1 1 1 1 1 1 1       . . . . 1 . . .
. . . 1 . . . .   &   1 1 1 1 1 1 1 1   =   . . . 1 . . . .
. . B . . . . .       1 1 B 1 1 1 1 1       . . B . . . . .
. . . . . . . .       1 1 1 1 1 1 1 1       . . . . . . . .
. . . . . . . .       1 1 1 1 1 1 1 1       . . . . . . . .

This process is repeated for the northeast, southwest and southeast directions. When each result is bitwise ORed together, we get the following bitboard, which contains all possible bishop attacks.

. . . . . . . .
. . . . . . . .
. . . . . 1 . .
1 . . . 1 . . .
. 1 . 1 . . . .
. . B . . . . .
. 1 . 1 . . . .
1 . . . . . . .

The method for rooks is analogous, simply substitute northwest, southwest, northeast and southeast rays for north, south, east and west.

Note that for the southwest and southeast directions, we need to find the highest nonzero bit instead of the lowest when masking out squares we cannot move to. This can be done on x86 using the bsr (bitscan reverse) instruction.

This method is reasonably fast, requiring only a small number of instructions in each movement direction. An implementation for bishops in C++ looks like this:

U64 getBishopMovesClassical(int square, U64 blockers) {
  U64 attacks = 0ULL;

  // North West
  attacks |= RAYS[NORTH_WEST][square];
  if (RAYS[NORTH_WEST][square] & blockers) {
    int blockerIndex = bitscanForward(RAYS[NORTH_WEST][square] & blockers);
    attacks &= ~RAYS[NORTH_WEST][blockerIndex];
  }

  // North East
  attacks |= RAYS[NORTH_EAST][square];
  if (RAYS[NORTH_EAST][square] & blockers) {
    int blockerIndex = bitscanForward(RAYS[NORTH_EAST][square] & blockers);
    attacks &= ~RAYS[NORTH_EAST][blockerIndex];
  }

  // South East
  attacks |= RAYS[SOUTH_EAST][square];
  if (RAYS[SOUTH_EAST][square] & blockers) {
    int blockerIndex = bitscanReverse(RAYS[SOUTH_EAST][square] & blockers);
    attacks &= ~RAYS[SOUTH_EAST][blockerIndex];
  }

  // South West
  attacks |= RAYS[SOUTH_WEST][square];
  if (RAYS[SOUTH_WEST][square] & blockers) {
    int blockerIndex = bitscanReverse(RAYS[SOUTH_WEST][square] & blockers);
    attacks &= ~RAYS[SOUTH_WEST][blockerIndex];
  }

  return attacks;
}

On my laptop, GCC Generates the following x86 machine code when compiling with optimization flags:

00000000000125d0 <getBishopAttacksClassical>:
  125d0:       48 8d 15 09 a7 48 00    lea    0x48a709(%rip),%rdx
  125d7:       48 63 ff                movslq %edi,%rdi
  125da:       48 89 f1                mov    %rsi,%rcx
  125dd:       48 8b 84 fa 00 0a 00    mov    0xa00(%rdx,%rdi,8),%rax
  125e4:       00
  125e5:       48 21 c1                and    %rax,%rcx
  125e8:       0f 85 92 00 00 00       jne    12680
  125ee:       48 8b 8c fa 00 08 00    mov    0x800(%rdx,%rdi,8),%rcx
  125f5:       00
  125f6:       48 09 c8                or     %rcx,%rax
  125f9:       48 21 f1                and    %rsi,%rcx
  125fc:       75 62                   jne    12660
  125fe:       48 8b 8c fa 00 0c 00    mov    0xc00(%rdx,%rdi,8),%rcx
  12605:       00
  12606:       48 09 c8                or     %rcx,%rax
  12609:       48 21 f1                and    %rsi,%rcx
  1260c:       75 32                   jne    12640
  1260e:       48 8b 8c fa 00 0e 00    mov    0xe00(%rdx,%rdi,8),%rcx
  12615:       00
  12616:       48 09 c8                or     %rcx,%rax
  12619:       48 21 ce                and    %rcx,%rsi
  1261c:       75 02                   jne    12620
  1261e:       c3                      retq
  1261f:       90                      nop
  12620:       b9 3f 00 00 00          mov    $0x3f,%ecx
  12625:       f3 48 0f bd f6          lzcnt  %rsi,%rsi
  1262a:       29 f1                   sub    %esi,%ecx
  1262c:       48 63 c9                movslq %ecx,%rcx
  1262f:       48 8b 94 ca 00 0e 00    mov    0xe00(%rdx,%rcx,8),%rdx
  12636:       00
  12637:       c4 e2 e8 f2 c0          andn   %rax,%rdx,%rax
  1263c:       c3                      retq
  1263d:       0f 1f 00                nopl   (%rax)
  12640:       41 b8 3f 00 00 00       mov    $0x3f,%r8d
  12646:       f3 48 0f bd c9          lzcnt  %rcx,%rcx
  1264b:       41 29 c8                sub    %ecx,%r8d
  1264e:       49 63 c8                movslq %r8d,%rcx
  12651:       48 8b 8c ca 00 0c 00    mov    0xc00(%rdx,%rcx,8),%rcx
  12658:       00
  12659:       c4 e2 f0 f2 c0          andn   %rax,%rcx,%rax
  1265e:       eb ae                   jmp    1260e
  12660:       f3 48 0f bc c9          tzcnt  %rcx,%rcx
  12665:       48 8b 8c ca 00 08 00    mov    0x800(%rdx,%rcx,8),%rcx
  1266c:       00
  1266d:       c4 e2 f0 f2 c0          andn   %rax,%rcx,%rax
  12672:       eb 8a                   jmp    125fe
  12674:       66 90                   xchg   %ax,%ax
  12676:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  1267d:       00 00 00
  12680:       f3 48 0f bc c9          tzcnt  %rcx,%rcx
  12685:       48 8b 8c ca 00 0a 00    mov    0xa00(%rdx,%rcx,8),%rcx
  1268c:       00
  1268d:       c4 e2 f0 f2 c0          andn   %rax,%rcx,%rax
  12692:       e9 57 ff ff ff          jmpq   125ee
  12697:       90                      nop
  12698:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  1269f:       00

This might seem like a fairly reasonable approach, but it turns out we can do better than 47 instructions.

Enter Magic Bitboards

The term "magic bitboard" is a bit of a misnomer. "Magic bitboards" are not a special type of bitboard, rather the term refers to a technique for quickly generating attacks for sliding pieces.

The basic idea behind magic bitboards is quite simple. Instead of manually calculating sliding piece attack sets as needed, we precalculate all possible attack sets on all squares, for all possible sets of blockers (on the appropriate rank/file or diagonals). We then store the attack sets in a hash table and look them up as needed.

The hash table we need to do this maps blocking pieces on rows/columns (for rooks) or diagonals (for bishops), to attack sets. These blockers can easily be obtained by a simple bitwise AND with all the squares a rook/bishop could possibly move to with all blocking pieces. For example, if we want to find possible blocking pieces for a rook on f6, we could do the following:

rookMask                blockers                maskedBlockers
. . . . . . . .         1 1 . . . 1 . .         . . . . . . . .
. . . . . 1 . .         . 1 1 1 1 . . 1         . . . . . . . .
. 1 1 1 1 . 1 .         . 1 . 1 . . . 1         . 1 . 1 . . . .
. . . . . 1 . .         . . . . . . . .         . . . . . . . .
. . . . . 1 . .    &    . 1 . . . . 1 .    =    . . . . . . . .
. . . . . 1 . .         . . . 1 . . . .         . . . . . . . .
. . . . . 1 . .         . . . . . 1 . .         . . . . . 1 . .
. . . . . . . .         1 . . . . . . .         . . . . . . . .

Note that rookMask excludes edge squares as they don't affect the attack set. In the example above for instance, the rook can attack f8 regardless of whether or not there is a piece on that square.

The maskedBlockers bitboard above is what our hash table will map to an attack set. In the example above, our hash table would map maskedBlockers to the following attack set bitboard:

. . . . . . . .
. . . . . 1 . .
. . . . 1 . 1 .
. . . . . 1 . .
. . . . . 1 . .
. . . . . 1 . .
. . . . . 1 . .
. . . . . . . .

The question now is how to construct a hash table to quickly obtain an attack set from a maskedBlockers bitboard. The first thing that might come to mind is not to use a hash function at all, simply using maskedBlockers directly as a key into an array of attack set bitboards. Theoretically this works, but we're faced with the problem of having to store an array of size 8*264 bytes (147.57 exabytes) entirely in memory.

Because of this, we need to hash our 64 bit maskedBlockers bitboard into a smaller key, which we can then use to index our preinitialized attack set table. Ideally we want the hash function to be perfect (ie. no collisions), and very quick to execute (on the order of a few instructions).

The process for doing this is shown below:

maskedBlockers        magicNum                multResult
. . . . . . . .                               . . 1 . 1 . 1 .
. . . . . . . .                               1 . . . . . . .
. 1 . 1 . . . .                               . 1 . 1 . 1 . .
. . . . . . . .                               . . . . . . . .
. . . . . . . .   *   0x12000810020004   =    . . . . . . 1 .
. . . . . . . .                               . . . . . . . .
. . . . . 1 . .                               . . . . . . . 1
. . . . . . . .                               . . . . . . . .

multResult                      key
. . 1 . 1 . 1 .                 . . . . . . . .
1 . . . . . . .                 . . . . . . . .
. 1 . 1 . 1 . .                 . . . . . . . .
. . . . . . . .                 . . . . . . . .
. . . . . . 1 .   >>   54   =   . . . . . . . .
. . . . . . . .                 . . . . . . . .
. . . . . . . 1                 1 . . . . . . .
. . . . . . . .                 . . . . 1 . 1 .
```

This may seem like a strange convoluted process. You're probably wondering where 0x12000810020004 came from. There's a method to the madness though. Let's start by looking at the multiplication, which I'm going to rewrite slightly. I've replaced the nonzero bits in the maskedBlockers bitboard with the letters A - C.

maskedBlockers        magicNum                 multResult
. . . . . . . .                                . . A . B . C .
. . . . . . . .                                1 . . . . . . .
. B . C . . . .                                . 1 . 1 . 1 . .
. . . . . . . .                                . . . . . . . .
. . . . . . . .   *   0x12000810020004     =   . . . . . . 1 .
. . . . . . . .                                . . . . . . . .
. . . . . A . .                                . . . . . . . 1

Now take a look at a multiplication with a different blockers bitboard. Again, the nonzero bits have been replaced with the letters A - C.

maskedBlockers        magicNum                 multResult
. . . . . . . .                                . A . . . B C .
. . . . . . . .                                1 . . . . . . .
. . . . C . . .                                . 1 . . . . . 1
. . . . . . . .                                . . . . . . . .
. . . . . B . .   *   0x12000810020004     =   . . . . . . 1 1
. . . . . . . .                                . . . . . . . .
. . . . . A . .                                . . . . . . . 1
. . . . . . . .                                . . . . . . . .

Notice how each multiplication "pushed" the nonzero bits in the maskedBlockers bitboard up. As it turns out, the value 0x12000810020004, when multiplied with a maskedBlockers bitboard for a rook on f6, always "pushes" the bits in maskedBlockers up in such a way that the result has a unique top 10 bits for every possible maskedBlockers bitboard. This means that the value we obtain by right shifting multResult by 64 - 10 = 54 (ie. dropping all but the top 10 bits) can be used as a unique index into our attack set table. 0x12000810020004 is known as a "magic number", and is where the term "magic bitboard" comes from.

Using this hashing method, the preinitialized attack tables are about 800 KiB for rooks and 38 KiB for bishops, well within the limits of a modern computer's memory.

So how do we generate these magic numbers? As it turns out, it's fairly easy to do via simple brute force just by trying random numbers with low numbers of set bits. My laptop (Core i5 5300U) grinds out magic numbers for rooks and bishops on all squares in ~5 seconds using the rather slow unoptimized code that can be found on this page.

The magics used by Shallow Blue can be found here.

A C++ implementation of magic bitboard hashing and attack set lookup for bishops looks like this:

U64 getBishopAttacksMagic(int square, U64 blockers) {
  // Mask blockers to only include bits on diagonals
  blockers &= BISHOP_MASKS[square];

  // Generate the key using a multiplication and right shift
  U64 key = (blockers * BISHOP_MAGICS[square]) >> (64 - BISHOP_INDEX_BITS[square]);

  // Return the preinitialized attack set bitboard from the table
  return BISHOP_TABLE[square][key];
}

GCC Generates the following x86 assembly on my laptop for the above code when compiling with optimization flags:

0000000000012eb0 <getBishopAttacksMagic>:
  12eb0:       48 63 ff                movslq %edi,%rdi
  12eb3:       48 8d 05 26 3c 49 00    lea    0x493c26(%rip),%rax
  12eba:       48 8d 15 df 32 00 00    lea    0x32df(%rip),%rdx
  12ec1:       48 23 34 f8             and    (%rax,%rdi,8),%rsi
  12ec5:       48 8d 05 d4 33 00 00    lea    0x33d4(%rip),%rax
  12ecc:       48 0f af 34 f8          imul   (%rax,%rdi,8),%rsi
  12ed1:       48 89 f0                mov    %rsi,%rax
  12ed4:       be 40 00 00 00          mov    $0x40,%esi
  12ed9:       2b 34 ba                sub    (%rdx,%rdi,4),%esi
  12edc:       48 c1 e7 0c             shl    $0xc,%rdi
  12ee0:       c4 e2 cb f7 f0          shrx   %rsi,%rax,%rsi
  12ee5:       48 8d 05 f4 3b 29 00    lea    0x293bf4(%rip),%rax
  12eec:       48 01 f7                add    %rsi,%rdi
  12eef:       48 8b 04 f8             mov    (%rax,%rdi,8),%rax
  12ef3:       c3                      retq

Notice the huge reduction in instructions over the classical approach. We've gone from 47 to 15 instructions to accomplish exactly the same task. Looking beyond simple instruction counts, speedups also come from the huge caches found on modern processors, which can store large swaths of the attack set tables. The superiority of the this approach can be clearly seen when using Shallow Blue to calculate perft(5) from the starting position. Magic bitboards show a 30% speedup over the classical approach.

perft 5
...
==========================
Total time (ms) : 319
Nodes searched  : 4865609
Nodes / second  : 15249341

perft 5
...
==========================
Total time (ms) : 246
Nodes searched  : 4865609
Nodes / second  : 19731045

Final Thoughts

Setting out to write a chess engine is a very dangerous thing to do. It will never feel done and the fact that I spent a good chunk of my winter break implementing magic bitboard move generation is a testament to that.

All the code for magic bitboard movegen in Shallow Blue can be viewed on GitHub for the interested.

© Rhys Rustad-Elliott. Built using Pelican. Best viewed with NCSA Mosaic on a 250MHz Pentium 2 machine running Windows 3.1 or later.