One of my larger personal projects is a chess engine called Shallow Blue that's been under on and off development since I started it a few years ago. Almost since I started writing Shallow Blue, magic bitboards have been sitting in the feature backlog. That changed the past few weeks as I had some time off school for the holidays. Impressed by the cleverness of the trick employed by magic bitboards, I figured it would make a good subject for a blog post.
Overview
Most chess engines are based on the Minimax algorithm (or some modification of it). Minimax is based on searching future game positions for a position that gives the engine the highest advantage, while at the same time being a feasible position to end up in (ie. it doesn't require any stupid moves from our opponent).
Generally speaking, the further ahead we look in the game tree, the better our engine will play. As the game tree grows very rapidly in chess with each additional ply we search, any reasonable chess engine must be able to generate chess moves very quickly.
The way almost all engines go about generating moves is with a variety of bit twiddling hacks involving bitboards. A bitboard is simply a 64 bit value, with each bit representing a square on the chess board (usually bit 0 = a1 and bit 63 = h8). Since most modern CPU architectures are capable of manipulating 64 bit values very quickly, a variety of bit hacks can be used to efficiently generate new bitboards containing all possible moves for a given piece.
Sliding pieces are the chess pieces that can move an indefinite number of squares along a column or diagonal until hitting another piece, or the edge of the board. If you have any knowledge of the rules of chess, you'll know this means the bishop, rook and queen.
Sliding pieces present particular challenges for fast move generation as their attack sets (the bitboard containing all squares they can possibly move to or capture to, given the square they sit on and a bitboard containing pieces blocking their movement), are much more dependent on the positions of other pieces on the board than non-sliding pieces. For example, if a king on a1 is blocked by a pawn on b2, it cannot move to one square (b2) that it otherwise could have. However if a bishop on a1 is blocked by a pawn on b2, it cannot move to six other squares that it otherwise could have. We could theoretically just loop over all those squares and mask them out one by one, however this is incredibly slow in practice when millions of moves need to be generated in a very short period of time.
The Classical Approach
Shallow Blue previously used an approach known as the classical approach to generate siding piece moves. In the classical approach, a two dimensional table is generated once when the engine starts. It contains rays in the eight cardinal and intercardinal directions on each square and is arranged to allow easy lookup. For example:
RAYS[NORTH_EAST][25] RAYS[NORTH][10] RAYS[EAST][4]
. . . . . 1 . . . . 1 . . . . . . . . . . . . .
. . . . 1 . . . . . 1 . . . . . . . . . . . . .
. . . 1 . . . . . . 1 . . . . . . . . . . . . .
. . 1 . . . . . . . 1 . . . . . . . . . . . . .
. . . . . . . . . . 1 . . . . . . . . . . . . .
. . . . . . . . . . 1 . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 1 1 1
In the case of rooks and bishops, the classical approach calculates sliding piece moves ray by ray for each of the four directions the piece can move. Queen moves are simply calculated by taking the bitwise OR of rook and bishop moves.
As an example, say we want to calculate all moves for a bishop on c3, with the bitboard containing blocking pieces looking like this: (the bishop square is marked by a B)
blockers
. 1 . 1 . . . 1
. . . . . . . .
1 . . . 1 1 1 .
. . . . . . . .
. . . . . . . .
. . B . . . . .
1 . . 1 . . . .
. . . . . . . .
We start by considering the northwest direction. We take the northwest ray
from the bishop's position, and bitwise AND it with the blocker bitboard
to get a maskedBlockers
bitboard.
blockers RAYS[NORTH_WEST][c3] maskedBlockers
. 1 . 1 . . . 1 . . . . . . . 1 . . . . . . . 1
. . . . . . . . . . . . . . 1 . . . . . . . . .
1 . . . 1 1 1 . . . . . . 1 . . . . . . . 1 . .
. . . . . . . . . . . . 1 . . . . . . . . . . .
. . . . . . . . & . . . 1 . . . . = . . . . . . . .
. . B . . . . . . . B . . . . . . . . . . . . .
1 . . 1 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
Now we need to identify the position of the lowest nonzero bit in
maskedBlockers. This can be done quickly on x86 using the bsf
(bitscan
forward) instruction.
bitscanForward(maskedBlockers) = f6
We now get the northwest ray from square f6 and use it to mask out all squares the bishop cannot move to in the northwest direction.
RAYS[NORTH_WEST][c3] ~RAYS[NORTH_WEST][f6] result
. . . . . . . 1 1 1 1 1 1 1 1 . . . . . . . . .
. . . . . . 1 . 1 1 1 1 1 1 . 1 . . . . . . . .
. . . . . 1 . . 1 1 1 1 1 1 1 1 . . . . . 1 . .
. . . . 1 . . . 1 1 1 1 1 1 1 1 . . . . 1 . . .
. . . 1 . . . . & 1 1 1 1 1 1 1 1 = . . . 1 . . . .
. . B . . . . . 1 1 B 1 1 1 1 1 . . B . . . . .
. . . . . . . . 1 1 1 1 1 1 1 1 . . . . . . . .
. . . . . . . . 1 1 1 1 1 1 1 1 . . . . . . . .
This process is repeated for the northeast, southwest and southeast directions. When each result is bitwise ORed together, we get the following bitboard, which contains all possible bishop attacks.
. . . . . . . .
. . . . . . . .
. . . . . 1 . .
1 . . . 1 . . .
. 1 . 1 . . . .
. . B . . . . .
. 1 . 1 . . . .
1 . . . . . . .
The method for rooks is analogous, simply substitute northwest, southwest, northeast and southeast rays for north, south, east and west.
Note that for the southwest and southeast directions, we need to find the
highest nonzero bit instead of the lowest when masking out squares we cannot
move to. This can be done on x86 using the bsr
(bitscan reverse) instruction.
This method is reasonably fast, requiring only a small number of instructions in each movement direction. An implementation for bishops in C++ looks like this:
U64 getBishopMovesClassical(int square, U64 blockers) {
U64 attacks = 0ULL;
// North West
attacks |= RAYS[NORTH_WEST][square];
if (RAYS[NORTH_WEST][square] & blockers) {
int blockerIndex = bitscanForward(RAYS[NORTH_WEST][square] & blockers);
attacks &= ~RAYS[NORTH_WEST][blockerIndex];
}
// North East
attacks |= RAYS[NORTH_EAST][square];
if (RAYS[NORTH_EAST][square] & blockers) {
int blockerIndex = bitscanForward(RAYS[NORTH_EAST][square] & blockers);
attacks &= ~RAYS[NORTH_EAST][blockerIndex];
}
// South East
attacks |= RAYS[SOUTH_EAST][square];
if (RAYS[SOUTH_EAST][square] & blockers) {
int blockerIndex = bitscanReverse(RAYS[SOUTH_EAST][square] & blockers);
attacks &= ~RAYS[SOUTH_EAST][blockerIndex];
}
// South West
attacks |= RAYS[SOUTH_WEST][square];
if (RAYS[SOUTH_WEST][square] & blockers) {
int blockerIndex = bitscanReverse(RAYS[SOUTH_WEST][square] & blockers);
attacks &= ~RAYS[SOUTH_WEST][blockerIndex];
}
return attacks;
}
On my laptop, GCC Generates the following x86 machine code when compiling with optimization flags:
00000000000125d0 <getBishopAttacksClassical>:
125d0: 48 8d 15 09 a7 48 00 lea 0x48a709(%rip),%rdx
125d7: 48 63 ff movslq %edi,%rdi
125da: 48 89 f1 mov %rsi,%rcx
125dd: 48 8b 84 fa 00 0a 00 mov 0xa00(%rdx,%rdi,8),%rax
125e4: 00
125e5: 48 21 c1 and %rax,%rcx
125e8: 0f 85 92 00 00 00 jne 12680
125ee: 48 8b 8c fa 00 08 00 mov 0x800(%rdx,%rdi,8),%rcx
125f5: 00
125f6: 48 09 c8 or %rcx,%rax
125f9: 48 21 f1 and %rsi,%rcx
125fc: 75 62 jne 12660
125fe: 48 8b 8c fa 00 0c 00 mov 0xc00(%rdx,%rdi,8),%rcx
12605: 00
12606: 48 09 c8 or %rcx,%rax
12609: 48 21 f1 and %rsi,%rcx
1260c: 75 32 jne 12640
1260e: 48 8b 8c fa 00 0e 00 mov 0xe00(%rdx,%rdi,8),%rcx
12615: 00
12616: 48 09 c8 or %rcx,%rax
12619: 48 21 ce and %rcx,%rsi
1261c: 75 02 jne 12620
1261e: c3 retq
1261f: 90 nop
12620: b9 3f 00 00 00 mov $0x3f,%ecx
12625: f3 48 0f bd f6 lzcnt %rsi,%rsi
1262a: 29 f1 sub %esi,%ecx
1262c: 48 63 c9 movslq %ecx,%rcx
1262f: 48 8b 94 ca 00 0e 00 mov 0xe00(%rdx,%rcx,8),%rdx
12636: 00
12637: c4 e2 e8 f2 c0 andn %rax,%rdx,%rax
1263c: c3 retq
1263d: 0f 1f 00 nopl (%rax)
12640: 41 b8 3f 00 00 00 mov $0x3f,%r8d
12646: f3 48 0f bd c9 lzcnt %rcx,%rcx
1264b: 41 29 c8 sub %ecx,%r8d
1264e: 49 63 c8 movslq %r8d,%rcx
12651: 48 8b 8c ca 00 0c 00 mov 0xc00(%rdx,%rcx,8),%rcx
12658: 00
12659: c4 e2 f0 f2 c0 andn %rax,%rcx,%rax
1265e: eb ae jmp 1260e
12660: f3 48 0f bc c9 tzcnt %rcx,%rcx
12665: 48 8b 8c ca 00 08 00 mov 0x800(%rdx,%rcx,8),%rcx
1266c: 00
1266d: c4 e2 f0 f2 c0 andn %rax,%rcx,%rax
12672: eb 8a jmp 125fe
12674: 66 90 xchg %ax,%ax
12676: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
1267d: 00 00 00
12680: f3 48 0f bc c9 tzcnt %rcx,%rcx
12685: 48 8b 8c ca 00 0a 00 mov 0xa00(%rdx,%rcx,8),%rcx
1268c: 00
1268d: c4 e2 f0 f2 c0 andn %rax,%rcx,%rax
12692: e9 57 ff ff ff jmpq 125ee
12697: 90 nop
12698: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
1269f: 00
This might seem like a fairly reasonable approach, but it turns out we can do better than 47 instructions.
Enter Magic Bitboards
The term "magic bitboard" is a bit of a misnomer. "Magic bitboards" are not a special type of bitboard, rather the term refers to a technique for quickly generating attacks for sliding pieces.
The basic idea behind magic bitboards is quite simple. Instead of manually calculating sliding piece attack sets as needed, we precalculate all possible attack sets on all squares, for all possible sets of blockers (on the appropriate rank/file or diagonals). We then store the attack sets in a hash table and look them up as needed.
The hash table we need to do this maps blocking pieces on rows/columns (for rooks) or diagonals (for bishops), to attack sets. These blockers can easily be obtained by a simple bitwise AND with all the squares a rook/bishop could possibly move to with all blocking pieces. For example, if we want to find possible blocking pieces for a rook on f6, we could do the following:
rookMask blockers maskedBlockers
. . . . . . . . 1 1 . . . 1 . . . . . . . . . .
. . . . . 1 . . . 1 1 1 1 . . 1 . . . . . . . .
. 1 1 1 1 . 1 . . 1 . 1 . . . 1 . 1 . 1 . . . .
. . . . . 1 . . . . . . . . . . . . . . . . . .
. . . . . 1 . . & . 1 . . . . 1 . = . . . . . . . .
. . . . . 1 . . . . . 1 . . . . . . . . . . . .
. . . . . 1 . . . . . . . 1 . . . . . . . 1 . .
. . . . . . . . 1 . . . . . . . . . . . . . . .
Note that rookMask
excludes edge squares as they don't affect the attack set.
In the example above for instance, the rook can attack f8 regardless of whether
or not there is a piece on that square.
The maskedBlockers
bitboard above is what our hash table will map to an attack
set. In the example above, our hash table would map maskedBlockers
to the
following attack set bitboard:
. . . . . . . .
. . . . . 1 . .
. . . . 1 . 1 .
. . . . . 1 . .
. . . . . 1 . .
. . . . . 1 . .
. . . . . 1 . .
. . . . . . . .
The question now is how to construct a hash table to quickly obtain an attack
set from a maskedBlockers
bitboard. The first thing that might come to mind is
not to use a hash function at all, simply using maskedBlockers
directly as a
key into an array of attack set bitboards. Theoretically this works, but we're
faced with the problem of having to store an array of size 8*264
bytes (147.57 exabytes) entirely in memory.
Because of this, we need to hash our 64 bit maskedBlockers
bitboard into a
smaller key, which we can then use to index our preinitialized attack set table.
Ideally we want the hash function to be
perfect (ie. no
collisions), and very quick to execute (on the order of a few instructions).
The process for doing this is shown below:
maskedBlockers magicNum multResult
. . . . . . . . . . 1 . 1 . 1 .
. . . . . . . . 1 . . . . . . .
. 1 . 1 . . . . . 1 . 1 . 1 . .
. . . . . . . . . . . . . . . .
. . . . . . . . * 0x12000810020004 = . . . . . . 1 .
. . . . . . . . . . . . . . . .
. . . . . 1 . . . . . . . . . 1
. . . . . . . . . . . . . . . .
multResult key
. . 1 . 1 . 1 . . . . . . . . .
1 . . . . . . . . . . . . . . .
. 1 . 1 . 1 . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . 1 . >> 54 = . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . 1 1 . . . . . . .
. . . . . . . . . . . . 1 . 1 .
```
This may seem like a strange convoluted process. You're probably wondering where
0x12000810020004
came from. There's a method to the madness though. Let's
start by looking at the multiplication, which I'm going to rewrite slightly.
I've replaced the nonzero bits in the maskedBlockers
bitboard with the letters
A - C.
maskedBlockers magicNum multResult
. . . . . . . . . . A . B . C .
. . . . . . . . 1 . . . . . . .
. B . C . . . . . 1 . 1 . 1 . .
. . . . . . . . . . . . . . . .
. . . . . . . . * 0x12000810020004 = . . . . . . 1 .
. . . . . . . . . . . . . . . .
. . . . . A . . . . . . . . . 1
Now take a look at a multiplication with a different blockers bitboard. Again, the nonzero bits have been replaced with the letters A - C.
maskedBlockers magicNum multResult
. . . . . . . . . A . . . B C .
. . . . . . . . 1 . . . . . . .
. . . . C . . . . 1 . . . . . 1
. . . . . . . . . . . . . . . .
. . . . . B . . * 0x12000810020004 = . . . . . . 1 1
. . . . . . . . . . . . . . . .
. . . . . A . . . . . . . . . 1
. . . . . . . . . . . . . . . .
Notice how each multiplication "pushed" the nonzero bits in the maskedBlockers
bitboard up. As it turns out, the value 0x12000810020004
, when multiplied with
a maskedBlockers
bitboard for a rook on f6, always "pushes" the bits in
maskedBlockers
up in such a way that the result has a unique top 10 bits for
every possible maskedBlockers
bitboard. This means that the value we obtain by
right shifting multResult
by 64 - 10 = 54 (ie. dropping all but the top 10
bits) can be used as a unique index into our attack set table.
0x12000810020004
is known as a "magic number", and is where the term "magic
bitboard" comes from.
Using this hashing method, the preinitialized attack tables are about 800 KiB for rooks and 38 KiB for bishops, well within the limits of a modern computer's memory.
So how do we generate these magic numbers? As it turns out, it's fairly easy to do via simple brute force just by trying random numbers with low numbers of set bits. My laptop (Core i5 5300U) grinds out magic numbers for rooks and bishops on all squares in ~5 seconds using the rather slow unoptimized code that can be found on this page.
The magics used by Shallow Blue can be found here.
A C++ implementation of magic bitboard hashing and attack set lookup for bishops looks like this:
U64 getBishopAttacksMagic(int square, U64 blockers) {
// Mask blockers to only include bits on diagonals
blockers &= BISHOP_MASKS[square];
// Generate the key using a multiplication and right shift
U64 key = (blockers * BISHOP_MAGICS[square]) >> (64 - BISHOP_INDEX_BITS[square]);
// Return the preinitialized attack set bitboard from the table
return BISHOP_TABLE[square][key];
}
GCC Generates the following x86 assembly on my laptop for the above code when compiling with optimization flags:
0000000000012eb0 <getBishopAttacksMagic>:
12eb0: 48 63 ff movslq %edi,%rdi
12eb3: 48 8d 05 26 3c 49 00 lea 0x493c26(%rip),%rax
12eba: 48 8d 15 df 32 00 00 lea 0x32df(%rip),%rdx
12ec1: 48 23 34 f8 and (%rax,%rdi,8),%rsi
12ec5: 48 8d 05 d4 33 00 00 lea 0x33d4(%rip),%rax
12ecc: 48 0f af 34 f8 imul (%rax,%rdi,8),%rsi
12ed1: 48 89 f0 mov %rsi,%rax
12ed4: be 40 00 00 00 mov $0x40,%esi
12ed9: 2b 34 ba sub (%rdx,%rdi,4),%esi
12edc: 48 c1 e7 0c shl $0xc,%rdi
12ee0: c4 e2 cb f7 f0 shrx %rsi,%rax,%rsi
12ee5: 48 8d 05 f4 3b 29 00 lea 0x293bf4(%rip),%rax
12eec: 48 01 f7 add %rsi,%rdi
12eef: 48 8b 04 f8 mov (%rax,%rdi,8),%rax
12ef3: c3 retq
Notice the huge reduction in instructions over the classical approach. We've gone from 47 to 15 instructions to accomplish exactly the same task. Looking beyond simple instruction counts, speedups also come from the huge caches found on modern processors, which can store large swaths of the attack set tables. The superiority of the this approach can be clearly seen when using Shallow Blue to calculate perft(5) from the starting position. Magic bitboards show a 30% speedup over the classical approach.
perft 5
...
==========================
Total time (ms) : 319
Nodes searched : 4865609
Nodes / second : 15249341
perft 5
...
==========================
Total time (ms) : 246
Nodes searched : 4865609
Nodes / second : 19731045
Final Thoughts
Setting out to write a chess engine is a very dangerous thing to do. It will never feel done and the fact that I spent a good chunk of my winter break implementing magic bitboard move generation is a testament to that.
All the code for magic bitboard movegen in Shallow Blue can be viewed on GitHub for the interested.