In doing some work on a personal project today, I came across one of the nastiest heisenbugs I've ever had the displeasure of dealing with. In hindsight, the root-cause is obvious, but it certainly wasn't a few hours ago when I was on the verge of drop-kicking my laptop across my apartment.
I had just encountered a segfault in my binary packer, Kiteshield. Firing up GDB and re-running the packer with a breakpoint set, Kiteshield segfaulted almost immediately, well before the offending instruction. Repeating the process with no breakpoint, the original instruction of interest started segfaulting again. Retrying a few more times, it was clear that whenever a breakpoint was set, Kiteshield would exhibit this early segfault and whenever it wasn't, it wouldn't—classic heisenbug.
Of course, the inability to perform the most canonical of debugging tasks,
setting breakpoints, did not make this easy to debug. Much grumbling and
several debug print statements later, it became clear that an ELF header
Kiteshield was trying to parse at 0xa03354
was corrupted:
(gdb) x/4c 0xa03354
0xa03354: -89 '\247' -76 '\264' -22 '\352' -80 '\260'
Evidently not the ELF magic bytes of 0x7F 'E' 'L' 'F'
.
Diffing Kiteshield's debug output with and without a breakpoint set in GDB revealed the cause of this corruption. When no breakpoints were active in GDB, the following was logged:
[kiteshield] RC4 decrypting binary with key 5ae8d877b6f3a8203f9c7d9cb1a7b690
[kiteshield] decrypted 12336 bytes
Placing a breakpoint and restarting, there was the slightest of changes:
[kiteshield] RC4 decrypting binary with key 5ae8d877b6f3a8203f9c7d9cb16bb690
[kiteshield] decrypted 12336 bytes
Catch it? No? Don't worry: neither did I for a few hours while I contemplated changing careers.
In case you missed it, the 14th byte (and only the 14th byte) of the RC4 key
used to decrypt the packed binary changed from 0xa7
to 0x6b
. This obviously
explains the corrupted data—we're using the wrong key to decrypt our
packed ELF. The question now is why does setting a breakpoint cause this
behaviour?
In order to avoid storing the naked RC4 key used to encrypt the binary on disk, Kiteshield obfuscates it. Specifically, it does this by XORing every byte of the loader code (that is, the code I'd been debugging) into successive bytes of the key. This serves two purposes: obfuscating the key (as mentioned) and protecting against code patching. If a reverse engineer attempts to patch even a single byte of the loader code, the key will deobfuscate incorrectly and the binary will be decrypted with the wrong key, leading to an inevitable segfault as the loader code tries to parse garbage data.
I liked this route more than a simple checksum verification as it causes crashes at locations far away from the anti-patching code, drawing a reverse engineer's eyes elsewhere and hopefully slowing him or her down.
What I hadn't considered during this ordeal was that code patching is exactly
how GDB breakpoints are implemented. In order to implement a breakpoint on x86,
GDB replaces the byte at the requested address with an int3
instruction
(0xcc
), which, when hit, causes the program to be paused and a trap into GDB.
This of course modifies a single byte of the code—enough to mess up key
deobfuscation, exactly as Kiteshield is designed.
The byte at the address of the breakpoint I was setting happened to be 0x00
.
With that in mind, confirming this was easy:
>>> hex(0xa7 ^ 0x6b)
'0xcc' # int3 instruction
Understanding why this is the case is left as an exercise to the reader. ;)
tl;dr I had unknowingly been prevented from debugging by functionality I wrote myself designed to frustrate debugging... On the one hand, this was indescribably infuriating to deal with. On the other, I'm legitimately somewhat pleased the anti-patching functionality had its intended effect, even if I didn't originally see GDB breakpoints triggering it (or intend for myself to be its target—aargh).