Debugging a Slow VM on a 2007 Era Xeon

Posted on: Mon 17 August 2020

Recently I've been involved in a project that requires a lot of kernel-space work on Linux. This of course means frequent re-builds of the Linux kernel. Taking about fifteen minutes for a build from scratch on my laptop, this provided too many opportunities to justify slacking off, and I decided faster builds were in order. It was then that I realized I had an eight core Dell workstation sitting in my basement that had been collecting dust for years. While it wasn't the newest machine (circa 2007), its dual-CPU motherboard with its two Xeon E5335's did have a total of eight physical cores to compete with my laptop's two (Core i5 5300U). I figured this would undoubtedly lead to a faster build and spun it up.

In order to avoid nuking my development machine's kernel, all my kernel work (including building) was done inside a QEMU/KVM virtual machine. While this did have a small overhead, KVM/VT-x ensured that this was minimal and the difference in guest/host build times on my laptop was negligible. I thought the story would be the same on the workstation, it did have older CPUs, but they did have virtualization extensions, so I expected a substantial decrease in build times due to those extra cores.

Firing up a build from scratch, I was surprised to see that a clean build of Linux inside the VM on the workstation took fifteen minutes, almost exactly the same time as my dual-core laptop. While single-threaded performance on Intel chips has certainly improved since 2007, two cores outperforming eight in such a highly-parallelizable task seemed fishy. Using the same .config I re-ran the build outside a VM on the workstation, which got it done in seven minutes. I was intrigued by the massive performance hit.

Asking around, I had someone tell me that they saw similar issues before that boiled down to performance hits from Meltdown mitigations, specifically, Kernel Page Table Isolation (KPTI). Running a quick test in which I ran a single-threaded build of Shallow Blue to completion inside the guest, this did appear to be the root cause (all times cited are averaged over 50 runs on an otherwise idle system):

KPTI Status in Guest	Time needed for `make -j1 all`
Enabled	47.972s
Disabled	25.088s

Curious at the massive performance hit KPTI gave in a guest, I ran the same test on the host with KPTI enabled and disabled for comparison:

KPTI Status in Host	Time needed for `make -j1 all`
Enabled	18.184s
Disabled	17.583s

So while KPTI still results in a performance hit on the host, it's nowhere near the scale of the performance hit in the guest. The host performance hit can likely be explained by the increased TLB misses resulting from KPTI (especially seeing as this processor is old enough that it doesn't have PCID), but I wouldn't expect that to be solely responsible for the near 100% slowdown seen in the guest.

The root cause of these types of performance issues can often be found by generating a few flame graphs, which can provide a massive amount of insight into performance bottlenecks. Attaching perf to the parent QEMU process, I generated the following flame graphs while Shallow Blue was building:

KPTI Flamegraph No KPTI Flamegraph

Open both of the above SVGs in another tab and you should be able to search and zoom in on particular sections.

So from the above two flame graphs, it's very clear that the kernel function kvm_mmu_sync_roots in arch/x86/kvm/mmu.c becomes a huge bottleneck when we enable KPTI in the guest. Correlating the stack trace in the graph with kernel sources, one can see that it's being called from vcpu_enter_guest, just before KVM hands off control to the guest OS.

if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
        kvm_mmu_sync_roots(vcpu);

Since the CPUs on this workstation are from 2007, they don't have support for second level address translation / EPT, necessitating the use of shadow page tables. The call to kvm_check_request checks if the guest's shadow page tables need to be synced with the host. If it returns true, kvm_mmu_sync_roots is called, which walks all page table entries reachable from the guest's CR3 register (page table base address) to perform the required shadow page table sync. Needless to say, walking the entire page table hierarchy is expensive.

This synchronization must be performed in a few circumstances, notably for us, upon a mov to the CR3 register (ie. a change of page tables). Since KPTI vastly increases the number of CR3 switches (every switch into kernel space requires a change of page tables), it seems we've located the source of the problem.

So what can be done about this? Well, my host was running a 4.9 kernel, and interestingly, this seems to have been more-or-less fixed in 4.19, with a number of workaround patches that cut down on the number of shadow page table resyncs needed. If upgrading to 4.19+ isn't an option however, perhaps invest in a newer CPU that has EPT and hope nobody finds any more phantom trolleys in Intel CPUs.