How an Obscure ARM64 Link Option Broke Our BPF Probe

Posted on: Wed 31 August 2022

Here's a fun debugging story involving a quick-to-find proximate cause and a dreadful-to-find root cause.

Here at Elastic, we use eBPF (often known less officially as simply BPF) to gather security-related events on Linux for our cloud security solution. To accomplish this, we hook into a variety of points in the Linux kernel with BPF programs to gather event-specific data at the source.

One such event we monitor is writes to terminal devices. What a process writes to the terminal is often interesting from a security perspective (E.g., an insider threat exfiltrates a private key by cat'ing it and copying it from the terminal.).

To accomplish this, we hook into the tty_write function with a BPF program and read the data written to the terminal out of the function arguments. Linux's tracing infrastructure makes this pretty easy — you can hook almost all kernel functions by attaching a kprobe to them that invokes a BPF program.

Linux commit 9bb48c8, however, complicated this strategy. It changed the signature of tty_write, requiring us to determine the signature in use on the running kernel in order to correctly parse the function arguments and read the data written to the terminal.

That change went into Linux 5.10.11, which would lead one to believe we could just check the kernel version to determine the function signature. Unfortunately, things aren't quite that easy.

Since being merged into the mainline sources, many distribution-specific kernels have backported that patch to an older base kernel. This means they will report a version pre-5.10.11, while still having the post-5.10.11 function signature. Relying on version/distro checks thus requires maintaining a huge hardcoded map of version/distro to tty_write signature mappings, that, while possible, is a recipe for headaches and will always be subject to break if run on a distribution not yet added to the list.

Thankfully though, there is a better way. BTF information is available at /sys/kernel/btf/vmlinux on 5.5+ kernels compiled with CONFIG_DEBUG_INFO_BTF. This BTF information describes most types in the C source the kernel was compiled from (structs, typedefs, function signatures, etc.). This allows for source-code level type information to be accessed on a running kernel.

From the system BTF information, we can determine the signature of tty_write without having to rely on a long and difficult-to-maintain list of distribution and version number checks. libbpf provides helpers to access and parse the system BTF, making it easy to detect the signature of tty_write with just a small bit of C.

Seems pretty straightforward. Turns out, however, that this was the start of a long rabbit hole...

The above strategy was implemented and worked well on most kernels. Test failures, however, were observed on a small number of ARM64 kernels — x86 kernels consistently passed.

Digging into the failures, it was noticed that BTF info for tty_write was absent on the offending ARM64 kernels, as shown here with bpftool:

# bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep tty_write
[58503] FUNC 'redirected_tty_write' type_id=19506 linkage=static
[58505] FUNC 'tty_write_message' type_id=58504 linkage=static
[58566] FUNC 'tty_write_room' type_id=58565 linkage=static

Naturally, this would cause the tests to fail: If BTF info for tty_write doesn't exist, we can't detect its signature.

On a working kernel, the same command produced the following, showing that BTF information for tty_write existed:

# bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep tty_write
[58437] FUNC 'redirected_tty_write' type_id=26281 linkage=static
[58438] FUNC 'tty_write' type_id=17027 linkage=static
[58440] FUNC 'tty_write_message' type_id=58439 linkage=static
[58458] FUNC 'hung_up_tty_write' type_id=17027 linkage=static
[58483] FUNC 'n_tty_write' type_id=58482 linkage=static
[58519] FUNC 'n_tty_write_wakeup' type_id=21329 linkage=static
[58550] FUNC 'tty_write_room' type_id=58430 linkage=static

My first thought was that tty_write had been inlined by the ARM64 compiler, meaning it wouldn't be seen at BTF-generation time. A quick cat /proc/kallsyms | grep tty_write on a failing kernel, however, confirmed this was not the case:

# cat /proc/kallsyms | grep tty_write
...snip
ffffc3856827c954 t tty_write
...snip

Interestingly, as can be seen in the output of bpftool above, tty_write was absent in the failing kernel's BTF information, but some related functions were still present (e.g., redirected_tty_write). For some reason, tty_write, and a handful of other functions with tty_write in their name seem to not be present in the system BTF on the failing kernels. We've found our proximate cause.

pahole is the tool used to generate BTF at kernel build time. Peeking into the v1.20 sources (the version being used to build all the kernels), the relevant code could be found in btf_encoder.c:

int cu__encode_btf(struct cu *cu, int verbose, bool force,
           bool skip_encoding_vars)
{
    uint32_t type_id_off;
    uint32_t core_id;
    struct variable *var;
    struct function *fn;
    struct tag *pos;
    int err = 0;

    btf_elf__verbose = verbose;
    btf_elf__force = force;

    /* ...snip */

    cu__for_each_function(cu, core_id, fn) {
        int btf_fnproto_id, btf_fn_id;
        const char *name;

        /*
         * Skip functions that:
         * - are marked as declarations
         * - do not have full argument names
         * - are not in ftrace list (if it's available)
         * - are not external (in case ftrace filter is not available)
         */
        if (fn->declaration)
            continue;
        if (!has_arg_names(cu, &fn->proto))
            continue;
        if (functions_cnt) {
            struct elf_function *func;
            const char *name;

            name = function__name(fn, cu);
            if (!name)
                continue;

            func = find_function(btfe, name);
            if (!func || func->generated)
                continue;
            func->generated = true;
        } else {
            if (!fn->external)
                continue;
        }

Stepping through the code with gdb, when the cu__for_each_function loop got to processing redirected_tty_write (which was always present in the BTF), control would flow into the conditional block on line 31 (i.e. functions_cnt != 0), then onwards to the subsequent code past line 46 that generates the BTF.

When the loop got to processing tty_write however, functions_cnt would be 0, causing control to flow to the else on line 43, where fn->external was false. This resulted in the continue being hit and no BTF being generated for tty_write.

Seems tty_write is ultimately being skipped during BTF generation because it doesn't have external linkage (the check on line 44), but to hit that logic at all, it has to flow through the if (functions_cnt) check, which seems to correspond to the are not in ftrace list bullet point in the comment.

This led to a lot of digging into how ftrace works. Particularly helpful was this post on Brendan Gregg's blog which links to a recording of a talk at Kernel Recipies 2019 by kernel maintainer Steven Rostedt. Go watch that talk, seriously, it's great.

Ftrace uses a compiler option to insert a few nops at the start of nearly every kernel function, and, when a user enables tracing on a kernel function, dynamically patches the nops in that function with a call to a trampoline that invokes tracing callbacks. This allows for functions to be traced with essentially no performance hit when ftrace is not in use (The performance penalty of the nops is so tiny as to be essentially zero.).

Putting this together with the pahole code above, it seems tty_write isn't being traced because it isn't in the list of functions ftrace knows about (this is what the if (functions_cnt) check seems to correspond to).

Lending further credence to this theory, it was noticed that when an ARM64 kernel build was performed with pahole >=1.22, BTF for tty_write would be present. It was only before version 1.22 that pahole was not generating BTF for tty_write.

One git bisect, later and it was discovered that pahole commit 58a98f7 was what fixed the problem. It did this by removing the constraint that BTF is only generated for functions known to ftrace. This restriction was removed as it was determined that the space savings of only generating BTF for functions known to ftrace (the motivation for that restriction in the first place) was outweighed by the usefulness of the extra BTF. We're narrowing in.

So it seems tty_write is unknown to ftrace, meaning that in pahole <1.22, BTF won't be generated describing its signature. There's a file under /sys/kernel/tracing that lists all functions known to ftrace, allowing us to check this on a failing ARM64 kernel:

# cat /sys/kernel/tracing/available_filter_functions | grep tty_write
hung_up_tty_write
file_tty_write.constprop.0
tty_write
redirected_tty_write
tty_write_message
n_tty_write_wakeup
n_tty_write
tty_write_room
rfcomm_tty_write_room [rfcomm]
rfcomm_tty_write [rfcomm]

But, it's there...

Both the working x86 and failing ARM64 kernel list tty_write in /sys/kernel/tracing/available_filter_functions and in manual tests, tty_write was traceable with ftrace on both kernels.

So what gives?

This involved digging into how exactly ftrace stores the list of functions it's capable of tracing, something the above linked talk helped immensely with.

The long and short of it is that there's logic in the kernel source tree that, at build time, generates a list of the addresses of all functions that have been instrumented with nops by ftrace and stores their addresses between two linker-defined symbols in the produced vmlinux binary: __start_mcount_loc and __stop_mcount_loc (the name mcount is a bit of historical baggage related to GNU gprof, you can just mentally substitute ftrace_functions for mcount in those symbol names).

These symbols are located in the .init.text section. The whole list of addresses is read by the kernel at boot, used to initialize ftrace, and then discarded (along with the rest of the data in .init.text, all of which is only needed at boot).

Looking at the pahole source code before commit 58a98f7, to obtain this list of addresses, pahole was simply mapping the kernel vmlinux image into memory, seeking to the __start_mcount_loc symbol, and reading the list of function addresses. This is the "ftrace list" referenced by the comment in pahole's source.

Since tty_write is traceable with ftrace on the breaking kernel, it would make sense that its address is found in the ftrace list, but it seems pahole isn't finding it for whatever reason.

Time to pull out gdb again. The ftrace list is read into the addrs variable which can be read at btf_encoder.c:171 in pahole 1.20.

$ gdb -q pahole
Reading symbols from pahole...
(gdb) b btf_encoder.c:171
No source file named btf_encoder.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (btf_encoder.c:171) pending.
(gdb) run -J ~/vmlinux-aarch64
Starting program: /home/vagrant/dwarves/build/pahole -J ~/vmlinux-aarch64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, get_vmlinux_addrs (btfe=0x555555594110, fl=0x7fffffefddf0, paddrs=0x7fffffefdd90, pcount=0x7fffffefdd88)
    at /home/vagrant/dwarves/btf_encoder.c:171
171             *paddrs = addrs;
(gdb) x/32gx addrs
0x7fffda05d010: 0x0000000000000000      0x0000000000000000
0x7fffda05d020: 0x0000000000000000      0x0000000000000000
0x7fffda05d030: 0x0000000000000000      0x0000000000000000
0x7fffda05d040: 0x0000000000000000      0x0000000000000000
0x7fffda05d050: 0x0000000000000000      0x0000000000000000
0x7fffda05d060: 0x0000000000000000      0x0000000000000000
0x7fffda05d070: 0x0000000000000000      0x0000000000000000
0x7fffda05d080: 0x0000000000000000      0x0000000000000000
0x7fffda05d090: 0x0000000000000000      0x0000000000000000
0x7fffda05d0a0: 0x0000000000000000      0x0000000000000000
0x7fffda05d0b0: 0x0000000000000000      0x0000000000000000
0x7fffda05d0c0: 0x0000000000000000      0x0000000000000000
0x7fffda05d0d0: 0x0000000000000000      0x0000000000000000
0x7fffda05d0e0: 0x0000000000000000      0x0000000000000000
0x7fffda05d0f0: 0x0000000000000000      0x0000000000000000
0x7fffda05d100: 0x0000000000000000      0x0000000000000000

Weird... it's all nulls. Let's try a working x86 kernel:

$ gdb -q pahole
Reading symbols from pahole...
(gdb) b btf_encoder.c:171
No source file named btf_encoder.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (btf_encoder.c:171) pending.
(gdb) run -J ~/vmlinux-x86_64
Starting program: /home/vagrant/dwarves/build/pahole -J ~/vmlinux-x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Breakpoint 1, get_vmlinux_addrs (btfe=0x555555595f20, fl=0x7fffffefddf0, paddrs=0x7fffffefdd90, pcount=0x7fffffefdd88)
    at /home/vagrant/dwarves/btf_encoder.c:171
171             *paddrs = addrs;
(gdb) x/32gx addrs
0x7fffc2105010: 0xffffffff81000660      0xffffffff810006a0
0x7fffc2105020: 0xffffffff810006e0      0xffffffff81000720
0x7fffc2105030: 0xffffffff81000c80      0xffffffff81000dd0
0x7fffc2105040: 0xffffffff81000fa0      0xffffffff81000fd0
0x7fffc2105050: 0xffffffff81001000      0xffffffff81001030
0x7fffc2105060: 0xffffffff81001050      0xffffffff810014c0
0x7fffc2105070: 0xffffffff810014f0      0xffffffff81001500
0x7fffc2105080: 0xffffffff81001510      0xffffffff81001ac0
0x7fffc2105090: 0xffffffff81001b40      0xffffffff81001c00
0x7fffc21050a0: 0xffffffff81001d70      0xffffffff81001f30
0x7fffc21050b0: 0xffffffff81001f40      0xffffffff81001ff0
0x7fffc21050c0: 0xffffffff810020e0      0xffffffff81002190
0x7fffc21050d0: 0xffffffff810021b0      0xffffffff81002210
0x7fffc21050e0: 0xffffffff810022d0      0xffffffff81002310
0x7fffc21050f0: 0xffffffff810024c0      0xffffffff81002500
0x7fffc2105100: 0xffffffff81002940      0xffffffff81002970

Those look like valid kernelspace addresses. Why the discrepancy?

Loading the ARM64 kernel in radare2 confirms that there are nothing but zeros between the __start_mcount_loc and __stop_mcount_loc symbols:

$ r2 vmlinux-aarch64
[0xffff800008000000]> s loc.__start_mcount_loc
[0xffff80000a2f3e68]> px
- offset -           0 1  2 3  4 5  6 7  8 9  A B  C D  E F 0123456789ABCDEF
0xffff80000a2f3e68  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3e78  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3e88  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3e98  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3ea8  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3eb8  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3ec8  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3ed8  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3ee8  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3ef8  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f08  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f18  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f28  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f38  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f48  0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f58  0000 0000 0000 0000 0000 0000 0000 0000 ................

On an x86 kernel, however, the same addresses seen in gdb are present:

$ r2 vmlinux-x86_64
[0xffffffff81e00000]> s loc.__start_mcount_loc
[0xffffffff836f4a10]> px
- offset -           0 1  2 3  4 5  6 7  8 9  A B  C D  E F 0123456789ABCDEF
0xffffffff836f4a10  6006 0081 ffff ffff a006 0081 ffff ffff `...............
0xffffffff836f4a20  e006 0081 ffff ffff 2007 0081 ffff ffff ........ .......
0xffffffff836f4a30  800c 0081 ffff ffff d00d 0081 ffff ffff ................
0xffffffff836f4a40  a00f 0081 ffff ffff d00f 0081 ffff ffff ................
0xffffffff836f4a50  0010 0081 ffff ffff 3010 0081 ffff ffff ........0.......
0xffffffff836f4a60  5010 0081 ffff ffff c014 0081 ffff ffff P...............
0xffffffff836f4a70  f014 0081 ffff ffff 0015 0081 ffff ffff ................
0xffffffff836f4a80  1015 0081 ffff ffff c01a 0081 ffff ffff ................
0xffffffff836f4a90  401b 0081 ffff ffff 001c 0081 ffff ffff @...............
0xffffffff836f4aa0  701d 0081 ffff ffff 301f 0081 ffff ffff p.......0.......
0xffffffff836f4ab0  401f 0081 ffff ffff f01f 0081 ffff ffff @...............
0xffffffff836f4ac0  e020 0081 ffff ffff 9021 0081 ffff ffff  . .......!......
0xffffffff836f4ad0  b021 0081 ffff ffff 1022 0081 ffff ffff .!......."......
0xffffffff836f4ae0  d022 0081 ffff ffff 1023 0081 ffff ffff .".......#......
0xffffffff836f4af0  c024 0081 ffff ffff 0025 0081 ffff ffff .$.......%......
0xffffffff836f4b00  4029 0081 ffff ffff 7029 0081 ffff ffff @)......p)......

This naturally begs the question: where are the ARM64 ftrace addresses?

We know the kernel is getting the list somehow. It's listing tty_write in /sys/kernel/tracing/available_filter_functions and is able to trace the function. It's not there however when we map the kernel into memory.

A few hours of research and binary grepping later, I found the answer:

$ nm -C vmlinux-aarch64 | grep tty_write
...snip
ffff800008914af0 t tty_write
...snip
$ llvm-readelf -r vmlinux-aarch64 | grep ffff800008914af0
ffff8000094fff40  0000000000000403 R_AARCH64_RELATIVE  ffff800008914af0

In the kernel ELF binary, the data between __start_mcount_loc and __stop_mcount_loc was entirely zeros, but relocations had been emitted such that the kernel will patch up all the zeros with actual addresses at boot when it relocates itself.

So, when the kernel actually boots, the section of the address space in .init.data between __start_mcount_loc and __stop_mcount_loc will contain the correct function addresses. If the binary is simply mapped into memory during the kernel build process by pahole however, it will only read zeros from the corresponding location, causing the ftrace list to appear empty. The pahole commit that removed the restriction on only generating BTF for functions known to ftrace inadvertently fixed this as it stopped considering the ftrace list entirely when generating BTF.

Rebuilding the ARM64 kernel with CONFIG_RELOCATABLE=n and generating BTF with pahole 1.20, the problem was fixed. There were no relocations — the data was simply there and pahole generated BTF as expected.

But why was that data zeroed only on ARM64? The x86 kernels were all built with CONFIG_RELOCATABLE=y as well.

But wait, if the x86 kernels were also built with CONFIG_RELOCATABLE=y, shouldn't they be emitting relocations as well? After all, a relocatable kernel could be loaded anywhere, so the addresses we see in the list should all have relocations so they can change if the kernel is loaded somewhere nonstandard:

$ nm -C vmlinux-x86 | grep tty_write
...snip
ffffffff815dfdd0 t tty_write
...snip
$ llvm-readelf -r vmlinux-x86 | grep 5dfdd0
...snip
ffffffff8372d7b8  0000000100000001 R_X86_64_64 ffffffff81000000 .text + 5dfdd0
$ llvm-readelf -s vmlinux-x86 | grep mcount
118204: ffffffff837612c0     0 NOTYPE  GLOBAL DEFAULT    38 __stop_mcount_loc
132185: ffffffff836fdf30     0 NOTYPE  GLOBAL DEFAULT    38 __start_mcount_loc

It's a touch more complex this time as, on x86, the relocation for tty_write's address in the ftrace list is given relative to the .text symbol (i.e. it's given as 0xffffffff81000000 + 5dfdd0 instead of ffffffff815dfdd0), but as can be seen, an equivalent relocation is there.

Ok, so relocations are emitted for both x86 and ARM64, but the actual locations in the produced binary corresponding to those relocations are only filled in (assuming a normal 1M kernel load address) on x86.

The reason for this discrepancy was found only after much grepping. As of Linux 5.19, it can be found at arch/arm64/Makefile:15, which was added in commit fd9dde6:

commit fd9dde6abcb9bfe6c6bee48834e157999f113971
Author: Nick Desaulniers <[email protected]>
Date:   Fri Oct 27 09:33:41 2017 -0700

arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27

Upon upgrading to binutils 2.27, we found that our lz4 and gzip
compressed kernel images were significantly larger, resulting is 10ms
boot time regressions.

As noted by Rahul:
"aarch64 binaries uses RELA relocations, where each relocation entry
includes an addend value. This is similar to x86_64.  On x86_64, the
addend values are also stored at the relocation offset for relative
relocations. This is an optimization: in the case where code does not
need to be relocated, the loader can simply skip processing relative
relocations.  In binutils-2.25, both bfd and gold linkers did this for
x86_64, but only the gold linker did this for aarch64.  The kernel build
here is using the bfd linker, which stored zeroes at the relocation
offsets for relative relocations.  Since a set of zeroes compresses
better than a set of non-zero addend values, this behavior was resulting
in much better lz4 compression.

The bfd linker in binutils-2.27 is now storing the actual addend values
at the relocation offsets. The behavior is now consistent with what it
does for x86_64 and what gold linker does for both architectures.  The
change happened in this upstream commit:
https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=1f56df9d0d5ad89806c24e71f296576d82344613
Since a bunch of zeroes got replaced by non-zero addend values, we see
the side effect of lz4 compressed image being a bit bigger.

To get the old behavior from the bfd linker, "--no-apply-dynamic-relocs"
flag can be used:
$ LDFLAGS="--no-apply-dynamic-relocs" make
With this flag, the compressed image size is back to what it was with
binutils-2.25.
...snip

So, on x86, the target address of the relocations is patched up in the actual binary image with the relocation value (again, assuming a normal 1M kernel load address), whereas on ARM64, the target location of the relocation value is always set to 0 as a result of --no-apply-dynamic-relocs, which was done to make the kernel image more compressible. This has been the case since 4.15 and broke pahole's assumption that the region between __start_mcount_loc and __stop_mcount_loc in the vmlinux binary would always contain valid kernel addresses.

Whew... there's our root cause.

So what's the fix for this? We can't just fail for kernels built with pahole <1.22.

Well, conveniently enough, kernel commit 9bb48c8 also changed the signature of redirected_tty_write in exactly the same way as tty_write. redirected_tty_write has external linkage, thus pahole <1.22 will generate BTF for it even if it can't find it in the (entirely null) ftrace list. We updated our prototype-detection logic to use this function as an analog for the signature of tty_write instead and ended up with a working BPF program.

And, that's all I got for today. Hopefully, you found this rabbit hole made for an enjoyable read. If you're interested in the security-related BPF work we're doing at Elastic, all of our BPF code and associated tools (including the fix described in this blog post) is publicly available.