Here's a fun debugging story involving a quick-to-find proximate cause and a dreadful-to-find root cause.
Here at Elastic, we use eBPF (often known less officially as simply BPF) to gather security-related events on Linux for our cloud security solution. To accomplish this, we hook into a variety of points in the Linux kernel with BPF programs to gather event-specific data at the source.
One such event we monitor is writes to terminal devices. What a process writes
to the terminal is often interesting from a security perspective (E.g., an
insider threat exfiltrates a private key by
cat'ing it and copying it from
To accomplish this, we hook into the
function with a BPF program and read the data written to the terminal out of
the function arguments. Linux's tracing infrastructure makes this pretty easy
— you can hook almost all kernel functions by attaching a
kprobe to them that invokes a BPF
however, complicated this strategy. It changed the signature of
requiring us to determine the signature in use on the running kernel in order
to correctly parse the function arguments and read the data written to the
That change went into Linux 5.10.11, which would lead one to believe we could just check the kernel version to determine the function signature. Unfortunately, things aren't quite that easy.
Since being merged into the mainline sources, many distribution-specific
kernels have backported that patch to an older base kernel. This means they
will report a version pre-5.10.11, while still having the post-5.10.11 function
signature. Relying on version/distro checks thus requires maintaining a huge
hardcoded map of version/distro to
tty_write signature mappings, that, while
possible, is a recipe for headaches and will always be subject to break if run
on a distribution not yet added to the list.
Thankfully though, there is a better way.
BTF information is
/sys/kernel/btf/vmlinux on 5.5+ kernels compiled with
CONFIG_DEBUG_INFO_BTF. This BTF information describes most types in the C
source the kernel was compiled from (structs, typedefs, function signatures,
etc.). This allows for source-code level type information to be accessed on a
From the system BTF information, we can determine the signature of
without having to rely on a long and difficult-to-maintain list of distribution
and version number checks. libbpf provides
helpers to access and parse the system BTF, making it easy to detect the
tty_write with just a small bit of C.
Seems pretty straightforward. Turns out, however, that this was the start of a long rabbit hole...
The above strategy was implemented and worked well on most kernels. Test failures, however, were observed on a small number of ARM64 kernels — x86 kernels consistently passed.
Digging into the failures, it was noticed that BTF info for
absent on the offending ARM64 kernels, as shown here with
# bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep tty_write  FUNC 'redirected_tty_write' type_id=19506 linkage=static  FUNC 'tty_write_message' type_id=58504 linkage=static  FUNC 'tty_write_room' type_id=58565 linkage=static
Naturally, this would cause the tests to fail: If BTF info for
doesn't exist, we can't detect its signature.
On a working kernel, the same command produced the following, showing that BTF
# bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep tty_write  FUNC 'redirected_tty_write' type_id=26281 linkage=static  FUNC 'tty_write' type_id=17027 linkage=static  FUNC 'tty_write_message' type_id=58439 linkage=static  FUNC 'hung_up_tty_write' type_id=17027 linkage=static  FUNC 'n_tty_write' type_id=58482 linkage=static  FUNC 'n_tty_write_wakeup' type_id=21329 linkage=static  FUNC 'tty_write_room' type_id=58430 linkage=static
My first thought was that
tty_write had been inlined by the ARM64 compiler,
meaning it wouldn't be seen at BTF-generation time. A quick
| grep tty_write on a failing kernel, however, confirmed this was not the
# cat /proc/kallsyms | grep tty_write ...snip ffffc3856827c954 t tty_write ...snip
Interestingly, as can be seen in the output of
absent in the failing kernel's BTF information, but some related functions were
still present (e.g.,
redirected_tty_write). For some reason,
a handful of other functions with
tty_write in their name seem to not be
present in the system BTF on the failing kernels. We've found our proximate
pahole is the tool used to generate BTF at
kernel build time. Peeking into the v1.20 sources (the version being used to
build all the kernels), the relevant code could be found in
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
Stepping through the code with gdb, when the
cu__for_each_function loop got
redirected_tty_write (which was always present in the BTF),
control would flow into the conditional block on line 31 (i.e.
!= 0), then onwards to the subsequent code past line 46 that generates the
When the loop got to processing
functions_cnt would be
0, causing control to flow to the else on line 43, where
false. This resulted in the
continue being hit and no BTF being generated
tty_write is ultimately being skipped during BTF generation because it
doesn't have external linkage (the check on line 44), but to hit that logic at
all, it has to flow through the
if (functions_cnt) check, which seems to
correspond to the
are not in ftrace list bullet point in the comment.
This led to a lot of digging into how ftrace works. Particularly helpful was this post on Brendan Gregg's blog which links to a recording of a talk at Kernel Recipies 2019 by kernel maintainer Steven Rostedt. Go watch that talk, seriously, it's great.
Ftrace uses a compiler option to insert a few
nops at the start of nearly
every kernel function, and, when a user enables tracing on a kernel function,
dynamically patches the
nops in that function with a call to a trampoline
that invokes tracing callbacks. This allows for functions to be traced with
essentially no performance hit when ftrace is not in use (The performance
penalty of the
nops is so tiny as to be essentially zero.).
Putting this together with the pahole code above, it seems
being traced because it isn't in the list of functions ftrace knows about (this
is what the
if (functions_cnt) check seems to correspond to).
Lending further credence to this theory, it was noticed that when an ARM64
kernel build was performed with pahole >=1.22, BTF for
tty_write would be
present. It was only before version 1.22 that pahole was not generating BTF for
git bisect, later and it was discovered that pahole commit
was what fixed the problem. It did this by removing the constraint that BTF is only
generated for functions known to ftrace. This restriction was removed as it was
determined that the space savings of only generating BTF for functions known to
ftrace (the motivation for that restriction in the first place) was outweighed
by the usefulness of the extra BTF. We're narrowing in.
So it seems
tty_write is unknown to ftrace, meaning that in pahole <1.22, BTF
won't be generated describing its signature. There's a file under
/sys/kernel/tracing that lists all functions known to ftrace, allowing us to
check this on a failing ARM64 kernel:
# cat /sys/kernel/tracing/available_filter_functions | grep tty_write hung_up_tty_write file_tty_write.constprop.0 tty_write redirected_tty_write tty_write_message n_tty_write_wakeup n_tty_write tty_write_room rfcomm_tty_write_room [rfcomm] rfcomm_tty_write [rfcomm]
But, it's there...
Both the working x86 and failing ARM64 kernel list
/sys/kernel/tracing/available_filter_functions and in manual tests,
tty_write was traceable with ftrace on both kernels.
So what gives?
This involved digging into how exactly ftrace stores the list of functions it's capable of tracing, something the above linked talk helped immensely with.
The long and short of it is that there's logic in the kernel source tree that,
at build time, generates a list of the addresses of all functions that have
been instrumented with
nops by ftrace and stores their addresses between two
linker-defined symbols in the produced
__stop_mcount_loc (the name
mcount is a bit of historical baggage
related to GNU gprof, you can just mentally substitute
mcount in those symbol names).
These symbols are located in the
.init.text section. The whole list of
addresses is read by the kernel at boot, used to initialize ftrace, and then
discarded (along with the rest of the data in
.init.text, all of which is
only needed at boot).
Looking at the pahole source code before commit
58a98f7, to obtain this list
of addresses, pahole was simply mapping the kernel
vmlinux image into memory,
seeking to the
__start_mcount_loc symbol, and reading the list of function
addresses. This is the "ftrace list" referenced by the comment in pahole's
tty_write is traceable with ftrace on the breaking kernel, it would
make sense that its address is found in the ftrace list, but it seems pahole
isn't finding it for whatever reason.
Time to pull out
gdb again. The ftrace list is read into the
which can be read at
btf_encoder.c:171 in pahole
$ gdb -q pahole Reading symbols from pahole... (gdb) b btf_encoder.c:171 No source file named btf_encoder.c. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (btf_encoder.c:171) pending. (gdb) run -J ~/vmlinux-aarch64 Starting program: /home/vagrant/dwarves/build/pahole -J ~/vmlinux-aarch64 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Breakpoint 1, get_vmlinux_addrs (btfe=0x555555594110, fl=0x7fffffefddf0, paddrs=0x7fffffefdd90, pcount=0x7fffffefdd88) at /home/vagrant/dwarves/btf_encoder.c:171 171 *paddrs = addrs; (gdb) x/32gx addrs 0x7fffda05d010: 0x0000000000000000 0x0000000000000000 0x7fffda05d020: 0x0000000000000000 0x0000000000000000 0x7fffda05d030: 0x0000000000000000 0x0000000000000000 0x7fffda05d040: 0x0000000000000000 0x0000000000000000 0x7fffda05d050: 0x0000000000000000 0x0000000000000000 0x7fffda05d060: 0x0000000000000000 0x0000000000000000 0x7fffda05d070: 0x0000000000000000 0x0000000000000000 0x7fffda05d080: 0x0000000000000000 0x0000000000000000 0x7fffda05d090: 0x0000000000000000 0x0000000000000000 0x7fffda05d0a0: 0x0000000000000000 0x0000000000000000 0x7fffda05d0b0: 0x0000000000000000 0x0000000000000000 0x7fffda05d0c0: 0x0000000000000000 0x0000000000000000 0x7fffda05d0d0: 0x0000000000000000 0x0000000000000000 0x7fffda05d0e0: 0x0000000000000000 0x0000000000000000 0x7fffda05d0f0: 0x0000000000000000 0x0000000000000000 0x7fffda05d100: 0x0000000000000000 0x0000000000000000
Weird... it's all nulls. Let's try a working x86 kernel:
$ gdb -q pahole Reading symbols from pahole... (gdb) b btf_encoder.c:171 No source file named btf_encoder.c. Make breakpoint pending on future shared library load? (y or [n]) y Breakpoint 1 (btf_encoder.c:171) pending. (gdb) run -J ~/vmlinux-x86_64 Starting program: /home/vagrant/dwarves/build/pahole -J ~/vmlinux-x86_64 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Breakpoint 1, get_vmlinux_addrs (btfe=0x555555595f20, fl=0x7fffffefddf0, paddrs=0x7fffffefdd90, pcount=0x7fffffefdd88) at /home/vagrant/dwarves/btf_encoder.c:171 171 *paddrs = addrs; (gdb) x/32gx addrs 0x7fffc2105010: 0xffffffff81000660 0xffffffff810006a0 0x7fffc2105020: 0xffffffff810006e0 0xffffffff81000720 0x7fffc2105030: 0xffffffff81000c80 0xffffffff81000dd0 0x7fffc2105040: 0xffffffff81000fa0 0xffffffff81000fd0 0x7fffc2105050: 0xffffffff81001000 0xffffffff81001030 0x7fffc2105060: 0xffffffff81001050 0xffffffff810014c0 0x7fffc2105070: 0xffffffff810014f0 0xffffffff81001500 0x7fffc2105080: 0xffffffff81001510 0xffffffff81001ac0 0x7fffc2105090: 0xffffffff81001b40 0xffffffff81001c00 0x7fffc21050a0: 0xffffffff81001d70 0xffffffff81001f30 0x7fffc21050b0: 0xffffffff81001f40 0xffffffff81001ff0 0x7fffc21050c0: 0xffffffff810020e0 0xffffffff81002190 0x7fffc21050d0: 0xffffffff810021b0 0xffffffff81002210 0x7fffc21050e0: 0xffffffff810022d0 0xffffffff81002310 0x7fffc21050f0: 0xffffffff810024c0 0xffffffff81002500 0x7fffc2105100: 0xffffffff81002940 0xffffffff81002970
Those look like valid kernelspace addresses. Why the discrepancy?
Loading the ARM64 kernel in radare2
confirms that there are nothing but zeros between the
$ r2 vmlinux-aarch64 [0xffff800008000000]> s loc.__start_mcount_loc [0xffff80000a2f3e68]> px - offset - 0 1 2 3 4 5 6 7 8 9 A B C D E F 0123456789ABCDEF 0xffff80000a2f3e68 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3e78 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3e88 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3e98 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3ea8 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3eb8 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3ec8 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3ed8 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3ee8 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3ef8 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3f08 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3f18 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3f28 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3f38 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3f48 0000 0000 0000 0000 0000 0000 0000 0000 ................ 0xffff80000a2f3f58 0000 0000 0000 0000 0000 0000 0000 0000 ................
On an x86 kernel, however, the same addresses seen in
gdb are present:
$ r2 vmlinux-x86_64 [0xffffffff81e00000]> s loc.__start_mcount_loc [0xffffffff836f4a10]> px - offset - 0 1 2 3 4 5 6 7 8 9 A B C D E F 0123456789ABCDEF 0xffffffff836f4a10 6006 0081 ffff ffff a006 0081 ffff ffff `............... 0xffffffff836f4a20 e006 0081 ffff ffff 2007 0081 ffff ffff ........ ....... 0xffffffff836f4a30 800c 0081 ffff ffff d00d 0081 ffff ffff ................ 0xffffffff836f4a40 a00f 0081 ffff ffff d00f 0081 ffff ffff ................ 0xffffffff836f4a50 0010 0081 ffff ffff 3010 0081 ffff ffff ........0....... 0xffffffff836f4a60 5010 0081 ffff ffff c014 0081 ffff ffff P............... 0xffffffff836f4a70 f014 0081 ffff ffff 0015 0081 ffff ffff ................ 0xffffffff836f4a80 1015 0081 ffff ffff c01a 0081 ffff ffff ................ 0xffffffff836f4a90 401b 0081 ffff ffff 001c 0081 ffff ffff @............... 0xffffffff836f4aa0 701d 0081 ffff ffff 301f 0081 ffff ffff p.......0....... 0xffffffff836f4ab0 401f 0081 ffff ffff f01f 0081 ffff ffff @............... 0xffffffff836f4ac0 e020 0081 ffff ffff 9021 0081 ffff ffff . .......!...... 0xffffffff836f4ad0 b021 0081 ffff ffff 1022 0081 ffff ffff .!......."...... 0xffffffff836f4ae0 d022 0081 ffff ffff 1023 0081 ffff ffff .".......#...... 0xffffffff836f4af0 c024 0081 ffff ffff 0025 0081 ffff ffff .$.......%...... 0xffffffff836f4b00 4029 0081 ffff ffff 7029 0081 ffff ffff @)......p)......
This naturally begs the question: where are the ARM64 ftrace addresses?
We know the kernel is getting the list somehow. It's listing
/sys/kernel/tracing/available_filter_functions and is able to trace the
function. It's not there however when we map the kernel into memory.
A few hours of research and binary grepping later, I found the answer:
$ nm -C vmlinux-aarch64 | grep tty_write ...snip ffff800008914af0 t tty_write ...snip $ llvm-readelf -r vmlinux-aarch64 | grep ffff800008914af0 ffff8000094fff40 0000000000000403 R_AARCH64_RELATIVE ffff800008914af0
In the kernel ELF binary, the data between
__stop_mcount_loc was entirely zeros, but relocations had been emitted such
that the kernel will patch up all the zeros with actual addresses at boot when
it relocates itself.
So, when the kernel actually boots, the section of the address space in
__stop_mcount_loc will contain
the correct function addresses. If the binary is simply mapped into memory
during the kernel build process by pahole however, it will only read zeros from
the corresponding location, causing the ftrace list to appear empty. The pahole
commit that removed the restriction on only generating BTF for functions known
to ftrace inadvertently fixed this as it stopped considering the ftrace list
entirely when generating BTF.
Rebuilding the ARM64 kernel with
CONFIG_RELOCATABLE=n and generating BTF with
pahole 1.20, the problem was fixed. There were no relocations — the data was
simply there and pahole generated BTF as expected.
But why was that data zeroed only on ARM64? The x86 kernels were all built
CONFIG_RELOCATABLE=y as well.
But wait, if the x86 kernels were also built with
shouldn't they be emitting relocations as well? After all, a relocatable kernel
could be loaded anywhere, so the addresses we see in the list should all have
relocations so they can change if the kernel is loaded somewhere nonstandard:
$ nm -C vmlinux-x86 | grep tty_write ...snip ffffffff815dfdd0 t tty_write ...snip $ llvm-readelf -r vmlinux-x86 | grep 5dfdd0 ...snip ffffffff8372d7b8 0000000100000001 R_X86_64_64 ffffffff81000000 .text + 5dfdd0 $ llvm-readelf -s vmlinux-x86 | grep mcount 118204: ffffffff837612c0 0 NOTYPE GLOBAL DEFAULT 38 __stop_mcount_loc 132185: ffffffff836fdf30 0 NOTYPE GLOBAL DEFAULT 38 __start_mcount_loc
It's a touch more complex this time as, on x86, the relocation for
tty_write's address in the ftrace list is given relative to the
symbol (i.e. it's given as
0xffffffff81000000 + 5dfdd0 instead of
ffffffff815dfdd0), but as can be seen, an equivalent relocation is there.
Ok, so relocations are emitted for both x86 and ARM64, but the actual locations in the produced binary corresponding to those relocations are only filled in (assuming a normal 1M kernel load address) on x86.
commit fd9dde6abcb9bfe6c6bee48834e157999f113971 Author: Nick Desaulniers <[email protected]> Date: Fri Oct 27 09:33:41 2017 -0700 arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27 Upon upgrading to binutils 2.27, we found that our lz4 and gzip compressed kernel images were significantly larger, resulting is 10ms boot time regressions. As noted by Rahul: "aarch64 binaries uses RELA relocations, where each relocation entry includes an addend value. This is similar to x86_64. On x86_64, the addend values are also stored at the relocation offset for relative relocations. This is an optimization: in the case where code does not need to be relocated, the loader can simply skip processing relative relocations. In binutils-2.25, both bfd and gold linkers did this for x86_64, but only the gold linker did this for aarch64. The kernel build here is using the bfd linker, which stored zeroes at the relocation offsets for relative relocations. Since a set of zeroes compresses better than a set of non-zero addend values, this behavior was resulting in much better lz4 compression. The bfd linker in binutils-2.27 is now storing the actual addend values at the relocation offsets. The behavior is now consistent with what it does for x86_64 and what gold linker does for both architectures. The change happened in this upstream commit: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=1f56df9d0d5ad89806c24e71f296576d82344613 Since a bunch of zeroes got replaced by non-zero addend values, we see the side effect of lz4 compressed image being a bit bigger. To get the old behavior from the bfd linker, "--no-apply-dynamic-relocs" flag can be used: $ LDFLAGS="--no-apply-dynamic-relocs" make With this flag, the compressed image size is back to what it was with binutils-2.25. ...snip
So, on x86, the target address of the relocations is patched up in the actual
binary image with the relocation value (again, assuming a normal 1M kernel load
address), whereas on ARM64, the target location of the relocation value is
always set to 0 as a result of
--no-apply-dynamic-relocs, which was done to
make the kernel image more compressible. This has been the case since 4.15 and
pahole's assumption that the region between
__stop_mcount_loc in the
vmlinux binary would always contain valid kernel
Whew... there's our root cause.
So what's the fix for this? We can't just fail for kernels built with pahole <1.22.
Well, conveniently enough, kernel commit
also changed the signature of
redirected_tty_write in exactly the same way as
redirected_tty_write has external linkage, thus pahole <1.22
will generate BTF for it even if it can't find it in the (entirely null)
ftrace list. We updated our prototype-detection logic to use this function as
an analog for the signature of
tty_write instead and ended up with a working
And, that's all I got for today. Hopefully, you found this rabbit hole made for an enjoyable read. If you're interested in the security-related BPF work we're doing at Elastic, all of our BPF code and associated tools (including the fix described in this blog post) is publicly available.