Here's a fun debugging story involving a quick-to-find proximate cause and a dreadful-to-find root cause.
Here at Elastic, we use eBPF (often known less officially as simply BPF) to gather security-related events on Linux for our cloud security solution. To accomplish this, we hook into a variety of points in the Linux kernel with BPF programs to gather event-specific data at the source.
One such event we monitor is writes to terminal devices. What a process writes
to the terminal is often interesting from a security perspective (E.g., an
insider threat exfiltrates a private key by cat
'ing it and copying it from
the terminal.).
To accomplish this, we hook into the
tty_write
function with a BPF program and read the data written to the terminal out of
the function arguments. Linux's tracing infrastructure makes this pretty easy
— you can hook almost all kernel functions by attaching a
kprobe to them that invokes a BPF
program.
Linux commit
9bb48c8
,
however, complicated this strategy. It changed the signature of tty_write
,
requiring us to determine the signature in use on the running kernel in order
to correctly parse the function arguments and read the data written to the
terminal.
That change went into Linux 5.10.11, which would lead one to believe we could just check the kernel version to determine the function signature. Unfortunately, things aren't quite that easy.
Since being merged into the mainline sources, many distribution-specific
kernels have backported that patch to an older base kernel. This means they
will report a version pre-5.10.11, while still having the post-5.10.11 function
signature. Relying on version/distro checks thus requires maintaining a huge
hardcoded map of version/distro to tty_write
signature mappings, that, while
possible, is a recipe for headaches and will always be subject to break if run
on a distribution not yet added to the list.
Thankfully though, there is a better way.
BTF information is
available at /sys/kernel/btf/vmlinux
on 5.5+ kernels compiled with
CONFIG_DEBUG_INFO_BTF
. This BTF information describes most types in the C
source the kernel was compiled from (structs, typedefs, function signatures,
etc.). This allows for source-code level type information to be accessed on a
running kernel.
From the system BTF information, we can determine the signature of tty_write
without having to rely on a long and difficult-to-maintain list of distribution
and version number checks. libbpf provides
helpers to access and parse the system BTF, making it easy to detect the
signature of tty_write
with just a small bit of C.
Seems pretty straightforward. Turns out, however, that this was the start of a long rabbit hole...
The above strategy was implemented and worked well on most kernels. Test failures, however, were observed on a small number of ARM64 kernels — x86 kernels consistently passed.
Digging into the failures, it was noticed that BTF info for tty_write
was
absent on the offending ARM64 kernels, as shown here with
bpftool:
# bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep tty_write
[58503] FUNC 'redirected_tty_write' type_id=19506 linkage=static
[58505] FUNC 'tty_write_message' type_id=58504 linkage=static
[58566] FUNC 'tty_write_room' type_id=58565 linkage=static
Naturally, this would cause the tests to fail: If BTF info for tty_write
doesn't exist, we can't detect its signature.
On a working kernel, the same command produced the following, showing that BTF
information for tty_write
existed:
# bpftool btf dump file /sys/kernel/btf/vmlinux format raw | grep tty_write
[58437] FUNC 'redirected_tty_write' type_id=26281 linkage=static
[58438] FUNC 'tty_write' type_id=17027 linkage=static
[58440] FUNC 'tty_write_message' type_id=58439 linkage=static
[58458] FUNC 'hung_up_tty_write' type_id=17027 linkage=static
[58483] FUNC 'n_tty_write' type_id=58482 linkage=static
[58519] FUNC 'n_tty_write_wakeup' type_id=21329 linkage=static
[58550] FUNC 'tty_write_room' type_id=58430 linkage=static
My first thought was that tty_write
had been inlined by the ARM64 compiler,
meaning it wouldn't be seen at BTF-generation time. A quick cat /proc/kallsyms
| grep tty_write
on a failing kernel, however, confirmed this was not the
case:
# cat /proc/kallsyms | grep tty_write
...snip
ffffc3856827c954 t tty_write
...snip
Interestingly, as can be seen in the output of bpftool
above, tty_write
was
absent in the failing kernel's BTF information, but some related functions were
still present (e.g., redirected_tty_write
). For some reason, tty_write
, and
a handful of other functions with tty_write
in their name seem to not be
present in the system BTF on the failing kernels. We've found our proximate
cause.
pahole is the tool used to generate BTF at
kernel build time. Peeking into the v1.20 sources (the version being used to
build all the kernels), the relevant code could be found in btf_encoder.c
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
Stepping through the code with gdb, when the cu__for_each_function
loop got
to processing redirected_tty_write
(which was always present in the BTF),
control would flow into the conditional block on line 31 (i.e. functions_cnt
!= 0
), then onwards to the subsequent code past line 46 that generates the
BTF.
When the loop got to processing tty_write
however, functions_cnt
would be
0, causing control to flow to the else on line 43, where fn->external
was
false. This resulted in the continue
being hit and no BTF being generated
for tty_write
.
Seems tty_write
is ultimately being skipped during BTF generation because it
doesn't have external linkage (the check on line 44), but to hit that logic at
all, it has to flow through the if (functions_cnt)
check, which seems to
correspond to the are not in ftrace list
bullet point in the comment.
This led to a lot of digging into how ftrace works. Particularly helpful was this post on Brendan Gregg's blog which links to a recording of a talk at Kernel Recipies 2019 by kernel maintainer Steven Rostedt. Go watch that talk, seriously, it's great.
Ftrace uses a compiler option to insert a few nop
s at the start of nearly
every kernel function, and, when a user enables tracing on a kernel function,
dynamically patches the nop
s in that function with a call to a trampoline
that invokes tracing callbacks. This allows for functions to be traced with
essentially no performance hit when ftrace is not in use (The performance
penalty of the nops
is so tiny as to be essentially zero.).
Putting this together with the pahole code above, it seems tty_write
isn't
being traced because it isn't in the list of functions ftrace knows about (this
is what the if (functions_cnt)
check seems to correspond to).
Lending further credence to this theory, it was noticed that when an ARM64
kernel build was performed with pahole >=1.22, BTF for tty_write
would be
present. It was only before version 1.22 that pahole was not generating BTF for
tty_write
.
One git bisect
, later and it was discovered that pahole commit
58a98f7
was what fixed the problem. It did this by removing the constraint that BTF is only
generated for functions known to ftrace. This restriction was removed as it was
determined that the space savings of only generating BTF for functions known to
ftrace (the motivation for that restriction in the first place) was outweighed
by the usefulness of the extra BTF. We're narrowing in.
So it seems tty_write
is unknown to ftrace, meaning that in pahole <1.22, BTF
won't be generated describing its signature. There's a file under
/sys/kernel/tracing
that lists all functions known to ftrace, allowing us to
check this on a failing ARM64 kernel:
# cat /sys/kernel/tracing/available_filter_functions | grep tty_write
hung_up_tty_write
file_tty_write.constprop.0
tty_write
redirected_tty_write
tty_write_message
n_tty_write_wakeup
n_tty_write
tty_write_room
rfcomm_tty_write_room [rfcomm]
rfcomm_tty_write [rfcomm]
But, it's there...
Both the working x86 and failing ARM64 kernel list tty_write
in
/sys/kernel/tracing/available_filter_functions
and in manual tests,
tty_write
was traceable with ftrace on both kernels.
So what gives?
This involved digging into how exactly ftrace stores the list of functions it's capable of tracing, something the above linked talk helped immensely with.
The long and short of it is that there's logic in the kernel source tree that,
at build time, generates a list of the addresses of all functions that have
been instrumented with nop
s by ftrace and stores their addresses between two
linker-defined symbols in the produced vmlinux
binary: __start_mcount_loc
and __stop_mcount_loc
(the name mcount
is a bit of historical baggage
related to GNU gprof, you can just mentally substitute ftrace_functions
for
mcount
in those symbol names).
These symbols are located in the .init.text
section. The whole list of
addresses is read by the kernel at boot, used to initialize ftrace, and then
discarded (along with the rest of the data in .init.text
, all of which is
only needed at boot).
Looking at the pahole source code before commit 58a98f7
, to obtain this list
of addresses, pahole was simply mapping the kernel vmlinux
image into memory,
seeking to the __start_mcount_loc
symbol, and reading the list of function
addresses. This is the "ftrace list" referenced by the comment in pahole's
source.
Since tty_write
is traceable with ftrace on the breaking kernel, it would
make sense that its address is found in the ftrace list, but it seems pahole
isn't finding it for whatever reason.
Time to pull out gdb
again. The ftrace list is read into the addrs
variable
which can be read at btf_encoder.c:171
in pahole
1.20.
$ gdb -q pahole
Reading symbols from pahole...
(gdb) b btf_encoder.c:171
No source file named btf_encoder.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (btf_encoder.c:171) pending.
(gdb) run -J ~/vmlinux-aarch64
Starting program: /home/vagrant/dwarves/build/pahole -J ~/vmlinux-aarch64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, get_vmlinux_addrs (btfe=0x555555594110, fl=0x7fffffefddf0, paddrs=0x7fffffefdd90, pcount=0x7fffffefdd88)
at /home/vagrant/dwarves/btf_encoder.c:171
171 *paddrs = addrs;
(gdb) x/32gx addrs
0x7fffda05d010: 0x0000000000000000 0x0000000000000000
0x7fffda05d020: 0x0000000000000000 0x0000000000000000
0x7fffda05d030: 0x0000000000000000 0x0000000000000000
0x7fffda05d040: 0x0000000000000000 0x0000000000000000
0x7fffda05d050: 0x0000000000000000 0x0000000000000000
0x7fffda05d060: 0x0000000000000000 0x0000000000000000
0x7fffda05d070: 0x0000000000000000 0x0000000000000000
0x7fffda05d080: 0x0000000000000000 0x0000000000000000
0x7fffda05d090: 0x0000000000000000 0x0000000000000000
0x7fffda05d0a0: 0x0000000000000000 0x0000000000000000
0x7fffda05d0b0: 0x0000000000000000 0x0000000000000000
0x7fffda05d0c0: 0x0000000000000000 0x0000000000000000
0x7fffda05d0d0: 0x0000000000000000 0x0000000000000000
0x7fffda05d0e0: 0x0000000000000000 0x0000000000000000
0x7fffda05d0f0: 0x0000000000000000 0x0000000000000000
0x7fffda05d100: 0x0000000000000000 0x0000000000000000
Weird... it's all nulls. Let's try a working x86 kernel:
$ gdb -q pahole
Reading symbols from pahole...
(gdb) b btf_encoder.c:171
No source file named btf_encoder.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (btf_encoder.c:171) pending.
(gdb) run -J ~/vmlinux-x86_64
Starting program: /home/vagrant/dwarves/build/pahole -J ~/vmlinux-x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Breakpoint 1, get_vmlinux_addrs (btfe=0x555555595f20, fl=0x7fffffefddf0, paddrs=0x7fffffefdd90, pcount=0x7fffffefdd88)
at /home/vagrant/dwarves/btf_encoder.c:171
171 *paddrs = addrs;
(gdb) x/32gx addrs
0x7fffc2105010: 0xffffffff81000660 0xffffffff810006a0
0x7fffc2105020: 0xffffffff810006e0 0xffffffff81000720
0x7fffc2105030: 0xffffffff81000c80 0xffffffff81000dd0
0x7fffc2105040: 0xffffffff81000fa0 0xffffffff81000fd0
0x7fffc2105050: 0xffffffff81001000 0xffffffff81001030
0x7fffc2105060: 0xffffffff81001050 0xffffffff810014c0
0x7fffc2105070: 0xffffffff810014f0 0xffffffff81001500
0x7fffc2105080: 0xffffffff81001510 0xffffffff81001ac0
0x7fffc2105090: 0xffffffff81001b40 0xffffffff81001c00
0x7fffc21050a0: 0xffffffff81001d70 0xffffffff81001f30
0x7fffc21050b0: 0xffffffff81001f40 0xffffffff81001ff0
0x7fffc21050c0: 0xffffffff810020e0 0xffffffff81002190
0x7fffc21050d0: 0xffffffff810021b0 0xffffffff81002210
0x7fffc21050e0: 0xffffffff810022d0 0xffffffff81002310
0x7fffc21050f0: 0xffffffff810024c0 0xffffffff81002500
0x7fffc2105100: 0xffffffff81002940 0xffffffff81002970
Those look like valid kernelspace addresses. Why the discrepancy?
Loading the ARM64 kernel in radare2
confirms that there are nothing but zeros between the __start_mcount_loc
and
__stop_mcount_loc
symbols:
$ r2 vmlinux-aarch64
[0xffff800008000000]> s loc.__start_mcount_loc
[0xffff80000a2f3e68]> px
- offset - 0 1 2 3 4 5 6 7 8 9 A B C D E F 0123456789ABCDEF
0xffff80000a2f3e68 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3e78 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3e88 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3e98 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3ea8 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3eb8 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3ec8 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3ed8 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3ee8 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3ef8 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f08 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f18 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f28 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f38 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f48 0000 0000 0000 0000 0000 0000 0000 0000 ................
0xffff80000a2f3f58 0000 0000 0000 0000 0000 0000 0000 0000 ................
On an x86 kernel, however, the same addresses seen in gdb
are present:
$ r2 vmlinux-x86_64
[0xffffffff81e00000]> s loc.__start_mcount_loc
[0xffffffff836f4a10]> px
- offset - 0 1 2 3 4 5 6 7 8 9 A B C D E F 0123456789ABCDEF
0xffffffff836f4a10 6006 0081 ffff ffff a006 0081 ffff ffff `...............
0xffffffff836f4a20 e006 0081 ffff ffff 2007 0081 ffff ffff ........ .......
0xffffffff836f4a30 800c 0081 ffff ffff d00d 0081 ffff ffff ................
0xffffffff836f4a40 a00f 0081 ffff ffff d00f 0081 ffff ffff ................
0xffffffff836f4a50 0010 0081 ffff ffff 3010 0081 ffff ffff ........0.......
0xffffffff836f4a60 5010 0081 ffff ffff c014 0081 ffff ffff P...............
0xffffffff836f4a70 f014 0081 ffff ffff 0015 0081 ffff ffff ................
0xffffffff836f4a80 1015 0081 ffff ffff c01a 0081 ffff ffff ................
0xffffffff836f4a90 401b 0081 ffff ffff 001c 0081 ffff ffff @...............
0xffffffff836f4aa0 701d 0081 ffff ffff 301f 0081 ffff ffff p.......0.......
0xffffffff836f4ab0 401f 0081 ffff ffff f01f 0081 ffff ffff @...............
0xffffffff836f4ac0 e020 0081 ffff ffff 9021 0081 ffff ffff . .......!......
0xffffffff836f4ad0 b021 0081 ffff ffff 1022 0081 ffff ffff .!......."......
0xffffffff836f4ae0 d022 0081 ffff ffff 1023 0081 ffff ffff .".......#......
0xffffffff836f4af0 c024 0081 ffff ffff 0025 0081 ffff ffff .$.......%......
0xffffffff836f4b00 4029 0081 ffff ffff 7029 0081 ffff ffff @)......p)......
This naturally begs the question: where are the ARM64 ftrace addresses?
We know the kernel is getting the list somehow. It's listing tty_write
in
/sys/kernel/tracing/available_filter_functions
and is able to trace the
function. It's not there however when we map the kernel into memory.
A few hours of research and binary grepping later, I found the answer:
$ nm -C vmlinux-aarch64 | grep tty_write
...snip
ffff800008914af0 t tty_write
...snip
$ llvm-readelf -r vmlinux-aarch64 | grep ffff800008914af0
ffff8000094fff40 0000000000000403 R_AARCH64_RELATIVE ffff800008914af0
In the kernel ELF binary, the data between __start_mcount_loc
and
__stop_mcount_loc
was entirely zeros, but relocations had been emitted such
that the kernel will patch up all the zeros with actual addresses at boot when
it relocates itself.
So, when the kernel actually boots, the section of the address space in
.init.data
between __start_mcount_loc
and __stop_mcount_loc
will contain
the correct function addresses. If the binary is simply mapped into memory
during the kernel build process by pahole however, it will only read zeros from
the corresponding location, causing the ftrace list to appear empty. The pahole
commit that removed the restriction on only generating BTF for functions known
to ftrace inadvertently fixed this as it stopped considering the ftrace list
entirely when generating BTF.
Rebuilding the ARM64 kernel with CONFIG_RELOCATABLE=n
and generating BTF with
pahole 1.20, the problem was fixed. There were no relocations — the data was
simply there and pahole generated BTF as expected.
But why was that data zeroed only on ARM64? The x86 kernels were all built
with CONFIG_RELOCATABLE=y
as well.
But wait, if the x86 kernels were also built with CONFIG_RELOCATABLE=y
,
shouldn't they be emitting relocations as well? After all, a relocatable kernel
could be loaded anywhere, so the addresses we see in the list should all have
relocations so they can change if the kernel is loaded somewhere nonstandard:
$ nm -C vmlinux-x86 | grep tty_write
...snip
ffffffff815dfdd0 t tty_write
...snip
$ llvm-readelf -r vmlinux-x86 | grep 5dfdd0
...snip
ffffffff8372d7b8 0000000100000001 R_X86_64_64 ffffffff81000000 .text + 5dfdd0
$ llvm-readelf -s vmlinux-x86 | grep mcount
118204: ffffffff837612c0 0 NOTYPE GLOBAL DEFAULT 38 __stop_mcount_loc
132185: ffffffff836fdf30 0 NOTYPE GLOBAL DEFAULT 38 __start_mcount_loc
It's a touch more complex this time as, on x86, the relocation for
tty_write
's address in the ftrace list is given relative to the .text
symbol (i.e. it's given as 0xffffffff81000000 + 5dfdd0
instead of
ffffffff815dfdd0
), but as can be seen, an equivalent relocation is there.
Ok, so relocations are emitted for both x86 and ARM64, but the actual locations in the produced binary corresponding to those relocations are only filled in (assuming a normal 1M kernel load address) on x86.
The reason for this discrepancy was found only after much grepping. As of Linux 5.19, it can be found at arch/arm64/Makefile:15, which was added in commit fd9dde6:
commit fd9dde6abcb9bfe6c6bee48834e157999f113971
Author: Nick Desaulniers <[email protected]>
Date: Fri Oct 27 09:33:41 2017 -0700
arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27
Upon upgrading to binutils 2.27, we found that our lz4 and gzip
compressed kernel images were significantly larger, resulting is 10ms
boot time regressions.
As noted by Rahul:
"aarch64 binaries uses RELA relocations, where each relocation entry
includes an addend value. This is similar to x86_64. On x86_64, the
addend values are also stored at the relocation offset for relative
relocations. This is an optimization: in the case where code does not
need to be relocated, the loader can simply skip processing relative
relocations. In binutils-2.25, both bfd and gold linkers did this for
x86_64, but only the gold linker did this for aarch64. The kernel build
here is using the bfd linker, which stored zeroes at the relocation
offsets for relative relocations. Since a set of zeroes compresses
better than a set of non-zero addend values, this behavior was resulting
in much better lz4 compression.
The bfd linker in binutils-2.27 is now storing the actual addend values
at the relocation offsets. The behavior is now consistent with what it
does for x86_64 and what gold linker does for both architectures. The
change happened in this upstream commit:
https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=1f56df9d0d5ad89806c24e71f296576d82344613
Since a bunch of zeroes got replaced by non-zero addend values, we see
the side effect of lz4 compressed image being a bit bigger.
To get the old behavior from the bfd linker, "--no-apply-dynamic-relocs"
flag can be used:
$ LDFLAGS="--no-apply-dynamic-relocs" make
With this flag, the compressed image size is back to what it was with
binutils-2.25.
...snip
So, on x86, the target address of the relocations is patched up in the actual
binary image with the relocation value (again, assuming a normal 1M kernel load
address), whereas on ARM64, the target location of the relocation value is
always set to 0 as a result of --no-apply-dynamic-relocs
, which was done to
make the kernel image more compressible. This has been the case since 4.15 and
broke pahole
's assumption that the region between __start_mcount_loc
and
__stop_mcount_loc
in the vmlinux
binary would always contain valid kernel
addresses.
Whew... there's our root cause.
So what's the fix for this? We can't just fail for kernels built with pahole <1.22.
Well, conveniently enough, kernel commit
9bb48c8
also changed the signature of redirected_tty_write
in exactly the same way as
tty_write
. redirected_tty_write
has external linkage, thus pahole <1.22
will generate BTF for it even if it can't find it in the (entirely null)
ftrace list. We updated our prototype-detection logic to use this function as
an analog for the signature of tty_write
instead and ended up with a working
BPF program.
And, that's all I got for today. Hopefully, you found this rabbit hole made for an enjoyable read. If you're interested in the security-related BPF work we're doing at Elastic, all of our BPF code and associated tools (including the fix described in this blog post) is publicly available.