This article is a tutorial on how to create an eBPF program using a tracepoint. If you’re not familiar with eBPF, you can refer to our introduction to eBPF, or to the official documentation
What is a tracepoint? How do I create one with eBPF? What are the steps involved?
Note: This article is also available in french 🇫🇷.
Aim
In this article, we’ll take a look at how eBPF works, using a practical example. We will try to hide a process identifier (PID) from a user on a linux system. This article introduces the concepts surrounding the use of tracepoints in an eBPF program and the usage of the getdents64
system call (syscall).
The syscall in question will be used to mask the existence of a PID for the user, i.e. when using the ps -ef
command.
This topic is divided into 2 parts: the first article covers the general concepts surrounding syscall and how it works, and provides a theoretical solution in the form of pseudo-code. The following article focuses exclusively on the concrete resolution with C code, which you’ll find on the blog next week.
Definition
An eBPF tracepoint is an anchor point in the Linux kernel where eBPF programs can be attached to monitor specific system events. It is a predefined attachment point for collecting and observing kernel behavior, and in our case, a system call.
Theoretical approach
On POSIX-compliant Unix systems, a process is represented as a directory in /proc
, whose name is its own PID. So, if a process starts up with PID 1234, this will create a /proc/1234
folder in which the process data will be stored. So if we manage to hide the existence of this folder, the process will not be visible to the user.
Theoretically, if the folder is not visible to the user, this could imply that a program does not exist.
How do you hide a directory from the user, but leave it accessible to the kernel so as not to interfere with its operation?
Identifying the syscall
The syscall used to list a directory is called getdents64
(its predecessor getdents
is no longer in use). You can see that this is a syscall used by ps
by using the strace
tool.
strace ps
will display thegetdents64
system call, but beware: the display will be very long and difficult to read.
To hook into this syscall, we need to identify the eBPF hook point. This can be done using a tracepoint (i.e. a predefined event directly in the kernel), or a kprobe (a specific kernel function).
The /sys
directory contains a wealth of documentation on the various methods and types of functions that can be hooked.
With regard to tracepoints, they may have an activated or deactivated status, and it is important to ensure that this is the case before hooking onto them.
The following command shows whether the tracepoint linked to getdents64
is active (it is by default).
cat /sys/kernel/debug/tracing/available_events | grep getdents64
Using this command, we can see that we have 2 possible hooks:
sys_enter_getdents64
, which is executed before syscall and the reading of the foldersys_exit_getdents64
, which is executed after syscall and contains the complete folder data.
A
sys_enter
hook is executed before a system call, whilesys_exit
is executed after.
An eBPF program is defined using the following macro. Only then can the eBPF program be defined.
SEC("tp/syscalls/sys_enter_getdents64")
An eBPF program is simply a function whose header is the
SEC
macro.
In the case of a tracepoint, the function will take context parameters (ctx
in the following code) which contain a certain amount of information relating to the hook, such as the arguments passed to the original syscall, for example.
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
SEC("tp/syscalls/sys_enter_getdents64")
int acceis_ebpf_prog(void *ctx) {
/*
Do something here ...
*/
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";
The
void *
type is used here generically, but the data has a specific type which depends on the tracepoint in question.
Before trying to identify what the ctx
has to tell us, we need to understand how syscall works to know what to look for.
Understanding syscall
The documentation for this syscall can be found directly in the manual.
man getdents64
As we go through this manual, there are 2 important things we’ll learn:
- It details the structure of a
linux_dirent
("dirent" for "directory entries")
# The system call getdents() reads several linux_dirent structures from the directory referred
# to by the open file descriptor fd into the buffer pointed to by dirp.
struct linux_dirent {
...
unsigned short d_reclen; /* Length of this linux_dirent */
char d_name[]; /* Filename (null-terminated) */
...
}
linux_dirent
here has 2 interesting parameters. d_name
which is the name of the file represented by a character string and would allow to identify the PID to be masked by comparing their values. d_reclen
represents the size of the structure in the buffer.
- The documentation also includes a piece of code demonstrating the use of syscall.
char buf[BUF_SIZE]; // Buffer containing all linux_dirent entries
long nread; // Length of all entries (linux_dirent) in the buffer
struct linux_dirent *d; // current input (contains input type, inode ...)
nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
// bpos can be translated as buffer position
for (size_t bpos = 0; bpos < nread;) {
// the current entry is defined by the buffer + the total size of previous entries
d = (struct linux_dirent *) (buf + bpos);
/*
... Here, we can print information about the current input...
*/
// Add the size of all entries already scanned
bpos += d->d_reclen;
}
The user defines a buffer of size BUF_SIZE
. When this buffer is passed to syscall, it is filled with as many linux_dirent
entries as the buffer size allows. Syscall then returns nread
, which is the total size of all linux_dirent
entries in the buffer.
To obtain a precise entry in this buffer, you need to know the position of the previous entry. The for loop iterates over all entries starting from 0 and is incremented by the length of the entry (d_reclen
) to reach the next element.
The linux_dirent
, d_name
and d_reclen
parameters are very useful for our use case. Either to identify the entry corresponding to the PID to be masked, or to increment each entry in the buffer.
We still need to retrieve the buffer containing all the entries from the sys_enter_getdents64
tracepoint.
Knowing a tracepoint’s context type (ctx)
There are 2 ways to find out the type of a tracepoint parameter:
- The first method involves retrieving the type from the directory
/sys/kernel/debug/tracing/events/syscalls/<NOTRE_TRACEPOINT>
. - The second method uses a type described in the linux kernel header file,
vmlinux.h
.
Retrieving context from the /sys
directory
The type can be retrieved directly from the /sys
directory with the following command:
cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_getdents64/format
# name: sys_enter_getdents64
# ID: 818
# format:
# ...
# field:unsigned int fd; offset:16; size:8; signed:0;
# field:struct linux_dirent64 * dirent; offset:24; size:8; signed:0;
# field:unsigned int count; offset:32; size:8; signed:0;
fd
, linux_dirent64
and count
are the arguments passed by the user to syscall.
The buffer containing all linux_dirent64
entries is visible here as the 2nd syscall parameter. This is the one that will be useful for what follows.
To use the values described in this file as context parameters, you must define a struct containing each of these types (most of them are hidden by the
...
).
Context retrieval with `vmlinux.h
Another method is to use the linux kernel header file vmlinux.h
. This is a dynamically generated file containing all the types used in the linux kernel.
You can create the vmlinux.h
file using the bpftool
, which is a necessary utility for eBPF tasks.
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
Our hook is of type sys_enter
, so by doing a little ctrl+f
and searching for sys_enter
in the header file, we quickly find the type linked to this tracepoint.
struct trace_event_raw_sys_enter {
struct trace_entry ent;
long int id;
long unsigned int args[6];
char __data[0];
};
All that remains is to extract the data we’re interested in. To do this, the parameters passed by the user to syscall are stored in args
. From what we’ve seen in the /sys/kernel/debug/tracing/events/syscalls/sys_enter_getdents64/format
file, we can extract 3 parameters, fd
, dirent
and count
. These will be placed in args
so that args[0] = fd
, args[1] = dirent
, args[2] = count
. All you have to do is cast the parameters to obtain the typed data.
In Part 2, this method is used.
Type retrieval for `sys_exit_getdents64
To find out the type of the parameter for the sys_exit_getdents64
hook, all we need to do is do the same thing as above.
struct trace_event_raw_sys_exit {
struct trace_entry ent;
long int id;
long int ret;
char __data[0];
};
This time, we have the ret
parameter directly, which contains the syscall return data. To find out more, take a look at the getdents64
syscall manual.
man getdents64
# RETURN VALUE
# On success, the number of bytes read is returned. On end of directory, 0 is returned. On error, -1 is returned, and errno is set to indicate the error.
So our ret
parameter will be used as a reference index to find out how far the file has progressed.
Here
ret
corresponds tonread
in the code shown above.
Planning, hiding a PID
In the context of the sys_enter_getdents64
tracepoint, user-defined parameters (, i.e. in user space, ) are accessible. The buffer containing all dirent
entries must therefore be extracted for use after syscall, i.e. once the buffer has been filled.
To transmit data from one eBPF program to another, you’ll need to use maps, which are data structures whose role is to store information and share it between eBPF programs or user space. Please refer to the article introduction to eBPF for more details.
In the context of the sys_exit_getdents64
tracepoint, only the long int ret
parameter is returned. This is useful for finding out the maximum number of elements that can be read from the buffer.
We can schematize the resolution with pseudo-code:
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
/*
- Definition of the map able to store syscall information
*/
SEC("tp/syscalls/sys_enter_getdents64")
int acceis_sys_enter_getents64(void *ctx) {
/*
- retrieve buffer containing linux_dirent64
entries
- Save information in map
*/
}
SEC("tp/syscalls/sys_exit_getdents64")
int acceis_sys_exit_getents64(void *ctx) {
/*
- buffer retrieval
- retrieve PID to mask
- identify the entry corresponding to the PID (using d_name
)
- Delete entry from buffer
*/
}
char LICENSE[] SEC("license") = "Dual BSD/GPL";
One way of deleting the entry from the buffer would be to modify the entry length value (defined by d_reclen
) preceding our dirent to patch as its length plus the length of this dirent.
Conclusion
In this article, we have seen how to identify and obtain information about an eBPF tracepoint. We have also seen how to understand a syscall and exploit it.
To hide a PID, you need to delete the corresponding entry in the getdents64
syscall buffer. To do this, retrieve the buffer defined in user space before the syscall call and save it in a map for use in the sys_exit
. Once this has been done, we loop over all the entries (linux_dirent64
) in the buffer to identify our PID by comparing its value with d_name
, then if the entry is indeed matched, override the d_reclen
value of the previous dirent by its own size plus that of the dirent to be masked.
In few weeks, you’ll be able to read part 2, which will deal with the purely practical side of creating a program capable of hiding a PID.
Some useful resources
- The history of eBPF
- Excellent book for learning
- 2 fun learning labs
- eBPF tutorials
- A great blog for further reading
- List of major projects using eBPF
About the author
Article written by Tristan d’Audibert aka Sathi, cybersecurity engineer apprentice at ACCEIS.