Tracing Policy
TracingPolicy is a user-configurable Kubernetes custom resource (CR) that
allows users to trace arbitrary events in the kernel and optionally define
actions to take on a match, for enforcement for example. Currently, two types
of events are supported: kprobes and tracepoints, but others may be added in
the future (e.g., uprobes) by following a similar approach.
Note that TracingPolicy can be considered low-level since they require
knowledge about the Linux kernel and containers to be written correctly. In the
future, we are considering to add a high-level RuntimeSecurityPolicy which
would take this complexity away.
For the complete custom resource definition (CRD) refer to the YAML file
cilium.io_tracingpolicies.yaml.
One practical way to explore the CRD is to use kubectl explain against a
Kubernetes API server on which it is installed, for example kubectl explain tracingpolicy.spec.kprobes provides field-specific documentation and details
on kprobe spec.
--tracing-policy) or
via the tetra CLI to load the policies via gRPC.
Getting started with Tracing Policy
To keep the Cilium Policy Definitions unified, the TracingPolicy follows the
same CR logic and semantics that you might be familiar with from other Cilium
concepts, like CiliumNetworkPolicy.
To discover TracingPolicy, let’s understand via an example that will be
explained in this document:
Required fields
The first part follows a common pattern among all Cilium Policies or more widely Kubernetes object. It first declares the Kubernetes API used, then the kind of Kubernetes object it is in this API and an arbitrary name for the object, that has to comply with Kubernetes naming convention.
Specification
The spec or specification can be composed of kprobes or tracepoints at
the moment. It describes the arbitrary event in the kernel that will be traced.
Kprobes
Kprobes enables you to dynamically break into any kernel routine and collect debugging and performance information non-disruptively. kprobes are highly tight to your kernel version and might not be portable since the kernel symbols depend on your build.
Conveniently, you can list all kernel symbols reading the /proc/kallsyms
file. For example to search for the write syscall kernel function, you can
execute sudo grep sys_write /proc/kallsyms, the output should be similar to
this, minus the architecture specific prefixes.
ffffdeb14ea712e0 T __arm64_sys_writev
ffffdeb14ea73010 T ksys_write
ffffdeb14ea73140 T __arm64_sys_write
ffffdeb14eb5a460 t proc_sys_write
ffffdeb15092a700 d _eil_addr___arm64_sys_writev
ffffdeb15092a740 d _eil_addr___arm64_sys_write
You can see that the exact name of the symbol for the write syscall on our
kernel version is __arm64_sys_write. Note that on x86_64, the prefix should
be __x64_ instead of __arm64_.
Kernel symbols contain an architecture specific prefix when they refer to syscall symbols. To write portable tracing policies, i.e. policies that can run on multiple architectures, just use the symbol name without the prefix.
For example, instead of writing call: "__arm64_sys_write" or call: "__x64_sys_write", just write call: "sys_write", Tetragon will adapt and add
the correct prefix based on the architecture of the underlying machine. Note
that the event generated as output currently includes the prefix.
In our example, we will explore a kprobe hooking into the
fd_install
kernel function. The fd_install kernel function is called each time a file
descriptor is installed into the file descriptor table of a process, typically
referenced within system calls like open or openat. Hooking fd_install
has its benefits and limitations, which are out of the scope of this document.
syscall field, specific to a kprobe spec, with default value
false, that indicates whether Tetragon will hook a syscall or just a regular
kernel function. Tetragon needs this information because syscall and kernel
function use a different ABI.
As usual, kprobes calls can be defined independently in different policies, or together in the same Policy. For example, we can define trace multiple kprobes under the same tracing policy:
Tracepoints
A tracepoint placed in the Linux kernel code provides a hook to call a function that you can provide at runtime using Tetragon. Tracepoints have the advantage of being stable across kernel versions and thus more portable than kprobes.
To see the list of tracepoints available on your kernel, you can list them
using sudo ls /sys/kernel/debug/tracing/events, the output should be similar
to this.
alarmtimer ext4 iommu page_pool sock
avc fib ipi pagemap spi
block fib6 irq percpu swiotlb
bpf_test_run filelock jbd2 power sync_trace
bpf_trace filemap kmem printk syscalls
bridge fs_dax kvm pwm task
btrfs ftrace libata qdisc tcp
cfg80211 gpio lock ras tegra_apb_dma
cgroup hda mctp raw_syscalls thermal
clk hda_controller mdio rcu thermal_power_allocator
cma hda_intel migrate regmap thermal_pressure
compaction header_event mmap regulator thp
cpuhp header_page mmap_lock rpm timer
cros_ec huge_memory mmc rpmh tlb
dev hwmon module rseq tls
devfreq i2c mptcp rtc udp
devlink i2c_slave napi sched vmscan
dma_fence initcall neigh scmi wbt
drm interconnect net scsi workqueue
emulation io_uring netlink signal writeback
enable iocost oom skb xdp
error_report iomap page_isolation smbus xhci-hcd
You can then choose the subsystem that you want to trace, and look the
tracepoint you want to use and its format. For example, if we choose the
netif_receive_skb tracepoints from the net subsystem, we can read its
format with sudo cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/format,
the output should be similar to the following.
name: netif_receive_skb
ID: 1398
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:void * skbaddr; offset:8; size:8; signed:0;
field:unsigned int len; offset:16; size:4; signed:0;
field:__data_loc char[] name; offset:20; size:4; signed:0;
print fmt: "dev=%s skbaddr=%px len=%u", __get_str(name), REC->skbaddr, REC->len
Similarly to kprobes, tracepoints can also hook into system calls. For more
details, see the raw_syscalls and syscalls subysystems.
Arguments
A call contains an args field which a list of function arguments to include
in the trace output. Indeed the BPF code that runs on the hook point requires
information about the types of arguments of the function being traced to
properly read, print and filter on its arguments. Currently, this information
needs to be provided by the user under the args section. For the available
types,
see directly in the TracingPolicy CRD.
Following our example, here is the part that defines the arguments:
To properly read and hook onto the fd_install(unsigned int fd, struct file *file) function, the YAML snippet above tells the BPF code that the first
argument is an int and the second argument is a file, which is the
struct file
of the kernel. In this way, the BPF code and its printer can properly collect
and print the arguments.
These types are sorted by the index field, where you can specify the order.
The indexing starts with 0. So, index: 0 means, this is going to be the first
argument of the function, index: 1 means this is going to be the second
argument of the function, etc.
Note that for some args types, char_buf and char_iovec, there are
additional fields named returnCopy and sizeArgIndex available:
returnCopyindicates that the corresponding argument should be read later (when the kretprobe for the symbol is triggered) because it might not be populated when the kprobe is triggered at the entrance of the function. For example, a buffer supplied toread(2)won’t have content until kretprobe is triggered.sizeArgIndexindicates the (1-based, see warning below) index of the arguments that represents the size of thechar_buforiovec. For example, forwrite(2), the third argument,size_t countis the number ofcharelement that we can read from theconst void *bufpointer from the second argument. Similarly, if we would like to capture the__x64_sys_writev(long, iovec *, vlen)syscall, theniovechas a size ofvlen, which is going to be the third argument.
sizeArgIndex is inconsistent at the moment and does not take the index, but
the number of the index (or index + 1). So if the size is the third argument,
index 2, the value should be 3.
These flags can be combined, see the example below.
Note that you can specify which arguments you would like to print from a
specific syscall. For example if you don’t care about the file descriptor,
which is the first int argument with index: 0 and just want the char_buf,
what is written, then you can leave this section out and just define:
This tells the printer to skip printing the int arg because it’s not useful.
For char_buf type up to the 4096 bytes are stored. Data with bigger size are
cut and returned as truncated bytes.
You can specify maxData flag for char_buf type to read maximum possible data
(currently 327360 bytes), like:
This field is only used for char_buff data. When this value is false (default),
the bpf program will fetch at most 4096 bytes. In later kernels (>=5.4) tetragon
supports fetching up to 327360 bytes if this flag is turned on.
The maxData flag does not work with returnCopy flag at the moment, so it’s
usable only for syscalls/functions that do not require return probe to read the
data.
Selectors
A TracingPolicy can contain from 0 to 5 selectors. A selector is composed of
1 or more filters. The available filters are the following:
matchArgsmatchReturnArgsmatchPIDsmatchBinariesmatchNamespacesmatchCapabilitiesmatchNamespaceChangesmatchCapabilityChangesmatchActions
Arguments filter
Arguments filters can be specified under the matchArgs field and provide
filtering based on the value of the function’s argument.
In the next example, a selector is defined with a matchArgs filter that tells
the BPF code to process only the function call for which the second argument,
index equal to 1, concerns the file under the path /etc/passwd or
/etc/shadow. It’s using the operator Equal to match against the value of
the argument.
Note that conveniently, we can match against a path directly when the argument
is of type file.
The available operators for matchArgs are:
EqualNotEqualPrefixPostfixMask
Further examples
In the previous example, we used the operator Equal, but we can also use the
Prefix operator and match against all files under /etc with:
In this situation, an event will be created every time a process tries to
access a file under /etc.
Although it makes less sense, you can also match over the first argument, to only detect events that will use the file descriptor 4, which is usually the first that come afters stdin, stdout and stderr in process. And combine that with the previous example.
Return args filter
Arguments filters can be specified under the returnMatchArgs field and
provide filtering based on the value of the function return value. It allows
you to filter on the return value, thus success, error or value returned by a
kernel call.
The available operators for matchReturnArgs are:
EqualNotEqualPrefixPostfix
A use case for this would be to detect the failed access to certain files, like
/etc/shadow. Doing cat /etc/shadow will use a openat syscall that will
returns -1 for a failed attempt with an unprivileged user.
PIDs filter
PIDs filters can be specified under the matchPIDs field and provide filtering
based on the value of host pid of the process. For example, the following
matchPIDs filter tells the BPF code that observe only hooks for which the
host PID is equal to either pid1 or pid2 or pid3:
The available operators for matchPIDs are:
InNotIn
Further examples
Another example can be to collect all processes not associated with a
container’s init PID, which is equal to 1. In this way, we are able to detect
if there was a kubectl exec performed inside a container because processes
created by kubectl exec are not children of PID 1.
Binaries filter
Binary filters can be specified under the matchBinaries field and provide
filtering based on the value of a certain binary name. For example, the
following matchBinaries selector tells the BPF code to process only system
calls and kernel functions that are coming from cat or tail.
Currently, only the In operator type is supported and the values field has
to be a map of strings. The default behaviour is followForks: true, so all
the child processes are followed. The current limitation is 4 values.
Further examples
One example can be to monitor all the sys_write system calls which are
coming from the /usr/sbin/sshd binary and its child processes and writing to
stdin/stdout/stderr.
This is how we can monitor what was written to the console by different users
during different ssh sessions. The matchBinaries selector in this case is the
following:
while the whole kprobe call is the following:
Namespaces filter
Namespaces filters can be specified under the matchNamespaces field and
provide filtering of calls based on Linux namespace. You can specify the
namespace inode or use the special host_ns keyword, see the example and
description for more information.
An example syntax is:
This will match if: [Pid namespace is 4026531836] OR [Pid namespace is
4026531835]
namespacecan be:Uts,Ipc,Mnt,Pid,PidForChildren,Net,Cgroup, orUser.TimeandTimeForChildrenare also available in Linux >= 5.6.operatorcan beInorNotInvaluescan be raw numeric values (i.e. obtained fromlsns) or"host_ns"which will automatically be translated to the appropriate value.
Limitations
- We can have up to 4
values. These can be both numeric andhost_nsinside a singlenamespace. - We can have up to 4
namespacevalues undermatchNamespacesin Linux kernel < 5.3. In Linux >= 5.3 we can have up to 10 values (i.e. the maximum number of namespaces that modern kernels provide).
Further examples
We can have multiple namespace filters:
This will match if: ([Pid namespace is 4026531836] OR [Pid namespace is
4026531835]) AND ([Mnt namespace is 4026531833] OR [Mnt namespace
is 4026531834])
Use cases examples
Generate a kprobe event if
/etc/shadowwas opened by/bin/catwhich either had hostNetorMntnamespace access
This example has 2 selectors. Note that each selector starts with -.
Selector 1:
Selector 2:
We have [Selector1 OR Selector2]. Inside each selector we have filters.
Both selectors have 3 filters (i.e. matchBinaries, matchArgs, and
matchNamespaces) with different arguments. Adding a - in the beginning of a
filter will result in a new selector.
So the previous CRD will match if:
[binary == /bin/cat AND arg1 == /etc/shadow AND MntNs == host] OR
[binary == /bin/cat AND arg1 == /etc/shadow AND NetNs is host]
We can modify the previous example as follows:
Generate a kprobe event if
/etc/shadowwas opened by/bin/catwhich has hostNetandMntnamespace access
Here we have a single selector. This CRD will match if:
[binary == /bin/cat AND arg1 == /etc/shadow AND (MntNs == host AND
NetNs == host) ]
Capabilities filter
Capabilities filters can be specified under the matchCapabilities field and
provide filtering of calls based on Linux capabilities in the specific sets.
An example syntax is:
This will match if: [Effective capabilities contain CAP_CHOWN] OR
[Effective capabilities contain CAP_NET_RAW]
typecan be:Effective,Inheritable, orPermitted.operatorcan beInorNotInvaluescan be any supported capability. A list of all supported capabilities can be found in/usr/include/linux/capability.h.
Limitations
- There is no limit in the number of capabilities listed under
values. - Only one
typefield can be specified undermatchCapabilities.
Namespace changes filter
Namespace changes filter can be specified under the matchNamespaceChanges
field and provide filtering based on calls that are changing Linux namespaces.
This filter can be useful to track execution of code in a new namespace or even
container escapes that change their namespaces.
For instance, if an unprivileged process creates a new user namespace, it gains full privileges within that namespace. This grants the process the ability to perform some privileged operations within the context of this new namespace that would otherwise only be available to privileged root user. As a result, such filter is useful to track namespace creation, which can be abused by untrusted processes.
To keep track of the changes, when a process_exec happens, the namespaces of
the process are recorded and these are compared with the current namespaces on
the event with a matchNamespaceChanges filter.
The unshare command, or executing in the host namespace using nsenter can
be used to test this feature. See a
demonstration example
of this feature.
Capability changes filter
Capability changes filter can be specified under the matchCapabilityChanges
field and provide filtering based on calls that are changing Linux capabilities.
To keep track of the changes, when a process_exec happens, the capabilities
of the process are recorded and these are compared with the current
capabilities on the event with a matchCapabilityChanges filter.
See a demonstration example of this feature.
Actions filter
Actions filters are a list of actions that execute when an appropriate selector
matches. They are defined under matchActions and currently, the following
action types are supported:
- Sigkill action
- Signal action
- Override action
- FollowFD action
- UnfollowFD action
- CopyFD action
- GetUrl action
- DnsLookup action
- Post action
- NoPost action
Sigkill, Override, FollowFD, UnfollowFD, CopyFD and Post are
executed directly in the kernel BPF code while GetUrl and DnsLookup are
happening in userspace after the reception of events.
Sigkill action
Sigkill action terminates synchronously the process that made the call that
matches the appropriate selectors from the kernel. In the example below, every
sys_write system call with a PID not equal to 1 or 0 attempting to write to
/etc/passwd will be terminated. Indeed when using kubectl exec, a new
process is spawned in the container PID namespace and is not a child of PID 1.
Signal action
Signal action sends specified signal to current process. The signal number
is specified with argSig value.
Following example is equivalent to the Sigkill action example above.
The difference is to use the signal action with SIGKILL(9) signal.
Override action
Override action allows to modify the return value of call. While Sigkill
will terminate the entire process responsible for making the call, Override
will override the return value that was supposed to be returned with the value
given in the argError field. It’s then up to the process handling of the
return value of the function to stop or continue the execution.
For example, you can create a TracingPolicy that intercepts sys_symlinkat
and will make it return -1 every time the first argument is equal to the
string /etc/passwd:
Override can override the return value of any call but doing so in kernel
functions can create unexpected code path execution. While syscall are a stable
user interface that should handle errors gracefully.
FollowFD action
The FollowFD action allows to create a mapping using a BPF map between file
descriptors numbers and filenames. It however needs to maintain a state
correctly, see UnfollowFD and
CopyFD related actions.
The fd_install kernel function is called each time a file descriptor must be
installed into the file descriptor table of a process, typically referenced
within system calls like open or openat. It is a good place for tracking
file descriptor and filename matching.
Let’s take a look at the following example:
This action uses the dedicated argFd and argName fields to get respectively
the index of the file descriptor argument and the index of the name argument in
the call.
UnfollowFD action
The UnfollowFD action takes a file descriptor from a system call and deletes
the corresponding entry from the BPF map, where it was put under the FollowFD
action.
Let’s take a look at the following example:
Similar to the FollowFD action, the index of the file descriptor is described
under argFd:
In this example, argFD is 0. So, the argument from the sys_close system
call at index: 0 will be deleted from the BPF map whenever a sys_close is
executed.
FollowFD block,
there should be a matching UnfollowFD block, otherwise the BPF map will be
broken.
CopyFD action
The CopyFD action is specific to duplication of file descriptor use cases.
Similary to FollowFD, it takes an argFd and argName arguments. It can
typically be used tracking the dup, dup2 or dup3 syscalls.
See the following example for illustration:
GetUrl action
The GetUrl action can be used to perform a remote interaction such as
triggering Thinkst canaries or any system that can be triggered via an URL
request. It uses the argUrl field to specify the URL to request using GET
method.
DnsLookup action
The DnsLookup action can be used to perform a remote interaction such as
triggering Thinkst canaries or any system that can be triggered via an DNS
entry request. It uses the argFqdn field to specify the domain to lookup.
Post action
The Post action is intended to create an event but at the moment should be
considered as deprecated as all TracingPolicy will generate an event by
default.
NoPost action
The NoPost action can be used to suppress the event to be generated, but at
the same time all its defined actions are performed.
It’s useful when you are not interested in the event itself, just in the action being performed.
Following example override openat syscall for “/etc/passwd” file but does not generate any event about that.
Selector Semantics
The selector semantics of the CiliumTracingPolicy follows the standard
Kubernetes semantics and the principles that are used by Cilium to create a
unified policy definition.
To explain deeper the structure and the logic behind it, let’s consider first the following example:
In the YAML above matchPIDs and matchArgs are logically AND together
giving the expression:
Multiple values
When multiple values are given, we apply the OR operation between them. In
case of having multiple values under the matchPIDs selector, if any value
matches with the given pid from pid1, pid2 or pid3 then we accept the
event:
As an example, we can filter for sys_read() syscalls that were not part of
the container initialization and the main pod process and tried to read from
the /etc/passwd file by using:
Similarly, we can use multiple values under the matchArgs selector:
If any value matches with fdstring1 or fdstring2, specifically
(string==fdstring1 OR string==fdstring2) then we accept the event.
For example, we can monitor sys_read() syscalls accessing both the
/etc/passwd or the /etc/shadow files:
Multiple operators
When multiple operators are supported under matchPIDs or matchArgs, they
are logically AND together. In case if we have multiple operators under
matchPIDs:
then we would build the following expression on the BPF side:
In case of having multiple matchArgs:
Then we would build the following expression on the BPF side
Operator types
There are different types supported for each operator. In case of matchArgs:
- Equal
- NotEqual
- Prefix
- Postfix
- Mask
The operator types Equal and NotEqual are used to test whether the certain
argument of a system call is equal to the defined value in the CR.
For example, the following YAML snippet matches if the argument at index 0 is
equal to /etc/passwd:
Both Equal and NotEqual are set operations. This means if multiple values
are specified, they are ORd together in case of Equal, and ANDd together
in case of NotEqual.
For example, in case of Equal the following YAML snippet matches if the
argument at index 0 is in the set of {arg0, arg1, arg2}.
The above would be executed in the kernel as
In case of NotEqual the following YAML snippet matches if the argument at
index 0 is not in the set of {arg0, arg1}.
The above would be executed in the kernel as
The operator type Mask performs and bitwise operation on the argument value
and defined values. The argument type needs to be one of the value types.
For example in following YAML snippet we match second argument for bits 1 and 9 (0x200 value). We could use single value 0x201 as well.
The above would be executed in the kernel as
The value can be specified as hexadecimal (with 0x prefix) octal (with 0 prefix) or decimal value (no prefix).
The type Prefix checks if the certain argument starts with the defined value,
while the type Postfix compares if the argument matches to the defined value
as trailing.
In case of matchPIDs:
- In
- NotIn
The operator types In and NotIn are used to test whether the pid of a
system call is found in the provided values list in the CR. Both In and
NotIn are set operations, which means if multiple values are specified they
are ORd together in case of In and ANDd together in case of NotIn.
For example, in case of In the following YAML snippet matches if the pid of a
certain system call is being part of the list of {0, 1}:
The above would be executed in the kernel as
In case of NotIn the following YAML snippet matches if the pid of a certain
system call is not being part of the list of {0, 1}:
The above would be executed in the kernel as
In case of matchBinaries:
- In
The In operator type is used to test whether a binary name of a system call
is found in the provided values list. For example, the following YAML snippet
matches if the binary name of a certain system call is being part of the list
of {binary0, binary1, binary2}:
Multiple selectors
When multiple selectors are configured they are logically ORd together.
The above would be executed in kernel as:
Limitations
Because BPF must be bounded we have to place limits on how many selectors can exist.
- Max Selectors 8.
- Max PID values per selector 4
- Max MatchArgs per selector 5 (one per index)
- Max MatchArg Values per MatchArgs 1 (limiting initial implementation can bump to 16 or so)