Kernel tracepoints simply represent static hooks to events of interest, which can be used to build a "Big Picture" of what is going on within the system.
- Purpose of Tracepoints
- Where Are Tracepoints
- Which Function to Trace
- Enabling Event Tracing
- Writing a Tracepoint
- Using a Tracepoint
- Tracepoint in a Kernel Module
- Connecting a Function to a Tracepoint
A tracepoint placed in code provides a hook to call a function (probe) that you can provide at runtime. A tracepoint can be on (a probe is connected to it) or off (no probe is attached). When a tracepoint is off it has no effect, except for adding a tiny time penalty (checking a condition for a branch) and space penalty (adding a few bytes for the function call at the end of the instrumented function and adds a data structure in a separate section). When a tracepoint is on, the function you provided is called each time the tracepoint is executed, in the execution context of the caller. When the function provided ends its execution, it returns to the caller (continuing from the tracepoint site).
- Currently defined kernel tracepoints can be found in your system directory /sys/kernel/tracing/events/*
- Define new kernel tracepoints in kernel source directory include/trace/events/
/sys/kernel/tracing/events can only be accessed as the root user. Each of its subdirectories contains tracepoints that correspond to events of many different types such as system calls and network processing.
Example: Net Subsystem defines multiple events
- The enable file is used to switch on/off a probe or a set of probes.
- The filter file is used to filter events by specifying filter expressions which operate on trace data fields.
See here.
To see the tracepoints appear on the tracing buffer, we first need to make sure tracing feature is enabled:
echo 1 > /sys/kernel/tracing/tracing_onThe events which are available for tracing can be found in the file /sys/kernel/tracing/available_events.
The events are organized into subsystems, such as irq, sched, etc., and a full event name looks like this: <subsystem>:<event>. The subsystem name is optional, but it is displayed in the available_events file. All of the events in a subsystem can be specified via the syntax <subsystem>:*.
To enable a particular event, such as sched_wakeup, simply echo it to /sys/kernel/tracing/set_event. For example:
(note that >> is necessary, otherwise it will firstly disable all the events)
echo sched_wakeup >> /sys/kernel/tracing/set_eventTo disable an event, echo the event name to the set_event file prefixed with an exclamation point:
echo '!sched_wakeup' >> /sys/kernel/tracing/set_eventTo disable all events, echo an empty line to the set_event file:
echo > /sys/kernel/tracing/set_eventTo enable all irq events, you can use the command:
echo 'irq:*' > /sys/kernel/tracing/set_eventThe events available are also listed in /sys/kernel/tracing/events/ hierarchy of directories. When reading one of the enable files in those directories, there are four possible results:
- 0 - all events this file affects are disabled
- 1 - all events this file affects are enabled
- X - there is a mixture of events enabled and disabled
- ? - this file does not affect any event
To enable event sched_wakeup:
echo 1 > /sys/kernel/tracing/events/sched/sched_wakeup/enableTo disable it:
echo 0 > /sys/kernel/tracing/events/sched/sched_wakeup/enableTo enable all events in sched subsystem:
echo 1 > /sys/kernel/tracing/events/sched/enableYou can refer to sample code for examples and detailed explanation.
-
To insert a new tracepoint into the Linux kernel, define a new header file with a special format.
-
By default, tracepoint kernel files are located in include/trace/events/ directory, but the kernel has functionality so that the header files can be located in a different path. This is useful when defining a tracepoint in a kernel module.
-
A tracepoint is declared using the
TRACE_EVENTmacro. ButTRACE_EVENTdoes not yet instantiate it in the code. The tracepoint has arguments which are passed to at its actual instantiation. Tracepoints added with theTRACE_EVENTmacro allow some high-level tracer system to use them. -
To use the tracepoint, the header file must be included in any file that inserts the tracepoint, and a single
Cfile must defineCREATE_TRACE_POINTS.
TRACE_EVENT(name, prototype, args, structure, assign, print)All parameters except the first one are encapsulated with another macro. These macros give more control in processing and also allow commas to be used within the TRACE_EVENT() macro.
-
name - the name of the tracepoint to be created (
foo_barin this case). The actual tracepoint function that is used hastrace_prefixed to the name (i.e.,trace_foo_bar).TRACE_EVENT(foo_bar, prototype, args, structure, assign, print)
-
prototype - the prototype for the tracepoint callbacks (wrapped by
TP_PROTOmacro)TRACE_EVENT(foo_bar, TP_PROTO(const char* foo, int bar, const int* lst, const struct cpu_mask* mask, const char* fmt, va_list* va), ... )
The prototype is written as if you were to declare the tracepoint function directly:
trace_foo_bar(const char* foo, int bar, const int* lst, const struct cpu_mask* mask, const char* fmt, va_list* va)
It is used as the prototype for both the tracepoint added to the kernel code and for the callback function. Remember, a tracepoint calls the callback functions as if the callback functions were being called at the location of the tracepoint.
-
args - the arguments that match the prototype
TRACE_EVENT(foo_bar, TP_PROTO(const char* foo, int bar, const int* lst, const struct cpu_mask* mask, const char* fmt, va_list* va), TP_ARGS(foo, bar, lst, mask, fmt, va), ... )
-
structure - the structure layout of the data that will be stored in the tracer's ring buffer. Each element of the structure is defined by another macro. These macros are used to automate the creation of a structure and are not function-like. Notice that the macros are not separated by any delimiter (no comma nor semicolon).
TRACE_EVENT(foo_bar, TP_PROTO(const char* foo, int bar, const int* lst, const struct cpu_mask* mask, const char* fmt, va_list* va), TP_ARGS(foo, bar, lst, mask, fmt, va), TP_STRUCT__entry( __field(const char*, foo) __field(int, bar) ), ... )
__field(type, name)- this defines a normal structure element, likeint var; where type isintand name isvar -
assign - defines the way the data from the parameters is saved to the ring buffer
-
print - defines how a
printk()can be used to print out the fields from theTP_STRUCT__entrystructure.
The header file that contains the TRACE_EVENT macro must follow a certain format so that it will work with other tracer systems.
If we are writing a tracepoint for my_trace_system events, the structure should be:
#undef TRACE_SYSTEM
#define TRACE_SYSTEM my_trace_system
#if !defined(_TRACE_MY_EXAMPLE_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_MY_EXAMPLE_H
#include <linux/tracepoint.h>
TRACE_EVENT(foo_bar,
...
)
#endif /* _TRACE_MY_EXAMPLE_H */
/* This part must be outside protection */
#include <trace/define_trace.h>The TRACE_HEADER_MULTI_READ test allows this file to be included more than once; this is important for the processing of the TRACE_EVENT macro. The TRACE_SYSTEM must also be defined for the file and must be outside the guard of the #if. The TRACE_SYSTEM defines what group the TRACE_EVENT macros in this file belong to (you can define multiple events in this file). This is also the directory name that the events will be grouped under the /kernel/tracing/events/ directory. This grouping is important for other tracer systems as it allows the user to enable or disable events by group.
All trace headers should include tracepoint.h.
The define_trace.h is where all the magic lies in creating the tracepoints. It must be included at the bottom of the trace header file outside the protection of the #endif.
If this header file is not located in the include/trace/events/ directory, we need to add additional instructions to specify the include path.
Each TRACE_EVENT macro creates several helper functions to produce the code to add the tracepoint, create the files in the trace directory, hook it to other tracer systems, assign the values and print out the raw data from the ring buffer. To prevent too much bloat, if there are more than one tracepoint that uses the same format for the prototype, args, structure, assign and print, and only the name is different, it is highly recommended to use the DECLARE_EVENT_CLASS.
DECLARE_EVENT_CLASS macro creates most of the functions for the tracepoint. Then DEFINE_EVENT is use to hook a tracepoint to those functions.
Defining the tracepoint is meaningless if it is not used anywhere. To use the tracepoint, the trace header must be included, but one C file (and only one) must also define CREATE_TRACE_POINTS before including the trace. This will cause the define_trace.h to create the necessary functions needed to produce the tracing events. For example:
#define CREATE_TRACE_POINTS
#include <trace/events/my_example.h>If another file needs to use tracepoints that were defined in the trace header file, then it only needs to include the trace file, and does not need to define CREATE_TRACE_POINTS. Defining it more than once for the same header file will cause linking errors when building.
#include <trace/events/my_example.h>Finally, the tracepoint is used in the code just as it was defined in the TRACE_EVENT macro:
#define CREATE_TRACE_POINTS
#include <trace/events/my_example.h>
void do_something(int num)
{
/* Do something */
...
/* Insert the static tracepoint */
trace_foo_bar(foo, bar, lst, mask, fmt, va)
/* Do something else */
...
}After building the kernel, you should be able to see a new my_trace_system directory inside /sys/kernel/tracing/events/, which contains a foo_bar event tracepoint. You can then enable the function tracer on the do_something() function. When a tracepoint is hit and data is logged, you can view the output in the tracing buffer by reading the trace file located at /sys/kernel/tracing/trace.
If this header file is not located in the include/trace/events/ directory, e.g., when it is part of an external module, we need to specify TRACE_INCLUDE_PATH.
Assume our header file is named my_example.h:
...
#endif
/* This part must be outside protection */
#undef TRACE_INCLUDE_PATH
#undef TRACE_INCLUDE_FILE
#define TRACE_INCLUDE_PATH .
#define TRACE_INCLUDE_FILE my_example
#include <trace/define_trace.h>Assuming the module is named my_module.c, its Makefile would look as follows:
obj-m := my_module.o
CFLAGS_my_module.o += -I$(src)
LINUX_KERNEL := $(shell uname -r)
LINUX_KERNEL_PATH := /usr/src/linux-headers-$(LINUX_KERNEL)
all:
$(MAKE) -C $(LINUX_KERNEL_PATH) M=$(PWD) modules
install:
$(MAKE) -C $(LINUX_KERNEL_PATH) M=$(PWD) modules_install
clean:
$(MAKE) -C $(LINUX_KERNEL_PATH) M=$(PWD) cleanNote the CFLAGS_my_module.o is used to include in the search path the folder containing the trace header file.
Connecting a function (probe) to a tracepoint is done through register_trace_subsys_eventname(). Removing a probe is done through unregister_trace_subsys_eventname(); it will remove the probe. Note that you should replace subsys and eventname with your actual subsystem name and event name. tracepoint_synchronize_unregister() must be called before the end of the module exit function to make sure there is no caller left using the probe. This, and the fact that preemption is disabled around the probe call, make sure that probe removal and module unload are safe.
However, it's recommended to use other tracer system, such as eBPF, to register callbacks for tracepoints instead of doing all this manually.
