When you need to understand what the Linux kernel is actually doing, you do not always need a debugger or extra packages. The kernel ships with a tracer called ftrace that can record kernel function calls as they happen. This tutorial is a practical, reproducible introduction to function tracing with ftrace. You will mount the trace filesystem, list the available tracers, run the plain function tracer, narrow the output to the functions you care about, read a timed call graph, and — most importantly — interpret the output and find the few lines that answer a specific question, ending with three worked examples. Every command and the output you should expect is shown below.
What you need
You need a Linux system whose kernel was built with the tracing options enabled. Most desktop distributions and many embedded configurations already enable them. The relevant kernel config symbols are:
CONFIG_FUNCTION_TRACER— the function tracer.CONFIG_FUNCTION_GRAPH_TRACER— the call-graph tracer with timing.CONFIG_DYNAMIC_FTRACE— lets you filter on individual functions cheaply.
You also need root access, because the trace controls live in a root-only filesystem. The commands below use sudo with absolute paths so you can copy them directly. If your build has a config file available, you can confirm the options:
raghu@techveda.org:~$ zcat /proc/config.gz | grep -E 'FUNCTION_TRACER|FUNCTION_GRAPH|DYNAMIC_FTRACE'
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
Mounting and exploring the trace filesystem
ftrace is controlled entirely through files in a special filesystem called tracefs, normally mounted at /sys/kernel/tracing. On older systems it appears under /sys/kernel/debug/tracing. If the directory is empty, mount it by hand:
raghu@techveda.org:~$ sudo mount -t tracefs nodev /sys/kernel/tracing
It helps to hold a simple mental model: you write to control files to say what to trace, the kernel records matching calls into a ring buffer, and you read the result back from two output files.
you (root) kernel
---------- ------
echo function --> current_tracer ----+
echo 'vfs_*' --> set_ftrace_filter |
echo vfs_read --> set_graph_function +--> record matching
echo 1 --> tracing_on | calls into a
+--> ring buffer
|
cat trace <-- snapshot (re-readable) <------+
cat trace_pipe <-- live stream (consumed) <-----+
The first file to look at lists the tracers your kernel supports. The exact list varies by configuration:
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/available_tracers
blk function_graph wakeup_dl wakeup_rt wakeup function nop
The two you will use here are function and function_graph. The special tracer nop means “trace nothing” and is how you turn tracing off. The set of functions you are allowed to trace is listed in available_filter_functions; it is large, so count it rather than print it:
raghu@techveda.org:~$ sudo wc -l /sys/kernel/tracing/available_filter_functions
58213 /sys/kernel/tracing/available_filter_functions
Function tracing with ftrace: your first trace
Select the function tracer by writing its name into current_tracer. Because the target file needs root, send the value with tee:
raghu@techveda.org:~$ echo function | sudo tee /sys/kernel/tracing/current_tracer
function
Tracing is now running and the kernel is writing every traced function call into a ring buffer. Read a snapshot of that buffer from the trace file:
raghu@techveda.org:~$ sudo head -n 20 /sys/kernel/tracing/trace
# tracer: function
#
# entries-in-buffer/entries-written: 140080/250281 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
bash-1442 [000] .... 1758.872404: mutex_lock <-tracing_set_tracer
bash-1442 [000] .... 1758.872405: __might_sleep <-mutex_lock
The plain function tracer is comprehensive but very noisy; without filtering it records the whole kernel. The global switch tracing_on pauses and resumes writing to the buffer without changing your setup. Writing 0 pauses, 1 resumes. Make sure it is back at 1 before the later steps, otherwise the buffer stays empty:
raghu@techveda.org:~$ echo 0 | sudo tee /sys/kernel/tracing/tracing_on
0
raghu@techveda.org:~$ echo 1 | sudo tee /sys/kernel/tracing/tracing_on
1
How to read a trace line
Each data line carries more information than it first appears. The diagram below labels every field of one line:
bash-1442 [000] .... 1758.872404: mutex_lock <- tracing_set_tracer
| | | | | |
| | | | | +-- caller
| | | | +-- traced function
| | | +-- timestamp (seconds since boot)
| | +-- context flags: irqs / resched / irq / preempt
| +-- CPU number
+-- command-PID
Read field by field, that means:
bash-1442— the command name and PID of the task that was running when the function was called.[000]— the CPU the function ran on. The same call path can interleave across CPUs, so this column matters when you read a busy trace.....— four context flags (explained below).1758.872404— a timestamp in seconds since boot. The difference between two timestamps is how you measure how long something took with the plain tracer.mutex_lock <-tracing_set_tracer— the traced function, and after the<-marker, its immediate caller. This caller field is the single most useful part of the plain tracer: it answers “who called this?”.
The four flag characters, read left to right, tell you the context the function executed in:
- irqs-off:
dmeans hardware interrupts were disabled at that point;.means they were enabled. - need-resched:
Nmeans the scheduler has flagged this task to be rescheduled;.means it has not. - hardirq/softirq:
hmeans the function ran in hardware interrupt context,sin software interrupt (softirq) context,Ha hardirq that occurred inside a softirq, and.normal process context. - preempt-depth: a digit giving how many nested
preempt_disable()levels are active;.means zero.
These flags are not decoration. If you see a function that can sleep being called while interrupts are off (d) or with a non-zero preempt-depth, that is a real bug, and the trace line is where you catch it.
Narrowing the output with set_ftrace_filter
Tracing every function is rarely useful. The set_ftrace_filter file restricts the tracer to the functions you name, and it accepts shell-style globs. Only names that appear in available_filter_functions are valid. For example, to trace only the VFS read path, filter on vfs_* before selecting the tracer:
raghu@techveda.org:~$ echo 'vfs_*' | sudo tee /sys/kernel/tracing/set_ftrace_filter
vfs_*
raghu@techveda.org:~$ echo function | sudo tee /sys/kernel/tracing/current_tracer
function
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/set_ftrace_filter | head
vfs_read
vfs_write
vfs_open
vfs_fstat
vfs_getattr
Now the trace file only contains the functions whose names start with vfs_. To watch events live instead of taking a snapshot, read trace_pipe. Unlike trace, reading trace_pipe consumes the data and blocks until new events arrive, so stop it with Ctrl-C:
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace_pipe
cat-2391 [001] .... 2014.553210: vfs_read <-ksys_read
cat-2391 [001] .... 2014.553233: vfs_write <-ksys_write
Reading a timed call graph with function_graph
The function_graph tracer is often more useful than the plain tracer because it shows the nesting of calls and how long each one took. Pick the entry function with set_graph_function so the graph starts where you want, then switch tracers:
raghu@techveda.org:~$ echo vfs_read | sudo tee /sys/kernel/tracing/set_graph_function
vfs_read
raghu@techveda.org:~$ echo function_graph | sudo tee /sys/kernel/tracing/current_tracer
function_graph
raghu@techveda.org:~$ sudo head -n 14 /sys/kernel/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
1) | vfs_read() {
1) 0.234 us | rw_verify_area();
1) 2.104 us | ext4_file_read_iter();
1) 3.998 us | }
Read the braces as a call tree, and the durations as the cost of each branch:
vfs_read() { vfs_read .................. 3.998 us (total)
rw_verify_area(); ====> |-- rw_verify_area ...... 0.234 us
ext4_file_read_iter(); +-- ext4_file_read_iter . 2.104 us
} (closing brace carries the total)
The DURATION column gives the time spent in each function. A duration printed on the closing brace line is the total for that call, including everything nested inside it. The exact inner functions depend on your kernel version and the filesystem in use; the read path above is for an ext4 file. The kernel marks unusually long durations with a + (over 10 microseconds), a ! (over 100 microseconds), or a # (over 1000 microseconds), which makes slow paths easy to find.
Interpreting the output and finding what you need
Turning tracing on is the easy part. The skill that most guides skip is reading the result and locating the few lines that answer your question. The most reliable approach is to decide the question first, then choose the method that answers it:
- Is this code path taken at all? Put the function in
set_ftrace_filter, perform the action, and check whether it appears intrace. Its absence is itself an answer. - What called this function? Use the plain
functiontracer and read the<-field, which names the immediate caller. - What does this function call, and where does the time go? Use
function_graphwithset_graph_functionset to that function, and read the durations. - Why is this operation slow? Use
function_graph, then scan for the largestDURATIONvalues and the+,!, and#markers.
Isolate the exact window. The buffer fills constantly, so the single most effective habit is to capture only the activity around the action you care about. The pattern is a short timeline:
|--pause--|--clear buffer--|--record ON--|--run action--|--record OFF--|--read--|
t0 t1 t2 t3 t4 t5
on=0 echo > trace on=1 your command on=0 cat trace
In commands, that is:
raghu@techveda.org:~$ echo 0 | sudo tee /sys/kernel/tracing/tracing_on
0
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/trace
raghu@techveda.org:~$ echo 1 | sudo tee /sys/kernel/tracing/tracing_on
1
raghu@techveda.org:~$ cat /etc/hostname > /dev/null
raghu@techveda.org:~$ echo 0 | sudo tee /sys/kernel/tracing/tracing_on
0
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace | head
Now the buffer holds mostly the activity from that one action, instead of thousands of unrelated lines.
Mark your own events in the timeline. To line up the trace with your steps, write a label into trace_marker. It appears inline in the trace as a tracing_mark_write entry, so you can see exactly where your action began:
raghu@techveda.org:~$ echo "start: my read test" | sudo tee /sys/kernel/tracing/trace_marker
raghu@techveda.org:~$ sudo grep tracing_mark_write /sys/kernel/tracing/trace
bash-1442 [000] .... 2014.553100: tracing_mark_write: start: my read test
Search the buffer with ordinary tools. The trace file is plain text, so grep works. Find every place a function appears, or filter to a single PID:
raghu@techveda.org:~$ sudo grep ext4_file_read_iter /sys/kernel/tracing/trace
raghu@techveda.org:~$ sudo grep -- '-2391 ' /sys/kernel/tracing/trace | head
Find the hotspot in a call graph. In a function_graph capture, the closing-brace line of each function holds its total time including its children. Read the outermost braces first to see which high-level call dominates, then descend into that call to find the specific child that is expensive. A quick visual scan for the ! and # markers points you straight at the functions worth investigating, instead of reading every line.
Three worked examples
The methods above are easier to trust once you see them applied. Each example below isolates the window, runs one ordinary action, and reads the result. Assume the trace filesystem is mounted and tracing is on.
Example 1 — a char driver’s own routines: are my file operations being called?
This is the question most newcomers to driver writing actually have: I loaded my module, is the kernel calling my routines, and in what order? Suppose you have a simple character driver whose file operations are registered like this:
static const struct file_operations mychardev_fops = {
.owner = THIS_MODULE,
.open = mychardev_open,
.read = mychardev_read,
.write = mychardev_write,
.release = mychardev_release,
};
When a program uses the device, the call passes down through the kernel layers and into your routines. The path you are about to see in the trace looks like this:
cat /dev/mychardev (user space)
| | |
open() read() close()
v v v
chrdev_open vfs_read __fput (kernel layers)
| | |
v v v
mychardev_open mychardev_read mychardev_release (your module)
Once the module is loaded, its functions become traceable and appear in available_filter_functions. Load the module and confirm:
raghu@techveda.org:~$ sudo insmod ./mychardev.ko
raghu@techveda.org:~$ sudo grep mychardev /sys/kernel/tracing/available_filter_functions
mychardev_open
mychardev_read
mychardev_write
mychardev_release
Rather than name each function, use the module filter, which selects every function belonging to one module. Set it, choose the plain tracer, clear the buffer, then read from the device:
raghu@techveda.org:~$ echo ':mod:mychardev' | sudo tee /sys/kernel/tracing/set_ftrace_filter
raghu@techveda.org:~$ echo function | sudo tee /sys/kernel/tracing/current_tracer
function
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/trace
raghu@techveda.org:~$ cat /dev/mychardev > /dev/null
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace
cat-2620 [001] .... 5123.001012: mychardev_open <-chrdev_open
cat-2620 [001] .... 5123.001120: mychardev_read <-vfs_read
cat-2620 [001] .... 5123.001195: mychardev_read <-vfs_read
cat-2620 [001] .... 5123.001240: mychardev_release <-__fput
How to read it: your routines ran in the order open, read, read, release — exactly the lifecycle in the diagram above. The caller after <- tells you where in the kernel each one was invoked from: mychardev_open from chrdev_open (the character-device layer), mychardev_read from vfs_read (the VFS read path), and mychardev_release from __fput (the final close of the file). The second read returning is normal: it reads zero bytes, which is how cat detects end of file. If a routine you expected is missing — say mychardev_read never appears — that points at your file_operations table or your device registration, not at the program doing the reading. The same method works for any code you deploy: a kernel thread’s main function, a workqueue handler, or a sysfs show or store callback all become traceable the moment the module is loaded.
Example 2 — vfs_read: where does a read spend its time?
Question: a file read feels slow; which part costs the time? Use function_graph anchored at vfs_read, clear the buffer, then read a file:
raghu@techveda.org:~$ echo vfs_read | sudo tee /sys/kernel/tracing/set_graph_function
vfs_read
raghu@techveda.org:~$ echo function_graph | sudo tee /sys/kernel/tracing/current_tracer
function_graph
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/trace
raghu@techveda.org:~$ dd if=/var/log/syslog of=/dev/null bs=1M count=1 2>/dev/null
raghu@techveda.org:~$ sudo head -n 12 /sys/kernel/tracing/trace
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
1) | vfs_read() {
1) 0.212 us | rw_verify_area();
1) + 18.540 us | ext4_file_read_iter();
1) + 19.860 us | }
How to read it: the closing brace shows the whole vfs_read took 19.860 us. Inside it, rw_verify_area is 0.212 us, which is negligible, while ext4_file_read_iter is 18.540 us and carries a + marker (over 10 us). The time is in the filesystem read, not the VFS wrapper. The next step is clear: set set_graph_function to ext4_file_read_iter and repeat, to see which part of the ext4 read dominates.
Example 3 — __schedule: was the task switched out voluntarily or preempted?
Question: a task keeps losing the CPU; is it giving it up on its own (for example, waiting for I/O) or being preempted? Trace __schedule, the core context-switch function, and read the caller:
raghu@techveda.org:~$ echo __schedule | sudo tee /sys/kernel/tracing/set_ftrace_filter
__schedule
raghu@techveda.org:~$ echo function | sudo tee /sys/kernel/tracing/current_tracer
function
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace_pipe
bash-2510 [000] d... 4011.220104: __schedule <-schedule
bash-2510 [002] dN.1 4011.221455: __schedule <-preempt_schedule_common
How to read it: the caller after <- classifies the switch. <-schedule is a voluntary switch — the task called schedule() itself, usually because it is waiting for something. <-preempt_schedule_common is an involuntary switch — the kernel preempted the task. The flags on the second line reinforce this: N means need-resched was set (a higher-priority task was waiting) and the trailing 1 is a preempt-depth of one. The timestamps let you measure the gap as well: 4011.221455 − 4011.220104 = 1.351 ms between the two switches. Reading the caller and the flags together tells you why the task left the CPU, which a bare list of function names never would.
Limiting a trace to one process and getting call counts
To trace only a specific process, write its PID into set_ftrace_pid. A common pattern is to start a shell, capture its PID, and trace it:
raghu@techveda.org:~$ echo $$ | sudo tee /sys/kernel/tracing/set_ftrace_pid
3187
If you want aggregate numbers rather than a stream of events, the function profiler counts how often each function ran and the time spent in it. Enable it, wait, then read the per-CPU statistics:
raghu@techveda.org:~$ echo 1 | sudo tee /sys/kernel/tracing/function_profile_enabled
1
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace_stat/function0 | head
Function Hit Time Avg s^2
-------- --- ---- --- ---
schedule 1842 9382.234 us 5.093 us 12.004 us
mutex_lock 5210 1204.882 us 0.231 us 0.003 us
The profiler answers a different question from the trace stream: instead of “what happened in order”, it tells you “which functions cost the most over the whole run”, which is often the fastest way to find a hotspot before you drill in with function_graph.
Cleaning up
ftrace settings persist until you reset them, so always return the tracer to nop and clear your filters and buffer when you are done:
raghu@techveda.org:~$ echo nop | sudo tee /sys/kernel/tracing/current_tracer
nop
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/set_ftrace_filter
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/set_graph_function
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/set_ftrace_pid
raghu@techveda.org:~$ echo 0 | sudo tee /sys/kernel/tracing/function_profile_enabled
0
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/trace
With the tracer back at nop, the filters cleared, the PID filter removed, and tracing_on left at 1, the system is returned to its normal state.
A faster front-end: trace-cmd
Writing to tracefs by hand is good for understanding the mechanism, but for daily use the trace-cmd tool wraps the same interface. It records to a file you can analyse later:
raghu@techveda.org:~$ sudo trace-cmd record -p function_graph -g vfs_read
raghu@techveda.org:~$ trace-cmd report | head
The recorded file can also be opened in the KernelShark graphical viewer. Under the hood, trace-cmd sets current_tracer, set_graph_function, and the other files exactly as you did above. Function tracing and driver debugging like this are covered in depth in our Linux Kernel Infrastructure and Linux Device Drivers tracks.
Key takeaways
- ftrace is built into the kernel and controlled through files in tracefs at
/sys/kernel/tracing. - Select a tracer by writing to
current_tracer;functionlists calls,function_graphshows nesting and timing, andnopturns tracing off. - Your own module’s routines become traceable once it is loaded; the
:mod:filter selects all functions in a module. - Read every trace line in full: the PID, CPU, the context flags (
d,N,h/s, preempt-depth), the timestamp, and the<-caller. - Interpret with a question in mind: isolate the window with
tracing_on, annotate withtrace_marker, search withgrep, and scanfunction_graphdurations and markers for the hotspot. - Always reset the tracer to
nopand clear filters, the PID filter, and the buffer when you finish.


