Function Tracing with ftrace: Hands-On Kernel Tutorial

When you need to understand what the Linux kernel is actually doing, you do not always need a debugger or extra packages. The kernel ships with a tracer called ftrace that can record kernel function calls as they happen. This tutorial is a practical, reproducible introduction to function tracing with ftrace. You will mount the trace filesystem, list the available tracers, run the plain function tracer, narrow the output to the functions you care about, read a timed call graph, and — most importantly — interpret the output and find the few lines that answer a specific question, ending with three worked examples. Every command and the output you should expect is shown below.

What you need

You need a Linux system whose kernel was built with the tracing options enabled. Most desktop distributions and many embedded configurations already enable them. The relevant kernel config symbols are:

CONFIG_FUNCTION_TRACER — the function tracer.
CONFIG_FUNCTION_GRAPH_TRACER — the call-graph tracer with timing.
CONFIG_DYNAMIC_FTRACE — lets you filter on individual functions cheaply.

You also need root access, because the trace controls live in a root-only filesystem. The commands below use sudo with absolute paths so you can copy them directly. If your build has a config file available, you can confirm the options:

raghu@techveda.org:~$ zcat /proc/config.gz | grep -E 'FUNCTION_TRACER|FUNCTION_GRAPH|DYNAMIC_FTRACE'
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y

Mounting and exploring the trace filesystem

ftrace is controlled entirely through files in a special filesystem called tracefs, normally mounted at /sys/kernel/tracing. On older systems it appears under /sys/kernel/debug/tracing. If the directory is empty, mount it by hand:

raghu@techveda.org:~$ sudo mount -t tracefs nodev /sys/kernel/tracing

It helps to hold a simple mental model: you write to control files to say what to trace, the kernel records matching calls into a ring buffer, and you read the result back from two output files.

        you (root)                          kernel
        ----------                          ------
  echo function    --> current_tracer    ----+
  echo 'vfs_*'     --> set_ftrace_filter     |
  echo vfs_read    --> set_graph_function    +--> record matching
  echo 1           --> tracing_on            |     calls into a
                                             +--> ring buffer
                                                      |
  cat trace        <-- snapshot (re-readable)  <------+
  cat trace_pipe   <-- live stream (consumed)   <-----+

The first file to look at lists the tracers your kernel supports. The exact list varies by configuration:

raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/available_tracers
blk function_graph wakeup_dl wakeup_rt wakeup function nop

The two you will use here are function and function_graph. The special tracer nop means “trace nothing” and is how you turn tracing off. The set of functions you are allowed to trace is listed in available_filter_functions; it is large, so count it rather than print it:

raghu@techveda.org:~$ sudo wc -l /sys/kernel/tracing/available_filter_functions
58213 /sys/kernel/tracing/available_filter_functions

Function tracing with ftrace: your first trace

Select the function tracer by writing its name into current_tracer. Because the target file needs root, send the value with tee:

raghu@techveda.org:~$ echo function | sudo tee /sys/kernel/tracing/current_tracer
function

Tracing is now running and the kernel is writing every traced function call into a ring buffer. Read a snapshot of that buffer from the trace file:

raghu@techveda.org:~$ sudo head -n 20 /sys/kernel/tracing/trace
# tracer: function
#
# entries-in-buffer/entries-written: 140080/250281   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            bash-1442  [000] ....  1758.872404: mutex_lock <-tracing_set_tracer
            bash-1442  [000] ....  1758.872405: __might_sleep <-mutex_lock

The plain function tracer is comprehensive but very noisy; without filtering it records the whole kernel. The global switch tracing_on pauses and resumes writing to the buffer without changing your setup. Writing 0 pauses, 1 resumes. Make sure it is back at 1 before the later steps, otherwise the buffer stays empty:

raghu@techveda.org:~$ echo 0 | sudo tee /sys/kernel/tracing/tracing_on
0
raghu@techveda.org:~$ echo 1 | sudo tee /sys/kernel/tracing/tracing_on
1

How to read a trace line

Each data line carries more information than it first appears. The diagram below labels every field of one line:

  bash-1442  [000]  ....  1758.872404:  mutex_lock <- tracing_set_tracer
   |          |      |         |             |             |
   |          |      |         |             |             +-- caller
   |          |      |         |             +-- traced function
   |          |      |         +-- timestamp (seconds since boot)
   |          |      +-- context flags: irqs / resched / irq / preempt
   |          +-- CPU number
   +-- command-PID

Read field by field, that means:

bash-1442 — the command name and PID of the task that was running when the function was called.
[000] — the CPU the function ran on. The same call path can interleave across CPUs, so this column matters when you read a busy trace.
.... — four context flags (explained below).
1758.872404 — a timestamp in seconds since boot. The difference between two timestamps is how you measure how long something took with the plain tracer.
mutex_lock <-tracing_set_tracer — the traced function, and after the <- marker, its immediate caller. This caller field is the single most useful part of the plain tracer: it answers “who called this?”.

The four flag characters, read left to right, tell you the context the function executed in:

irqs-off: d means hardware interrupts were disabled at that point; . means they were enabled.
need-resched: N means the scheduler has flagged this task to be rescheduled; . means it has not.
hardirq/softirq: h means the function ran in hardware interrupt context, s in software interrupt (softirq) context, H a hardirq that occurred inside a softirq, and . normal process context.
preempt-depth: a digit giving how many nested preempt_disable() levels are active; . means zero.

These flags are not decoration. If you see a function that can sleep being called while interrupts are off (d) or with a non-zero preempt-depth, that is a real bug, and the trace line is where you catch it.

Narrowing the output with set_ftrace_filter

Tracing every function is rarely useful. The set_ftrace_filter file restricts the tracer to the functions you name, and it accepts shell-style globs. Only names that appear in available_filter_functions are valid. For example, to trace only the VFS read path, filter on vfs_* before selecting the tracer:

raghu@techveda.org:~$ echo 'vfs_*' | sudo tee /sys/kernel/tracing/set_ftrace_filter
vfs_*
raghu@techveda.org:~$ echo function | sudo tee /sys/kernel/tracing/current_tracer
function
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/set_ftrace_filter | head
vfs_read
vfs_write
vfs_open
vfs_fstat
vfs_getattr

Now the trace file only contains the functions whose names start with vfs_. To watch events live instead of taking a snapshot, read trace_pipe. Unlike trace, reading trace_pipe consumes the data and blocks until new events arrive, so stop it with Ctrl-C:

raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace_pipe
            cat-2391  [001] ....  2014.553210: vfs_read <-ksys_read
            cat-2391  [001] ....  2014.553233: vfs_write <-ksys_write

Reading a timed call graph with function_graph

The function_graph tracer is often more useful than the plain tracer because it shows the nesting of calls and how long each one took. Pick the entry function with set_graph_function so the graph starts where you want, then switch tracers:

raghu@techveda.org:~$ echo vfs_read | sudo tee /sys/kernel/tracing/set_graph_function
vfs_read
raghu@techveda.org:~$ echo function_graph | sudo tee /sys/kernel/tracing/current_tracer
function_graph
raghu@techveda.org:~$ sudo head -n 14 /sys/kernel/tracing/trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)               |  vfs_read() {
 1)   0.234 us   |    rw_verify_area();
 1)   2.104 us   |    ext4_file_read_iter();
 1)   3.998 us   |  }

Read the braces as a call tree, and the durations as the cost of each branch:

  vfs_read() {                         vfs_read .................. 3.998 us (total)
    rw_verify_area();        ====>       |-- rw_verify_area ...... 0.234 us
    ext4_file_read_iter();               +-- ext4_file_read_iter . 2.104 us
  }                                    (closing brace carries the total)

The DURATION column gives the time spent in each function. A duration printed on the closing brace line is the total for that call, including everything nested inside it. The exact inner functions depend on your kernel version and the filesystem in use; the read path above is for an ext4 file. The kernel marks unusually long durations with a + (over 10 microseconds), a ! (over 100 microseconds), or a # (over 1000 microseconds), which makes slow paths easy to find.

Interpreting the output and finding what you need

Turning tracing on is the easy part. The skill that most guides skip is reading the result and locating the few lines that answer your question. The most reliable approach is to decide the question first, then choose the method that answers it:

Is this code path taken at all? Put the function in set_ftrace_filter, perform the action, and check whether it appears in trace. Its absence is itself an answer.
What called this function? Use the plain function tracer and read the <- field, which names the immediate caller.
What does this function call, and where does the time go? Use function_graph with set_graph_function set to that function, and read the durations.
Why is this operation slow? Use function_graph, then scan for the largest DURATION values and the +, !, and # markers.

Isolate the exact window. The buffer fills constantly, so the single most effective habit is to capture only the activity around the action you care about. The pattern is a short timeline:

  |--pause--|--clear buffer--|--record ON--|--run action--|--record OFF--|--read--|
  t0        t1               t2            t3             t4             t5
  on=0      echo > trace      on=1          your command   on=0           cat trace

In commands, that is:

raghu@techveda.org:~$ echo 0 | sudo tee /sys/kernel/tracing/tracing_on
0
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/trace
raghu@techveda.org:~$ echo 1 | sudo tee /sys/kernel/tracing/tracing_on
1
raghu@techveda.org:~$ cat /etc/hostname > /dev/null
raghu@techveda.org:~$ echo 0 | sudo tee /sys/kernel/tracing/tracing_on
0
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace | head

Now the buffer holds mostly the activity from that one action, instead of thousands of unrelated lines.

Mark your own events in the timeline. To line up the trace with your steps, write a label into trace_marker. It appears inline in the trace as a tracing_mark_write entry, so you can see exactly where your action began:

raghu@techveda.org:~$ echo "start: my read test" | sudo tee /sys/kernel/tracing/trace_marker
raghu@techveda.org:~$ sudo grep tracing_mark_write /sys/kernel/tracing/trace
            bash-1442  [000] ....  2014.553100: tracing_mark_write: start: my read test

Search the buffer with ordinary tools. The trace file is plain text, so grep works. Find every place a function appears, or filter to a single PID:

raghu@techveda.org:~$ sudo grep ext4_file_read_iter /sys/kernel/tracing/trace
raghu@techveda.org:~$ sudo grep -- '-2391 ' /sys/kernel/tracing/trace | head

Find the hotspot in a call graph. In a function_graph capture, the closing-brace line of each function holds its total time including its children. Read the outermost braces first to see which high-level call dominates, then descend into that call to find the specific child that is expensive. A quick visual scan for the ! and # markers points you straight at the functions worth investigating, instead of reading every line.

Three worked examples

The methods above are easier to trust once you see them applied. Each example below isolates the window, runs one ordinary action, and reads the result. Assume the trace filesystem is mounted and tracing is on.

Example 1 — a char driver’s own routines: are my file operations being called?

This is the question most newcomers to driver writing actually have: I loaded my module, is the kernel calling my routines, and in what order? Suppose you have a simple character driver whose file operations are registered like this:

static const struct file_operations mychardev_fops = {
        .owner   = THIS_MODULE,
        .open    = mychardev_open,
        .read    = mychardev_read,
        .write   = mychardev_write,
        .release = mychardev_release,
};

When a program uses the device, the call passes down through the kernel layers and into your routines. The path you are about to see in the trace looks like this:

  cat /dev/mychardev        (user space)
       |               |               |
     open()          read()         close()
       v               v               v
  chrdev_open       vfs_read        __fput          (kernel layers)
       |               |               |
       v               v               v
  mychardev_open   mychardev_read  mychardev_release   (your module)

Once the module is loaded, its functions become traceable and appear in available_filter_functions. Load the module and confirm:

raghu@techveda.org:~$ sudo insmod ./mychardev.ko
raghu@techveda.org:~$ sudo grep mychardev /sys/kernel/tracing/available_filter_functions
mychardev_open
mychardev_read
mychardev_write
mychardev_release

Rather than name each function, use the module filter, which selects every function belonging to one module. Set it, choose the plain tracer, clear the buffer, then read from the device:

raghu@techveda.org:~$ echo ':mod:mychardev' | sudo tee /sys/kernel/tracing/set_ftrace_filter
raghu@techveda.org:~$ echo function | sudo tee /sys/kernel/tracing/current_tracer
function
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/trace
raghu@techveda.org:~$ cat /dev/mychardev > /dev/null
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace
            cat-2620  [001] ....  5123.001012: mychardev_open <-chrdev_open
            cat-2620  [001] ....  5123.001120: mychardev_read <-vfs_read
            cat-2620  [001] ....  5123.001195: mychardev_read <-vfs_read
            cat-2620  [001] ....  5123.001240: mychardev_release <-__fput

How to read it: your routines ran in the order open, read, read, release — exactly the lifecycle in the diagram above. The caller after <- tells you where in the kernel each one was invoked from: mychardev_open from chrdev_open (the character-device layer), mychardev_read from vfs_read (the VFS read path), and mychardev_release from __fput (the final close of the file). The second read returning is normal: it reads zero bytes, which is how cat detects end of file. If a routine you expected is missing — say mychardev_read never appears — that points at your file_operations table or your device registration, not at the program doing the reading. The same method works for any code you deploy: a kernel thread’s main function, a workqueue handler, or a sysfs show or store callback all become traceable the moment the module is loaded.

Example 2 — vfs_read: where does a read spend its time?

Question: a file read feels slow; which part costs the time? Use function_graph anchored at vfs_read, clear the buffer, then read a file:

raghu@techveda.org:~$ echo vfs_read | sudo tee /sys/kernel/tracing/set_graph_function
vfs_read
raghu@techveda.org:~$ echo function_graph | sudo tee /sys/kernel/tracing/current_tracer
function_graph
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/trace
raghu@techveda.org:~$ dd if=/var/log/syslog of=/dev/null bs=1M count=1 2>/dev/null
raghu@techveda.org:~$ sudo head -n 12 /sys/kernel/tracing/trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)               |  vfs_read() {
 1)   0.212 us   |    rw_verify_area();
 1) + 18.540 us  |    ext4_file_read_iter();
 1) + 19.860 us  |  }

How to read it: the closing brace shows the whole vfs_read took 19.860 us. Inside it, rw_verify_area is 0.212 us, which is negligible, while ext4_file_read_iter is 18.540 us and carries a + marker (over 10 us). The time is in the filesystem read, not the VFS wrapper. The next step is clear: set set_graph_function to ext4_file_read_iter and repeat, to see which part of the ext4 read dominates.

Example 3 — __schedule: was the task switched out voluntarily or preempted?

Question: a task keeps losing the CPU; is it giving it up on its own (for example, waiting for I/O) or being preempted? Trace __schedule, the core context-switch function, and read the caller:

raghu@techveda.org:~$ echo __schedule | sudo tee /sys/kernel/tracing/set_ftrace_filter
__schedule
raghu@techveda.org:~$ echo function | sudo tee /sys/kernel/tracing/current_tracer
function
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace_pipe
            bash-2510  [000] d...  4011.220104: __schedule <-schedule
            bash-2510  [002] dN.1  4011.221455: __schedule <-preempt_schedule_common

How to read it: the caller after <- classifies the switch. <-schedule is a voluntary switch — the task called schedule() itself, usually because it is waiting for something. <-preempt_schedule_common is an involuntary switch — the kernel preempted the task. The flags on the second line reinforce this: N means need-resched was set (a higher-priority task was waiting) and the trailing 1 is a preempt-depth of one. The timestamps let you measure the gap as well: 4011.221455 − 4011.220104 = 1.351 ms between the two switches. Reading the caller and the flags together tells you why the task left the CPU, which a bare list of function names never would.

Limiting a trace to one process and getting call counts

To trace only a specific process, write its PID into set_ftrace_pid. A common pattern is to start a shell, capture its PID, and trace it:

raghu@techveda.org:~$ echo $$ | sudo tee /sys/kernel/tracing/set_ftrace_pid
3187

If you want aggregate numbers rather than a stream of events, the function profiler counts how often each function ran and the time spent in it. Enable it, wait, then read the per-CPU statistics:

raghu@techveda.org:~$ echo 1 | sudo tee /sys/kernel/tracing/function_profile_enabled
1
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace_stat/function0 | head
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  schedule                              1842    9382.234 us     5.093 us        12.004 us
  mutex_lock                            5210    1204.882 us     0.231 us        0.003 us

The profiler answers a different question from the trace stream: instead of “what happened in order”, it tells you “which functions cost the most over the whole run”, which is often the fastest way to find a hotspot before you drill in with function_graph.

Cleaning up

ftrace settings persist until you reset them, so always return the tracer to nop and clear your filters and buffer when you are done:

raghu@techveda.org:~$ echo nop | sudo tee /sys/kernel/tracing/current_tracer
nop
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/set_ftrace_filter
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/set_graph_function
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/set_ftrace_pid
raghu@techveda.org:~$ echo 0 | sudo tee /sys/kernel/tracing/function_profile_enabled
0
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/trace

With the tracer back at nop, the filters cleared, the PID filter removed, and tracing_on left at 1, the system is returned to its normal state.

A faster front-end: trace-cmd

Writing to tracefs by hand is good for understanding the mechanism, but for daily use the trace-cmd tool wraps the same interface. It records to a file you can analyse later:

raghu@techveda.org:~$ sudo trace-cmd record -p function_graph -g vfs_read
raghu@techveda.org:~$ trace-cmd report | head

The recorded file can also be opened in the KernelShark graphical viewer. Under the hood, trace-cmd sets current_tracer, set_graph_function, and the other files exactly as you did above. Function tracing and driver debugging like this are covered in depth in our Linux Kernel Infrastructure and Linux Device Drivers tracks.

Key takeaways

ftrace is built into the kernel and controlled through files in tracefs at /sys/kernel/tracing.
Select a tracer by writing to current_tracer; function lists calls, function_graph shows nesting and timing, and nop turns tracing off.
Your own module’s routines become traceable once it is loaded; the :mod: filter selects all functions in a module.
Read every trace line in full: the PID, CPU, the context flags (d, N, h/s, preempt-depth), the timestamp, and the <- caller.
Interpret with a question in mind: isolate the window with tracing_on, annotate with trace_marker, search with grep, and scan function_graph durations and markers for the hotspot.
Always reset the tracer to nop and clear filters, the PID filter, and the buffer when you finish.

Function Tracing with ftrace: A Hands-On Kernel Tutorial

What you need

Mounting and exploring the trace filesystem

Function tracing with ftrace: your first trace

How to read a trace line

Narrowing the output with set_ftrace_filter

Reading a timed call graph with function_graph

Interpreting the output and finding what you need

Three worked examples

Example 1 — a char driver’s own routines: are my file operations being called?

Example 2 — vfs_read: where does a read spend its time?

Example 3 — __schedule: was the task switched out voluntarily or preempted?

Limiting a trace to one process and getting call counts

Cleaning up

A faster front-end: trace-cmd

Key takeaways

Further reading

What you need

Mounting and exploring the trace filesystem

Function tracing with ftrace: your first trace

How to read a trace line

Narrowing the output with set_ftrace_filter

Reading a timed call graph with function_graph

Interpreting the output and finding what you need

Three worked examples

Example 1 — a char driver’s own routines: are my file operations being called?

Example 2 — vfs_read: where does a read spend its time?

Example 3 — __schedule: was the task switched out voluntarily or preempted?

Limiting a trace to one process and getting call counts

Cleaning up

A faster front-end: trace-cmd

Key takeaways

Further reading

Related reading

Real-Time Linux with PREEMPT_RT — Part 1: Understanding Latency and Measuring It with cyclictest