Debug Story: Hung Task on an Embedded Linux Board

This is a debugging incident from an embedded Linux board I was working on, where a single stuck driver transfer brought a working system to a halt. The application I was testing stopped responding, and a couple of minutes later the kernel log filled with hung task warnings. The root cause turned out to be a completion that was never signalled on a hardware error path, and it taught me a lesson I now repeat to every driver author: a wait that has no bound will, sooner or later, wait forever. Here is how the failure looked, how the kernel’s own detector pointed me at the cause, and how I fixed it.

The symptom: a frozen application and a flood of warnings

The board ran a custom SPI flash controller with an out-of-tree driver I was bringing up. Under normal use everything worked. During a stress test that issued back-to-back flash reads, my user-space tool froze: it stopped producing output but did not exit, and I could not kill it with Ctrl-C. About two minutes later the kernel console started printing, and kept printing every two minutes:

INFO: task spi0:73 blocked for more than 122 seconds.
      Not tainted 6.12.0 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:spi0            state:D stack:0     pid:73    ppid:2     flags:0x00000008
Call trace:
 __switch_to+0xa0/0xfc
 __schedule+0x2b4/0x6d0
 schedule+0x40/0xb0
 schedule_timeout+0x18c/0x1b0
 wait_for_completion+0x8c/0x150
 acme_spi_transfer_one+0xd4/0x180 [acme_spi]
 spi_transfer_one_message+0x2e8/0x5d0
 ...

A second task was reported as blocked at the same time, a user-space process waiting on the bus:

INFO: task flashcfg:412 blocked for more than 122 seconds.
      Not tainted 6.12.0 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:flashcfg        state:D stack:0     pid:412   ppid:401   flags:0x00000000
Call trace:
 __schedule+0x2b4/0x6d0
 schedule+0x40/0xb0
 schedule_timeout+0x18c/0x1b0
 wait_for_completion+0x8c/0x150
 __spi_sync+0x2a4/0x3a0
 ...

What the hung task detector was telling me

The “blocked for more than 122 seconds” message comes from the kernel’s hung task detector, a kernel thread called khungtaskd built when CONFIG_DETECT_HUNG_TASK is set. Its logic is in kernel/hung_task.c. The thread periodically walks every task and, in check_hung_uninterruptible_tasks(), looks only at tasks in the TASK_UNINTERRUPTIBLE state, the “D” state you see in ps. For each such task it compares a stored context-switch count against the current one. If a task has not been scheduled at all for longer than the timeout, the detector assumes it is stuck and prints the warning.

Three points helped me read the message correctly. First, it only ever fires for D-state tasks, because those are the ones that cannot be woken by a signal; a task sleeping interruptibly is not a bug. Second, the timeout is kernel.hung_task_timeout_secs, which defaults to 120 seconds, so the very first report can arrive up to two minutes after the real freeze, which is exactly the delay I saw. Third, the detector does not fix or kill anything; by default it prints a limited number of warnings (kernel.hung_task_warnings, default 10) and otherwise leaves the task exactly as stuck as it found it. I confirmed the settings and the stuck tasks directly:

raghu@techveda.org:~$ cat /proc/sys/kernel/hung_task_timeout_secs
120
raghu@techveda.org:~$ ps -eo pid,stat,comm | grep '  *D'
   73 D    spi0
  412 D    flashcfg

Reading the blocked-task stack

The call trace under each warning is printed by sched_show_task(), and it was the most useful part. Both stacks ended in wait_for_completion. That single frame told me each task was sleeping inside a struct completion, waiting for some other context to call complete(). The kernel documentation is precise about this call: wait_for_completion() marks the task TASK_UNINTERRUPTIBLE and waits with no timeout. That is exactly the combination the hung task detector reports, so a stack ending in a plain wait_for_completion sent me straight to one question: who is supposed to call complete(), and can that ever fail to happen?

The top stack, task spi0, is the SPI core’s message-pump kernel thread. The SPI subsystem serialises messages through a single per-controller thread, and here that thread was stuck in my driver’s acme_spi_transfer_one(). That explained the second task too: flashcfg had called spi_sync(), which queues a message and waits for the pump thread to run it. Because the pump thread was itself stuck, every queued message waited forever, so unrelated callers piled up in D state behind the one stuck transfer. One blocked completion had become a system-wide freeze.

The root cause: a completion that is never signalled

My driver waited for the controller’s “done” interrupt like this:

static int acme_spi_transfer_one(struct spi_controller *ctlr,
                                 struct spi_device *spi,
                                 struct spi_transfer *xfer)
{
        struct acme_spi *acme = spi_controller_get_devdata(ctlr);

        reinit_completion(&acme->done);

        acme_writel(acme, xfer->len, ACME_LEN);
        acme_writel(acme, ACME_CMD_START | ACME_IRQ_EN, ACME_CMD);

        wait_for_completion(&acme->done);   /* no timeout */
        return 0;
}

And the interrupt handler signalled the completion:

static irqreturn_t acme_spi_isr(int irq, void *dev_id)
{
        struct acme_spi *acme = dev_id;
        u32 status = acme_readl(acme, ACME_STATUS);

        if (status & ACME_STATUS_DONE) {
                acme_writel(acme, ACME_STATUS_DONE, ACME_STATUS);  /* ack */
                complete(&acme->done);
                return IRQ_HANDLED;
        }

        return IRQ_NONE;
}

The handler only signalled the completion when it saw ACME_STATUS_DONE. During the stress test the controller occasionally finished a transfer with its error bit, ACME_STATUS_ERR, set instead of the done bit. On that path the handler took the return IRQ_NONE branch and never called complete(). The transfer in acme_spi_transfer_one() then waited on a completion that nothing would ever signal, and because the wait had no timeout, it never returned. The hardware error was rare, which is why the board passed my casual testing and only failed under sustained load.

The fix: bound the wait and signal on every exit path

Two changes were needed, and both mattered. First, the wait had to be bounded so a missed interrupt degraded into a reported error instead of a permanent freeze. I switched to wait_for_completion_timeout(), which returns 0 on timeout and the remaining jiffies otherwise, and recovered the controller when it fired:

static int acme_spi_transfer_one(struct spi_controller *ctlr,
                                 struct spi_device *spi,
                                 struct spi_transfer *xfer)
{
        struct acme_spi *acme = spi_controller_get_devdata(ctlr);
        unsigned long time_left;
        u32 status;

        reinit_completion(&acme->done);

        acme_writel(acme, xfer->len, ACME_LEN);
        acme_writel(acme, ACME_CMD_START | ACME_IRQ_EN, ACME_CMD);

        time_left = wait_for_completion_timeout(&acme->done,
                                                msecs_to_jiffies(100));
        if (time_left == 0) {
                acme_writel(acme, ACME_CMD_RESET, ACME_CMD);
                dev_err(&ctlr->dev, "transfer timed out\n");
                return -ETIMEDOUT;
        }

        status = acme_readl(acme, ACME_STATUS);
        if (status & ACME_STATUS_ERR) {
                dev_err(&ctlr->dev, "transfer error, status %#x\n", status);
                return -EIO;
        }
        return 0;
}

Second, and more importantly, I made the interrupt handler wake the waiter on the error path as well, so the normal case never relies on the timeout at all:

static irqreturn_t acme_spi_isr(int irq, void *dev_id)
{
        struct acme_spi *acme = dev_id;
        u32 status = acme_readl(acme, ACME_STATUS);

        if (!(status & (ACME_STATUS_DONE | ACME_STATUS_ERR)))
                return IRQ_NONE;

        acme_writel(acme, status, ACME_STATUS);   /* ack everything */
        complete(&acme->done);                     /* wake the waiter either way */
        return IRQ_HANDLED;
}

With both changes, a transfer that ends in an error now wakes the waiting thread immediately; the transfer function reads the status, reports -EIO, and the SPI core moves on to the next message. The timeout remains as a backstop for the worst case, where the controller raises no interrupt at all. This also reflects the standing advice in the kernel’s completion documentation to be careful about long, unbounded waits, especially while other work is serialised behind you.

Confirming the fix

After the change, the same stress test no longer froze. I injected the error condition deliberately, by forcing the controller’s error bit, and now got a clean log line and a recovered bus instead of a dead board:

raghu@techveda.org:~$ dmesg | tail -2
[  812.447120] acme_spi 5a200000.spi: transfer error, status 0x2
[  812.447559] spi-nor spi0.0: error -5 reading 256 bytes

The hung task warnings stopped because no task is left in an unbounded D-state wait. If you want a board to reboot rather than sit frozen when this class of bug appears, for example in an automated test rack, set kernel.hung_task_panic=1 so the detector turns a hung task into a panic you can capture, rather than a silent freeze.

Key takeaways

A “task blocked for more than N seconds” message is the hung task detector (khungtaskd, kernel/hung_task.c) reporting a task stuck in the TASK_UNINTERRUPTIBLE “D” state; it warns but does not fix or kill.
The default kernel.hung_task_timeout_secs is 120, so the first report can lag the real freeze by up to two minutes.
A stack ending in a plain wait_for_completion means the task is waiting for a complete() that has not happened; always ask whether that complete() can be missed.
Because the SPI core serialises messages through one thread, a single stuck transfer blocks every queued message, turning one missed completion into a system-wide freeze.
Bound hardware waits with wait_for_completion_timeout() and signal the completion on every interrupt exit path, including error paths.

Debug Story: A Hung Task That Froze an Embedded Board

The symptom: a frozen application and a flood of warnings

What the hung task detector was telling me

Reading the blocked-task stack

The root cause: a completion that is never signalled

The fix: bound the wait and signal on every exit path

Confirming the fix

Key takeaways

Further reading

The symptom: a frozen application and a flood of warnings

What the hung task detector was telling me

Reading the blocked-task stack

The root cause: a completion that is never signalled

The fix: bound the wait and signal on every exit path

Confirming the fix

Key takeaways

Further reading

Related reading

Debug Story: Sleeping Function Called From Invalid Context in a Kernel Driver

Don’t jump to the solution