Linux Workqueues: A Deep Dive into Deferred Work

This deep dive traces the Linux workqueue path through mainline kernel/workqueue.c. queue_work() claims WORK_STRUCT_PENDING_BIT and calls __queue_work(); insert_work() links the item into a worker pool’s worklist; the scheduler hook wq_worker_sleeping() decrements pool->nr_running and calls kick_pool() so another kworker starts the moment the last running one blocks; process_one_work() finally calls your handler between two tracepoints you can watch live.

Every driver engineer calls queue_work(), but few have read what stands between that call and their handler running. This deep dive walks the actual mainline source of Linux workqueues — kernel/workqueue.c — function by function: the enqueue path, the structures that hold pending work, the scheduler hook that implements concurrency management, the execution loop, and the rescuer that guarantees forward progress under memory pressure. All code below is quoted from current mainline (trimmed where marked); read it with the source open in one terminal and a tracing session in another.

The contract in brief

Since the concurrency-managed workqueue (cmwq) rework by Tejun Heo, merged in 2.6.36 (2010), a workqueue does not own threads. Before cmwq, every multi-threaded workqueue spawned one worker thread per CPU, and the kernel documentation records that large systems saturated the 32k PID space during boot. Today a workqueue_struct is a routing and policy object; shared worker pools execute everything.

The alloc_workqueue() flags are promises about execution: WQ_UNBOUND trades CPU locality for scheduler placement; WQ_HIGHPRI routes to a separate pool whose workers run at an elevated nice level; WQ_CPU_INTENSIVE exempts long-running items from the concurrency count so they cannot delay the pool; WQ_FREEZABLE drains the queue across suspend; WQ_MEM_RECLAIM reserves a rescuer thread; and WQ_BH (kernel 6.9) executes items in softirq context as the tasklet replacement. max_active caps execution contexts per CPU — passing 0 selects the default of 1024, with a ceiling of 2048.

That is the contract every driver author codes against. The rest of this article walks the implementation that honours it.

The enqueue path: queue_work() to insert_work()

queue_work(wq, work) is a wrapper that calls queue_work_on(WORK_CPU_UNBOUND, wq, work). The core of queue_work_on() is one atomic bit operation — this is why queueing an already-pending work item is a no-op:

	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work)) &&
	    !clear_pending_if_disabled(work)) {
		__queue_work(cpu, wq, work);
		ret = true;
	}

	local_irq_restore(irq_flags);
	return ret;
}
EXPORT_SYMBOL(queue_work_on);

WORK_STRUCT_PENDING_BIT lives inside work->data, so the pending state travels with the work_struct itself. From there, __queue_work() does the routing:

void __queue_work(int cpu, struct workqueue_struct *wq,
		  struct work_struct *work)
{
	struct pool_workqueue *pwq;
	struct worker_pool *last_pool, *pool;
	unsigned int work_flags;
	unsigned int req_cpu = cpu;
	/* ... */

It selects the pool_workqueue (pwq) — the per-pool handle of this workqueue — for the requested CPU, or an unbound pool for WQ_UNBOUND queues. It then checks the concurrency budget through pwq_tryinc_nr_active():

static bool pwq_tryinc_nr_active(struct pool_workqueue *pwq, bool fill)
{
	struct workqueue_struct *wq = pwq->wq;
	struct worker_pool *pool = pwq->pool;
	struct wq_node_nr_active *nna = wq_node_nr_active(wq, pool->node);
	/* ... */

If an nr_active slot is obtained, the item goes straight onto the pool’s worklist; if the workqueue is already at its max_active limit, it is parked on the pwq’s inactive list and activated later. The final linking step is short enough to quote whole:

static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
			struct list_head *head, unsigned int extra_flags)
{
	debug_work_activate(work);

	/* record the work call stack in order to print it in KASAN reports */
	kasan_record_aux_stack(work);

	/* we own @work, set data and link */
	set_work_pwq(work, pwq, extra_flags);
	list_add_tail(&work->entry, head);
	get_pwq(pwq);
}

Note set_work_pwq(): the same work->data word that holds the pending bit now also records which pwq owns the item. This is how cancel_work_sync() later finds where a work item went.

Where pending work waits: struct worker_pool

Work items do not queue on threads; they queue on pools. The structure at the top of kernel/workqueue.c (trimmed):

struct worker_pool {
	raw_spinlock_t		lock;		/* the pool lock */
	int			cpu;		/* I: the associated cpu */
	int			node;		/* I: the associated node ID */
	int			id;		/* I: pool ID */
	unsigned int		flags;		/* L: flags */
	/* ... */
	int			nr_running;

	struct list_head	worklist;	/* L: list of pending works */

	int			nr_workers;	/* L: total number of workers */
	int			nr_idle;	/* L: currently idle workers */

	struct list_head	idle_list;	/* L: list of idle workers */
	struct timer_list	idle_timer;	/* L: worker idle timeout */
	/* ... */
	struct timer_list	mayday_timer;	/* L: SOS timer for workers */
};

nr_running is the single number the whole concurrency-management design revolves around. The source comment above it states the discipline: it is incremented in process context on the associated CPU with preemption disabled, and decremented in the same context with pool->lock held. The decision function built on it is two lines:

static bool need_more_worker(struct worker_pool *pool)
{
	return !list_empty(&pool->worklist) && !pool->nr_running;
}

Pending work exists, and no worker of this pool is currently runnable on the CPU — only then does the pool start another worker. That is the entire cmwq policy.

The scheduler hook: wq_worker_sleeping()

The mechanism that keeps nr_running accurate is a direct hook from the scheduler: when a kworker blocks inside your handler, schedule() calls into workqueue code:

void wq_worker_sleeping(struct task_struct *task)
{
	struct worker *worker = kthread_data(task);
	struct worker_pool *pool;

	/*
	 * Rescuers, which may not have all the fields set up like normal
	 * workers, also reach here, let's not access anything before
	 * checking NOT_RUNNING.
	 */
	if (worker->flags & WORKER_NOT_RUNNING)
		return;

	pool = worker->pool;
	/* ... */
	pool->nr_running--;
	if (kick_pool(pool))
		worker->current_pwq->stats[PWQ_STAT_CM_WAKEUP]++;

	raw_spin_unlock_irq(&pool->lock);
}

kick_pool() wakes the first idle worker if need_more_worker() is true — so the CPU never idles while the pool’s worklist is non-empty:

static bool kick_pool(struct worker_pool *pool)
{
	struct worker *worker = first_idle_worker(pool);
	struct task_struct *p;

	lockdep_assert_held(&pool->lock);

	if (!need_more_worker(pool) || !worker)
		return false;

	if (pool->flags & POOL_BH) {
		kick_bh_pool(pool);
		return true;
	}
	/* ... */

The POOL_BH branch is the kernel 6.9 BH-workqueue path (the tasklet replacement): those pools have no threads to wake, so the pool is kicked by raising softirq instead. The counterpart hook wq_worker_running() increments nr_running again when the worker resumes. This pair of functions is why a sleeping work item does not stall the queue behind it — the concurrency handover happens inside the scheduler, not on a timer.

Execution: worker_thread() and process_one_work()

Every kworker runs worker_thread(), whose main loop is compact:

recheck:
	/* no more worker necessary? */
	if (!need_more_worker(pool))
		goto sleep;

	/* do we need to manage? */
	if (unlikely(!may_start_working(pool)) && manage_workers(worker))
		goto recheck;
	/* ... */
		process_scheduled_works(worker);
	} while (keep_working(pool));

	worker_set_flags(worker, WORKER_PREP);
sleep:
	/* ... */

manage_workers() is where new kworkers get created on demand — worker creation is itself lazy and driven by the same need_more_worker() test. Each item finally reaches process_one_work(), which brackets your function with the two tracepoints used for debugging:

	trace_workqueue_execute_start(work);
	worker->current_func(work);
	/*
	 * While we must be careful to not use "work" after this, the trace
	 * point will only record its address.
	 */
	trace_workqueue_execute_end(work, worker->current_func);

Immediately after the call, process_one_work() checks that your handler did not return with a lock held, in atomic context, or inside an RCU read section — the console message every driver developer should recognise is printed here: BUG: workqueue leaked atomic, lock or RCU: comm[pid], followed by the offending workfn pointer and held locks.

The kworker names you see in ps are produced when a worker identifies itself: bound workers format "kworker/%d:%d%s" — CPU, worker id, and an H suffix when the pool’s nice level is negative (a WQ_HIGHPRI pool); unbound workers format "kworker/u%d:%d", where the number after u is the internal pool id, not a CPU; rescuers format "kworker/R-%s" with the workqueue’s name.

The rescuer path: what WQ_MEM_RECLAIM buys

Worker creation needs memory. If the system is in reclaim and the pool cannot spawn a worker, a queue could deadlock waiting for the very memory its work item would free. alloc_workqueue() therefore calls init_rescuer(), which does nothing unless the flag is set:

	if (!(wq->flags & WQ_MEM_RECLAIM))
		return 0;

	rescuer = alloc_worker(NUMA_NO_NODE);

When a pool makes no progress, its mayday_timer fires and send_mayday() puts the starving pwq on the workqueue’s mayday list — the comment in the source reads, literally, /* mayday mayday mayday */ — and wakes the pre-allocated rescuer thread, which processes items from that pwq directly. This is the forward-progress guarantee: one reserved execution context per WQ_MEM_RECLAIM workqueue, created at alloc_workqueue() time, never at reclaim time.

Watching the path live

The two tracepoints quoted above, plus the queueing tracepoint fired near insert_work(), make the whole path observable on any running system:

raghu@techveda.org:~$ echo 'workqueue:workqueue_queue_work workqueue:workqueue_execute_start' | sudo tee /sys/kernel/tracing/set_event
raghu@techveda.org:~$ sudo cat /sys/kernel/tracing/trace_pipe | head -2
kworker/2:1-73    [002] d..1  1290.112034: workqueue_queue_work: work struct=00000000e3b1a6f4 function=vmstat_update workqueue=mm_percpu_wq req_cpu=2 cpu=2
kworker/2:1-73    [002] ....  1290.112051: workqueue_execute_start: work struct=00000000e3b1a6f4: function=vmstat_update
raghu@techveda.org:~$ echo | sudo tee /sys/kernel/tracing/set_event

The req_cpu field in the first event is exactly the req_cpu variable you saw in __queue_work(); the gap between the two timestamps is the queue-to-execute latency, which grows when need_more_worker() keeps returning false because a runnable worker already occupies the CPU. The last command clears the event selection. If a single kworker consumes unexpected CPU, cat /proc/<pid>/stack shows which work function it is stuck in.

One current detail worth knowing when you read mainline: __queue_work() now begins with a pr_warn_once() for workqueues carrying the __WQ_DEPRECATED flag — part of the ongoing renaming of the system workqueues (system_wq to system_percpu_wq, system_unbound_wq to system_dfl_wq, with the explicit WQ_PERCPU flag as the complement of WQ_UNBOUND).

Key takeaways

The pending state and the owning pool of a work item both live inside work->data: WORK_STRUCT_PENDING_BIT is claimed in queue_work_on(), and set_work_pwq() records the pwq in insert_work().
Concurrency management is one condition — need_more_worker(): pending work on pool->worklist and pool->nr_running == 0 — evaluated from a scheduler hook (wq_worker_sleeping() → kick_pool()), not from a timer.
max_active is enforced at enqueue time in pwq_tryinc_nr_active(); items over the limit wait on the pwq’s inactive list, not on the pool’s worklist.
The rescuer is a pre-allocated worker created by init_rescuer() only for WQ_MEM_RECLAIM queues, woken by the pool’s mayday_timer through send_mayday().
trace_workqueue_execute_start/end bracket your handler in process_one_work(), so queue-to-execute latency and handler misbehaviour (the “leaked atomic, lock or RCU” check) are directly observable.

Frequently asked questions

When should a driver create its own workqueue instead of using schedule_work()?
Create a dedicated workqueue when you need a forward-progress guarantee via WQ_MEM_RECLAIM, need to flush your work items as a group, or need attributes such as WQ_UNBOUND or WQ_HIGHPRI. For ordinary one-off items, schedule_work() on the shared system workqueue is equally good.

What does WQ_MEM_RECLAIM actually do?
It reserves a rescuer thread for that workqueue, guaranteeing at least one execution context even under memory pressure when new workers cannot be created. Any workqueue that can be used in a memory-reclaim path must set it, otherwise the system can deadlock.

Why does my system show so many kworker threads?
Worker pools create and retire kworker threads on demand for all workqueues in the system, so the count varies with load. The name encodes the pool: kworker/2:1 is a bound pool on CPU 2, an H suffix marks the high-priority pool, and a u prefix marks an unbound pool.

Do BH workqueues replace tasklets?
That is the stated goal. WQ_BH workqueues, added in kernel 6.9, run work items in softirq context on the queueing CPU in queueing order, and several tasklet users have already been converted. Unlike tasklets, they support flushing, canceling, and delayed queueing.

Linux Workqueues: A Deep Dive into the Kernel’s Deferred Work Engine

The contract in brief

The enqueue path: queue_work() to insert_work()

Where pending work waits: struct worker_pool

The scheduler hook: wq_worker_sleeping()

Execution: worker_thread() and process_one_work()

The rescuer path: what WQ_MEM_RECLAIM buys

Watching the path live

Key takeaways

Frequently asked questions

Further reading

The contract in brief

The enqueue path: queue_work() to insert_work()

Where pending work waits: struct worker_pool

The scheduler hook: wq_worker_sleeping()

Execution: worker_thread() and process_one_work()

The rescuer path: what WQ_MEM_RECLAIM buys

Watching the path live

Key takeaways

Frequently asked questions

Further reading

Related reading

Real-Time Linux vs RTOS: Zephyr, FreeRTOS, PREEMPT_RT

Coherent vs Streaming DMA: A Deep Dive into the Linux DMA Mapping API

Deferred Probe in the Linux Kernel: Why a Driver’s probe() Runs Late and How to Debug It