The sched_ext Revolution: The Future of CPU Scheduling in Linux

Introduction

The CPU scheduler is the unsung hero of the Linux kernel. Its job is to answer three critical questions: which task, where, and for how long? For decades, general-purpose schedulers like CFS and EEVDF handled this, powering everything from phones to supercomputers. But with complex hardware and specialized software, the “one-size-fits-all” model began to crack. This tension set the stage for sched_ext.

Raghu Bharadwaj

Known for his unique ability to turn complex concepts into deep, practical insights. His thought-provoking writings challenge readers to look beyond the obvious, helping them not just understand technology but truly think differently about it.

His writing style encourages curiosity and helps readers discover fresh perspectives that stick with them long after reading

The Cracks in a One-Size-Fits-All Model

A universal scheduler is a master of compromise, but compromise has its limits. Every decision involves trade-offs:

  • Throughput vs. Latency: Maximize raw power, lose responsiveness.
  • Cache Locality vs. CPU Utilization: Keep tasks local for speed, leave other cores idle.
  • Power Efficiency vs. Peak Performance: Save battery, sacrifice critical performance.

Why a single scheduler couldn’t optimize for everyone:

  • Data Centers: Need predictable performance for strict SLOs.
  • VR/AR: Demand millisecond-precise frame delivery.
  • Gaming: Prioritizes smooth, consistent frame rates over raw FPS.
  • Mobile Devices: Constant battle between performance and battery.

A single, universal algorithm cannot be optimal for every specific use case.

The Innovation Bottleneck

Why didn’t developers just write custom schedulers? Because changing the kernel’s scheduler was:

  • High-Risk: A small error can crash the system.
  • High-Cost: Significant engineering effort required.
  • Slow: Kernel maintainers have an extremely high bar for changes.

This led to:

  • Out-of-Tree Schedulers: Companies maintaining costly, fragmented custom kernels.
  • Stifled Innovation: Difficulty experimenting with new ideas safely.

Developers needed a way to experiment safely and deploy custom schedulers without having to convince the entire world their approach was the one true way.

sched_extA New Framework for a New Era

In late 2022 (Linux 6.12), the vision became reality: extensible scheduling. sched_ext (Extensible Scheduler Class) is not another scheduler algorithm. It’s a framework that allows developers to write and deploy their own schedulers as BPF programs, which can be loaded directly into the kernel at runtime.

Why sched_ext is a Game-Changer:

Dynamic & Agile:

  • Load, unload, or switch schedulers at runtime—no reboots required.
  • Transforms development cycles from months to minutes, enabling rapid iteration.

Safety First:

  • BPF Verifier: Statically analyzes code to prevent kernel crashes, invalid memory access, or infinite loops.
  • Kernel Watchdog: Automatically unloads misbehaving schedulers at runtime and reverts to a safe default.

Focus on Policy, Not Mechanics:

  • sched_ext handles low-level details (context switching, runqueues).
  • Developers focus purely on the scheduling policy—the core logic for task selection.

This new model shifts Linux from a “one scheduler for all” philosophy to a platform for many schedulers, each perfectly tuned for its job.

Summary: A New Era of Optimization

sched_ext represents a paradigm shift. It democratizes scheduler development, makes experimentation safe, and finally bridges the gap between the kernel’s stability and the unique needs of modern workloads. This isn’t just another update—it’s the beginning of a new era of extensible, workload-aware scheduling in Linux.

Recent Posts

The Many Paths to init, Part 5: Unifying Themes

In this final installment of our series, we synthesize our exploration of diverse Linux boot processes by examining two critical, cross-platform themes: securing the chain of trust and ensuring system resiliency through atomic updates

Read More »

The Many Paths to init, Part 5: Unifying Themes

In this final installment of our series, we synthesize our exploration of diverse Linux boot processes by examining two critical, cross-platform themes: securing the chain of trust and ensuring system resiliency through atomic updates. While the implementations vary, the underlying goals are universal, reflecting the core challenges of building reliable and secure modern computing systems.

 

Securing the Chain of Trust

A secure boot process establishes a “chain of trust,” where each software stage cryptographically verifies the next before executing it. The implementation of this concept is tailored to the specific threat model of each platform.

Raghu Bharadwaj

Known for his unique ability to turn complex concepts into deep, practical insights. His thought-provoking writings challenge readers to look beyond the obvious, helping them not just understand technology but truly think differently about it.

His writing style encourages curiosity and helps readers discover fresh perspectives that stick with them long after reading

  • UEFI Secure Boot (PCs/Servers): This standard is designed to protect the user from malware like bootkits. It uses a database of keys in the firmware to verify EFI applications. Crucially, it is designed to be flexible; users can disable it or enroll their own keys to run any OS they choose, preserving user control over the hardware.

 

  • Hardware-Fused Boot (Embedded/IoT): In the embedded world, the threat model shifts to protecting the manufacturer from unauthorized firmware. Mechanisms like NXP’s High Assurance Boot (HAB) establish an immutable root of trust by permanently burning a hash of the manufacturer’s public key into one-time-programmable eFuses on the SoC. On a “closed” device, the Boot ROM will refuse to execute any code that isn’t signed by the manufacturer, creating a non-bypassable lockdown.

 

  • IBM Z Secure Boot (Mainframes): The mainframe threat model is focused on enterprise-grade compliance and auditability. Here, public keys are uploaded to the Hardware Management Console (HMC) and explicitly assigned to specific Logical Partitions (LPARs). The firmware will only boot an LPAR with code signed by its assigned keys, providing a strict, centrally managed chain of trust essential for high-security environments.

 

Ensuring Resiliency with Atomic Updates

 

A failed update can render a device unusable. To prevent this, modern systems employ atomic update strategies that ensure an update is either fully completed or not at all, always leaving the system in a bootable state.

  • The A/B Partitioning Model (State-Switching): This is the dominant strategy in the mobile and embedded worlds. The system has two full sets of OS partitions (slot A and slot B). While running from the active slot (A), an update is written to the inactive slot (B) in the background. Once complete, the bootloader is instructed to switch to slot B on the next reboot. If the new slot fails to boot, the bootloader automatically reverts to the original slot A, ensuring the device remains operational. This robust model is used by Android Seamless Updates and embedded frameworks like RAUC. Its main drawback is the storage overhead of duplicating the OS partitions.

 

  • The Transactional Update Model (State-Generation): For servers and modern desktops, a more storage-efficient model has emerged, exemplified by rpm-ostree (the technology behind Fedora CoreOS and Silverblue). This system treats the OS as a versioned, git-like repository. An update does not modify the running system; instead, it “checks out” a new filesystem tree into a new directory. The bootloader configuration is then atomically updated to point to this new deployment. The old deployment remains untouched on disk, allowing for instant rollback by simply changing the bootloader’s default entry. This “State-Generation” pattern is more flexible and storage-efficient, making it ideal for server and cloud environments.

 

Conclusion

 

The Linux boot process is a rich tapestry of specialized adaptations. From the atomic UKIs on modern servers to the multi-stage ascent of embedded SoCs, each platform has forged a unique path from power-on to init. This diversity is a testament to the kernel’s flexibility. As technology evolves, we see a convergence of ideas: the principles of verifiable boot artifacts and transactional, image-based updates are becoming the new standard across all domains, pointing to a future of Linux systems that are simpler to manage, more resilient to failure, and provably secure from the very first instruction.

Recent Posts

The Many Paths to init, Part 5: Unifying Themes

In this final installment of our series, we synthesize our exploration of diverse Linux boot processes by examining two critical, cross-platform themes: securing the chain of trust and ensuring system resiliency through atomic updates

Read More »

The Many Paths to init, Part 4: The Specialists

Beyond PCs and general-purpose embedded systems lie platforms where the Linux boot process has been specialized to an extreme degree. In this installment, we explore three of these unique environments: the security-focused world of Android, the legacy-rich domain of IBM Z mainframes, and the software-defined flexibility of QEMU/KVM virtualization.

Raghu Bharadwaj

Known for his unique ability to turn complex concepts into deep, practical insights. His thought-provoking writings challenge readers to look beyond the obvious, helping them not just understand technology but truly think differently about it.

His writing style encourages curiosity and helps readers discover fresh perspectives that stick with them long after reading

The Mobile Ecosystem: The Android Boot Flow

 

The Android boot process is a masterclass in vertical integration, engineered for security and reliability at a massive consumer scale.

  • The boot.img Artifact: The central component is the boot.img file, a specially formatted binary that packages the kernel, a ramdisk, and a metadata header. The final bootloader stage, the Android Bootloader (ABL), parses this header to load the kernel.

 

  • Generic Kernel Image (GKI): To combat ecosystem fragmentation, modern Android uses a Generic Kernel Image (GKI). This decouples the core, Google-maintained kernel from device-specific components. The boot partition contains the generic kernel, while a separate vendor_boot partition holds all the device-specific drivers, kernel modules, and the Device Tree Blob (DTB). This architecture allows Google to push core kernel security updates directly, bypassing vendor integration bottlenecks.

 

  • Android Verified Boot (AVB): AVB establishes an unbroken chain of trust from the hardware Boot ROM to the system partitions. Each stage cryptographically verifies the next, and the device’s security state is communicated to the user with color-coded warning screens (e.g., ORANGE for an unlocked bootloader, RED for a verification failure).

 

The Mainframe Environment: Linux on IBM Z

 

Booting Linux on an IBM Z mainframe follows a unique paradigm shaped by decades of mainframe design principles.

  • Initial Program Load (IPL): The process of “booting” is called an Initial Program Load (IPL). It is not an automatic discovery process but an explicit command issued by an operator through the Hardware Management Console (HMC).

 

  • The zipl Tool: The primary tool for preparing a boot device is zipl (z/OS Initial Program Loader). It is not an interactive bootloader but a deployment tool run from a live system. It takes the kernel, initramfs, and parameters and writes a boot record onto the target storage device, making it IPL-able.

 

  • Virtualization Contexts: The process differs depending on the virtualization context. In a hardware-level Logical Partition (LPAR), the IPL is initiated directly by the HMC. When running as a KVM guest, the hypervisor provides a standardized bootloader image (s390-ccw.img) to the guest, bypassing the need for a zipl-prepared disk.

 

The Virtualized Platform: QEMU/KVM Guests

 

In a virtualized environment like QEMU/KVM, the hardware is software-defined, making the boot process a highly configurable abstraction.

  • Emulated Firmware: QEMU provides virtual firmware for its guests. This can be SeaBIOS, which emulates a traditional legacy BIOS, or OVMF, which provides a full-featured UEFI environment, enabling modern features like Secure Boot within the virtual machine.

 

  • Direct Kernel Boot: For rapid development and testing, QEMU offers a powerful feature called direct kernel boot. Using command-line options (-kernel, -initrd, -append), a user can instruct QEMU to load a kernel and initramfs directly from the host filesystem, completely bypassing the virtual firmware and any bootloader on the guest’s virtual disk. This is invaluable for kernel developers, allowing them to test a new build in seconds.

Despite their profound differences, these platforms all face common challenges in securing the boot chain and ensuring system updates are reliable. In our final article, we will explore the cross-platform themes of Secure Boot and atomic updates.

Recent Posts

The Many Paths to init, Part 5: Unifying Themes

In this final installment of our series, we synthesize our exploration of diverse Linux boot processes by examining two critical, cross-platform themes: securing the chain of trust and ensuring system resiliency through atomic updates

Read More »

The Many Paths to init, Part 3: The Embedded Frontier

While the PC and server world has evolved towards the simplicity of Unified Kernel Images, the embedded systems domain—dominated by ARM and RISC-V architectures—operates under a completely different set of rules. Here, the boot process is dictated by resource constraints, non-discoverable hardware, and a relentless focus on cost optimization.

 

The Multi-Stage Ascent from Silicon to RAM

 

The boot process on a typical System-on-Chip (SoC) is a multi-stage climb, with each loader establishing a more capable environment for the next. This is a direct consequence of the hardware’s initial memory limitations.

Raghu Bharadwaj

Known for his unique ability to turn complex concepts into deep, practical insights. His thought-provoking writings challenge readers to look beyond the obvious, helping them not just understand technology but truly think differently about it.

His writing style encourages curiosity and helps readers discover fresh perspectives that stick with them long after reading

  • Stage 0: Boot ROM: Execution begins with immutable code etched directly into the SoC’s silicon. This Boot ROM is the ultimate root of trust. Its job is minimal: perform basic setup and search a predetermined sequence of boot devices (eMMC, SD card, etc.) for the next-stage loader. It loads this next stage into the SoC’s small, on-chip Static RAM (SRAM).

 

  • Stage 1: Secondary Program Loader (SPL): The on-chip SRAM is often too small (just a few kilobytes) to hold a full-featured bootloader. Therefore, a tiny intermediate loader, the SPL, is loaded first. The SPL has one critical function: to initialize the main system Dynamic RAM (DRAM) controller.

 

  • Stage 2: Main Bootloader: Once the much larger off-chip DRAM is available, the SPL loads the full-featured bootloader into it. This is typically Das U-Boot or its modern alternative, Barebox. This environment is far more powerful, providing an interactive shell, filesystem support, and networking capabilities. Its final task is to load the Linux kernel and hand over control.

 

U-Boot vs. Barebox: A Tale of Two Philosophies

 

While U-Boot and Barebox serve the same function, they represent different design philosophies.

  • U-Boot is the long-standing industry standard, known for its vast hardware support. Its configuration and scripting model is powerful but idiosyncratic, relying on a set of environment variables stored in non-volatile memory.

 

  • Barebox, which began as a fork of U-Boot, was created with the explicit goal of adopting a more Linux-like design. It provides a true shell environment where scripts are actual files, incorporates a Linux-style driver model, and even presents hardware resources through a virtual filesystem (e.g., /dev/mem). This makes development more intuitive for those already familiar with the Linux kernel.

 

The Device Tree Blob (DTB): Describing the Undiscoverable

 

Unlike the PC world with its self-enumerating buses like PCI, the hardware peripherals on an SoC (UARTs, I2C controllers, etc.) are at fixed memory addresses and cannot be discovered by the kernel at runtime.

The Device Tree is the solution. It is a data structure, written in a human-readable text file (.dts), that explicitly describes all the hardware on a specific board: what peripherals exist, their memory addresses, their interrupt connections, and other properties. This file is compiled into a compact Device Tree Blob (.dtb). The bootloader loads this .dtb into memory alongside the kernel and passes a pointer to it. The kernel then parses this data to learn what hardware it is running on, allowing a single, generic kernel binary to support a wide variety of boards.

From the resource-constrained world of embedded devices, we next turn to even more specialized platforms. In Part 4, we will examine the highly controlled boot flows of Android, IBM Z mainframes, and QEMU/KVM virtual machines.

Recent Posts

The Many Paths to init, Part 5: Unifying Themes

In this final installment of our series, we synthesize our exploration of diverse Linux boot processes by examining two critical, cross-platform themes: securing the chain of trust and ensuring system resiliency through atomic updates

Read More »

The Many Paths to init, Part 2: The PC and Server Revolution

In the first part of our series, we established a four-stage framework for understanding any boot process. Now, we apply that model to the modern x86-64 PC and server, a world that has been reshaped by the move from the legacy BIOS to the Unified Extensible Firmware Interface (UEFI). This shift has driven a clear trend towards simpler, more secure, and more atomic boot processes.

From BIOS to a Filesystem-Aware Firmware

The legacy BIOS was a simple piece of firmware. After its Power-On Self-Test (POST), its only job was to read the first 512 bytes of a disk—the Master Boot Record (MBR)—and execute whatever code it found there. This tiny space forced a complex chain of loaders just to get to the point where a bootloader like GRUB could understand a filesystem.

Raghu Bharadwaj

Known for his unique ability to turn complex concepts into deep, practical insights. His thought-provoking writings challenge readers to look beyond the obvious, helping them not just understand technology but truly think differently about it.

His writing style encourages curiosity and helps readers discover fresh perspectives that stick with them long after reading

UEFI is fundamentally different. It is a miniature operating system with its own drivers, shell, and, most importantly, the built-in ability to read standardized filesystems like FAT32. This capability led to the creation of the EFI System Partition (ESP), a dedicated FAT-formatted partition that acts as a universal, OS-agnostic hub for boot files. A bootloader is no longer a piece of code in a boot sector; it’s a standard executable file (e.g., grubx64.efi) that the firmware can find and run directly from the ESP.

 

Bypassing the Bootloader: EFI Stub and Unified Kernel Images (UKIs)

 

The power of UEFI opens the door to even simpler boot methods that can bypass a traditional bootloader entirely.

  • Direct Kernel Execution (EFI Stub): The Linux kernel can be compiled with a feature called the “EFI stub” (CONFIG_EFI_STUB=y). This embeds a small UEFI-compliant program into the kernel binary itself, allowing the UEFI firmware to execute the kernel directly. Using the efibootmgr tool from a running system, an administrator can create an entry in the firmware’s NVRAM that points directly to the kernel file on the ESP, completely bypassing GRUB. However, this method can be fragile, as some firmware implementations have bugs that prevent them from correctly passing necessary command-line arguments to the kernel
  • Unified Kernel Images (UKIs): The UKI is the modern solution to these challenges. A UKI is a single, self-contained UEFI application that bundles all necessary boot components—the EFI stub, the Linux kernel, the initramfs, and the kernel command line—into one file. This atomic approach offers three key advantages:
  1. Atomicity: The kernel and its critical dependencies are updated as a single unit.
  2. Robustness: Embedding the command line directly into the file bypasses firmware bugs related to passing arguments.
  3. Security: The entire UKI file can be cryptographically signed. UEFI Secure Boot then verifies this single signature, closing a major security hole where an attacker could modify an unsigned initramfs without being detected.

The nmbl Project: The Bootloader-less Philosophy in Practice

 

The nmbl (“no more bootloader”) project, championed by Marta Lewandowska, is a practical initiative to make the UKI-based, bootloader-less paradigm the default for mainstream distributions like Fedora. The project argues that traditional bootloaders like GRUB add unnecessary complexity, duplicate functionality already in the kernel (like filesystem drivers), and represent a significant and less-scrutinized attack surface. By replacing GRUB with a directly bootable UKI, nmbl aims to deliver a faster, more secure, and more maintainable boot process that leverages the robust and rapidly evolving Linux kernel as the bootloader itself.

The streamlined, secure, and atomic boot process of the modern PC stands in stark contrast to the resource-constrained world of embedded systems. In our next article, we’ll explore the multi-stage boot process of ARM and RISC-V devices.

Recent Posts

The Many Paths to init, Part 5: Unifying Themes

In this final installment of our series, we synthesize our exploration of diverse Linux boot processes by examining two critical, cross-platform themes: securing the chain of trust and ensuring system resiliency through atomic updates

Read More »

The Many Paths to init, Part 1: The Universal Blueprint

The Linux kernel is the most versatile operating system kernel in the world, powering everything from tiny embedded sensors to the world’s largest supercomputers. This adaptability means that the process of “booting Linux” is not a single, uniform sequence. It’s a collection of highly specialized strategies, each tailored to the unique hardware and security constraints of its platform.

This series will explore these divergent paths, from the familiar PC to the specialized worlds of embedded systems, mobile devices, and mainframes. To begin, we must first establish a common language—a conceptual framework that applies to every boot process, regardless of the underlying architecture.

The Four Universal Stages of Booting

At its core, booting is a procedure that takes a system from inert hardware to a fully operational state. This happens across four fundamental stages, each building upon the last.

Raghu Bharadwaj

Known for his unique ability to turn complex concepts into deep, practical insights. His thought-provoking writings challenge readers to look beyond the obvious, helping them not just understand technology but truly think differently about it.

His writing style encourages curiosity and helps readers discover fresh perspectives that stick with them long after reading

Stage 1: Firmware Initialization (The First Spark)

The moment a device is powered on, the CPU begins executing code from a hard-coded program stored in non-volatile memory like a ROM chip. This is the system firmware. Its first job is to perform a Power-On Self-Test (POST), initializing and verifying critical hardware like memory controllers. This stage solves the most basic problem: making the system’s main RAM usable for the larger programs that will follow. It concludes when the firmware identifies a bootable device and hands over control to the first piece of software it finds—the bootloader.

Stage 2: Bootloader Execution (The Bridge to the OS)

The bootloader is the crucial intermediary between the firmware and the operating system kernel. While firmware can initialize hardware, it typically doesn’t understand complex filesystems like ext4 or btrfs where the OS resides. The bootloader’s purpose is to bridge this gap. It contains just enough logic to navigate the filesystem, find the kernel image, load it into RAM, and pass it essential configuration data. This stage can be a single program, like GRUB2 on a PC, or a multi-stage chain, as is common in embedded systems.

Stage 3: Kernel Initialization (The Core Takes Control)

Once loaded, the Linux kernel takes charge. It first decompresses itself into memory and begins initializing its own internal subsystems, like the process scheduler and memory management. It then uses its vast array of drivers to initialize all the system’s hardware.

However, the kernel faces its own bootstrapping dilemma. The final root filesystem might be on a device (like an encrypted disk or a network share) that requires special drivers to be mounted. To solve this, the bootloader also loads an initial RAM filesystem (initramfs). The kernel uses this initramfs as a temporary root, which contains the necessary drivers and tools to mount the real root filesystem.

Stage 4: Hand-off to init (Welcome to Userspace)

After mounting the real root filesystem, the kernel’s final task is to execute the init program (typically /sbin/init), which is assigned Process ID 1 (PID 1). This marks the critical transition from kernel space to user space. On modern systems, this init process is almost always systemd. It is responsible for starting all the system services, daemons, and graphical interfaces that make up a fully functional Linux environment.

In the next installment, we will apply this universal framework to the platform most of us use every day: the modern PC and server, exploring the revolutionary shift to the UEFI paradigm.

Recent Posts

The Many Paths to init, Part 5: Unifying Themes

In this final installment of our series, we synthesize our exploration of diverse Linux boot processes by examining two critical, cross-platform themes: securing the chain of trust and ensuring system resiliency through atomic updates

Read More »