Edge AI Stack and the Fleet: Yocto, OTA, Federated ML

The edge AI stack on embedded Linux is built on Yocto, with vendor AI stacks (TI Edge AI/TIDL, NXP eIQ, JetPack) shipping as layers on top of the BSP. Keep ONNX as the framework-neutral source of truth and treat the vendor-compiled model as a regenerable build output. A product also needs secure boot, signed OTA updates, and a hardened Yocto LTS baseline, because shipping improved models to the field is a core capability, not an afterthought. Once the fleet is deployed, federated learning — with Flower as the pragmatic default framework — lets models improve across thousands of devices while raw data never leaves them.

Edge AI has moved inference off the cloud and onto the device, and the accelerator silicon is no longer the limiting factor: dedicated NPUs now ship in everything from sub-50 mW microcontrollers to 2000-plus TFLOP compute modules. But an accelerator is inert until a software platform stands it up — and that platform, for the overwhelming majority of edge AI devices, is embedded Linux. This article covers the edge AI stack from the bottom up — the build system, the runtimes, and the security layer that turns a demo into a product — and then the question that defines the second half of the story: how you keep improving models across a deployed fleet without centralising anyone’s data. If you want the silicon side of this picture, we surveyed the accelerator landscape in Edge AI in the Next 10 Years: The Silicon Shift.

The edge AI stack on embedded Linux

Yocto as the foundation

The industry has converged on Yocto as the base for edge AI reference platforms. What has changed recently is that silicon vendors now ship complete AI stacks on top of their Yocto BSPs, and those stacks are usually the fastest path from a trained model to a deployed one. Texas Instruments is a representative example: its Edge AI stack — Edge AI Studio, the TIDL tools, and a pre-trained Model Zoo — runs on Yocto-based reference images and compiles trained models directly to the accelerators on the SoC. The same pattern holds across vendors: NXP’s meta-imx carries the eIQ machine-learning components, the community meta-tegra layer packages JetPack for Jetson modules, and Hailo publishes meta-hailo.

The workflow that has become standard is: develop and run the program on a PC first, port the same program to the embedded target, then run it there with acceleration engaged. This loop decouples model development from hardware bring-up and lets both proceed in parallel. Your job as the embedded engineer is to make the porting step uneventful — a clean, reproducible Yocto layer where the accelerator driver, firmware, and runtime libraries are pinned together and simply work.

Runtimes and model formats

Between the model and the metal sits the runtime, and this is where portability is won or lost. ONNX has become the practical interchange format: train in PyTorch or TensorFlow, export to ONNX, and hand that to the vendor compiler. TensorFlow Lite — renamed LiteRT by Google in September 2024 — remains the workhorse at the microcontroller and mobile end of the spectrum. Vendor compilers such as TIDL, RKNN-Toolkit2, and TensorRT take the neutral model and produce something the on-chip accelerator executes efficiently; this is also where INT8/INT4 quantization is applied and validated, because a model trained in FP32 must shrink to run within an edge NPU’s compute and memory budget.

Two practices keep this manageable. First, keep the framework-neutral ONNX model as your source of truth, and treat the vendor-specific compiled artifact as a build output your CI can regenerate — not a hand-tuned blob no one dares touch. The compiled artifact is usually tied to the exact toolkit version that produced it, so driver, firmware, runtime, and toolkit must be upgraded as a set. Second, profile what actually runs on the NPU: operations the accelerator cannot handle fall back silently to the CPU, so a model can be “accelerated” and still spend most of its time on the Cortex-A cores.

The kernel is catching up: mainline NPU drivers

One development worth tracking for long-lived products: NPU drivers are reaching mainline. Kernel 6.2 created the drivers/accel subsystem for compute accelerators, whose devices appear as /dev/accel/accel0. Intel’s ivpu driver arrived in 6.3, and — most relevant for embedded boards — the rocket driver for the Rockchip NPU in SoCs such as the RK3588 was merged in 2025, while the etnaviv driver exposes the VeriSilicon NPU in the NXP i.MX 8M Plus. On top of these, Mesa’s Teflon delegate gives TensorFlow Lite a fully open path from runtime to silicon:

raghu@techveda.org:~$ ls /dev/accel/
accel0

Most production BSPs still use the vendor kernel driver today. But a device that must take kernel updates for years is far easier to maintain on an upstream driver than on a frozen vendor branch, so when you select silicon, check the state of its mainline NPU support the way you check mainline support for its display or Ethernet.

Security-hardened distros and OTA

Here is the part that separates a demo from a product. A field-deployed edge AI device is a networked computer running valuable models on potentially sensitive data, which makes it a target. Hardened Yocto distributions built for this have emerged — Clea OS, a Yocto LTS-based foundation for industrial edge devices, bakes in secure boot, signed OTA updates, and unified access control as defaults. The specific product matters less than the pattern. For any serious deployment you should be able to answer three questions:

Secure boot. Can an attacker replace your firmware — or your model? A verified boot chain says no.
Signed OTA updates. When you push an improved model to 50,000 devices, how do you guarantee only your signed artifact runs? The ability to update models safely in the field is a core capability of an edge AI product, because your models will improve and you will need to ship them.
Access control and update integrity. Who can talk to the device, and how do you prove an update was not tampered with in transit?

Regulation is moving in the same direction. The EU Cyber Resilience Act, to take the clearest example, begins to apply from September 2026 and makes vulnerability handling and secure updates a legal obligation for connected products sold in the EU. Building on a hardened Yocto LTS baseline now avoids a costly retrofit later.

Federated learning at the edge

You have shipped the fleet. Now: how do the models get better over time? The old answer was to collect real-world data from every device, centralise it, retrain in the cloud, and push an update. But the whole premise of edge AI — and the direction of privacy regulation — is that the data should not leave the device. Federated learning (FL) resolves that tension by moving the model to the data instead: a central server coordinates training, clients train locally on their private data, and only model updates travel — never raw data.

For an embedded fleet the wins are concrete. Privacy: raw sensor data, camera frames, and user data stay on the device. Bandwidth: model deltas are far smaller than the raw data that produced them, a decisive advantage on constrained or metered edge links. Model quality: the model learns from the full diversity of real-world conditions across the fleet, including cases no lab dataset would contain.

Flower vs TensorFlow Federated

Two frameworks dominate the practical conversation. Flower has emerged as the de-facto standard for production federated learning. It is ML-framework-agnostic — PyTorch, TensorFlow, JAX, scikit-learn — and lets you port an existing training pipeline to a federated setup with essentially no changes to model code. It was designed for real-world scale: the Flower research paper demonstrates experiments with client cohorts up to 15 million, and the framework supports GPU execution and containerised deployment on real edge devices. Its pull-based connection model, where clients initiate connections to the server, is a practical advantage for devices behind NAT and firewalls. TensorFlow Federated (TFF) takes a different posture: its notable advantage is built-in differential privacy — a formal mathematical privacy guarantee on top of the “data stays local” property — but it has historically been oriented toward local simulation and TensorFlow-centric models.

Criterion	Flower	TensorFlow Federated
ML framework	Agnostic (PyTorch, TF, JAX, etc.)	TensorFlow-centric
Real-device deployment	Designed for it; GPU + containers	Historically simulation-first
Differential privacy	Via third-party libraries	Built-in
Client connectivity	Pull-based (NAT/firewall friendly)	—

In practice: for a heterogeneous fleet of real Linux devices where you want framework freedom, Flower is the pragmatic default; add a differential-privacy library when your threat model demands formal guarantees.

Scaling to production fleets

Getting FL working in simulation is quick. Running it across thousands of real, intermittently connected, resource-constrained devices is the hard part, and this is where the ecosystem has matured most recently. Flower’s architecture separates long-running infrastructure processes — SuperLink on the server side, SuperNode on the client side — which handle all network communication so your code contains only the ML logic. Beyond that, the Flower integration with Open Cluster Management enables declarative deployment of federated infrastructure across multi-cluster Kubernetes environments, from edge devices to cloud regions. The signal for planning is that federated learning has crossed from research technique to enterprise-grade, declaratively deployable infrastructure. If your plans include continuously improving models across a large fleet, FL is no longer a bet on immature tooling.

Where this leaves us

Taken together, this is a complete on-device picture: a hardened Yocto platform, a framework-neutral model runtime with vendor acceleration, secure OTA to update models in the field, and federated learning to improve those models across the fleet without centralising data. That is a deployable edge AI product architecture, buildable today — and building it is platform engineering from top to bottom.

Two forces on the horizon will reshape this picture over the coming decade — networks becoming AI-native with 6G, and neuromorphic silicon promising large efficiency gains for always-on workloads — with a wave of regulation layered over both. We will take those longer-horizon shifts up separately; none of them changes what to build now.

Key takeaways

The edge AI stack on embedded Linux is Yocto plus a vendor AI layer; make the PC-to-target port uneventful with a reproducible layer where driver, firmware, and runtime are pinned together.
Keep ONNX as the framework-neutral source of truth and treat the vendor-compiled model as a build output your CI regenerates; profile for silent CPU fallback.
Mainline NPU support is arriving (drivers/accel, the Rockchip rocket driver, etnaviv on i.MX 8M Plus, Mesa’s Teflon delegate) — weigh it in silicon selection for long-lived products.
Secure boot, signed OTA, and a hardened Yocto LTS baseline are product requirements: shipping improved models to the field is a core capability, and regulation is making it an obligation.
Federated learning lets a fleet improve without centralising data; Flower is the pragmatic default for real Linux devices, TFF where built-in differential privacy matters.

Frequently asked questions

What is federated learning?
A training approach where a central server coordinates many devices that each train on their own local data. Only model updates travel to the server for aggregation; raw data never leaves the device. It preserves privacy, saves bandwidth, and lets the model learn from real-world conditions across the whole fleet.

Why keep ONNX as the source of truth instead of the vendor format?
The vendor-compiled artifact is tied to one accelerator and usually to one toolkit version. The ONNX model is portable across vendors and training frameworks, so keeping it canonical — and regenerating the compiled artifact in CI — prevents lock-in and makes silicon changes survivable.

Can I run an NPU with only mainline components?
On some hardware, yes. The i.MX 8M Plus (via the etnaviv driver) and Rockchip NPUs such as the RK3588’s (via the rocket driver merged in 2025) can run TensorFlow Lite models through Mesa’s Teflon delegate with no vendor blobs. Coverage and performance still trail the vendor stacks for some models, so evaluate against your workload.

Why do OTA updates matter so much for edge AI specifically?
Because the model is part of the product and it will keep improving. A device that cannot receive signed updates in the field cannot benefit from retraining — federated or otherwise — and cannot meet the security and regulatory obligations now arriving for connected products.

Edge AI in the Next 10 Years: The Stack and the Fleet

The edge AI stack on embedded Linux

Yocto as the foundation

Runtimes and model formats

The kernel is catching up: mainline NPU drivers

Security-hardened distros and OTA

Federated learning at the edge

Flower vs TensorFlow Federated

Scaling to production fleets

Where this leaves us

Key takeaways

Frequently asked questions

Further reading

The edge AI stack on embedded Linux

Yocto as the foundation

Runtimes and model formats

The kernel is catching up: mainline NPU drivers

Security-hardened distros and OTA

Federated learning at the edge

Flower vs TensorFlow Federated

Scaling to production fleets

Where this leaves us

Key takeaways

Frequently asked questions

Further reading

Related reading

Edge AI in the Next 10 Years: The Silicon Shift