DMA Mapping API: Coherent vs Streaming Memory

If you write drivers for embedded Linux, the DMA mapping API is one of the interfaces you cannot avoid for long. The moment a device moves data into or out of memory on its own, without the CPU copying each byte, your driver has to tell the kernel how that memory should be prepared. Get it wrong and the symptoms are some of the hardest to debug in kernel work: data that is correct on a desktop x86 board but corrupted on an ARM target, or a buffer that reads back stale values only under load. This Deep Dive comes in two parts. First it covers the concepts and rules every driver author needs: coherent versus streaming mappings, the DMA mask, directions, and syncing. Then it goes one layer down and traces the kernel source that implements them, from the dispatch in kernel/dma/mapping.c to the arm64 cache hooks and the CONFIG_DMA_API_DEBUG facility. The source shown is from Linux 7.1, whose series reworked the DMA core to be physical-address based.

Three kinds of addresses

The first source of confusion is that DMA involves three different address spaces, and they are not interchangeable. The kernel works with virtual addresses, the kind returned by kmalloc() and stored in a void *. The memory management unit translates those into CPU physical addresses, the values you see in /proc/iomem. A device, however, sees a third kind of address called a bus address. On simple systems the bus address equals the physical address, but when an IOMMU or a host bridge sits between the device and memory, the two diverge.

This matters because a device performing DMA uses bus addresses, and it has no access to the CPU’s virtual memory system. You cannot hand a device a pointer from kmalloc() and expect it to work. The job of the DMA mapping API is to take a buffer the CPU can see and return a dma_addr_t value the device can use, setting up any IOMMU translation along the way. Every driver that touches DMA must include the header that defines this type:

#include <linux/dma-mapping.h>

Why the DMA mapping API exists

Beyond address translation, the API solves a second problem: cache coherency. Many embedded SoCs have CPU caches that are not kept coherent with DMA traffic. If the CPU writes a buffer, the data may still be sitting in the cache when the device reads main memory, so the device sees old contents. In the other direction, the device writes main memory while the CPU still holds a cached copy, so the CPU reads stale data. The DMA mapping API is the single place where the kernel inserts the cache maintenance operations needed to handle this, in an architecture-independent way. On a fully coherent platform those operations compile down to almost nothing; on a non-coherent ARM board they become real cache flushes and invalidations. Your driver code stays the same either way.

Tell the kernel your addressing limits

Before mapping anything, a driver must declare how many address bits the device can drive. By default the kernel assumes 32-bit DMA addressing. You change that with a single call that covers both the streaming and coherent interfaces:

if (dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64))) {
        dev_warn(dev, "No suitable 64-bit DMA availablen");
        /* fall back or refuse to probe */
}

The kernel saves this mask and uses it later when it allocates DMA addresses, so it never hands the device an address it cannot reach. Note that dma_set_mask_and_coherent() will not fail for masks of 32 bits or larger, so the common pattern is to set 64 bits when the device supports it and 32 bits otherwise, rather than trying a 64-bit call and falling back to 32. If the device has different limits for descriptors and for data, you can set the streaming and coherent masks separately with dma_set_mask() and dma_set_coherent_mask().

Coherent mappings: allocate once, keep for the device’s lifetime

A coherent mapping is memory for which a write by either the CPU or the device is immediately visible to the other, with no explicit flushing in your driver. Think of it as synchronous. You allocate it once, usually at probe time, and free it at removal. The classic uses are control structures the device polls continuously: network card ring descriptors, command mailboxes, or firmware microcode run out of main memory.

dma_addr_t dma_handle;
void *cpu_addr;

cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
if (!cpu_addr)
        return -ENOMEM;

The call returns two things: a CPU virtual address you use to read and write the buffer, and a dma_handle of type dma_addr_t that you program into the device. When you are done, release both with the matching free call:

dma_free_coherent(dev, size, cpu_addr, dma_handle);

One subtlety that surprises people: coherent does not mean the CPU stops reordering writes. If the device must see word zero of a descriptor updated before word one, you still need a write memory barrier between the two stores:

desc->word0 = address;
wmb();
desc->word1 = DESC_VALID;

For many small allocations, carving them out of a single page is wasteful. The dma_pool interface acts like a kmem_cache built on top of dma_alloc_coherent(), and it understands alignment and boundary constraints that hardware queues often require.

Streaming mappings: map for one transfer, then unmap

A streaming mapping is for memory the CPU already owns, which you want to hand to the device for a single transfer and then take back. Think of it as asynchronous, outside the coherency domain. Network packets being transmitted or received, and filesystem buffers, are the standard examples. You map a buffer just before the transfer and unmap it as soon as the device signals completion:

dma_addr_t dma_handle;

dma_handle = dma_map_single(dev, addr, size, DMA_TO_DEVICE);
if (dma_mapping_error(dev, dma_handle))
        goto map_error;

/* program dma_handle into the device, start the transfer */

dma_unmap_single(dev, dma_handle, size, DMA_TO_DEVICE);

The direction argument is not optional decoration. DMA_TO_DEVICE means memory to device, DMA_FROM_DEVICE means device to memory, and DMA_BIDIRECTIONAL covers both at a possible performance cost. The kernel uses the direction to decide which cache operations to perform, so specify it as precisely as you can. DMA_NONE exists only as a debugging placeholder.

Two rules are easy to miss. First, always check dma_mapping_error() on the returned address; mapping can fail when DMA address space is exhausted or an IOMMU mapping cannot be created, and using an unchecked address can lead to silent corruption. Second, never use the CPU buffer while it is mapped for the device. The buffer belongs to the device between map and unmap. The same applies to dma_map_page(), which takes a page and offset instead of a CPU pointer so it can map HIGHMEM memory, and to dma_map_sg() for scatter-gather lists.

Synchronising a buffer you reuse

Sometimes you need the CPU to look at a streaming buffer between transfers without fully unmapping it. That is what the sync calls are for. Before the CPU reads a buffer the device just wrote, give ownership back to the CPU; before handing it to the device again, return ownership to the device:

dma_sync_single_for_cpu(dev, dma_handle, size, DMA_FROM_DEVICE);
/* CPU may now safely read the buffer */

dma_sync_single_for_device(dev, dma_handle, size, DMA_FROM_DEVICE);
/* device may now use the buffer again */

If you never touch the data between dma_map_*() and dma_unmap_*(), you do not need the sync calls at all. They exist precisely for the reuse case, and skipping them on a non-coherent platform is a frequent cause of intermittent corruption.

Alignment and cache lines

One rule deserves special attention on embedded targets. You may DMA to memory from kmalloc() or the page allocator, but not from vmalloc() memory, kernel stack, or static (data, text, bss) addresses. On a CPU with DMA-incoherent caches, a DMA buffer must also not share a cache line with other data, or a CPU write to one word and a DMA write to a neighbouring word in the same line can overwrite each other. Architectures set ARCH_DMA_MINALIGN so that kmalloc() buffers are aligned safely, but if you embed a DMA buffer inside a larger structure next to fields the CPU writes, you are responsible for keeping them on separate cache lines.

Inside the DMA mapping API: three back ends

Everything above is the contract your driver works to, and it is stable across kernel versions. The implementation underneath is not. The 7.x series reworked the DMA core to be physical-address based: the internal entry point that performs the dispatch is now dma_map_phys() in kernel/dma/mapping.c, the dma_map_ops operation map_page was renamed map_phys, and dma_direct_map_page() became dma_direct_map_phys(). The dma_map_single() and dma_map_page() you call are unchanged; they convert your buffer to a physical address and feed dma_map_phys() underneath. Trimmed to the decision that matters, the dispatch looks like this:

dma_addr_t dma_map_phys(struct device *dev, phys_addr_t phys, size_t size,
                enum dma_data_direction dir, unsigned long attrs)
{
        const struct dma_map_ops *ops = get_dma_ops(dev);
        dma_addr_t addr = DMA_MAPPING_ERROR;

        if (dma_map_direct(dev, ops))
                addr = dma_direct_map_phys(dev, phys, size, dir, attrs, true);
        else if (use_dma_iommu(dev))
                addr = iommu_dma_map_phys(dev, phys, size, dir, attrs);
        else if (ops->map_phys)
                addr = ops->map_phys(dev, phys, size, dir, attrs);

        debug_dma_map_phys(dev, phys, size, dir, addr, attrs);
        return addr;
}

There are still exactly three paths. The dma-direct path (dma_direct_map_phys) is the common one on most modern arm64 and x86 systems. The IOMMU path (iommu_dma_map_phys) runs when an IOMMU is managing the device. The legacy ops path (ops->map_phys) is for buses that install their own struct dma_map_ops. The selector is dma_map_direct(), which calls a small helper:

static bool dma_go_direct(struct device *dev, dma_addr_t mask,
                const struct dma_map_ops *ops)
{
        if (use_dma_iommu(dev))
                return false;
        if (likely(!ops))
                return true;
        /* CONFIG_DMA_OPS_BYPASS mask check omitted */
        return false;
}

The key line is if (likely(!ops)) return true;. When a device has no custom DMA ops, the kernel takes the direct path. And whether a device has ops is decided by get_dma_ops():

static inline const struct dma_map_ops *get_dma_ops(struct device *dev)
{
        if (dev->dma_ops)
                return dev->dma_ops;
        return get_arch_dma_ops();
}

On architectures built without CONFIG_ARCH_HAS_DMA_OPS (which includes today’s arm64 and x86), this returns NULL. A NULL ops pointer is precisely what makes dma_go_direct() return true. So on a typical embedded arm64 board with no IOMMU in the path, every mapping you make goes straight through the dma-direct layer. That is the code worth understanding well.

The dma-direct fast path

The dma-direct implementation lives in kernel/dma/direct.h and kernel/dma/direct.c. The single-buffer map is a static inline in the header, and in 7.x it takes a physical address directly rather than a page and offset. Trimmed to the normal-memory path (the source also handles MMIO and confidential-computing buffers), it is short and revealing:

static inline dma_addr_t dma_direct_map_phys(struct device *dev,
                phys_addr_t phys, size_t size, enum dma_data_direction dir,
                unsigned long attrs, bool flush)
{
        dma_addr_t dma_addr = phys_to_dma(dev, phys);

        if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
                if (is_swiotlb_active(dev))
                        return swiotlb_map(dev, phys, size, dir, attrs);
                return DMA_MAPPING_ERROR;
        }

        if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
                arch_sync_dma_for_device(phys, size, dir);
                if (flush)
                        arch_sync_dma_flush();
        }
        return dma_addr;
}

Read it from the top. The page-to-physical conversion that older kernels did here is gone; the caller already passes a phys_addr_t. The function turns it into a bus address with phys_to_dma(), then checks dma_capable(): can the device, given its DMA mask, reach this address? If not, and a software IOMMU is available, it bounces the transfer through swiotlb_map(); otherwise it returns DMA_MAPPING_ERROR, the value dma_mapping_error() tests for. This closes the loop on the mask you set in part one: the mask is the input to dma_capable(), and an honest mask is what triggers bouncing instead of silent corruption when a 32-bit device is handed a high buffer.

The last lines are the cache story. If the device is not cache-coherent and the caller did not set DMA_ATTR_SKIP_CPU_SYNC, the code calls arch_sync_dma_for_device(), then, when flush is set, arch_sync_dma_flush(). That second call is new in the 7.x series: the cache maintenance and its memory barrier were split apart, so a batch of mappings can issue one barrier at the end instead of one per buffer. On a coherent platform dev_is_dma_coherent(dev) is true and nothing happens. That single branch is the difference between a desktop x86 board where DMA “just works” and an embedded arm64 target where forgetting a sync corrupts data.

Where cache coherency actually happens

arch_sync_dma_for_device() and its counterpart arch_sync_dma_for_cpu() are per-architecture hooks, enabled by CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE. On arm64 they live in arch/arm64/mm/dma-mapping.c and are remarkably direct:

void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
                enum dma_data_direction dir)
{
        unsigned long start = (unsigned long)phys_to_virt(paddr);

        dcache_clean_poc_nosync(start, start + size);
}

void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
                enum dma_data_direction dir)
{
        unsigned long start = (unsigned long)phys_to_virt(paddr);

        if (dir == DMA_TO_DEVICE)
                return;
        dcache_inval_poc_nosync(start, start + size);
}

This is the concrete meaning of the DMA direction argument from part one. Before the device reads memory the CPU wrote (the for-device direction), arm64 cleans the data cache to the Point of Coherency with dcache_clean_poc_nosync(), pushing any dirty cache lines out to where the device will read them. After the device writes memory the CPU will read (the for-cpu direction), arm64 invalidates the cache with dcache_inval_poc_nosync() so stale cached copies are dropped and the CPU re-reads from RAM. The _nosync suffix is the 7.x change: the barrier that used to follow each cache operation is deferred to the single arch_sync_dma_flush() call we saw a moment ago, which the generic layer issues once per batch. The if (dir == DMA_TO_DEVICE) return; in the for-cpu path is an optimisation: if data only moved toward the device, there is nothing for the CPU to re-read, so no invalidation is needed. The direction selects which cache operation runs.

Why coherent memory is a different memory type

Part one said a coherent buffer needs no syncing. The source explains why. The allocator dma_alloc_attrs() dispatches the same three ways, and on the direct path calls dma_direct_alloc() in direct.c. On a non-coherent architecture it allocates pages with __dma_direct_alloc_pages(), prepares them with arch_dma_prep_coherent() (on arm64 a cache clean over the region), and then, when CONFIG_DMA_DIRECT_REMAP is set, remaps them uncached with dma_common_contiguous_remap(); on architectures with CONFIG_ARCH_HAS_DMA_SET_UNCACHED it instead calls arch_dma_set_uncached().

So on a non-coherent SoC, the buffer returned by dma_alloc_coherent() is mapped uncached. That is how coherency is achieved without per-access cache maintenance, and it is also why coherent memory is the wrong choice for large data buffers: every CPU access bypasses the cache and is slow. This is the implementation reason behind the rule from part one to keep coherent memory for small control structures and use streaming maps for bulk data.

When the device cannot reach the buffer: swiotlb

Return to the dma_capable() check. When a device with a narrow DMA mask is handed a buffer above its reach, the direct path calls swiotlb_map(). The software I/O TLB keeps a low, device-addressable memory pool reserved at boot. swiotlb_map() copies (bounces) the buffer into that pool and returns a bus address the device can use. For a transfer toward the device, the bounce copy happens at map time; for a transfer from the device, it happens when you sync or unmap, which is one more reason the unmap and sync calls are mandatory rather than advisory. Bouncing is transparent to your driver, but it costs a copy, so a correct DMA mask is what lets the kernel skip it whenever the hardware can address the buffer directly.

Streaming sync, in the source

The sync calls from part one map onto dma_direct_sync_single_for_cpu() and dma_direct_sync_single_for_device(). The for-cpu side shows both halves of the work in one place:

static inline void dma_direct_sync_single_for_cpu(struct device *dev,
                dma_addr_t addr, size_t size, enum dma_data_direction dir,
                bool flush)
{
        phys_addr_t paddr = dma_to_phys(dev, addr);

        if (!dev_is_dma_coherent(dev)) {
                arch_sync_dma_for_cpu(paddr, size, dir);
                if (flush)
                        arch_sync_dma_flush();
                arch_sync_dma_for_cpu_all();
        }
        swiotlb_sync_single_for_cpu(dev, paddr, size, dir);
}

It performs the architecture cache invalidate (only when the device is non-coherent), issues the deferred barrier, then asks swiotlb to copy any bounced data back. So the same call covers both the cache problem and the bounce-buffer problem, which is why a single dma_sync_single_for_cpu() in your driver is enough regardless of platform.

A debugging session with CONFIG_DMA_API_DEBUG

The kernel can validate every DMA call you make. Build with CONFIG_DMA_API_DEBUG=y and the code in kernel/dma/debug.c shadows each mapping in a hash table, checking that unmaps match maps, that the CPU does not touch memory currently owned by a device, and that drivers do not free memory with the wrong function. First confirm it is enabled and look at the debugfs directory it creates:

raghu@techveda.org:~$ zcat /proc/config.gz | grep DMA_API_DEBUG
CONFIG_DMA_API_DEBUG=y
raghu@techveda.org:~$ ls /sys/kernel/debug/dma-api/
all_errors        driver_filter     error_count       min_free_entries
disabled          dump              num_errors        nr_total_entries
                                                      num_free_entries

The most useful files are error_count (how many problems have been detected), dump (a listing of every mapping the kernel is currently tracking, which lets you spot leaks), and num_errors (how many warnings it will still print before going quiet, which you can raise). The driver_filter file restricts reporting to a single driver, so you can isolate your own:

raghu@techveda.org:~$ cat /sys/kernel/debug/dma-api/error_count
0
raghu@techveda.org:~$ echo mydev > /sys/kernel/debug/dma-api/driver_filter

When you break the rules from part one, the report names the driver and is specific about the fault. Unmapping with a different call than you mapped with is flagged as freeing DMA memory with the wrong function. Touching a buffer the device still owns is caught by the active-cacheline tracker, which in the 7.x series warns that you have exceeded the allowed number of overlapping mappings of a cacheline (the older “cpu touching an active dma mapped cacheline” wording was replaced). If heavy traffic exhausts the shadow entries, which you can watch by reading min_free_entries as it falls toward zero, raise the preallocated count from its default of 65536 at boot with the kernel parameter dma_debug_entries=. You can disable the facility entirely with dma_debug=off; note that it cannot be re-enabled at runtime. The tracking has a real performance cost, so this is a development-kernel tool, not something to ship.

Learning both the contract and the implementation behind it, against real hardware and a real source tree, is the core of our Linux device drivers training, where DMA, interrupts, and the driver model are taught by tracing the kernel rather than memorising signatures.

Key takeaways

DMA involves virtual, physical, and bus addresses; the DMA mapping API converts a CPU buffer into a dma_addr_t the device can use, handling IOMMU and cache maintenance.
Declare the device’s addressing limits with dma_set_mask_and_coherent() before mapping; the mask feeds dma_capable() and decides whether the kernel must bounce through swiotlb.
Use coherent mappings (dma_alloc_coherent()) for small long-lived control structures; they are uncached on non-coherent SoCs, so use streaming maps (dma_map_single(), dma_map_page(), dma_map_sg()) with the correct direction and a dma_mapping_error() check for bulk data.
Call dma_sync_*() when the CPU touches a streaming buffer between transfers; on arm64 the direction selects the cache op, dcache_clean_poc_nosync() before the device reads and dcache_inval_poc_nosync() after it writes.
The 7.x DMA core is physical-address based: with no custom ops and no IOMMU every map takes the dma-direct path (dma_map_phys() then dma_direct_map_phys()); build with CONFIG_DMA_API_DEBUG and read /sys/kernel/debug/dma-api/ to catch misuse.

Coherent vs Streaming DMA: A Deep Dive into the Linux DMA Mapping API

Three kinds of addresses

Why the DMA mapping API exists

Tell the kernel your addressing limits

Coherent mappings: allocate once, keep for the device’s lifetime

Streaming mappings: map for one transfer, then unmap

Synchronising a buffer you reuse

Alignment and cache lines

Inside the DMA mapping API: three back ends

The dma-direct fast path

Where cache coherency actually happens

Why coherent memory is a different memory type

When the device cannot reach the buffer: swiotlb

Streaming sync, in the source

A debugging session with CONFIG_DMA_API_DEBUG

Key takeaways

Further reading

Three kinds of addresses

Why the DMA mapping API exists

Tell the kernel your addressing limits

Coherent mappings: allocate once, keep for the device’s lifetime

Streaming mappings: map for one transfer, then unmap

Synchronising a buffer you reuse

Alignment and cache lines

Inside the DMA mapping API: three back ends

The dma-direct fast path

Where cache coherency actually happens

Why coherent memory is a different memory type

When the device cannot reach the buffer: swiotlb

Streaming sync, in the source

A debugging session with CONFIG_DMA_API_DEBUG

Key takeaways

Further reading

Related reading

Deferred Probe in the Linux Kernel: Why a Driver’s probe() Runs Late and How to Debug It

Meet the sched_ext Ecosystem

The sched_ext Architecture