Skip to content

The Basic Idea

Modern CPUs are fast, but memory is much slower.

Cache-Friendly Design

Modern CPUs are fast, but memory is much slower.

This is one of the most important facts in performance work. A simple arithmetic instruction may be cheap, while waiting for memory may be expensive. A program that keeps data close together in memory can be much faster than a program that constantly jumps around.

Cache-friendly design means arranging data and code so the CPU can read memory efficiently.

The Basic Idea

A CPU does not read only one byte from memory at a time.

When it loads memory, it usually pulls in a small nearby block of memory called a cache line. A common cache line size is 64 bytes.

So if your program reads this:

const x = numbers[0];

the CPU may also load nearby values:

numbers[1]
numbers[2]
numbers[3]
...

If your program then reads the next item, it may already be in cache.

That is fast.

Contiguous Data Is Fast

An array stores values next to each other.

var numbers = [_]u32{ 10, 20, 30, 40, 50 };

for (numbers) |n| {
    std.debug.print("{}\n", .{n});
}

The loop reads memory in order.

This is friendly to the CPU cache.

The CPU can predict that the next value will probably be near the current value. It can prefetch memory before the program asks for it.

This is one reason arrays are often faster than pointer-heavy structures.

Pointer Chasing Is Slow

A linked list stores each node separately.

const Node = struct {
    value: u32,
    next: ?*Node,
};

Each next pointer may lead to a different place in memory.

The CPU cannot easily know where the next node lives until it reads the current node.

That creates a chain:

read node 1
find address of node 2
read node 2
find address of node 3
read node 3

This is called pointer chasing.

Pointer chasing often causes cache misses.

A linked list can look elegant, but it is often slower than an array in real programs.

Locality

Locality means using nearby data close together in time.

There are two common kinds.

KindMeaningExample
Spatial localityUse memory near memory you already usedReading an array in order
Temporal localityReuse the same memory soonUpdating the same counter many times

Good programs often have both.

This loop has spatial locality:

for (items) |item| {
    process(item);
}

This loop has poor locality if each item points to far-away memory:

for (items) |item| {
    process(item.ptr.*);
}

The second loop may spend much of its time waiting for memory.

Array of Structs

A common layout is an array of structs.

const Particle = struct {
    x: f32,
    y: f32,
    z: f32,
    vx: f32,
    vy: f32,
    vz: f32,
};

var particles: [1000]Particle = undefined;

Each particle stores position and velocity together.

This layout is good when your code usually needs all fields of each particle.

Example:

for (&particles) |*p| {
    p.x += p.vx;
    p.y += p.vy;
    p.z += p.vz;
}

Here, each loop iteration uses all fields. The layout works well.

Struct of Arrays

Sometimes you only need one or two fields from many objects.

In that case, a struct of arrays may be faster.

const Particles = struct {
    x: []f32,
    y: []f32,
    z: []f32,
    vx: []f32,
    vy: []f32,
    vz: []f32,
};

Now all x values are stored together, all y values are stored together, and so on.

This can be faster for code like:

for (particles.x) |*x| {
    x.* += 1.0;
}

The CPU reads a dense stream of f32 values.

No unused y, z, vx, vy, or vz fields are pulled into cache.

Wasted Cache Space

Suppose you have this struct:

const User = struct {
    id: u64,
    age: u8,
    name: [128]u8,
    email: [256]u8,
};

Now suppose you only want to count users above age 18:

var count: usize = 0;

for (users) |user| {
    if (user.age >= 18) {
        count += 1;
    }
}

Each User is large. The loop only needs age, but the CPU still loads cache lines containing other fields.

A better layout may separate hot and cold data.

Hot data is used often.

Cold data is used rarely.

const UserHot = struct {
    id: u64,
    age: u8,
};

const UserCold = struct {
    name: [128]u8,
    email: [256]u8,
};

This keeps frequently used data smaller and denser.

Padding and Alignment

Struct fields may include padding.

Padding is unused space inserted so fields meet alignment requirements.

Example:

const Bad = struct {
    a: u8,
    b: u64,
    c: u8,
};

This may use more memory than expected because u64 usually needs stronger alignment.

A better order may reduce padding:

const Better = struct {
    b: u64,
    a: u8,
    c: u8,
};

You can inspect sizes:

const std = @import("std");

const Bad = struct {
    a: u8,
    b: u64,
    c: u8,
};

const Better = struct {
    b: u64,
    a: u8,
    c: u8,
};

pub fn main() void {
    std.debug.print("Bad: {}\n", .{@sizeOf(Bad)});
    std.debug.print("Better: {}\n", .{@sizeOf(Better)});
}

Smaller structs often improve cache behavior because more values fit into each cache line.

Avoid Random Access When Possible

Sequential access is usually faster than random access.

Fast:

for (numbers) |n| {
    sum += n;
}

Slower:

for (indices) |i| {
    sum += numbers[i];
}

The second loop jumps around depending on indices.

If indices is random, the CPU cache cannot help much.

Sometimes you can sort indices or reorganize data so access becomes more sequential.

Batching Work

Batching means doing similar work together.

Bad locality:

for (items) |item| {
    readInput(item);
    process(item);
    writeOutput(item);
}

Better locality may come from stages:

for (items) |item| {
    readInput(item);
}

for (items) |item| {
    process(item);
}

for (items) |item| {
    writeOutput(item);
}

This is not always faster. It depends on the data. But in many systems, grouping similar work improves cache behavior and makes performance easier to reason about.

Avoid Allocating Many Tiny Objects

Many tiny heap allocations often produce scattered memory.

Example:

const Node = struct {
    value: u32,
    next: ?*Node,
};

If each node is separately allocated, the list may be spread across the heap.

Better options include:

  • store nodes in an array
  • use an arena allocator
  • store indices instead of pointers
  • use a packed representation

Example using indices:

const Node = struct {
    value: u32,
    next_index: ?usize,
};

var nodes: [1000]Node = undefined;

Now the nodes can live close together.

This keeps memory easier for the CPU to read.

Indices Can Be Better Than Pointers

A pointer stores a memory address.

An index stores a position inside an array.

Pointer:

next: ?*Node

Index:

next_index: ?usize

Indices have advantages:

  • data can stay inside one array
  • structures can be moved or serialized more easily
  • memory layout is more compact on some targets
  • cache behavior is often better

For high-performance code, index-based structures are often worth considering.

Measure Before Changing Layout

Cache-friendly design can produce large speedups, but it can also make code more complex.

Do not rewrite every struct just because you can.

First measure.

Then change layout if profiling shows memory access is the bottleneck.

A good rule:

Write clear code first. Then make data layout more specialized when measurement proves it matters.

Mental Model

When thinking about cache-friendly design, ask:

  • Is my data contiguous?
  • Does this loop read memory in order?
  • Am I chasing pointers?
  • Am I loading fields I do not use?
  • Are hot fields separated from cold fields?
  • Are many tiny heap allocations scattering my data?
  • Could an array plus indices replace pointers?
  • Can I process work in batches?

Performance is not only about doing fewer operations.

Very often, performance is about making memory easier for the CPU to read.