Skip to content

Why SIMD Matters

SIMD means Single Instruction, Multiple Data.

SIMD and Vectorization

SIMD means Single Instruction, Multiple Data.

It is a CPU feature that lets one instruction operate on several values at the same time.

A normal instruction might add one pair of numbers:

a + b

A SIMD instruction can add several pairs at once:

[a0, a1, a2, a3] + [b0, b1, b2, b3]

The result is:

[a0 + b0, a1 + b1, a2 + b2, a3 + b3]

This is useful for workloads that perform the same operation across many values.

Why SIMD Matters

Many programs contain loops like this:

for (items) |*item| {
    item.* += 1;
}

Each element receives the same operation.

That pattern is a good candidate for vectorization.

Vectorization means converting scalar operations into vector operations.

Scalar code works on one value at a time.

Vector code works on multiple values at a time.

Zig Vector Types

Zig has a built-in vector type:

const Vec4 = @Vector(4, f32);

This means:

a vector of 4 f32 values

You can create vector values like this:

const a: Vec4 = .{ 1.0, 2.0, 3.0, 4.0 };
const b: Vec4 = .{ 10.0, 20.0, 30.0, 40.0 };

Then you can add them:

const c = a + b;

The result is:

.{ 11.0, 22.0, 33.0, 44.0 }

The code looks like one addition, but it applies to every lane.

Vector Lanes

Each value inside a vector is called a lane.

In this example:

const Vec4 = @Vector(4, i32);
const values: Vec4 = .{ 10, 20, 30, 40 };

The vector has 4 lanes:

lane 0 = 10
lane 1 = 20
lane 2 = 30
lane 3 = 40

Most vector operations happen lane by lane.

const a: Vec4 = .{ 1, 2, 3, 4 };
const b: Vec4 = .{ 5, 6, 7, 8 };

const c = a * b;

The result is:

[1 * 5, 2 * 6, 3 * 7, 4 * 8]

So c becomes:

.{ 5, 12, 21, 32 }

A Simple Vector Example

Here is a complete Zig example:

const std = @import("std");

pub fn main() void {
    const Vec4 = @Vector(4, i32);

    const a: Vec4 = .{ 1, 2, 3, 4 };
    const b: Vec4 = .{ 10, 20, 30, 40 };

    const c = a + b;

    std.debug.print("{any}\n", .{c});
}

This adds four integers with one vector expression.

The exact machine instructions depend on the target CPU and optimization mode.

Scalar Loop vs Vector Loop

A scalar loop might look like this:

fn addScalar(out: []f32, a: []const f32, b: []const f32) void {
    for (out, a, b) |*dst, x, y| {
        dst.* = x + y;
    }
}

This is clear and correct.

A vectorized version works in chunks:

fn addVector(out: []f32, a: []const f32, b: []const f32) void {
    const Vec4 = @Vector(4, f32);

    var i: usize = 0;

    while (i + 4 <= out.len) : (i += 4) {
        const va: Vec4 = .{
            a[i],
            a[i + 1],
            a[i + 2],
            a[i + 3],
        };

        const vb: Vec4 = .{
            b[i],
            b[i + 1],
            b[i + 2],
            b[i + 3],
        };

        const vc = va + vb;

        out[i] = vc[0];
        out[i + 1] = vc[1];
        out[i + 2] = vc[2];
        out[i + 3] = vc[3];
    }

    while (i < out.len) : (i += 1) {
        out[i] = a[i] + b[i];
    }
}

The first loop handles four values at a time.

The second loop handles the leftover values when the length is not divisible by four.

The Remainder Problem

Vector code often processes fixed-size chunks.

If a vector has 4 lanes, an array of 10 values splits like this:

[0, 1, 2, 3]   vector chunk
[4, 5, 6, 7]   vector chunk
[8, 9]         remainder

You still need scalar code for the remainder.

This is a common SIMD pattern.

while (i + 4 <= len) : (i += 4) {
    // vector work
}

while (i < len) : (i += 1) {
    // leftover scalar work
}

What SIMD Is Good For

SIMD works well when the same operation applies to many values.

Good examples:

  • image processing
  • audio processing
  • physics simulation
  • matrix math
  • compression
  • checksums
  • parsing
  • cryptography
  • machine learning kernels

For example, brightening pixels:

pixel = pixel + brightness

This operation applies to many pixels independently.

That is a good SIMD workload.

What SIMD Is Bad For

SIMD works poorly when each value requires different control flow.

Example:

for (items) |item| {
    if (item.kind == .text) {
        processText(item);
    } else if (item.kind == .image) {
        processImage(item);
    } else {
        processOther(item);
    }
}

Each item may go down a different path.

That makes SIMD harder.

SIMD prefers regular, repeated, predictable operations.

Auto-Vectorization

Sometimes the compiler can vectorize scalar loops automatically.

For example:

for (out, a, b) |*dst, x, y| {
    dst.* = x + y;
}

An optimizing compiler may turn this into SIMD instructions.

This is called auto-vectorization.

Auto-vectorization works best when:

  • arrays are contiguous
  • there are no hidden aliases
  • loop bounds are simple
  • operations are independent
  • memory access is predictable

But compilers cannot always prove a loop is safe to vectorize.

Manual vector types give you more direct control.

Alignment

SIMD can be affected by memory alignment.

Alignment means the memory address is a multiple of some value.

For example, 16-byte alignment means the address is divisible by 16.

Aligned memory can be faster for some vector instructions and targets.

Zig lets you express alignment in pointer types and allocations.

You do not need to master alignment immediately, but you should know that vector code often cares about it.

Wider Is Not Always Better

A vector with more lanes can process more values at once.

Example:

const Vec8 = @Vector(8, f32);
const Vec16 = @Vector(16, f32);

But wider vectors are not always faster.

Reasons include:

  • target CPU may not support that width directly
  • compiler may split the vector into smaller instructions
  • memory bandwidth may become the bottleneck
  • register pressure may increase
  • code may become less portable

A good SIMD implementation depends on the target machine.

Measure before assuming.

SIMD and Memory Bandwidth

SIMD speeds up computation, but it does not remove memory cost.

Suppose your loop does very little arithmetic:

out[i] = a[i] + b[i];

For each element, the program loads two values and stores one value.

The bottleneck may become memory bandwidth, not arithmetic.

SIMD helps most when the CPU has enough work to do per loaded byte.

Vector Comparisons

Vector operations can also compare values lane by lane.

Example:

const Vec4 = @Vector(4, i32);

const a: Vec4 = .{ 1, 5, 3, 8 };
const b: Vec4 = .{ 2, 4, 3, 9 };

const mask = a < b;

The result is a vector mask:

[true, false, false, true]

Masks are useful for conditional vector logic.

Instead of branching separately for each value, SIMD code often computes a mask and uses it to select results.

@splat

The @splat builtin creates a vector where every lane has the same value.

Example:

const Vec4 = @Vector(4, i32);

const all_ten: Vec4 = @splat(10);

This creates:

.{ 10, 10, 10, 10 }

This is useful when applying one value to many lanes.

Example:

const adjusted = values + @as(Vec4, @splat(5));

Every lane receives + 5.

@reduce

Sometimes you need to turn a vector into one value.

Example:

const Vec4 = @Vector(4, i32);
const values: Vec4 = .{ 10, 20, 30, 40 };

const sum = @reduce(.Add, values);

The result is:

100

Reduction combines lanes using an operation.

Common reductions include:

  • add
  • multiply
  • minimum
  • maximum
  • bitwise operations

A Small Sum Example

Here is a simple vectorized sum:

fn sumVector(values: []const i32) i32 {
    const Vec4 = @Vector(4, i32);

    var i: usize = 0;
    var acc: Vec4 = @splat(0);

    while (i + 4 <= values.len) : (i += 4) {
        const chunk: Vec4 = .{
            values[i],
            values[i + 1],
            values[i + 2],
            values[i + 3],
        };

        acc += chunk;
    }

    var total: i32 = @reduce(.Add, acc);

    while (i < values.len) : (i += 1) {
        total += values[i];
    }

    return total;
}

The vector accumulator stores four partial sums.

At the end, @reduce combines them into one scalar value.

SIMD Is an Advanced Tool

SIMD can make code faster, but it can also make code harder to read.

Start with clear scalar code.

Then profile.

Then use SIMD only when measurement shows that a loop is hot and suitable for vectorization.

A good SIMD candidate usually has:

  • large input arrays
  • simple repeated operations
  • predictable memory access
  • little branching
  • independent elements

Mental Model

SIMD is useful when your program says:

do the same thing to many values

Zig gives you vector types so you can express that directly.

The basic pattern is:

  1. process several items at once with a vector
  2. handle leftover items with scalar code
  3. measure the result

SIMD is not magic. It is a way to use the CPU’s data-parallel hardware explicitly.