SIMD and Vectorization
SIMD means Single Instruction, Multiple Data.
It is a CPU feature that lets one instruction operate on several values at the same time.
A normal instruction might add one pair of numbers:
a + bA SIMD instruction can add several pairs at once:
[a0, a1, a2, a3] + [b0, b1, b2, b3]The result is:
[a0 + b0, a1 + b1, a2 + b2, a3 + b3]This is useful for workloads that perform the same operation across many values.
Why SIMD Matters
Many programs contain loops like this:
for (items) |*item| {
item.* += 1;
}Each element receives the same operation.
That pattern is a good candidate for vectorization.
Vectorization means converting scalar operations into vector operations.
Scalar code works on one value at a time.
Vector code works on multiple values at a time.
Zig Vector Types
Zig has a built-in vector type:
const Vec4 = @Vector(4, f32);This means:
a vector of 4 f32 valuesYou can create vector values like this:
const a: Vec4 = .{ 1.0, 2.0, 3.0, 4.0 };
const b: Vec4 = .{ 10.0, 20.0, 30.0, 40.0 };Then you can add them:
const c = a + b;The result is:
.{ 11.0, 22.0, 33.0, 44.0 }The code looks like one addition, but it applies to every lane.
Vector Lanes
Each value inside a vector is called a lane.
In this example:
const Vec4 = @Vector(4, i32);
const values: Vec4 = .{ 10, 20, 30, 40 };The vector has 4 lanes:
lane 0 = 10
lane 1 = 20
lane 2 = 30
lane 3 = 40Most vector operations happen lane by lane.
const a: Vec4 = .{ 1, 2, 3, 4 };
const b: Vec4 = .{ 5, 6, 7, 8 };
const c = a * b;The result is:
[1 * 5, 2 * 6, 3 * 7, 4 * 8]So c becomes:
.{ 5, 12, 21, 32 }A Simple Vector Example
Here is a complete Zig example:
const std = @import("std");
pub fn main() void {
const Vec4 = @Vector(4, i32);
const a: Vec4 = .{ 1, 2, 3, 4 };
const b: Vec4 = .{ 10, 20, 30, 40 };
const c = a + b;
std.debug.print("{any}\n", .{c});
}This adds four integers with one vector expression.
The exact machine instructions depend on the target CPU and optimization mode.
Scalar Loop vs Vector Loop
A scalar loop might look like this:
fn addScalar(out: []f32, a: []const f32, b: []const f32) void {
for (out, a, b) |*dst, x, y| {
dst.* = x + y;
}
}This is clear and correct.
A vectorized version works in chunks:
fn addVector(out: []f32, a: []const f32, b: []const f32) void {
const Vec4 = @Vector(4, f32);
var i: usize = 0;
while (i + 4 <= out.len) : (i += 4) {
const va: Vec4 = .{
a[i],
a[i + 1],
a[i + 2],
a[i + 3],
};
const vb: Vec4 = .{
b[i],
b[i + 1],
b[i + 2],
b[i + 3],
};
const vc = va + vb;
out[i] = vc[0];
out[i + 1] = vc[1];
out[i + 2] = vc[2];
out[i + 3] = vc[3];
}
while (i < out.len) : (i += 1) {
out[i] = a[i] + b[i];
}
}The first loop handles four values at a time.
The second loop handles the leftover values when the length is not divisible by four.
The Remainder Problem
Vector code often processes fixed-size chunks.
If a vector has 4 lanes, an array of 10 values splits like this:
[0, 1, 2, 3] vector chunk
[4, 5, 6, 7] vector chunk
[8, 9] remainderYou still need scalar code for the remainder.
This is a common SIMD pattern.
while (i + 4 <= len) : (i += 4) {
// vector work
}
while (i < len) : (i += 1) {
// leftover scalar work
}What SIMD Is Good For
SIMD works well when the same operation applies to many values.
Good examples:
- image processing
- audio processing
- physics simulation
- matrix math
- compression
- checksums
- parsing
- cryptography
- machine learning kernels
For example, brightening pixels:
pixel = pixel + brightnessThis operation applies to many pixels independently.
That is a good SIMD workload.
What SIMD Is Bad For
SIMD works poorly when each value requires different control flow.
Example:
for (items) |item| {
if (item.kind == .text) {
processText(item);
} else if (item.kind == .image) {
processImage(item);
} else {
processOther(item);
}
}Each item may go down a different path.
That makes SIMD harder.
SIMD prefers regular, repeated, predictable operations.
Auto-Vectorization
Sometimes the compiler can vectorize scalar loops automatically.
For example:
for (out, a, b) |*dst, x, y| {
dst.* = x + y;
}An optimizing compiler may turn this into SIMD instructions.
This is called auto-vectorization.
Auto-vectorization works best when:
- arrays are contiguous
- there are no hidden aliases
- loop bounds are simple
- operations are independent
- memory access is predictable
But compilers cannot always prove a loop is safe to vectorize.
Manual vector types give you more direct control.
Alignment
SIMD can be affected by memory alignment.
Alignment means the memory address is a multiple of some value.
For example, 16-byte alignment means the address is divisible by 16.
Aligned memory can be faster for some vector instructions and targets.
Zig lets you express alignment in pointer types and allocations.
You do not need to master alignment immediately, but you should know that vector code often cares about it.
Wider Is Not Always Better
A vector with more lanes can process more values at once.
Example:
const Vec8 = @Vector(8, f32);
const Vec16 = @Vector(16, f32);But wider vectors are not always faster.
Reasons include:
- target CPU may not support that width directly
- compiler may split the vector into smaller instructions
- memory bandwidth may become the bottleneck
- register pressure may increase
- code may become less portable
A good SIMD implementation depends on the target machine.
Measure before assuming.
SIMD and Memory Bandwidth
SIMD speeds up computation, but it does not remove memory cost.
Suppose your loop does very little arithmetic:
out[i] = a[i] + b[i];For each element, the program loads two values and stores one value.
The bottleneck may become memory bandwidth, not arithmetic.
SIMD helps most when the CPU has enough work to do per loaded byte.
Vector Comparisons
Vector operations can also compare values lane by lane.
Example:
const Vec4 = @Vector(4, i32);
const a: Vec4 = .{ 1, 5, 3, 8 };
const b: Vec4 = .{ 2, 4, 3, 9 };
const mask = a < b;The result is a vector mask:
[true, false, false, true]Masks are useful for conditional vector logic.
Instead of branching separately for each value, SIMD code often computes a mask and uses it to select results.
@splat
The @splat builtin creates a vector where every lane has the same value.
Example:
const Vec4 = @Vector(4, i32);
const all_ten: Vec4 = @splat(10);This creates:
.{ 10, 10, 10, 10 }This is useful when applying one value to many lanes.
Example:
const adjusted = values + @as(Vec4, @splat(5));Every lane receives + 5.
@reduce
Sometimes you need to turn a vector into one value.
Example:
const Vec4 = @Vector(4, i32);
const values: Vec4 = .{ 10, 20, 30, 40 };
const sum = @reduce(.Add, values);The result is:
100Reduction combines lanes using an operation.
Common reductions include:
- add
- multiply
- minimum
- maximum
- bitwise operations
A Small Sum Example
Here is a simple vectorized sum:
fn sumVector(values: []const i32) i32 {
const Vec4 = @Vector(4, i32);
var i: usize = 0;
var acc: Vec4 = @splat(0);
while (i + 4 <= values.len) : (i += 4) {
const chunk: Vec4 = .{
values[i],
values[i + 1],
values[i + 2],
values[i + 3],
};
acc += chunk;
}
var total: i32 = @reduce(.Add, acc);
while (i < values.len) : (i += 1) {
total += values[i];
}
return total;
}The vector accumulator stores four partial sums.
At the end, @reduce combines them into one scalar value.
SIMD Is an Advanced Tool
SIMD can make code faster, but it can also make code harder to read.
Start with clear scalar code.
Then profile.
Then use SIMD only when measurement shows that a loop is hot and suitable for vectorization.
A good SIMD candidate usually has:
- large input arrays
- simple repeated operations
- predictable memory access
- little branching
- independent elements
Mental Model
SIMD is useful when your program says:
do the same thing to many valuesZig gives you vector types so you can express that directly.
The basic pattern is:
- process several items at once with a vector
- handle leftover items with scalar code
- measure the result
SIMD is not magic. It is a way to use the CPU’s data-parallel hardware explicitly.