Benchmark the Right Thing

Benchmark Methodology

Benchmarking means measuring how fast code runs.

A benchmark answers a narrow question:

How long does this operation take under these conditions?

That last part matters. A benchmark is only useful when the conditions are clear. Different inputs, build modes, machines, allocators, and operating system states can produce different results.

Good benchmarking is not only running a timer. It is designing a measurement that tells the truth.

Benchmark the Right Thing

Before writing a benchmark, define the question.

Bad question:

Is this code fast?

Better question:

How many bytes per second can this parser process on a 100 MB JSON file in ReleaseFast mode?

Better question:

How many requests per second can this server handle with 1,000 concurrent connections?

Better question:

How many nanoseconds does this small function take when called 10 million times?

A vague benchmark gives vague results.

A precise benchmark gives useful results.

Use Release Builds

Never benchmark Debug mode unless you are measuring Debug mode itself.

Debug builds include extra checks and less optimization.

Use:

zig build-exe main.zig -O ReleaseFast

or, when you want safety checks with optimization:

zig build-exe main.zig -O ReleaseSafe

Record the mode with the result.

A result without the build mode is incomplete.

Make the Work Large Enough

A benchmark must run long enough to measure.

This is too small:

const start = std.time.nanoTimestamp();
const x = add(1, 2);
const end = std.time.nanoTimestamp();

std.debug.print("{} ns\n", .{end - start});

The timer overhead may be larger than the work.

Instead, repeat the work many times:

const std = @import("std");

fn add(a: u64, b: u64) u64 {
    return a + b;
}

pub fn main() void {
    const iterations = 100_000_000;

    const start = std.time.nanoTimestamp();

    var sum: u64 = 0;
    for (0..iterations) |i| {
        sum += add(i, 1);
    }

    const end = std.time.nanoTimestamp();

    std.debug.print("sum: {}\n", .{sum});
    std.debug.print("elapsed: {} ns\n", .{end - start});
}

The benchmark runs enough work for the timing to be meaningful.

Prevent Dead Code Elimination

Optimizing compilers remove useless work.

This benchmark may be invalid:

for (0..1000000) |i| {
    _ = i * i;
}

The result is unused. The compiler may remove the loop entirely.

Use the result in a visible way:

var sum: u64 = 0;

for (0..1000000) |i| {
    sum += i * i;
}

std.debug.print("{}\n", .{sum});

Now the compiler must preserve the computation.

Separate Setup from Measurement

Do not include setup time unless setup is part of what you want to measure.

Bad:

const start = std.time.nanoTimestamp();

const input = try allocator.alloc(u8, 1024 * 1024);
defer allocator.free(input);

fillInput(input);
process(input);

const end = std.time.nanoTimestamp();

This measures allocation, input generation, and processing together.

Better:

const input = try allocator.alloc(u8, 1024 * 1024);
defer allocator.free(input);

fillInput(input);

const start = std.time.nanoTimestamp();
process(input);
const end = std.time.nanoTimestamp();

Now the timed region measures only process.

Sometimes you do want total end-to-end time. That is fine. Just name it correctly.

Run Benchmarks Multiple Times

One run is not enough.

Performance varies because of:

operating system scheduling
CPU frequency scaling
background processes
disk cache
memory layout
thermal throttling

Run multiple trials.

Example output:

run 1: 104 ms
run 2: 101 ms
run 3: 103 ms
run 4: 102 ms
run 5: 101 ms

This is more trustworthy than one number.

Look at Distribution, Not Only Average

The average can hide important behavior.

Suppose you measure request latency:

average: 10 ms

That sounds good.

But the distribution may be:

p50: 5 ms
p95: 40 ms
p99: 200 ms

For servers, games, databases, and interactive tools, tail latency matters.

A program that is usually fast but sometimes very slow may still be bad.

Compare Against a Baseline

A benchmark needs comparison.

Example:

old parser: 800 MB/s
new parser: 1.2 GB/s

Now the result has meaning.

Without a baseline, “1.2 GB/s” may be good or bad depending on the workload and hardware.

Good comparisons include:

old implementation vs new implementation
scalar version vs SIMD version
heap allocation version vs buffer reuse version
different data layouts
different algorithms

Keep Inputs Realistic

Microbenchmarks are useful, but they can lie.

Example:

const input = "hello";

A parser that is fast on "hello" may be slow on real files.

Use realistic inputs:

small input
medium input
large input
common case
worst case
malformed input when relevant

Benchmarking only the happy path gives incomplete information.

Control the Environment

For serious benchmarks, control as much as possible.

Useful practices:

close unnecessary programs
use the same machine
use the same compiler version
use the same build mode
use the same input files
avoid measuring over network when testing CPU work
pin CPU frequency if needed
run enough iterations

You do not need extreme rigor for every small test, but you should know what can affect the result.

Record Hardware and Software

A benchmark result should include context.

At minimum, record:

Field	Example
CPU	Apple M3 Pro, Ryzen 7950X, etc.
RAM	32 GB
OS	Linux, macOS, Windows
Zig version	0.16.0
Build mode	ReleaseFast
Input	100 MB JSON file
Command	`zig build-exe main.zig -O ReleaseFast`

Without this, another person cannot reproduce the result.

Measure Throughput and Latency

Two common performance measurements are throughput and latency.

Throughput measures amount of work per time.

MB/s
requests/s
items/s
frames/s

Latency measures time for one operation.

milliseconds per request
nanoseconds per item
seconds per file

A batch processor often cares about throughput.

An interactive program often cares about latency.

A server usually cares about both.

Avoid Misleading Units

Use units that match the task.

For file processing:

MB/s

For function calls:

ns/op

For servers:

requests/s
p95 latency
p99 latency

For memory:

bytes allocated per operation
allocations per operation
peak memory

Good units make results easier to understand.

Benchmark Memory Too

Time is not the only performance metric.

A faster version may use much more memory.

Example:

Version	Time	Peak Memory
A	100 ms	10 MB
B	70 ms	500 MB

Version B is faster, but may be unacceptable.

Track memory when it matters.

Important memory metrics:

peak memory
allocation count
bytes allocated
cache misses
working set size

Avoid Benchmarking the Wrong Layer

Suppose you want to measure parsing speed.

Bad benchmark:

read file from disk + parse + print output

This measures disk and printing too.

Better:

load file once
then measure parser only

But if your real product reads files from disk, also run an end-to-end benchmark.

Use both:

component benchmark
end-to-end benchmark

They answer different questions.

Beware I/O Benchmarks

I/O benchmarks are difficult.

File benchmarks are affected by:

OS page cache
disk type
filesystem
file size
write buffering
compression
background disk activity

Network benchmarks are affected by:

latency
packet loss
kernel tuning
TLS
connection reuse
remote server behavior

When possible, isolate CPU work from I/O work. Then separately measure end-to-end behavior.

Benchmark Algorithmic Complexity

A benchmark should test scaling.

Do not only test one input size.

Example:

Input Size	Time
1,000	1 ms
10,000	10 ms
100,000	100 ms
1,000,000	1,000 ms

This suggests linear behavior.

But this result is different:

Input Size	Time
1,000	1 ms
10,000	100 ms
100,000	10,000 ms
1,000,000	1,000,000 ms

That suggests quadratic behavior.

Scaling matters more than one isolated number.

Do Not Trust Tiny Differences

A 1% improvement may be noise.

Example:

old: 100.2 ms
new: 99.8 ms

That is probably not meaningful unless you have careful repeated measurements.

A 30% improvement is easier to trust.

But even then, verify.

Good benchmarking is skeptical.

Use Profiling with Benchmarking

Benchmarking tells you whether something improved.

Profiling tells you why.

Use both.

Example:

Benchmark result:

new version is 25% faster

Profiler result:

allocation time dropped from 40% to 5%

Now you know the reason.

A Simple Benchmark Harness

Here is a small pattern you can adapt:

const std = @import("std");

fn work(input: []const u8) usize {
    var count: usize = 0;

    for (input) |ch| {
        if (ch == 'x') {
            count += 1;
        }
    }

    return count;
}

pub fn main() !void {
    var data: [1024 * 1024]u8 = undefined;

    for (&data, 0..) |*byte, i| {
        byte.* = if (i % 17 == 0) 'x' else 'a';
    }

    const iterations = 1000;

    var total: usize = 0;

    const start = std.time.nanoTimestamp();

    for (0..iterations) |_| {
        total += work(data[0..]);
    }

    const end = std.time.nanoTimestamp();

    const elapsed_ns = end - start;
    const bytes_processed = data.len * iterations;

    std.debug.print("total: {}\n", .{total});
    std.debug.print("elapsed: {} ns\n", .{elapsed_ns});
    std.debug.print("bytes: {}\n", .{bytes_processed});
}

This benchmark:

prepares input before timing
repeats work many times
uses the result
records elapsed time
exposes enough data to calculate throughput

Mental Model

A benchmark is an experiment.

A good experiment has:

a clear question
controlled inputs
realistic conditions
repeated trials
meaningful units
a baseline
recorded environment
skepticism about tiny differences

In Zig, performance is visible and controllable, but you still need disciplined measurement.

Fast code starts with correct measurement.