Skip to content

Benchmark the Right Thing

Benchmarking means measuring how fast code runs.

Benchmark Methodology

Benchmarking means measuring how fast code runs.

A benchmark answers a narrow question:

How long does this operation take under these conditions?

That last part matters. A benchmark is only useful when the conditions are clear. Different inputs, build modes, machines, allocators, and operating system states can produce different results.

Good benchmarking is not only running a timer. It is designing a measurement that tells the truth.

Benchmark the Right Thing

Before writing a benchmark, define the question.

Bad question:

Is this code fast?

Better question:

How many bytes per second can this parser process on a 100 MB JSON file in ReleaseFast mode?

Better question:

How many requests per second can this server handle with 1,000 concurrent connections?

Better question:

How many nanoseconds does this small function take when called 10 million times?

A vague benchmark gives vague results.

A precise benchmark gives useful results.

Use Release Builds

Never benchmark Debug mode unless you are measuring Debug mode itself.

Debug builds include extra checks and less optimization.

Use:

zig build-exe main.zig -O ReleaseFast

or, when you want safety checks with optimization:

zig build-exe main.zig -O ReleaseSafe

Record the mode with the result.

A result without the build mode is incomplete.

Make the Work Large Enough

A benchmark must run long enough to measure.

This is too small:

const start = std.time.nanoTimestamp();
const x = add(1, 2);
const end = std.time.nanoTimestamp();

std.debug.print("{} ns\n", .{end - start});

The timer overhead may be larger than the work.

Instead, repeat the work many times:

const std = @import("std");

fn add(a: u64, b: u64) u64 {
    return a + b;
}

pub fn main() void {
    const iterations = 100_000_000;

    const start = std.time.nanoTimestamp();

    var sum: u64 = 0;
    for (0..iterations) |i| {
        sum += add(i, 1);
    }

    const end = std.time.nanoTimestamp();

    std.debug.print("sum: {}\n", .{sum});
    std.debug.print("elapsed: {} ns\n", .{end - start});
}

The benchmark runs enough work for the timing to be meaningful.

Prevent Dead Code Elimination

Optimizing compilers remove useless work.

This benchmark may be invalid:

for (0..1000000) |i| {
    _ = i * i;
}

The result is unused. The compiler may remove the loop entirely.

Use the result in a visible way:

var sum: u64 = 0;

for (0..1000000) |i| {
    sum += i * i;
}

std.debug.print("{}\n", .{sum});

Now the compiler must preserve the computation.

Separate Setup from Measurement

Do not include setup time unless setup is part of what you want to measure.

Bad:

const start = std.time.nanoTimestamp();

const input = try allocator.alloc(u8, 1024 * 1024);
defer allocator.free(input);

fillInput(input);
process(input);

const end = std.time.nanoTimestamp();

This measures allocation, input generation, and processing together.

Better:

const input = try allocator.alloc(u8, 1024 * 1024);
defer allocator.free(input);

fillInput(input);

const start = std.time.nanoTimestamp();
process(input);
const end = std.time.nanoTimestamp();

Now the timed region measures only process.

Sometimes you do want total end-to-end time. That is fine. Just name it correctly.

Run Benchmarks Multiple Times

One run is not enough.

Performance varies because of:

  • operating system scheduling
  • CPU frequency scaling
  • background processes
  • disk cache
  • memory layout
  • thermal throttling

Run multiple trials.

Example output:

run 1: 104 ms
run 2: 101 ms
run 3: 103 ms
run 4: 102 ms
run 5: 101 ms

This is more trustworthy than one number.

Look at Distribution, Not Only Average

The average can hide important behavior.

Suppose you measure request latency:

average: 10 ms

That sounds good.

But the distribution may be:

p50: 5 ms
p95: 40 ms
p99: 200 ms

For servers, games, databases, and interactive tools, tail latency matters.

A program that is usually fast but sometimes very slow may still be bad.

Compare Against a Baseline

A benchmark needs comparison.

Example:

old parser: 800 MB/s
new parser: 1.2 GB/s

Now the result has meaning.

Without a baseline, “1.2 GB/s” may be good or bad depending on the workload and hardware.

Good comparisons include:

  • old implementation vs new implementation
  • scalar version vs SIMD version
  • heap allocation version vs buffer reuse version
  • different data layouts
  • different algorithms

Keep Inputs Realistic

Microbenchmarks are useful, but they can lie.

Example:

const input = "hello";

A parser that is fast on "hello" may be slow on real files.

Use realistic inputs:

  • small input
  • medium input
  • large input
  • common case
  • worst case
  • malformed input when relevant

Benchmarking only the happy path gives incomplete information.

Control the Environment

For serious benchmarks, control as much as possible.

Useful practices:

  • close unnecessary programs
  • use the same machine
  • use the same compiler version
  • use the same build mode
  • use the same input files
  • avoid measuring over network when testing CPU work
  • pin CPU frequency if needed
  • run enough iterations

You do not need extreme rigor for every small test, but you should know what can affect the result.

Record Hardware and Software

A benchmark result should include context.

At minimum, record:

FieldExample
CPUApple M3 Pro, Ryzen 7950X, etc.
RAM32 GB
OSLinux, macOS, Windows
Zig version0.16.0
Build modeReleaseFast
Input100 MB JSON file
Commandzig build-exe main.zig -O ReleaseFast

Without this, another person cannot reproduce the result.

Measure Throughput and Latency

Two common performance measurements are throughput and latency.

Throughput measures amount of work per time.

MB/s
requests/s
items/s
frames/s

Latency measures time for one operation.

milliseconds per request
nanoseconds per item
seconds per file

A batch processor often cares about throughput.

An interactive program often cares about latency.

A server usually cares about both.

Avoid Misleading Units

Use units that match the task.

For file processing:

MB/s

For function calls:

ns/op

For servers:

requests/s
p95 latency
p99 latency

For memory:

bytes allocated per operation
allocations per operation
peak memory

Good units make results easier to understand.

Benchmark Memory Too

Time is not the only performance metric.

A faster version may use much more memory.

Example:

VersionTimePeak Memory
A100 ms10 MB
B70 ms500 MB

Version B is faster, but may be unacceptable.

Track memory when it matters.

Important memory metrics:

  • peak memory
  • allocation count
  • bytes allocated
  • cache misses
  • working set size

Avoid Benchmarking the Wrong Layer

Suppose you want to measure parsing speed.

Bad benchmark:

read file from disk + parse + print output

This measures disk and printing too.

Better:

load file once
then measure parser only

But if your real product reads files from disk, also run an end-to-end benchmark.

Use both:

  • component benchmark
  • end-to-end benchmark

They answer different questions.

Beware I/O Benchmarks

I/O benchmarks are difficult.

File benchmarks are affected by:

  • OS page cache
  • disk type
  • filesystem
  • file size
  • write buffering
  • compression
  • background disk activity

Network benchmarks are affected by:

  • latency
  • packet loss
  • kernel tuning
  • TLS
  • connection reuse
  • remote server behavior

When possible, isolate CPU work from I/O work. Then separately measure end-to-end behavior.

Benchmark Algorithmic Complexity

A benchmark should test scaling.

Do not only test one input size.

Example:

Input SizeTime
1,0001 ms
10,00010 ms
100,000100 ms
1,000,0001,000 ms

This suggests linear behavior.

But this result is different:

Input SizeTime
1,0001 ms
10,000100 ms
100,00010,000 ms
1,000,0001,000,000 ms

That suggests quadratic behavior.

Scaling matters more than one isolated number.

Do Not Trust Tiny Differences

A 1% improvement may be noise.

Example:

old: 100.2 ms
new: 99.8 ms

That is probably not meaningful unless you have careful repeated measurements.

A 30% improvement is easier to trust.

But even then, verify.

Good benchmarking is skeptical.

Use Profiling with Benchmarking

Benchmarking tells you whether something improved.

Profiling tells you why.

Use both.

Example:

Benchmark result:

new version is 25% faster

Profiler result:

allocation time dropped from 40% to 5%

Now you know the reason.

A Simple Benchmark Harness

Here is a small pattern you can adapt:

const std = @import("std");

fn work(input: []const u8) usize {
    var count: usize = 0;

    for (input) |ch| {
        if (ch == 'x') {
            count += 1;
        }
    }

    return count;
}

pub fn main() !void {
    var data: [1024 * 1024]u8 = undefined;

    for (&data, 0..) |*byte, i| {
        byte.* = if (i % 17 == 0) 'x' else 'a';
    }

    const iterations = 1000;

    var total: usize = 0;

    const start = std.time.nanoTimestamp();

    for (0..iterations) |_| {
        total += work(data[0..]);
    }

    const end = std.time.nanoTimestamp();

    const elapsed_ns = end - start;
    const bytes_processed = data.len * iterations;

    std.debug.print("total: {}\n", .{total});
    std.debug.print("elapsed: {} ns\n", .{elapsed_ns});
    std.debug.print("bytes: {}\n", .{bytes_processed});
}

This benchmark:

  • prepares input before timing
  • repeats work many times
  • uses the result
  • records elapsed time
  • exposes enough data to calculate throughput

Mental Model

A benchmark is an experiment.

A good experiment has:

  • a clear question
  • controlled inputs
  • realistic conditions
  • repeated trials
  • meaningful units
  • a baseline
  • recorded environment
  • skepticism about tiny differences

In Zig, performance is visible and controllable, but you still need disciplined measurement.

Fast code starts with correct measurement.