Benchmark Methodology
Benchmarking means measuring how fast code runs.
A benchmark answers a narrow question:
How long does this operation take under these conditions?
That last part matters. A benchmark is only useful when the conditions are clear. Different inputs, build modes, machines, allocators, and operating system states can produce different results.
Good benchmarking is not only running a timer. It is designing a measurement that tells the truth.
Benchmark the Right Thing
Before writing a benchmark, define the question.
Bad question:
Is this code fast?
Better question:
How many bytes per second can this parser process on a 100 MB JSON file in ReleaseFast mode?
Better question:
How many requests per second can this server handle with 1,000 concurrent connections?
Better question:
How many nanoseconds does this small function take when called 10 million times?
A vague benchmark gives vague results.
A precise benchmark gives useful results.
Use Release Builds
Never benchmark Debug mode unless you are measuring Debug mode itself.
Debug builds include extra checks and less optimization.
Use:
zig build-exe main.zig -O ReleaseFastor, when you want safety checks with optimization:
zig build-exe main.zig -O ReleaseSafeRecord the mode with the result.
A result without the build mode is incomplete.
Make the Work Large Enough
A benchmark must run long enough to measure.
This is too small:
const start = std.time.nanoTimestamp();
const x = add(1, 2);
const end = std.time.nanoTimestamp();
std.debug.print("{} ns\n", .{end - start});The timer overhead may be larger than the work.
Instead, repeat the work many times:
const std = @import("std");
fn add(a: u64, b: u64) u64 {
return a + b;
}
pub fn main() void {
const iterations = 100_000_000;
const start = std.time.nanoTimestamp();
var sum: u64 = 0;
for (0..iterations) |i| {
sum += add(i, 1);
}
const end = std.time.nanoTimestamp();
std.debug.print("sum: {}\n", .{sum});
std.debug.print("elapsed: {} ns\n", .{end - start});
}The benchmark runs enough work for the timing to be meaningful.
Prevent Dead Code Elimination
Optimizing compilers remove useless work.
This benchmark may be invalid:
for (0..1000000) |i| {
_ = i * i;
}The result is unused. The compiler may remove the loop entirely.
Use the result in a visible way:
var sum: u64 = 0;
for (0..1000000) |i| {
sum += i * i;
}
std.debug.print("{}\n", .{sum});Now the compiler must preserve the computation.
Separate Setup from Measurement
Do not include setup time unless setup is part of what you want to measure.
Bad:
const start = std.time.nanoTimestamp();
const input = try allocator.alloc(u8, 1024 * 1024);
defer allocator.free(input);
fillInput(input);
process(input);
const end = std.time.nanoTimestamp();This measures allocation, input generation, and processing together.
Better:
const input = try allocator.alloc(u8, 1024 * 1024);
defer allocator.free(input);
fillInput(input);
const start = std.time.nanoTimestamp();
process(input);
const end = std.time.nanoTimestamp();Now the timed region measures only process.
Sometimes you do want total end-to-end time. That is fine. Just name it correctly.
Run Benchmarks Multiple Times
One run is not enough.
Performance varies because of:
- operating system scheduling
- CPU frequency scaling
- background processes
- disk cache
- memory layout
- thermal throttling
Run multiple trials.
Example output:
run 1: 104 ms
run 2: 101 ms
run 3: 103 ms
run 4: 102 ms
run 5: 101 msThis is more trustworthy than one number.
Look at Distribution, Not Only Average
The average can hide important behavior.
Suppose you measure request latency:
average: 10 msThat sounds good.
But the distribution may be:
p50: 5 ms
p95: 40 ms
p99: 200 msFor servers, games, databases, and interactive tools, tail latency matters.
A program that is usually fast but sometimes very slow may still be bad.
Compare Against a Baseline
A benchmark needs comparison.
Example:
old parser: 800 MB/s
new parser: 1.2 GB/sNow the result has meaning.
Without a baseline, “1.2 GB/s” may be good or bad depending on the workload and hardware.
Good comparisons include:
- old implementation vs new implementation
- scalar version vs SIMD version
- heap allocation version vs buffer reuse version
- different data layouts
- different algorithms
Keep Inputs Realistic
Microbenchmarks are useful, but they can lie.
Example:
const input = "hello";A parser that is fast on "hello" may be slow on real files.
Use realistic inputs:
- small input
- medium input
- large input
- common case
- worst case
- malformed input when relevant
Benchmarking only the happy path gives incomplete information.
Control the Environment
For serious benchmarks, control as much as possible.
Useful practices:
- close unnecessary programs
- use the same machine
- use the same compiler version
- use the same build mode
- use the same input files
- avoid measuring over network when testing CPU work
- pin CPU frequency if needed
- run enough iterations
You do not need extreme rigor for every small test, but you should know what can affect the result.
Record Hardware and Software
A benchmark result should include context.
At minimum, record:
| Field | Example |
|---|---|
| CPU | Apple M3 Pro, Ryzen 7950X, etc. |
| RAM | 32 GB |
| OS | Linux, macOS, Windows |
| Zig version | 0.16.0 |
| Build mode | ReleaseFast |
| Input | 100 MB JSON file |
| Command | zig build-exe main.zig -O ReleaseFast |
Without this, another person cannot reproduce the result.
Measure Throughput and Latency
Two common performance measurements are throughput and latency.
Throughput measures amount of work per time.
MB/s
requests/s
items/s
frames/sLatency measures time for one operation.
milliseconds per request
nanoseconds per item
seconds per fileA batch processor often cares about throughput.
An interactive program often cares about latency.
A server usually cares about both.
Avoid Misleading Units
Use units that match the task.
For file processing:
MB/sFor function calls:
ns/opFor servers:
requests/s
p95 latency
p99 latencyFor memory:
bytes allocated per operation
allocations per operation
peak memoryGood units make results easier to understand.
Benchmark Memory Too
Time is not the only performance metric.
A faster version may use much more memory.
Example:
| Version | Time | Peak Memory |
|---|---|---|
| A | 100 ms | 10 MB |
| B | 70 ms | 500 MB |
Version B is faster, but may be unacceptable.
Track memory when it matters.
Important memory metrics:
- peak memory
- allocation count
- bytes allocated
- cache misses
- working set size
Avoid Benchmarking the Wrong Layer
Suppose you want to measure parsing speed.
Bad benchmark:
read file from disk + parse + print outputThis measures disk and printing too.
Better:
load file once
then measure parser onlyBut if your real product reads files from disk, also run an end-to-end benchmark.
Use both:
- component benchmark
- end-to-end benchmark
They answer different questions.
Beware I/O Benchmarks
I/O benchmarks are difficult.
File benchmarks are affected by:
- OS page cache
- disk type
- filesystem
- file size
- write buffering
- compression
- background disk activity
Network benchmarks are affected by:
- latency
- packet loss
- kernel tuning
- TLS
- connection reuse
- remote server behavior
When possible, isolate CPU work from I/O work. Then separately measure end-to-end behavior.
Benchmark Algorithmic Complexity
A benchmark should test scaling.
Do not only test one input size.
Example:
| Input Size | Time |
|---|---|
| 1,000 | 1 ms |
| 10,000 | 10 ms |
| 100,000 | 100 ms |
| 1,000,000 | 1,000 ms |
This suggests linear behavior.
But this result is different:
| Input Size | Time |
|---|---|
| 1,000 | 1 ms |
| 10,000 | 100 ms |
| 100,000 | 10,000 ms |
| 1,000,000 | 1,000,000 ms |
That suggests quadratic behavior.
Scaling matters more than one isolated number.
Do Not Trust Tiny Differences
A 1% improvement may be noise.
Example:
old: 100.2 ms
new: 99.8 msThat is probably not meaningful unless you have careful repeated measurements.
A 30% improvement is easier to trust.
But even then, verify.
Good benchmarking is skeptical.
Use Profiling with Benchmarking
Benchmarking tells you whether something improved.
Profiling tells you why.
Use both.
Example:
Benchmark result:
new version is 25% fasterProfiler result:
allocation time dropped from 40% to 5%Now you know the reason.
A Simple Benchmark Harness
Here is a small pattern you can adapt:
const std = @import("std");
fn work(input: []const u8) usize {
var count: usize = 0;
for (input) |ch| {
if (ch == 'x') {
count += 1;
}
}
return count;
}
pub fn main() !void {
var data: [1024 * 1024]u8 = undefined;
for (&data, 0..) |*byte, i| {
byte.* = if (i % 17 == 0) 'x' else 'a';
}
const iterations = 1000;
var total: usize = 0;
const start = std.time.nanoTimestamp();
for (0..iterations) |_| {
total += work(data[0..]);
}
const end = std.time.nanoTimestamp();
const elapsed_ns = end - start;
const bytes_processed = data.len * iterations;
std.debug.print("total: {}\n", .{total});
std.debug.print("elapsed: {} ns\n", .{elapsed_ns});
std.debug.print("bytes: {}\n", .{bytes_processed});
}This benchmark:
- prepares input before timing
- repeats work many times
- uses the result
- records elapsed time
- exposes enough data to calculate throughput
Mental Model
A benchmark is an experiment.
A good experiment has:
- a clear question
- controlled inputs
- realistic conditions
- repeated trials
- meaningful units
- a baseline
- recorded environment
- skepticism about tiny differences
In Zig, performance is visible and controllable, but you still need disciplined measurement.
Fast code starts with correct measurement.