Designing Concurrent Programs

Threads are a mechanism, not a design.

A program should not begin with the question, “How many threads do I need?” It should begin with the data.

Ask first:

What data exists?
Who owns it?
Which thread may change it?
Which thread may read it?
How does one thread notify another?

Most concurrency bugs come from unclear ownership.

A simple rule is useful: one thread should own each mutable object.

Other threads may send messages to the owner, or they may access the object only through a small synchronized interface.

This is poor design:

var count: u32 = 0;
var failed: bool = false;
var path_buffer: [4096]u8 = undefined;
var mutex = std.Thread.Mutex{};

The data and the lock are separate. It is not clear which fields the mutex protects.

This is better:

const State = struct {
    mutex: std.Thread.Mutex = .{},
    count: u32 = 0,
    failed: bool = false,
    path_buffer: [4096]u8 = undefined,

    fn addOne(self: *State) void {
        self.mutex.lock();
        defer self.mutex.unlock();

        self.count += 1;
    }

    fn markFailed(self: *State) void {
        self.mutex.lock();
        defer self.mutex.unlock();

        self.failed = true;
    }
};

The lock and the protected data are together. The methods define the allowed operations.

Keep critical sections small.

Do not hold a mutex while doing slow work:

state.mutex.lock();
defer state.mutex.unlock();

try readLargeFile();
try writeNetworkResponse();
state.count += 1;

This blocks every other thread that needs the state.

Do the slow work outside the lock:

try readLargeFile();
try writeNetworkResponse();

state.mutex.lock();
defer state.mutex.unlock();

state.count += 1;

A lock should protect memory, not waiting time.

Prefer local data.

Local variables belong to the current thread. They need no lock.

fn worker() void {
    var local_count: u32 = 0;

    while (local_count < 1000) : (local_count += 1) {
        // no synchronization needed
    }
}

If many threads need to count work, let each thread count locally, then merge once.

const std = @import("std");

const State = struct {
    mutex: std.Thread.Mutex = .{},
    total: u64 = 0,

    fn add(self: *State, n: u64) void {
        self.mutex.lock();
        defer self.mutex.unlock();

        self.total += n;
    }
};

fn worker(state: *State) void {
    var local: u64 = 0;

    var i: u64 = 0;
    while (i < 1000000) : (i += 1) {
        local += 1;
    }

    state.add(local);
}

pub fn main() !void {
    var state = State{};

    const a = try std.Thread.spawn(.{}, worker, .{&state});
    const b = try std.Thread.spawn(.{}, worker, .{&state});

    a.join();
    b.join();

    std.debug.print("total = {d}\n", .{state.total});
}

This program locks only twice.

That is better than locking two million times.

Avoid sharing when message passing is enough.

A worker can receive a job, produce a result, and send the result back. It does not need to see the whole program state.

This shape is easier to reason about:

main thread -> jobs -> workers -> results -> main thread

The worker owns the job while processing it.

The main thread owns the result after receiving it.

Ownership moves. Shared mutation is reduced.

Be careful with pointer lifetimes.

This is dangerous:

fn startThread() !std.Thread {
    var value: u32 = 42;
    return try std.Thread.spawn(.{}, worker, .{&value});
}

value is local to startThread. After the function returns, the pointer is no longer valid. The worker may read invalid memory.

The data passed to a thread must live until the thread finishes.

This is safe:

pub fn main() !void {
    var value: u32 = 42;

    const thread = try std.Thread.spawn(.{}, worker, .{&value});
    thread.join();
}

The thread is joined before value goes out of scope.

For long-running threads, store shared state in an object whose lifetime is clearly longer than the threads using it.

Do not use sleeps for correctness.

This is wrong:

std.Thread.sleep(1000000);
// assume worker is ready

A sleep guesses timing. Timing changes across machines, loads, and operating systems.

Use synchronization:

while (!state.ready) {
    state.condition.wait(&state.mutex);
}

A correct concurrent program should work even when the scheduler chooses the worst possible interleaving.

Good concurrent Zig programs have a few visible rules:

mutable state has an owner
shared state has a lock
locks protect data, not arbitrary code
waits use conditions, semaphores, or joins
threads are joined or intentionally detached
pointer lifetimes are explicit
slow work happens outside critical sections

Concurrency is not made safe by using one primitive everywhere. It is made safe by keeping ownership simple.

Exercise 18-31. Rewrite a global shared-state program so the lock and data are fields of one struct.

Exercise 18-32. Move expensive work outside a locked section.

Exercise 18-33. Write a worker that counts locally and merges once at the end.

Exercise 18-34. Find a program that uses sleep for ordering. Replace the sleep with a condition variable.