High Performance Concurrent Design

High performance concurrent design means using several threads or tasks without making the program slower, more fragile, or harder to reason about.

Concurrency does not automatically make a program fast.

A bad concurrent program can be slower than a simple single-threaded program. It can waste time on locks, memory allocation, thread scheduling, cache misses, and communication between threads.

The goal is not to use many threads.

The goal is to keep useful work moving.

Start with the Work

Before adding concurrency, ask what kind of work the program performs.

Work type	Usually good tool
CPU-heavy independent work	Worker threads
Many waiting network operations	Event loop or async I/O
Simple shared counters	Atomics
Shared complex state	Mutexes
Producer-consumer pipelines	Queues
Periodic background work	Threads, timers, or event loop tasks

A program that compresses many files has a different shape from a program that handles many sockets.

Do not choose the concurrency tool first. Choose it after you understand the work.

Avoid Shared Mutable State

The fastest lock is the lock you do not need.

Shared mutable state forces coordination. Coordination costs time and creates bugs.

Prefer this shape:

thread 1 owns data A
thread 2 owns data B
thread 3 owns data C

main combines the results later

over this shape:

all threads update one shared object

For example, instead of one shared counter:

var counter = std.atomic.Value(u64).init(0);

you can often give each worker its own local counter:

fn worker(result: *u64) void {
    var local: u64 = 0;

    // do work
    local += 1;

    result.* = local;
}

Then combine results after join.

const total = a + b + c;

Local data is cheap. Shared data is expensive.

Partition the Input

A good parallel program often splits input into independent chunks.

For example, suppose you need to process a large array.

Bad shape:

all threads pull one item at a time from one shared queue

Better shape:

thread 1 processes items 0..1000
thread 2 processes items 1000..2000
thread 3 processes items 2000..3000

Each thread owns a range.

const Range = struct {
    start: usize,
    end: usize,
};

The worker receives its range:

fn worker(items: []const u64, range: Range, result: *u64) void {
    var sum: u64 = 0;

    var i = range.start;
    while (i < range.end) : (i += 1) {
        sum += items[i];
    }

    result.* = sum;
}

This avoids locking inside the loop.

Lock Outside Hot Loops

A hot loop is code that runs many times.

This is usually bad:

while (i < items.len) : (i += 1) {
    mutex.lock();
    shared_sum += items[i];
    mutex.unlock();
}

Every iteration locks and unlocks the mutex.

Better:

var local_sum: u64 = 0;

while (i < items.len) : (i += 1) {
    local_sum += items[i];
}

mutex.lock();
shared_sum += local_sum;
mutex.unlock();

Now the lock is used once.

Do as much work locally as possible. Communicate less often.

Reduce Contention

Contention means several threads want the same resource at the same time.

The resource might be a mutex, allocator, queue, file, socket, cache line, or atomic counter.

High contention means threads spend time waiting instead of working.

Common fixes:

Problem	Better design
One shared counter	Per-thread counters, then merge
One global queue	Work stealing or per-thread queues
One shared allocator	Arena per worker or fixed buffers
One large mutex	Smaller locks around independent state
Frequent tiny messages	Batch messages

A useful question is:

What are all threads fighting over?

Then remove or split that thing.

Be Careful with Atomics

Atomics can look cheaper than mutexes, but they can still be expensive under contention.

This can become a bottleneck:

_ = counter.fetchAdd(1, .seq_cst);

If every thread increments the same atomic millions of times, the CPU cores must constantly coordinate ownership of that memory location.

Better:

each thread counts locally
one final merge happens at the end

Atomics are good for small shared facts. They are not magic performance tools.

Cache Lines Matter

Modern CPUs move memory in cache lines, not individual bytes.

A cache line is commonly 64 bytes. If two threads write different variables that happen to sit on the same cache line, the CPU cores can still interfere with each other.

This is called false sharing.

Example:

const Counters = struct {
    a: u64,
    b: u64,
};

If thread 1 writes a and thread 2 writes b, they may still fight over the same cache line.

For very hot per-thread counters, you may need padding or alignment so each thread writes to separate cache lines.

The beginner rule is simpler:

Do not put heavily written per-thread values tightly next to each other unless you have measured and know it is safe.

Use Batching

Batching means doing many operations together instead of one at a time.

Instead of pushing one job at a time:

push job
signal
push job
signal
push job
signal

push a batch:

lock
append many jobs
signal or broadcast
unlock

Batching reduces lock overhead, wakeups, allocator calls, and queue traffic.

The tradeoff is latency. A batch may make an individual job wait a little longer.

For high throughput, batching is often excellent.

Keep Ownership Clear

High performance code must still be readable.

A fast design with unclear ownership is dangerous.

Every object should have an owner:

this buffer belongs to worker 1
this queue owns these jobs
this result array belongs to main until workers finish
this connection owns its read buffer

Ownership gives you two benefits.

It prevents races.

It also reduces synchronization.

If only one thread owns a buffer, that buffer needs no lock.

Avoid Unbounded Thread Creation

Creating a thread has a cost.

This is usually bad:

for each request:
    create a new thread

A busy server could create thousands of threads. That can waste memory and overwhelm the scheduler.

Better:

create a fixed worker pool
send jobs to workers
reuse the same threads

A worker pool keeps concurrency under control.

The number of workers should usually be related to the work type.

For CPU-heavy work, start near the number of CPU cores.

For blocking I/O work, more threads may help, but measure.

Backpressure

Backpressure means the system slows producers down when consumers cannot keep up.

Without backpressure, queues can grow until memory runs out.

Bad design:

producer creates jobs forever
queue grows forever
workers fall behind
memory usage grows

Better design:

queue has a maximum size
producer waits when queue is full
workers drain the queue

A bounded queue protects the program under load.

High performance is not only about speed when things are normal. It is also about controlled behavior when things are overloaded.

Measure Before Optimizing

Concurrency bugs are hard. Performance guesses are often wrong.

Measure before making the design more complex.

Useful things to measure:

Metric	Question
Throughput	How much work finishes per second?
Latency	How long does one job wait?
CPU usage	Are cores busy or idle?
Lock contention	Are threads waiting on locks?
Queue length	Are producers faster than consumers?
Allocation count	Is memory allocation dominating?
Cache misses	Is memory layout hurting performance?

Do not add lock-free structures because they sound fast. Add them only when measurement shows the current design is the bottleneck.

Prefer Simple Correct Designs First

A good first concurrent design is often:

main thread creates work
fixed workers process work
workers keep local results
main joins workers
main merges results

This design is boring, but strong.

It has limited sharing.

It has clear lifetimes.

It has predictable shutdown.

It is easy to test.

Only move to more advanced designs when this shape is not enough.

A Practical Pattern

For CPU-heavy batch work:

split input into N chunks
start N workers
each worker processes one chunk
each worker writes one result
join workers
merge results

For I/O-heavy servers:

event loop handles sockets
small handlers do minimal work
blocking or CPU-heavy work goes to workers
bounded queues provide backpressure
shutdown wakes all workers

For pipelines:

stage 1 parses input
stage 2 transforms data
stage 3 writes output
bounded queues connect stages
each stage owns its local buffers

Each design keeps communication explicit.

The Main Rule

Concurrency is a structure, not a decoration.

You do not make a program fast by adding threads around random functions. You make it fast by dividing ownership, reducing communication, keeping hot loops local, bounding queues, and measuring the real bottlenecks.

A good concurrent Zig program is explicit about:

who owns each piece of memory

which data is shared

which synchronization protects it

where work can wait

where work can run in parallel

how the program shuts down

That discipline gives you both speed and correctness.