High performance concurrent design means using several threads or tasks without making the program slower, more fragile, or harder to reason about.
High performance concurrent design means using several threads or tasks without making the program slower, more fragile, or harder to reason about.
Concurrency does not automatically make a program fast.
A bad concurrent program can be slower than a simple single-threaded program. It can waste time on locks, memory allocation, thread scheduling, cache misses, and communication between threads.
The goal is not to use many threads.
The goal is to keep useful work moving.
Start with the Work
Before adding concurrency, ask what kind of work the program performs.
| Work type | Usually good tool |
|---|---|
| CPU-heavy independent work | Worker threads |
| Many waiting network operations | Event loop or async I/O |
| Simple shared counters | Atomics |
| Shared complex state | Mutexes |
| Producer-consumer pipelines | Queues |
| Periodic background work | Threads, timers, or event loop tasks |
A program that compresses many files has a different shape from a program that handles many sockets.
Do not choose the concurrency tool first. Choose it after you understand the work.
Avoid Shared Mutable State
The fastest lock is the lock you do not need.
Shared mutable state forces coordination. Coordination costs time and creates bugs.
Prefer this shape:
thread 1 owns data A
thread 2 owns data B
thread 3 owns data C
main combines the results laterover this shape:
all threads update one shared objectFor example, instead of one shared counter:
var counter = std.atomic.Value(u64).init(0);you can often give each worker its own local counter:
fn worker(result: *u64) void {
var local: u64 = 0;
// do work
local += 1;
result.* = local;
}Then combine results after join.
const total = a + b + c;Local data is cheap. Shared data is expensive.
Partition the Input
A good parallel program often splits input into independent chunks.
For example, suppose you need to process a large array.
Bad shape:
all threads pull one item at a time from one shared queueBetter shape:
thread 1 processes items 0..1000
thread 2 processes items 1000..2000
thread 3 processes items 2000..3000Each thread owns a range.
const Range = struct {
start: usize,
end: usize,
};The worker receives its range:
fn worker(items: []const u64, range: Range, result: *u64) void {
var sum: u64 = 0;
var i = range.start;
while (i < range.end) : (i += 1) {
sum += items[i];
}
result.* = sum;
}This avoids locking inside the loop.
Lock Outside Hot Loops
A hot loop is code that runs many times.
This is usually bad:
while (i < items.len) : (i += 1) {
mutex.lock();
shared_sum += items[i];
mutex.unlock();
}Every iteration locks and unlocks the mutex.
Better:
var local_sum: u64 = 0;
while (i < items.len) : (i += 1) {
local_sum += items[i];
}
mutex.lock();
shared_sum += local_sum;
mutex.unlock();Now the lock is used once.
Do as much work locally as possible. Communicate less often.
Reduce Contention
Contention means several threads want the same resource at the same time.
The resource might be a mutex, allocator, queue, file, socket, cache line, or atomic counter.
High contention means threads spend time waiting instead of working.
Common fixes:
| Problem | Better design |
|---|---|
| One shared counter | Per-thread counters, then merge |
| One global queue | Work stealing or per-thread queues |
| One shared allocator | Arena per worker or fixed buffers |
| One large mutex | Smaller locks around independent state |
| Frequent tiny messages | Batch messages |
A useful question is:
What are all threads fighting over?Then remove or split that thing.
Be Careful with Atomics
Atomics can look cheaper than mutexes, but they can still be expensive under contention.
This can become a bottleneck:
_ = counter.fetchAdd(1, .seq_cst);If every thread increments the same atomic millions of times, the CPU cores must constantly coordinate ownership of that memory location.
Better:
each thread counts locally
one final merge happens at the endAtomics are good for small shared facts. They are not magic performance tools.
Cache Lines Matter
Modern CPUs move memory in cache lines, not individual bytes.
A cache line is commonly 64 bytes. If two threads write different variables that happen to sit on the same cache line, the CPU cores can still interfere with each other.
This is called false sharing.
Example:
const Counters = struct {
a: u64,
b: u64,
};If thread 1 writes a and thread 2 writes b, they may still fight over the same cache line.
For very hot per-thread counters, you may need padding or alignment so each thread writes to separate cache lines.
The beginner rule is simpler:
Do not put heavily written per-thread values tightly next to each other unless you have measured and know it is safe.
Use Batching
Batching means doing many operations together instead of one at a time.
Instead of pushing one job at a time:
push job
signal
push job
signal
push job
signalpush a batch:
lock
append many jobs
signal or broadcast
unlockBatching reduces lock overhead, wakeups, allocator calls, and queue traffic.
The tradeoff is latency. A batch may make an individual job wait a little longer.
For high throughput, batching is often excellent.
Keep Ownership Clear
High performance code must still be readable.
A fast design with unclear ownership is dangerous.
Every object should have an owner:
this buffer belongs to worker 1
this queue owns these jobs
this result array belongs to main until workers finish
this connection owns its read bufferOwnership gives you two benefits.
It prevents races.
It also reduces synchronization.
If only one thread owns a buffer, that buffer needs no lock.
Avoid Unbounded Thread Creation
Creating a thread has a cost.
This is usually bad:
for each request:
create a new threadA busy server could create thousands of threads. That can waste memory and overwhelm the scheduler.
Better:
create a fixed worker pool
send jobs to workers
reuse the same threadsA worker pool keeps concurrency under control.
The number of workers should usually be related to the work type.
For CPU-heavy work, start near the number of CPU cores.
For blocking I/O work, more threads may help, but measure.
Backpressure
Backpressure means the system slows producers down when consumers cannot keep up.
Without backpressure, queues can grow until memory runs out.
Bad design:
producer creates jobs forever
queue grows forever
workers fall behind
memory usage growsBetter design:
queue has a maximum size
producer waits when queue is full
workers drain the queueA bounded queue protects the program under load.
High performance is not only about speed when things are normal. It is also about controlled behavior when things are overloaded.
Measure Before Optimizing
Concurrency bugs are hard. Performance guesses are often wrong.
Measure before making the design more complex.
Useful things to measure:
| Metric | Question |
|---|---|
| Throughput | How much work finishes per second? |
| Latency | How long does one job wait? |
| CPU usage | Are cores busy or idle? |
| Lock contention | Are threads waiting on locks? |
| Queue length | Are producers faster than consumers? |
| Allocation count | Is memory allocation dominating? |
| Cache misses | Is memory layout hurting performance? |
Do not add lock-free structures because they sound fast. Add them only when measurement shows the current design is the bottleneck.
Prefer Simple Correct Designs First
A good first concurrent design is often:
main thread creates work
fixed workers process work
workers keep local results
main joins workers
main merges resultsThis design is boring, but strong.
It has limited sharing.
It has clear lifetimes.
It has predictable shutdown.
It is easy to test.
Only move to more advanced designs when this shape is not enough.
A Practical Pattern
For CPU-heavy batch work:
split input into N chunks
start N workers
each worker processes one chunk
each worker writes one result
join workers
merge resultsFor I/O-heavy servers:
event loop handles sockets
small handlers do minimal work
blocking or CPU-heavy work goes to workers
bounded queues provide backpressure
shutdown wakes all workersFor pipelines:
stage 1 parses input
stage 2 transforms data
stage 3 writes output
bounded queues connect stages
each stage owns its local buffersEach design keeps communication explicit.
The Main Rule
Concurrency is a structure, not a decoration.
You do not make a program fast by adding threads around random functions. You make it fast by dividing ownership, reducing communication, keeping hot loops local, bounding queues, and measuring the real bottlenecks.
A good concurrent Zig program is explicit about:
who owns each piece of memory
which data is shared
which synchronization protects it
where work can wait
where work can run in parallel
how the program shuts down
That discipline gives you both speed and correctness.