Skip to content

Writing a Compiler in Zig

A compiler translates one form of code into another form.

A compiler translates one form of code into another form.

For this chapter, we will use a small meaning:

source code -> tokens -> AST -> bytecode

The source code is what the user writes. The bytecode is what our virtual machine runs.

A full production compiler can generate machine code, optimize programs, perform type checking, emit debug information, and link object files. We do not need all of that yet. The first compiler should be small enough to understand.

The Compiler Pipeline

A simple compiler has stages.

source text
    -> tokenizer
    -> parser
    -> AST
    -> compiler
    -> bytecode
    -> VM

Each stage has a clear job.

The tokenizer reads characters and creates tokens.

The parser reads tokens and creates a syntax tree.

The compiler reads the syntax tree and creates instructions.

The VM executes the instructions.

Keeping these stages separate makes the system easier to test.

A Small Source Language

Suppose our source language only supports integer expressions:

1 + 2 * 3

From the parser section, this becomes an AST:

+
  1
  *
    2
    3

The compiler should turn that tree into bytecode:

push_i64 1
push_i64 2
push_i64 3
mul
add
halt

The VM then runs the bytecode and computes 7.

AST Input

We can reuse the expression type from the parser:

const Expr = union(enum) {
    integer: i64,
    add: Binary,
    sub: Binary,
    multiply: Binary,
    divide: Binary,

    const Binary = struct {
        left: *Expr,
        right: *Expr,
    };
};

This AST is already structured. The compiler does not need to rediscover precedence. The parser already handled that.

That is why the pipeline matters. Each stage should do its own job once.

Bytecode Output

Use a small instruction format:

const OpCode = enum(u8) {
    push_i64,
    add,
    sub,
    mul,
    div,
    halt,
};

const Instruction = union(OpCode) {
    push_i64: i64,
    add: void,
    sub: void,
    mul: void,
    div: void,
    halt: void,
};

The compiler will produce a list of Instruction values.

instructions: std.ArrayList(Instruction)

The Compiler Type

A compiler needs an allocator and an output buffer.

const std = @import("std");

const Compiler = struct {
    instructions: std.ArrayList(Instruction),

    pub fn init(allocator: std.mem.Allocator) Compiler {
        return .{
            .instructions = std.ArrayList(Instruction).init(allocator),
        };
    }

    pub fn deinit(self: *Compiler) void {
        self.instructions.deinit();
    }

    fn emit(self: *Compiler, instruction: Instruction) !void {
        try self.instructions.append(instruction);
    }
};

The emit function appends one instruction to the output.

This is the basic compiler action:

look at AST node
emit instruction

Compiling an Integer

An integer expression compiles to one push_i64 instruction.

fn compileExpr(self: *Compiler, expr: *const Expr) !void {
    switch (expr.*) {
        .integer => |value| {
            try self.emit(.{ .push_i64 = value });
        },
        else => {
            return error.UnsupportedExpression;
        },
    }
}

For this AST:

integer 42

the compiler emits:

push_i64 42

Compiling Addition

For binary operators, compile the left expression first, then the right expression, then emit the operator.

fn compileExpr(self: *Compiler, expr: *const Expr) !void {
    switch (expr.*) {
        .integer => |value| {
            try self.emit(.{ .push_i64 = value });
        },
        .add => |node| {
            try self.compileExpr(node.left);
            try self.compileExpr(node.right);
            try self.emit(.add);
        },
        else => {
            return error.UnsupportedExpression;
        },
    }
}

For:

1 + 2

the compiler emits:

push_i64 1
push_i64 2
add

This works because our VM is stack-based.

Why Left Then Right

The stack machine expects operands to be pushed before the operator runs.

For:

1 + 2

the execution is:

push 1       stack: [1]
push 2       stack: [1, 2]
add          stack: [3]

For subtraction, order matters.

10 - 3

must emit:

push_i64 10
push_i64 3
sub

The VM pops 3 as the right operand and 10 as the left operand.

Compiling More Operators

Now add subtraction, multiplication, and division.

fn compileExpr(self: *Compiler, expr: *const Expr) !void {
    switch (expr.*) {
        .integer => |value| {
            try self.emit(.{ .push_i64 = value });
        },
        .add => |node| {
            try self.compileExpr(node.left);
            try self.compileExpr(node.right);
            try self.emit(.add);
        },
        .sub => |node| {
            try self.compileExpr(node.left);
            try self.compileExpr(node.right);
            try self.emit(.sub);
        },
        .multiply => |node| {
            try self.compileExpr(node.left);
            try self.compileExpr(node.right);
            try self.emit(.mul);
        },
        .divide => |node| {
            try self.compileExpr(node.left);
            try self.compileExpr(node.right);
            try self.emit(.div);
        },
    }
}

The shape is repetitive, but clear.

Later, you can factor it. At first, clarity is more important.

Finishing the Program

After compiling the expression, emit halt.

pub fn compile(self: *Compiler, expr: *const Expr) ![]Instruction {
    try self.compileExpr(expr);
    try self.emit(.halt);

    return self.instructions.items;
}

This returns the instruction slice.

Be careful with ownership. The returned slice belongs to the compiler’s ArrayList. It stays valid until the compiler is deinitialized or the list reallocates.

A Small Manual AST

Before connecting the parser, test the compiler with a hand-built AST.

test "compile addition" {
    var compiler = Compiler.init(std.testing.allocator);
    defer compiler.deinit();

    var left = Expr{ .integer = 1 };
    var right = Expr{ .integer = 2 };
    var root = Expr{
        .add = .{
            .left = &left,
            .right = &right,
        },
    };

    const instructions = try compiler.compile(&root);

    try std.testing.expectEqual(@as(usize, 4), instructions.len);
    try std.testing.expectEqual(Instruction{ .push_i64 = 1 }, instructions[0]);
    try std.testing.expectEqual(Instruction{ .push_i64 = 2 }, instructions[1]);
    try std.testing.expectEqual(Instruction.add, instructions[2]);
    try std.testing.expectEqual(Instruction.halt, instructions[3]);
}

This test checks only the compiler. It does not require a tokenizer, parser, or VM.

That separation is valuable.

Compiler Errors

A compiler should report structured errors.

Examples:

unsupported expression
too many constants
variable not found
invalid assignment target
type mismatch
jump target too large

In Zig, start with an error set:

const CompileError = error{
    UnsupportedExpression,
    VariableNotFound,
    TypeMismatch,
};

Then return errors from compiler functions:

fn compileExpr(self: *Compiler, expr: *const Expr) CompileError!void {
    // ...
}

For a real compiler, an error message should also include source location. The parser can attach locations to AST nodes.

Adding Statements

Expressions compute values.

Statements perform actions.

Examples:

print 1 + 2;
let x = 10;
x = x + 1;

A small statement AST might be:

const Stmt = union(enum) {
    print: *Expr,
    expr: *Expr,
};

Add a print instruction to the VM:

const OpCode = enum(u8) {
    push_i64,
    add,
    sub,
    mul,
    div,
    print,
    halt,
};

Compile a print statement:

fn compileStmt(self: *Compiler, stmt: *const Stmt) !void {
    switch (stmt.*) {
        .print => |expr| {
            try self.compileExpr(expr);
            try self.emit(.print);
        },
        .expr => |expr| {
            try self.compileExpr(expr);
        },
    }
}

Now:

print 1 + 2;

emits:

push_i64 1
push_i64 2
add
print

Adding Variables

Variables need storage.

The VM can use local slots:

slot 0
slot 1
slot 2

Add bytecode instructions:

const OpCode = enum(u8) {
    push_i64,
    load_local,
    store_local,
    add,
    sub,
    mul,
    div,
    print,
    halt,
};

const Instruction = union(OpCode) {
    push_i64: i64,
    load_local: usize,
    store_local: usize,
    add: void,
    sub: void,
    mul: void,
    div: void,
    print: void,
    halt: void,
};

A source statement:

let x = 10;

can emit:

push_i64 10
store_local 0

Using the variable:

print x;

can emit:

load_local 0
print

The compiler maps variable names to local slots.

Symbol Table

A symbol table stores names known to the compiler.

const Symbol = struct {
    name: []const u8,
    slot: usize,
};

const Compiler = struct {
    instructions: std.ArrayList(Instruction),
    symbols: std.StringHashMap(usize),
};

When compiling:

let x = 10;

the compiler assigns a slot:

x -> 0

When compiling:

x + 1

the compiler looks up x and emits:

load_local 0
push_i64 1
add

A missing name is a compile error.

const slot = self.symbols.get(name) orelse {
    return error.VariableNotFound;
};

Control Flow

Control flow needs jumps.

For an if expression:

if condition {
    then_branch
} else {
    else_branch
}

the compiler emits something like:

compile condition
jump_if_zero else_start
compile then_branch
jump end
else_start:
compile else_branch
end:

The hard part is that the compiler may not know jump targets until later.

So it emits placeholder jumps, then patches them.

Patching Jumps

A compiler can remember the instruction index where a jump was emitted.

fn emitJump(self: *Compiler, instruction: Instruction) !usize {
    const index = self.instructions.items.len;
    try self.emit(instruction);
    return index;
}

Later it patches the target:

fn patchJump(self: *Compiler, index: usize, target: usize) void {
    switch (self.instructions.items[index]) {
        .jump => {
            self.instructions.items[index] = .{ .jump = target };
        },
        .jump_if_zero => {
            self.instructions.items[index] = .{ .jump_if_zero = target };
        },
        else => unreachable,
    }
}

This pattern appears in many bytecode compilers.

Compiler Design Rule

A compiler should not execute the program.

It should translate the program.

Bad compiler behavior:

while compiling, run user code accidentally
while compiling, depend on runtime variable values
while compiling, mutate VM state

Good compiler behavior:

read AST
check rules
emit bytecode
report errors

This keeps compile time and runtime separate.

Optimization

Do not start with optimization.

First make the compiler correct.

Then add small optimizations.

Example: constant folding.

Source:

1 + 2

Instead of emitting:

push_i64 1
push_i64 2
add

the compiler can emit:

push_i64 3

But only do this after the ordinary compiler works.

A simple optimizer should preserve meaning. Speed is useless if the compiler becomes wrong.

Testing the Whole Pipeline

After testing each stage separately, test the full pipeline.

source -> parse -> compile -> run -> result

Example test cases:

1 + 2 -> 3
1 + 2 * 3 -> 7
(1 + 2) * 3 -> 9
10 - 3 - 2 -> 5
8 / 2 -> 4

Whole-pipeline tests catch integration bugs.

Separate stage tests tell you where the bug probably lives.

Common Mistakes

A common mistake is mixing parsing and compilation too early. Keep the AST stage until you understand the language.

Another mistake is adding variables, functions, types, and optimization all at once. Add one feature at a time.

A third mistake is producing bytecode without a disassembler. You need to see what the compiler emitted.

A fourth mistake is ignoring source locations. Good errors need locations.

A fifth mistake is treating compiler bugs as VM bugs. If the VM runs the emitted bytecode correctly, the compiler may be the broken part.

A Practical Build Order

Build a small compiler in this order:

integer expressions
binary arithmetic
print statements
local variables
assignment
if statements
while loops
function calls
type checking
simple optimization

After each step, add tests.

Each feature should change the AST, compiler, VM, and tests in a controlled way.

The Main Idea

A compiler is a translator.

For a small Zig project, the compiler can translate an AST into bytecode for your VM.

Use clear stages. Keep the tokenizer, parser, compiler, and VM separate. Emit simple instructions. Add a disassembler. Test each stage alone, then test the full pipeline.