Binary File Formats

A binary file format stores data as bytes with a specific structure.

Text files store data as readable characters:

name: Alice
age: 30

Binary files store data in compact byte layouts:

41 4C 49 43 45 1E

Those bytes may represent a name, an integer, a timestamp, an image, a database page, an executable header, or anything else. The bytes only make sense if you know the format.

Text vs Binary

A text file is designed to be read by humans.

A binary file is designed to be read by programs.

That does not mean binary files are mysterious. They are just more strict. Instead of reading lines and words, you read exact byte positions.

For example, a simple binary format might say:

bytes 0..4    magic number
bytes 4..8    version
bytes 8..16   record count
bytes 16..    records

Your program must follow that layout exactly.

Magic Numbers

Many binary formats start with a magic number.

A magic number is a short byte sequence that identifies the file type.

For example, a custom format might begin with:

ZDB1

In Zig:

const magic = "ZDB1";

When reading the file, check the first bytes:

if (!std.mem.eql(u8, bytes[0..4], "ZDB1")) {
    return error.BadMagic;
}

This prevents your parser from treating the wrong file as valid data.

A Tiny Binary Format

Let’s design a small file format for storing unsigned 32-bit numbers.

The file layout:

bytes 0..4    magic: "NUMS"
bytes 4..8    count: u32 little-endian
bytes 8..     count numbers, each u32 little-endian

A file with three numbers:

magic = "NUMS"
count = 3
numbers = 10, 20, 30

The byte layout is:

4E 55 4D 53  03 00 00 00  0A 00 00 00  14 00 00 00  1E 00 00 00

Each number uses 4 bytes.

Writing the File

const std = @import("std");

pub fn main() !void {
    var file = try std.fs.cwd().createFile("numbers.bin", .{});
    defer file.close();

    const numbers = [_]u32{ 10, 20, 30 };

    try file.writeAll("NUMS");

    var buffer: [4]u8 = undefined;

    std.mem.writeInt(u32, &buffer, numbers.len, .little);
    try file.writeAll(&buffer);

    for (numbers) |n| {
        std.mem.writeInt(u32, &buffer, n, .little);
        try file.writeAll(&buffer);
    }
}

The key function is:

std.mem.writeInt(u32, &buffer, n, .little);

It writes an integer into bytes using little-endian order.

Reading the File

const std = @import("std");

const ParseError = error{
    BadMagic,
    Truncated,
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    const allocator = gpa.allocator();

    const bytes = try std.fs.cwd().readFileAlloc(
        allocator,
        "numbers.bin",
        1024 * 1024,
    );
    defer allocator.free(bytes);

    const numbers = try parseNumbers(bytes);

    for (numbers) |n| {
        std.debug.print("{}\n", .{n});
    }
}

fn parseNumbers(bytes: []const u8) ParseError![]const u32 {
    if (bytes.len < 8) {
        return error.Truncated;
    }

    if (!std.mem.eql(u8, bytes[0..4], "NUMS")) {
        return error.BadMagic;
    }

    const count = std.mem.readInt(u32, bytes[4..8], .little);

    const needed = 8 + @as(usize, count) * 4;
    if (bytes.len < needed) {
        return error.Truncated;
    }

    // This function returns a view-like idea in spirit, but not a real u32 slice.
    // We will parse one number at a time in real code below.
    _ = count;
    return error.Truncated;
}

This version shows validation, but the return type is not the right design. The file contains bytes, not a native []const u32 slice. You should not pretend those bytes are already a safe Zig u32 array.

A better parser reads each integer from the byte slice.

const std = @import("std");

const ParseError = error{
    BadMagic,
    Truncated,
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();

    const allocator = gpa.allocator();

    const bytes = try std.fs.cwd().readFileAlloc(
        allocator,
        "numbers.bin",
        1024 * 1024,
    );
    defer allocator.free(bytes);

    try printNumbers(bytes);
}

fn printNumbers(bytes: []const u8) ParseError!void {
    if (bytes.len < 8) {
        return error.Truncated;
    }

    if (!std.mem.eql(u8, bytes[0..4], "NUMS")) {
        return error.BadMagic;
    }

    const count = std.mem.readInt(u32, bytes[4..8], .little);

    const needed = 8 + @as(usize, count) * 4;
    if (bytes.len < needed) {
        return error.Truncated;
    }

    var offset: usize = 8;
    var i: u32 = 0;

    while (i < count) : (i += 1) {
        const n = std.mem.readInt(u32, bytes[offset..][0..4], .little);
        offset += 4;

        std.debug.print("{}\n", .{n});
    }
}

This is safer. It treats the file as bytes and converts bytes into integers deliberately.

Endianness

Endianness means byte order.

The integer 0x12345678 can be stored in memory as:

big-endian:    12 34 56 78
little-endian: 78 56 34 12

Many modern machines are little-endian, but file formats should not depend on the current machine unless they are explicitly machine-local.

A good binary format states its byte order.

Example:

All integers are little-endian.

Then every reader and writer must follow that rule.

In Zig, make the byte order explicit:

std.mem.writeInt(u32, &buffer, value, .little);
std.mem.readInt(u32, bytes, .little);

This makes the file format portable across machines.

Alignment

A binary file is a sequence of bytes. It does not automatically obey the alignment rules of your CPU.

This is dangerous:

const value: *const u32 = @ptrCast(bytes.ptr);

The pointer may not be aligned for u32. The file may use a different endianness. The layout may not match Zig’s in-memory layout.

Prefer this:

const value = std.mem.readInt(u32, bytes[0..4], .little);

Parsing through bytes is clearer and safer.

Struct Layout Is Not a File Format

A common beginner mistake is to write a struct directly to disk and treat that as a file format.

const Header = struct {
    version: u32,
    count: u32,
};

The in-memory layout of this struct may include padding. It may depend on alignment rules. It may change if fields change. It may depend on target details unless you carefully control layout.

For file formats, define bytes, not structs.

Better:

bytes 0..4    version, u32 little-endian
bytes 4..8    count, u32 little-endian

Then write parsing code that follows the byte layout.

You may use structs internally after parsing, but the file format itself should be described as bytes.

Offsets

Binary parsing is mostly offset management.

You keep track of where you are in the byte slice.

var offset: usize = 0;

const magic = bytes[offset..][0..4];
offset += 4;

const version = std.mem.readInt(u32, bytes[offset..][0..4], .little);
offset += 4;

This pattern appears everywhere in parsers.

For larger formats, it is useful to create a small reader.

const ByteReader = struct {
    bytes: []const u8,
    offset: usize = 0,

    fn readBytes(self: *ByteReader, n: usize) ![]const u8 {
        if (self.offset + n > self.bytes.len) {
            return error.Truncated;
        }

        const out = self.bytes[self.offset..][0..n];
        self.offset += n;
        return out;
    }

    fn readU32(self: *ByteReader) !u32 {
        const b = try self.readBytes(4);
        return std.mem.readInt(u32, b, .little);
    }
};

Now the parser is cleaner:

var reader = ByteReader{ .bytes = bytes };

const magic = try reader.readBytes(4);
const count = try reader.readU32();

Versioning

Binary formats should include a version field.

Example:

bytes 0..4    magic: "NUMS"
bytes 4..8    version: u32 little-endian
bytes 8..12   count: u32 little-endian
bytes 12..    numbers

Versioning lets your format evolve.

Version 1 might store only numbers.

Version 2 might add timestamps.

Version 3 might add compression.

Without a version field, future readers must guess which layout the file uses. Guessing is fragile.

Length Fields

Binary formats often use length fields.

Example:

bytes 0..4      name length, u32 little-endian
next N bytes    UTF-8 name bytes

When parsing length fields, always check bounds.

Bad:

const name = bytes[offset .. offset + name_len];

Good:

if (offset + name_len > bytes.len) {
    return error.Truncated;
}
const name = bytes[offset .. offset + name_len];

Also watch for integer overflow when computing sizes.

const end = std.math.add(usize, offset, name_len) catch {
    return error.Truncated;
};

For parsers that read untrusted files, these checks are not optional.

Checksums

Some binary formats include checksums.

A checksum is a value computed from bytes to detect corruption.

Example layout:

bytes 0..4      magic
bytes 4..8      payload length
bytes 8..12     checksum
bytes 12..      payload

When reading, the parser recomputes the checksum and compares it with the stored checksum.

Checksums do not prove that data is safe or authentic. They mainly detect accidental corruption. For security, use cryptographic authentication such as MACs or signatures.

Binary Formats Must Be Defensive

A binary parser should assume the input may be invalid.

The file may be too short.

The magic number may be wrong.

The version may be unsupported.

A length field may point past the end.

A count may be huge.

Offsets may overflow.

Data may be compressed incorrectly.

Strings may not be valid UTF-8.

Your parser should reject bad data cleanly instead of crashing or reading outside the buffer.

Zig helps because slices carry lengths and integer conversions are explicit, but you still need to write the checks.

When Binary Formats Are Useful

Binary formats are useful when you care about compact size, fast parsing, exact layout, or compatibility with existing systems.

Common examples:

image files

audio files

video files

database files

index files

network packets

executables

object files

Mental Model

A binary file is a contract.

The contract says what each byte means.

Your Zig code should follow that contract exactly: check the magic number, read integers with explicit endianness, validate lengths, manage offsets carefully, and reject malformed data.

Do not treat file bytes as native structs too early. Parse bytes first. Build structured values after validation.