Efficient text processing means working with text without doing unnecessary allocation, copying, or decoding.
Efficient text processing means working with text without doing unnecessary allocation, copying, or decoding.
In Zig, text is usually a byte slice:
[]const u8That means most text processing starts with a simple idea:
read bytes first
allocate only when needed
decode UTF-8 only when neededThis fits Zig well because Zig makes memory and ownership visible.
Prefer Slices Over Copies
A slice lets you refer to part of a string without copying it.
const text = "hello zig";
const first = text[0..5];
const second = text[6..9];Now:
first = hello
second = zigNo new string was created. first and second both point into text.
This is cheap because a slice is only:
pointer + lengthUse this style when parsing, splitting, scanning, or inspecting text.
Example: Get the First Word
const std = @import("std");
fn firstWord(text: []const u8) []const u8 {
for (text, 0..) |byte, index| {
if (byte == ' ') {
return text[0..index];
}
}
return text;
}
pub fn main() void {
const text = "hello zig";
const word = firstWord(text);
std.debug.print("{s}\n", .{word});
}Output:
helloThe function returns a slice into the original text. It does not allocate.
That is efficient, but it also means the returned slice is valid only while the original text is valid.
Avoid Building Strings Too Early
Many programs waste work by creating new strings before they need to.
For example, suppose you want to check whether a file path ends with .zig.
You do not need to copy the path. You can inspect the existing bytes.
const std = @import("std");
fn isZigFile(path: []const u8) bool {
return std.mem.endsWith(u8, path, ".zig");
}
pub fn main() void {
std.debug.print("{}\n", .{isZigFile("main.zig")});
std.debug.print("{}\n", .{isZigFile("main.c")});
}Output:
true
falseThis function does not allocate, does not copy, and does not decode Unicode. It only compares bytes.
Use std.mem for Byte Slice Work
The std.mem namespace contains many useful functions for byte slices.
Common examples:
| Function | Purpose |
|---|---|
std.mem.eql | Compare two slices |
std.mem.startsWith | Check prefix |
std.mem.endsWith | Check suffix |
std.mem.indexOf | Find a slice inside another slice |
std.mem.splitScalar | Split by one byte |
std.mem.tokenizeScalar | Split while skipping empty parts |
std.mem.trim | Remove bytes from both ends |
Example:
const std = @import("std");
pub fn main() void {
const text = "name=zig";
if (std.mem.indexOf(u8, text, "=")) |index| {
const key = text[0..index];
const value = text[index + 1 ..];
std.debug.print("key = {s}\n", .{key});
std.debug.print("value = {s}\n", .{value});
}
}Output:
key = name
value = zigAgain, key and value are slices into text. No allocation happens.
Split Without Allocation
Splitting text does not need to create new strings.
const std = @import("std");
pub fn main() void {
const path = "usr/local/bin";
var it = std.mem.splitScalar(u8, path, '/');
while (it.next()) |part| {
std.debug.print("{s}\n", .{part});
}
}Output:
usr
local
binEach part is a slice into path.
This is efficient because the iterator only tracks positions.
splitScalar vs tokenizeScalar
splitScalar keeps empty fields.
const std = @import("std");
pub fn main() void {
const text = "a,,b";
var it = std.mem.splitScalar(u8, text, ',');
while (it.next()) |part| {
std.debug.print("[{s}]\n", .{part});
}
}Output:
[a]
[]
[b]The empty part between the two commas is preserved.
tokenizeScalar skips empty fields.
const std = @import("std");
pub fn main() void {
const text = "a,,b";
var it = std.mem.tokenizeScalar(u8, text, ',');
while (it.next()) |part| {
std.debug.print("[{s}]\n", .{part});
}
}Output:
[a]
[b]Use splitScalar when empty fields matter, such as CSV-like data. Use tokenizeScalar when repeated separators should be ignored, such as simple whitespace tokenization.
Trim Without Allocation
Trimming can also return a slice.
const std = @import("std");
pub fn main() void {
const line = " hello zig ";
const trimmed = std.mem.trim(u8, line, " ");
std.debug.print("[{s}]\n", .{trimmed});
}Output:
[hello zig]trimmed points into line. It does not allocate a new string.
You can trim several bytes:
const trimmed = std.mem.trim(u8, line, " \t\r\n");This removes spaces, tabs, carriage returns, and newlines from both ends.
Scan Once When Possible
A common performance rule is: avoid reading the same text many times.
For example, this counts lines:
fn countLines(text: []const u8) usize {
var count: usize = 0;
for (text) |byte| {
if (byte == '\n') {
count += 1;
}
}
return count;
}This is efficient because it scans once from left to right.
If the input may not end with a newline, you may want to count the final line too:
fn countLines(text: []const u8) usize {
if (text.len == 0) return 0;
var count: usize = 1;
for (text) |byte| {
if (byte == '\n') {
count += 1;
}
}
return count;
}This version treats non-empty text as having at least one line.
Parse Without Copying
Suppose you parse key-value lines:
name=zig
version=0.16
mode=debugYou can parse each line using slices.
const std = @import("std");
fn printKeyValue(line: []const u8) void {
if (std.mem.indexOf(u8, line, "=")) |index| {
const key = line[0..index];
const value = line[index + 1 ..];
std.debug.print("key={s}, value={s}\n", .{ key, value });
}
}
pub fn main() void {
const text =
\\name=zig
\\version=0.16
\\mode=debug
;
var lines = std.mem.splitScalar(u8, text, '\n');
while (lines.next()) |line| {
if (line.len == 0) continue;
printKeyValue(line);
}
}Output:
key=name, value=zig
key=version, value=0.16
key=mode, value=debugThe parser does not allocate. Each key and value is a slice into the original text.
Allocate Only for Owned Results
Sometimes you need a result that outlives the input. Then you should allocate or copy.
For example, this function returns a slice into the input:
fn extension(path: []const u8) ?[]const u8 {
if (std.mem.lastIndexOfScalar(u8, path, '.')) |index| {
return path[index + 1 ..];
}
return null;
}That is fine when the caller keeps path alive.
But if you need to store the extension after the original path is gone, make a copy:
fn copyExtension(
allocator: std.mem.Allocator,
path: []const u8,
) !?[]u8 {
const ext = extension(path) orelse return null;
const copy = try allocator.alloc(u8, ext.len);
@memcpy(copy, ext);
return copy;
}The caller owns the returned copy and must free it.
Reuse Buffers
If you build temporary text repeatedly, reuse a buffer or ArrayList.
const std = @import("std");
pub fn main() !void {
var buffer: [128]u8 = undefined;
for (0..3) |i| {
const message = try std.fmt.bufPrint(buffer[0..], "item {}", .{i});
std.debug.print("{s}\n", .{message});
}
}Output:
item 0
item 1
item 2The same stack buffer is reused for each message.
This avoids heap allocation.
For variable-size output, reuse an ArrayList:
const std = @import("std");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
var text = std.ArrayList(u8).init(allocator);
defer text.deinit();
for (0..3) |i| {
text.clearRetainingCapacity();
try text.writer().print("item {}", .{i});
std.debug.print("{s}\n", .{text.items});
}
}The list keeps its allocated capacity and reuses it.
Avoid Holding Slices Across Reallocation
This is a common bug:
const old = text.items;
try text.appendSlice("more data");
// old may now be invalidAn ArrayList may reallocate when it grows. If it reallocates, old slices into its storage may become invalid.
Use text.items again after operations that may grow the list.
try text.appendSlice("more data");
const current = text.items;This rule matters for efficient code because efficient code often keeps references. Keep them only as long as the underlying memory is stable.
Know When UTF-8 Decoding Is Needed
Many text tasks are byte tasks:
check file extension
split path by slash
parse ASCII protocol headers
find newline
compare command names
trim spacesFor these, byte operations are correct and fast.
Some tasks need Unicode-aware processing:
count user-visible characters
move cursor by character
uppercase multilingual text
validate user text
slice without breaking code points
display aligned columns with non-ASCII textFor these, use UTF-8 validation and decoding.
Do not decode Unicode when byte processing is enough. Do not use byte processing when Unicode meaning matters.
Example: Validate Before Unicode Processing
const std = @import("std");
fn printCodepoints(text: []const u8) !void {
var view = try std.unicode.Utf8View.init(text);
var it = view.iterator();
while (it.nextCodepoint()) |cp| {
std.debug.print("U+{X}\n", .{cp});
}
}
pub fn main() !void {
const text = "Aé你";
try printCodepoints(text);
}This checks that the text is valid UTF-8 before iterating over code points.
Use Writers for Streaming Output
If output may become large, you do not always need to build one big string first.
You can write directly to a writer.
For example, this function writes CSV-style output:
const std = @import("std");
fn writeCsvRow(writer: anytype, name: []const u8, score: u32) !void {
try writer.print("\"{s}\",{}\n", .{ name, score });
}You can write to an ArrayList:
try writeCsvRow(text.writer(), "Ada", 95);Or to another writer, such as a file writer.
The function does not care where the output goes. This avoids unnecessary intermediate strings.
Complete Example
const std = @import("std");
fn parseLine(line: []const u8) ?struct {
key: []const u8,
value: []const u8,
} {
const trimmed = std.mem.trim(u8, line, " \t\r\n");
if (trimmed.len == 0) return null;
const index = std.mem.indexOfScalar(u8, trimmed, '=') orelse return null;
return .{
.key = std.mem.trim(u8, trimmed[0..index], " \t"),
.value = std.mem.trim(u8, trimmed[index + 1 ..], " \t"),
};
}
pub fn main() void {
const text =
\\ name = zig
\\ version = 0.16
\\ mode = debug
;
var lines = std.mem.splitScalar(u8, text, '\n');
while (lines.next()) |line| {
if (parseLine(line)) |entry| {
std.debug.print("{s} -> {s}\n", .{ entry.key, entry.value });
}
}
}Output:
name -> zig
version -> 0.16
mode -> debugThis example does not allocate. It uses slices into the original input.
Summary
Efficient text processing in Zig is mostly about restraint.
Use slices instead of copies. Use std.mem for byte-level text work. Allocate only when the result must be owned or must outlive the input. Reuse buffers when building temporary text. Use writers when output can be streamed.
For ASCII-like protocols and file formats, byte processing is often enough. For human language text, validate and decode UTF-8 when Unicode meaning matters.
Zig gives you the tools, but it does not hide the cost. That is the point: you can see when text is borrowed, copied, allocated, decoded, or written.