Compression means making data smaller.
A text file may contain repeated words. A log file may contain repeated timestamps. A binary file may contain repeated patterns. Compression algorithms find these patterns and store the same information using fewer bytes.
For example, this text has obvious repetition:
hello hello hello hello helloA compressor can store it more compactly than the original bytes.
Compression is useful for:
saving disk space
sending less data over a network
storing large logs
packaging files
working with archive formats
reducing bandwidth costs
But compression has a cost. The program must spend CPU time compressing and decompressing data.
Compression and Decompression
There are two directions.
Compression turns original data into compressed data.
plain bytes -> compressed bytesDecompression turns compressed data back into the original data.
compressed bytes -> plain bytesA correct decompressor should recover exactly the original bytes.
If the original data is:
abcabcabcthen after compression and decompression, the result should still be:
abcabcabcCompression should not change the meaning of the data.
Lossless and Lossy Compression
There are two broad kinds of compression.
Lossless compression preserves the exact original bytes.
Examples:
gzip
zlib
zstd
deflate
lzmaLossy compression does not preserve the exact original data. It removes details to make the result smaller.
Examples:
JPEG
MP3
AAC
some video codecsFor source code, JSON, logs, databases, executables, and archives, you need lossless compression.
If one byte changes, the file may become invalid.
This section is about lossless compression.
Common Compression Formats
You will see several names often.
| Name | Common use |
|---|---|
| gzip | .gz files, HTTP compression, logs |
| zlib | compressed data format used by many systems |
| deflate | compression algorithm used inside zlib and gzip |
| zstd | modern fast compression with good ratios |
| lzma | high compression ratio, often slower |
| tar | archive format, often combined with compression |
Be careful with vocabulary.
tar is not compression by itself. It combines many files into one archive.
gzip compresses one byte stream.
So this file:
archive.tar.gzusually means:
First, many files were packed into one .tar.
Then the .tar stream was compressed with gzip.
Working with Compressed Data
At a high level, compression code usually looks like this:
const compressed = try compress(allocator, original);
defer allocator.free(compressed);
const restored = try decompress(allocator, compressed);
defer allocator.free(restored);The exact APIs depend on the compression format and Zig version.
But the resource rules are familiar:
compression may allocate
decompression may allocate
both can fail
allocated buffers must be freed
compressed data is still just bytes
Why Decompression Can Fail
Decompression can fail for many reasons.
The input may not be compressed data.
The input may use the wrong compression format.
The compressed data may be corrupted.
The output may be too large.
The allocator may fail.
The stream may end early.
So decompression returns errors.
Do not treat compressed data as trusted input. A malformed compressed file can be accidental or malicious.
Compression Ratio
Compression ratio measures how much smaller the data becomes.
Suppose the original data is 1000 bytes.
The compressed data is 250 bytes.
The compressed size is one quarter of the original size.
That is a good compression result.
But not all data compresses well.
This text compresses well:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaThis random-looking data may not:
8f 2a 90 11 c4 7b e2 09Already compressed files often do not compress much more. Compressing a .jpg, .mp4, or .zip again may waste CPU and produce little benefit.
Compression Speed vs Size
Compression algorithms make tradeoffs.
Some are very fast but produce larger output.
Some produce smaller output but take more CPU time.
A log pipeline may prefer speed.
An archive stored for years may prefer size.
A network service may need a balance.
Many compressors have levels.
A low level is faster.
A high level may compress smaller.
Example conceptually:
level 1 -> faster, larger output
level 9 -> slower, smaller outputDo not assume the highest level is always best. Measure with your real data.
Streaming Compression
Small data can be compressed all at once.
Large data should usually be compressed as a stream.
Streaming means processing data in chunks:
read chunk -> compress chunk -> write chunk
read chunk -> compress chunk -> write chunk
read chunk -> compress chunk -> write chunkThis avoids loading the whole file into memory.
The same idea applies to decompression:
read compressed chunk -> decompress chunk -> write plain chunkStreaming is important for large files, network connections, and memory-constrained programs.
Buffer-Based Compression
For small inputs, buffer-based compression is simpler.
Conceptual shape:
const input = "hello hello hello hello\n";
const compressed = try compressToBuffer(allocator, input);
defer allocator.free(compressed);
const plain = try decompressToBuffer(allocator, compressed);
defer allocator.free(plain);This is easy to understand, but it requires enough memory for the input, compressed output, and decompressed output.
For small configuration data or tests, that is fine.
For large files, prefer streaming.
File Compression Pattern
A command-line compression tool usually follows this shape:
open input file
defer close input file
open output file
defer close output file
create compressor
while true:
read input chunk
if end: break
write compressed chunk
finish compressorThe final “finish” step matters.
Many compression formats need to write footer data, checksums, or final buffered bytes.
If you forget to finish or flush the compressor, the output file may be incomplete.
File Decompression Pattern
A decompression tool has the reverse shape:
open compressed file
defer close compressed file
open output file
defer close output file
create decompressor
while true:
read compressed chunk
if end: break
write plain chunkThe decompressor must validate the input.
If the compressed stream is malformed, the program should return an error and avoid treating partial output as valid.
A Simple Design for a Compression API
If you write your own wrapper, make the direction clear.
Good names:
compressGzip
decompressGzip
compressZstd
decompressZstdAvoid vague names:
process
convert
handleDataCompression code has too many similar byte streams. Clear names prevent mistakes.
A good function signature might look like:
fn compressData(
allocator: std.mem.Allocator,
input: []const u8,
) ![]u8 {
// returns allocated compressed bytes
}This signature says:
It needs an allocator.
It reads input bytes.
It can fail.
It returns allocated output.
The caller must free the result.
Avoid Compression Bombs
A compression bomb is compressed data that expands to a huge size.
For example, a small compressed file might decompress into gigabytes of data.
If your program blindly decompresses into memory, it can run out of memory.
So when decompressing untrusted data, set limits.
Conceptually:
if (output_size > max_allowed_size) {
return error.OutputTooLarge;
}This is important for servers, archive tools, package managers, and anything that accepts files from users.
Checksums
Some compression formats include checksums.
A checksum helps detect corrupted data.
For example, if one byte in a compressed file changes, decompression may fail with a checksum error.
A checksum is not the same as encryption or authentication.
It can detect accidental corruption.
It does not prove that the data came from a trusted source.
For security, use cryptographic authentication such as a MAC or digital signature.
Compression Is Not Encryption
Compression makes data smaller.
Encryption makes data unreadable without a key.
They solve different problems.
Compressed data may still reveal information.
Encrypted data should hide the contents.
Do not store secrets in compressed files and assume they are protected.
When to Compress
Compression is useful when:
data has repeated patterns
data is large
storage or bandwidth matters
CPU cost is acceptable
data will be read less often than it is stored
Compression may be a poor choice when:
data is already compressed
latency matters more than size
CPU is the bottleneck
files are tiny
the format must support random access
Some formats support block compression for random access, but plain stream compression usually does not.
What Beginners Should Learn First
Do not start by memorizing every compression API.
Start with the concepts:
compressed data is bytes
decompression can fail
streaming avoids large memory use
compression has speed-size tradeoffs
finalization or flushing matters
untrusted compressed data needs limits
Then learn one specific format, such as gzip or zstd.
The Core Pattern
For small data:
const compressed = try compress(allocator, input);
defer allocator.free(compressed);
const output = try decompress(allocator, compressed);
defer allocator.free(output);For files:
open input
defer close input
open output
defer close output
while reading chunks:
compress or decompress
write result
finish or flushFor safety:
if (decompressed_size > max_size) {
return error.OutputTooLarge;
}What You Should Remember
Compression makes data smaller.
Decompression restores the original bytes.
Lossless compression is required for code, text data, archives, databases, and structured files.
Lossy compression is for media where exact bytes are not required.
Compressed data is still byte data.
Compression can fail.
Decompression can fail.
Large files should be processed in chunks.
Always finish or flush compression streams.
Do not trust compressed input blindly.
Compression saves space or bandwidth by spending CPU time.