I had been reading https://www.podoliaka.org/2025/10/12/newlines/ blog post and realized that I want to replicate the optimizations in Zig 0.15.
The task is to write wc -l implementation, that can quickly count lines in 13GB file with one billion lines.
Compiler parameters
- Zig
zig build --release=fast - Rust
cargo build --release.cargo/config.toml[build] rustflags = ["-C", "target-cpu=native"]
std.Io.Reader
The very first thought was I must try the new std.Io.Reader, interface. It has all the helper methods I’ll ever need.
const std = @import("std");
pub fn main() !void {
const measurements_file = try std.fs.cwd().openFile("measurements.txt", .{});
defer measurements_file.close();
var buf: [1 * 1024 * 1024]u8 = undefined;
var reader = measurements_file.readerStreaming(&buf);
var n: usize = 0;
while (try reader.interface.takeDelimiter('\n')) |_| {
n += 1;
}
std.debug.print("{d}\n", .{n});
}
Not running yet, firstly let’s see the Rust program.
Rust?
Equivalent Rust program must be something like:
use std::{
error::Error,
fs::File,
io::{BufRead, BufReader},
result::Result,
};
fn main() {
let lines = count_lines("measurements.txt").unwrap();
println!("{}", lines)
}
fn count_lines(path: &str) -> Result<usize, Box<dyn Error>> {
let mut reader = BufReader::with_capacity(1 * 1024 * 1024, File::open(path)?);
let mut lines = 0;
let mut buf = Vec::new();
loop {
let n = reader.read_until(b' ', &mut buf)?;
if n == 0 {
break;
}
lines += 1;
buf.clear();
}
Ok(lines)
}
But it’s unfair! There’s one more hidden buffer, while Zig version buffers only once. Let’s compare stats anyway.
$ poop ./zig-out/bin/io_interface_helper_1mb ./target/release/naive_helper_1mb
Benchmark 1 (3 runs): ./zig-out/bin/io_interface_helper_1mb
measurement mean ± σ min … max outliers delta
wall_time 9.66s ± 12.1ms 9.65s … 9.67s 0 ( 0%) 0%
peak_rss 1.05MB ± 0 1.05MB … 1.05MB 0 ( 0%) 0%
cpu_cycles 46.1G ± 6.49M 46.1G … 46.1G 0 ( 0%) 0%
instructions 45.0G ± 50.8 45.0G … 45.0G 0 ( 0%) 0%
cache_references 449M ± 1.21M 448M … 451M 0 ( 0%) 0%
cache_misses 2.04M ± 374K 1.61M … 2.31M 0 ( 0%) 0%
branch_misses 152K ± 2.54K 149K … 154K 0 ( 0%) 0%
Benchmark 2 (3 runs): ./target/release/naive_helper_1mb
measurement mean ± σ min … max outliers delta
wall_time 4.58s ± 5.26ms 4.58s … 4.59s 0 ( 0%) ⚡- 52.6% ± 0.2%
peak_rss 3.01MB ± 2.36KB 3.01MB … 3.01MB 0 ( 0%) 💩+187.4% ± 0.4%
cpu_cycles 19.3G ± 42.1M 19.2G … 19.3G 0 ( 0%) ⚡- 58.2% ± 0.1%
instructions 41.1G ± 185 41.1G … 41.1G 0 ( 0%) ⚡- 8.8% ± 0.0%
cache_references 451M ± 725K 450M … 452M 0 ( 0%) + 0.4% ± 0.5%
cache_misses 2.95M ± 250K 2.76M … 3.23M 0 ( 0%) 💩+ 44.7% ± 35.3%
branch_misses 383M ± 1.29M 382M … 384M 0 ( 0%) 💩+252606.8% ± 1360.0%
Unexpected! I’d tried to compile with Zig 0.16, but it requires IO instance. OK, I’m too lazy, leaving it as it is. NOTE: maybe check it again in the future.
Simple loop with enough buffer size
To figure out how to imitate old read(buf: []u8) using the new std.Io.Reader interface forced me to read the docs. After some reading, there’s fillMore and tossBuffered!
const std = @import("std");
pub fn main() !void {
const measurements_file = try std.fs.cwd().openFile("measurements.txt", .{});
defer measurements_file.close();
var buf: [1 * 1024 * 1024]u8 = undefined;
var reader = measurements_file.readerStreaming(&buf);
var n: usize = 0;
while (true) {
reader.interface.fillMore() catch |err| switch (err) {
error.EndOfStream => break,
else => return err,
};
for (reader.interface.buffered()) |byte| {
if (byte == '\n') {
n += 1;
}
}
reader.interface.tossBuffered();
}
std.debug.print("{d}\n", .{n});
}
This version is much quicker. Now, the scanning loops over all the available data at once, instead of stopping the cycle after each new line. BTW, we are on par with Rust now.
$ poop ./zig-out/bin/read_for_1mb_buf 'wc -l measurements.txt'
Benchmark 1 (5 runs): ./zig-out/bin/read_for_1mb_buf
measurement mean ± σ min … max outliers delta
wall_time 1.20s ± 19.5ms 1.18s … 1.23s 0 ( 0%) 0%
peak_rss 1.02MB ± 58.6KB 918KB … 1.05MB 1 (20%) 0%
cpu_cycles 2.89G ± 12.2M 2.88G … 2.91G 0 ( 0%) 0%
instructions 8.19G ± 38.3 8.19G … 8.19G 0 ( 0%) 0%
cache_references 445M ± 2.05M 443M … 448M 0 ( 0%) 0%
cache_misses 2.64M ± 255K 2.29M … 2.90M 0 ( 0%) 0%
branch_misses 70.3K ± 2.35K 67.8K … 73.9K 0 ( 0%) 0%
Benchmark 2 (6 runs): wc -l measurements.txt
measurement mean ± σ min … max outliers delta
wall_time 897ms ± 4.58ms 893ms … 905ms 0 ( 0%) ⚡- 25.6% ± 1.5%
peak_rss 5.10MB ± 27.5KB 5.06MB … 5.13MB 0 ( 0%) 💩+398.8% ± 5.9%
cpu_cycles 993M ± 10.8M 982M … 1.01G 0 ( 0%) ⚡- 65.7% ± 0.5%
instructions 3.02G ± 12.3 3.02G … 3.02G 0 ( 0%) ⚡- 63.1% ± 0.0%
cache_references 249M ± 24.7M 232M … 286M 0 ( 0%) ⚡- 43.9% ± 5.7%
cache_misses 2.45M ± 1.84M 972K … 4.85M 0 ( 0%) - 7.2% ± 71.7%
branch_misses 171K ± 96.0 171K … 171K 0 ( 0%) 💩+142.6% ± 3.1%
Ha, such simple changes resulted in big changes! Let’s try the last optimization.
SIMD
The idea here is to parallelize equality the check. By loading data in the special CPU registers, we can execute the same operation on all the data in the registers.
const std = @import("std");
const simd_size = std.simd.suggestVectorLength(u8) orelse 32;
const Vector = @Vector(simd_size, u8);
pub fn main() !void {
const measurements_file = try std.fs.cwd().openFile("measurements.txt", .{});
defer measurements_file.close();
var buf: [1 * 1024 * 1024]u8 = undefined;
var reader = measurements_file.readerStreaming(&buf);
var n: usize = 0;
while (true) {
reader.interface.fillMore() catch |err| switch (err) {
error.EndOfStream => break,
else => return err,
};
var it = std.mem.window(u8, reader.interface.buffered(), simd_size, simd_size);
while (it.next()) |chunk| {
if (chunk.len < simd_size) {
for (chunk) |byte| {
if (byte == '\n') {
n += 1;
}
}
continue;
}
const vector: Vector = chunk[0..simd_size].*;
const target: Vector = @splat('\n');
n += std.simd.countTrues(vector == target);
}
reader.interface.tossBuffered();
}
std.debug.print("{d}\n", .{n});
}
I like Zig program more, because Zig has a built-in Vector type, which uses parallel instructions if the hardware supports it. The alogirthm is quite simple, firstly, determine optimal window/chunk size, secondly, split the buffer using that size. In my case, the size is 32 bytes. Having chunks of known length, enables us to apply SIMD operations on those data. We compare the Vector of input data to the Vector of new lines, resulting in a Vector of bool values. There’s a neat function to count true values std.simd.countTrues. Incomplete chunks require special handling, and for that a simple for loop will do!
$ poop ./zig-out/bin/read_chunks_vectorized ./target/release/read_chunks_vectorized 'wc -l measurements.txt'
Benchmark 1 (6 runs): ./zig-out/bin/read_chunks_vectorized
measurement mean ± σ min … max outliers delta
wall_time 851ms ± 17.8ms 833ms … 884ms 0 ( 0%) 0%
peak_rss 1.03MB ± 53.5KB 918KB … 1.05MB 1 (17%) 0%
cpu_cycles 1.07G ± 22.2M 1.05G … 1.10G 0 ( 0%) 0%
instructions 4.53G ± 17.2 4.53G … 4.53G 0 ( 0%) 0%
cache_references 447M ± 2.51M 443M … 450M 0 ( 0%) 0%
cache_misses 5.03M ± 387K 4.48M … 5.60M 0 ( 0%) 0%
branch_misses 67.3K ± 679 66.6K … 68.3K 0 ( 0%) 0%
Benchmark 2 (7 runs): ./target/release/read_chunks_vectorized
measurement mean ± σ min … max outliers delta
wall_time 791ms ± 14.8ms 774ms … 811ms 0 ( 0%) ⚡- 7.1% ± 2.3%
peak_rss 3.09MB ± 79.3KB 3.00MB … 3.17MB 0 ( 0%) 💩+201.0% ± 8.2%
cpu_cycles 848M ± 23.9M 809M … 882M 0 ( 0%) ⚡- 21.0% ± 2.6%
instructions 944M ± 399 944M … 944M 0 ( 0%) ⚡- 79.1% ± 0.0%
cache_references 484M ± 11.6M 477M … 509M 0 ( 0%) 💩+ 8.3% ± 2.4%
cache_misses 23.3M ± 2.35M 19.0M … 25.7M 0 ( 0%) 💩+363.8% ± 42.8%
branch_misses 57.9K ± 336 57.7K … 58.6K 0 ( 0%) ⚡- 13.9% ± 0.9%
Benchmark 3 (6 runs): wc -l measurements.txt
measurement mean ± σ min … max outliers delta
wall_time 896ms ± 3.46ms 892ms … 901ms 0 ( 0%) 💩+ 5.3% ± 1.9%
peak_rss 5.08MB ± 36.3KB 5.01MB … 5.12MB 0 ( 0%) 💩+394.5% ± 5.7%
cpu_cycles 984M ± 7.47M 974M … 996M 0 ( 0%) ⚡- 8.4% ± 2.0%
instructions 3.02G ± 30.2 3.02G … 3.02G 0 ( 0%) ⚡- 33.3% ± 0.0%
cache_references 237M ± 8.94M 232M … 255M 0 ( 0%) ⚡- 46.9% ± 1.9%
cache_misses 1.42M ± 685K 931K … 2.75M 0 ( 0%) ⚡- 71.8% ± 14.2%
branch_misses 171K ± 118 170K … 171K 0 ( 0%) 💩+153.7% ± 0.9%
All better now! Rust SIMD version can be found in the linked blog post, see beginning of this article. Memory mapped IO didn’t work for my, hence I skipped it.