wc -l optimizations

I had been reading https://www.podoliaka.org/2025/10/12/newlines/ blog post and realized that I want to replicate the optimizations in Zig 0.15.

The task is to write wc -l implementation, that can quickly count lines in 13GB file with one billion lines.

Compiler parameters

Zig zig build --release=fast

Rust

cargo build --release

.cargo/config.toml

[build]
rustflags = ["-C", "target-cpu=native"]

std.Io.Reader

The very first thought was I must try the new std.Io.Reader, interface. It has all the helper methods I’ll ever need.

const std = @import("std");

pub fn main() !void {
    const measurements_file = try std.fs.cwd().openFile("measurements.txt", .{});
    defer measurements_file.close();

    var buf: [1 * 1024 * 1024]u8 = undefined;
    var reader = measurements_file.readerStreaming(&buf);

    var n: usize = 0;
    while (try reader.interface.takeDelimiter('\n')) |_| {
        n += 1;
    }

    std.debug.print("{d}\n", .{n});
}

Not running yet, firstly let’s see the Rust program.

Rust?

Equivalent Rust program must be something like:

use std::{
    error::Error,
    fs::File,
    io::{BufRead, BufReader},
    result::Result,
};

fn main() {
    let lines = count_lines("measurements.txt").unwrap();
    println!("{}", lines)
}

fn count_lines(path: &str) -> Result<usize, Box<dyn Error>> {
    let mut reader = BufReader::with_capacity(1 * 1024 * 1024, File::open(path)?);
    let mut lines = 0;

    let mut buf = Vec::new();
    loop {
        let n = reader.read_until(b' ', &mut buf)?;
        if n == 0 {
            break;
        }

        lines += 1;
        buf.clear();
    }

    Ok(lines)
}

But it’s unfair! There’s one more hidden buffer, while Zig version buffers only once. Let’s compare stats anyway.

$ poop ./zig-out/bin/io_interface_helper_1mb ./target/release/naive_helper_1mb
Benchmark 1 (3 runs): ./zig-out/bin/io_interface_helper_1mb
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          9.66s  ± 12.1ms    9.65s  … 9.67s           0 ( 0%)        0%
  peak_rss           1.05MB ±    0      1.05MB … 1.05MB          0 ( 0%)        0%
  cpu_cycles         46.1G  ± 6.49M     46.1G  … 46.1G           0 ( 0%)        0%
  instructions       45.0G  ± 50.8      45.0G  … 45.0G           0 ( 0%)        0%
  cache_references    449M  ± 1.21M      448M  …  451M           0 ( 0%)        0%
  cache_misses       2.04M  ±  374K     1.61M  … 2.31M           0 ( 0%)        0%
  branch_misses       152K  ± 2.54K      149K  …  154K           0 ( 0%)        0%
Benchmark 2 (3 runs): ./target/release/naive_helper_1mb
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          4.58s  ± 5.26ms    4.58s  … 4.59s           0 ( 0%)        ⚡- 52.6% ±  0.2%
  peak_rss           3.01MB ± 2.36KB    3.01MB … 3.01MB          0 ( 0%)        💩+187.4% ±  0.4%
  cpu_cycles         19.3G  ± 42.1M     19.2G  … 19.3G           0 ( 0%)        ⚡- 58.2% ±  0.1%
  instructions       41.1G  ±  185      41.1G  … 41.1G           0 ( 0%)        ⚡-  8.8% ±  0.0%
  cache_references    451M  ±  725K      450M  …  452M           0 ( 0%)          +  0.4% ±  0.5%
  cache_misses       2.95M  ±  250K     2.76M  … 3.23M           0 ( 0%)        💩+ 44.7% ± 35.3%
  branch_misses       383M  ± 1.29M      382M  …  384M           0 ( 0%)        💩+252606.8% ± 1360.0%

Unexpected! I’d tried to compile with Zig 0.16, but it requires IO instance. OK, I’m too lazy, leaving it as it is. NOTE: maybe check it again in the future.

Simple loop with enough buffer size

To figure out how to imitate old read(buf: []u8) using the new std.Io.Reader interface forced me to read the docs. After some reading, there’s fillMore and tossBuffered!

const std = @import("std");

pub fn main() !void {
    const measurements_file = try std.fs.cwd().openFile("measurements.txt", .{});
    defer measurements_file.close();

    var buf: [1 * 1024 * 1024]u8 = undefined;
    var reader = measurements_file.readerStreaming(&buf);

    var n: usize = 0;
    while (true) {
        reader.interface.fillMore() catch |err| switch (err) {
            error.EndOfStream => break,
            else => return err,
        };

        for (reader.interface.buffered()) |byte| {
            if (byte == '\n') {
                n += 1;
            }
        }
        reader.interface.tossBuffered();
    }

    std.debug.print("{d}\n", .{n});
}

This version is much quicker. Now, the scanning loops over all the available data at once, instead of stopping the cycle after each new line. BTW, we are on par with Rust now.

$ poop ./zig-out/bin/read_for_1mb_buf 'wc -l measurements.txt'
Benchmark 1 (5 runs): ./zig-out/bin/read_for_1mb_buf
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.20s  ± 19.5ms    1.18s  … 1.23s           0 ( 0%)        0%
  peak_rss           1.02MB ± 58.6KB     918KB … 1.05MB          1 (20%)        0%
  cpu_cycles         2.89G  ± 12.2M     2.88G  … 2.91G           0 ( 0%)        0%
  instructions       8.19G  ± 38.3      8.19G  … 8.19G           0 ( 0%)        0%
  cache_references    445M  ± 2.05M      443M  …  448M           0 ( 0%)        0%
  cache_misses       2.64M  ±  255K     2.29M  … 2.90M           0 ( 0%)        0%
  branch_misses      70.3K  ± 2.35K     67.8K  … 73.9K           0 ( 0%)        0%
Benchmark 2 (6 runs): wc -l measurements.txt
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           897ms ± 4.58ms     893ms …  905ms          0 ( 0%)        ⚡- 25.6% ±  1.5%
  peak_rss           5.10MB ± 27.5KB    5.06MB … 5.13MB          0 ( 0%)        💩+398.8% ±  5.9%
  cpu_cycles          993M  ± 10.8M      982M  … 1.01G           0 ( 0%)        ⚡- 65.7% ±  0.5%
  instructions       3.02G  ± 12.3      3.02G  … 3.02G           0 ( 0%)        ⚡- 63.1% ±  0.0%
  cache_references    249M  ± 24.7M      232M  …  286M           0 ( 0%)        ⚡- 43.9% ±  5.7%
  cache_misses       2.45M  ± 1.84M      972K  … 4.85M           0 ( 0%)          -  7.2% ± 71.7%
  branch_misses       171K  ± 96.0       171K  …  171K           0 ( 0%)        💩+142.6% ±  3.1%

Ha, such simple changes resulted in big changes! Let’s try the last optimization.

SIMD

The idea here is to parallelize equality the check. By loading data in the special CPU registers, we can execute the same operation on all the data in the registers.

const std = @import("std");

const simd_size = std.simd.suggestVectorLength(u8) orelse 32;
const Vector = @Vector(simd_size, u8);

pub fn main() !void {
    const measurements_file = try std.fs.cwd().openFile("measurements.txt", .{});
    defer measurements_file.close();

    var buf: [1 * 1024 * 1024]u8 = undefined;
    var reader = measurements_file.readerStreaming(&buf);

    var n: usize = 0;
    while (true) {
        reader.interface.fillMore() catch |err| switch (err) {
            error.EndOfStream => break,
            else => return err,
        };

        var it = std.mem.window(u8, reader.interface.buffered(), simd_size, simd_size);
        while (it.next()) |chunk| {
            if (chunk.len < simd_size) {
                for (chunk) |byte| {
                    if (byte == '\n') {
                        n += 1;
                    }
                }
                continue;
            }

            const vector: Vector = chunk[0..simd_size].*;
            const target: Vector = @splat('\n');

            n += std.simd.countTrues(vector == target);
        }

        reader.interface.tossBuffered();
    }

    std.debug.print("{d}\n", .{n});
}

I like Zig program more, because Zig has a built-in Vector type, which uses parallel instructions if the hardware supports it. The alogirthm is quite simple, firstly, determine optimal window/chunk size, secondly, split the buffer using that size. In my case, the size is 32 bytes. Having chunks of known length, enables us to apply SIMD operations on those data. We compare the Vector of input data to the Vector of new lines, resulting in a Vector of bool values. There’s a neat function to count true values std.simd.countTrues. Incomplete chunks require special handling, and for that a simple for loop will do!

$ poop ./zig-out/bin/read_chunks_vectorized ./target/release/read_chunks_vectorized 'wc -l measurements.txt'
Benchmark 1 (6 runs): ./zig-out/bin/read_chunks_vectorized
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           851ms ± 17.8ms     833ms …  884ms          0 ( 0%)        0%
  peak_rss           1.03MB ± 53.5KB     918KB … 1.05MB          1 (17%)        0%
  cpu_cycles         1.07G  ± 22.2M     1.05G  … 1.10G           0 ( 0%)        0%
  instructions       4.53G  ± 17.2      4.53G  … 4.53G           0 ( 0%)        0%
  cache_references    447M  ± 2.51M      443M  …  450M           0 ( 0%)        0%
  cache_misses       5.03M  ±  387K     4.48M  … 5.60M           0 ( 0%)        0%
  branch_misses      67.3K  ±  679      66.6K  … 68.3K           0 ( 0%)        0%
Benchmark 2 (7 runs): ./target/release/read_chunks_vectorized
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           791ms ± 14.8ms     774ms …  811ms          0 ( 0%)        ⚡-  7.1% ±  2.3%
  peak_rss           3.09MB ± 79.3KB    3.00MB … 3.17MB          0 ( 0%)        💩+201.0% ±  8.2%
  cpu_cycles          848M  ± 23.9M      809M  …  882M           0 ( 0%)        ⚡- 21.0% ±  2.6%
  instructions        944M  ±  399       944M  …  944M           0 ( 0%)        ⚡- 79.1% ±  0.0%
  cache_references    484M  ± 11.6M      477M  …  509M           0 ( 0%)        💩+  8.3% ±  2.4%
  cache_misses       23.3M  ± 2.35M     19.0M  … 25.7M           0 ( 0%)        💩+363.8% ± 42.8%
  branch_misses      57.9K  ±  336      57.7K  … 58.6K           0 ( 0%)        ⚡- 13.9% ±  0.9%
Benchmark 3 (6 runs): wc -l measurements.txt
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           896ms ± 3.46ms     892ms …  901ms          0 ( 0%)        💩+  5.3% ±  1.9%
  peak_rss           5.08MB ± 36.3KB    5.01MB … 5.12MB          0 ( 0%)        💩+394.5% ±  5.7%
  cpu_cycles          984M  ± 7.47M      974M  …  996M           0 ( 0%)        ⚡-  8.4% ±  2.0%
  instructions       3.02G  ± 30.2      3.02G  … 3.02G           0 ( 0%)        ⚡- 33.3% ±  0.0%
  cache_references    237M  ± 8.94M      232M  …  255M           0 ( 0%)        ⚡- 46.9% ±  1.9%
  cache_misses       1.42M  ±  685K      931K  … 2.75M           0 ( 0%)        ⚡- 71.8% ± 14.2%
  branch_misses       171K  ±  118       170K  …  171K           0 ( 0%)        💩+153.7% ±  0.9%