Zig Beating GNU cmp In An Unfair Fight

published: [nandalism home] (dark light)


UNIX has lots of simple command line tools which can be used together to do some really complex things.

One example, cmp, is a fairly simple tool which compares 2 files and gives a same/not-same answer. Unlike the more complicated diff, it does not split lines, attempt to find ranges of differences nor display the differences.


It can be fun to rewrite some of these tools and compare your version to, say, the GNU/linux version or a BSD unix version. I will be comparing mine with the GNU/linux cmp (on alpine linux). I will also be leaving out more advanced features.


(spoiler warning)

Here are the stages I went thru before reaching my final version.

The reason that this was an unfair fight is that SIMD instructions are non-standard CPU extensions. However, GNU/linux is released as a binary package and the tools must run on all machines. This means GNU/linux cannot release tools using SIMD, although I'm sure the original sources are capable of using SIMD when compiled natively on a platform supporting those.

Reading the Files

I do this for both files, giving me 2 buffers.

const fda = try os.open(fna, os.O.RDONLY, 0o444);
defer os.close(fda);
var bufa: [bsz]u8 = undefined;
const rsza = try os.read(fda, &bufa);

The Compare Loop

I loop over 64K blocks of the 2 files and either

var offset: usize = 0;
var lineno: usize = 1;
  const rsza = try os.read(fda, &bufa);
  const rszb = try os.read(fdb, &bufb);
  if(rsza != rszb){
    const sfn = if(rsza<rszb) fna else fnb;
    const o = if(rsza<rszb) rsza else rszb;
    debug.print("cmp: prefixes match but file {s} is shorter ({} bytes)\n", .{sfn, offset+o});
  if(0==rsza) break;
  if(mem.indexOfDiff(u8, bufa[0..rsza], bufb[0..rsza])) |o| {
    lineno += countScalarSIMD(u8, bufa[0..o], '\n');
    debug.print("cmp: mismatch at byte:{} line:{}\n", .{offset+o+1, lineno});
  offset += rsza;
  lineno += countScalarSIMD(u8, bufa[0..rsza], '\n');

The Vectorization of Line Count

I wrote a general countScalarSIMD() function but here only use it to count newlines. It will count the occurrences of any single byte.

The zig @Vector command intrigued me, I thought it created vectors, however it relates to cpu vector operations/SIMD. I hadn't tried these yet and used this as an excuse to learn about it. This zig feature allows the compiler to generate CPU SIMD instructions, which operate on multi-byte blocks in a single instruction, making them faster.

I tried 4 byte blocks at first, and after getting the vector code working it was easy to experiment with various vector widths, by changing a single constant.

On my machine 16-bytes was the peak of performance. Now I was faster than gnu cmp. I used objdump -d cmp to prove that my code was using SIMD e.g. popcnt.

Note that in C, using SIMD is quite esoteric (there are builtin functions) but zig makes this much more approachable. There are multiple ways to do the same thing. At first in fact, I didn't find a way to do a @reduce addition on the bool match vector to get a popcnt so I used a inline-while loop of 16 iterations and the zig compiler converted that to a single popcnt instruction in the output assembler.

pub fn countScalarSIMD(comptime T: type, haystack: []const T, needle: T) usize {
  var cnt: usize = 0;
  const vsz = 16;
  const lnub = (haystack.len % vsz);
  const l = haystack.len - lnub;
  const vneedle = @splat(vsz, needle);
  var i: usize = 0; while(i<l):(i+=vsz) {
    const slide: @Vector(vsz, u8) = haystack[i..][0..vsz].*;
    const mask = (vneedle == slide);
    comptime var j = 0; inline while(j<vsz):(j+=1) { cnt += @boolToInt(mask[j]); }
  i=l; while(i<haystack.len):(i+=1) {
    if(needle == haystack[i]) cnt += 1;
  return cnt;

This compiles down to popcnt.

comptime var j = 0; inline while(j<vsz):(j+=1) { cnt += @boolToInt(mask[j]); }

I tried various combinations of @boolToInt but it wouldn't work on the vector. The best approach is probably to use @select to convert the bool vector created by the equality match into a u1 integer vector and then add. I tried this, but ended up with an extra assembler instruction [and $0x1, reg] which seemed incorrect. Maybe I made a mistake in this code or misunderstood the @select api?

const imask = @select(u1, mask, @splat(vsz,@as(u1,1)), @splat(vsz,@as(u1,0)));
cnt += @reduce(.Add, imask);

203217:       f3 0f b8 c9             popcnt %ecx,%ecx
20321b:       83 e1 01                and    $0x1,%ecx   <--- all counts now 0 or 1 (wrong)

Intuitive Align Calculation

Another nice compiler surprise (probably more due to LLVM than zig).

It's nice that this, more intuitive, alignment calculation, without bit manipulations, compiled into a single instruction, so I didn't have to "improve" it later using bit masking.

const vsz = 16;
const lnub = (haystack.len % vsz);
const l = haystack.len - lnub;

20317a:       49 83 e0 f0             and    $0xfffffffffffffff0,%r8

The Code

The cmp source code.

site built using mf technology