Zig Beating GNU cmp In An Unfair Fight
published: [nandalism home] (dark light)
Lore
UNIX has lots of simple command line tools which can be used together to do some really complex things.
One example, cmp
, is a fairly simple tool which compares 2 files and gives a same/not-same answer. Unlike the more complicated
diff
, it does not split lines, attempt to find ranges of differences nor display the differences.
Rewrite
It can be fun to rewrite some of these tools and compare your version to, say, the GNU/linux version or a BSD unix version. I will be comparing mine with the GNU/linux cmp (on alpine linux). I will also be leaving out more advanced features.
Overview
(spoiler warning)
Here are the stages I went thru before reaching my final version.
- open 2 files and use low-level
os.read()
to read blocks from each. - I assume both file read()'s will give the max possible number of bytes and treat one shorter than the other as EOF on that file. This works out ok for files and most streams but e.g. interactive stdin stream will break this assumption. (bad)
- discovered
mem.indexOfDiff()
which gives the index of the first difference in 2 memory blocks. - added the ability to use a stdin stream for the second file.
- so far my cmp was faster than gnu cmp, but gnu cmp outputs the line number of the diff as well as the byte offset. Mine doesn't (I'm not convinced this is useful.)
- count the newlines, like gnu cmp. I can use
mem.count
to count all the newlines in a block. Now my code is a bit slower than gnu cmp. Seems fair. - mem.count could be faster. While zig usually has Scalar and 'string' versions of functions, there is no mem.countScalar(). So, using the std.mem.count() source code (part of the zig install) I made a clone which does a scalar search. This improved the speed again to equal gnu cmp.
- zig has an intriguing @Vector command which relates to cpu vector operations. Using vectorization/SIMD I was now faster than gnu cmp.
The reason that this was an unfair fight is that SIMD instructions are non-standard CPU extensions. However, GNU/linux is released as a binary package and the tools must run on all machines. This means GNU/linux cannot release tools using SIMD, although I'm sure the original sources are capable of using SIMD when compiled natively on a platform supporting those.
Reading the Files
I do this for both files, giving me 2 buffers.
const fda = try os.open(fna, os.O.RDONLY, 0o444); defer os.close(fda); var bufa: [bsz]u8 = undefined; const rsza = try os.read(fda, &bufa);
The Compare Loop
I loop over 64K blocks of the 2 files and either
- one file ends before the other
- there is a difference in the bytes in the blocks
- otherwise, I continue with the next block
var offset: usize = 0; var lineno: usize = 1; while(true){ const rsza = try os.read(fda, &bufa); const rszb = try os.read(fdb, &bufb); if(rsza != rszb){ const sfn = if(rsza<rszb) fna else fnb; const o = if(rsza<rszb) rsza else rszb; debug.print("cmp: prefixes match but file {s} is shorter ({} bytes)\n", .{sfn, offset+o}); os.exit(1); } if(0==rsza) break; if(mem.indexOfDiff(u8, bufa[0..rsza], bufb[0..rsza])) |o| { lineno += countScalarSIMD(u8, bufa[0..o], '\n'); debug.print("cmp: mismatch at byte:{} line:{}\n", .{offset+o+1, lineno}); os.exit(1); } offset += rsza; lineno += countScalarSIMD(u8, bufa[0..rsza], '\n'); }
The Vectorization of Line Count
I wrote a general countScalarSIMD()
function but here only use it to count newlines. It will count the occurrences of any single byte.
The zig @Vector command intrigued me, I thought it created vectors, however it relates to cpu vector operations/SIMD. I hadn't tried these yet and used this as an excuse to learn about it. This zig feature allows the compiler to generate CPU SIMD instructions, which operate on multi-byte blocks in a single instruction, making them faster.
I tried 4 byte blocks at first, and after getting the vector code working it was easy to experiment with various vector widths, by changing a single constant.
On my machine 16-bytes was the peak of performance.
Now I was faster than gnu cmp.
I used objdump -d cmp
to prove that my code was using SIMD e.g. popcnt.
Note that in C, using SIMD is quite esoteric (there are builtin functions) but zig makes this much more approachable. There are multiple ways to do the same
thing. At first in fact, I didn't find a way to do a @reduce
addition on the bool match vector
to get a popcnt so I used a inline-while loop of 16 iterations and the zig compiler converted that to a single popcnt instruction in the output assembler.
pub fn countScalarSIMD(comptime T: type, haystack: []const T, needle: T) usize { var cnt: usize = 0; const vsz = 16; const lnub = (haystack.len % vsz); const l = haystack.len - lnub; const vneedle = @splat(vsz, needle); var i: usize = 0; while(i<l):(i+=vsz) { const slide: @Vector(vsz, u8) = haystack[i..][0..vsz].*; const mask = (vneedle == slide); comptime var j = 0; inline while(j<vsz):(j+=1) { cnt += @boolToInt(mask[j]); } } i=l; while(i<haystack.len):(i+=1) { if(needle == haystack[i]) cnt += 1; } return cnt; }
This compiles down to popcnt
.
comptime var j = 0; inline while(j<vsz):(j+=1) { cnt += @boolToInt(mask[j]); }
I tried various combinations of @boolToInt
but it wouldn't work on the vector.
The best approach is probably to use @select
to convert the bool vector created by the equality match into a u1 integer vector and then add.
I tried this, but ended up with an extra assembler instruction [and $0x1, reg] which seemed incorrect. Maybe I made a mistake in this code or misunderstood the @select api?
const imask = @select(u1, mask, @splat(vsz,@as(u1,1)), @splat(vsz,@as(u1,0))); cnt += @reduce(.Add, imask); 203217: f3 0f b8 c9 popcnt %ecx,%ecx 20321b: 83 e1 01 and $0x1,%ecx <--- all counts now 0 or 1 (wrong)
Intuitive Align Calculation
Another nice compiler surprise (probably more due to LLVM than zig).
It's nice that this, more intuitive, alignment calculation, without bit manipulations, compiled into a single instruction, so I didn't have to "improve" it later using bit masking.
const vsz = 16; const lnub = (haystack.len % vsz); const l = haystack.len - lnub; 20317a: 49 83 e0 f0 and $0xfffffffffffffff0,%r8
The Code
The cmp
source code.