Zig Wrapper For pcre2 Regex
published: [nandalism home] (dark light)
Wrapping a C Library
This is an example of how to wrap a C library in zig. The library I am wrapping is the perl compatible regular expression library pcre2. The zig manual describes the language and how to use the C integration.
I've tried to model the wrapper on code used in the zig std library. The regex api is split into a Regex type and a Groups type, handling compiling the regex and storing the matched ranges, respectively.
There's lots of pcre2 functionality I have left out e.g. contexts, but the wrapper should be enough to be useful.
Zig Does C Integration Well
zig can compile C code as well as zig. It also has language features which allow importing a C header file and automatically generating a low-level zig wrapper for the code and types found in the header. This is a great feature. For example I just had to do this to get the (8-bit version) of the pcre2 api.
const c = @cImport({ @cDefine("PCRE2_CODE_UNIT_WIDTH", "8"); @cInclude("pcre2.h"); });
In some ways the job is done now. One can use the wrapped C functions and types directly. Indeed one does just that when writing a wrapper. However, putting a more idiomatic zig layer on top helps to integrate the api into zig space.
Zig Testing
Using zig's integrated testing.
Run the tests in any zig file, like this:
$ zig test pcre2.zig -lc -lpcre2-8 All 2 tests passed.
First a very simple test without capture groups.
re=Regex.init()
compiles the pattern string into the pcre2 internal representation.re.match()
matches the pre-compiled regex against the subject string and returns the number of match groups, where the 0th group is the entire match.- In this case there are no capture groups so I expect a return value of 1 for the match() call.
- I expect the default 0th match group to equal the entire match (the entire subject string in this case).
test "basic test" { const tt = std.testing; const nooptions = .{}; const re = try Regex.init("abc*", nooptions); defer re.deinit(); var grp = try Groups.init(re); defer grp.deinit(); const subject = "abcccc"; const nmatch = re.match(grp, subject, nooptions); try tt.expectEqual(nmatch,1); const ncap = grp.count(); try tt.expectEqual(nmatch, ncap); try tt.expectEqualStrings(subject, grp.nth(0, subject)); }
Next, a more complex regex with 8 match groups and some perl'isms like \S
for not a space.
- ... the same init/match call as above (different regex and subject, of course).
- I expect the return value of match() to be 9 (8 groups plus one for the default 0th group).
- As usual, I expect the 0th match group to equal the entire match (the entire subject string in this case).
- I've made a constant string array to check the matches of the 8 capture groups.
I've used the zig \\ raw string form to avoid all the escaping required for an equivalent C string.
The example subject is a line from an nginx access.log, and the regex splits it into fields.
test "group test" { const tt = std.testing; const nooptions = .{}; const subject = \\99.99.99.99 - - [23/Feb/2023:09:46:18 +0000] "POST /u/nandalism/inbox HTTP/1.1" 200 0 "-" "http.rb/4.4.1 (Mastodon/3.3.3; +https://pawoo.net/)" ; const re = try Regex.init( \\(\S+) - (\S+) \[([^]]+)\] "([^"]+)" (\S+) (\S+) "([^"]+)" "([^"]+)" , nooptions); defer re.deinit(); var grp = try Groups.init(re); defer grp.deinit(); const nmatch = re.match(grp, subject, nooptions); try tt.expectEqual(nmatch,9); const ncap = grp.count(); try tt.expectEqual(nmatch, ncap); const exstr = [_][]const u8{ subject, "99.99.99.99", "-", "23/Feb/2023:09:46:18 +0000", "POST /u/nandalism/inbox HTTP/1.1", "200", "0", "-", "http.rb/4.4.1 (Mastodon/3.3.3; +https://pawoo.net/)", }; var i: usize = 0; while(i<ncap):(i+=1){ const s = grp.nth(i, subject); //debug.print("grp[{d:2}] = {s}\n", .{i, s}); try tt.expectEqualStrings(s, exstr[i]); } }
Passing Options
Here I've copied the zig std library idiom of passing options/long parameter lists as a struct instance.
Using default values for struct members and struct literals we can succinctly specify just the options we want.
Example: .{.Anchored=true, .NoJit=true}
.
I convert this struct into the bitmask the C pcre2 api requires by calling options.c_mask()
inside
the wrapper code.
Is this better than just having a set of const
'ants and using bitwise or on them? Not sure, but it
seems to be the accepted zig idiom. As long as we make the option lists compile-time constant, zig's aggressive,
constant-folding compiler will make all this disappear and leave us with a simple, constant integer, bitmask at compile time.
Note: I haven't actually tried any of the options and my tests don't test any of them.
pub const MatchOptions = struct { Anchored: bool = false, CopyMatchedSubject: bool = false, Endanchored: bool = false, Notbol: bool = false, Noteol: bool = false, Notempty: bool = false, NotemptyAtstart: bool = false, NoJit: bool = false, NoUtfCheck: bool = false, PartialHard: bool = false, PartialSoft: bool = false, pub fn c_mask(o: MatchOptions) u32 { var m: u32 = 0; if(o.Anchored) m |= c. PCRE2_ANCHORED; if(o.CopyMatchedSubject) m |= c. PCRE2_COPY_MATCHED_SUBJECT; if(o.Endanchored) m |= c.PCRE2_ENDANCHORED; if(o.Notbol) m |= c. PCRE2_NOTBOL; if(o.Noteol) m |= c. PCRE2_NOTEOL; if(o.Notempty) m |= c. PCRE2_NOTEMPTY; if(o.NotemptyAtstart) m |= c.PCRE2_NOTEMPTY_ATSTART; if(o.NoJit) m |= c.PCRE2_NO_JIT; if(o.NoUtfCheck) m |= c. PCRE2_NO_UTF_CHECK; if(o.PartialHard) m |= c.PCRE2_PARTIAL_HARD; if(o.PartialSoft) m |= c.PCRE2_PARTIAL_SOFT; return m; } };
The Code
The full code is here. I didn't bother to make a zig library project. It's just one file you can copy into your project (or into a sub-directory thereof).
Other Regex Libraries
I had a look around first to see what zig regex libraries were available.
- pure zig regex library @tiehuis zig-regex. I originally tried to use this. @tiehuis seems to be a regular contributor to zig and he references swtch.com's regex papers, which is always good. However, I found the implementation to be very slow.
- wrapper on pcre (version 1) @kivikakk's wrapper for pcre. I decided on this until I saw it uses original pcre and I have pcre2 installed on my machine.
My (alpine linux) grep uses pcre2, which I saw as a good sign. So I ended up deciding to wrap pcre2 myself.