Zig Wrapper For pcre2 Regex

published: [nandalism home] (dark light)

Wrapping a C Library

This is an example of how to wrap a C library in zig. The library I am wrapping is the perl compatible regular expression library pcre2. The zig manual describes the language and how to use the C integration.

I've tried to model the wrapper on code used in the zig std library. The regex api is split into a Regex type and a Groups type, handling compiling the regex and storing the matched ranges, respectively.

There's lots of pcre2 functionality I have left out e.g. contexts, but the wrapper should be enough to be useful.

Zig Does C Integration Well

zig can compile C code as well as zig. It also has language features which allow importing a C header file and automatically generating a low-level zig wrapper for the code and types found in the header. This is a great feature. For example I just had to do this to get the (8-bit version) of the pcre2 api.

const c = @cImport({
  @cDefine("PCRE2_CODE_UNIT_WIDTH", "8");
  @cInclude("pcre2.h");
});

In some ways the job is done now. One can use the wrapped C functions and types directly. Indeed one does just that when writing a wrapper. However, putting a more idiomatic zig layer on top helps to integrate the api into zig space.

Zig Testing

Using zig's integrated testing.

Run the tests in any zig file, like this:

$ zig test pcre2.zig -lc -lpcre2-8
All 2 tests passed.

First a very simple test without capture groups.

test "basic test" {
  const tt = std.testing;
  const nooptions = .{};
  const re = try Regex.init("abc*", nooptions);
  defer re.deinit();
  var grp = try Groups.init(re);
  defer grp.deinit();
  const subject = "abcccc";
  const nmatch = re.match(grp, subject, nooptions);
  try tt.expectEqual(nmatch,1);
  const ncap = grp.count();
  try tt.expectEqual(nmatch, ncap);
  try tt.expectEqualStrings(subject, grp.nth(0, subject));
}

Next, a more complex regex with 8 match groups and some perl'isms like \S for not a space.

I've used the zig \\ raw string form to avoid all the escaping required for an equivalent C string.

The example subject is a line from an nginx access.log, and the regex splits it into fields.

test "group test" {
  const tt = std.testing;
  const nooptions = .{};
  const subject =
    \\99.99.99.99 - - [23/Feb/2023:09:46:18 +0000] "POST /u/nandalism/inbox HTTP/1.1" 200 0 "-" "http.rb/4.4.1 (Mastodon/3.3.3; +https://pawoo.net/)"
    ;
  const re = try Regex.init(
    \\(\S+) - (\S+) \[([^]]+)\] "([^"]+)" (\S+) (\S+) "([^"]+)" "([^"]+)"
    , nooptions);
  defer re.deinit();
  var grp = try Groups.init(re);
  defer grp.deinit();
  const nmatch = re.match(grp, subject, nooptions);
  try tt.expectEqual(nmatch,9);
  const ncap = grp.count();
  try tt.expectEqual(nmatch, ncap);
  const exstr = [_][]const u8{
    subject,
    "99.99.99.99",
    "-",
    "23/Feb/2023:09:46:18 +0000",
    "POST /u/nandalism/inbox HTTP/1.1",
    "200",
    "0",
    "-",
    "http.rb/4.4.1 (Mastodon/3.3.3; +https://pawoo.net/)",
  };
  var i: usize = 0; while(i<ncap):(i+=1){
    const s = grp.nth(i, subject);
    //debug.print("grp[{d:2}] = {s}\n", .{i, s});
    try tt.expectEqualStrings(s, exstr[i]);
  }
}

Passing Options

Here I've copied the zig std library idiom of passing options/long parameter lists as a struct instance. Using default values for struct members and struct literals we can succinctly specify just the options we want. Example: .{.Anchored=true, .NoJit=true}. I convert this struct into the bitmask the C pcre2 api requires by calling options.c_mask() inside the wrapper code.

Is this better than just having a set of const'ants and using bitwise or on them? Not sure, but it seems to be the accepted zig idiom. As long as we make the option lists compile-time constant, zig's aggressive, constant-folding compiler will make all this disappear and leave us with a simple, constant integer, bitmask at compile time.

Note: I haven't actually tried any of the options and my tests don't test any of them.

pub const MatchOptions = struct {
  Anchored: bool = false,
  CopyMatchedSubject: bool = false,
  Endanchored: bool = false,
  Notbol: bool = false,
  Noteol: bool = false,
  Notempty: bool = false,
  NotemptyAtstart: bool = false,
  NoJit: bool = false,
  NoUtfCheck: bool = false,
  PartialHard: bool = false,
  PartialSoft: bool = false,

  pub fn c_mask(o: MatchOptions) u32 {
    var m: u32 = 0;
    if(o.Anchored) m |= c. PCRE2_ANCHORED;
    if(o.CopyMatchedSubject) m |= c. PCRE2_COPY_MATCHED_SUBJECT;
    if(o.Endanchored) m |= c.PCRE2_ENDANCHORED;
    if(o.Notbol) m |= c. PCRE2_NOTBOL;
    if(o.Noteol) m |= c. PCRE2_NOTEOL;
    if(o.Notempty) m |= c. PCRE2_NOTEMPTY;
    if(o.NotemptyAtstart) m |= c.PCRE2_NOTEMPTY_ATSTART;
    if(o.NoJit) m |= c.PCRE2_NO_JIT;
    if(o.NoUtfCheck) m |= c. PCRE2_NO_UTF_CHECK;
    if(o.PartialHard) m |= c.PCRE2_PARTIAL_HARD;
    if(o.PartialSoft) m |= c.PCRE2_PARTIAL_SOFT;
    return m;
  }
};

The Code

The full code is here. I didn't bother to make a zig library project. It's just one file you can copy into your project (or into a sub-directory thereof).

Other Regex Libraries

I had a look around first to see what zig regex libraries were available.

My (alpine linux) grep uses pcre2, which I saw as a good sign. So I ended up deciding to wrap pcre2 myself.


site built using mf technology