Up to Main Index                             Up to Journal for April, 2018

                    JOURNAL FOR WEDNESDAY 11TH APRIL, 2018
______________________________________________________________________________

SUBJECT: Commenting regular expressions in Go
   DATE: Wed 11 Apr 21:51:14 BST 2018

I am currently waiting to see if there is any fallout from the recent player
saving and loading changes before preparing the next release. So far things
have been quiet, and my own testing has not uncovered any issues.

It is a sad fact that more time is spent debugging WolfMUD than adding new
features. I’ve been promising to write tests for WolfMUD for ages. I feel the
time has come to knuckle down and sort out testing. There are a few areas of
the code that have tests written already, but they need improving. Testing is
often seen as a boring, unglamorous and thankless task. However, I look at it
as a way of not just hunting for bugs, but of reviewing and improving the
code.

Once all of the tests are written it will allow me to make some far reaching
changes. Changes that until now I’ve not been confident enough to make without
introducing subtle mistakes.

Where to start? I thought I’d start with the recordjar package. It has some
tests already, but I’m not happy with them. Looking at recordjar.go one of the
first lines I see is this:


  var splitLine = regexp.MustCompile(`^(?:([^\s:]+):)?\s*(.*?)$`)


I love regular expressions, but was this one correct? What did it actually do?
First thing to do, break the expression down and work out the parts. In Go
there is no easy way to document regular expressions. Something like Perl’s /x
modifier that lets you put comments inside the regular expression would be a
nice addition.

The best compromise I could come up with was to join a []string, which can be
commented and formatted. However, all of the quoting and commas get in the way
of seeing the regular expression, and you can’t indent parts of it:


  var splitLine = regexp.MustCompile(strings.Join([]string{
    `^`,         // match start of string
    `(?:`,       // non-capture group for 'field:'
    `([^\s:]+)`, // capture 'field' - non-whitespace/non-colon
    `:`,         // non-capture match of colon as field:value separator
    `)?`,        // match non-captured 'field:' zero or once, prefer once
    `\s*`,       // consume any whitepace - leading or after 'field:' if matched
    `(.*?)`,     // capture everything left umatched, not greedy
    `$`,         // match at end of string
  }, ""))


Meh, better than `^(?:([^\s:]+):)?\s*(.*?)$`, but still ugly :( So, while
working on this post I decided to do more than moan. I created a very simple
CommentedRE function:


  // uncommentRE is a regular expression to remove embedded comments and
  // leading/trailing whitespace from a regular expression string.
  var uncommentRE = regexp.MustCompile(`(?m)(?:\s*#\s.*$|^\s*|\n)`);

  // CommentedRE uncomments a commented regular expression. It takes a regular
  // expression as a string and: removes comments delimited with a '#' and at
  // least one whitespace character, removes any leading or trailing
  // whitespace. The resulting string is then returned.
  func CommentedRE(re string) string {
    return uncommentRE.ReplaceAllString(re, "")
  }


I then modified the splitLine string and commented the regular expression:


  var splitLine = regexp.MustCompile(CommentedRE(`
    ^            # match start of string
    (?:          # non-capture group for 'field:'
      ([^\s:]+)  # capture 'field' - non-whitespace/non-colon
      :          # non-capture match of colon as field:value separator
    )?           # match non-captured 'field:' zero or once, prefer once
    \s*          # consume any whitepace - leading or after 'field:' if matched
    (.*?)        # capture everything left umatched, not greedy
    $            # match at end of string
  `))


Much better! I decided to stick with Perl’s ‘#’ as the comment delimiter to
avoid confusion with real Go comments. There are still some downsides. Any ‘#’
characters will be removed from the commented regular expression if followed
by whitespace. Using ‘go fmt’ will not format the comments nicely — they have
to be manually aligned. Lastly, I now have another regular expression to write
tests for!

Getting back to the splitLine regular expression. The regular expression is
used to split a line in a .wrj file into a field name and its data. For
example:


  Name: Diddymus


The first non-capturing group will match ‘Name:’, within that group the
capturing group will match ‘Name’. The reason for the non-capturing group is
that if the colon is not matched with a field name we want it captured by the
second capturing group, which captures the data associated with the field. For
example in:


  OnAction: The rabbit hops around a bit.
          : The rabbit makes a soft squeaking and chattering noise.


The second line is a continuation of the first, and the colon at the start is
part of the data.

A nice property of using regexp.FindSubmatch with the capturing groups is that
it will always return a three element slice: the original input, the field
name and the data. If there is no field name or no data those elements will be
empty []byte.

Testing the splitLine regular expression is quite simple:


  func TestSplitLine(t *testing.T) {
    for _, test := range []struct {
      input string
      field string // Expected field name
      data  string // Expected data value
    }{
      {"a: b", "a", "b"},     // Normal 'field: data'
      {"a:b", "a", "b"},      // 'field:data' - no space
      {"a:", "a", ""},        // field only
      {"a: ", "a", ""},       // field only - with space
      {":b", "", ":b"},       // no field, ':' + data only
      {": b", "", ": b"},     // no field, ': ' + data only
      {"b", "", "b"},         // data only
      {":", "", ":"},         // colon only
      {"", "", ""},           // empty line
      {" ", "", ""},          // space only line
      {"a:b:c", "a", "b:c"},  // field:data + embedded colon
      {"a: b:c", "a", "b:c"}, // field: data + embedded colon

      // Don't expect to see these lines, such lines should be filtered out
      // and not passed to splitLine.
      {"// Comment", "", "// Comment"}, // a comment line
      {"%%", "", "%%"},                 // a record separator
    } {
      t.Run(test.input, func(t *testing.T) {
        have := splitLine.FindSubmatch([]byte(test.input))
        if lhave, lwant := len(have), 3; lhave != lwant {
          t.Errorf("length - have: %d %q, want %d [%q %q %q]",
            lhave, have, lwant, test.input, test.field, test.data)
          return
        }
        if have, want := string(have[1]), test.field; have != want {
          t.Errorf("field - have: %q, want: %q", have, want)
        }
        if have, want := string(have[2]), test.data; have != want {
          t.Errorf("data - have: %q, want: %q", have, want)
        }
      })
    }
  }


That’s one line tested, just over 6300 more lines of code to go…

--
Diddymus


  Up to Main Index                             Up to Journal for April, 2018