Up to Main Index                            Up to Journal for August, 2023

                    JOURNAL FOR TUESDAY 22ND AUGUST, 2023
______________________________________________________________________________

SUBJECT: Feedback helps improve regular expressions
   DATE: Tue 22 Aug 20:03:44 BST 2023

Sunday night, after posting, I continued working on regular expressions. I
think I have something nice that works, but it needs a new type. How long did
it take me to add the uint type? Ah well…

Actually two types are needed, regexp and []regexp — it does not make sense to
have a map with regexp keys. Throughout this post I’ll be using backticks out
of habit — it saves having to double quote escapes as I can type `\s+` instead
of "\\s+".

I’ve had some feedback on my previous regular expression post. I’ve been asked
to provide split and find functionality as well as match and substitute. This
didn’t fit well with the syntax I had planned, so I came up with: ~c ~f ~m ~s

For example:


    println "a,b,c" ~c `,`             // cut         prints "a" "b" "c"
    println "ant bat cat" ~f `.at`     // find        prints "bat" "cat"
    println "0xFF"  ~m `0x[a-fA-F]{2}` // match       prints true
    println "teh"   ~s "teh" "the"     // substitute  prints "the"


For substitution ‘~s’ only, there is also a combined substitute and assign:


    x = "teh quick grey wolf"
    x ~s= "teh" "the"
    println x                   // prints "the quick grey wolf"


I don’t think it makes sense for the others, ~m= would turn the source string
into a boolean while ~c= and ~f= would turn the source string into a []string.

The regular expressions for the new operators can be specified as type string,
in which case they are compiled on the fly, or they can be of type regexp and
are compiled when defined:


    println "0xEF" ~m `0x[a-fA-F]{2}`           // a string
    println "0xEF" ~m regexp `0x[a-fA-F]{2}`    // a regexp


Compiling a regular expression to a regexp is beneficial when we know we want
to use it multiple times and can just compile it the once. For example:


    re = regexp `.at`                       // compile once
    range ; v; []string "ant", "bat", "cat"
      println v, " ", v ~m re               // re-use on each loop iteration
    next


Another benefit of using the regexp type is that the regular expression can be
annotated. Any string quoted with backticks after the regexp keyword and up to
the next semi-colon, which may be implied, are assumed to be “regexp strings”.

For “regexp strings” only, line comments starting ‘# ’ within the string will
be removed and any white-space will be removed. White-space and a literal hash
‘#’ followed by a space can be preserved by escaping them with a backslash or
encoding them using octal or hexadecimal. For a hash ‘#’ the octal is ‘\043’
and hex is ‘\x23’.

A “regexp string” example:


    re = regexp `
      ^            # anchor at beginning
      0x           # hexadecimal prefix
      [a-fA-F0-9]  # a hexadecimal digit
      +            # one or more times
      $            # anchor at end
    `
    println literal re


This code produces:


    regexp(`^0x[a-fA-F0-9]+$`)


A “regexp string” is only found after the regexp keyword, the string must be
quoted with backticks, the string is only interpreted once — at compile time.

It is also possible to create arrays and maps of regexp. An example with a
string map holding the regular expressions:


    res = [string](
      "hex"   `0x[a-fA-F0-9]{2}`,
      "color" `#[a-fA-F0-9]{6}`,
    )
    range k; v; res; res[k] = regexp v; next // convert string to regexp

    range ; v; []string "0xFF", "#C0FFEE", "42"
      range category; re; res
        if v ~m re
          println category ": " v
          continue 1
        fi
      next
      println "unknown: " v
    next


When run, this code produces:


    hex: 0xFF
    color: #C0FFEE
    unknown: 42


For maps it is necessary to explicitly make the string values a regexp. This
can be achieved by populating the map with strings and converting them, as in
the example above. The other way is to use regexp in the map definition. For
example, the above map definition could be rewritten using regexp directly
without the need for the range-next loop:


    res = [string](
      "hex"   regexp `0x[a-fA-F0-9]{2}`,
      "color" regexp `#[a-fA-F0-9]{6}`,
    )


For arrays of type regexp the conversion from string to regexp is automatic,
as the element type is known to be regexp:


    res = []regexp `0x[a-fA-F0-9]{2}` `#[a-fA-F0-9]{6}`
    print literal res


This produces:


    []regexp(regexp(`0x[a-fA-F0-9]{2}`) regexp(`#[a-fA-F0-9]{6}`))


That’s about it for regular expressions. There are a few odds and ends to tidy
up. Most of the work is making sure that operators such as assignment, logical
AND, OR and NOT work with ~c, ~f, ~m and ~s and that built-ins such as delete,
dim, exists, literal, type, trace, dump, etc. know how to handle regexp and
[]regexp. Normal array operations such as appending should work for []regexp.

The hardest part, the design, I now consider done for regular expressions. I’m
really pleased with the changes and think that the new ~c, ~f, ~m, ~s and ~s=
work much better than the earlier ~~, ~, and ~= design. So, thank you to those
who spent some of their precious time to email me at diddymus@wolfmud.org :)

Implementation is 30-40% done. Documentation and tests still need writing…

--
Diddymus


  Up to Main Index                            Up to Journal for August, 2023