Up to Main Index                            Up to Journal for August, 2023

                     JOURNAL FOR SUNDAY 20TH AUGUST, 2023
______________________________________________________________________________

SUBJECT: Love it or hate it…
   DATE: Sun 20 Aug 20:46:00 BST 2023

Since the release of Mere v0.0.6 I’ve been busy tidying up some of the Mere
code. I’ve also been working on a new feature which people are either going to
love or hate…

I’ve used quite a number of programming languages on many different platforms
over the years. One of the features I have always loved is Perl’s built in use
of regular expressions. I knew I wanted something similar for Mere. It’s a
feature I’ve alluded to before and why I reserved the tilde ‘~’ character ages
ago. Since this afternoon I have a working implementation:


    >cat regexp.mr
    /*
    A mere RegExp example…

      ~~ is a regexp match (like compare '==')
      ~  is a regexp replacement
      ~= is a regexp replace and assign (like add and assign '+=')
    */

    text =  "The quick brown foxy..."
    println "Text is: " text

    print  "Text contains f..y?: "
    println text ~~ "f..y"

    print   "Replace '...' with '…' : "
    println text ~ "\\.\\.\\." "…"

    print  "Replace and assign 'foxy' with 'wolf': "
    println text ~= "foxy" "wolf"

    print   "Swap 'quick' and 'brown': "
    println text ~= "(.*)(quick)(.*)(brown)(.*)" "$1$4$3$2$5"

    println "Final text: " text

    >mere regexp.mr
    Text is: The quick brown foxy...
    Text contains f..y?: true
    Replace '...' with '…' : The quick brown foxy…
    Replace and assign 'foxy' with 'wolf': The quick brown wolf...
    Swap 'quick' and 'brown': The brown quick wolf...
    Final text: The brown quick wolf...
    >


There is quite a lot going on here. First we have ‘~~’ which is a regular
expression match. Think of it like the comparison operator ‘==’, but for
regular expressions: For example


    a ~~ "the"        // does 'a' contain the letters "the" anywhere
    a ~~ "\\bthe\\b"  // does 'a' contain the word "the"


Next we have replacement by regular expression ‘~’. This returns, as a string,
the result of replacing a regular expression match with a string. For example:


    a ~ "\\bteh\\b" "the" // returns the result of replacing all occurrences
                          // of the word "teh" in 'a' with "the", no
                          // assignment, just return the resulting string


Last of all we have ‘~=’ to perform a regular expression replacement and
assign the result back to a variable. For example:


    a ~= "\\bteh\\b" "the" // replace the word "teh" with "the" in 'a' and
                           // assign resulting string back to 'a'


I’m also thinking I should have a way to identify regular expression strings.
In Perl you would use ‘/’ as delimiters[1]. The only spare character I have is
‘@’ which would make code look ugly. I might be able to dual use ‘\’ instead?


    $foo =~ m/abc/   // Perl: does $foo contain "abc"?
    foo  ~~ "abc"    // Mere currently same thing
    foo  ~~ \abc\    // Possible change…


Why would I want to do that? If you know which strings are regular expressions
you can pre-compile them. This improves performance as they can be reused. For
example, assume I have a list of UK telephone numbers I want to (very badly)
validate. I could write something like:


    >cat validate.mr
    range ; v; []string(
      "+44(0)20 7946 1234",
      "+44 20 7946 1234",
      "020 7946 1234",
      "020-7946-1234",
      "02079461234",
      "tel: 02079461234",
      "02079461234, ext 2",
    )
      clean = v ~ "\\(0\\)|[ ()-]" ""
      printf "%5t - %s (%s)\n", clean ~~ "^(\\+44|0)\\d{10}$", v, clean
    next

    >mere validate.mr
     true - +44(0)20 7946 1234 (+442079461234)
     true - +44 20 7946 1234 (+442079461234)
     true - 020 7946 1234 (02079461234)
     true - 020-7946-1234 (02079461234)
     true - 02079461234 (02079461234)
    false - tel: 02079461234 (tel:02079461234)
    false - 02079461234, ext 2 (02079461234,ext2)
    >


However, there are two regular expressions in the loop: “\\(0\\)|[ ()-]” and
“^(\\+44|0)\\d{10}$”. The current implementation compiles a regular expression
on every iteration. Knowing a string is a regular expression also lets you
handle the string specially in other ways. For example to allow the regular
expression to be annotated:


    validatePhone = \
      ^           // match start of string
      (\\+44|0)   // begins with +44 or 0
      \\d{10}     // followed by 10 digits
      $           // match end of string
    \


This is very important for regular expressions, complex ones can tend to look
a lot like line noise :|

I still need to experiment and make sure the addition of regular expressions
makes sense and fits with the rest of the language. However, I’m quite excited
for this feature :)

Love it? Hate it? Let me know your thoughts: diddymus@wolfmud.org

--
Diddymus

  [1] Other delimiters are available, but ‘/’ is commonly used.


  Up to Main Index                            Up to Journal for August, 2023