Up to Main Index Up to Journal for August, 2023
JOURNAL FOR SUNDAY 20TH AUGUST, 2023
______________________________________________________________________________
SUBJECT: Love it or hate it…
DATE: Sun 20 Aug 20:46:00 BST 2023
Since the release of Mere v0.0.6 I’ve been busy tidying up some of the Mere
code. I’ve also been working on a new feature which people are either going to
love or hate…
I’ve used quite a number of programming languages on many different platforms
over the years. One of the features I have always loved is Perl’s built in use
of regular expressions. I knew I wanted something similar for Mere. It’s a
feature I’ve alluded to before and why I reserved the tilde ‘~’ character ages
ago. Since this afternoon I have a working implementation:
>cat regexp.mr
/*
A mere RegExp example…
~~ is a regexp match (like compare '==')
~ is a regexp replacement
~= is a regexp replace and assign (like add and assign '+=')
*/
text = "The quick brown foxy..."
println "Text is: " text
print "Text contains f..y?: "
println text ~~ "f..y"
print "Replace '...' with '…' : "
println text ~ "\\.\\.\\." "…"
print "Replace and assign 'foxy' with 'wolf': "
println text ~= "foxy" "wolf"
print "Swap 'quick' and 'brown': "
println text ~= "(.*)(quick)(.*)(brown)(.*)" "$1$4$3$2$5"
println "Final text: " text
>mere regexp.mr
Text is: The quick brown foxy...
Text contains f..y?: true
Replace '...' with '…' : The quick brown foxy…
Replace and assign 'foxy' with 'wolf': The quick brown wolf...
Swap 'quick' and 'brown': The brown quick wolf...
Final text: The brown quick wolf...
>
There is quite a lot going on here. First we have ‘~~’ which is a regular
expression match. Think of it like the comparison operator ‘==’, but for
regular expressions: For example
a ~~ "the" // does 'a' contain the letters "the" anywhere
a ~~ "\\bthe\\b" // does 'a' contain the word "the"
Next we have replacement by regular expression ‘~’. This returns, as a string,
the result of replacing a regular expression match with a string. For example:
a ~ "\\bteh\\b" "the" // returns the result of replacing all occurrences
// of the word "teh" in 'a' with "the", no
// assignment, just return the resulting string
Last of all we have ‘~=’ to perform a regular expression replacement and
assign the result back to a variable. For example:
a ~= "\\bteh\\b" "the" // replace the word "teh" with "the" in 'a' and
// assign resulting string back to 'a'
I’m also thinking I should have a way to identify regular expression strings.
In Perl you would use ‘/’ as delimiters[1]. The only spare character I have is
‘@’ which would make code look ugly. I might be able to dual use ‘\’ instead?
$foo =~ m/abc/ // Perl: does $foo contain "abc"?
foo ~~ "abc" // Mere currently same thing
foo ~~ \abc\ // Possible change…
Why would I want to do that? If you know which strings are regular expressions
you can pre-compile them. This improves performance as they can be reused. For
example, assume I have a list of UK telephone numbers I want to (very badly)
validate. I could write something like:
>cat validate.mr
range ; v; []string(
"+44(0)20 7946 1234",
"+44 20 7946 1234",
"020 7946 1234",
"020-7946-1234",
"02079461234",
"tel: 02079461234",
"02079461234, ext 2",
)
clean = v ~ "\\(0\\)|[ ()-]" ""
printf "%5t - %s (%s)\n", clean ~~ "^(\\+44|0)\\d{10}$", v, clean
next
>mere validate.mr
true - +44(0)20 7946 1234 (+442079461234)
true - +44 20 7946 1234 (+442079461234)
true - 020 7946 1234 (02079461234)
true - 020-7946-1234 (02079461234)
true - 02079461234 (02079461234)
false - tel: 02079461234 (tel:02079461234)
false - 02079461234, ext 2 (02079461234,ext2)
>
However, there are two regular expressions in the loop: “\\(0\\)|[ ()-]” and
“^(\\+44|0)\\d{10}$”. The current implementation compiles a regular expression
on every iteration. Knowing a string is a regular expression also lets you
handle the string specially in other ways. For example to allow the regular
expression to be annotated:
validatePhone = \
^ // match start of string
(\\+44|0) // begins with +44 or 0
\\d{10} // followed by 10 digits
$ // match end of string
\
This is very important for regular expressions, complex ones can tend to look
a lot like line noise :|
I still need to experiment and make sure the addition of regular expressions
makes sense and fits with the rest of the language. However, I’m quite excited
for this feature :)
Love it? Hate it? Let me know your thoughts: diddymus@wolfmud.org
--
Diddymus
[1] Other delimiters are available, but ‘/’ is commonly used.
Up to Main Index Up to Journal for August, 2023