Up to Main Index                              Up to Journal for June, 2017

                     JOURNAL FOR THURSDAY 15TH JUNE, 2017
______________________________________________________________________________

SUBJECT: Rabbit holes, Unicode normalisation and data races
   DATE: Thu 15 Jun 22:56:57 BST 2017

As so often seems to happen, I started looking into something and ended up
down another rabbit hole.

Finally getting some time to work on WolfMUD I thought I’d take a really
small, simple function and apply some testing to it. My thoughts were to write
some really good tests and then use it as a standard for my testing efforts
for the rest of the code. The function I picked was TitleFirst in WolfMUD’s
text package. TitleFirst just title cases the first rune in a string:


  func TitleFirst(s string) string {
    r := []rune(s)
    r[0] = unicode.ToTitle(r[0])
    return string(r)
  }


First bug found, it doesn’t check for zero length strings. Somewhat of a
rookie error, simple enough to fix:


  func TitleFirst(s string) string {
    if s == "" {
      return s
    }
    r := []rune(s)
    r[0] = unicode.ToTitle(r[0])
    return string(r)
  }


Writing test cases for this function is where I found the rabbit hole. What
happens if the string starts with Unicode? For example the rune 'ä' or the
runes 'a\u0308' which also displays as 'ä'. What about ligatures such as 'fi'?

This is where I started going down the rabbit hole. I wondered if there was
any way that a mix of composed and/or combining runes could produce different
results. Before I knew it I was reading about Unicode normalisation (again!)
and pondering as to whether WolfMUD should be normalising strings.

Normalisation is an issue when comparing strings. So it could be an issue if
there was a player called Chloé as the é could be either a “LATIN SMALL LETTER
E WITH ACUTE” or a “LATIN SMALL LETTER E” with a “COMBINING ACUTE ACCENT”. At
the moment players can only be created with names that use the letters a-z and
A-Z, mainly for practical reasons[1]. Player names are currently the only
instance in-game where we are concerned with comparing user supplied strings
with other user supplied strings. The rest of the time we are comparing user
supplied strings with strings from a known source — either from the code or
from zone files. For the text in source files and zone files normalisation can
be checked and enforced.

The predominant form of normalisation appears to be NFC — Normalised Form
Composed. This is the form recommended by the W3C for text on the internet[2].
It is also the form WolfMUD already assumes text is in.

Using the uconv tool from the Debian icu-devtools package we can normalise a
file to NFC on the command line:


  $ uconv -f utf8 -t utf8 -x nfc -o normal.txt file.txt


We can also check if a file is already normalised to NFC:


  $ md5sum file.txt <(uconv -f utf8 -t utf8 -x nfc file.txt)
  84fd8dc84761513ee7a290093a6acc01  file.txt
  3216f89b8b1539260756b73beabf45d2  /dev/fd/63


Here we are computing the MD5 for the file before and after normalisation to
NFC. If the checksums are different then the file is not normalised to NFC.

By keeping all of our text normalised to NFC we should be mostly okay and
things should work out just fine. Does this mean WolfMUD does not need to
normalise text? Normalisation should probably be added at some point to
enforce NFC rather than relying on convention only. At the moment the
situation is good enough — I currently have bigger problems to worry about…

At some point when introducing events in WolfMUD I seem to have introduced a
data race. Where as I was putting off writing new code to concentrate on
writing tests, I’m now putting off testing to debug and fix a data race :(

I’m still investigating the problem but it definitely involves events and
actions.

In the meantime I have pushed some updates it the public dev branch that I’ve
been sitting on. All minor tweaks, nothing really exciting:


  recordjar: Fixup comment whitespace in decoder
  recordjar: Fix broken pairs in PairList/KeyedString/KeyedStringList
  attr: Ignore incomplete pairs when unmarshaling Vetoes
  recordjar: Fix tabbing in documentation
  data,docs: Drop old Server.Debug references
  frontend: Check error from Chmod + note bug


Now it’s back to the races…

--
Diddymus

  [1] When is 'Jаne' <> 'Jane'?: ../../2016/10/16.html

  [2] W3C: Normalization in HTML and CSS
      https://www.w3.org/International/questions/qa-html-css-normalization


  Up to Main Index                              Up to Journal for June, 2017