Up to Main Index Up to Journal for June, 2017 JOURNAL FOR THURSDAY 15TH JUNE, 2017 ______________________________________________________________________________ SUBJECT: Rabbit holes, Unicode normalisation and data races DATE: Thu 15 Jun 22:56:57 BST 2017 As so often seems to happen, I started looking into something and ended up down another rabbit hole. Finally getting some time to work on WolfMUD I thought I’d take a really small, simple function and apply some testing to it. My thoughts were to write some really good tests and then use it as a standard for my testing efforts for the rest of the code. The function I picked was TitleFirst in WolfMUD’s text package. TitleFirst just title cases the first rune in a string: func TitleFirst(s string) string { r := []rune(s) r[0] = unicode.ToTitle(r[0]) return string(r) } First bug found, it doesn’t check for zero length strings. Somewhat of a rookie error, simple enough to fix: func TitleFirst(s string) string { if s == "" { return s } r := []rune(s) r[0] = unicode.ToTitle(r[0]) return string(r) } Writing test cases for this function is where I found the rabbit hole. What happens if the string starts with Unicode? For example the rune 'ä' or the runes 'a\u0308' which also displays as 'ä'. What about ligatures such as 'fi'? This is where I started going down the rabbit hole. I wondered if there was any way that a mix of composed and/or combining runes could produce different results. Before I knew it I was reading about Unicode normalisation (again!) and pondering as to whether WolfMUD should be normalising strings. Normalisation is an issue when comparing strings. So it could be an issue if there was a player called Chloé as the é could be either a “LATIN SMALL LETTER E WITH ACUTE” or a “LATIN SMALL LETTER E” with a “COMBINING ACUTE ACCENT”. At the moment players can only be created with names that use the letters a-z and A-Z, mainly for practical reasons[1]. Player names are currently the only instance in-game where we are concerned with comparing user supplied strings with other user supplied strings. The rest of the time we are comparing user supplied strings with strings from a known source — either from the code or from zone files. For the text in source files and zone files normalisation can be checked and enforced. The predominant form of normalisation appears to be NFC — Normalised Form Composed. This is the form recommended by the W3C for text on the internet[2]. It is also the form WolfMUD already assumes text is in. Using the uconv tool from the Debian icu-devtools package we can normalise a file to NFC on the command line: $ uconv -f utf8 -t utf8 -x nfc -o normal.txt file.txt We can also check if a file is already normalised to NFC: $ md5sum file.txt <(uconv -f utf8 -t utf8 -x nfc file.txt) 84fd8dc84761513ee7a290093a6acc01 file.txt 3216f89b8b1539260756b73beabf45d2 /dev/fd/63 Here we are computing the MD5 for the file before and after normalisation to NFC. If the checksums are different then the file is not normalised to NFC. By keeping all of our text normalised to NFC we should be mostly okay and things should work out just fine. Does this mean WolfMUD does not need to normalise text? Normalisation should probably be added at some point to enforce NFC rather than relying on convention only. At the moment the situation is good enough — I currently have bigger problems to worry about… At some point when introducing events in WolfMUD I seem to have introduced a data race. Where as I was putting off writing new code to concentrate on writing tests, I’m now putting off testing to debug and fix a data race :( I’m still investigating the problem but it definitely involves events and actions. In the meantime I have pushed some updates it the public dev branch that I’ve been sitting on. All minor tweaks, nothing really exciting: recordjar: Fixup comment whitespace in decoder recordjar: Fix broken pairs in PairList/KeyedString/KeyedStringList attr: Ignore incomplete pairs when unmarshaling Vetoes recordjar: Fix tabbing in documentation data,docs: Drop old Server.Debug references frontend: Check error from Chmod + note bug Now it’s back to the races… -- Diddymus [1] When is 'Jаne' <> 'Jane'?: ../../2016/10/16.html [2] W3C: Normalization in HTML and CSS https://www.w3.org/International/questions/qa-html-css-normalization Up to Main Index Up to Journal for June, 2017