Up to Main Index                         Up to Journal for September, 2021

                  JOURNAL FOR WEDNESDAY 15TH SEPTEMBER, 2021
______________________________________________________________________________

SUBJECT: Text folding now enabled for text sent to players
   DATE: Wed 15 Sep 21:07:51 BST 2021

Plain text. Simple, reliable, portable, hrm… portable apart from various line
endings, tabs, code pages and control codes. What about Unicode? Is something
written using Unicode plain text? Some people say yes, some say no…

I think Unicode is not strictly plain text because it requires processing.
Take UTF-8, it is an encoding and needs to be decoded into codepoints. Even if
you start with just unencoded codepoints they require additional processing.

Take the grapheme ‘é’, in UTF-8 that could be encoded as the bytes ‘0xC3 0xA9’
or ‘0x65 0xCC 0x81’. Decoded the UTF-8 bytes become the codepoints ‘U+00E9’ or
‘U+0065 U+0301’. The second form may then be normalised to ‘U+00E9’.

Depending on your operating system, locale settings and how tarnished your
luck currently is, displaying a file containing the following:


  It's not like Zoë is going to the café, which has a very nice façade, for
  some crème brûlée with her doppelgänger!


could result in any of:


  It's not like Zoe<CC><88> is going to the cafe<CC><81>, which has a very
  nice fac<CC><A7>ade, for some cre<CC><80>me bru<CC><82>le<CC><81>e with her
  doppelga<CC><88>nger!

  It's not like ZoeÌ is going to the cafeÌ, which has a very nice façade,
  for some creÌme bruÌleÌe with her doppelgaÌnger!

  It's not like Zoe ~L is going to the cafe ~L, which has a very nice fac ~L
  ade, for some cre ~Lme bru ~Lle ~Le with her doppelga ~Lnger!

  It's not like Zo? is going to the caf?, which has a very nice fa?ade, for
  some cr?me br?l?e with her doppelg?nger!

  It's not like Zoe�. is going to the cafe�., which has a very nice façade,
  for some cre�.me bru�.le�.e with her doppelga�.nger!


Why do I bring this up? It sort of explains the rabbit hole I’ve been down for
the last week or so. WolfMUD needs to fold — or wrap — text so that paragraphs
fit nicely on a player’s screen. At the moment a width of 80 characters is
assumed but it will be variable in the future — set by the player.

Folding/wrapping ASCII is easy. Split text into words on an ASCII space ‘0x20’
and add words together until the next word length would exceed the line length
at which point you insert ‘\r\n’ to start the next line and continue adding
more words. The length of a word is the number of bytes, one byte per visible
character.

WolfMUD will let you use Unicode. You can write zone files in any language you
want. Although, at the moment, you would need to translate any hard-coded
message text. I’m also not sure how well place holders for substitutions would
work in languages other than English. “You put %X into %Y” for example.

This means that each and every message sent to a client is individually
folded/wrapped. Every message therefore needs to be converted to runes,
processed, converted to bytes and sent to the client. Converting bytes or a
string to runes takes time, as does converting back to bytes. The processing
takes time. Working out the length of a word is exceptionally tedious, until
you process the stream you have unknown bytes per code point, unknown
codepoints per grapheme — some are non-spacing, or combining, or zero width…

Unicode can be quite a headache. Even with the built-in support Go provides. I
tried to relieve my headache with a dose of 3rd party libraries. For what I
needed they were bloated and/or slow :(

WolfMUD has a Fold function in text/fold.go which is quite good. It also
handles ANSI escape sequence for colours, ‘␠’[1] U+2420 for hard spaces and a
few other bits. Fold is already quite fast. However, a lot of effort and work
has been put into the experiment to make fast. Using the current Fold function
would have just slowed things down again :( I wanted a faster Fold method
dammit! So I set out to write a better Fold, and took over a week doing it…

The new implementation is over 165% faster ;) Some results from benchstat
folding different widths of 4.7k of ASCII text and 4.8k of Unicode text — most
message are a lot shorter, usually less than 512 bytes:


  NAME                        OLD TIME/OP  NEW TIME/OP    DELTA
  FoldLipsumASCII/Width_20-4   117µs ± 6%    44µs ± 2%  -62.79%  (p=0 n=10+10)
  FoldLipsumASCII/Width_40-4   115µs ± 4%    43µs ± 2%  -62.25%  (p=0 n=10+10)
  FoldLipsumASCII/Width_80-4   116µs ± 4%    44µs ± 2%  -62.31%  (p=0 n=10+10)
  FoldLipsumASCII/Width_100-4  116µs ± 4%    43µs ± 1%  -62.64%  (p=0 n=10+10)
  FoldLipsumASCII/Width_120-4  116µs ± 3%    43µs ± 3%  -62.54%  (p=0 n=9+10)
  FoldLipsumASCII/Width_140-4  115µs ± 6%    44µs ± 2%  -61.96%  (p=0 n=10+9)
  FoldLipsumASCII/Width_160-4  116µs ± 3%    43µs ± 2%  -62.61%  (p=0 n=10+10)
  FoldLipsumUTF8/Width_20-4    122µs ± 4%    46µs ± 2%  -62.67%  (p=0 n=9+10)
  FoldLipsumUTF8/Width_40-4    121µs ± 4%    46µs ± 1%  -62.22%  (p=0 n=10+10)
  FoldLipsumUTF8/Width_80-4    123µs ± 2%    46µs ± 2%  -62.60%  (p=0 n=8+10)
  FoldLipsumUTF8/Width_100-4   123µs ± 3%    46µs ± 2%  -62.60%  (p=0 n=9+10)
  FoldLipsumUTF8/Width_120-4   124µs ± 1%    46µs ± 2%  -62.80%  (p=0 n=8+9)
  FoldLipsumUTF8/Width_140-4   125µs ± 2%    46µs ± 3%  -63.08%  (p=0 n=8+10)
  FoldLipsumUTF8/Width_160-4   122µs ± 4%    46µs ± 3%  -62.24%  (p=0 n=10+10)


The new Fold function even passes all of the current Fold tests. With a server
running 64,000 bots folding text adds about a 5-10% CPU overhead. I may have
taken a few liberties with my Unicode codepoints, UTF-8, ASCII and ANSI escape
sequence handling but it all seems to be hanging together.

Change are out on the public experiment branch. I’ve reverted the previous
changes for the gastly line endings hack for Windows players — the new Fold
method converts ‘\n’ to ‘\r\n’ just like the previous Fold method.

Now to add some colour and a little text formatting…

--
Diddymus

  [1] U+2420 is the “symbol for space” which may not render on this page as it
      is not in the Go Mono font this site uses, although your browser may
      substitute with another font. It should render as a superscript ‘S’ over
      a subscript ‘P’. It’s like ℅ but no slash and replace C&O with S&P :P


  Up to Main Index                         Up to Journal for September, 2021