Up to Main Index                          Up to Journal for February, 2021

                    JOURNAL FOR SUNDAY 7TH FEBRUARY, 2021
______________________________________________________________________________

SUBJECT: Time for a little chat
   DATE: Sun  7 Feb 20:11:44 GMT 2021

With my last entry I unleashed the botrunner. This simulates players logging
into the server, running around, screaming, shouting and generally behaving
badly. This has turn out to be a good thing, it broke the server — a very good
thing. Why is it good to break the server? I now know about the issue and can
work on fixing it.

This is what I have been working on over the last week in the evenings.

The problem is an obvious design flaw in the network handling code found in
the comms package. In a nutshell, a client that is slow in handling received
data from the server can stall the main, critical code path. The main critical
code path for WolfMUD is:


                         .--relock--.
                         V          |
  client              acquire    handle   .-> respond to client  -.   release
  data in -> parse -> locks   -> command -+-> notify participant -+-> locks -.
    ^                                     '-> notify observers   -'          |
    |                                                                        |
    '------------------------------------------------------------------------'


If a client is slow in receiving data it means that ‘respond to client’,
‘notify participant’ and ‘notify observers’ take longer to complete. This in
turn increases the time it takes to get from ‘acquire locks’ to ‘release
locks’. It also increases the time taken to ‘relock’. This is very bad because
it means we are holding the locks longer and preventing other goroutines from
acquiring them.

Playing with the size of the kernel network buffers, as shown in my previous
post, can only alleviate the situation to a certain extent. Once network data
backs up and goroutines start stalling it starts a chain reaction and soon
everything grinds to a halt and network connections start to timeout and drop.

The WolfMUD server already has protections against badly behaving clients. If
a client stalls and stops responding or drops a connection to the server it’s
not a problem. However, the issue here is a slow client that is still alive
and responding to the server.

The solution I’ve been looking at involves using channels as a buffer between
the server game code layer and the networking layer. Here is a Write function
that puts messages onto a channel:


  func (c client) Write(data []byte)
    select {
    case c.output <- data:
    default:
      fmt.Printf("Dropping data")
    }
  }


Under normal circumstances another goroutine would be reading from the channel
and sending the messages to the network. The constant writing and reading of
the channel means that messages should not be dropped. If the sending of
messages slow down then the channel’s buffer will start to fill. The larger
the channel’s capacity the more tolerant we are of brief connection slowdowns,
but at the cost of using more memory. If enough messages are delayed, causing
the channel’s buffer to fill to capacity, then the default clause in the
select will cause additional messages to be dropped. If we start dropping too
many messages due to a slow client we can drop the connection.

To explore this further I’ve been writing a simple chat server and another
botrunner that just has bots sending messages to each other. If people are
interested I’ll release the source code once I’ve finished experimenting.

The biggest issue I have found with the chat server so far has been goroutine
scheduling. Running with 100,000 bots tends to end in one of three ways:


  1. No messages are dropped.
  2. A few messages for each of a few bots will be dropped.
  3. Nearly all messages for a single bot only will be dropped.


Issue 3 seems to indicate that the Go runtime scheduling can starve goroutines
of CPU time. I hope I just have a tight loop or something, I really don’t want
to start writing my own scheduler :(

Unfortunatly, sorting all of this out is going to take time and delay the next
relase. However, I think the time taken is going to be worth it in the long
run, even if you don’t find 100,000+ players for your server :P

--
Diddymus


  Up to Main Index                          Up to Journal for February, 2021