Up to Main Index Up to Journal for February, 2021 JOURNAL FOR SUNDAY 7TH FEBRUARY, 2021 ______________________________________________________________________________ SUBJECT: Time for a little chat DATE: Sun 7 Feb 20:11:44 GMT 2021 With my last entry I unleashed the botrunner. This simulates players logging into the server, running around, screaming, shouting and generally behaving badly. This has turn out to be a good thing, it broke the server — a very good thing. Why is it good to break the server? I now know about the issue and can work on fixing it. This is what I have been working on over the last week in the evenings. The problem is an obvious design flaw in the network handling code found in the comms package. In a nutshell, a client that is slow in handling received data from the server can stall the main, critical code path. The main critical code path for WolfMUD is: .--relock--. V | client acquire handle .-> respond to client -. release data in -> parse -> locks -> command -+-> notify participant -+-> locks -. ^ '-> notify observers -' | | | '------------------------------------------------------------------------' If a client is slow in receiving data it means that ‘respond to client’, ‘notify participant’ and ‘notify observers’ take longer to complete. This in turn increases the time it takes to get from ‘acquire locks’ to ‘release locks’. It also increases the time taken to ‘relock’. This is very bad because it means we are holding the locks longer and preventing other goroutines from acquiring them. Playing with the size of the kernel network buffers, as shown in my previous post, can only alleviate the situation to a certain extent. Once network data backs up and goroutines start stalling it starts a chain reaction and soon everything grinds to a halt and network connections start to timeout and drop. The WolfMUD server already has protections against badly behaving clients. If a client stalls and stops responding or drops a connection to the server it’s not a problem. However, the issue here is a slow client that is still alive and responding to the server. The solution I’ve been looking at involves using channels as a buffer between the server game code layer and the networking layer. Here is a Write function that puts messages onto a channel: func (c client) Write(data []byte) select { case c.output <- data: default: fmt.Printf("Dropping data") } } Under normal circumstances another goroutine would be reading from the channel and sending the messages to the network. The constant writing and reading of the channel means that messages should not be dropped. If the sending of messages slow down then the channel’s buffer will start to fill. The larger the channel’s capacity the more tolerant we are of brief connection slowdowns, but at the cost of using more memory. If enough messages are delayed, causing the channel’s buffer to fill to capacity, then the default clause in the select will cause additional messages to be dropped. If we start dropping too many messages due to a slow client we can drop the connection. To explore this further I’ve been writing a simple chat server and another botrunner that just has bots sending messages to each other. If people are interested I’ll release the source code once I’ve finished experimenting. The biggest issue I have found with the chat server so far has been goroutine scheduling. Running with 100,000 bots tends to end in one of three ways: 1. No messages are dropped. 2. A few messages for each of a few bots will be dropped. 3. Nearly all messages for a single bot only will be dropped. Issue 3 seems to indicate that the Go runtime scheduling can starve goroutines of CPU time. I hope I just have a tight loop or something, I really don’t want to start writing my own scheduler :( Unfortunatly, sorting all of this out is going to take time and delay the next relase. However, I think the time taken is going to be worth it in the long run, even if you don’t find 100,000+ players for your server :P -- Diddymus Up to Main Index Up to Journal for February, 2021