Up to Main Index Up to Journal for August, 2020 JOURNAL FOR WEDNESDAY 19TH AUGUST, 2020 ______________________________________________________________________________ SUBJECT: Data race fixes and a more stable server DATE: Wed 19 Aug 19:44:21 BST 2020 Another long week of debugging and sifting through logs and stack traces. The good news is I think all my hard work and long nights has finally paid off :) My desktop machine has been pushed hard with multiple instances of the server running[1]. One instance for developing and debugging, another instance has been left running in “turbo mode”[2] for (sometimes) days at a time. I would like use my more power efficient Raspberry Pi 4 for long test runs. To do that I’d need to be running a 64bit OS on the Raspberry Pi as the race detector for Go on Arm is only available for ARM64. At the moment the 64bit Raspberry Pi OS is still in beta — while I like to tinker I don’t want to deal with OS issues as well as WolfMUD issues right now :P There are four new fixes on the public dev branch: - An issue with the new respawn code not disposing of copied spawnable items properly causing nil pointer panics and data races due to stale Inventory references has been fixed. - The inventory.Move method can now move disabled items instead of silently failing. - The JUNK command’s lockOrigins method now correctly locks the origins of disabled items, fixing a data race due to locks being missed. - An issue in state.sync where the wrong locks could be acquired leading to a data race has been fixed. Those fixes had me pulling out hair for days, especially the last fix for the state.sync data race. Turns out that with the recent spawning changes it is possible for the actor of a command to change location without moving, well actually the container the actor is in can move to a new location. A situation where this can occur would be an item resetting in a container where the container is being carried by a player and the player moves. There were actually two issues here. First issue: the actor’s location is obtained, then the locks acquired. There was a small window between obtaining the location and acquiring the locks where the actor could be moved. The second issue was that issue one changed the Inventory hierarchy. To explain that I need to cover a little background. Some people like this kind of detail, some don’t, stop reading if you don’t :) WolfMUD uses a coarse locking scheme based on a Big Room Lock (BRL) and the Inventory hierarchy. Specifically, it is always the outermost Inventory that is locked — which is usually a location, but can be any container. Once the BRL is acquired any items in the Inventory can be manipulated, even items deeply nested inside containers. It’s a simple scheme, works well and avoids having to have a lot of locks. An alternative would be locking at the item level: moving an item from a player to a container would mean locking the player, the item, the container, other players who see the action and we have to send message to. With the BRL we lock the outermost Inventory, move the item, and then unlock the outermost Inventory. When processing commands the BRL for the actor’s outermost Inventory is acquired automatically. A command can also request additional locks using state.AddLock — such as the when moving from one location to another: the BRL for the first location is acquired automatically, we work out where the player wants to move to and request the BRL for the second location, relock on both locations, move the player, release the BRLs. Getting back to the second issue, we have an item (I) in a container (C). The container is being carried by a player (P) who is at location one (L¹). So our Inventory hierarchy looks like: L¹←P←C←I. We find out where the item (I) with the Reset event is and find the outermost Inventory, which in this case is L¹. In the small window between getting the item’s location and acquiring the BRL for L¹ the player moves to location two (L²). The item hasn’t moved, it’s still in the container (C←I) and the container is still held by the player (P←C) but the player has moved and the hierarchy Inventory is now: L²←P←C←I. The upshot of this is that the item (and container) do not realise they have been moved and the Reset event still locks the BRL for L¹ instead of L² which results in a data race — as the Reset event is now happening under L² not L¹. Yup, took me a while to work all that out and realise what had gone wrong. Now that the window of opportunity between “where am I?” and acquiring the BRL has been closed, and the check for “where am I?” revalidates the outermost Inventory and not just the immediate Inventory, the data races are fixed. The server has now been running for over 24 hours without incident, so I’m going to publish this and push out the fixes to the public dev branch. -- Diddymus [1] Nothing special or fancy, I just edit config.wrj and set each instance to listen for incoming connections on a different port. [2] See journal for Saturday 25th July, 2020 — “Update on respawn, in-flight events and duplication issues”: ../7/25.html Up to Main Index Up to Journal for August, 2020