Up to Main Index                            Up to Journal for August, 2020

                   JOURNAL FOR WEDNESDAY 19TH AUGUST, 2020
______________________________________________________________________________

SUBJECT: Data race fixes and a more stable server
   DATE: Wed 19 Aug 19:44:21 BST 2020

Another long week of debugging and sifting through logs and stack traces. The
good news is I think all my hard work and long nights has finally paid off :)

My desktop machine has been pushed hard with multiple instances of the server
running[1]. One instance for developing and debugging, another instance has
been left running in “turbo mode”[2] for (sometimes) days at a time. I would
like use my more power efficient Raspberry Pi 4 for long test runs. To do that
I’d need to be running a 64bit OS on the Raspberry Pi as the race detector for
Go on Arm is only available for ARM64. At the moment the 64bit Raspberry Pi OS
is still in beta — while I like to tinker I don’t want to deal with OS issues
as well as WolfMUD issues right now :P

There are four new fixes on the public dev branch:


  - An issue with the new respawn code not disposing of copied spawnable items
    properly causing nil pointer panics and data races due to stale Inventory
    references has been fixed.

  - The inventory.Move method can now move disabled items instead of silently
    failing.

  - The JUNK command’s lockOrigins method now correctly locks the origins of
    disabled items, fixing a data race due to locks being missed.

  - An issue in state.sync where the wrong locks could be acquired leading to
    a data race has been fixed.


Those fixes had me pulling out hair for days, especially the last fix for the
state.sync data race. Turns out that with the recent spawning changes it is
possible for the actor of a command to change location without moving, well
actually the container the actor is in can move to a new location. A situation
where this can occur would be an item resetting in a container where the
container is being carried by a player and the player moves.

There were actually two issues here. First issue: the actor’s location is
obtained, then the locks acquired. There was a small window between obtaining
the location and acquiring the locks where the actor could be moved. The
second issue was that issue one changed the Inventory hierarchy. To explain
that I need to cover a little background.

Some people like this kind of detail, some don’t, stop reading if you don’t :)

WolfMUD uses a coarse locking scheme based on a Big Room Lock (BRL) and the
Inventory hierarchy. Specifically, it is always the outermost Inventory that
is locked — which is usually a location, but can be any container. Once the
BRL is acquired any items in the Inventory can be manipulated, even items
deeply nested inside containers. It’s a simple scheme, works well and avoids
having to have a lot of locks.

An alternative would be locking at the item level: moving an item from a
player to a container would mean locking the player, the item, the container,
other players who see the action and we have to send message to.

With the BRL we lock the outermost Inventory, move the item, and then unlock
the outermost Inventory. When processing commands the BRL for the actor’s
outermost Inventory is acquired automatically. A command can also request
additional locks using state.AddLock — such as the when moving from one
location to another: the BRL for the first location is acquired automatically,
we work out where the player wants to move to and request the BRL for the
second location, relock on both locations, move the player, release the BRLs.

Getting back to the second issue, we have an item (I) in a container (C). The
container is being carried by a player (P) who is at location one (L¹). So our
Inventory hierarchy looks like: L¹←P←C←I. We find out where the item (I) with
the Reset event is and find the outermost Inventory, which in this case is L¹.
In the small window between getting the item’s location and acquiring the BRL
for L¹ the player moves to location two (L²). The item hasn’t moved, it’s
still in the container (C←I) and the container is still held by the player
(P←C) but the player has moved and the hierarchy Inventory is now: L²←P←C←I.
The upshot of this is that the item (and container) do not realise they have
been moved and the Reset event still locks the BRL for L¹ instead of L² which
results in a data race — as the Reset event is now happening under L² not L¹.

Yup, took me a while to work all that out and realise what had gone wrong.

Now that the window of opportunity between “where am I?” and acquiring the BRL
has been closed, and the check for “where am I?” revalidates the outermost
Inventory and not just the immediate Inventory, the data races are fixed.

The server has now been running for over 24 hours without incident, so I’m
going to publish this and push out the fixes to the public dev branch.

--
Diddymus

  [1] Nothing special or fancy, I just edit config.wrj and set each instance
      to listen for incoming connections on a different port.

  [2] See journal for Saturday 25th July, 2020 — “Update on respawn, in-flight
      events and duplication issues”: ../7/25.html


  Up to Main Index                            Up to Journal for August, 2020