Up to Main Index                              Up to Journal for June, 2017

                      JOURNAL FOR MONDAY 19TH JUNE, 2017
______________________________________________________________________________

SUBJECT: An update on data races
   DATE: Mon 19 Jun 22:57:38 BST 2017

This weekend has been hot, over 30°C. This weekend has also seen me trying to
debug some data races in WolfMUD. This requires running WolfMUD for hours on
end with ten of thousands of players. As a result my tiny study has been
extremely hot and noisy what with the cooling fans in my machines ramping up
their speed and dumping the heat into the study.

I did start testing on my dinky desktop machine. A few times it become
non-responsive and locked up completely due to running out of RAM and starting
to swap heavily to disk — this machine only has a dual core and 4GB RAM in it.
So I ended up running tests on another machine with 8 cores and 16GB RAM.
Running the race detector does have a large RAM overhead. Especially when you
have to increase the history size to avoid “failed to restore the stack”
errors in the race reports[1].

Investigating and debugging the data races has left me a little mystified. The
errors I have found are indeed actual errors and have been present for some
time now. However the race detector has been silent until recently. Maybe the
race detector has become smarter?

So what is the issue in WolfMUD that is causing a data race? In the cmd
package there is a state type. The state type coordinates the parsing and
processing of commands. It also handles locking via the state.sync method and
the BRL (big room lock) found in attr/internal.

The data race arises because the state.newState method has this line it in:


  s.where = attr.FindLocate(t).Where()


We are accessing a Thing’s attributes to find a Where attribute to find out
where the current actor/command issuer is. However this is done before calling
the sync method and so before our locking has been set up.

To fix the data race requires a number of changes to be made. The Locate type,
specifically its ‘where’ and ‘origin’ fields, need to be accessible in a
concurrently safe way. The FindLocate method iterates over a Thing’s map of
attributes. This access also needs to be made concurrently safe and it looks
like I’m going to have to rework all of the current finders.

When I say access should be concurrently safe that is access outside of the
protection of the state.sync method which would normally handle all of the
locking and synchronisation.

I’m none to happy about adding lots of locks as it will mean every Locate
instance will have a sync.Mutex and every Thing will have a sync.Mutex as
well. Instead of having a lock for the Thing type and the BRL for the Locate
type I’m wondering if I should just promote the BRL to the Thing type instead?

It’s now nearly 11 pm and the temperature in my study is still 32°C :P

--
Diddymus

  [1] See: https://golang.org/doc/articles/race_detector.html#Options


  Up to Main Index                              Up to Journal for June, 2017