Sleep on it

I was pretty annoyed. I had installed a program into a docker container and had invoked this container via our cloud computing platform and it was garbling my output files – with the exact same inputs and parameters I had just tested it with on my local machine.

At first, I didn’t realize anything was wrong. The program completed without complaints, the log file indicated a certain size of result which corresponded to the result I obtained locally. However, when I fed the output file into another tool it failed. The tool had failed with an esoteric error message which is very non-informative (yay bioinformatics tools), but after seeing this message several times I’ve come to understand that it usually means the input is truncated.

Sure enough, when I downloaded the file from the platform onto my local machine and gunzipped it gunzip ran for a while before complaining of a truncated file.

Now, I had just come out of dealing with an annoying issue related to running docker containers on our platform. When the containers ran, they ran using a non-interactive non-login shell. This leads to subtle differences in how many things work, which will take up a different post, but basically I was in a foul mood already.

The code I actually wanted to run looked like this

my-program \
  input-file \
  >(gzip > outputfile1) \
  >(gzip > outputfile2)

It turns out that the non-interactive shell on the ubuntu base image I used for my tool image is dash, rather than bash, and dash doesn’t like this process substitution thing and complains sh: 1: Syntax error: "(" unexpected

To get around this expediently I wrapped my program in a bash script my-script.sh and invoked it on the platform as bash my-script.sh. This worked, but now my file was getting truncated.

Same code, same data, same docker image. Container executes correctly on my mac, corrupts my file on the AWS linux machine. I cross checked that I was closing my file in my code, I read up a bit about flushing (close does it) and fsync (named pipes don’t have it – they don’t touch disk) and the fact that Python will raise an exception if something goes wrong with the write.

After some more thrashing about like this I suddenly wondered about the multiple processes involved here. By doing process substitution, I had created two new processes – both involving gzip – that were sucking up data from pipes linking to the process my main program was running in. It was quite possible that the two gzip processes were finishing slightly after my main process was done. Now suppose, just suppose, the original bash shell that I was invoking to run my program was terminating as soon as my main program was done without waiting for the gzip processes? This would lead to this kind of file truncation.

I put in a wait in the shell script, but this did not make any difference. I wasn’t explicitly sending any processes to the background, so this was perhaps expected. Then I added a sleep 2s command to the script:

my-program \
  input-file \
  >(gzip > outputfile1) \
  >(gzip > outputfile2)

sleep 2s # <--- BINGO!

and indeed my files were no longer being truncated. So, what was happening was that the parent shell had no way of ‘realizing’ that the two gzip subprocesses were running and was exiting and the docker container was shutting down, killing my gzips just before they were done writing the last bit of data.

Ah, the joys. The joys.

Random numbers in a parallel world

TL;DR: It’s always a great idea to explicitly initialize your random number generator especially in a parallel computing environment.

Random number generation in computing commonly uses pseudorandom number generators. These are iterated functions that take an initial number (a seed) and spit out a new number each time you call them. Each number is part of a sequence that is completely defined by the seed. The sequence of numbers itself satisfies all the important properties of a random number sequence.

When you do a computer “experiment” you can explicitly initialize your random number generator by feeding it a seed. You can also let the algorithm pick out a seed from a stream of random numbers being constantly spit out by the computer. If you don’t care about repeatability and just want good quality randomness, the default in modern packages – which is to pick up a seed from the computer – works fine.

Python numpy’s random number generator uses the Mersenne Twister which, I am assured, is good enough for government work. By default the twister is initialized using /dev/urandom or the clock, which should be, also, good enough. Normally it is, but when we move into the weird whacky twilight world of parallel computing strange things can happen…

Say we have an embarrassingly parallel job where we simply start up a bunch of nodes and do some computations. What happens if we ask for a stream of 5 random numbers in each child process?

Say we ask for five numbers each:

[-1.83711524 -0.57025788 -0.47002959  1.13873683 -1.20119582]
[ 1.24164455 -1.05592017 -2.04148199 -1.10787051  0.0561248 ] <--- B
[-0.95616591 -0.17653271 -1.81273353 -0.39608772 -0.08856469] <--- C
[ 1.23821309  0.44823325 -2.09541964 -1.7484197  -0.46045061] <--- D
[ 0.21477345 -1.30221712 -0.53970868 -0.00636153  0.85769071] <--- E
[-1.1929044   0.29644317  0.64373556 -1.38675638 -0.42406317]
[ 0.84775676  0.69999582  0.36933264  1.79700735  0.34603859]
[ 0.12984335  0.26137673 -0.32926777  0.06448171  0.33496994]
[ 0.10993647  0.6695855  -1.32065528  0.93429973  0.32059549]
[-1.83711524 -0.57025788 -0.47002959  1.13873683 -1.20119582]
[ 0.81037046  0.056652    0.54062643  1.18642807 -0.16265761]
[ 1.06970208  0.13867405  1.2972758  -1.02361929  0.08262749]
[ 1.24164455 -1.05592017 -2.04148199 -1.10787051  0.0561248 ] <--- B
[-0.95616591 -0.17653271 -1.81273353 -0.39608772 -0.08856469] <--- C
[ 0.35640486 -0.36840793 -0.84085094  1.82913844 -0.09785973]
[ 1.03816641 -0.26439036 -1.73092253 -1.05740638  1.40258063]
[ 0.69154406 -0.11333892 -0.55908131 -0.80996816  1.33116669]
[-0.42959555  1.31447804 -0.5917133  -0.26106957 -1.42283209]
[ 1.23821309  0.44823325 -2.09541964 -1.7484197  -0.46045061] <--- D
[ 0.21477345 -1.30221712 -0.53970868 -0.00636153  0.85769071] <--- E

It LOOKS OK, but on closer inspection our hair stands on end. There are sequence repeats (I’ve pointed them out). That should scare you. Want to see something even more scary?

How about another run that takes slightly longer for the computer to run? (All I did was change how many random numbers I asked for, see code below):

[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[ 1.20134648  0.99562444  0.39421463 -0.38797808  1.33273899 -0.44182553]

Oh, this is enough to keep us awake at night, huh?

We are not in control of when the child process reads from /dev/urandom or when it reads the system clock. If the process lands on the same machine it can read the same seed from the /dev/urandom or from the system clock, or not. Don’t take the risk, explicitly initialize the generator. An easy way is to pass a unique identifier to the child which can then form the seed.

Code follows:

import multiprocessing as mp, numpy

def child(n):
  numpy.random.seed(n) #<-- comment this out to get the fright of your life.
  m = numpy.random.randn(6)
  return m

N = 20
pool = mp.Pool()
results = pool.map(child, range(N))
for res in results: print res