Random numbers in a parallel world

TL;DR: It’s always a great idea to explicitly initialize your random number generator especially in a parallel computing environment.

Random number generation in computing commonly uses pseudorandom number generators. These are iterated functions that take an initial number (a seed) and spit out a new number each time you call them. Each number is part of a sequence that is completely defined by the seed. The sequence of numbers itself satisfies all the important properties of a random number sequence.

When you do a computer “experiment” you can explicitly initialize your random number generator by feeding it a seed. You can also let the algorithm pick out a seed from a stream of random numbers being constantly spit out by the computer. If you don’t care about repeatability and just want good quality randomness, the default in modern packages – which is to pick up a seed from the computer – works fine.

Python numpy’s random number generator uses the Mersenne Twister which, I am assured, is good enough for government work. By default the twister is initialized using /dev/urandom or the clock, which should be, also, good enough. Normally it is, but when we move into the weird whacky twilight world of parallel computing strange things can happen…

Say we have an embarrassingly parallel job where we simply start up a bunch of nodes and do some computations. What happens if we ask for a stream of 5 random numbers in each child process?

Say we ask for five numbers each:

[-1.83711524 -0.57025788 -0.47002959  1.13873683 -1.20119582]
[ 1.24164455 -1.05592017 -2.04148199 -1.10787051  0.0561248 ] <--- B
[-0.95616591 -0.17653271 -1.81273353 -0.39608772 -0.08856469] <--- C
[ 1.23821309  0.44823325 -2.09541964 -1.7484197  -0.46045061] <--- D
[ 0.21477345 -1.30221712 -0.53970868 -0.00636153  0.85769071] <--- E
[-1.1929044   0.29644317  0.64373556 -1.38675638 -0.42406317]
[ 0.84775676  0.69999582  0.36933264  1.79700735  0.34603859]
[ 0.12984335  0.26137673 -0.32926777  0.06448171  0.33496994]
[ 0.10993647  0.6695855  -1.32065528  0.93429973  0.32059549]
[-1.83711524 -0.57025788 -0.47002959  1.13873683 -1.20119582]
[ 0.81037046  0.056652    0.54062643  1.18642807 -0.16265761]
[ 1.06970208  0.13867405  1.2972758  -1.02361929  0.08262749]
[ 1.24164455 -1.05592017 -2.04148199 -1.10787051  0.0561248 ] <--- B
[-0.95616591 -0.17653271 -1.81273353 -0.39608772 -0.08856469] <--- C
[ 0.35640486 -0.36840793 -0.84085094  1.82913844 -0.09785973]
[ 1.03816641 -0.26439036 -1.73092253 -1.05740638  1.40258063]
[ 0.69154406 -0.11333892 -0.55908131 -0.80996816  1.33116669]
[-0.42959555  1.31447804 -0.5917133  -0.26106957 -1.42283209]
[ 1.23821309  0.44823325 -2.09541964 -1.7484197  -0.46045061] <--- D
[ 0.21477345 -1.30221712 -0.53970868 -0.00636153  0.85769071] <--- E

It LOOKS OK, but on closer inspection our hair stands on end. There are sequence repeats (I’ve pointed them out). That should scare you. Want to see something even more scary?

How about another run that takes slightly longer for the computer to run? (All I did was change how many random numbers I asked for, see code below):

[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[ 1.20134648  0.99562444  0.39421463 -0.38797808  1.33273899 -0.44182553]

Oh, this is enough to keep us awake at night, huh?

We are not in control of when the child process reads from /dev/urandom or when it reads the system clock. If the process lands on the same machine it can read the same seed from the /dev/urandom or from the system clock, or not. Don’t take the risk, explicitly initialize the generator. An easy way is to pass a unique identifier to the child which can then form the seed.

Code follows:

import multiprocessing as mp, numpy

def child(n):
  numpy.random.seed(n) #<-- comment this out to get the fright of your life.
  m = numpy.random.randn(6)
  return m

N = 20
pool = mp.Pool()
results = pool.map(child, range(N))
for res in results: print res

Down the rabbit hole

I was putting some finalizing touches to pre-processing some data in preparation for some analysis I was raring to do. The plan was to create some pretty pictures, get some insight, get this off my desk by noon and go into the weekend with no backlog and a clear conscience. But here I am, this Friday evening, rewriting the code to work around a bug in someone else’s software. I was angry for a while but then I wasterrified.

To be honest, it’s not a sudden realization, and I suspect that all of you have had this realization. It’s just that it has to happen to you personally, to bring that extra visceral element in.

Everyone who has used Windows (and now, sadly Mac OS X to an increasing and annoying degree) knows that all software has bugs. I used to think this had primarily to do with the fact that these are graphical operating systems that have to deal with asynchronous and random input.

These are difficult problems to find test cases for and the bulk of the software is a series of checks and balances to see if the user is allowed to do what they just did given what they have been doing in the past. I wrote GUIs once and I swore I would never do it again. No matter how many checks you put in, someone somewhere is going to come in and press the right combination of keys at just the right pace to lock your software and crash it, eventually.

Widely used, well tested computing packages, on the other hand, I pretty much trusted. Yes, it is true, that there are tricky algorithms such as integration and differentiation and optimization that are hard to get very right and not all implementations are equal. Some trade accuracy for speed and so on, but I don’t expect anything to collapse catastrophically if thousands of people have been using it for years.

And yet here I was sitting at my keyboard, fuming, because a module I was using presented a strange, extremely unexpected bug. To top it off, the library doesn’t do any fancy computations, doesn’t do any graphics or any user interface stuff. All it does is take tabular data and save it to disk.

The selling point of the software is that it allows you to build a file on disk, many gigabytes in size, much, much larger than your available memory, and still process data from it seamlessly. I’ve used it for a while and it works for me.

Another great thing about the library was that it had an easy way to indicate missing data. It uses something called a ‘NaN’ which expands to Not-a-Number which is a fairly common value we put in our data to say “hey, don’t take this value into account when you do some computation, like summing or multiplying this table of numbers, it’s not really there.”

So, I had this large table full of numbers with a few missing data points, which I had filled with NaNs. Say the rows of the table are individual people (my actual case is completely different, but this will do) and the columns are attributes such as height, weight, eye color, zip code and so on.

I was interested in asking questions like “Give me data for all the people who are taller than 5′, have blue eyes and live in zip code 02115”. I took a little chunk of my data, loaded it into memory, asked my questions and got back a list of names. Everything checked out.

So I saved the data to disk and then did the same exact thing, except I used the software’s special ability to pull chunks of data straight from disk. Sometimes I would get sensible answers but sometimes I would get nothing. The software would tell me nobody matched the set of conditions I was asking for, but I knew for a fact that there were people in the database matching the description I had given.

The even odder thing was that if I loaded the file back from disk in its entirety and then asked the questions I got all correct answers.

My first reaction was that I was screwing up somewhere and I had badly formatted my data and when I was saving the data to disk I was causing some strange corruption. I started to take away more and more of my code in an effort to isolate the loose valve, so to speak.

But as the hours went by, I started to decide, that however unlikely, the problem was actually in the other guy’s code. I started to go the other way. I started to build a new example using as little code as I possibly could to try and replicate the problem. Finally I found it. The problem was very very odd and very insidious.

In the table of data, if you had a missing value (a NaN) in the first or second row of  any column the software would behave just fine when the data were all in memory. When, however, you asked the software to process the same data from disk it would return nothing, but only when you asked about that column. If you asked about other columns the software would give the correct answer.

This took up most of my Friday. I wrote in to the person who made the software suggesting it was a bug. They got back to me saying, yep it was a bug, but they knew about it and it was actually a bug in this OTHER software from this other fella that they were using inside their own code.

By this time, I was less angry and more curious. I went over to these other folks and sniffed around a bit and read the thread where this bug was discussed. I couldn’t understand all the details but it seems, that in order to make things work fast, they used a clever algorithm to search for matching data on the data table when it was on disk. This clever algorithm, for all its speed and brilliance, would stumble and fall if the first value in the column it was searching in was not-a-number. Just like that.

Importantly, instead of raising an error, it would fail silently and lie, saying it did not find anything matching the question it was given. Just like that.

This exercise brought home a realization to me. You can make your code as water tight as possible, but you almost always rely on other people’s code somewhere. You don’t want to reinvent the wheel repeatedly. But you also inherit the other fellow’s bugs. And the bugs they inherited from yet another fellow and so and so forth. And you can’t test for ALL the bugs.

And then I thought, this is the stuff that launches the missiles. This is the stuff that runs the X-ray machine. This is the stuff that controls more and more of vehicles on the road.

The most common kind of code failure we are used to is the catastrophic kind, when the computer crashes and burns. When our browser quits, when we get the blue screen and the spinning beach ball. This is when you KNOW you’ve run into a bug. But this is not the worst kind. The worst kind are the silent ones. The ones where the X-ray machine delivers double the dose of x-rays and ten years later you get cancer.

And even though I knew all this cognitively, it took me a while to calm down.

What is Mutual Information?

The mutual information between two things is a measure of how much knowing one thing can tell you about the other thing. In this respect, it’s a bit like correlation, but cooler – at least in theory.

Suppose we have accumulated a lot of data about the size of apartments and their rent and we want to know if there is any relationship between the two quantities. We could do this by measuring their mutual information.

Say, for convenience, we’ve normalized our rent and size data so that the highest rent and size are 1 “unit” and the smallest ones are 0 “units”. We start out by plotting two dimensional probability distributions for the rent and size.

Screen Shot 2013-11-01 at 12.32.04 PM

We plot rent on the x-axis, size on the y-axis. The density – a normalized measure of how often we run into a particular (rent,size) combination), and called the joint distribution (p(r,s)) – is actually plotted on the z-axis, coming out of the screen, forming a surface. To simplify matters, let’s assume the joint distribution here is uniform all over, so this surface is flat and at a constant height.

So, here the joint distribution of rents and sizes (p(r,s)) is given by the square (which is actually the roof of a cube, poking out) and the distribution of rents and sizes by themselves (called the marginals, because they are drawn on the margins of the joint distribution) are given by p(r) and p(s).

To recall a bit of probability, and probability distributions, the probability of finding a house/apartment within a certain rent/size range combo is given by the volume of the plot within that rent/size range. The volume of the whole plot, is therefore, equal to 1, since all our data is within this range.

The mutual information is given by the equation:
\displaystyle I(R;S) = \int \int p(r,s) \log \frac{p(r,s)}{p(r)p(s)}drds

This equation takes in our rent/size data and spits out a single number. This is the value of the mutual information. The logarithm is one of the interesting parts of this equation. In practice the only effect of changing the base is to multiply your mutual information value by some number. If you use base 2 you get out an answer in ‘bits’ which makes sense in an interesting way.

Intuitively we see that, for this data, knowing the rent tells us nothing additional about the size (and vice versa).

If we work out the value of the mutual information by substituting the values for p(r,s), p(r) and p(s) into the equation above we see that, since all these quantities are constant, we can just perform the calculation within the integral sign and multiply the result by the area of the plot (which is 1 and indicated by the final x1 term)
I(R;S) = 1 \log_2 \frac{1}{1 \times 1} \times 1 = 0

So we have 0 bits of information in this relation, which jives with our intuition that there is no information here – rents just don’t tell us anything about size.

Now suppose our data came out like this.
Screen Shot 2013-11-01 at 12.32.55 PM
[one-bit diagram]

Substituting the values we see that (noting we have two areas to integrate, each of size 1/2 x 1/2 = 1/4)
I(R;S) = 2 \log_2 \frac{2}{1 \times 1} \times \frac{1}{4} \times 2 = 1

That’s interesting. We can see intuitively there is a relation between rent and size, but what is this 1 bit of information? One way of looking at our plot is to say, if you give me a value for rent, I can tell you in which range of sizes the apartment will fall, and this range splits the total range of sizes in two. 2^1=2 so we say we have 1 bit of information which allows us to distinguish between two alternatives: large size and small size.

Interestingly, if you tell me the size of the apartment, I can tell you the range of the rent, and this range splits the total range of rents in two, so the information is still 1 bit. The mutual information is symmetric, as you may have noted from the formula.

Now, suppose our data came out like this.
Screen Shot 2013-11-01 at 12.33.41 PM
[two-bit diagram]

You can see that:
I(R;S) = 4 \log_2 \frac{4}{1 \times 1} \times \frac{1}{16} \times 4 = 2

Two bits! The rents and sizes seem to split into four clusters, and knowing the rent will allow us to say in which one of four clusters the size will fall. Since 2^2=4 we have 2 bits of information here.

Now so far, this has been a bit ho-hum. You could imagine working out the correlation coefficient between rent and size and getting a similar notion of whether rents and sizes are related. True, we get a fancy number in bits, but so what?

Well, suppose our data came out like this.
Screen Shot 2013-11-01 at 12.34.29 PM
[two-bit, scrambled diagram]

It’s funny, but the computation for MI comes out exactly the same as before:
I(R;S) = 4 \log_2 \frac{4}{1 \times 1} \times \frac{1}{16} \times 4 = 2

Two bits again! There is no linear relationship that we can see between rents and sizes, but upon inspection we realize that rents and sizes cluster into four groups, and knowing the rent allows us to predict which one of four size ranges the apartment will fall in.

This, then, is the power of mutual information in exploratory data analysis. If there is some relationship between the two quantities we are testing, the mutual information will reveal this to us, without having to assume any model or pattern.

However, WHAT the relationship is, is not revealed to us, and we can not use the value of the mutual information to build any kind of predictive “box” that will allow us to predict, say, sizes from rents.

Knowing the mutual information, however, gives us an idea of how well a predictive box will do at best, regardless of whether it is a simple linear model, or a fancy statistical method like a support vector machine. Sometimes, computing the mutual information is a good, quick, first pass to check if it is worthwhile spending time training a computer to do the task at all.

A note on computing mutual information

In our toy examples above it has been pretty easy to compute mutual information because the forms of p(r,s), p(r) and p(s) have been given explicitly. In real life we don’t have the distributions and all we are given is a (not large enough) pile of data. We try to estimate the functions p(r,s), p(r) and p(s) from this data on our way to computing the mutual information.

You will notice that the term in the integral is always positive. p(r,s) \geq 0 because it is a probability and \frac{p(r,s)}{p(r)p(s)} \geq 1. This second fact can be seen by considering the extreme cases where r and s are independent (in which case p(r,s)=p(r)p(s) which leads us to \frac{p(r,s)}{p(r)p(s)} = 1) and when they are completely dependent (in which case p(r,s)=p(r)=p(s) which leads us to \frac{p(r,s)}{p(r)p(s)} = \frac{p(r)}{p^2(r)} = \frac{1}{p(r)} \geq 1).

You will immediately sense the problem here. In many calculations, when we have noise in the terms, the noise averages out because the plus terms balance out the minus terms. Here, all we have are plus terms, and our integral has a tendency to get bigger.

Histograms (where we take the data and bin it) is an expedient way of estimating probability distributions and they normally work alright. But this can lead us to a funny problem when computing mutual information because of this always positive nature of the integral term.

For example, say there was really no dependence between rents and sizes, but suppose our data and our binning interacted in an odd manner to give us a pattern such as this:
Screen Shot 2013-11-01 at 12.35.14 PM

We can see that the marginals are not affected badly, but the joint, because it is in two-dimensional space, is filled rather more sparsely which leads to us having ‘holes’ in the distribution. If we now compute the mutual information we find that we have ended up with 1 bit of information, when really, it should be 0 bits.

Most attempts to address this bias in mutual information computations recognize the problem with these ‘holes’ in the joint distribution and try to smear them out using various ever more sophisticated techniques. The simplest way is to make larger bins (which would completely solve our problem in this toy case) and other methods blur out the original data points themselves.

All of these methods, not matter how fancy, still leave us with the problem of how much to smear the data: smear too little and you inflate the mutual information, smear too much and you start to wipe it out.

Often, to be extra cautious, we do what I have known as ‘shuffle correction’ (and I was told by a pro is actually called the ‘null model’). Here you thoroughly jumble up your data so that any relation ship that existed between r and s is gone. You then compute the mutual information of that jumbled up data. You know that the mutual information should actually be zero, but because of the bias it comes out to something greater. You then compare the mutual information from the data with this jumbled one to see if there is something peeking above the bias.