Mutual information (MI), often given in terms of bits of information, is a neat way to describe the relationship between two variables, x and y. The MI tells us how predictive x is of y (and vice versa). Here I will explore what the MI value in bits means.

For example, if x takes 8 values and very reliably predicts which of 8 values y will take, it turns out that the mutual information between x and y is 3 bits. One interesting aspect of MI is to get an intuition of how noise and uncertainty reflect in the MI. In the example below we will explore this a little bit.

At the end of the post I show all the Python code needed to reproduce the plots.

As our basic data we consider an array of numbers (x) each of which is either 0 or 1. So x can have two states (0,1).

Next we create another array (y). For each value in x, we have a value in y. Each value in y is sampled from a unit variance Gaussian distribution whose mean is 0 or 4 depending on the value of its corresponding x.

We have an intuition that x and y predict each other pretty well. We can see the distribution of y has two peaks depending on whether they are associated with x=0 or x=1. The mutual information between the two comes out to be 1 bit. Our intuition is that if some one gave us a value of x we could tell them if the value of y is above or below 4. Alternatively, if some one gave us a value for y, we could tell them is x was 0 or 1. We can tell them, reliably one of two alternatives. For this reason, we claim that there is 1 bit of information here. We can distinguish 2**1 alternatives with the value given to us.

Say we now arrange things such that whatever the value of x, the value of y is always chosen from the same Gaussian distribution.

Now, our intuition is that there is no way to tell what our value of y will be based on the value of x, and vice versa. When we work out our mutual information, we see that we have close to zero bits of information, which matches our intuition that x and y carry no information about each other.

Next, suppose we arrange things such that the value of y is taken from Gaussian distributions with mean 0 or 1, depending on whether the value of the corresponding x is 0 or 1.

We see that the distributions overlap, but not completely. The mutual information computation tells us that there are 0.45 bits of information between our two variables. What does that mean? What does it mean to be able to distinguish among 2**0.45=1.36 possibilities? I find it hard to interpret this value directly. Given knowledge about x (it can be 0 or 1) I know that the maximum mutual information possible is 1 bit. So anything less than 1 represents a noisy or flawed relation between x and y. I have little intution on whether 1.5 bits of information is a lot better than 1.3 bits and just a little worse than 1.7 bits. We can check what happens when we gradually increase the separation between the two distributions, making them more easily separable.

As we can see, 0.4 bits of separation imply the distributions are separated by one standard deviation (d’=1) 0.8 bits means d’=2 and then we start to saturate.

Code follows

def mi(x,y, bins=11): """Given two arrays x and y of equal length, return their mutual information in bits """ Hxy, xe, ye = pylab.histogram2d(x,y,bins=bins) Hx = Hxy.sum(axis=1) Hy = Hxy.sum(axis=0) Pxy = Hxy/float(x.size) Px = Hx/float(x.size) Py = Hy/float(x.size) pxy = Pxy.ravel() px = Px.repeat(Py.size) py = pylab.tile(Py, Px.size) idx = pylab.find((pxy > 0) & (px > 0) & (py > 0)) return (pxy[idx]*pylab.log2(pxy[idx]/(px[idx]*py[idx]))).sum() x1 = pylab.zeros(1000) x2 = pylab.ones(1000) x = pylab.concatenate((x1,x2*2)) y = pylab.randn(x.size)+4*x h = hist([y[:1000], y[1000:]], histtype='step', bins=31) h = text(2,80,'MI={:f}'.format(mi(x,y))) y = pylab.randn(x.size)+0*x h = hist([y[:1000], y[1000:]], histtype='step', bins=31) h = text(2,80,'MI={:f}'.format(mi(x,y))) y = pylab.randn(x.size)+x h = hist([y[:1000], y[1000:]], histtype='step', bins=31) h = text(2,80,'MI={:f}'.format(mi(x,y))) N = 100 kk = pylab.linspace(0,5,N) mutinf = [mi(x,pylab.randn(x.size)+k*x) for k in kk] plot(kk, mutinf) ax = gca() ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) xlabel('Separation') ylabel('Mutual information')