Computing mutual information and other scary things

A moderately technical writeup of some adventures in computing mutual information.

I told a colleague of mine, Boris H, of my plan to use mutual information to test data from an experiment. He encouraged me, but he also passed along a warning. “Be wary of discretization. Be very wary” he said, or words to that effect. Boris is a professional: he knows the math so well, people pay him good money to do it. I was, like, “Yeah, yeah, whatever dude.” Long story short, always listen to the pros…

Summary: The Jack Knifed MLE gives a low bias estimate of the MI. It is easy to implement and is not too compute intensive. It is fairly robust to bin size.

In theory, mutual information is a super elegant way of uncovering the relationship between two quantities. There is a quiet elegance to its formulation

\displaystyle I(X;Y) = \int \int p(x,y) \log \frac{p(x,y)}{p(x)p(y)}dxdy

Or, equivalently

\displaystyle I(X;Y) = H(X) + H(Y) - H(X,Y)

where H(X) = -\int p(x) \log p(x)dx is the entropy

These elegant formulae, on the surface, lend themselves to an easy discretization. Personally I see integrals and think – ah, I’m just going to discretize the space and replace the integrals with summations, it works for everything else.

Basically, the idea would be to take the data, which comes as pairs of numbers (x,y) and throw them into a two-dimensional histogram. Each cell of the histogram gives us an estimate of p(x,y) – the joint probability of (x,y) and summing along the rows or columns of the 2d-histogram gives us our marginal probabilities p(x), p(y) which can then be thrown in to our mutual information formula (with integrals replaced by summations). Easy Peasy.

It turns out that this particular computation – because of the form of the equations – is very sensitive to how the data are discretized. The easy formula, apparently the MLE for the entropy, has persistent biases. In all cases the number of bins used for creating the histogram plays an annoyingly critical role. If we have too few bins we underestimate the MI and if we have too many bins we over-estimate the MI. People have come up with different corrections for the bias of which I picked two to test (these formulae are nicely summarized in Paninski 2003, in the introduction).

Defining N as the number of samples and \hat{m} as the number of bins with non-zero probability we have:

Simple-minded discretization (MLE): \hat{H}_{MLE}(x) = -\sum p(x_i)log_2{p(x_i)}

Miller Madow correction (MLEMM): \hat{H}_{MLEMM}(x) = \hat{H}_{MLE}(x) + \frac{\hat{m}-1}{2N}

The Jack Knifed estimate (MLEJK): \hat{H}_{MLEJK}(x) = N\hat{H}_{MLE}(x) - \frac{N-1}{N}\sum_{j=1}^{N} \hat{H}_{MLE-j}(x) (Leave-one-out formula)

I tested the formulae by generating pairs of correlated random numbers. If X,Y are two gaussian random variables with correlation coefficient \rho then their mutual information is given by:

-\frac{1}{2}\log_2{(1-\rho^2)}

which I use as the ground truth to compare the numerical results against.

The plot below compares the actual and estimated MI estimate spit out by the MLE formula for various values of correlation coefficient \rho. The actual value is shown by the blue line, the gray shaded area shows 95% of the distribution (from repeating the calculation 1000 times) for a sample size of 100 (Chosen as a reasonable sample size I usually run into, in my experiments). The sets of shaded curves show computation using 11 bins (darker gray) and 21 bins (lighter gray). As you can see this formula sucks. It has a large persistent bias through out the range of correlation coeffcients (the gray areas hang well above the blue line).

MIMLE

The MLEMM did not fare much better in some preliminary explorations I did (see below) so I went to the jack knife which performed pretty well. The Jack Knife still has a bias, but it is not very large. It is less sensitive to the binning and the bias is stable throughout the range.

MIMLEJK

I’m including this second plot, which isn’t that easy on the eyes, but it was the initial exploration that led me to look into the jack knife in more detail. The plots are distributions of the MI value computed from 1000 runs of each algorithm. The three rows are the three formulae: the MLE, the MLE with the Miller Madow correction (MLEMM) and the Jack Knifed MLE. The columns cycle through different sample sizes (10,100,100), different bin sizes (11,21) and different correlation coefficients between x and y (0,0.5). The vertical dotted line indicates the real MI. As you can see the MLE and MLEMM have an upward bias, which persists even for large sample sizes and is exacerbated by finer binning. The jack knife is impressive, but it too has an upward bias (as you have seen in the previous plot).

three_mi_formulae

Next steps

The next incremental improvement I would use is adaptive binning, which uses some formalized criteria (like making sure all bins have a non-zero probability) to adjust the binning to the data before calculating MI.

Further reading:

Paninski, Liam. “Estimation of entropy and mutual information.” Neural Computation 15.6 (2003): 1191-1253. – This paper was very tough for me to chew through. I used it for the formulae in the first few pages. Liam claims he has a method for correcting the MLE method, but it seemed like a very complicated computation

Cellucci, C. J., Alfonso M. Albano, and P. E. Rapp. “Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms.” Physical Review E 71.6 (2005): 066208. – This was a much more readable paper compared to Liam’s. I’m using to see what kind of adaptive binning I can apply to the data.

import numpy, pylab

def correlated_x(N, rho=0.5):
  r = numpy.random.randn(N,2)
  r1 = r[:,0]
  r1_ = r[:,1]
  r2 = rho*r1 + (1-rho**2)**0.5*r1_
  return r1, r2

def bin_data(x,y, bins=11, limits=[-4,4]):
  Nxy, xe, ye = numpy.histogram2d(x,y,bins=bins,range=[limits,limits])
  N = float(Nxy.sum())
  Pxy = Nxy/N
  Px = Pxy.sum(axis=1)
  Py = Pxy.sum(axis=0)
  return Pxy, Px, Py, N

MI_exact = lambda rho: -.5*numpy.log2(1-rho**2)

def H_mle(P):
  idx = pylab.find(P>0)
  return -(P.flat[idx]*numpy.log2(P.flat[idx])).sum()

def MI_mle(x,y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, Ntot = bin_data(x,y,bins=bins,limits=limits)
  return H_mle(Px) + H_mle(Py) - H_mle(Pxy)

def MI_mle_jack_knife(x, y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
  Hx = H_mle(Px)
  Hy = H_mle(Py)
  Hxy = H_mle(Pxy)
  Hx_jk = 0
  Hy_jk = 0
  Hxy_jk = 0
  for n in range(x.size):
    jx = numpy.concatenate((x[:n],x[n+1:]))
    jy = numpy.concatenate((y[:n],y[n+1:]))
    Pxy, Px, Py, Njk = bin_data(jx,jy,bins=bins,limits=limits)
    Hx_jk += H_mle(Px)
    Hy_jk += H_mle(Py)
    Hxy_jk += H_mle(Pxy)

  Hx_jk = N*Hx - (N-1.0)/N*Hx_jk
  Hy_jk = N*Hy - (N-1.0)/N*Hy_jk
  Hxy_jk = N*Hxy - (N-1.0)/N*Hxy_jk

  return Hx_jk + Hy_jk - Hxy_jk

def runmany(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
  mi = numpy.empty(b)
  for n in range(b):
    r1,r2 = correlated_x(N, rho)
    mi[n] = func(r1,r2,bins=bins)
  mi.sort()
  med_idx = int(b*0.5)
  lo_idx = int(b*0.025)
  hi_idx = int(b*0.975)
  return mi[lo_idx], mi[med_idx], mi[hi_idx]

b = 1000
Ni = 20
rho = numpy.linspace(0,.99,Ni)
mi_exact = MI_exact(rho)
N = 100
cols = [(.5,.5,.5), (.7,.7,.7)]
for lab, func in zip(['MI MLE', 'MI Jack Knife'], [MI_mle, MI_mle_jack_knife]):
  pylab.figure(figsize=(10,5))
  for k, bins in enumerate([11,21]):
    mi_est = numpy.empty((Ni,3))
    for n in range(rho.size):
      #mi_est[n,:] = runmany(MI_mle_jack_knife, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
      #mi_est[n,:] = runmany(MI_mle, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
      mi_est[n,:] = runmany(func, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
    pylab.fill_between(rho, mi_est[:,0], mi_est[:,2],color=cols[k],edgecolor='k',alpha=.5)
    pylab.plot(rho, mi_est[:,1], 'k', lw=2)
    pylab.plot(rho, mi_exact, 'b',lw=2)
    pylab.xlabel(r'$\rho$')
    pylab.ylabel(lab)
    pylab.setp(pylab.gca(),ylim=[-.1,3.5])
    pylab.savefig('{:s}.pdf'.format(lab))

The code for the ugly, detailed plot:

import numpy, pylab

def correlated_x(N, rho=0.5):
  r = numpy.random.randn(N,2)
  r1 = r[:,0]
  r1_ = r[:,1]
  r2 = rho*r1 + (1-rho**2)**0.5*r1_
  return r1, r2

def bin_data(x,y, bins=11, limits=[-4,4]):
  Nxy, xe, ye = numpy.histogram2d(x,y,bins=bins,range=[limits,limits])
  N = float(Nxy.sum())
  Pxy = Nxy/N
  Px = Pxy.sum(axis=1)
  Py = Pxy.sum(axis=0)
  return Pxy, Px, Py, N

MI_exact = lambda rho: -.5*numpy.log2(1-rho**2)

def H_mle(P):
  idx = pylab.find(P>0)
  return -(P.flat[idx]*numpy.log2(P.flat[idx])).sum()

def H_mm(P,N):
  m = pylab.find(P>0).size
  return H_mle(P) + (m-1)/(2.0*N)

def MI_mle(x,y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, Ntot = bin_data(x,y,bins=bins,limits=limits)
  return H_mle(Px) + H_mle(Py) - H_mle(Pxy)

def MI_mle_miller_madow(x, y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
  return H_mm(Px, N) + H_mm(Py, N) - H_mm(Pxy, N)

def MI_mle_jack_knife(x, y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
  Hx = H_mle(Px)
  Hy = H_mle(Py)
  Hxy = H_mle(Pxy)
  Hx_jk = 0
  Hy_jk = 0
  Hxy_jk = 0
  for n in range(x.size):
    jx = numpy.concatenate((x[:n],x[n+1:]))
    jy = numpy.concatenate((y[:n],y[n+1:]))
    Pxy, Px, Py, Njk = bin_data(jx,jy,bins=bins,limits=limits)
    Hx_jk += H_mle(Px)
    Hy_jk += H_mle(Py)
    Hxy_jk += H_mle(Pxy)

  Hx_jk = N*Hx - (N-1.0)/N*Hx_jk
  Hy_jk = N*Hy - (N-1.0)/N*Hy_jk
  Hxy_jk = N*Hxy - (N-1.0)/N*Hxy_jk

  return Hx_jk + Hy_jk - Hxy_jk

def runmany(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
  mi = numpy.empty(b)
  for n in range(b):
    r1,r2 = correlated_x(N, rho)
    mi[n] = func(r1,r2,bins=bins)
  return mi

def plot_bias_variance(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
  mi = runmany(func, N=N, rho=rho, bins=bins, limits=limits, b=b)
  pylab.hist(mi,range=[0,2],bins=61,histtype='stepfilled',color='gray',edgecolor='gray')
  pylab.plot([MI_exact(rho),MI_exact(rho)], pylab.getp(pylab.gca(),'ylim'), 'k:',lw=3)
  pylab.setp(pylab.gca(),xlim=[-.1,1.0],xticks=[0,.5],yticks=[],xticklabels=[])
  pylab.title('N={:d},rho={:2.2f},bins={:d}'.format(N,rho,bins),fontsize=6)

pylab.figure(figsize=(15,6))
pylab.subplots_adjust(left=0.03,right=.98,top=0.95,bottom=0.05,wspace=.01,hspace=.15)
b = 1000
for i, rho in enumerate([0,0.5]):
  for j, N in enumerate([10,100,1000]):
    for k, bins in enumerate([11,21]):
      c = 1+k+j*2+i*6
      pylab.subplot(3,12,c); plot_bias_variance(MI_mle,rho=rho, N=N,bins=bins,b=b)
      if c == 1: pylab.ylabel('MLE')
      pylab.subplot(3,12,12+c); plot_bias_variance(MI_mle_miller_madow,rho=rho,N=N,bins=bins,b=b)
      if c == 1: pylab.ylabel('MillerMadow')
      pylab.subplot(3,12,24+c); plot_bias_variance(MI_mle_jack_knife,rho=rho,N=N,bins=bins,b=b)
      if c == 1: pylab.ylabel('JackKnife')

pylab.savefig('mi_experiments.pdf')
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s