Computing mutual information and other scary things

A moderately technical writeup of some adventures in computing mutual information.

I told a colleague of mine, Boris H, of my plan to use mutual information to test data from an experiment. He encouraged me, but he also passed along a warning. “Be wary of discretization. Be very wary” he said, or words to that effect. Boris is a professional: he knows the math so well, people pay him good money to do it. I was, like, “Yeah, yeah, whatever dude.” Long story short, always listen to the pros…

Summary: The Jack Knifed MLE gives a low bias estimate of the MI. It is easy to implement and is not too compute intensive. It is fairly robust to bin size.

In theory, mutual information is a super elegant way of uncovering the relationship between two quantities. There is a quiet elegance to its formulation

\displaystyle I(X;Y) = \int \int p(x,y) \log \frac{p(x,y)}{p(x)p(y)}dxdy

Or, equivalently

\displaystyle I(X;Y) = H(X) + H(Y) - H(X,Y)

where H(X) = -\int p(x) \log p(x)dx is the entropy

These elegant formulae, on the surface, lend themselves to an easy discretization. Personally I see integrals and think – ah, I’m just going to discretize the space and replace the integrals with summations, it works for everything else.

Basically, the idea would be to take the data, which comes as pairs of numbers (x,y) and throw them into a two-dimensional histogram. Each cell of the histogram gives us an estimate of p(x,y) – the joint probability of (x,y) and summing along the rows or columns of the 2d-histogram gives us our marginal probabilities p(x), p(y) which can then be thrown in to our mutual information formula (with integrals replaced by summations). Easy Peasy.

It turns out that this particular computation – because of the form of the equations – is very sensitive to how the data are discretized. The easy formula, apparently the MLE for the entropy, has persistent biases. In all cases the number of bins used for creating the histogram plays an annoyingly critical role. If we have too few bins we underestimate the MI and if we have too many bins we over-estimate the MI. People have come up with different corrections for the bias of which I picked two to test (these formulae are nicely summarized in Paninski 2003, in the introduction).

Defining N as the number of samples and \hat{m} as the number of bins with non-zero probability we have:

Simple-minded discretization (MLE): \hat{H}_{MLE}(x) = -\sum p(x_i)log_2{p(x_i)}

Miller Madow correction (MLEMM): \hat{H}_{MLEMM}(x) = \hat{H}_{MLE}(x) + \frac{\hat{m}-1}{2N}

The Jack Knifed estimate (MLEJK): \hat{H}_{MLEJK}(x) = N\hat{H}_{MLE}(x) - \frac{N-1}{N}\sum_{j=1}^{N} \hat{H}_{MLE-j}(x) (Leave-one-out formula)

I tested the formulae by generating pairs of correlated random numbers. If X,Y are two gaussian random variables with correlation coefficient \rho then their mutual information is given by:

-\frac{1}{2}\log_2{(1-\rho^2)}

which I use as the ground truth to compare the numerical results against.

The plot below compares the actual and estimated MI estimate spit out by the MLE formula for various values of correlation coefficient \rho. The actual value is shown by the blue line, the gray shaded area shows 95% of the distribution (from repeating the calculation 1000 times) for a sample size of 100 (Chosen as a reasonable sample size I usually run into, in my experiments). The sets of shaded curves show computation using 11 bins (darker gray) and 21 bins (lighter gray). As you can see this formula sucks. It has a large persistent bias through out the range of correlation coeffcients (the gray areas hang well above the blue line).

MIMLE

The MLEMM did not fare much better in some preliminary explorations I did (see below) so I went to the jack knife which performed pretty well. The Jack Knife still has a bias, but it is not very large. It is less sensitive to the binning and the bias is stable throughout the range.

MIMLEJK

I’m including this second plot, which isn’t that easy on the eyes, but it was the initial exploration that led me to look into the jack knife in more detail. The plots are distributions of the MI value computed from 1000 runs of each algorithm. The three rows are the three formulae: the MLE, the MLE with the Miller Madow correction (MLEMM) and the Jack Knifed MLE. The columns cycle through different sample sizes (10,100,100), different bin sizes (11,21) and different correlation coefficients between x and y (0,0.5). The vertical dotted line indicates the real MI. As you can see the MLE and MLEMM have an upward bias, which persists even for large sample sizes and is exacerbated by finer binning. The jack knife is impressive, but it too has an upward bias (as you have seen in the previous plot).

three_mi_formulae

Next steps

The next incremental improvement I would use is adaptive binning, which uses some formalized criteria (like making sure all bins have a non-zero probability) to adjust the binning to the data before calculating MI.

Further reading:

Paninski, Liam. “Estimation of entropy and mutual information.” Neural Computation 15.6 (2003): 1191-1253. – This paper was very tough for me to chew through. I used it for the formulae in the first few pages. Liam claims he has a method for correcting the MLE method, but it seemed like a very complicated computation

Cellucci, C. J., Alfonso M. Albano, and P. E. Rapp. “Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms.” Physical Review E 71.6 (2005): 066208. – This was a much more readable paper compared to Liam’s. I’m using to see what kind of adaptive binning I can apply to the data.

import numpy, pylab

def correlated_x(N, rho=0.5):
  r = numpy.random.randn(N,2)
  r1 = r[:,0]
  r1_ = r[:,1]
  r2 = rho*r1 + (1-rho**2)**0.5*r1_
  return r1, r2

def bin_data(x,y, bins=11, limits=[-4,4]):
  Nxy, xe, ye = numpy.histogram2d(x,y,bins=bins,range=[limits,limits])
  N = float(Nxy.sum())
  Pxy = Nxy/N
  Px = Pxy.sum(axis=1)
  Py = Pxy.sum(axis=0)
  return Pxy, Px, Py, N

MI_exact = lambda rho: -.5*numpy.log2(1-rho**2)

def H_mle(P):
  idx = pylab.find(P>0)
  return -(P.flat[idx]*numpy.log2(P.flat[idx])).sum()

def MI_mle(x,y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, Ntot = bin_data(x,y,bins=bins,limits=limits)
  return H_mle(Px) + H_mle(Py) - H_mle(Pxy)

def MI_mle_jack_knife(x, y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
  Hx = H_mle(Px)
  Hy = H_mle(Py)
  Hxy = H_mle(Pxy)
  Hx_jk = 0
  Hy_jk = 0
  Hxy_jk = 0
  for n in range(x.size):
    jx = numpy.concatenate((x[:n],x[n+1:]))
    jy = numpy.concatenate((y[:n],y[n+1:]))
    Pxy, Px, Py, Njk = bin_data(jx,jy,bins=bins,limits=limits)
    Hx_jk += H_mle(Px)
    Hy_jk += H_mle(Py)
    Hxy_jk += H_mle(Pxy)

  Hx_jk = N*Hx - (N-1.0)/N*Hx_jk
  Hy_jk = N*Hy - (N-1.0)/N*Hy_jk
  Hxy_jk = N*Hxy - (N-1.0)/N*Hxy_jk

  return Hx_jk + Hy_jk - Hxy_jk

def runmany(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
  mi = numpy.empty(b)
  for n in range(b):
    r1,r2 = correlated_x(N, rho)
    mi[n] = func(r1,r2,bins=bins)
  mi.sort()
  med_idx = int(b*0.5)
  lo_idx = int(b*0.025)
  hi_idx = int(b*0.975)
  return mi[lo_idx], mi[med_idx], mi[hi_idx]

b = 1000
Ni = 20
rho = numpy.linspace(0,.99,Ni)
mi_exact = MI_exact(rho)
N = 100
cols = [(.5,.5,.5), (.7,.7,.7)]
for lab, func in zip(['MI MLE', 'MI Jack Knife'], [MI_mle, MI_mle_jack_knife]):
  pylab.figure(figsize=(10,5))
  for k, bins in enumerate([11,21]):
    mi_est = numpy.empty((Ni,3))
    for n in range(rho.size):
      #mi_est[n,:] = runmany(MI_mle_jack_knife, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
      #mi_est[n,:] = runmany(MI_mle, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
      mi_est[n,:] = runmany(func, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
    pylab.fill_between(rho, mi_est[:,0], mi_est[:,2],color=cols[k],edgecolor='k',alpha=.5)
    pylab.plot(rho, mi_est[:,1], 'k', lw=2)
    pylab.plot(rho, mi_exact, 'b',lw=2)
    pylab.xlabel(r'$\rho$')
    pylab.ylabel(lab)
    pylab.setp(pylab.gca(),ylim=[-.1,3.5])
    pylab.savefig('{:s}.pdf'.format(lab))

The code for the ugly, detailed plot:

import numpy, pylab

def correlated_x(N, rho=0.5):
  r = numpy.random.randn(N,2)
  r1 = r[:,0]
  r1_ = r[:,1]
  r2 = rho*r1 + (1-rho**2)**0.5*r1_
  return r1, r2

def bin_data(x,y, bins=11, limits=[-4,4]):
  Nxy, xe, ye = numpy.histogram2d(x,y,bins=bins,range=[limits,limits])
  N = float(Nxy.sum())
  Pxy = Nxy/N
  Px = Pxy.sum(axis=1)
  Py = Pxy.sum(axis=0)
  return Pxy, Px, Py, N

MI_exact = lambda rho: -.5*numpy.log2(1-rho**2)

def H_mle(P):
  idx = pylab.find(P>0)
  return -(P.flat[idx]*numpy.log2(P.flat[idx])).sum()

def H_mm(P,N):
  m = pylab.find(P>0).size
  return H_mle(P) + (m-1)/(2.0*N)


def MI_mle(x,y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, Ntot = bin_data(x,y,bins=bins,limits=limits)
  return H_mle(Px) + H_mle(Py) - H_mle(Pxy)

def MI_mle_miller_madow(x, y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
  return H_mm(Px, N) + H_mm(Py, N) - H_mm(Pxy, N)

def MI_mle_jack_knife(x, y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
  Hx = H_mle(Px)
  Hy = H_mle(Py)
  Hxy = H_mle(Pxy)
  Hx_jk = 0
  Hy_jk = 0
  Hxy_jk = 0
  for n in range(x.size):
    jx = numpy.concatenate((x[:n],x[n+1:]))
    jy = numpy.concatenate((y[:n],y[n+1:]))
    Pxy, Px, Py, Njk = bin_data(jx,jy,bins=bins,limits=limits)
    Hx_jk += H_mle(Px)
    Hy_jk += H_mle(Py)
    Hxy_jk += H_mle(Pxy)

  Hx_jk = N*Hx - (N-1.0)/N*Hx_jk
  Hy_jk = N*Hy - (N-1.0)/N*Hy_jk
  Hxy_jk = N*Hxy - (N-1.0)/N*Hxy_jk

  return Hx_jk + Hy_jk - Hxy_jk

def runmany(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
  mi = numpy.empty(b)
  for n in range(b):
    r1,r2 = correlated_x(N, rho)
    mi[n] = func(r1,r2,bins=bins)
  return mi

def plot_bias_variance(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
  mi = runmany(func, N=N, rho=rho, bins=bins, limits=limits, b=b)
  pylab.hist(mi,range=[0,2],bins=61,histtype='stepfilled',color='gray',edgecolor='gray')
  pylab.plot([MI_exact(rho),MI_exact(rho)], pylab.getp(pylab.gca(),'ylim'), 'k:',lw=3)
  pylab.setp(pylab.gca(),xlim=[-.1,1.0],xticks=[0,.5],yticks=[],xticklabels=[])
  pylab.title('N={:d},rho={:2.2f},bins={:d}'.format(N,rho,bins),fontsize=6)

pylab.figure(figsize=(15,6))
pylab.subplots_adjust(left=0.03,right=.98,top=0.95,bottom=0.05,wspace=.01,hspace=.15)
b = 1000
for i, rho in enumerate([0,0.5]):
  for j, N in enumerate([10,100,1000]):
    for k, bins in enumerate([11,21]):
      c = 1+k+j*2+i*6
      pylab.subplot(3,12,c); plot_bias_variance(MI_mle,rho=rho, N=N,bins=bins,b=b)
      if c == 1: pylab.ylabel('MLE')
      pylab.subplot(3,12,12+c); plot_bias_variance(MI_mle_miller_madow,rho=rho,N=N,bins=bins,b=b)
      if c == 1: pylab.ylabel('MillerMadow')
      pylab.subplot(3,12,24+c); plot_bias_variance(MI_mle_jack_knife,rho=rho,N=N,bins=bins,b=b)
      if c == 1: pylab.ylabel('JackKnife')

pylab.savefig('mi_experiments.pdf')

Parzen windows for estimating distributions

Part of a set of moderately technical writeups of some adventures in computing mutual information for neural data.

Often, for example, when you are computing mutual information, you need to estimate the probability distribution of a random variable. The simplest way, which I had done for years, is to to create a histogram. You take each sample and put it into a bin based on its value. Then you can use one of several tests to check if the shape of the histogram deviates from whatever distribution you are interested in.

When you don’t have enough data (aka Always) your histogram comes out jagged (Almost everyone who’s ever made a histogram knows what I’m sayin’). In undergrad stats I learned that 11 was a nice number of bins, and indeed both Matplotlib and MATLAB seem to have that as the default. What I ended up doing was plotting the data using various bins until, by inspection, I was satisfied by the smoothness of the histogram.

Turns out mathematicians have a whole cottage industry devoted to formalizing how to compute the number of bins your histogram should have. The relevant keyword is (imaginatively enough) “histogram problem”. I ran into this word while reading up a Navy technical report by Cellucci et al found here. That paper has references to a bunch of dudes who worked on that problem. Anyhow, there are lots of complex schemes but I liked Tukey’s formula, which was n^{1/2} (n being the number of data points).

This post however, is not about binning. It’s about Parzen windows to get rid of binning altogether. There is a deep theoretical underpinning to Parzen windowing, but intuitively I understand Parzen windowing as a smoothing operation that takes the samples we have and creates a smooth distribution out of the points.

Here’s how I understand Parzen windows: Each sample creates a splash – it’s own little gaussian (Apparently, you can also use boxcar windows or whatever window has a nice property for your problem). This is the Parzen window. As a result, the sample is no longer tightly localized but has a bit of a blur to it. We then add up all the blurs to create a smoothened curve/surface which is our estimate of the pdf of the samples. With a judicious choice of the width of the blurring and proper normalization of the height of each gaussian we can come up with a sensible pdf estimate. The advantage of this is that you know have a continuous function representing the pdf, which you can integrate.

Formally (I referred to a paper by Kwak and Choi – Input Feature Selection by Mutual Information based on Parzen window) the Parzen window estimate of the pdf is given by

\hat{p} = \frac{1}{n}\sum_{i=1}^{n} \phi(x-x_i,h)

where \phi is the window function, and I used a gaussian for that. As you can see, the density estimate at any given point x is given by the sum of gaussians centered around each data point x_i with the width of the gaussian being given by h. The larger h is the more washed out the estimate is and the smaller h is the more jaggy is the estimate.

We seem to have transferred our problem of finding an appropriate bin width (bin count) to finding an appropriate smoothening constant (What! You expected a free lunch!?). I used what wikipedia calls Silverman’s rule of thumb: h=1.06\hat{\sigma}n^{-1/5}.

Here is a fun little animation showing how the Parzen window estimate of a pdf (thin black line) matches up with the actual pdf (thicker blue line). The histogram of the actual data points are shown in light gray in the background.

Interesting things about this smoothing is that there is no binning involved – the final curve depends only on the actual data samples – and we make no strong assumptions about the pdf of the data – it’s not like we are trying to fit a model of the pdf to the data. Here is an animation of the exact same technique being used to fit a uniform distribution. We could have done better with a different choice of window width more tuned to the distribution, but the idea is that it still works if we don’t have any idea of what the actual pdf looks like.

These are also known as mixture of Gaussians or mixture decompositions.

During my web-searches I ran across this nice set of lecture slides about estimating pdfs from a prof at TAMU.

import numpy, pylab, matplotlib.animation as animation

root_two_pi = (2<em>numpy.pi)<strong>0.5
parzen_est = lambda x,X,h,sigma: numpy.exp(-(X-x)</strong>2/(2</em>h<strong>2*sigma</strong>2)).sum()/(root_two_pi*h*sigma*X.size)
gaussian_pdf = lambda x: (1/(2*numpy.pi)<strong>.5)*numpy.exp(-x</strong>2/2)
def uniform_pdf(x):
  p = numpy.zeros(x.size)
  p[(-2 &lt; x) &amp; (x &lt; 2)] = 0.25
  return p

def init():
  fig = pylab.figure(figsize=(3,6))
  ax = pylab.axes(xlim=(-5, 5), ylim=(-0.5, 0.7))
  return fig, ax

def animate(i, r, ax, distr):
  this_r = r[:i+2]
  ax.cla()
  h= 1.06<em>this_r.std()</em>this_r.size<strong>(-.2)#1/(2<em>numpy.log(N))
  lim = [-5,5]
  bins = 51
  x=numpy.linspace(lim[0],lim[1],num=100)
  pylab.text(-.4,-.025,'n=' + str(this_r.size))
  pylab.hist(this_r, bins=bins,range=lim,normed=True,edgecolor=(.9,.9,.9),color=(.8,.8,.8),histtype='stepfilled')
  #pylab.plot(x, (1/(2</em>numpy.pi)</strong>.5)*numpy.exp(-x**2/2), color=(.1,.1,1.0), lw=5)
  pylab.plot(x, distr(x), color=(.1,.1,1.0), lw=5)
  pylab.plot(x, [parzen_est(xx, this_r, h, this_r.std()) for xx in x],'k', lw=3)
  pylab.setp(ax,xlim=(-5, 5), ylim=(-0.05, 0.7))

N = 200
r0=numpy.random.randn(N); distr = gaussian_pdf
#r0=numpy.random.rand(N)*4-2; distr = uniform_pdf

fig, ax = init()
anim = animation.FuncAnimation(fig, animate, fargs=(r0, ax, distr), frames=N-2, interval=2, repeat=False)
#anim.save('parzen_smoothing.mp4', fps=30, extra_args=['-vcodec', 'libx264'])
pylab.show()

What is machine learning, and why should I care?

This is a (longish) informal intro to machine learning aimed at Biology/Neuroscience undergraduates who are encountering this for the first time in the context of biological data analysis.

We are, as you know, furiously lazy. It’s a biological adaptation: we always try to find the path of least resistance. Only people with distorted senses of reality, like Neurobio students, make things harder for themselves. Naturally, we dream of making machines do not just the easy work, like cutting our lawn, harvesting our crops and assembling our digital cameras, but also do the hard work, like picking stocks, diagnosing our diseases and analyzing all our data.

Categorization

Categorization – making discrete decisions based on complex information – is at the heart of many interesting things we do, which are collectively called decision making. For example, we have an x-ray done and the radiologist examines the picture and determines if there is cancer or not. There are many, many input factors which go into this determination but the output is relatively simple: it is one of  ‘Yes’ or ‘No’. A lot of machine learning is devoted to understanding and implementing this kind of decision making in computers.

Machine learning = Learning by examples

When we humans do half-way interesting things, like read, make coffee and do math, we attribute that to learning. We learned how to look at squiggles on a piece of paper and interpret them as information from somebody else’s mind, usually indicating an event or thought. Often we learn things by example: we are exposed to things, like words, and we are given examples of what they sound like and what they represent and we ‘pickup’ the words, their sounds and their meanings.

Machine learning is a set of statistical techniques that take inspiration from this learning-by-example and try to mimic this process on computers, thereby realizing our dream of a perfect society where we’ll all be chilling at the beach while the machines do all the work.

The basic idea is that we collect a set of examples (“samples”) and their meanings (“category”). Right now, we are doing baby steps, so the meaning (category) is restricted to simple concepts that have been reduced to a single label. The hope is that if we show the machine enough examples we can let it loose on the world and when it finds new things it can look to its book of samples and match up the new thing it sees with what it already has seen and then take a decision: This is a ‘tree’, this is ‘euler’s equation’, ‘you need to exercise more’.

Supervised learning

Suppose I want to teach the machine what a cat looks like. I collect a huge bunch of cat pictures and label them “cat”. We show the computer the pictures (more on that later) and tell it “These are cats”. Cool! Now the computer knows what a cat looks like.

i_know_catBut, what happens when it sees a picture of a house, or a tree? It doesn’t know what a cat doesn’t look like. Some people will say they have a secret method where it’s possible to show the computer just pictures of cats, and when a non-cat comes along, the computer will know the difference. We are more cautious. For good measure we throw in a huge collection of pictures that are not cats and tell the computer about that too, so it knows to look for the difference.

This is called supervised learning, because we are supervising the computer by telling it what picture is what.

Unsupervised learning: the height of laziness

Humans learn many things by just observing. In machine learning there are techniques, called unsupervised learning, where don’t even bother to label the examples. We just a dump a whole lot of data into the computer and say to it “Here’s all the information, you sort it out, I’m off camping.” The computer patiently goes through the samples and finds differences between them and sorts them into categories it comes up on its own. This is a powerful method, but as you can imagine, without supervision, the results can be quite hilarious. In general, the computer, being prone to over-analysis, tends to find tiny differences between samples, and tends to break them up into many, many categories. Without some kind of intervention the computer will patiently put each sample into its own separate category and be quite satisfied in the end.

How is this different from more traditional computing?

In traditional computing we do all the ‘thinking’ for the computer and program a specific algorithm (a clear cut set of computing steps) based on an explicit mathematical formula. We are the ones who come up with the core process of how to solve the problem. For example, say we are trying to get a computer to drive a car and we want it to stop when it sees a red stop sign. We think hard and say, “The most salient thing about a stop sign is the insane amount of red color it has. We’ll program the computer to stop the car when it sees lot of red.”

If we used a machine learning approach, we would say, “Why should I do all the thinking? I want to hang out with my homies. I’ll just show the computer a huge scrapbook  of stop signs, yield signs, pedestrian crossing signs and let the computer figure it out. The only work I’ll do is that I’ll put the signs into the correct categories first. If I’m in a hurry, I might not even do that.”

“Learning” = finding similarities in the features

How do we distinguish cats from, say, fire hydrants? We’ve found things (features) that are common to cats that are not present in fire hydrants. More precisely, we’ve found combinations of features that are very likely to make up a cat and very unlikely to make up a fire hydrant.

So how and what does the computer learn? The details of this depend on which exact machine learning method you use from the veritable zoo of machine learning methods. In general, the input that you give to the computer is converted to a list of numbers called a feature vector. It can be as simple as taking a picture of a cat, rearranging all the pixels into one long row and then feeding in this row of pixels as numbers.

So how does the computer learn to use this feature vector to learn what is a picture of a cat and what isn’t? You would expect that the feature vector taken from the picture of the cat is similar is some sense to vectors from pictures of other cats, and different from vectors of fire hydrant pictures. Machine learning algorithms use different mathematical formulae to automatically find the similarity between cat vectors and the differences between cat and fire hydrant vectors.

When we show the computer a new picture, the computer converts the picture into a feature vector,  refers to its giant book of existing feature vectors and asks the question “Is this new vector more similar to a cat vector or a something else in my book?” and bases its decision on that.

Feature selection

Most things in the world have many, many features. A cat’s features would, for example, include the shape and size of the ears, the color of the eyes, the color and length of the fur and so on. If we were distinguishing between fire-hydrants and cats we probably don’t have to look in great detail into the features of both, just a handfull, such as color and overall shape will do. If we are trying to distinguish between breeds of cats, however, we probably need to delve in great detail into a lot of the features.

Deciding which features to use has a great impact on our ability to teach the computer to perform categorization successfully. Sometimes we hand select the features, where we have a good idea of what things distinguish the categories we care about. Other times, we get lazy and, once again, let the computer sort it out.

Dirty little secrets: Overtraining/overfitting and the curse of dimensionality

A fallout 3 in-game poster from destructoid

As you may have noticed this rambling piece of text bears a suspicious resemblance to rose tinted predictions of the 60s of how robots were going to do all our chores for us as we lay about in our bathrobes, smoking our pipes and ordering trifles by mail. If machine learning is so great where are our jet-packs, flying cars and domestic robots?

Well, two things. First, the future is here, but it’s a lot more profane that the romantics of the 1960s depicted. Automated algorithms track our spending habits and place annoying ads on our social media pages. The US postal service uses robots that can read our scraggly handwriting to route our birthday card to mom. UPS fired all the human telephone operators and now you can speak in your tracking number and the computer will take it down despite your thick southern accent.

Secondly, all this glib talk about “Letting the computer sort things out” should have you on your guard. After all, we all know about skynet. There are two main problems with having machines do all the work.

Overfitting

The first is called overfitting. This is where the computer gets too fixated on the details of the input and builds over-elaborate classification rules.  Basically, we give the computer 1000 pictures of cats, and the computer builds a super elaborate library memorizing every teeny detail of those 1000 pictures of cats. So now we show it the 1001th picture of a cat and because the picture is a little bit different from each of the others the computer says “That picture is not an exact match to any of my 1000 cat pictures so that’s not a cat, it must be a fire hydrant.”

We can diagnose over fitting by doing something called cross-validation. Though this sounds like some kind of new-agey pop-psychology mumbo-jumbo, the concept is simple: you simply don’t test your students using same question set that you trained them on. We take our set of 1000 cat pictures and divide it up into two sets. The first set we call the training set (say 700 pictures). We let the computer learn using those 700 pictures. Then, we pull out the other 300 pictures that the computer has never seen before and make the computer classify those. The idea is that if the computer has done some kind of rote memorization trick and has not understood the essence of cat, it’s going to do rather poorly on them while it does spectacularly well on the original 700. This is how we catch computers and other students out.

But, how do we ensure that the computer does not do rote-memorization in the first place. That’s a pretty problem. There are no easy answers, but one thing we try is to figure out ways the computer could rote learn and penalize it. If, for example, the computer is using too many features, we say “Bad computer, use less features.” But this is all very subjective and trial and error and domain specific. Our only objective measure of success is to do cross-validation.

The curse of dimensionality

This is very scary, but I’m sorry I have to tell you about it. Remember when we talked about feature vectors? A feature vector is a list of numbers that describes the input. The more complex the nature of the input the longer each feature vector (the more numbers in the list).

Suppose we have been given a dossier 20 people and we are trying to figure out who to pick for the basketball team. None of these people have played sports of any kind before, so all we have are physical attributes and bunch of other things like hometown, what color car they drive, how many times they flunked math and so on and so forth. The only other thing we have is a similar dossier from 50  other folks from last year and a notation that says “Good player” or “Bad player”. How do we do this?

We start off simple. We pick the physical attribute “Height” and look at last year’s dossier. Interestingly,  when we arrange the 50 previous players it turns out most everyone above 5’8″ is a good player and most everyone below that height is a bad player. So we sort this year’s list into ‘Good’ and ‘Bad’ piles based on height. We send the kids out to play and it turns out our picks are pretty good.

Well, the coach comes back and says, can we do better? We say, hell, if we did so well with just one feature (height) why don’t we toss ALL the information in the dossier into the comparison. It can only get better right?

So we start to include everything we have. Pretty soon a cool pattern emerges. All the tall players, who scored above D+ , who live in a yellow house and drive a red car AND all the tall players who scored above B+ and live in a blue house and drive a yellow car are all good players. Man, ring up that Nate Gladwell guy, he’ll want to know about this! We split the new players up according this criterion and wait to hear the good news from the coach.

Next week the coach storms in and chews us out. He tells us we’re a bunch of dunder heads, it’s such a mixed bag of players his pet cat could have picked a better team, and we are FIRED!

(My apologies to people who realize I know nothing about sports)

What happened? How did we fail so badly? Does this remind you of the overfitting problem? Where our computer got to hung up about the details? Well, yes. It’s kind of the same thing. When you add more features (also called dimensions of the feature vector – how many numbers there are in the list) you need many, many more sample points. If we had enough dossiers those phoney correlations between playing ability and house color would have been drowned out. However, because we only had a relatively few number of dossiers we started to get hung up on coincidences between things. The more features we pick from the dossier, the more coincidences we find.

Statisticians, known for their rapier wit and general good humor, call this the curse of dimensionality.

As applied to Neuroscience: An information measure

As a neuroscientist, or neurophysiologist or general biology student why should you care? Well, these statisticians are encroaching on your turf. Statisticians can smell data a mile away, and biology absolutely reeks of it.

Using fMRI studies as an example, traditionally we have looked at each voxel individually and done a simple statistical test to answer the question “Is the activity in this cubic inch of the brain significantly different from chance”. Using somewhat dodgy corrections for multiple comparisons we then create a picture of the brain with all the significantly active parts of the brain colored in, and we draw deep conclusions like “The visual cortex is more active when the person looks at visual stimuli” or “The motor cortex lights up when the subject does motor actions”.

When the statisticians sneak in with forged immigration papers, things get more wild. The statisticians are not happy with these stodgy answers, like this brain region is differentially active during a working memory task. They go straight to the heart of why any body would do brain science in the first place. “Can I read your mind and guess which card you are holding?”

And the plan they have is to use these machine learning techniques we just discussed to answer this question. A popular method is to take those fMRI pictures of the brain, much like the ones you find on the internet and feed them into the computer, just as we talked about, along with some category information “Looking at cat”, “Looking at fire-hydrant” and so on.

Then you test the computer (Recall, cross-validation) and show them an fMRI and ask the computer to guess what it was the person was thinking/seeing/doing when the fMRI was done.

The result is a value, a percentage correct. This percetage correct ranges from chance to 100% and is a crude measure of how much information there is in the brain about a certain thing, like a stimulus, or a mood and, if we partition the fMRI, we can answer questions like, how much information is there in the prefrontal cortex, in the basal ganglia, in the thalamus and so on.

Again, this differs from traditional methods of analyzing fMRI data only in that we don’t fix what it is exactly in the data that will give is the answer. We let the computer figure it out. As we learned, this is exciting, but also dangerous if we let the computer totally loose on the data (Recall, curse of dimensionality, overfitting and whatnot).

It is also important to remember that this is not a statement on the nuts and bolts of HOW the brain is processing the information, merely a statement on how MUCH information there is (possibly) is some part of the brain that the rest of the brain could possibly use.