What is Mutual Information?

The mutual information between two things is a measure of how much knowing one thing can tell you about the other thing. In this respect, it’s a bit like correlation, but cooler – at least in theory.

Suppose we have accumulated a lot of data about the size of apartments and their rent and we want to know if there is any relationship between the two quantities. We could do this by measuring their mutual information.

Say, for convenience, we’ve normalized our rent and size data so that the highest rent and size are 1 “unit” and the smallest ones are 0 “units”. We start out by plotting two dimensional probability distributions for the rent and size.

Screen Shot 2013-11-01 at 12.32.04 PM

We plot rent on the x-axis, size on the y-axis. The density – a normalized measure of how often we run into a particular (rent,size) combination), and called the joint distribution (p(r,s)) – is actually plotted on the z-axis, coming out of the screen, forming a surface. To simplify matters, let’s assume the joint distribution here is uniform all over, so this surface is flat and at a constant height.

So, here the joint distribution of rents and sizes (p(r,s)) is given by the square (which is actually the roof of a cube, poking out) and the distribution of rents and sizes by themselves (called the marginals, because they are drawn on the margins of the joint distribution) are given by p(r) and p(s).

To recall a bit of probability, and probability distributions, the probability of finding a house/apartment within a certain rent/size range combo is given by the volume of the plot within that rent/size range. The volume of the whole plot, is therefore, equal to 1, since all our data is within this range.

The mutual information is given by the equation:
\displaystyle I(R;S) = \int \int p(r,s) \log \frac{p(r,s)}{p(r)p(s)}drds

This equation takes in our rent/size data and spits out a single number. This is the value of the mutual information. The logarithm is one of the interesting parts of this equation. In practice the only effect of changing the base is to multiply your mutual information value by some number. If you use base 2 you get out an answer in ‘bits’ which makes sense in an interesting way.

Intuitively we see that, for this data, knowing the rent tells us nothing additional about the size (and vice versa).

If we work out the value of the mutual information by substituting the values for p(r,s), p(r) and p(s) into the equation above we see that, since all these quantities are constant, we can just perform the calculation within the integral sign and multiply the result by the area of the plot (which is 1 and indicated by the final x1 term)
I(R;S) = 1 \log_2 \frac{1}{1 \times 1} \times 1 = 0

So we have 0 bits of information in this relation, which jives with our intuition that there is no information here – rents just don’t tell us anything about size.

Now suppose our data came out like this.
Screen Shot 2013-11-01 at 12.32.55 PM
[one-bit diagram]

Substituting the values we see that (noting we have two areas to integrate, each of size 1/2 x 1/2 = 1/4)
I(R;S) = 2 \log_2 \frac{2}{1 \times 1} \times \frac{1}{4} \times 2 = 1

That’s interesting. We can see intuitively there is a relation between rent and size, but what is this 1 bit of information? One way of looking at our plot is to say, if you give me a value for rent, I can tell you in which range of sizes the apartment will fall, and this range splits the total range of sizes in two. 2^1=2 so we say we have 1 bit of information which allows us to distinguish between two alternatives: large size and small size.

Interestingly, if you tell me the size of the apartment, I can tell you the range of the rent, and this range splits the total range of rents in two, so the information is still 1 bit. The mutual information is symmetric, as you may have noted from the formula.

Now, suppose our data came out like this.
Screen Shot 2013-11-01 at 12.33.41 PM
[two-bit diagram]

You can see that:
I(R;S) = 4 \log_2 \frac{4}{1 \times 1} \times \frac{1}{16} \times 4 = 2

Two bits! The rents and sizes seem to split into four clusters, and knowing the rent will allow us to say in which one of four clusters the size will fall. Since 2^2=4 we have 2 bits of information here.

Now so far, this has been a bit ho-hum. You could imagine working out the correlation coefficient between rent and size and getting a similar notion of whether rents and sizes are related. True, we get a fancy number in bits, but so what?

Well, suppose our data came out like this.
Screen Shot 2013-11-01 at 12.34.29 PM
[two-bit, scrambled diagram]

It’s funny, but the computation for MI comes out exactly the same as before:
I(R;S) = 4 \log_2 \frac{4}{1 \times 1} \times \frac{1}{16} \times 4 = 2

Two bits again! There is no linear relationship that we can see between rents and sizes, but upon inspection we realize that rents and sizes cluster into four groups, and knowing the rent allows us to predict which one of four size ranges the apartment will fall in.

This, then, is the power of mutual information in exploratory data analysis. If there is some relationship between the two quantities we are testing, the mutual information will reveal this to us, without having to assume any model or pattern.

However, WHAT the relationship is, is not revealed to us, and we can not use the value of the mutual information to build any kind of predictive “box” that will allow us to predict, say, sizes from rents.

Knowing the mutual information, however, gives us an idea of how well a predictive box will do at best, regardless of whether it is a simple linear model, or a fancy statistical method like a support vector machine. Sometimes, computing the mutual information is a good, quick, first pass to check if it is worthwhile spending time training a computer to do the task at all.

A note on computing mutual information

In our toy examples above it has been pretty easy to compute mutual information because the forms of p(r,s), p(r) and p(s) have been given explicitly. In real life we don’t have the distributions and all we are given is a (not large enough) pile of data. We try to estimate the functions p(r,s), p(r) and p(s) from this data on our way to computing the mutual information.

You will notice that the term in the integral is always positive. p(r,s) \geq 0 because it is a probability and \frac{p(r,s)}{p(r)p(s)} \geq 1. This second fact can be seen by considering the extreme cases where r and s are independent (in which case p(r,s)=p(r)p(s) which leads us to \frac{p(r,s)}{p(r)p(s)} = 1) and when they are completely dependent (in which case p(r,s)=p(r)=p(s) which leads us to \frac{p(r,s)}{p(r)p(s)} = \frac{p(r)}{p^2(r)} = \frac{1}{p(r)} \geq 1).

You will immediately sense the problem here. In many calculations, when we have noise in the terms, the noise averages out because the plus terms balance out the minus terms. Here, all we have are plus terms, and our integral has a tendency to get bigger.

Histograms (where we take the data and bin it) is an expedient way of estimating probability distributions and they normally work alright. But this can lead us to a funny problem when computing mutual information because of this always positive nature of the integral term.

For example, say there was really no dependence between rents and sizes, but suppose our data and our binning interacted in an odd manner to give us a pattern such as this:
Screen Shot 2013-11-01 at 12.35.14 PM

We can see that the marginals are not affected badly, but the joint, because it is in two-dimensional space, is filled rather more sparsely which leads to us having ‘holes’ in the distribution. If we now compute the mutual information we find that we have ended up with 1 bit of information, when really, it should be 0 bits.

Most attempts to address this bias in mutual information computations recognize the problem with these ‘holes’ in the joint distribution and try to smear them out using various ever more sophisticated techniques. The simplest way is to make larger bins (which would completely solve our problem in this toy case) and other methods blur out the original data points themselves.

All of these methods, not matter how fancy, still leave us with the problem of how much to smear the data: smear too little and you inflate the mutual information, smear too much and you start to wipe it out.

Often, to be extra cautious, we do what I have known as ‘shuffle correction’ (and I was told by a pro is actually called the ‘null model’). Here you thoroughly jumble up your data so that any relation ship that existed between r and s is gone. You then compute the mutual information of that jumbled up data. You know that the mutual information should actually be zero, but because of the bias it comes out to something greater. You then compare the mutual information from the data with this jumbled one to see if there is something peeking above the bias.

What is machine learning, and why should I care?

This is a (longish) informal intro to machine learning aimed at Biology/Neuroscience undergraduates who are encountering this for the first time in the context of biological data analysis.

We are, as you know, furiously lazy. It’s a biological adaptation: we always try to find the path of least resistance. Only people with distorted senses of reality, like Neurobio students, make things harder for themselves. Naturally, we dream of making machines do not just the easy work, like cutting our lawn, harvesting our crops and assembling our digital cameras, but also do the hard work, like picking stocks, diagnosing our diseases and analyzing all our data.


Categorization – making discrete decisions based on complex information – is at the heart of many interesting things we do, which are collectively called decision making. For example, we have an x-ray done and the radiologist examines the picture and determines if there is cancer or not. There are many, many input factors which go into this determination but the output is relatively simple: it is one of  ‘Yes’ or ‘No’. A lot of machine learning is devoted to understanding and implementing this kind of decision making in computers.

Machine learning = Learning by examples

When we humans do half-way interesting things, like read, make coffee and do math, we attribute that to learning. We learned how to look at squiggles on a piece of paper and interpret them as information from somebody else’s mind, usually indicating an event or thought. Often we learn things by example: we are exposed to things, like words, and we are given examples of what they sound like and what they represent and we ‘pickup’ the words, their sounds and their meanings.

Machine learning is a set of statistical techniques that take inspiration from this learning-by-example and try to mimic this process on computers, thereby realizing our dream of a perfect society where we’ll all be chilling at the beach while the machines do all the work.

The basic idea is that we collect a set of examples (“samples”) and their meanings (“category”). Right now, we are doing baby steps, so the meaning (category) is restricted to simple concepts that have been reduced to a single label. The hope is that if we show the machine enough examples we can let it loose on the world and when it finds new things it can look to its book of samples and match up the new thing it sees with what it already has seen and then take a decision: This is a ‘tree’, this is ‘euler’s equation’, ‘you need to exercise more’.

Supervised learning

Suppose I want to teach the machine what a cat looks like. I collect a huge bunch of cat pictures and label them “cat”. We show the computer the pictures (more on that later) and tell it “These are cats”. Cool! Now the computer knows what a cat looks like.

i_know_catBut, what happens when it sees a picture of a house, or a tree? It doesn’t know what a cat doesn’t look like. Some people will say they have a secret method where it’s possible to show the computer just pictures of cats, and when a non-cat comes along, the computer will know the difference. We are more cautious. For good measure we throw in a huge collection of pictures that are not cats and tell the computer about that too, so it knows to look for the difference.

This is called supervised learning, because we are supervising the computer by telling it what picture is what.

Unsupervised learning: the height of laziness

Humans learn many things by just observing. In machine learning there are techniques, called unsupervised learning, where don’t even bother to label the examples. We just a dump a whole lot of data into the computer and say to it “Here’s all the information, you sort it out, I’m off camping.” The computer patiently goes through the samples and finds differences between them and sorts them into categories it comes up on its own. This is a powerful method, but as you can imagine, without supervision, the results can be quite hilarious. In general, the computer, being prone to over-analysis, tends to find tiny differences between samples, and tends to break them up into many, many categories. Without some kind of intervention the computer will patiently put each sample into its own separate category and be quite satisfied in the end.

How is this different from more traditional computing?

In traditional computing we do all the ‘thinking’ for the computer and program a specific algorithm (a clear cut set of computing steps) based on an explicit mathematical formula. We are the ones who come up with the core process of how to solve the problem. For example, say we are trying to get a computer to drive a car and we want it to stop when it sees a red stop sign. We think hard and say, “The most salient thing about a stop sign is the insane amount of red color it has. We’ll program the computer to stop the car when it sees lot of red.”

If we used a machine learning approach, we would say, “Why should I do all the thinking? I want to hang out with my homies. I’ll just show the computer a huge scrapbook  of stop signs, yield signs, pedestrian crossing signs and let the computer figure it out. The only work I’ll do is that I’ll put the signs into the correct categories first. If I’m in a hurry, I might not even do that.”

“Learning” = finding similarities in the features

How do we distinguish cats from, say, fire hydrants? We’ve found things (features) that are common to cats that are not present in fire hydrants. More precisely, we’ve found combinations of features that are very likely to make up a cat and very unlikely to make up a fire hydrant.

So how and what does the computer learn? The details of this depend on which exact machine learning method you use from the veritable zoo of machine learning methods. In general, the input that you give to the computer is converted to a list of numbers called a feature vector. It can be as simple as taking a picture of a cat, rearranging all the pixels into one long row and then feeding in this row of pixels as numbers.

So how does the computer learn to use this feature vector to learn what is a picture of a cat and what isn’t? You would expect that the feature vector taken from the picture of the cat is similar is some sense to vectors from pictures of other cats, and different from vectors of fire hydrant pictures. Machine learning algorithms use different mathematical formulae to automatically find the similarity between cat vectors and the differences between cat and fire hydrant vectors.

When we show the computer a new picture, the computer converts the picture into a feature vector,  refers to its giant book of existing feature vectors and asks the question “Is this new vector more similar to a cat vector or a something else in my book?” and bases its decision on that.

Feature selection

Most things in the world have many, many features. A cat’s features would, for example, include the shape and size of the ears, the color of the eyes, the color and length of the fur and so on. If we were distinguishing between fire-hydrants and cats we probably don’t have to look in great detail into the features of both, just a handfull, such as color and overall shape will do. If we are trying to distinguish between breeds of cats, however, we probably need to delve in great detail into a lot of the features.

Deciding which features to use has a great impact on our ability to teach the computer to perform categorization successfully. Sometimes we hand select the features, where we have a good idea of what things distinguish the categories we care about. Other times, we get lazy and, once again, let the computer sort it out.

Dirty little secrets: Overtraining/overfitting and the curse of dimensionality

A fallout 3 in-game poster from destructoid

As you may have noticed this rambling piece of text bears a suspicious resemblance to rose tinted predictions of the 60s of how robots were going to do all our chores for us as we lay about in our bathrobes, smoking our pipes and ordering trifles by mail. If machine learning is so great where are our jet-packs, flying cars and domestic robots?

Well, two things. First, the future is here, but it’s a lot more profane that the romantics of the 1960s depicted. Automated algorithms track our spending habits and place annoying ads on our social media pages. The US postal service uses robots that can read our scraggly handwriting to route our birthday card to mom. UPS fired all the human telephone operators and now you can speak in your tracking number and the computer will take it down despite your thick southern accent.

Secondly, all this glib talk about “Letting the computer sort things out” should have you on your guard. After all, we all know about skynet. There are two main problems with having machines do all the work.


The first is called overfitting. This is where the computer gets too fixated on the details of the input and builds over-elaborate classification rules.  Basically, we give the computer 1000 pictures of cats, and the computer builds a super elaborate library memorizing every teeny detail of those 1000 pictures of cats. So now we show it the 1001th picture of a cat and because the picture is a little bit different from each of the others the computer says “That picture is not an exact match to any of my 1000 cat pictures so that’s not a cat, it must be a fire hydrant.”

We can diagnose over fitting by doing something called cross-validation. Though this sounds like some kind of new-agey pop-psychology mumbo-jumbo, the concept is simple: you simply don’t test your students using same question set that you trained them on. We take our set of 1000 cat pictures and divide it up into two sets. The first set we call the training set (say 700 pictures). We let the computer learn using those 700 pictures. Then, we pull out the other 300 pictures that the computer has never seen before and make the computer classify those. The idea is that if the computer has done some kind of rote memorization trick and has not understood the essence of cat, it’s going to do rather poorly on them while it does spectacularly well on the original 700. This is how we catch computers and other students out.

But, how do we ensure that the computer does not do rote-memorization in the first place. That’s a pretty problem. There are no easy answers, but one thing we try is to figure out ways the computer could rote learn and penalize it. If, for example, the computer is using too many features, we say “Bad computer, use less features.” But this is all very subjective and trial and error and domain specific. Our only objective measure of success is to do cross-validation.

The curse of dimensionality

This is very scary, but I’m sorry I have to tell you about it. Remember when we talked about feature vectors? A feature vector is a list of numbers that describes the input. The more complex the nature of the input the longer each feature vector (the more numbers in the list).

Suppose we have been given a dossier 20 people and we are trying to figure out who to pick for the basketball team. None of these people have played sports of any kind before, so all we have are physical attributes and bunch of other things like hometown, what color car they drive, how many times they flunked math and so on and so forth. The only other thing we have is a similar dossier from 50  other folks from last year and a notation that says “Good player” or “Bad player”. How do we do this?

We start off simple. We pick the physical attribute “Height” and look at last year’s dossier. Interestingly,  when we arrange the 50 previous players it turns out most everyone above 5’8″ is a good player and most everyone below that height is a bad player. So we sort this year’s list into ‘Good’ and ‘Bad’ piles based on height. We send the kids out to play and it turns out our picks are pretty good.

Well, the coach comes back and says, can we do better? We say, hell, if we did so well with just one feature (height) why don’t we toss ALL the information in the dossier into the comparison. It can only get better right?

So we start to include everything we have. Pretty soon a cool pattern emerges. All the tall players, who scored above D+ , who live in a yellow house and drive a red car AND all the tall players who scored above B+ and live in a blue house and drive a yellow car are all good players. Man, ring up that Nate Gladwell guy, he’ll want to know about this! We split the new players up according this criterion and wait to hear the good news from the coach.

Next week the coach storms in and chews us out. He tells us we’re a bunch of dunder heads, it’s such a mixed bag of players his pet cat could have picked a better team, and we are FIRED!

(My apologies to people who realize I know nothing about sports)

What happened? How did we fail so badly? Does this remind you of the overfitting problem? Where our computer got to hung up about the details? Well, yes. It’s kind of the same thing. When you add more features (also called dimensions of the feature vector – how many numbers there are in the list) you need many, many more sample points. If we had enough dossiers those phoney correlations between playing ability and house color would have been drowned out. However, because we only had a relatively few number of dossiers we started to get hung up on coincidences between things. The more features we pick from the dossier, the more coincidences we find.

Statisticians, known for their rapier wit and general good humor, call this the curse of dimensionality.

As applied to Neuroscience: An information measure

As a neuroscientist, or neurophysiologist or general biology student why should you care? Well, these statisticians are encroaching on your turf. Statisticians can smell data a mile away, and biology absolutely reeks of it.

Using fMRI studies as an example, traditionally we have looked at each voxel individually and done a simple statistical test to answer the question “Is the activity in this cubic inch of the brain significantly different from chance”. Using somewhat dodgy corrections for multiple comparisons we then create a picture of the brain with all the significantly active parts of the brain colored in, and we draw deep conclusions like “The visual cortex is more active when the person looks at visual stimuli” or “The motor cortex lights up when the subject does motor actions”.

When the statisticians sneak in with forged immigration papers, things get more wild. The statisticians are not happy with these stodgy answers, like this brain region is differentially active during a working memory task. They go straight to the heart of why any body would do brain science in the first place. “Can I read your mind and guess which card you are holding?”

And the plan they have is to use these machine learning techniques we just discussed to answer this question. A popular method is to take those fMRI pictures of the brain, much like the ones you find on the internet and feed them into the computer, just as we talked about, along with some category information “Looking at cat”, “Looking at fire-hydrant” and so on.

Then you test the computer (Recall, cross-validation) and show them an fMRI and ask the computer to guess what it was the person was thinking/seeing/doing when the fMRI was done.

The result is a value, a percentage correct. This percetage correct ranges from chance to 100% and is a crude measure of how much information there is in the brain about a certain thing, like a stimulus, or a mood and, if we partition the fMRI, we can answer questions like, how much information is there in the prefrontal cortex, in the basal ganglia, in the thalamus and so on.

Again, this differs from traditional methods of analyzing fMRI data only in that we don’t fix what it is exactly in the data that will give is the answer. We let the computer figure it out. As we learned, this is exciting, but also dangerous if we let the computer totally loose on the data (Recall, curse of dimensionality, overfitting and whatnot).

It is also important to remember that this is not a statement on the nuts and bolts of HOW the brain is processing the information, merely a statement on how MUCH information there is (possibly) is some part of the brain that the rest of the brain could possibly use.

The curse of D- and the LDA

All you dataheads know the curse whose name must not be spoken. The curse of D(imensionality)! Let’s look at how the curse sickens us when we perform Linear Discriminant Analysis (LDA). Our intuition, when we perform LDA, is that we are rotating a higher dimensional data space and casting a shadow onto a lower dimensional surface. We rotate things such that the shadow exaggerates the separation of data coming from different sources (categories) and we hope that the data, which may look jumbled in a higher dimension, actually cast a shadow where the categories are better separated.

I have some data that can be thought of as having a large number of dimensions, say about a hundred. (Each dimension is a neuron in the brain, if you must know). I know, from plotting the data from the neurons individually, that some neurons just don’t carry any information I’m interested in, while others do. I’m interested in the question: if I combine multiple neurons together can I get more information out of them than if I look at them individually. It is possible to construct toy scenarios where this is true, so I want to know if this works in a real brain.

A question quickly arose: which neurons should I pick to combine? I know that by using some criteria I can rank the neurons as to informativeness and then pick the top 50% or 30% of the neurons to put together. But what happens if I just take all the neurons? Will LDA discard the useless noisy ones. Will my data space be rotated so that these useless neurons don’t cast any shadow and are eclipsed by the useful neurons?

This is not entirely an idle question, or a question simply of data analysis. The brain itself, if it is using data from these neurons, needs some kind of mechanism to figure out which neurons are important for what. I find this is an important question and I don’t think we are sure of the answers yet.

However, back to the math. We can generate a toy scenario where we can test this. I created a dataset that has 10,25 and 50 dimensions. Only one dimension is informative, the rest are noise. Data from the first dimension come from two different classes and are separated. What happens when we rotate these spaces such that the points from the two classes are as well separated as possible?

The plot below shows thlda_dimensionalitye original data (blue and green classes). You can see that the ‘bumps’ are decently separated. Then you can see the 10d, 25d and 50d data.

Wow! Adding irrelevant dimensions sure helps us separate our data! Shouldn’t we all do this, just add noise as additional dimensions and then rotate the space to cast a well separated shadow? Uh, oh! The curse of D- strikes again!

We aren’t fooled though. We know what’s going on. In real life we have limited data. For example, in this data set I used 100 samples. Our intuition tells us that as our dimensions increase the points get less crowded. Each point is able to nestle into a nice niche in a higher dimension, further and further away from its neighbors. Its like the more dimensions we add, the more streets and alleys we add to the city. All the points no longer have to live in the same street. They now have their own zip codes. (OK I’ll stop this theme now)

Poor old LDA knows nothing about this. LDA simply picks up our space and starts to rotate it and is extremely happy when the shadow looks like it’s well separated and stops. The illusion will be removed as soon as we actually try to use the shadow. Say we split our data into test and train sets. Our train set data look nicely separated, but the moment we dump in the test data: CHAOS! It’s really jumbled. Those separate zipcodes – mail fraud! Thanks to the curse of D-

Code follows:

import pylab
from sklearn.lda import LDA

def myplot(ax, F, title):
  bins = pylab.arange(-15,15,1)
  N = F.shape[0]
  pylab.hist(F[N/2:,0], bins, histtype='step', lw=3)
  pylab.hist(F[:N/2,0], bins, histtype='step', lw=3)

D = 50 #Dimensions
N = 100 #Samples
F = pylab.randn(N,D)
C = pylab.zeros(N)
C[:N/2] = 1 #Category vector
F[:,0] += C*4 #Adjust 1st dimension to carry category information

lda = LDA(n_components=1)
#bins = pylab.arange(-15,15,1)
fig, ax = pylab.subplots(4,1,sharex=True, figsize=(4,8))

myplot(ax[0], F, 'original')
F_new = lda.fit_transform(F[:,:10],C) #Ten dimensions
myplot(ax[1], F_new, '10d')
F_new = lda.fit_transform(F[:,:25],C) #25 dimensions
myplot(ax[2], F_new, '25d')
F_new = lda.fit_transform(F[:,:50],C) #50 dimensions
myplot(ax[3], F_new, '50d')

Gradient descent

We all have to get our car oil changed periodically. If we do it too often its a waste of money (and a strain on the environment), if we do it too infrequently we run the risk of damaging our engine. Say an insightful friend has developed a formula – a function – that informs us what our total cost is (cost of oil + probable cost of engine repair) when we plug in how often we are doing oil changes.


We have an idea that we can use this graph to figure out the best frequency for our oil change. Finding the value of the input of a function to obtain the lowest value of the output is called optimization.

Brute force

One easy way to find the optimum is to simply compute the function over a large range of input values and pick the value of the input that gives the smallest output. We already did this by inspection when we saw the above graph, because some one has already plotted the graph for us.


One problem with this is that, in the real world, the function may be complicated and involve lots of individual mathematical formulae making it very time consuming to compute the function over a large range of values. This is especially true for multi-dimensional functions that take multiple inputs that can be altered.

Say our formula not only takes into account oil change frequency but also our average driving speed and average length of travel and we are really obsessive and we want to optimize all three. We need to pick ranges for the three dimensions and then decide how densely we are going to sample this space of three dimensions. Even if we are very careless and decide to check only 10 values for periodicity, travel distance and average speed we end up having to check 10 x 10 x 10 = 1000 combinations of values!

We need to find a short cut!

Gravity is our friend


Look again at the graph out friend has given us. Imagine the graph is actually a physical surface, like a bowl.
If we put a marble somewhere it’s going to roll down and end up at the bottom. Now, this is where we want to be: we want to be at the lowest point of the curve. If we have such a bowl shaped surface, the action of gravity on a marble placed on that surface acts as a minimizing algorithm: it finds the lowest point for us. (At this point, you might suggest that this is not always true, we won’t always find the lowest point, and we will discuss this later).

This gives us an idea for a method (an algorithm) by which we can fake this ‘rolling down a slope’ to find the input that minimizes a function. Before we continue, just so we are on the same page, teh slope of the curve is defined as the ratio of the change in output to the change in input. This will be familiar to any one who has been driving in a hilly area:

Road gradient


1. Place an imaginary marble at some arbitrarily picked starting position
2. Find the slope of the curve at that point
3. Move our point along the curve to a new position computed by adding the numerical value of the negative of the slope to the old position.
4. Keep repeating this until our slope becomes smaller than some value we are happy with calling ‘flat’ i.e. we are done.


Now, in our physical example, how fast the marble rolls down the slope depends on the strength of gravity. If we did our experiment in orbit (say on the International Space Station) the marble of course would not roll down but float around. If we did it on the moon, it would roll down, but slowly, since the gravity is lower there.

In our fake version of this experiment we also have to decide on a ‘gravity’. We don’t want to make the gravity too low or our point will take a long time to move along the slope. If we make our gravity too high, our point will overshoot and bounce around a lot.

We call our fake gravity ‘alpha’ and it is simply a number that we multiply our slope by to decide how far our point is going to move.

It is important to point out here that our marble in a ball is an analogy. In our algorithm that we are going to use, for practical reasons, we don’t include inertia. So, in the real world the ball would slide down to the bottom and then swing back upward due to momentum (inertia). In our algorithm, as soon as the slope goes to zero our ‘ball’ (point) stops instantly. However, it is still important to choose a gravity (alpha) that is not too large.

Stuck in local minima

You would have thought to yourself, what if the surface is kind of uneven, can’t the ball get stuck on a ledge somewhere as it is rolling down, and not reach the bottom? Yes, this can happen using this method, and is called getting stuck in local minima.

It’s a matter of luck if you get stuck in local minima. You could avoid it simply by accident, by starting from a different position or by having a different alpha value.


The Gradient descent method requires far fewer function evaluations than the brute force method, especially when the number of input variables we are tweaking (the dimensions of the input space) get larger.

Getting stuck in local minima is a big problem which crops up quite frequently in many interesting applications. There are some tricks to get out of local minima which we will discuss later.

Also, if you pick too large an alpha value you could end up bouncing around in your function, perhaps lost for eternity, either spinning out into space, or cycling through the same positions, never getting to where you want. You have been warned …

The ROC and its AUC

Hark back to the caveman days. You are out there, in the tall grass, with your spear. You are stalking a tasty gnu. Suddenly, over the whispering of the wind over the green and the calls of the birds, you hear a twig snap! Was that a twig snap? Was that a saber-toothed tiger stalking you? Or was it just a birdchirp, or an insect falling? Do you keep stalking, or do you slink away? Categorizing things is a fundamental operation for animals and ‘intelligent’ machines.


When we talk about machines (including animals) categorizing things (like stimuli, choices etc) we often run into the concept of the Reciever Operating Characteristic (ROC), and the area under the ROC. Taking the example above, you need to decide whether that sound you heard is simply prarie noise, or the stealthy sound of a predator. The stimuli are not that easy to distinguish, but you need to try.

In reality, we probably base our decision on many factors: time of day (maybe we know tigers like to hunt at particular times, while food grazes at other times), are there many such crackles, how exactly did the crackle sound, did the birds stop calling, are there vultures overhead and so on and so forth.

To simplify matters for this discussion, say we simply rely on how loud the sound was relative to the general hub hub of the grazing fields. Say the louder the sound the more likely it is to be something extra-ordinary rather than a random sound that happened to be loud by chance. Say, it turns out that the probability distributions of sound level for noise on the prarie (blue) and tiger stalking (green) turn out to be as shown below.


Suppose we are not so hungry, and a little flighty and on edge today. Even the slightest sound can set us off and running for the home cave. We can model this as a criterion c (indicated by the vertical black dashed line) that is set pretty low. This means that even a soft tiger sound will cause us to respond (the green shaded are to the right). When we hear a sound and correctly identify it as a tiger we say it was a ‘Hit’. But, we overreact and often interpret the regular prarie noise as a tiger (shaded blue region). These instances are called ‘False alarms’. As we can see, when we set our criterion low (we are ‘jumpy’) we have a lot of ‘Hits’ but also a lot of ‘False alarms’.

Now, on another day, we might me more even-keeled. We are less jumpy, more measured in our responses. We can model that as a higher criterion.


We can see that in such a case our ‘Hits’ go down, but our ‘False Alarms’ go down as well.

On another day, we might be very hungry. We are determined to hunt down that bison, tigers or no tiger. We can model that as setting our criterion very high.


Now, we have very few ‘Hits’ and almost no ‘False Alarms’.

Under controlled conditions, it is possible to systematically vary the eagerness of an observer and obtain what is called the Receiver Operating Characteristic (ROC) – terminology harking back to the days when the hot research topic was radar reciver operators sitting at listening posts and how they could be made more efficient in detecting enemy aircraft from primitive radar returns. As we sweep the criterion point, the proportion of FA and H varies systematically, creating an ROC that looks like the curve shown below.


At high criteria there are no FA but no H either (lower left). For low criteria we have high H, but high FA too (upper right). At intermediate criteria we have a mixture.

We can see that the ROC curve bulges out upwards and leftwards. The size of this bulge depends on how well separated the two distributions are (how far apart the blue and green curves are). If the distributions are identical the ROC is a 45 degree line: changing our criterion increases our FA and H at the same rate all the time. If the blue and green are well separated the ROC bulges out a lot, such that increasing our criterion, initially, causes our H to go up much faster than our FA (which is good – it means that, by being a little more conservative, we can get many more hits without making too many more false alarms, upto a point.

Here’s a funny thing: we now have some intuition that the bulge of the curve indicates something about how easy it is to distinguish the categories. The more bulging the curve the larger the Area Under the Curve (AUC, the space enclosed bewteen the ROC curve and the x-axis) is. It turns out that the area under the curve can be interpreted as a probability (it ranges from 0 to 1). In fact it is the ‘ideal observer’ probability. If could some how build a machine or train an observer to make the best use of the data available, the best that ‘ideal observer’ can do (on average) is to get it right with probability ‘p’, where p is the area under the curve.

Intuitively, this makes sense: if the ROC is a 45 degree line the AUC=0.5 which means that the best we can do is guess blindly, getting things light half the time. If the ROC is completely bulged out the AUC=1.0 and its a walk in the park and we get the classification right all the time.

If the ROC bulges in (and our AUC < 0.5) we notice that we can simply invert the categories and get the ROC to bulge outward – we’ve simply got the sign backwards.

WHY the AUC of the ROC gives is this ‘ideal observer’ probabilty is another story, but to get us thinking, we make a fun note: suppose that we kept getting samples of ‘noise’ and ‘tiger’ and we simply decided that the larger sample was the tiger and the smaller one noise. It turns out that this simple strategy gives us the ‘ideal observer’ performance (which is same as the AUC of the ROC).

The code for this and the rest of the post is given below and you might have a lot of fun pondering this ‘coincidence’. As a hint I would suggest asking the question, what does it mean to compare the two samples? Can this have anything to do with moving the criterion from -infiinty to +infinity and checking what our percentage correct (H-FA) is?

def plotit(x1,x2,cri):
    h1=hist(x1, histtype='step',bins=31)
    h2=hist(x2, histtype='step',bins=31)
    c2 = (h2[1][:-1]+h2[1][1:])/2.
    idx = find(c2 > cri)
    h=fill_between(c2[idx], h2[0][idx], alpha=.2,color='green')
    Hits = h2[0][idx].sum()/float(h2[0].sum())

    h1=hist(x1, histtype='step',bins=31)
    h2=hist(x2, histtype='step',bins=31)
    c1 = (h1[1][:-1]+h1[1][1:])/2.
    idx = find(c1 > cri)
    h=fill_between(c1[idx], h1[0][idx], alpha=.2,color='b')
    FA = h1[0][idx].sum()/float(h1[0].sum())
    h=text(c1[idx[0]]+.2, h1[0].max(), 'FA={:1.2f}'.format(FA))
    h=text(c1[idx[0]]+.2, h1[0].max()*.7, 'H={:1.2f}'.format(Hits))

def auroc(x1, x2, N=40):
  """Area under the ROC curve. Given scalar data from two groups (x1,x2) what is the probability that an ideal
  observer will be able to correctly classify the groups?"""
  n1 = float(x1.size)
  n2 = float(x2.size)
  av1 = x1.mean()
  av2 = x2.mean()
  if av1 > av2:
    FA = 1
    H = 0
    FA = 0
    H = 1

  st = min(x1.min(), x2.min())
  nd = max(x1.max(), x2.max())
  ra = nd - st
  st -= .01*ra
  nd += .01*ra
  cri = pylab.linspace(nd,st,N)
  roc = [pylab.zeros(N), pylab.zeros(N)]
  for n in xrange(N):
    roc[H][n] = pylab.find(x2 > cri[n]).size/n2
    roc[FA][n] = pylab.find(x1 > cri[n]).size/n1

  return pylab.trapz(roc[1], roc[0]), roc

x1 = randn(10000)
x2 = randn(10000)+2



p, roc = auroc(x1, x2, N=40)
plot(roc[0], roc[1],lw=2)

N = 10000
i1 = random_integers(0,x1.size-1,N)
i2 = random_integers(0,x2.size-1,N)

Musings on Mutual Information

Mutual information (MI), often given in terms of bits of information, is a neat way to describe the relationship between two variables, x and y. The MI tells us how predictive x is of y (and vice versa). Here I will explore what the MI value in bits means.

For example, if x takes 8 values and very reliably predicts which of 8 values y will take, it turns out that the mutual information between x and y is 3 bits. One interesting aspect of MI is to get an intuition of how noise and uncertainty reflect in the MI. In the example below we will explore this a little bit.

At the end of the post I show all the Python code needed to reproduce the plots.

As our basic data we consider an array of numbers (x) each of which is either 0 or 1. So x can have two states (0,1).

Next we create another array (y). For each value in x, we have a value in y. Each value in y is sampled from a unit variance Gaussian distribution whose mean is 0 or 4 depending on the value of its corresponding x.


We have an intuition that x and y predict each other pretty well. We can see the distribution of y has two peaks depending on whether they are associated with x=0 or x=1. The mutual information between the two comes out to be 1 bit. Our intuition is that if some one gave us a value of x we could tell them if the value of y is above or below 4. Alternatively, if some one gave us a value for y, we could tell them is x was 0 or 1. We can tell them, reliably one of two alternatives. For this reason, we claim that there is 1 bit of information here. We can distinguish 2**1 alternatives with the value given to us.

Say we now arrange things such that whatever the value of x, the value of y is always chosen from the same Gaussian distribution.


Now, our intuition is that there is no way to tell what our value of y will be based on the value of x, and vice versa. When we work out our mutual information, we see that we have close to zero bits of information, which matches our intuition that x and y carry no information about each other.

Next, suppose we arrange things such that the value of y is taken from Gaussian distributions with mean 0 or 1, depending on whether the value of the corresponding x is 0 or 1.


We see that the distributions overlap, but not completely. The mutual information computation tells us that there are 0.45 bits of information between our two variables. What does that mean? What does it mean to be able to distinguish among 2**0.45=1.36 possibilities? I find it hard to interpret this value directly. Given knowledge about x (it can be 0 or 1) I know that the maximum mutual information possible is 1 bit. So anything less than 1 represents a noisy or flawed relation between x and y. I have little intution on whether 1.5 bits of information is a lot better than 1.3 bits and just a little worse than 1.7 bits. We can check what happens when we gradually increase the separation between the two distributions, making them more easily separable.


As we can see, 0.4 bits of separation imply the distributions are separated by one standard deviation (d’=1) 0.8 bits means d’=2 and then we start to saturate.

Code follows

def mi(x,y, bins=11):
  """Given two arrays x and y of equal length, return their mutual information in bits
  Hxy, xe, ye = pylab.histogram2d(x,y,bins=bins)
  Hx = Hxy.sum(axis=1)
  Hy = Hxy.sum(axis=0)

  Pxy = Hxy/float(x.size)
  Px = Hx/float(x.size)
  Py = Hy/float(x.size)

  pxy = Pxy.ravel()
  px = Px.repeat(Py.size)
  py = pylab.tile(Py, Px.size)

  idx = pylab.find((pxy > 0) & (px > 0) & (py > 0))
  return (pxy[idx]*pylab.log2(pxy[idx]/(px[idx]*py[idx]))).sum()

x1 = pylab.zeros(1000)
x2 = pylab.ones(1000)
x = pylab.concatenate((x1,x2*2))

y = pylab.randn(x.size)+4*x
h = hist([y[:1000], y[1000:]], histtype='step', bins=31)
h = text(2,80,'MI={:f}'.format(mi(x,y)))

y = pylab.randn(x.size)+0*x
h = hist([y[:1000], y[1000:]], histtype='step', bins=31)
h = text(2,80,'MI={:f}'.format(mi(x,y)))

y = pylab.randn(x.size)+x
h = hist([y[:1000], y[1000:]], histtype='step', bins=31)
h = text(2,80,'MI={:f}'.format(mi(x,y)))

N = 100
kk = pylab.linspace(0,5,N)
mutinf = [mi(x,pylab.randn(x.size)+k*x) for k in kk]
plot(kk, mutinf)
ax = gca()
ylabel('Mutual information')