Self-Organizing Maps

Self-organizing maps are an old idea (first published in 1989) and take strong inspiration from some empirical neurophysiological observations from that time. The original paper “Self-Organizing Semantic Maps” by Ritter and Kohonen (pdf) has a nice discussion that took me back to some questions I was looking at in another life as a neurophysiologist. The discussion centers around cortical maps, which are most pronounced in the clearly sensory and motor areas of the brain.

Very early experiments (in their paper Lashley‘s work is referenced) led some people to propose the brain was a complete mishmash, with no functional separable internal organization. Each part of the brain did exactly what the other parts did, a bit like how a lens cut in half still works like the original lens (just dimmer), or how a hologram cut in half, still shows the original scene.

Later experiments looked more closely and found a wealth of specialization at all scales in the brain. The cortex (the most evolutionarily recent part of the brain) shows an especially fascinating organization at the smallest scales. This is most apparent in the visual cortex, where the neurons are often organized in two or higher dimensional maps. A famous example is the set of orientation maps in early visual cortex:

nrn3766-b1This figure is taken from Moser et al, Nature Reviews Neuroscience 15,466–481(2014) but you’ll run into such figures every where. It shows physical space along the cortex, color coded for the orientation of bar that the neurons at that location best respond to. One can see that the colors are not randomly assigned but form contiguous regions, indicating that neighboring neurons respond to similar orientations.

I could go on about cortical organization like this (and indeed I have elsewhere) but this is a good segway into self-organizing maps. Researchers studying machine learning got interested in this feature of the brain and wondered if it held some use for classification problems. Kohonen drew up an algorithm for what he called a self-organizing map that has been used on and off mainly for visualization via un-supervised learning.

The Kohonen map is constructed as follows.

  1. Construct a sheet of “neurons” with the dimensionality you want. Usually this is limited to one, two or three dimensions, because we want to visualize a complex higher dimensional dataset.
  2. Each neuron is a template – it has the same dimensions as the input data and can be overlaid on each instance of the input data.
  3. We initialize the templates randomly – so the templates look like white noise.
  4. We “present” each input example to the map. Then we find the template neuron that most closely matches this input. We then morph the template slightly so that it even more closely matches the input. We do this for all the neighbors of the neuron too.
  5. We keep repeating this for all the input we have.

What does this result in? I have two videos for you, you product of the MTV generation, that shows how this works.

I took the famous NIST handwritten digits data set and fed it to a two dimensional SOM.

The first video represents the self-organizing map as a set of templates that change as inputs are supplied to it.

This is easy to interpret as you can identify the templates as they evolve. Note how neurons form neighborhoods that naturally link together similar stimuli. It is easy to see with this data set because the data are two dimensional visual patterns and we can judge similarity intuitively.

The second video represents each neuron’s stimulus preferences, with it’s most preferred stimulus digit appearing as the largest.

You can see of the network evolves as examples are thrown at it and segregates out into zones that respond to or represent a particular form of the stimulus. You can see how zones travel and try to spread-out, ending up at the corners of the map. This is due to the competition created by the winner-take-all operation we use (closest matching template wins) combined with the effect of adjusting the winning neuron’s neighbors too.

SOMs are fun things to play with, but are a little fiddly. You may have picked up from the videos that we had two parameters called sigma and eta that changed as the learning went on. The numbers got smaller as the training went on. Sigma is the size of the neighborhood and eta is the learning rate – the smaller the value of eta the less the neurons change in response to inputs. These are two things you have to fiddle with to get decent results. Also as you can imagine, this whole process is sensitive to the random templates we start with.

Numerous attempts have been made to try and figure out objective ways to determine these numerous parameters for a SOM. It turns out they are pretty data dependent and most people just do multiple runs until they get results they are happy with.

As I was reading Bishop I ran into his version of the SOM which he calls Generative Topographic Mapping which Bishop claims is more principled and mathematically grounded than SOMs. The GTM, which I really haven’t heard about at all, seems like a fun thing to study, but as a ex-neurophysiologist SOMs are definitely more intuitive to understand.

Code (and link to data source) follows:

Bayesian networks and conditional independence

Bayesian networks are directed graphs that represent probabilistic relationships between variables. I’m using the super-excellent book “Pattern Recognition and Machine Learning” by C. Bishop as my reference to learn about bayesian networks and conditional independence, and I’d like to share some intuitions about these.

First, let’s motivate our study with a common home situation related to kids. We start with the question “Do we have to gas up the car?” (As you know, everything related to kids is binary. There is no room for gray areas). Well, this depends upon whether we need to do a grocery trip, or not. That, in turn depends upon whether the food in the fridge has gone bad and whether we need dino nuggets. Why would the food go bad? Well that depends on whether the electricity went out, and whether the darn kids broke the fridge again. The kids never broke your fridge? Lucky you. Whether we need dino nuggets depends on whether we’re having a party. The probability that the kids broke the fridge depends on whether there were a horde of them or not (i.e. a party).

Whoa, this paragraph looks complicated, like a GRE word problem. Let’s represent this as a directed graph:

Screen Shot 2017-04-13 at 7.57.02 AM

Much better!

All the variables here are binary and we can put actual numbers to this network as follows:

Screen Shot 2017-04-13 at 8.14.09 AM

Explaining some of the numbers, the entry for the (P)arty node indicates that there is a probability of 0.1 we’re going to have a party, and this is uninfluenced by anything else. Looking at the entry for the (G)as up node, we see that whether we gas up depends on the states of F and D. For example, if both F and D are false, there is a 0.1 probability of needing to gas up, and if F is true and D is false, there is a 0.6 probability of needing to gas up, and so on.

Intuitively, for a pair of nodes A and B, there is a linkage if the probability of the output depends on the state of the input(s). So, in the truth tables above I’ve made every row different. If, for example, the probability that B would take the state “T” is 0.2 regardless of whether A was “T” or “F” then we can intuit that A doesn’t influence B and there is no line from A to B. If, on the other hand, B takes the state “T” with p=0.2 if A is “T” and with p=0.8 if A is “F” then we can see that the probability B takes the state “T” depends on A.

Before we start to look at our kid’s party network, I’d like to examine three basic cases Bishop describes in his book, and try an develop intuitions for those, all involving just three nodes. For each of these cases I’m going to run some simulations and print results using the code given in this gist in case you want to redo the experiments for your satisfaction.

The only thing to know going forward is that I’ll be using the phi coefficient to quantify the degree of correlation between pairs of nodes. phi is a little more tricky to interpret than pearson’s R, but in general a phi of 0 indicates no correlation, and larger values indicate more correlation.

The first case is the tail-to-head configuration:Screen Shot 2017-04-13 at 4.56.51 PM.png

Say A=”T” with p=0.5 (i.e. a fair coin) and C=”T” p=0.7 if A=”T” and p=0.3 otherwise, and B=”T” p=0.7 if C=”T” and p=0.3 otherwise. Here “B” is related to “A” through an intermediary node “C”. Our intuition tells us that in general, A influences C, C influences B and so A has some influence on B. A and B are correlated (or dependent).

Now, here is the interesting thing: suppose we “observe C”, which effectively means, we do a bunch of experimental runs to get the states of A, C and B according to the probability tables and then we pick just those experiments where C = “T” or C = “F”. What happens to the correlation of A and B?

If we take just the experiments where C=”T”, while we do have a bias in A (because C comes up “T” a lot more times when A is “T”) and a bias in B (because B comes up “T” a lot more times when C is “T”) the two are actually not related anymore! This is because B is coming up “T” with a fixed 0.7 probability REGARDLESS OF WHAT A was. Yes, we have a lot more A=”T”, but that happens independently of what B does.

That’s kind of cool to think about. Bishop, of course, has algebra to prove this, but I find intuition to be more fun.

As a verification, running this configuration 100000 times I find the phi between A & B = 0.16379837272156458 while with C observed to be “T” it is -0.00037 and “F” it is 0.0024. Observing C decorrelates A and B. MATH WORKS!

As an aside, if I setup the network such that the bias of C does not change with the input from A, I get phi of A & B = -0.00138 and phi of A & B with C observed to be “T” = -0.00622 and “F” =  0.001665 confirming that there is never any correlation in the first place in such a “flat” network.

The second case bishop considers is the tail-to-tail configuration:

Screen Shot 2017-04-13 at 6.33.31 PM

Here we have a good intuition that C influences A and B, causing A and B to be correlated. If C is “T” it causes A to be “T” with pA(T)  and B to be “T” with pB(T) and when C is “F” A is “T” with pB(F) and B is “T” with pB(F). These probabilities switch to track C as it changes, resulting in the linkage.

What happens if we observe C? Say C is “T”. Now this fixes the probability of “T” for both A and B. They maybe different, but they are fixed, and the values A and B take are now independent!

The third case is very cool, the head-to-head configuration:

Screen Shot 2017-04-13 at 7.22.51 PM.png

We easily see that A and B are independent. What happens if we observe C? Let’s consider a concrete example with a probability table

A B  p(T) for C
F F  0.1
F T  0.7
T F  0.3
T T  0.9

Say we run 100 trials and A and B come up “T” or “F” equally likely (p=0.5). We expect each AB pattern will occur equally often (25 times) and then the expected number of C=T states is

A B  E(C=T)
F F  0.1 * 25 =  2.5
F T  0.7 * 25 = 17.5
T F  0.3 * 25 =  7.5
T T  0.9 * 25 = 22.5

So we have an expected 50 trials where C=T. Of that subset of trials, the fraction of trials that each pattern comes up is:

A B  p(AB|C=T)
F F   2.5 / 50 = 0.05
F T  17.5 / 50 = 0.35
T F   7.5 / 50 = 0.15
T T  22.5 / 50 = 0.45

Ah! You say. Look, when A and B are independent, each of those patterns come up equally often, but here, after we observe C, there is clearly an imbalance! How sneaky! Basically, because some patterns of AB are more likely to result in C=T this sneaks into the statistics once we pick a particular C (Those hoity, toity statisticians call this “conditioning on C”). A and B are now dependent!!

I have to say, I got pretty excited by this. Perhaps I’m odd.

But wait! There’s more! Say C has a descendent node. Now observing a descendent node actually “biases” the ancestor node – so in a way you are partially observing the ancestor node and this can also cause A and B to become dependent!

Now, I was going to then show you some experiments I did with a simulation of the Kid’s party network that I started with, and show you all these three conditions, and a funky one where there are two paths linking a pair of nodes (P and G) and how observing the middle node reduces their correlation, but not all the way, cause of the secondary path, but I think I’ll leave you with the simulation code so you can play with it yourself.

Code for this little experiment is available as a github gist.

What is Mutual Information?

The mutual information between two things is a measure of how much knowing one thing can tell you about the other thing. In this respect, it’s a bit like correlation, but cooler – at least in theory.

Suppose we have accumulated a lot of data about the size of apartments and their rent and we want to know if there is any relationship between the two quantities. We could do this by measuring their mutual information.

Say, for convenience, we’ve normalized our rent and size data so that the highest rent and size are 1 “unit” and the smallest ones are 0 “units”. We start out by plotting two dimensional probability distributions for the rent and size.

Screen Shot 2013-11-01 at 12.32.04 PM

We plot rent on the x-axis, size on the y-axis. The density – a normalized measure of how often we run into a particular (rent,size) combination), and called the joint distribution (p(r,s)) – is actually plotted on the z-axis, coming out of the screen, forming a surface. To simplify matters, let’s assume the joint distribution here is uniform all over, so this surface is flat and at a constant height.

So, here the joint distribution of rents and sizes (p(r,s)) is given by the square (which is actually the roof of a cube, poking out) and the distribution of rents and sizes by themselves (called the marginals, because they are drawn on the margins of the joint distribution) are given by p(r) and p(s).

To recall a bit of probability, and probability distributions, the probability of finding a house/apartment within a certain rent/size range combo is given by the volume of the plot within that rent/size range. The volume of the whole plot, is therefore, equal to 1, since all our data is within this range.

The mutual information is given by the equation:
\displaystyle I(R;S) = \int \int p(r,s) \log \frac{p(r,s)}{p(r)p(s)}drds

This equation takes in our rent/size data and spits out a single number. This is the value of the mutual information. The logarithm is one of the interesting parts of this equation. In practice the only effect of changing the base is to multiply your mutual information value by some number. If you use base 2 you get out an answer in ‘bits’ which makes sense in an interesting way.

Intuitively we see that, for this data, knowing the rent tells us nothing additional about the size (and vice versa).

If we work out the value of the mutual information by substituting the values for p(r,s), p(r) and p(s) into the equation above we see that, since all these quantities are constant, we can just perform the calculation within the integral sign and multiply the result by the area of the plot (which is 1 and indicated by the final x1 term)
I(R;S) = 1 \log_2 \frac{1}{1 \times 1} \times 1 = 0

So we have 0 bits of information in this relation, which jives with our intuition that there is no information here – rents just don’t tell us anything about size.

Now suppose our data came out like this.
Screen Shot 2013-11-01 at 12.32.55 PM
[one-bit diagram]

Substituting the values we see that (noting we have two areas to integrate, each of size 1/2 x 1/2 = 1/4)
I(R;S) = 2 \log_2 \frac{2}{1 \times 1} \times \frac{1}{4} \times 2 = 1

That’s interesting. We can see intuitively there is a relation between rent and size, but what is this 1 bit of information? One way of looking at our plot is to say, if you give me a value for rent, I can tell you in which range of sizes the apartment will fall, and this range splits the total range of sizes in two. 2^1=2 so we say we have 1 bit of information which allows us to distinguish between two alternatives: large size and small size.

Interestingly, if you tell me the size of the apartment, I can tell you the range of the rent, and this range splits the total range of rents in two, so the information is still 1 bit. The mutual information is symmetric, as you may have noted from the formula.

Now, suppose our data came out like this.
Screen Shot 2013-11-01 at 12.33.41 PM
[two-bit diagram]

You can see that:
I(R;S) = 4 \log_2 \frac{4}{1 \times 1} \times \frac{1}{16} \times 4 = 2

Two bits! The rents and sizes seem to split into four clusters, and knowing the rent will allow us to say in which one of four clusters the size will fall. Since 2^2=4 we have 2 bits of information here.

Now so far, this has been a bit ho-hum. You could imagine working out the correlation coefficient between rent and size and getting a similar notion of whether rents and sizes are related. True, we get a fancy number in bits, but so what?

Well, suppose our data came out like this.
Screen Shot 2013-11-01 at 12.34.29 PM
[two-bit, scrambled diagram]

It’s funny, but the computation for MI comes out exactly the same as before:
I(R;S) = 4 \log_2 \frac{4}{1 \times 1} \times \frac{1}{16} \times 4 = 2

Two bits again! There is no linear relationship that we can see between rents and sizes, but upon inspection we realize that rents and sizes cluster into four groups, and knowing the rent allows us to predict which one of four size ranges the apartment will fall in.

This, then, is the power of mutual information in exploratory data analysis. If there is some relationship between the two quantities we are testing, the mutual information will reveal this to us, without having to assume any model or pattern.

However, WHAT the relationship is, is not revealed to us, and we can not use the value of the mutual information to build any kind of predictive “box” that will allow us to predict, say, sizes from rents.

Knowing the mutual information, however, gives us an idea of how well a predictive box will do at best, regardless of whether it is a simple linear model, or a fancy statistical method like a support vector machine. Sometimes, computing the mutual information is a good, quick, first pass to check if it is worthwhile spending time training a computer to do the task at all.

A note on computing mutual information

In our toy examples above it has been pretty easy to compute mutual information because the forms of p(r,s), p(r) and p(s) have been given explicitly. In real life we don’t have the distributions and all we are given is a (not large enough) pile of data. We try to estimate the functions p(r,s), p(r) and p(s) from this data on our way to computing the mutual information.

You will notice that the term in the integral is always positive. p(r,s) \geq 0 because it is a probability and \frac{p(r,s)}{p(r)p(s)} \geq 1. This second fact can be seen by considering the extreme cases where r and s are independent (in which case p(r,s)=p(r)p(s) which leads us to \frac{p(r,s)}{p(r)p(s)} = 1) and when they are completely dependent (in which case p(r,s)=p(r)=p(s) which leads us to \frac{p(r,s)}{p(r)p(s)} = \frac{p(r)}{p^2(r)} = \frac{1}{p(r)} \geq 1).

You will immediately sense the problem here. In many calculations, when we have noise in the terms, the noise averages out because the plus terms balance out the minus terms. Here, all we have are plus terms, and our integral has a tendency to get bigger.

Histograms (where we take the data and bin it) is an expedient way of estimating probability distributions and they normally work alright. But this can lead us to a funny problem when computing mutual information because of this always positive nature of the integral term.

For example, say there was really no dependence between rents and sizes, but suppose our data and our binning interacted in an odd manner to give us a pattern such as this:
Screen Shot 2013-11-01 at 12.35.14 PM

We can see that the marginals are not affected badly, but the joint, because it is in two-dimensional space, is filled rather more sparsely which leads to us having ‘holes’ in the distribution. If we now compute the mutual information we find that we have ended up with 1 bit of information, when really, it should be 0 bits.

Most attempts to address this bias in mutual information computations recognize the problem with these ‘holes’ in the joint distribution and try to smear them out using various ever more sophisticated techniques. The simplest way is to make larger bins (which would completely solve our problem in this toy case) and other methods blur out the original data points themselves.

All of these methods, not matter how fancy, still leave us with the problem of how much to smear the data: smear too little and you inflate the mutual information, smear too much and you start to wipe it out.

Often, to be extra cautious, we do what I have known as ‘shuffle correction’ (and I was told by a pro is actually called the ‘null model’). Here you thoroughly jumble up your data so that any relation ship that existed between r and s is gone. You then compute the mutual information of that jumbled up data. You know that the mutual information should actually be zero, but because of the bias it comes out to something greater. You then compare the mutual information from the data with this jumbled one to see if there is something peeking above the bias.

Computing mutual information and other scary things

A moderately technical writeup of some adventures in computing mutual information.

I told a colleague of mine, Boris H, of my plan to use mutual information to test data from an experiment. He encouraged me, but he also passed along a warning. “Be wary of discretization. Be very wary” he said, or words to that effect. Boris is a professional: he knows the math so well, people pay him good money to do it. I was, like, “Yeah, yeah, whatever dude.” Long story short, always listen to the pros…

Summary: The Jack Knifed MLE gives a low bias estimate of the MI. It is easy to implement and is not too compute intensive. It is fairly robust to bin size.

In theory, mutual information is a super elegant way of uncovering the relationship between two quantities. There is a quiet elegance to its formulation

\displaystyle I(X;Y) = \int \int p(x,y) \log \frac{p(x,y)}{p(x)p(y)}dxdy

Or, equivalently

\displaystyle I(X;Y) = H(X) + H(Y) - H(X,Y)

where H(X) = -\int p(x) \log p(x)dx is the entropy

These elegant formulae, on the surface, lend themselves to an easy discretization. Personally I see integrals and think – ah, I’m just going to discretize the space and replace the integrals with summations, it works for everything else.

Basically, the idea would be to take the data, which comes as pairs of numbers (x,y) and throw them into a two-dimensional histogram. Each cell of the histogram gives us an estimate of p(x,y) – the joint probability of (x,y) and summing along the rows or columns of the 2d-histogram gives us our marginal probabilities p(x), p(y) which can then be thrown in to our mutual information formula (with integrals replaced by summations). Easy Peasy.

It turns out that this particular computation – because of the form of the equations – is very sensitive to how the data are discretized. The easy formula, apparently the MLE for the entropy, has persistent biases. In all cases the number of bins used for creating the histogram plays an annoyingly critical role. If we have too few bins we underestimate the MI and if we have too many bins we over-estimate the MI. People have come up with different corrections for the bias of which I picked two to test (these formulae are nicely summarized in Paninski 2003, in the introduction).

Defining N as the number of samples and \hat{m} as the number of bins with non-zero probability we have:

Simple-minded discretization (MLE): \hat{H}_{MLE}(x) = -\sum p(x_i)log_2{p(x_i)}

Miller Madow correction (MLEMM): \hat{H}_{MLEMM}(x) = \hat{H}_{MLE}(x) + \frac{\hat{m}-1}{2N}

The Jack Knifed estimate (MLEJK): \hat{H}_{MLEJK}(x) = N\hat{H}_{MLE}(x) - \frac{N-1}{N}\sum_{j=1}^{N} \hat{H}_{MLE-j}(x) (Leave-one-out formula)

I tested the formulae by generating pairs of correlated random numbers. If X,Y are two gaussian random variables with correlation coefficient \rho then their mutual information is given by:


which I use as the ground truth to compare the numerical results against.

The plot below compares the actual and estimated MI estimate spit out by the MLE formula for various values of correlation coefficient \rho. The actual value is shown by the blue line, the gray shaded area shows 95% of the distribution (from repeating the calculation 1000 times) for a sample size of 100 (Chosen as a reasonable sample size I usually run into, in my experiments). The sets of shaded curves show computation using 11 bins (darker gray) and 21 bins (lighter gray). As you can see this formula sucks. It has a large persistent bias through out the range of correlation coeffcients (the gray areas hang well above the blue line).


The MLEMM did not fare much better in some preliminary explorations I did (see below) so I went to the jack knife which performed pretty well. The Jack Knife still has a bias, but it is not very large. It is less sensitive to the binning and the bias is stable throughout the range.


I’m including this second plot, which isn’t that easy on the eyes, but it was the initial exploration that led me to look into the jack knife in more detail. The plots are distributions of the MI value computed from 1000 runs of each algorithm. The three rows are the three formulae: the MLE, the MLE with the Miller Madow correction (MLEMM) and the Jack Knifed MLE. The columns cycle through different sample sizes (10,100,100), different bin sizes (11,21) and different correlation coefficients between x and y (0,0.5). The vertical dotted line indicates the real MI. As you can see the MLE and MLEMM have an upward bias, which persists even for large sample sizes and is exacerbated by finer binning. The jack knife is impressive, but it too has an upward bias (as you have seen in the previous plot).


Next steps

The next incremental improvement I would use is adaptive binning, which uses some formalized criteria (like making sure all bins have a non-zero probability) to adjust the binning to the data before calculating MI.

Further reading:

Paninski, Liam. “Estimation of entropy and mutual information.” Neural Computation 15.6 (2003): 1191-1253. – This paper was very tough for me to chew through. I used it for the formulae in the first few pages. Liam claims he has a method for correcting the MLE method, but it seemed like a very complicated computation

Cellucci, C. J., Alfonso M. Albano, and P. E. Rapp. “Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms.” Physical Review E 71.6 (2005): 066208. – This was a much more readable paper compared to Liam’s. I’m using to see what kind of adaptive binning I can apply to the data.

import numpy, pylab

def correlated_x(N, rho=0.5):
  r = numpy.random.randn(N,2)
  r1 = r[:,0]
  r1_ = r[:,1]
  r2 = rho*r1 + (1-rho**2)**0.5*r1_
  return r1, r2

def bin_data(x,y, bins=11, limits=[-4,4]):
  Nxy, xe, ye = numpy.histogram2d(x,y,bins=bins,range=[limits,limits])
  N = float(Nxy.sum())
  Pxy = Nxy/N
  Px = Pxy.sum(axis=1)
  Py = Pxy.sum(axis=0)
  return Pxy, Px, Py, N

MI_exact = lambda rho: -.5*numpy.log2(1-rho**2)

def H_mle(P):
  idx = pylab.find(P>0)
  return -(P.flat[idx]*numpy.log2(P.flat[idx])).sum()

def MI_mle(x,y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, Ntot = bin_data(x,y,bins=bins,limits=limits)
  return H_mle(Px) + H_mle(Py) - H_mle(Pxy)

def MI_mle_jack_knife(x, y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
  Hx = H_mle(Px)
  Hy = H_mle(Py)
  Hxy = H_mle(Pxy)
  Hx_jk = 0
  Hy_jk = 0
  Hxy_jk = 0
  for n in range(x.size):
    jx = numpy.concatenate((x[:n],x[n+1:]))
    jy = numpy.concatenate((y[:n],y[n+1:]))
    Pxy, Px, Py, Njk = bin_data(jx,jy,bins=bins,limits=limits)
    Hx_jk += H_mle(Px)
    Hy_jk += H_mle(Py)
    Hxy_jk += H_mle(Pxy)

  Hx_jk = N*Hx - (N-1.0)/N*Hx_jk
  Hy_jk = N*Hy - (N-1.0)/N*Hy_jk
  Hxy_jk = N*Hxy - (N-1.0)/N*Hxy_jk

  return Hx_jk + Hy_jk - Hxy_jk

def runmany(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
  mi = numpy.empty(b)
  for n in range(b):
    r1,r2 = correlated_x(N, rho)
    mi[n] = func(r1,r2,bins=bins)
  med_idx = int(b*0.5)
  lo_idx = int(b*0.025)
  hi_idx = int(b*0.975)
  return mi[lo_idx], mi[med_idx], mi[hi_idx]

b = 1000
Ni = 20
rho = numpy.linspace(0,.99,Ni)
mi_exact = MI_exact(rho)
N = 100
cols = [(.5,.5,.5), (.7,.7,.7)]
for lab, func in zip(['MI MLE', 'MI Jack Knife'], [MI_mle, MI_mle_jack_knife]):
  for k, bins in enumerate([11,21]):
    mi_est = numpy.empty((Ni,3))
    for n in range(rho.size):
      #mi_est[n,:] = runmany(MI_mle_jack_knife, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
      #mi_est[n,:] = runmany(MI_mle, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
      mi_est[n,:] = runmany(func, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
    pylab.fill_between(rho, mi_est[:,0], mi_est[:,2],color=cols[k],edgecolor='k',alpha=.5)
    pylab.plot(rho, mi_est[:,1], 'k', lw=2)
    pylab.plot(rho, mi_exact, 'b',lw=2)

The code for the ugly, detailed plot:

import numpy, pylab

def correlated_x(N, rho=0.5):
  r = numpy.random.randn(N,2)
  r1 = r[:,0]
  r1_ = r[:,1]
  r2 = rho*r1 + (1-rho**2)**0.5*r1_
  return r1, r2

def bin_data(x,y, bins=11, limits=[-4,4]):
  Nxy, xe, ye = numpy.histogram2d(x,y,bins=bins,range=[limits,limits])
  N = float(Nxy.sum())
  Pxy = Nxy/N
  Px = Pxy.sum(axis=1)
  Py = Pxy.sum(axis=0)
  return Pxy, Px, Py, N

MI_exact = lambda rho: -.5*numpy.log2(1-rho**2)

def H_mle(P):
  idx = pylab.find(P>0)
  return -(P.flat[idx]*numpy.log2(P.flat[idx])).sum()

def H_mm(P,N):
  m = pylab.find(P>0).size
  return H_mle(P) + (m-1)/(2.0*N)

def MI_mle(x,y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, Ntot = bin_data(x,y,bins=bins,limits=limits)
  return H_mle(Px) + H_mle(Py) - H_mle(Pxy)

def MI_mle_miller_madow(x, y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
  return H_mm(Px, N) + H_mm(Py, N) - H_mm(Pxy, N)

def MI_mle_jack_knife(x, y, bins=11, limits=[-4,4]):
  Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
  Hx = H_mle(Px)
  Hy = H_mle(Py)
  Hxy = H_mle(Pxy)
  Hx_jk = 0
  Hy_jk = 0
  Hxy_jk = 0
  for n in range(x.size):
    jx = numpy.concatenate((x[:n],x[n+1:]))
    jy = numpy.concatenate((y[:n],y[n+1:]))
    Pxy, Px, Py, Njk = bin_data(jx,jy,bins=bins,limits=limits)
    Hx_jk += H_mle(Px)
    Hy_jk += H_mle(Py)
    Hxy_jk += H_mle(Pxy)

  Hx_jk = N*Hx - (N-1.0)/N*Hx_jk
  Hy_jk = N*Hy - (N-1.0)/N*Hy_jk
  Hxy_jk = N*Hxy - (N-1.0)/N*Hxy_jk

  return Hx_jk + Hy_jk - Hxy_jk

def runmany(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
  mi = numpy.empty(b)
  for n in range(b):
    r1,r2 = correlated_x(N, rho)
    mi[n] = func(r1,r2,bins=bins)
  return mi

def plot_bias_variance(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
  mi = runmany(func, N=N, rho=rho, bins=bins, limits=limits, b=b)
  pylab.plot([MI_exact(rho),MI_exact(rho)], pylab.getp(pylab.gca(),'ylim'), 'k:',lw=3)

b = 1000
for i, rho in enumerate([0,0.5]):
  for j, N in enumerate([10,100,1000]):
    for k, bins in enumerate([11,21]):
      c = 1+k+j*2+i*6
      pylab.subplot(3,12,c); plot_bias_variance(MI_mle,rho=rho, N=N,bins=bins,b=b)
      if c == 1: pylab.ylabel('MLE')
      pylab.subplot(3,12,12+c); plot_bias_variance(MI_mle_miller_madow,rho=rho,N=N,bins=bins,b=b)
      if c == 1: pylab.ylabel('MillerMadow')
      pylab.subplot(3,12,24+c); plot_bias_variance(MI_mle_jack_knife,rho=rho,N=N,bins=bins,b=b)
      if c == 1: pylab.ylabel('JackKnife')


Parzen windows for estimating distributions

Part of a set of moderately technical writeups of some adventures in computing mutual information for neural data.

Often, for example, when you are computing mutual information, you need to estimate the probability distribution of a random variable. The simplest way, which I had done for years, is to to create a histogram. You take each sample and put it into a bin based on its value. Then you can use one of several tests to check if the shape of the histogram deviates from whatever distribution you are interested in.

When you don’t have enough data (aka Always) your histogram comes out jagged (Almost everyone who’s ever made a histogram knows what I’m sayin’). In undergrad stats I learned that 11 was a nice number of bins, and indeed both Matplotlib and MATLAB seem to have that as the default. What I ended up doing was plotting the data using various bins until, by inspection, I was satisfied by the smoothness of the histogram.

Turns out mathematicians have a whole cottage industry devoted to formalizing how to compute the number of bins your histogram should have. The relevant keyword is (imaginatively enough) “histogram problem”. I ran into this word while reading up a Navy technical report by Cellucci et al found here. That paper has references to a bunch of dudes who worked on that problem. Anyhow, there are lots of complex schemes but I liked Tukey’s formula, which was n^{1/2} (n being the number of data points).

This post however, is not about binning. It’s about Parzen windows to get rid of binning altogether. There is a deep theoretical underpinning to Parzen windowing, but intuitively I understand Parzen windowing as a smoothing operation that takes the samples we have and creates a smooth distribution out of the points.

Here’s how I understand Parzen windows: Each sample creates a splash – it’s own little gaussian (Apparently, you can also use boxcar windows or whatever window has a nice property for your problem). This is the Parzen window. As a result, the sample is no longer tightly localized but has a bit of a blur to it. We then add up all the blurs to create a smoothened curve/surface which is our estimate of the pdf of the samples. With a judicious choice of the width of the blurring and proper normalization of the height of each gaussian we can come up with a sensible pdf estimate. The advantage of this is that you know have a continuous function representing the pdf, which you can integrate.

Formally (I referred to a paper by Kwak and Choi – Input Feature Selection by Mutual Information based on Parzen window) the Parzen window estimate of the pdf is given by

\hat{p} = \frac{1}{n}\sum_{i=1}^{n} \phi(x-x_i,h)

where \phi is the window function, and I used a gaussian for that. As you can see, the density estimate at any given point x is given by the sum of gaussians centered around each data point x_i with the width of the gaussian being given by h. The larger h is the more washed out the estimate is and the smaller h is the more jaggy is the estimate.

We seem to have transferred our problem of finding an appropriate bin width (bin count) to finding an appropriate smoothening constant (What! You expected a free lunch!?). I used what wikipedia calls Silverman’s rule of thumb: h=1.06\hat{\sigma}n^{-1/5}.

Here is a fun little animation showing how the Parzen window estimate of a pdf (thin black line) matches up with the actual pdf (thicker blue line). The histogram of the actual data points are shown in light gray in the background.

Interesting things about this smoothing is that there is no binning involved – the final curve depends only on the actual data samples – and we make no strong assumptions about the pdf of the data – it’s not like we are trying to fit a model of the pdf to the data. Here is an animation of the exact same technique being used to fit a uniform distribution. We could have done better with a different choice of window width more tuned to the distribution, but the idea is that it still works if we don’t have any idea of what the actual pdf looks like.

These are also known as mixture of Gaussians or mixture decompositions.

During my web-searches I ran across this nice set of lecture slides about estimating pdfs from a prof at TAMU.

import numpy, pylab, matplotlib.animation as animation

root_two_pi = (2<em>numpy.pi)<strong>0.5
parzen_est = lambda x,X,h,sigma: numpy.exp(-(X-x)</strong>2/(2</em>h<strong>2*sigma</strong>2)).sum()/(root_two_pi*h*sigma*X.size)
gaussian_pdf = lambda x: (1/(2*numpy.pi)<strong>.5)*numpy.exp(-x</strong>2/2)
def uniform_pdf(x):
  p = numpy.zeros(x.size)
  p[(-2 &lt; x) &amp; (x &lt; 2)] = 0.25
  return p

def init():
  fig = pylab.figure(figsize=(3,6))
  ax = pylab.axes(xlim=(-5, 5), ylim=(-0.5, 0.7))
  return fig, ax

def animate(i, r, ax, distr):
  this_r = r[:i+2]
  h= 1.06<em>this_r.std()</em>this_r.size<strong>(-.2)#1/(2<em>numpy.log(N))
  lim = [-5,5]
  bins = 51
  pylab.text(-.4,-.025,'n=' + str(this_r.size))
  pylab.hist(this_r, bins=bins,range=lim,normed=True,edgecolor=(.9,.9,.9),color=(.8,.8,.8),histtype='stepfilled')
  #pylab.plot(x, (1/(2</em>numpy.pi)</strong>.5)*numpy.exp(-x**2/2), color=(.1,.1,1.0), lw=5)
  pylab.plot(x, distr(x), color=(.1,.1,1.0), lw=5)
  pylab.plot(x, [parzen_est(xx, this_r, h, this_r.std()) for xx in x],'k', lw=3)
  pylab.setp(ax,xlim=(-5, 5), ylim=(-0.05, 0.7))

N = 200
r0=numpy.random.randn(N); distr = gaussian_pdf
#r0=numpy.random.rand(N)*4-2; distr = uniform_pdf

fig, ax = init()
anim = animation.FuncAnimation(fig, animate, fargs=(r0, ax, distr), frames=N-2, interval=2, repeat=False)'parzen_smoothing.mp4', fps=30, extra_args=['-vcodec', 'libx264'])

What is machine learning, and why should I care?

This is a (longish) informal intro to machine learning aimed at Biology/Neuroscience undergraduates who are encountering this for the first time in the context of biological data analysis.

We are, as you know, furiously lazy. It’s a biological adaptation: we always try to find the path of least resistance. Only people with distorted senses of reality, like Neurobio students, make things harder for themselves. Naturally, we dream of making machines do not just the easy work, like cutting our lawn, harvesting our crops and assembling our digital cameras, but also do the hard work, like picking stocks, diagnosing our diseases and analyzing all our data.


Categorization – making discrete decisions based on complex information – is at the heart of many interesting things we do, which are collectively called decision making. For example, we have an x-ray done and the radiologist examines the picture and determines if there is cancer or not. There are many, many input factors which go into this determination but the output is relatively simple: it is one of  ‘Yes’ or ‘No’. A lot of machine learning is devoted to understanding and implementing this kind of decision making in computers.

Machine learning = Learning by examples

When we humans do half-way interesting things, like read, make coffee and do math, we attribute that to learning. We learned how to look at squiggles on a piece of paper and interpret them as information from somebody else’s mind, usually indicating an event or thought. Often we learn things by example: we are exposed to things, like words, and we are given examples of what they sound like and what they represent and we ‘pickup’ the words, their sounds and their meanings.

Machine learning is a set of statistical techniques that take inspiration from this learning-by-example and try to mimic this process on computers, thereby realizing our dream of a perfect society where we’ll all be chilling at the beach while the machines do all the work.

The basic idea is that we collect a set of examples (“samples”) and their meanings (“category”). Right now, we are doing baby steps, so the meaning (category) is restricted to simple concepts that have been reduced to a single label. The hope is that if we show the machine enough examples we can let it loose on the world and when it finds new things it can look to its book of samples and match up the new thing it sees with what it already has seen and then take a decision: This is a ‘tree’, this is ‘euler’s equation’, ‘you need to exercise more’.

Supervised learning

Suppose I want to teach the machine what a cat looks like. I collect a huge bunch of cat pictures and label them “cat”. We show the computer the pictures (more on that later) and tell it “These are cats”. Cool! Now the computer knows what a cat looks like.

i_know_catBut, what happens when it sees a picture of a house, or a tree? It doesn’t know what a cat doesn’t look like. Some people will say they have a secret method where it’s possible to show the computer just pictures of cats, and when a non-cat comes along, the computer will know the difference. We are more cautious. For good measure we throw in a huge collection of pictures that are not cats and tell the computer about that too, so it knows to look for the difference.

This is called supervised learning, because we are supervising the computer by telling it what picture is what.

Unsupervised learning: the height of laziness

Humans learn many things by just observing. In machine learning there are techniques, called unsupervised learning, where don’t even bother to label the examples. We just a dump a whole lot of data into the computer and say to it “Here’s all the information, you sort it out, I’m off camping.” The computer patiently goes through the samples and finds differences between them and sorts them into categories it comes up on its own. This is a powerful method, but as you can imagine, without supervision, the results can be quite hilarious. In general, the computer, being prone to over-analysis, tends to find tiny differences between samples, and tends to break them up into many, many categories. Without some kind of intervention the computer will patiently put each sample into its own separate category and be quite satisfied in the end.

How is this different from more traditional computing?

In traditional computing we do all the ‘thinking’ for the computer and program a specific algorithm (a clear cut set of computing steps) based on an explicit mathematical formula. We are the ones who come up with the core process of how to solve the problem. For example, say we are trying to get a computer to drive a car and we want it to stop when it sees a red stop sign. We think hard and say, “The most salient thing about a stop sign is the insane amount of red color it has. We’ll program the computer to stop the car when it sees lot of red.”

If we used a machine learning approach, we would say, “Why should I do all the thinking? I want to hang out with my homies. I’ll just show the computer a huge scrapbook  of stop signs, yield signs, pedestrian crossing signs and let the computer figure it out. The only work I’ll do is that I’ll put the signs into the correct categories first. If I’m in a hurry, I might not even do that.”

“Learning” = finding similarities in the features

How do we distinguish cats from, say, fire hydrants? We’ve found things (features) that are common to cats that are not present in fire hydrants. More precisely, we’ve found combinations of features that are very likely to make up a cat and very unlikely to make up a fire hydrant.

So how and what does the computer learn? The details of this depend on which exact machine learning method you use from the veritable zoo of machine learning methods. In general, the input that you give to the computer is converted to a list of numbers called a feature vector. It can be as simple as taking a picture of a cat, rearranging all the pixels into one long row and then feeding in this row of pixels as numbers.

So how does the computer learn to use this feature vector to learn what is a picture of a cat and what isn’t? You would expect that the feature vector taken from the picture of the cat is similar is some sense to vectors from pictures of other cats, and different from vectors of fire hydrant pictures. Machine learning algorithms use different mathematical formulae to automatically find the similarity between cat vectors and the differences between cat and fire hydrant vectors.

When we show the computer a new picture, the computer converts the picture into a feature vector,  refers to its giant book of existing feature vectors and asks the question “Is this new vector more similar to a cat vector or a something else in my book?” and bases its decision on that.

Feature selection

Most things in the world have many, many features. A cat’s features would, for example, include the shape and size of the ears, the color of the eyes, the color and length of the fur and so on. If we were distinguishing between fire-hydrants and cats we probably don’t have to look in great detail into the features of both, just a handfull, such as color and overall shape will do. If we are trying to distinguish between breeds of cats, however, we probably need to delve in great detail into a lot of the features.

Deciding which features to use has a great impact on our ability to teach the computer to perform categorization successfully. Sometimes we hand select the features, where we have a good idea of what things distinguish the categories we care about. Other times, we get lazy and, once again, let the computer sort it out.

Dirty little secrets: Overtraining/overfitting and the curse of dimensionality

A fallout 3 in-game poster from destructoid

As you may have noticed this rambling piece of text bears a suspicious resemblance to rose tinted predictions of the 60s of how robots were going to do all our chores for us as we lay about in our bathrobes, smoking our pipes and ordering trifles by mail. If machine learning is so great where are our jet-packs, flying cars and domestic robots?

Well, two things. First, the future is here, but it’s a lot more profane that the romantics of the 1960s depicted. Automated algorithms track our spending habits and place annoying ads on our social media pages. The US postal service uses robots that can read our scraggly handwriting to route our birthday card to mom. UPS fired all the human telephone operators and now you can speak in your tracking number and the computer will take it down despite your thick southern accent.

Secondly, all this glib talk about “Letting the computer sort things out” should have you on your guard. After all, we all know about skynet. There are two main problems with having machines do all the work.


The first is called overfitting. This is where the computer gets too fixated on the details of the input and builds over-elaborate classification rules.  Basically, we give the computer 1000 pictures of cats, and the computer builds a super elaborate library memorizing every teeny detail of those 1000 pictures of cats. So now we show it the 1001th picture of a cat and because the picture is a little bit different from each of the others the computer says “That picture is not an exact match to any of my 1000 cat pictures so that’s not a cat, it must be a fire hydrant.”

We can diagnose over fitting by doing something called cross-validation. Though this sounds like some kind of new-agey pop-psychology mumbo-jumbo, the concept is simple: you simply don’t test your students using same question set that you trained them on. We take our set of 1000 cat pictures and divide it up into two sets. The first set we call the training set (say 700 pictures). We let the computer learn using those 700 pictures. Then, we pull out the other 300 pictures that the computer has never seen before and make the computer classify those. The idea is that if the computer has done some kind of rote memorization trick and has not understood the essence of cat, it’s going to do rather poorly on them while it does spectacularly well on the original 700. This is how we catch computers and other students out.

But, how do we ensure that the computer does not do rote-memorization in the first place. That’s a pretty problem. There are no easy answers, but one thing we try is to figure out ways the computer could rote learn and penalize it. If, for example, the computer is using too many features, we say “Bad computer, use less features.” But this is all very subjective and trial and error and domain specific. Our only objective measure of success is to do cross-validation.

The curse of dimensionality

This is very scary, but I’m sorry I have to tell you about it. Remember when we talked about feature vectors? A feature vector is a list of numbers that describes the input. The more complex the nature of the input the longer each feature vector (the more numbers in the list).

Suppose we have been given a dossier 20 people and we are trying to figure out who to pick for the basketball team. None of these people have played sports of any kind before, so all we have are physical attributes and bunch of other things like hometown, what color car they drive, how many times they flunked math and so on and so forth. The only other thing we have is a similar dossier from 50  other folks from last year and a notation that says “Good player” or “Bad player”. How do we do this?

We start off simple. We pick the physical attribute “Height” and look at last year’s dossier. Interestingly,  when we arrange the 50 previous players it turns out most everyone above 5’8″ is a good player and most everyone below that height is a bad player. So we sort this year’s list into ‘Good’ and ‘Bad’ piles based on height. We send the kids out to play and it turns out our picks are pretty good.

Well, the coach comes back and says, can we do better? We say, hell, if we did so well with just one feature (height) why don’t we toss ALL the information in the dossier into the comparison. It can only get better right?

So we start to include everything we have. Pretty soon a cool pattern emerges. All the tall players, who scored above D+ , who live in a yellow house and drive a red car AND all the tall players who scored above B+ and live in a blue house and drive a yellow car are all good players. Man, ring up that Nate Gladwell guy, he’ll want to know about this! We split the new players up according this criterion and wait to hear the good news from the coach.

Next week the coach storms in and chews us out. He tells us we’re a bunch of dunder heads, it’s such a mixed bag of players his pet cat could have picked a better team, and we are FIRED!

(My apologies to people who realize I know nothing about sports)

What happened? How did we fail so badly? Does this remind you of the overfitting problem? Where our computer got to hung up about the details? Well, yes. It’s kind of the same thing. When you add more features (also called dimensions of the feature vector – how many numbers there are in the list) you need many, many more sample points. If we had enough dossiers those phoney correlations between playing ability and house color would have been drowned out. However, because we only had a relatively few number of dossiers we started to get hung up on coincidences between things. The more features we pick from the dossier, the more coincidences we find.

Statisticians, known for their rapier wit and general good humor, call this the curse of dimensionality.

As applied to Neuroscience: An information measure

As a neuroscientist, or neurophysiologist or general biology student why should you care? Well, these statisticians are encroaching on your turf. Statisticians can smell data a mile away, and biology absolutely reeks of it.

Using fMRI studies as an example, traditionally we have looked at each voxel individually and done a simple statistical test to answer the question “Is the activity in this cubic inch of the brain significantly different from chance”. Using somewhat dodgy corrections for multiple comparisons we then create a picture of the brain with all the significantly active parts of the brain colored in, and we draw deep conclusions like “The visual cortex is more active when the person looks at visual stimuli” or “The motor cortex lights up when the subject does motor actions”.

When the statisticians sneak in with forged immigration papers, things get more wild. The statisticians are not happy with these stodgy answers, like this brain region is differentially active during a working memory task. They go straight to the heart of why any body would do brain science in the first place. “Can I read your mind and guess which card you are holding?”

And the plan they have is to use these machine learning techniques we just discussed to answer this question. A popular method is to take those fMRI pictures of the brain, much like the ones you find on the internet and feed them into the computer, just as we talked about, along with some category information “Looking at cat”, “Looking at fire-hydrant” and so on.

Then you test the computer (Recall, cross-validation) and show them an fMRI and ask the computer to guess what it was the person was thinking/seeing/doing when the fMRI was done.

The result is a value, a percentage correct. This percetage correct ranges from chance to 100% and is a crude measure of how much information there is in the brain about a certain thing, like a stimulus, or a mood and, if we partition the fMRI, we can answer questions like, how much information is there in the prefrontal cortex, in the basal ganglia, in the thalamus and so on.

Again, this differs from traditional methods of analyzing fMRI data only in that we don’t fix what it is exactly in the data that will give is the answer. We let the computer figure it out. As we learned, this is exciting, but also dangerous if we let the computer totally loose on the data (Recall, curse of dimensionality, overfitting and whatnot).

It is also important to remember that this is not a statement on the nuts and bolts of HOW the brain is processing the information, merely a statement on how MUCH information there is (possibly) is some part of the brain that the rest of the brain could possibly use.

The curse of D- and the LDA

All you dataheads know the curse whose name must not be spoken. The curse of D(imensionality)! Let’s look at how the curse sickens us when we perform Linear Discriminant Analysis (LDA). Our intuition, when we perform LDA, is that we are rotating a higher dimensional data space and casting a shadow onto a lower dimensional surface. We rotate things such that the shadow exaggerates the separation of data coming from different sources (categories) and we hope that the data, which may look jumbled in a higher dimension, actually cast a shadow where the categories are better separated.

I have some data that can be thought of as having a large number of dimensions, say about a hundred. (Each dimension is a neuron in the brain, if you must know). I know, from plotting the data from the neurons individually, that some neurons just don’t carry any information I’m interested in, while others do. I’m interested in the question: if I combine multiple neurons together can I get more information out of them than if I look at them individually. It is possible to construct toy scenarios where this is true, so I want to know if this works in a real brain.

A question quickly arose: which neurons should I pick to combine? I know that by using some criteria I can rank the neurons as to informativeness and then pick the top 50% or 30% of the neurons to put together. But what happens if I just take all the neurons? Will LDA discard the useless noisy ones. Will my data space be rotated so that these useless neurons don’t cast any shadow and are eclipsed by the useful neurons?

This is not entirely an idle question, or a question simply of data analysis. The brain itself, if it is using data from these neurons, needs some kind of mechanism to figure out which neurons are important for what. I find this is an important question and I don’t think we are sure of the answers yet.

However, back to the math. We can generate a toy scenario where we can test this. I created a dataset that has 10,25 and 50 dimensions. Only one dimension is informative, the rest are noise. Data from the first dimension come from two different classes and are separated. What happens when we rotate these spaces such that the points from the two classes are as well separated as possible?

The plot below shows thlda_dimensionalitye original data (blue and green classes). You can see that the ‘bumps’ are decently separated. Then you can see the 10d, 25d and 50d data.

Wow! Adding irrelevant dimensions sure helps us separate our data! Shouldn’t we all do this, just add noise as additional dimensions and then rotate the space to cast a well separated shadow? Uh, oh! The curse of D- strikes again!

We aren’t fooled though. We know what’s going on. In real life we have limited data. For example, in this data set I used 100 samples. Our intuition tells us that as our dimensions increase the points get less crowded. Each point is able to nestle into a nice niche in a higher dimension, further and further away from its neighbors. Its like the more dimensions we add, the more streets and alleys we add to the city. All the points no longer have to live in the same street. They now have their own zip codes. (OK I’ll stop this theme now)

Poor old LDA knows nothing about this. LDA simply picks up our space and starts to rotate it and is extremely happy when the shadow looks like it’s well separated and stops. The illusion will be removed as soon as we actually try to use the shadow. Say we split our data into test and train sets. Our train set data look nicely separated, but the moment we dump in the test data: CHAOS! It’s really jumbled. Those separate zipcodes – mail fraud! Thanks to the curse of D-

Code follows:

import pylab
from sklearn.lda import LDA

def myplot(ax, F, title):
  bins = pylab.arange(-15,15,1)
  N = F.shape[0]
  pylab.hist(F[N/2:,0], bins, histtype='step', lw=3)
  pylab.hist(F[:N/2,0], bins, histtype='step', lw=3)

D = 50 #Dimensions
N = 100 #Samples
F = pylab.randn(N,D)
C = pylab.zeros(N)
C[:N/2] = 1 #Category vector
F[:,0] += C*4 #Adjust 1st dimension to carry category information

lda = LDA(n_components=1)
#bins = pylab.arange(-15,15,1)
fig, ax = pylab.subplots(4,1,sharex=True, figsize=(4,8))

myplot(ax[0], F, 'original')
F_new = lda.fit_transform(F[:,:10],C) #Ten dimensions
myplot(ax[1], F_new, '10d')
F_new = lda.fit_transform(F[:,:25],C) #25 dimensions
myplot(ax[2], F_new, '25d')
F_new = lda.fit_transform(F[:,:50],C) #50 dimensions
myplot(ax[3], F_new, '50d')

Slicing high dimensional arrays

When we learned about matrices it wasn’t hard to think of slicing 2D matrices along rows and columns. But what about slicing higher dimensional matrices? I always get confused when I go above 3-dimensions. Interestingly, for me, an easy way to imagine slicing such high dimensional matrices is by using trees and thinking of how the numbers are stored in a computer.

Consider a 2x2x2x2 matrix, something we can construct as follows:

a -> array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

b ->
array([[[[ 0,  1],
         [ 2,  3]],

        [[ 4,  5],
         [ 6,  7]]],

       [[[ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15]]]])

Now imagine the matrix is stored in the computer as a simple sequence of numbers (this is kind of true). What makes these numbers behave as a matrix is the indexing scheme. The indexing scheme can be thought of as a kind of tree.


Each level of the tree depicts a dimension, with the root being the outermost dimension and the leaves the inner-most dimensions.

When we slice the array, say by doing

b[0,:,:,:] ->
array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])

We are actually breaking off a branch of the tree and looking at that.


The same with deeper slices, such as

b[0,0,:,:] ->
array([[0, 1],
       [2, 3]])


But what about a slice that starts at the leaves, rather than the roots of the tree?

b[:,:,:,0] ->
array([[[ 0,  2],
        [ 4,  6]],

       [[ 8, 10],
        [12, 14]]])

In such a case I think in terms of breaking off unused branches/leaves and then merging the stubs to form a new tree.


Bayesian stats (githubbed, iPython notebooked)

Someone has put up on github a very interesting experiment. An experiment I hope will succeed well. They have started a tutorial on Bayesian statistics in the form of an iPython notebook (which as you may recall allows you to embed python code, plots, text and math, into a notebook format, much like Maple or Mathematica). They have put this notebook collection up on github. This is awesome, because you can play with the examples as you learn about the theory and you can put in corrections/enhancements to the text/code and share it with everyone else.

The other great analog-digital debate

Some of you are old enough to remember the great analog-v-digital debate: Vinyl or CD? This post is about the OTHER great (but slightly less well known) analog-v-digital debate: do we simulate neurons on digital computers or on custom designed analog VLSI chips?

When I was at the Univ. of Maryland I was hooked on Neuromorphic engineering by Timothy Horiuchi. The central tenet of Neuromorphic engineering is that transistors operating in the subthreshold (analog) zone are great mimics of the computations done by neurons, and the way to intelligent machines is through building networks of such neuro-mimetic neurons on analog Very Large Scale Integration (aVLSI) chips. This press release of some work being done at INI at Zurich reminded me of this field.

What the writeup also reminded me about, was the great debate between digital and analog implementations of neural circuits. Proponents of Neuromorphic VLSI base their work on the idea that transistors working in the sub-threshold zone give, for “free”, a nice non-linearity between input and output that is at the heart of neural circuits. When applying for funds from DARPA they also remind the grant reviewers that aVLSI circuits have very low power consumption compared to digital VLSI circuits.

A well designed and debugged aVLSI Neuromorphic chip is a great feat of engineering (often taking several fabrication rounds to get all the design problems weeded out) which makes iterating over designs very time consuming and unwieldy.

The proponents of old school digital computation, where neural behavior is encoded in an algorithm (implementing differential equation models of neurons) point to the ease of implementation (you can use your favorite programing language) the ease of debugging (you just recompile while you have a drink) and the ease of modifying and elaborating the design (comment your code!!).

There are some specific issues with aVLSI too. When you make giant neural networks hooking up digital neurons is usually done using a connection matrix. This matrix simply tells the simulating program which neuron gets inputs from which other neurons and which neurons it projects to. In aVLSI you need to physically wire up neurons on the chip layout. This means you can no longer modify the network organization to test out ideas on the fly – you need to design a new chip layout, send it for fabrication wait, debug and so on. (And the moment you start changing connections you have to start moving the whole design around because the exact routing of the wires affects the behavior of chip because everything is so close and the voltages so low that the capacitance between wires matters. As I said, it is a true feat of engineering).

People have come up with non-analog solutions to this ‘routing’ problem, by creating hardware versions of the connection matrix: separate circuits, often on a separate chip, that are dedicated to hooking up neurons to other neurons, somewhat like a telephone switchboard. These lose the low power advantage of aVLSI and increase the complexity of the circuits.

You know that I’m going to give my two cents. I think, not being very qualified to comment on either analog and digital implementations of Neural circuits, that aVLSI might have some niche applications in tiny devices tailored to a specific task where small size and low power consumption are important. However, for the vast majority of machine intelligence applications, I think simple simulations of neural circuits, performed by ever more powerful and power efficient digital circuits will prevail.