# Computing mutual information and other scary things

A moderately technical writeup of some adventures in computing mutual information.

I told a colleague of mine, Boris H, of my plan to use mutual information to test data from an experiment. He encouraged me, but he also passed along a warning. “Be wary of discretization. Be very wary” he said, or words to that effect. Boris is a professional: he knows the math so well, people pay him good money to do it. I was, like, “Yeah, yeah, whatever dude.” Long story short, always listen to the pros…

Summary: The Jack Knifed MLE gives a low bias estimate of the MI. It is easy to implement and is not too compute intensive. It is fairly robust to bin size.

In theory, mutual information is a super elegant way of uncovering the relationship between two quantities. There is a quiet elegance to its formulation

$\displaystyle I(X;Y) = \int \int p(x,y) \log \frac{p(x,y)}{p(x)p(y)}dxdy$

Or, equivalently

$\displaystyle I(X;Y) = H(X) + H(Y) - H(X,Y)$

where $H(X) = -\int p(x) \log p(x)dx$ is the entropy

These elegant formulae, on the surface, lend themselves to an easy discretization. Personally I see integrals and think – ah, I’m just going to discretize the space and replace the integrals with summations, it works for everything else.

Basically, the idea would be to take the data, which comes as pairs of numbers (x,y) and throw them into a two-dimensional histogram. Each cell of the histogram gives us an estimate of p(x,y) – the joint probability of (x,y) and summing along the rows or columns of the 2d-histogram gives us our marginal probabilities p(x), p(y) which can then be thrown in to our mutual information formula (with integrals replaced by summations). Easy Peasy.

It turns out that this particular computation – because of the form of the equations – is very sensitive to how the data are discretized. The easy formula, apparently the MLE for the entropy, has persistent biases. In all cases the number of bins used for creating the histogram plays an annoyingly critical role. If we have too few bins we underestimate the MI and if we have too many bins we over-estimate the MI. People have come up with different corrections for the bias of which I picked two to test (these formulae are nicely summarized in Paninski 2003, in the introduction).

Defining $N$ as the number of samples and $\hat{m}$ as the number of bins with non-zero probability we have:

Simple-minded discretization (MLE): $\hat{H}_{MLE}(x) = -\sum p(x_i)log_2{p(x_i)}$

Miller Madow correction (MLEMM): $\hat{H}_{MLEMM}(x) = \hat{H}_{MLE}(x) + \frac{\hat{m}-1}{2N}$

The Jack Knifed estimate (MLEJK): $\hat{H}_{MLEJK}(x) = N\hat{H}_{MLE}(x) - \frac{N-1}{N}\sum_{j=1}^{N} \hat{H}_{MLE-j}(x)$ (Leave-one-out formula)

I tested the formulae by generating pairs of correlated random numbers. If $X,Y$ are two gaussian random variables with correlation coefficient $\rho$ then their mutual information is given by:

$-\frac{1}{2}\log_2{(1-\rho^2)}$

which I use as the ground truth to compare the numerical results against.

The plot below compares the actual and estimated MI estimate spit out by the MLE formula for various values of correlation coefficient $\rho$. The actual value is shown by the blue line, the gray shaded area shows 95% of the distribution (from repeating the calculation 1000 times) for a sample size of 100 (Chosen as a reasonable sample size I usually run into, in my experiments). The sets of shaded curves show computation using 11 bins (darker gray) and 21 bins (lighter gray). As you can see this formula sucks. It has a large persistent bias through out the range of correlation coeffcients (the gray areas hang well above the blue line).

The MLEMM did not fare much better in some preliminary explorations I did (see below) so I went to the jack knife which performed pretty well. The Jack Knife still has a bias, but it is not very large. It is less sensitive to the binning and the bias is stable throughout the range.

I’m including this second plot, which isn’t that easy on the eyes, but it was the initial exploration that led me to look into the jack knife in more detail. The plots are distributions of the MI value computed from 1000 runs of each algorithm. The three rows are the three formulae: the MLE, the MLE with the Miller Madow correction (MLEMM) and the Jack Knifed MLE. The columns cycle through different sample sizes (10,100,100), different bin sizes (11,21) and different correlation coefficients between x and y (0,0.5). The vertical dotted line indicates the real MI. As you can see the MLE and MLEMM have an upward bias, which persists even for large sample sizes and is exacerbated by finer binning. The jack knife is impressive, but it too has an upward bias (as you have seen in the previous plot).

### Next steps

The next incremental improvement I would use is adaptive binning, which uses some formalized criteria (like making sure all bins have a non-zero probability) to adjust the binning to the data before calculating MI.

Paninski, Liam. “Estimation of entropy and mutual information.” Neural Computation 15.6 (2003): 1191-1253. – This paper was very tough for me to chew through. I used it for the formulae in the first few pages. Liam claims he has a method for correcting the MLE method, but it seemed like a very complicated computation

Cellucci, C. J., Alfonso M. Albano, and P. E. Rapp. “Statistical validation of mutual information calculations: Comparison of alternative numerical algorithms.” Physical Review E 71.6 (2005): 066208. – This was a much more readable paper compared to Liam’s. I’m using to see what kind of adaptive binning I can apply to the data.

import numpy, pylab

def correlated_x(N, rho=0.5):
r = numpy.random.randn(N,2)
r1 = r[:,0]
r1_ = r[:,1]
r2 = rho*r1 + (1-rho**2)**0.5*r1_
return r1, r2

def bin_data(x,y, bins=11, limits=[-4,4]):
Nxy, xe, ye = numpy.histogram2d(x,y,bins=bins,range=[limits,limits])
N = float(Nxy.sum())
Pxy = Nxy/N
Px = Pxy.sum(axis=1)
Py = Pxy.sum(axis=0)
return Pxy, Px, Py, N

MI_exact = lambda rho: -.5*numpy.log2(1-rho**2)

def H_mle(P):
idx = pylab.find(P>0)
return -(P.flat[idx]*numpy.log2(P.flat[idx])).sum()

def MI_mle(x,y, bins=11, limits=[-4,4]):
Pxy, Px, Py, Ntot = bin_data(x,y,bins=bins,limits=limits)
return H_mle(Px) + H_mle(Py) - H_mle(Pxy)

def MI_mle_jack_knife(x, y, bins=11, limits=[-4,4]):
Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
Hx = H_mle(Px)
Hy = H_mle(Py)
Hxy = H_mle(Pxy)
Hx_jk = 0
Hy_jk = 0
Hxy_jk = 0
for n in range(x.size):
jx = numpy.concatenate((x[:n],x[n+1:]))
jy = numpy.concatenate((y[:n],y[n+1:]))
Pxy, Px, Py, Njk = bin_data(jx,jy,bins=bins,limits=limits)
Hx_jk += H_mle(Px)
Hy_jk += H_mle(Py)
Hxy_jk += H_mle(Pxy)

Hx_jk = N*Hx - (N-1.0)/N*Hx_jk
Hy_jk = N*Hy - (N-1.0)/N*Hy_jk
Hxy_jk = N*Hxy - (N-1.0)/N*Hxy_jk

return Hx_jk + Hy_jk - Hxy_jk

def runmany(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
mi = numpy.empty(b)
for n in range(b):
r1,r2 = correlated_x(N, rho)
mi[n] = func(r1,r2,bins=bins)
mi.sort()
med_idx = int(b*0.5)
lo_idx = int(b*0.025)
hi_idx = int(b*0.975)
return mi[lo_idx], mi[med_idx], mi[hi_idx]

b = 1000
Ni = 20
rho = numpy.linspace(0,.99,Ni)
mi_exact = MI_exact(rho)
N = 100
cols = [(.5,.5,.5), (.7,.7,.7)]
for lab, func in zip(['MI MLE', 'MI Jack Knife'], [MI_mle, MI_mle_jack_knife]):
pylab.figure(figsize=(10,5))
for k, bins in enumerate([11,21]):
mi_est = numpy.empty((Ni,3))
for n in range(rho.size):
#mi_est[n,:] = runmany(MI_mle_jack_knife, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
#mi_est[n,:] = runmany(MI_mle, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
mi_est[n,:] = runmany(func, N=N, rho=rho[n], bins=bins, limits=[-4,4], b=b)
pylab.fill_between(rho, mi_est[:,0], mi_est[:,2],color=cols[k],edgecolor='k',alpha=.5)
pylab.plot(rho, mi_est[:,1], 'k', lw=2)
pylab.plot(rho, mi_exact, 'b',lw=2)
pylab.xlabel(r'$\rho$')
pylab.ylabel(lab)
pylab.setp(pylab.gca(),ylim=[-.1,3.5])
pylab.savefig('{:s}.pdf'.format(lab))


The code for the ugly, detailed plot:

import numpy, pylab

def correlated_x(N, rho=0.5):
r = numpy.random.randn(N,2)
r1 = r[:,0]
r1_ = r[:,1]
r2 = rho*r1 + (1-rho**2)**0.5*r1_
return r1, r2

def bin_data(x,y, bins=11, limits=[-4,4]):
Nxy, xe, ye = numpy.histogram2d(x,y,bins=bins,range=[limits,limits])
N = float(Nxy.sum())
Pxy = Nxy/N
Px = Pxy.sum(axis=1)
Py = Pxy.sum(axis=0)
return Pxy, Px, Py, N

MI_exact = lambda rho: -.5*numpy.log2(1-rho**2)

def H_mle(P):
idx = pylab.find(P>0)
return -(P.flat[idx]*numpy.log2(P.flat[idx])).sum()

def H_mm(P,N):
m = pylab.find(P>0).size
return H_mle(P) + (m-1)/(2.0*N)

def MI_mle(x,y, bins=11, limits=[-4,4]):
Pxy, Px, Py, Ntot = bin_data(x,y,bins=bins,limits=limits)
return H_mle(Px) + H_mle(Py) - H_mle(Pxy)

Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
return H_mm(Px, N) + H_mm(Py, N) - H_mm(Pxy, N)

def MI_mle_jack_knife(x, y, bins=11, limits=[-4,4]):
Pxy, Px, Py, N = bin_data(x,y,bins=bins,limits=limits)
Hx = H_mle(Px)
Hy = H_mle(Py)
Hxy = H_mle(Pxy)
Hx_jk = 0
Hy_jk = 0
Hxy_jk = 0
for n in range(x.size):
jx = numpy.concatenate((x[:n],x[n+1:]))
jy = numpy.concatenate((y[:n],y[n+1:]))
Pxy, Px, Py, Njk = bin_data(jx,jy,bins=bins,limits=limits)
Hx_jk += H_mle(Px)
Hy_jk += H_mle(Py)
Hxy_jk += H_mle(Pxy)

Hx_jk = N*Hx - (N-1.0)/N*Hx_jk
Hy_jk = N*Hy - (N-1.0)/N*Hy_jk
Hxy_jk = N*Hxy - (N-1.0)/N*Hxy_jk

return Hx_jk + Hy_jk - Hxy_jk

def runmany(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
mi = numpy.empty(b)
for n in range(b):
r1,r2 = correlated_x(N, rho)
mi[n] = func(r1,r2,bins=bins)
return mi

def plot_bias_variance(func, N=50, rho=0, bins=11, limits=[-4,4], b=1000):
mi = runmany(func, N=N, rho=rho, bins=bins, limits=limits, b=b)
pylab.hist(mi,range=[0,2],bins=61,histtype='stepfilled',color='gray',edgecolor='gray')
pylab.plot([MI_exact(rho),MI_exact(rho)], pylab.getp(pylab.gca(),'ylim'), 'k:',lw=3)
pylab.setp(pylab.gca(),xlim=[-.1,1.0],xticks=[0,.5],yticks=[],xticklabels=[])
pylab.title('N={:d},rho={:2.2f},bins={:d}'.format(N,rho,bins),fontsize=6)

pylab.figure(figsize=(15,6))
b = 1000
for i, rho in enumerate([0,0.5]):
for j, N in enumerate([10,100,1000]):
for k, bins in enumerate([11,21]):
c = 1+k+j*2+i*6
pylab.subplot(3,12,c); plot_bias_variance(MI_mle,rho=rho, N=N,bins=bins,b=b)
if c == 1: pylab.ylabel('MLE')
pylab.subplot(3,12,24+c); plot_bias_variance(MI_mle_jack_knife,rho=rho,N=N,bins=bins,b=b)
if c == 1: pylab.ylabel('JackKnife')

pylab.savefig('mi_experiments.pdf')


# Parzen windows for estimating distributions

Part of a set of moderately technical writeups of some adventures in computing mutual information for neural data.

Often, for example, when you are computing mutual information, you need to estimate the probability distribution of a random variable. The simplest way, which I had done for years, is to to create a histogram. You take each sample and put it into a bin based on its value. Then you can use one of several tests to check if the shape of the histogram deviates from whatever distribution you are interested in.

When you don’t have enough data (aka Always) your histogram comes out jagged (Almost everyone who’s ever made a histogram knows what I’m sayin’). In undergrad stats I learned that 11 was a nice number of bins, and indeed both Matplotlib and MATLAB seem to have that as the default. What I ended up doing was plotting the data using various bins until, by inspection, I was satisfied by the smoothness of the histogram.

Turns out mathematicians have a whole cottage industry devoted to formalizing how to compute the number of bins your histogram should have. The relevant keyword is (imaginatively enough) “histogram problem”. I ran into this word while reading up a Navy technical report by Cellucci et al found here. That paper has references to a bunch of dudes who worked on that problem. Anyhow, there are lots of complex schemes but I liked Tukey’s formula, which was $n^{1/2}$ (n being the number of data points).

This post however, is not about binning. It’s about Parzen windows to get rid of binning altogether. There is a deep theoretical underpinning to Parzen windowing, but intuitively I understand Parzen windowing as a smoothing operation that takes the samples we have and creates a smooth distribution out of the points.

Here’s how I understand Parzen windows: Each sample creates a splash – it’s own little gaussian (Apparently, you can also use boxcar windows or whatever window has a nice property for your problem). This is the Parzen window. As a result, the sample is no longer tightly localized but has a bit of a blur to it. We then add up all the blurs to create a smoothened curve/surface which is our estimate of the pdf of the samples. With a judicious choice of the width of the blurring and proper normalization of the height of each gaussian we can come up with a sensible pdf estimate. The advantage of this is that you know have a continuous function representing the pdf, which you can integrate.

Formally (I referred to a paper by Kwak and Choi – Input Feature Selection by Mutual Information based on Parzen window) the Parzen window estimate of the pdf is given by

$\hat{p} = \frac{1}{n}\sum_{i=1}^{n} \phi(x-x_i,h)$

where $\phi$ is the window function, and I used a gaussian for that. As you can see, the density estimate at any given point $x$ is given by the sum of gaussians centered around each data point $x_i$ with the width of the gaussian being given by $h$. The larger $h$ is the more washed out the estimate is and the smaller $h$ is the more jaggy is the estimate.

We seem to have transferred our problem of finding an appropriate bin width (bin count) to finding an appropriate smoothening constant (What! You expected a free lunch!?). I used what wikipedia calls Silverman’s rule of thumb: $h=1.06\hat{\sigma}n^{-1/5}$.

Here is a fun little animation showing how the Parzen window estimate of a pdf (thin black line) matches up with the actual pdf (thicker blue line). The histogram of the actual data points are shown in light gray in the background.

Interesting things about this smoothing is that there is no binning involved – the final curve depends only on the actual data samples – and we make no strong assumptions about the pdf of the data – it’s not like we are trying to fit a model of the pdf to the data. Here is an animation of the exact same technique being used to fit a uniform distribution. We could have done better with a different choice of window width more tuned to the distribution, but the idea is that it still works if we don’t have any idea of what the actual pdf looks like.

These are also known as mixture of Gaussians or mixture decompositions.

During my web-searches I ran across this nice set of lecture slides about estimating pdfs from a prof at TAMU.

import numpy, pylab, matplotlib.animation as animation

root_two_pi = (2<em>numpy.pi)<strong>0.5
parzen_est = lambda x,X,h,sigma: numpy.exp(-(X-x)</strong>2/(2</em>h<strong>2*sigma</strong>2)).sum()/(root_two_pi*h*sigma*X.size)
gaussian_pdf = lambda x: (1/(2*numpy.pi)<strong>.5)*numpy.exp(-x</strong>2/2)
def uniform_pdf(x):
p = numpy.zeros(x.size)
p[(-2 &lt; x) &amp; (x &lt; 2)] = 0.25
return p

def init():
fig = pylab.figure(figsize=(3,6))
ax = pylab.axes(xlim=(-5, 5), ylim=(-0.5, 0.7))
return fig, ax

def animate(i, r, ax, distr):
this_r = r[:i+2]
ax.cla()
h= 1.06<em>this_r.std()</em>this_r.size<strong>(-.2)#1/(2<em>numpy.log(N))
lim = [-5,5]
bins = 51
x=numpy.linspace(lim[0],lim[1],num=100)
pylab.text(-.4,-.025,'n=' + str(this_r.size))
pylab.hist(this_r, bins=bins,range=lim,normed=True,edgecolor=(.9,.9,.9),color=(.8,.8,.8),histtype='stepfilled')
#pylab.plot(x, (1/(2</em>numpy.pi)</strong>.5)*numpy.exp(-x**2/2), color=(.1,.1,1.0), lw=5)
pylab.plot(x, distr(x), color=(.1,.1,1.0), lw=5)
pylab.plot(x, [parzen_est(xx, this_r, h, this_r.std()) for xx in x],'k', lw=3)
pylab.setp(ax,xlim=(-5, 5), ylim=(-0.05, 0.7))

N = 200
r0=numpy.random.randn(N); distr = gaussian_pdf
#r0=numpy.random.rand(N)*4-2; distr = uniform_pdf

fig, ax = init()
anim = animation.FuncAnimation(fig, animate, fargs=(r0, ax, distr), frames=N-2, interval=2, repeat=False)
#anim.save('parzen_smoothing.mp4', fps=30, extra_args=['-vcodec', 'libx264'])
pylab.show()


# Slicing high dimensional arrays

When we learned about matrices it wasn’t hard to think of slicing 2D matrices along rows and columns. But what about slicing higher dimensional matrices? I always get confused when I go above 3-dimensions. Interestingly, for me, an easy way to imagine slicing such high dimensional matrices is by using trees and thinking of how the numbers are stored in a computer.

Consider a 2x2x2x2 matrix, something we can construct as follows:

a=pylab.arange(16)
a -> array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

b=a.reshape((2,2,2,2))
b ->
array([[[[ 0,  1],
[ 2,  3]],

[[ 4,  5],
[ 6,  7]]],

[[[ 8,  9],
[10, 11]],

[[12, 13],
[14, 15]]]])


Now imagine the matrix is stored in the computer as a simple sequence of numbers (this is kind of true). What makes these numbers behave as a matrix is the indexing scheme. The indexing scheme can be thought of as a kind of tree.

Each level of the tree depicts a dimension, with the root being the outermost dimension and the leaves the inner-most dimensions.

When we slice the array, say by doing

b[0,:,:,:] ->
array([[[0, 1],
[2, 3]],

[[4, 5],
[6, 7]]])


We are actually breaking off a branch of the tree and looking at that.

The same with deeper slices, such as

b[0,0,:,:] ->
array([[0, 1],
[2, 3]])


But what about a slice that starts at the leaves, rather than the roots of the tree?

b[:,:,:,0] ->
array([[[ 0,  2],
[ 4,  6]],

[[ 8, 10],
[12, 14]]])


In such a case I think in terms of breaking off unused branches/leaves and then merging the stubs to form a new tree.

# Balancing training examples in classes

This one was a little non-intuitive for me.

I had a combination of neural and behavioral data from an experiment I’m running. The behavioral data has two categories: Correct and Incorrect (depending on whether the subject performed the task correctly on that particular trial). The neural data takes the form of a feature vector. So I have one neural vector and one category code (Correct/Incorrect) for every trial. I wanted to know how well the neural vector predicts the category code.

Now, the fly in the ointment is that we have more corrects than in-corrects (which is how we know that the subject is actually performing the task and not simply guessing). I had assumed that the SVM, because it is fitting a hyperplane to try and separate the classes would not be affected by an imbalance in the number of samples for each class. But it turns out that it is affected:

Say we have three samples of class ‘O’ and one sample of class ‘X’ as shown.

When we perform the optimization to place the dividing plane, there will be three ‘O’ points pushing the plane away from their side and only one point pushing away from the ‘X’ side. This means that the dividing plane will be placed further from the class with more ‘votes’, i.e. samples.

There are two major ways of dealing with this problem. One is to assign weights to the classes in inverse proportion to their contribution of samples, such that classes with more samples get lower weights.

This way the votes are normalized (using an electoral college, as it were) and the skew is reduced.

The other major way is to resample the data such that there are now equal numbers of samples in all the classes.

As an example (All code at end of article) I considered a control case where the feature data are simply samples taken from a gaussian distribution. Category data (0 or 1) is randomly assigned to each feature datum. True classification performance should therefore be 50%: pure chance.

Now if when you change the proportion of samples from each class you get the following classifier performance curve:

Only the balanced condition (class fraction is 0.5) gives us the correct performance of 50%. At other ratios we get spuriously high classifier performance because the test set has an over-representation of one of the classes and we can (effectively) just guess class (O) and get a lot right.

Using weights helps somewhat:

But I had the most luck with resampling:

The code:

First we just setup

import pylab
from sklearn import svm, cross_validation


We define a picker function that will split our data into train and test sets how we want

def simple_picker(C, boots, k_train=.5):
for b in range(boots):
r = pylab.rand(C.size)
idx_train = pylab.find(r < k_train)
idx_test = pylab.find(r >= k_train)
yield idx_train, idx_test


We define a function that will repeatedly test our classifier on different slices of the data to give us a bootstrap of the classifier performance

def run_classifier(clf, F, C, boots=50, k_train=.5, picker=simple_picker):
score = pylab.zeros(boots)
for n, (idx_train, idx_test) in enumerate(picker(C, boots, k_train)):
score[n] = clf.fit(F[idx_train], C[idx_train]).
score(F[idx_test],C[idx_test])
return score.mean()


A convenience function for plotting our result

def plot_results(k, P):
pylab.plot(k, P,'ko-',lw=2)
pylab.xlabel('Class imbalance fraction')
pylab.ylabel('Performance of SVM')
pylab.setp(pylab.gca(), 'xlim', [0.1,.9],'ylim',[.1,.9], 'yticks', [.3,.5,.7,.9])


A convenience function for running our classifier with various levels of class imbalance

def simulation_generator(N=100, ksep=0):
K = pylab.arange(.2,.8,.1)
P = pylab.zeros(f.size)
for n,k in enumerate(K):
r = pylab.rand(N)
C = pylab.zeros(N)
C[pylab.find(r>k)] = 1
F = pylab.randn(N,1)
F[:,0] += C*ksep
yield k, F, C


Run the simulation with a plain old SVM

K = []
P = []
clf = svm.SVC(kernel='linear', C=1)
for k,F,C in simulation_generator(N=100, ksep=0):
K.append(k)
P.append(run_classifier(clf, F, C, picker=simple_picker))
plot_results(K, P)


With class weights

K = []
P = []
clf = svm.SVC(kernel='linear', C=1)
for k,F,C in simulation_generator(N=100, ksep=0):
cw = {0: 1-k, 1:k}
clf = svm.SVC(kernel='linear', C=1, class_weight=cw)
K.append(k)
P.append(run_classifier(clf, F, C, picker=simple_picker))
plot_results(K, P)


A picker function that resamples and balances the classes

def balanced_picker(C, boots, k_train=.5):
"""Given category vector, a number of bootstraps and what fraction of samples
to reserve for training return us a series of indexes that serve to create train
and test sets with no regard to number of samples in a category."""
#We an generalize this code later, for now we keep it simple for current purposes
idx_0 = pylab.find(C == 0)
idx_1 = pylab.find(C == 1)
Npick = min(idx_0.size, idx_1.size)#Can be some arbitrary number - we pick with replacement
for b in range(boots):
sub_idx_0 = pylab.randint(0,high=idx_0.size,size=Npick)
sub_idx_1 = pylab.randint(0,high=idx_1.size,size=Npick)
this_idx = pylab.concatenate((idx_0[sub_idx_0], idx_1[sub_idx_1]))
r = pylab.rand(2*Npick)
idx_train = this_idx[pylab.find(r < k_train)]
idx_test = this_idx[pylab.find(r >= k_train)]
yield idx_train, idx_test


With balanced resampling

K = []
P = []
clf = svm.SVC(kernel='linear', C=1)
for k,F,C in simulation_generator(N=100, ksep=0):
K.append(k)
P.append(run_classifier(clf, F, C, picker=balanced_picker))
plot_results(K, P)


With resampling and an actual difference in the classes, just to make sure we are not making some kind of mistake (and our answer always comes to chance)

K = []
P = []
clf = svm.SVC(kernel='linear', C=1)
for k,F,C in simulation_generator(N=100, ksep=1):
K.append(k)
P.append(run_classifier(clf, F, C, picker=balanced_picker))
plot_results(K, P)


# Support Vector Machines and dimensionality

When I was introduced to support vector machines I initially thought: this is great, the method takes care of irrelevant dimensions. My intuition was that since the algorithm tilts hyperplanes to cut the space, adding irrelevant dimensions does not matter at all, since the hyperplane would just lie parallel to the irrelevant dimensions.

Practically speaking, however, as the number of dimensions increases our data start to become sparse which can ‘fool’ the partitioning algorithm.

We can run some simple simulations to explore this question.

import pylab
from sklearn import svm, cross_validation


Let’s generate a dataset which consists of 500 examples of 200 dimensional data. The category information of the data only depend on the 1st dimension

d = 200
N = 500
C = pylab.randint(0,high=2,size=N)
F = pylab.randn(N,d)
F[:,0] += C*2


We set up a linear SVM classifier and cross validate with K-folds

clf = svm.SVC(kernel='linear', C=1)
cv = cross_validation.StratifiedKFold(C, n_folds=10)


If we run the classifier with just the first dimension, we get a classifier accuracy of 0.83 (chance being 0.5)

scores = cross_validation.cross_val_score(clf, F[:,:1], C, cv=cv)
scores.mean()


As we add in the first 99 irrelevant dimensions, our accuracy drops to 0.784

scores = cross_validation.cross_val_score(clf, F[:,:100], C, cv=cv)
scores.mean()


and when we add in all the 199 irrelevant dimensions, our accuracy drops to 0.754

scores = cross_validation.cross_val_score(clf, F[:,:], C, cv=cv)
scores.mean()


Now, this an extreme example (with so many dimensions), but it is a good lesson to keep in mind. The more complex your dataset (in terms of features) the more data you have to collect.

PS. For those wondering, the featured image is from Deus Ex:Human Revolution. It has not relevance to the post except that it has cool geometrical features. If you haven’t played Deus Ex:HR yet, you should do it.