This one was a little non-intuitive for me.

I had a combination of neural and behavioral data from an experiment I’m running. The behavioral data has two categories: Correct and Incorrect (depending on whether the subject performed the task correctly on that particular trial). The neural data takes the form of a feature vector. So I have one neural vector and one category code (Correct/Incorrect) for every trial. I wanted to know how well the neural vector predicts the category code.

Now, the fly in the ointment is that we have more corrects than in-corrects (which is how we know that the subject is actually performing the task and not simply guessing). I had assumed that the SVM, because it is fitting a hyperplane to try and separate the classes would not be affected by an imbalance in the number of samples for each class. But it turns out that it is affected:

Say we have three samples of class ‘O’ and one sample of class ‘X’ as shown.

When we perform the optimization to place the dividing plane, there will be three ‘O’ points pushing the plane away from their side and only one point pushing away from the ‘X’ side. This means that the dividing plane will be placed further from the class with more ‘votes’, i.e. samples.

There are two major ways of dealing with this problem. One is to assign weights to the classes in inverse proportion to their contribution of samples, such that classes with more samples get lower weights.

This way the votes are normalized (using an electoral college, as it were) and the skew is reduced.

The other major way is to resample the data such that there are now equal numbers of samples in all the classes.

As an example (All code at end of article) I considered a control case where the feature data are simply samples taken from a gaussian distribution. Category data (0 or 1) is randomly assigned to each feature datum. True classification performance should therefore be 50%: pure chance.

Now if when you change the proportion of samples from each class you get the following classifier performance curve:

Only the balanced condition (class fraction is 0.5) gives us the correct performance of 50%. At other ratios we get spuriously high classifier performance because the test set has an over-representation of one of the classes and we can (effectively) just guess class (O) and get a lot right.

But I had the most luck with resampling:

The code:

First we just setup

import pylab from sklearn import svm, cross_validation

We define a picker function that will split our data into train and test sets how we want

def simple_picker(C, boots, k_train=.5): for b in range(boots): r = pylab.rand(C.size) idx_train = pylab.find(r < k_train) idx_test = pylab.find(r >= k_train) yield idx_train, idx_test

We define a function that will repeatedly test our classifier on different slices of the data to give us a bootstrap of the classifier performance

def run_classifier(clf, F, C, boots=50, k_train=.5, picker=simple_picker): score = pylab.zeros(boots) for n, (idx_train, idx_test) in enumerate(picker(C, boots, k_train)): score[n] = clf.fit(F[idx_train], C[idx_train]). score(F[idx_test],C[idx_test]) return score.mean()

A convenience function for plotting our result

def plot_results(k, P): pylab.plot(k, P,'ko-',lw=2) pylab.xlabel('Class imbalance fraction') pylab.ylabel('Performance of SVM') pylab.setp(pylab.gca(), 'xlim', [0.1,.9],'ylim',[.1,.9], 'yticks', [.3,.5,.7,.9])

A convenience function for running our classifier with various levels of class imbalance

def simulation_generator(N=100, ksep=0): K = pylab.arange(.2,.8,.1) P = pylab.zeros(f.size) for n,k in enumerate(K): r = pylab.rand(N) C = pylab.zeros(N) C[pylab.find(r>k)] = 1 F = pylab.randn(N,1) F[:,0] += C*ksep yield k, F, C

Run the simulation with a plain old SVM

K = [] P = [] clf = svm.SVC(kernel='linear', C=1) for k,F,C in simulation_generator(N=100, ksep=0): K.append(k) P.append(run_classifier(clf, F, C, picker=simple_picker)) plot_results(K, P)

With class weights

K = [] P = [] clf = svm.SVC(kernel='linear', C=1) for k,F,C in simulation_generator(N=100, ksep=0): cw = {0: 1-k, 1:k} clf = svm.SVC(kernel='linear', C=1, class_weight=cw) K.append(k) P.append(run_classifier(clf, F, C, picker=simple_picker)) plot_results(K, P)

A picker function that resamples and balances the classes

def balanced_picker(C, boots, k_train=.5): """Given category vector, a number of bootstraps and what fraction of samples to reserve for training return us a series of indexes that serve to create train and test sets with no regard to number of samples in a category.""" #We an generalize this code later, for now we keep it simple for current purposes idx_0 = pylab.find(C == 0) idx_1 = pylab.find(C == 1) Npick = min(idx_0.size, idx_1.size)#Can be some arbitrary number - we pick with replacement for b in range(boots): sub_idx_0 = pylab.randint(0,high=idx_0.size,size=Npick) sub_idx_1 = pylab.randint(0,high=idx_1.size,size=Npick) this_idx = pylab.concatenate((idx_0[sub_idx_0], idx_1[sub_idx_1])) r = pylab.rand(2*Npick) idx_train = this_idx[pylab.find(r < k_train)] idx_test = this_idx[pylab.find(r >= k_train)] yield idx_train, idx_test

With balanced resampling

K = [] P = [] clf = svm.SVC(kernel='linear', C=1) for k,F,C in simulation_generator(N=100, ksep=0): K.append(k) P.append(run_classifier(clf, F, C, picker=balanced_picker)) plot_results(K, P)

With resampling and an actual difference in the classes, just to make sure we are not making some kind of mistake (and our answer always comes to chance)

K = [] P = [] clf = svm.SVC(kernel='linear', C=1) for k,F,C in simulation_generator(N=100, ksep=1): K.append(k) P.append(run_classifier(clf, F, C, picker=balanced_picker)) plot_results(K, P)

Nice post man. I think we can use the prediction on Gaussian as a benchmark to evaluate other performances in skewed distribution cases. As for sampling, the problem is that it does not mimic the actual distribution and any new data that comes in will probably be in that original distribution. Our code may predict with a heavy skew towards the minority class and that might bring down the overall accuracy. I believe that a sort of mid-way sampling (say your original distro is 90-10, bringing it to 70-30 instead of 50-50) may make for a reasonable case.

Thank you. Your point about the distribution of data is well taken. As a verification step, I am a big fan of doing a shuffle test – or as real statisticians call it – the null model.

Also, maybe you could tag your post so it comes higher on google search. Just a suggestion.

Thanks for the hint. I never thought about tags and web search indexes. I will try to tag the posts better.