When I was introduced to support vector machines I initially thought: this is great, the method takes care of irrelevant dimensions. My intuition was that since the algorithm tilts hyperplanes to cut the space, adding irrelevant dimensions does not matter at all, since the hyperplane would just lie parallel to the irrelevant dimensions.
Practically speaking, however, as the number of dimensions increases our data start to become sparse which can ‘fool’ the partitioning algorithm.
We can run some simple simulations to explore this question.
import pylab from sklearn import svm, cross_validation
Let’s generate a dataset which consists of 500 examples of 200 dimensional data. The category information of the data only depend on the 1st dimension
d = 200 N = 500 C = pylab.randint(0,high=2,size=N) F = pylab.randn(N,d) F[:,0] += C*2
We set up a linear SVM classifier and cross validate with K-folds
clf = svm.SVC(kernel='linear', C=1) cv = cross_validation.StratifiedKFold(C, n_folds=10)
If we run the classifier with just the first dimension, we get a classifier accuracy of 0.83 (chance being 0.5)
scores = cross_validation.cross_val_score(clf, F[:,:1], C, cv=cv) scores.mean()
As we add in the first 99 irrelevant dimensions, our accuracy drops to 0.784
scores = cross_validation.cross_val_score(clf, F[:,:100], C, cv=cv) scores.mean()
and when we add in all the 199 irrelevant dimensions, our accuracy drops to 0.754
scores = cross_validation.cross_val_score(clf, F[:,:], C, cv=cv) scores.mean()
Now, this an extreme example (with so many dimensions), but it is a good lesson to keep in mind. The more complex your dataset (in terms of features) the more data you have to collect.
PS. For those wondering, the featured image is from Deus Ex:Human Revolution. It has not relevance to the post except that it has cool geometrical features. If you haven’t played Deus Ex:HR yet, you should do it.