I yield to Python

I’ve been using Python for a few years now and I am amazed that I got by all these years without using yield. I think this is a result of my coming from a C/Matlab background and self-teaching myself Python. Now of course I have the zeal of a recent convert.

The way I came to use ‘yield’ is when I started to refactor my analysis code. I needed to repeat a computation many times (in a loop) and in a slightly different way each time. A lot of the code was very similar, but some parts of the heart of the code were different.

So I started out by passing functions into other functions. So I had a basic function that ran my computation loop into which I would pass parameters and functions that I would run inside the loop. But then I sometimes needed to make subtle changes within the loop itself or before the loop ran and it started to get messy.

The code would get a lot cleaner and more understandable if I could refactor the concept of the loop out and reuse that. If only Python had something like that, if only there was some way …

So, here’s a short tutorial on Python’s yield from a recent convert using a contrived example.

Task: Given an integer N, take all the numbers 1…2*N, split them into even and odd numbers and then return the result of adding the number pairs for various values of N.

def run_generator(Nmax):
    print """    Any code upto the yield statement 
    only runs the first time this generator is
    called"""
    for n in range(1,Nmax+1):
        yield n

def number_generator(N):
    for n in range(1,N+1):
       yield 2*n-1,2*n 
        
for N in run_generator(10):
    for a,b in number_generator(N):
        print a + b

Ah, you say, but I could have done that using just a set of explicit nested loops. Ah, but what if I come along and say, now I want to find the product of the number pairs AND a running total of their product? Before you would be copying the code and pasting it and making some tweaks to the inner and out loops. But now, because you are using generators:

sum = 0
for N in run_generator(10):
    for a,b in number_generator(N):
        prod = a*b
        sum += prod
        print prod, sum

Some people will tell you that yield is really useful because they form generators and therefore compute the next value just-in-time, saving memory (you don’t keep the entire list in memory at the same time). Pfft, memory schmemory, I find the real use is in keeping code elegant and reusable.

(I will not engage in lame wordplay for blog post titles. I will not engage in lame wordplay for blog post titles. I will not engage in lame wordplay for blog post titles. I will not engage in lame wordplay for blog post titles. I will not engage in lame wordplay for blog post titles. I will not engage in lame wordplay for blog post titles.)

Balancing training examples in classes

This one was a little non-intuitive for me.

I had a combination of neural and behavioral data from an experiment I’m running. The behavioral data has two categories: Correct and Incorrect (depending on whether the subject performed the task correctly on that particular trial). The neural data takes the form of a feature vector. So I have one neural vector and one category code (Correct/Incorrect) for every trial. I wanted to know how well the neural vector predicts the category code.

Now, the fly in the ointment is that we have more corrects than in-corrects (which is how we know that the subject is actually performing the task and not simply guessing). I had assumed that the SVM, because it is fitting a hyperplane to try and separate the classes would not be affected by an imbalance in the number of samples for each class. But it turns out that it is affected:

Say we have three samples of class ‘O’ and one sample of class ‘X’ as shown.
imbalanced_sketch1
When we perform the optimization to place the dividing plane, there will be three ‘O’ points pushing the plane away from their side and only one point pushing away from the ‘X’ side. This means that the dividing plane will be placed further from the class with more ‘votes’, i.e. samples.

There are two major ways of dealing with this problem. One is to assign weights to the classes in inverse proportion to their contribution of samples, such that classes with more samples get lower weights.
imbalanced_sketch2
This way the votes are normalized (using an electoral college, as it were) and the skew is reduced.

The other major way is to resample the data such that there are now equal numbers of samples in all the classes.
imbalanced_sketch3

As an example (All code at end of article) I considered a control case where the feature data are simply samples taken from a gaussian distribution. Category data (0 or 1) is randomly assigned to each feature datum. True classification performance should therefore be 50%: pure chance.

Now if when you change the proportion of samples from each class you get the following classifier performance curve:
fig1
Only the balanced condition (class fraction is 0.5) gives us the correct performance of 50%. At other ratios we get spuriously high classifier performance because the test set has an over-representation of one of the classes and we can (effectively) just guess class (O) and get a lot right.

Using weights helps somewhat:
fig2

But I had the most luck with resampling:
fig3

The code:

First we just setup

import pylab
from sklearn import svm, cross_validation

We define a picker function that will split our data into train and test sets how we want

def simple_picker(C, boots, k_train=.5):
    for b in range(boots):
        r = pylab.rand(C.size)
        idx_train = pylab.find(r < k_train)
        idx_test = pylab.find(r >= k_train)
        yield idx_train, idx_test

We define a function that will repeatedly test our classifier on different slices of the data to give us a bootstrap of the classifier performance

def run_classifier(clf, F, C, boots=50, k_train=.5, picker=simple_picker):
    score = pylab.zeros(boots)
    for n, (idx_train, idx_test) in enumerate(picker(C, boots, k_train)):
        score[n] = clf.fit(F[idx_train], C[idx_train]).
                   score(F[idx_test],C[idx_test])
    return score.mean()

A convenience function for plotting our result

def plot_results(k, P):
    pylab.plot(k, P,'ko-',lw=2)
    pylab.xlabel('Class imbalance fraction')
    pylab.ylabel('Performance of SVM')
    pylab.setp(pylab.gca(), 'xlim', [0.1,.9],'ylim',[.1,.9], 'yticks', [.3,.5,.7,.9])

A convenience function for running our classifier with various levels of class imbalance

def simulation_generator(N=100, ksep=0):
    K = pylab.arange(.2,.8,.1)
    P = pylab.zeros(f.size)
    for n,k in enumerate(K):
        r = pylab.rand(N)
        C = pylab.zeros(N)
        C[pylab.find(r>k)] = 1
        F = pylab.randn(N,1)
        F[:,0] += C*ksep
        yield k, F, C

Run the simulation with a plain old SVM

K = []
P = []
clf = svm.SVC(kernel='linear', C=1)
for k,F,C in simulation_generator(N=100, ksep=0):
    K.append(k)
    P.append(run_classifier(clf, F, C, picker=simple_picker))
plot_results(K, P)

With class weights

K = []
P = []
clf = svm.SVC(kernel='linear', C=1)
for k,F,C in simulation_generator(N=100, ksep=0):
    cw = {0: 1-k, 1:k}
    clf = svm.SVC(kernel='linear', C=1, class_weight=cw)
    K.append(k)
    P.append(run_classifier(clf, F, C, picker=simple_picker))
plot_results(K, P)

A picker function that resamples and balances the classes

def balanced_picker(C, boots, k_train=.5):
    """Given category vector, a number of bootstraps and what fraction of samples
to reserve for training return us a series of indexes that serve to create train
and test sets with no regard to number of samples in a category."""
    #We an generalize this code later, for now we keep it simple for current purposes
    idx_0 = pylab.find(C == 0)
    idx_1 = pylab.find(C == 1)
    Npick = min(idx_0.size, idx_1.size)#Can be some arbitrary number - we pick with replacement
    for b in range(boots):
        sub_idx_0 = pylab.randint(0,high=idx_0.size,size=Npick)
        sub_idx_1 = pylab.randint(0,high=idx_1.size,size=Npick)
        this_idx = pylab.concatenate((idx_0[sub_idx_0], idx_1[sub_idx_1]))
        r = pylab.rand(2*Npick)
        idx_train = this_idx[pylab.find(r < k_train)]
        idx_test = this_idx[pylab.find(r >= k_train)]
        yield idx_train, idx_test

With balanced resampling

K = []
P = []
clf = svm.SVC(kernel='linear', C=1)
for k,F,C in simulation_generator(N=100, ksep=0):
    K.append(k)
    P.append(run_classifier(clf, F, C, picker=balanced_picker))
plot_results(K, P)

With resampling and an actual difference in the classes, just to make sure we are not making some kind of mistake (and our answer always comes to chance)

K = []
P = []
clf = svm.SVC(kernel='linear', C=1)
for k,F,C in simulation_generator(N=100, ksep=1):
    K.append(k)
    P.append(run_classifier(clf, F, C, picker=balanced_picker))
plot_results(K, P)

Support Vector Machines and dimensionality

When I was introduced to support vector machines I initially thought: this is great, the method takes care of irrelevant dimensions. My intuition was that since the algorithm tilts hyperplanes to cut the space, adding irrelevant dimensions does not matter at all, since the hyperplane would just lie parallel to the irrelevant dimensions.

hyperplane

Practically speaking, however, as the number of dimensions increases our data start to become sparse which can ‘fool’ the partitioning algorithm.

We can run some simple simulations to explore this question.

import pylab
from sklearn import svm, cross_validation

Let’s generate a dataset which consists of 500 examples of 200 dimensional data. The category information of the data only depend on the 1st dimension

d = 200
N = 500
C = pylab.randint(0,high=2,size=N)
F = pylab.randn(N,d)
F[:,0] += C*2

We set up a linear SVM classifier and cross validate with K-folds

clf = svm.SVC(kernel='linear', C=1)
cv = cross_validation.StratifiedKFold(C, n_folds=10)

If we run the classifier with just the first dimension, we get a classifier accuracy of 0.83 (chance being 0.5)

scores = cross_validation.cross_val_score(clf, F[:,:1], C, cv=cv)
scores.mean()

As we add in the first 99 irrelevant dimensions, our accuracy drops to 0.784

scores = cross_validation.cross_val_score(clf, F[:,:100], C, cv=cv)
scores.mean()

and when we add in all the 199 irrelevant dimensions, our accuracy drops to 0.754

scores = cross_validation.cross_val_score(clf, F[:,:], C, cv=cv)
scores.mean()

Now, this an extreme example (with so many dimensions), but it is a good lesson to keep in mind. The more complex your dataset (in terms of features) the more data you have to collect.

PS. For those wondering, the featured image is from Deus Ex:Human Revolution. It has not relevance to the post except that it has cool geometrical features. If you haven’t played Deus Ex:HR yet, you should do it.

The case of the hidden inversion

Josh is setting up a behavioral experiment where the subject has to press a button as part of their response during the experiment. He was writing some code in Matlab to drive his experiment hardware and called me over. “I have a problem. My button gives output high when not touched and output low when closed (touched). The software is expecting that the input is low when open (not touching switch) and high when the subject touches the switch. How do you get this to work, I know Churly’s rig has the same button and the software works with it.”

Image

I wasn’t very familiar with touch buttons, but I remembered helping Churly out with an experiment and I thought I recalled his switch was active-high, that is, when the subject touched the button the voltage went high. We went to investigate and discovered that, indeed, Churly’s switch behaved as if it were active high. But it was the same model as Josh’s switch. So, somewhere, the signal was being inverted.

We traced the connections to Churly’s rig. (I should mention here one of the cardinal rules of experimental setups: if you leave, other scientists can and will steal parts from your rig to repair/build their own. Churly left the lab about a week or so ago, and his rig is already showing signs of being cannibalized.) Anyhow, after tracing the crows nest of wires  that is standard for experimental rigs we became very puzzled.

The circuit was hooked up as follows:

Image

The BNC coax plug went into a National Instruments breakout box.

After a little thought we decided that the only way this could work was if the two grounds (C, the power supply ground/negative and B, the Coax/system ground ) were isolated from each other. Churly was using a separate power supply for the switch, but it was a cheap looking one and nothing on the faceplate said ‘Isolated supply’, which is something you advertize, since it is expensive to build one.

We went in with a multimeter and started to poke around. Our hypothesis was that the power supply ground (C) was floating with respect to the system ground (B) and so, since A and B were tied together through a resistor, C would be at -10V with respect to the rest of the world. We stuck one end of the voltmeter in C and the other end on another BNC plug shield on the NI break out box. No voltage difference.

This was very vexing, but fit with our intuition that this was a cheap ass power supply and everything was tied to ground. Now that our leading hypothesis was discarded we were left free in the wind. We started to measure voltages across random pairs of points in the circuit. Nothing of interest happened until we put our voltmeter across the BNC shield (B) and the power supply ground (C). 9.6 V. WHAT!??

We pressed the switch. The voltage went to 0V. We started to putter around some more.We had already measured the voltage across the system ground and (C) before, or so we thought. Just to make sure, we went through each BNC shield terminal and measured the voltage with respect to C. All 0V. Really!?

It turns out that that particular input on the NI break out box was configured as differential. Now, we knew about differential modes. What we expected was that the differential signal pair was the central signal lines of the paired BNC channels.

What we learned new was that when you switch a pair of inputs to differential mode, the National Instruments BNC breakout board ties the shield of the BNC input to the reference input, thereby decoupling it from the system ground (and the other BNC shield terminals) which is what threw us off at the start.

I had expected that, in differential mode, the shields would still remain attached to system ground, and you would introduce the signal and reference through the ‘signal’ wires (the center conductors). I’m still not sure about the wisdom of passing the reference signal through the shield of the BNC cable since the frequency characteristics of the shield are different from that of the core, but perhaps this is only significant way beyond the digitizing bandwidth of the cards and is therefore irrelevant.

“Impact”

Well, someone has finally come out and said what we are all thinking. A professor, Marc Kirschner, has an editorial in the tabloid ‘Science’ criticizing the NIH’s emphasis on pre-judging “impact” and translational “significance” of proposed studies.

The editorial is frank, with some of my favorite quotes being:

Thus, under the guise of an objective assessment of impact, such requirements [of judging impact of proposed and current research] invite exaggerated claims of the importance of the predictable outcomes—which are unlikely to be the most important ones. This is both misleading and dangerous.

In today’s competitive job and grant market, these demands create a strong inducement for sloppy science.

And they should reemphasize humility, banishing the words “impact” and “significance” and seeing them for what they really are: ways of asserting bias without being forced to defend it.

I’m not sure, however, if speaking our minds, even when done by senior scientists, is going to change anything. The problem is not really of attitude, as Kirshner suggests. I think it is simply of funds, as in not enough of compared to the number of scientists vying for them.

When ever demand exceeds supply the supplier can do any arbitrary thing they want. In many jobs, like a coder position at Google or a tenure track research position, where there are many more applicants than positions we would think quality would be selected for, that the cream would rise to the top.

What really happens, however, is that since imperfect humans are judging other imperfect humans in qualities they hope are predictive of future success, we pick arbitrary criteria. We think we are being clever and competitive, when we are actually asking people to jump through hoops just for the heck of it. It would be better to just have a lottery.

I think that until we have more grants or fewer scientists we will continue to apply stupid criteria simply because we need to filter. When one thing is as good (or bad) as the other, its just chance. Like Kirshner, I agree we should not use “impact” and “significance” to judge suitability of grant proposals.

I have a more radical suggestion: use a lottery.

My former adviser (John Maunsell) has said to me that he believes that once you go below the 20 percent line in grant scoring its all noise. The bad ones have been taken out and now all the good ones are in the mix and no one can tell which ones will pan out and be significant.

We should be honest, filter the badly designed experiments out, and then do a lottery on the proposals that contain technically sound science.

(On the topic of grant scoring, have you noticed how NIH scores have tended towards bimodal? It’s like reviewers only use two numbers 1 and 5. Almost as if they know that with the tight paylines its a lottery and so in order for their favorite grants to win they really need to push the distributions apart)

Gradient descent

We all have to get our car oil changed periodically. If we do it too often its a waste of money (and a strain on the environment), if we do it too infrequently we run the risk of damaging our engine. Say an insightful friend has developed a formula – a function – that informs us what our total cost is (cost of oil + probable cost of engine repair) when we plug in how often we are doing oil changes.

fig1

We have an idea that we can use this graph to figure out the best frequency for our oil change. Finding the value of the input of a function to obtain the lowest value of the output is called optimization.

Brute force

One easy way to find the optimum is to simply compute the function over a large range of input values and pick the value of the input that gives the smallest output. We already did this by inspection when we saw the above graph, because some one has already plotted the graph for us.

brute_force

One problem with this is that, in the real world, the function may be complicated and involve lots of individual mathematical formulae making it very time consuming to compute the function over a large range of values. This is especially true for multi-dimensional functions that take multiple inputs that can be altered.

Say our formula not only takes into account oil change frequency but also our average driving speed and average length of travel and we are really obsessive and we want to optimize all three. We need to pick ranges for the three dimensions and then decide how densely we are going to sample this space of three dimensions. Even if we are very careless and decide to check only 10 values for periodicity, travel distance and average speed we end up having to check 10 x 10 x 10 = 1000 combinations of values!

We need to find a short cut!

Gravity is our friend

fig2

Look again at the graph out friend has given us. Imagine the graph is actually a physical surface, like a bowl.
If we put a marble somewhere it’s going to roll down and end up at the bottom. Now, this is where we want to be: we want to be at the lowest point of the curve. If we have such a bowl shaped surface, the action of gravity on a marble placed on that surface acts as a minimizing algorithm: it finds the lowest point for us. (At this point, you might suggest that this is not always true, we won’t always find the lowest point, and we will discuss this later).

This gives us an idea for a method (an algorithm) by which we can fake this ‘rolling down a slope’ to find the input that minimizes a function. Before we continue, just so we are on the same page, teh slope of the curve is defined as the ratio of the change in output to the change in input. This will be familiar to any one who has been driving in a hilly area:

Road gradient

Algorithm

1. Place an imaginary marble at some arbitrarily picked starting position
2. Find the slope of the curve at that point
3. Move our point along the curve to a new position computed by adding the numerical value of the negative of the slope to the old position.
4. Keep repeating this until our slope becomes smaller than some value we are happy with calling ‘flat’ i.e. we are done.

Alpha

Now, in our physical example, how fast the marble rolls down the slope depends on the strength of gravity. If we did our experiment in orbit (say on the International Space Station) the marble of course would not roll down but float around. If we did it on the moon, it would roll down, but slowly, since the gravity is lower there.

In our fake version of this experiment we also have to decide on a ‘gravity’. We don’t want to make the gravity too low or our point will take a long time to move along the slope. If we make our gravity too high, our point will overshoot and bounce around a lot.

We call our fake gravity ‘alpha’ and it is simply a number that we multiply our slope by to decide how far our point is going to move.

It is important to point out here that our marble in a ball is an analogy. In our algorithm that we are going to use, for practical reasons, we don’t include inertia. So, in the real world the ball would slide down to the bottom and then swing back upward due to momentum (inertia). In our algorithm, as soon as the slope goes to zero our ‘ball’ (point) stops instantly. However, it is still important to choose a gravity (alpha) that is not too large.

Stuck in local minima

You would have thought to yourself, what if the surface is kind of uneven, can’t the ball get stuck on a ledge somewhere as it is rolling down, and not reach the bottom? Yes, this can happen using this method, and is called getting stuck in local minima.

It’s a matter of luck if you get stuck in local minima. You could avoid it simply by accident, by starting from a different position or by having a different alpha value.

Discussion

The Gradient descent method requires far fewer function evaluations than the brute force method, especially when the number of input variables we are tweaking (the dimensions of the input space) get larger.

Getting stuck in local minima is a big problem which crops up quite frequently in many interesting applications. There are some tricks to get out of local minima which we will discuss later.

Also, if you pick too large an alpha value you could end up bouncing around in your function, perhaps lost for eternity, either spinning out into space, or cycling through the same positions, never getting to where you want. You have been warned …

The pitfall of big science.

Neuroscience was in the news recently because the President of the United States is giving his name to a somewhat diffuse initiative called BRAIN. The way this is being sold is that this is a ‘challenge’ initiative, like Kennedy’s Moon shot or the genome project. A challenge initiative that will enable us to understand the functioning of the human brain in order to cure brain disorders.

I think this is the beginning of the end of what I liked about American science. What I liked about American science was that funding was spread across many scientists doing many diverse things. And I think that is the only way to do fund basic science.

By its nature each individual bit of basic science is a high risk endeavor.  Experiments, when properly thought out to maximize new knowledge, are most likely to fail, leading us into many blind alleys. The only criterion, in my mind, for a good experiment, is one which tells us clearly when we’ve gone into one of those blind alleys.

The fact is, that no one can predict the blind alleys and the breakthroughs. Often scientific discoveries are not only accidental, but they are unrecognized at the time of discovery. Only later, after other discoveries are made or the socio-economic landscape has changed do these discoveries suddenly appear significant and revolutionary.

Funding science is like picking stocks. You don’t know which one is going to work out. In the stock market, you can do some things that are, at the very least immoral, and perhaps illegal, to game the system and ensure a flow of money from the less connected, small investor, to the better connected bigger investor. In science, you can not game the system. No one has a special insight into where the next big break through is coming from. You can’t time this market.

I used to think, therefore, that the NIH and NSF, by giving out a lot of grants to people doing many different things, was playing the game right. Diversify and fund as many different ideas as possible. A small fraction will bear fruit, but we will simultaneously maintain a decent mass of well trained scientists and reap the benefits of the occasional breakthrough.

But two factors seem to have broken the back of that system. The financial crunch has bled out funding from the NIH for over a decade and simultaneously the funding agencies have gone into a strange turtle mode: they only want to fund science that is ‘guaranteed’ to get ‘results’. There are no guarantees in science. Often we do not know what a ‘result’ is.

As part of this effort to only fund great science, the funding agencies are looking to only fund the ‘super star’ scientists. These are scientists, who, through some objective criterion, are ‘better’ than the rest. These seemingly objective criteria (publication impact, publication volume) are increasingly boiling down to political influence. How many friends and supporters they have on journal review staff,  grant panels, in the funding agencies and now, in the very machinery of government itself.

President Obama’s announcement of the BRAIN initiative, to me, is like a marker in the road to the increasing politicization of scientific funding. Usually when people think about politics in science they think about political appointees stopping scientific progress to benefit religious constituents or industrial powerhouses that pull their strings.

I think of politicization as the judging of scientific endeavor not from the strict lens of scientific correctness, but from the more subjective lens of ‘sexiness’ or ‘impact’. It is my firm opinion that this is a grave mistake. As scientists we can only judge scientific correctness. We may think we can judge impact, but the truth is that these are merely very biased opinions. We may sometimes be correct, but in most situations we do not have the foresight to see what combination of future socio-economic factors and other discoveries will make which contemporary scientific finding useful or useless.

With the BRAIN initiative we are now formalizing a centrally controlled system of doing science where a small number of politically skillful people will control an ever decreasing  purse of money to fund an ever decreasing and less and less diverse thinking set of scientists. This is a disaster.

I would rather propose that the number of grants should be kept constant (or increasing) and the size of each grant should go down. PIs should also be restricted in how many graduate students and post docs they can hire. Scientists are creative people, they will find ways to stretch that research budget. Perhaps PIs could take a little less by way of salaries, labs using common equipment could find ways to pool resources. But the science will go on. And will be diverse and we will let the future sort out what was a good result and what was a false positive or a mere curiosity of no practical value.