Hark back to the caveman days. You are out there, in the tall grass, with your spear. You are stalking a tasty gnu. Suddenly, over the whispering of the wind over the green and the calls of the birds, you hear a twig snap! Was that a twig snap? Was that a saber-toothed tiger stalking you? Or was it just a birdchirp, or an insect falling? Do you keep stalking, or do you slink away? Categorizing things is a fundamental operation for animals and ‘intelligent’ machines.

When we talk about machines (including animals) categorizing things (like stimuli, choices etc) we often run into the concept of the Reciever Operating Characteristic (ROC), and the area under the ROC. Taking the example above, you need to decide whether that sound you heard is simply prarie noise, or the stealthy sound of a predator. The stimuli are not that easy to distinguish, but you need to try.

In reality, we probably base our decision on many factors: time of day (maybe we know tigers like to hunt at particular times, while food grazes at other times), are there many such crackles, how exactly did the crackle sound, did the birds stop calling, are there vultures overhead and so on and so forth.

To simplify matters for this discussion, say we simply rely on how loud the sound was relative to the general hub hub of the grazing fields. Say the louder the sound the more likely it is to be something extra-ordinary rather than a random sound that happened to be loud by chance. Say, it turns out that the probability distributions of sound level for noise on the prarie (blue) and tiger stalking (green) turn out to be as shown below.

Suppose we are not so hungry, and a little flighty and on edge today. Even the slightest sound can set us off and running for the home cave. We can model this as a criterion c (indicated by the vertical black dashed line) that is set pretty low. This means that even a soft tiger sound will cause us to respond (the green shaded are to the right). When we hear a sound and correctly identify it as a tiger we say it was a ‘Hit’. But, we overreact and often interpret the regular prarie noise as a tiger (shaded blue region). These instances are called ‘False alarms’. As we can see, when we set our criterion low (we are ‘jumpy’) we have a lot of ‘Hits’ but also a lot of ‘False alarms’.

Now, on another day, we might me more even-keeled. We are less jumpy, more measured in our responses. We can model that as a higher criterion.

We can see that in such a case our ‘Hits’ go down, but our ‘False Alarms’ go down as well.

On another day, we might be very hungry. We are determined to hunt down that bison, tigers or no tiger. We can model that as setting our criterion very high.

Now, we have very few ‘Hits’ and almost no ‘False Alarms’.

Under controlled conditions, it is possible to systematically vary the eagerness of an observer and obtain what is called the Receiver Operating Characteristic (ROC) – terminology harking back to the days when the hot research topic was radar reciver operators sitting at listening posts and how they could be made more efficient in detecting enemy aircraft from primitive radar returns. As we sweep the criterion point, the proportion of FA and H varies systematically, creating an ROC that looks like the curve shown below.

At high criteria there are no FA but no H either (lower left). For low criteria we have high H, but high FA too (upper right). At intermediate criteria we have a mixture.

We can see that the ROC curve bulges out upwards and leftwards. The size of this bulge depends on how well separated the two distributions are (how far apart the blue and green curves are). If the distributions are identical the ROC is a 45 degree line: changing our criterion increases our FA and H at the same rate all the time. If the blue and green are well separated the ROC bulges out a lot, such that increasing our criterion, initially, causes our H to go up much faster than our FA (which is good – it means that, by being a little more conservative, we can get many more hits without making too many more false alarms, upto a point.

Here’s a funny thing: we now have some intuition that the bulge of the curve indicates something about how easy it is to distinguish the categories. The more bulging the curve the larger the Area Under the Curve (AUC, the space enclosed bewteen the ROC curve and the x-axis) is. It turns out that the area under the curve can be interpreted as a probability (it ranges from 0 to 1). In fact it is the ‘ideal observer’ probability. If could some how build a machine or train an observer to make the best use of the data available, the best that ‘ideal observer’ can do (on average) is to get it right with probability ‘p’, where p is the area under the curve.

Intuitively, this makes sense: if the ROC is a 45 degree line the AUC=0.5 which means that the best we can do is guess blindly, getting things light half the time. If the ROC is completely bulged out the AUC=1.0 and its a walk in the park and we get the classification right all the time.

If the ROC bulges in (and our AUC < 0.5) we notice that we can simply invert the categories and get the ROC to bulge outward – we’ve simply got the sign backwards.

WHY the AUC of the ROC gives is this ‘ideal observer’ probabilty is another story, but to get us thinking, we make a fun note: suppose that we kept getting samples of ‘noise’ and ‘tiger’ and we simply decided that the larger sample was the tiger and the smaller one noise. It turns out that this simple strategy gives us the ‘ideal observer’ performance (which is same as the AUC of the ROC).

The code for this and the rest of the post is given below and you might have a lot of fun pondering this ‘coincidence’. As a hint I would suggest asking the question, what does it mean to compare the two samples? Can this have anything to do with moving the criterion from -infiinty to +infinity and checking what our percentage correct (H-FA) is?

def plotit(x1,x2,cri): h1=hist(x1, histtype='step',bins=31) h2=hist(x2, histtype='step',bins=31) c2 = (h2[1][:-1]+h2[1][1:])/2. idx = find(c2 > cri) h=plot([c2[idx[0]],c2[idx[0]]],[0,1100],'k:',lw=4) h=fill_between(c2[idx], h2[0][idx], alpha=.2,color='green') Hits = h2[0][idx].sum()/float(h2[0].sum()) h1=hist(x1, histtype='step',bins=31) h2=hist(x2, histtype='step',bins=31) c1 = (h1[1][:-1]+h1[1][1:])/2. idx = find(c1 > cri) #h=plot([c1[idx[0]],c1[idx[0]]],[0,1100],'k:',lw=2) h=fill_between(c1[idx], h1[0][idx], alpha=.2,color='b') FA = h1[0][idx].sum()/float(h1[0].sum()) h=text(c1[idx[0]]+.2, h1[0].max(), 'FA={:1.2f}'.format(FA)) h=text(c1[idx[0]]+.2, h1[0].max()*.7, 'H={:1.2f}'.format(Hits)) def auroc(x1, x2, N=40): """Area under the ROC curve. Given scalar data from two groups (x1,x2) what is the probability that an ideal observer will be able to correctly classify the groups?""" n1 = float(x1.size) n2 = float(x2.size) av1 = x1.mean() av2 = x2.mean() if av1 > av2: FA = 1 H = 0 else: FA = 0 H = 1 st = min(x1.min(), x2.min()) nd = max(x1.max(), x2.max()) ra = nd - st st -= .01*ra nd += .01*ra cri = pylab.linspace(nd,st,N) roc = [pylab.zeros(N), pylab.zeros(N)] for n in xrange(N): roc[H][n] = pylab.find(x2 > cri[n]).size/n2 roc[FA][n] = pylab.find(x1 > cri[n]).size/n1 return pylab.trapz(roc[1], roc[0]), roc x1 = randn(10000) x2 = randn(10000)+2 plotit(x1,x2,cri=-2) plotit(x1,x2,cri=1) plotit(x1,x2,cri=3) p, roc = auroc(x1, x2, N=40) plot(roc[0], roc[1],lw=2) axis('scaled') xlabel('FA') ylabel('H') h=text(.5,.5,p) N = 10000 i1 = random_integers(0,x1.size-1,N) i2 = random_integers(0,x2.size-1,N) find(x2[i1]>x1[i2]).size/float(N)