Gareth McCaughan’s summary of what we know about D-wave’s quantum computer

D-Wave is making waves (I know, I know) with its quantum computer. Scott Aaronson’s post about the latest on the D-wave saga was linked on Hacker News. Gareth McCaughan has given a very nice layperson summary of this post which is actually a very nice standalone read on D-wave and the essence of quantum computing. This amazing comment (which encapsulates all that is great about HN) is linked here in the hope that it will not disappear so soon. Gareth, do make this a blog post for us laypeople, if you have time. Thanks again.


The ROC and its AUC

Hark back to the caveman days. You are out there, in the tall grass, with your spear. You are stalking a tasty gnu. Suddenly, over the whispering of the wind over the green and the calls of the birds, you hear a twig snap! Was that a twig snap? Was that a saber-toothed tiger stalking you? Or was it just a birdchirp, or an insect falling? Do you keep stalking, or do you slink away? Categorizing things is a fundamental operation for animals and ‘intelligent’ machines.


When we talk about machines (including animals) categorizing things (like stimuli, choices etc) we often run into the concept of the Reciever Operating Characteristic (ROC), and the area under the ROC. Taking the example above, you need to decide whether that sound you heard is simply prarie noise, or the stealthy sound of a predator. The stimuli are not that easy to distinguish, but you need to try.

In reality, we probably base our decision on many factors: time of day (maybe we know tigers like to hunt at particular times, while food grazes at other times), are there many such crackles, how exactly did the crackle sound, did the birds stop calling, are there vultures overhead and so on and so forth.

To simplify matters for this discussion, say we simply rely on how loud the sound was relative to the general hub hub of the grazing fields. Say the louder the sound the more likely it is to be something extra-ordinary rather than a random sound that happened to be loud by chance. Say, it turns out that the probability distributions of sound level for noise on the prarie (blue) and tiger stalking (green) turn out to be as shown below.


Suppose we are not so hungry, and a little flighty and on edge today. Even the slightest sound can set us off and running for the home cave. We can model this as a criterion c (indicated by the vertical black dashed line) that is set pretty low. This means that even a soft tiger sound will cause us to respond (the green shaded are to the right). When we hear a sound and correctly identify it as a tiger we say it was a ‘Hit’. But, we overreact and often interpret the regular prarie noise as a tiger (shaded blue region). These instances are called ‘False alarms’. As we can see, when we set our criterion low (we are ‘jumpy’) we have a lot of ‘Hits’ but also a lot of ‘False alarms’.

Now, on another day, we might me more even-keeled. We are less jumpy, more measured in our responses. We can model that as a higher criterion.


We can see that in such a case our ‘Hits’ go down, but our ‘False Alarms’ go down as well.

On another day, we might be very hungry. We are determined to hunt down that bison, tigers or no tiger. We can model that as setting our criterion very high.


Now, we have very few ‘Hits’ and almost no ‘False Alarms’.

Under controlled conditions, it is possible to systematically vary the eagerness of an observer and obtain what is called the Receiver Operating Characteristic (ROC) – terminology harking back to the days when the hot research topic was radar reciver operators sitting at listening posts and how they could be made more efficient in detecting enemy aircraft from primitive radar returns. As we sweep the criterion point, the proportion of FA and H varies systematically, creating an ROC that looks like the curve shown below.


At high criteria there are no FA but no H either (lower left). For low criteria we have high H, but high FA too (upper right). At intermediate criteria we have a mixture.

We can see that the ROC curve bulges out upwards and leftwards. The size of this bulge depends on how well separated the two distributions are (how far apart the blue and green curves are). If the distributions are identical the ROC is a 45 degree line: changing our criterion increases our FA and H at the same rate all the time. If the blue and green are well separated the ROC bulges out a lot, such that increasing our criterion, initially, causes our H to go up much faster than our FA (which is good – it means that, by being a little more conservative, we can get many more hits without making too many more false alarms, upto a point.

Here’s a funny thing: we now have some intuition that the bulge of the curve indicates something about how easy it is to distinguish the categories. The more bulging the curve the larger the Area Under the Curve (AUC, the space enclosed bewteen the ROC curve and the x-axis) is. It turns out that the area under the curve can be interpreted as a probability (it ranges from 0 to 1). In fact it is the ‘ideal observer’ probability. If could some how build a machine or train an observer to make the best use of the data available, the best that ‘ideal observer’ can do (on average) is to get it right with probability ‘p’, where p is the area under the curve.

Intuitively, this makes sense: if the ROC is a 45 degree line the AUC=0.5 which means that the best we can do is guess blindly, getting things light half the time. If the ROC is completely bulged out the AUC=1.0 and its a walk in the park and we get the classification right all the time.

If the ROC bulges in (and our AUC < 0.5) we notice that we can simply invert the categories and get the ROC to bulge outward – we’ve simply got the sign backwards.

WHY the AUC of the ROC gives is this ‘ideal observer’ probabilty is another story, but to get us thinking, we make a fun note: suppose that we kept getting samples of ‘noise’ and ‘tiger’ and we simply decided that the larger sample was the tiger and the smaller one noise. It turns out that this simple strategy gives us the ‘ideal observer’ performance (which is same as the AUC of the ROC).

The code for this and the rest of the post is given below and you might have a lot of fun pondering this ‘coincidence’. As a hint I would suggest asking the question, what does it mean to compare the two samples? Can this have anything to do with moving the criterion from -infiinty to +infinity and checking what our percentage correct (H-FA) is?

def plotit(x1,x2,cri):
    h1=hist(x1, histtype='step',bins=31)
    h2=hist(x2, histtype='step',bins=31)
    c2 = (h2[1][:-1]+h2[1][1:])/2.
    idx = find(c2 > cri)
    h=fill_between(c2[idx], h2[0][idx], alpha=.2,color='green')
    Hits = h2[0][idx].sum()/float(h2[0].sum())

    h1=hist(x1, histtype='step',bins=31)
    h2=hist(x2, histtype='step',bins=31)
    c1 = (h1[1][:-1]+h1[1][1:])/2.
    idx = find(c1 > cri)
    h=fill_between(c1[idx], h1[0][idx], alpha=.2,color='b')
    FA = h1[0][idx].sum()/float(h1[0].sum())
    h=text(c1[idx[0]]+.2, h1[0].max(), 'FA={:1.2f}'.format(FA))
    h=text(c1[idx[0]]+.2, h1[0].max()*.7, 'H={:1.2f}'.format(Hits))

def auroc(x1, x2, N=40):
  """Area under the ROC curve. Given scalar data from two groups (x1,x2) what is the probability that an ideal
  observer will be able to correctly classify the groups?"""
  n1 = float(x1.size)
  n2 = float(x2.size)
  av1 = x1.mean()
  av2 = x2.mean()
  if av1 > av2:
    FA = 1
    H = 0
    FA = 0
    H = 1

  st = min(x1.min(), x2.min())
  nd = max(x1.max(), x2.max())
  ra = nd - st
  st -= .01*ra
  nd += .01*ra
  cri = pylab.linspace(nd,st,N)
  roc = [pylab.zeros(N), pylab.zeros(N)]
  for n in xrange(N):
    roc[H][n] = pylab.find(x2 > cri[n]).size/n2
    roc[FA][n] = pylab.find(x1 > cri[n]).size/n1

  return pylab.trapz(roc[1], roc[0]), roc

x1 = randn(10000)
x2 = randn(10000)+2



p, roc = auroc(x1, x2, N=40)
plot(roc[0], roc[1],lw=2)

N = 10000
i1 = random_integers(0,x1.size-1,N)
i2 = random_integers(0,x2.size-1,N)

Musings on Mutual Information

Mutual information (MI), often given in terms of bits of information, is a neat way to describe the relationship between two variables, x and y. The MI tells us how predictive x is of y (and vice versa). Here I will explore what the MI value in bits means.

For example, if x takes 8 values and very reliably predicts which of 8 values y will take, it turns out that the mutual information between x and y is 3 bits. One interesting aspect of MI is to get an intuition of how noise and uncertainty reflect in the MI. In the example below we will explore this a little bit.

At the end of the post I show all the Python code needed to reproduce the plots.

As our basic data we consider an array of numbers (x) each of which is either 0 or 1. So x can have two states (0,1).

Next we create another array (y). For each value in x, we have a value in y. Each value in y is sampled from a unit variance Gaussian distribution whose mean is 0 or 4 depending on the value of its corresponding x.


We have an intuition that x and y predict each other pretty well. We can see the distribution of y has two peaks depending on whether they are associated with x=0 or x=1. The mutual information between the two comes out to be 1 bit. Our intuition is that if some one gave us a value of x we could tell them if the value of y is above or below 4. Alternatively, if some one gave us a value for y, we could tell them is x was 0 or 1. We can tell them, reliably one of two alternatives. For this reason, we claim that there is 1 bit of information here. We can distinguish 2**1 alternatives with the value given to us.

Say we now arrange things such that whatever the value of x, the value of y is always chosen from the same Gaussian distribution.


Now, our intuition is that there is no way to tell what our value of y will be based on the value of x, and vice versa. When we work out our mutual information, we see that we have close to zero bits of information, which matches our intuition that x and y carry no information about each other.

Next, suppose we arrange things such that the value of y is taken from Gaussian distributions with mean 0 or 1, depending on whether the value of the corresponding x is 0 or 1.


We see that the distributions overlap, but not completely. The mutual information computation tells us that there are 0.45 bits of information between our two variables. What does that mean? What does it mean to be able to distinguish among 2**0.45=1.36 possibilities? I find it hard to interpret this value directly. Given knowledge about x (it can be 0 or 1) I know that the maximum mutual information possible is 1 bit. So anything less than 1 represents a noisy or flawed relation between x and y. I have little intution on whether 1.5 bits of information is a lot better than 1.3 bits and just a little worse than 1.7 bits. We can check what happens when we gradually increase the separation between the two distributions, making them more easily separable.


As we can see, 0.4 bits of separation imply the distributions are separated by one standard deviation (d’=1) 0.8 bits means d’=2 and then we start to saturate.

Code follows

def mi(x,y, bins=11):
  """Given two arrays x and y of equal length, return their mutual information in bits
  Hxy, xe, ye = pylab.histogram2d(x,y,bins=bins)
  Hx = Hxy.sum(axis=1)
  Hy = Hxy.sum(axis=0)

  Pxy = Hxy/float(x.size)
  Px = Hx/float(x.size)
  Py = Hy/float(x.size)

  pxy = Pxy.ravel()
  px = Px.repeat(Py.size)
  py = pylab.tile(Py, Px.size)

  idx = pylab.find((pxy > 0) & (px > 0) & (py > 0))
  return (pxy[idx]*pylab.log2(pxy[idx]/(px[idx]*py[idx]))).sum()

x1 = pylab.zeros(1000)
x2 = pylab.ones(1000)
x = pylab.concatenate((x1,x2*2))

y = pylab.randn(x.size)+4*x
h = hist([y[:1000], y[1000:]], histtype='step', bins=31)
h = text(2,80,'MI={:f}'.format(mi(x,y)))

y = pylab.randn(x.size)+0*x
h = hist([y[:1000], y[1000:]], histtype='step', bins=31)
h = text(2,80,'MI={:f}'.format(mi(x,y)))

y = pylab.randn(x.size)+x
h = hist([y[:1000], y[1000:]], histtype='step', bins=31)
h = text(2,80,'MI={:f}'.format(mi(x,y)))

N = 100
kk = pylab.linspace(0,5,N)
mutinf = [mi(x,pylab.randn(x.size)+k*x) for k in kk]
plot(kk, mutinf)
ax = gca()
ylabel('Mutual information')