Down the rabbit hole

I was putting some finalizing touches to pre-processing some data in preparation for some analysis I was raring to do. The plan was to create some pretty pictures, get some insight, get this off my desk by noon and go into the weekend with no backlog and a clear conscience. But here I am, this Friday evening, rewriting the code to work around a bug in someone else’s software. I was angry for a while but then I wasterrified.

To be honest, it’s not a sudden realization, and I suspect that all of you have had this realization. It’s just that it has to happen to you personally, to bring that extra visceral element in.

Everyone who has used Windows (and now, sadly Mac OS X to an increasing and annoying degree) knows that all software has bugs. I used to think this had primarily to do with the fact that these are graphical operating systems that have to deal with asynchronous and random input.

These are difficult problems to find test cases for and the bulk of the software is a series of checks and balances to see if the user is allowed to do what they just did given what they have been doing in the past. I wrote GUIs once and I swore I would never do it again. No matter how many checks you put in, someone somewhere is going to come in and press the right combination of keys at just the right pace to lock your software and crash it, eventually.

Widely used, well tested computing packages, on the other hand, I pretty much trusted. Yes, it is true, that there are tricky algorithms such as integration and differentiation and optimization that are hard to get very right and not all implementations are equal. Some trade accuracy for speed and so on, but I don’t expect anything to collapse catastrophically if thousands of people have been using it for years.

And yet here I was sitting at my keyboard, fuming, because a module I was using presented a strange, extremely unexpected bug. To top it off, the library doesn’t do any fancy computations, doesn’t do any graphics or any user interface stuff. All it does is take tabular data and save it to disk.

The selling point of the software is that it allows you to build a file on disk, many gigabytes in size, much, much larger than your available memory, and still process data from it seamlessly. I’ve used it for a while and it works for me.

Another great thing about the library was that it had an easy way to indicate missing data. It uses something called a ‘NaN’ which expands to Not-a-Number which is a fairly common value we put in our data to say “hey, don’t take this value into account when you do some computation, like summing or multiplying this table of numbers, it’s not really there.”

So, I had this large table full of numbers with a few missing data points, which I had filled with NaNs. Say the rows of the table are individual people (my actual case is completely different, but this will do) and the columns are attributes such as height, weight, eye color, zip code and so on.

I was interested in asking questions like “Give me data for all the people who are taller than 5′, have blue eyes and live in zip code 02115”. I took a little chunk of my data, loaded it into memory, asked my questions and got back a list of names. Everything checked out.

So I saved the data to disk and then did the same exact thing, except I used the software’s special ability to pull chunks of data straight from disk. Sometimes I would get sensible answers but sometimes I would get nothing. The software would tell me nobody matched the set of conditions I was asking for, but I knew for a fact that there were people in the database matching the description I had given.

The even odder thing was that if I loaded the file back from disk in its entirety and then asked the questions I got all correct answers.

My first reaction was that I was screwing up somewhere and I had badly formatted my data and when I was saving the data to disk I was causing some strange corruption. I started to take away more and more of my code in an effort to isolate the loose valve, so to speak.

But as the hours went by, I started to decide, that however unlikely, the problem was actually in the other guy’s code. I started to go the other way. I started to build a new example using as little code as I possibly could to try and replicate the problem. Finally I found it. The problem was very very odd and very insidious.

In the table of data, if you had a missing value (a NaN) in the first or second row of  any column the software would behave just fine when the data were all in memory. When, however, you asked the software to process the same data from disk it would return nothing, but only when you asked about that column. If you asked about other columns the software would give the correct answer.

This took up most of my Friday. I wrote in to the person who made the software suggesting it was a bug. They got back to me saying, yep it was a bug, but they knew about it and it was actually a bug in this OTHER software from this other fella that they were using inside their own code.

By this time, I was less angry and more curious. I went over to these other folks and sniffed around a bit and read the thread where this bug was discussed. I couldn’t understand all the details but it seems, that in order to make things work fast, they used a clever algorithm to search for matching data on the data table when it was on disk. This clever algorithm, for all its speed and brilliance, would stumble and fall if the first value in the column it was searching in was not-a-number. Just like that.

Importantly, instead of raising an error, it would fail silently and lie, saying it did not find anything matching the question it was given. Just like that.

This exercise brought home a realization to me. You can make your code as water tight as possible, but you almost always rely on other people’s code somewhere. You don’t want to reinvent the wheel repeatedly. But you also inherit the other fellow’s bugs. And the bugs they inherited from yet another fellow and so and so forth. And you can’t test for ALL the bugs.

And then I thought, this is the stuff that launches the missiles. This is the stuff that runs the X-ray machine. This is the stuff that controls more and more of vehicles on the road.

The most common kind of code failure we are used to is the catastrophic kind, when the computer crashes and burns. When our browser quits, when we get the blue screen and the spinning beach ball. This is when you KNOW you’ve run into a bug. But this is not the worst kind. The worst kind are the silent ones. The ones where the X-ray machine delivers double the dose of x-rays and ten years later you get cancer.

And even though I knew all this cognitively, it took me a while to calm down.

The case of the hidden inversion

Josh is setting up a behavioral experiment where the subject has to press a button as part of their response during the experiment. He was writing some code in Matlab to drive his experiment hardware and called me over. “I have a problem. My button gives output high when not touched and output low when closed (touched). The software is expecting that the input is low when open (not touching switch) and high when the subject touches the switch. How do you get this to work, I know Churly’s rig has the same button and the software works with it.”

Image

I wasn’t very familiar with touch buttons, but I remembered helping Churly out with an experiment and I thought I recalled his switch was active-high, that is, when the subject touched the button the voltage went high. We went to investigate and discovered that, indeed, Churly’s switch behaved as if it were active high. But it was the same model as Josh’s switch. So, somewhere, the signal was being inverted.

We traced the connections to Churly’s rig. (I should mention here one of the cardinal rules of experimental setups: if you leave, other scientists can and will steal parts from your rig to repair/build their own. Churly left the lab about a week or so ago, and his rig is already showing signs of being cannibalized.) Anyhow, after tracing the crows nest of wires  that is standard for experimental rigs we became very puzzled.

The circuit was hooked up as follows:

Image

The BNC coax plug went into a National Instruments breakout box.

After a little thought we decided that the only way this could work was if the two grounds (C, the power supply ground/negative and B, the Coax/system ground ) were isolated from each other. Churly was using a separate power supply for the switch, but it was a cheap looking one and nothing on the faceplate said ‘Isolated supply’, which is something you advertize, since it is expensive to build one.

We went in with a multimeter and started to poke around. Our hypothesis was that the power supply ground (C) was floating with respect to the system ground (B) and so, since A and B were tied together through a resistor, C would be at -10V with respect to the rest of the world. We stuck one end of the voltmeter in C and the other end on another BNC plug shield on the NI break out box. No voltage difference.

This was very vexing, but fit with our intuition that this was a cheap ass power supply and everything was tied to ground. Now that our leading hypothesis was discarded we were left free in the wind. We started to measure voltages across random pairs of points in the circuit. Nothing of interest happened until we put our voltmeter across the BNC shield (B) and the power supply ground (C). 9.6 V. WHAT!??

We pressed the switch. The voltage went to 0V. We started to putter around some more.We had already measured the voltage across the system ground and (C) before, or so we thought. Just to make sure, we went through each BNC shield terminal and measured the voltage with respect to C. All 0V. Really!?

It turns out that that particular input on the NI break out box was configured as differential. Now, we knew about differential modes. What we expected was that the differential signal pair was the central signal lines of the paired BNC channels.

What we learned new was that when you switch a pair of inputs to differential mode, the National Instruments BNC breakout board ties the shield of the BNC input to the reference input, thereby decoupling it from the system ground (and the other BNC shield terminals) which is what threw us off at the start.

I had expected that, in differential mode, the shields would still remain attached to system ground, and you would introduce the signal and reference through the ‘signal’ wires (the center conductors). I’m still not sure about the wisdom of passing the reference signal through the shield of the BNC cable since the frequency characteristics of the shield are different from that of the core, but perhaps this is only significant way beyond the digitizing bandwidth of the cards and is therefore irrelevant.

Gareth McCaughan’s summary of what we know about D-wave’s quantum computer

D-Wave is making waves (I know, I know) with its quantum computer. Scott Aaronson’s post about the latest on the D-wave saga was linked on Hacker News. Gareth McCaughan has given a very nice layperson summary of this post which is actually a very nice standalone read on D-wave and the essence of quantum computing. This amazing comment (which encapsulates all that is great about HN) is linked here in the hope that it will not disappear so soon. Gareth, do make this a blog post for us laypeople, if you have time. Thanks again.