I was putting some finalizing touches to pre-processing some data in preparation for some analysis I was raring to do. The plan was to create some pretty pictures, get some insight, get this off my desk by noon and go into the weekend with no backlog and a clear conscience. But here I am, this Friday evening, rewriting the code to work around a bug in someone else’s software. I was angry for a while but then I wasterrified.
To be honest, it’s not a sudden realization, and I suspect that all of you have had this realization. It’s just that it has to happen to you personally, to bring that extra visceral element in.
Everyone who has used Windows (and now, sadly Mac OS X to an increasing and annoying degree) knows that all software has bugs. I used to think this had primarily to do with the fact that these are graphical operating systems that have to deal with asynchronous and random input.
These are difficult problems to find test cases for and the bulk of the software is a series of checks and balances to see if the user is allowed to do what they just did given what they have been doing in the past. I wrote GUIs once and I swore I would never do it again. No matter how many checks you put in, someone somewhere is going to come in and press the right combination of keys at just the right pace to lock your software and crash it, eventually.
Widely used, well tested computing packages, on the other hand, I pretty much trusted. Yes, it is true, that there are tricky algorithms such as integration and differentiation and optimization that are hard to get very right and not all implementations are equal. Some trade accuracy for speed and so on, but I don’t expect anything to collapse catastrophically if thousands of people have been using it for years.
And yet here I was sitting at my keyboard, fuming, because a module I was using presented a strange, extremely unexpected bug. To top it off, the library doesn’t do any fancy computations, doesn’t do any graphics or any user interface stuff. All it does is take tabular data and save it to disk.
The selling point of the software is that it allows you to build a file on disk, many gigabytes in size, much, much larger than your available memory, and still process data from it seamlessly. I’ve used it for a while and it works for me.
Another great thing about the library was that it had an easy way to indicate missing data. It uses something called a ‘NaN’ which expands to Not-a-Number which is a fairly common value we put in our data to say “hey, don’t take this value into account when you do some computation, like summing or multiplying this table of numbers, it’s not really there.”
So, I had this large table full of numbers with a few missing data points, which I had filled with NaNs. Say the rows of the table are individual people (my actual case is completely different, but this will do) and the columns are attributes such as height, weight, eye color, zip code and so on.
I was interested in asking questions like “Give me data for all the people who are taller than 5′, have blue eyes and live in zip code 02115”. I took a little chunk of my data, loaded it into memory, asked my questions and got back a list of names. Everything checked out.
So I saved the data to disk and then did the same exact thing, except I used the software’s special ability to pull chunks of data straight from disk. Sometimes I would get sensible answers but sometimes I would get nothing. The software would tell me nobody matched the set of conditions I was asking for, but I knew for a fact that there were people in the database matching the description I had given.
The even odder thing was that if I loaded the file back from disk in its entirety and then asked the questions I got all correct answers.
My first reaction was that I was screwing up somewhere and I had badly formatted my data and when I was saving the data to disk I was causing some strange corruption. I started to take away more and more of my code in an effort to isolate the loose valve, so to speak.
But as the hours went by, I started to decide, that however unlikely, the problem was actually in the other guy’s code. I started to go the other way. I started to build a new example using as little code as I possibly could to try and replicate the problem. Finally I found it. The problem was very very odd and very insidious.
In the table of data, if you had a missing value (a NaN) in the first or second row of any column the software would behave just fine when the data were all in memory. When, however, you asked the software to process the same data from disk it would return nothing, but only when you asked about that column. If you asked about other columns the software would give the correct answer.
This took up most of my Friday. I wrote in to the person who made the software suggesting it was a bug. They got back to me saying, yep it was a bug, but they knew about it and it was actually a bug in this OTHER software from this other fella that they were using inside their own code.
By this time, I was less angry and more curious. I went over to these other folks and sniffed around a bit and read the thread where this bug was discussed. I couldn’t understand all the details but it seems, that in order to make things work fast, they used a clever algorithm to search for matching data on the data table when it was on disk. This clever algorithm, for all its speed and brilliance, would stumble and fall if the first value in the column it was searching in was not-a-number. Just like that.
Importantly, instead of raising an error, it would fail silently and lie, saying it did not find anything matching the question it was given. Just like that.
This exercise brought home a realization to me. You can make your code as water tight as possible, but you almost always rely on other people’s code somewhere. You don’t want to reinvent the wheel repeatedly. But you also inherit the other fellow’s bugs. And the bugs they inherited from yet another fellow and so and so forth. And you can’t test for ALL the bugs.
And then I thought, this is the stuff that launches the missiles. This is the stuff that runs the X-ray machine. This is the stuff that controls more and more of vehicles on the road.
The most common kind of code failure we are used to is the catastrophic kind, when the computer crashes and burns. When our browser quits, when we get the blue screen and the spinning beach ball. This is when you KNOW you’ve run into a bug. But this is not the worst kind. The worst kind are the silent ones. The ones where the X-ray machine delivers double the dose of x-rays and ten years later you get cancer.
And even though I knew all this cognitively, it took me a while to calm down.