Chasing down a Cython error

My python program segfaulted. When I wrote C++ programs segfaults were no big deal. I would hook into gdb mostly through my IDE’s debugger, and dance around the point the program crashed until I had an idea of where I was going wrong. Python gives a wonderful stack trace and tells you where things have gone wrong, so you can just look into the source code from the information Python spits out before it dies. But not for C-extensions.

When things go wrong in a Cython or C module that you are calling from Python you are stuck, a little bit, in no-mans land. Fortunately, from a stack-overflow thread here, you can just run Python in gdb, run your code and then debug it all as a C program (Python) operating on some data (Your Python program).

Unfortunately, I’m on Mac OS X, and though I could use homebrew to install gdb, I could not get it to run my python interpreter due to a binary compatibility issue. I decided to be brave and use LLDB (which comes with Mac OS X) to try and debug this error.

lldb python

This attaches Python to the debugger. Then you can run the Python program as follows

run /path/to/my/script arg1 arg2

My program crashed with

Fatal Python error: GC object already tracked
Process 59193 stopped

And some instructions in assembly which weren’t so helpful. However, I could do

thread backtrace

Which then led me to some candidate functions in the c-source file of the translated Cython code, like:

   frame #4: 0x000000010004d658 Python`PyTuple_New + 252
    frame #5: 0x00000001036ad6af util_cython.so`__Pyx_PyObject_CallOneArg(_object*, _object*) [inlined] __Pyx__PyObject_CallOneArg(arg=<unavailable>) + 5 at util_cython.cpp:7694
    frame #6: 0x00000001036ad6aa util_cython.so`__Pyx_PyObject_CallOneArg(func=0x000000010456f170, arg=0x0000000104c62a90) + 149 at util_cython.cpp:7712
    frame #7: 0x00000001036b1480 util_cython.so`__pyx_pw_5mitty_3lib_11util_cython_9markov_sequences(_object*, _object*, _object*) [inlined] __pyx_f_5mitty_3lib_11util_cython_markov_chain_sequence_gen(__pyx_v_first_letter='C', __pyx_v_l=<unavailable>, __pyx_v_max_len=10, __pyx_v_rng=<unavailable>) [5], unsigned char*, unsigned long*, unsigned long, _object*) + 110 at util_cython.cpp:2898
    frame #8: 0x00000001036b1412 util_cython.so`__pyx_pw_5mitty_3lib_11util_cython_9markov_sequences(_object*, _object*, _object*) [inlined] __pyx_pf_5mitty_3lib_11util_cython_8markov_sequences(__pyx_v_max_len=0x0000000100300678) + 2150 at util_cython.cpp:3710
    frame #9: 0x00000001036b0bac util_cython.so`__pyx_pw_5mitty_3lib_11util_cython_9markov_sequences(__pyx_self=<unavailable>, __pyx_args=<unavailable>, __pyx_kwds=<unavailable>) + 6070 at util_cython.cpp:3263
    frame #10: 0x000000010008614d Python`PyEval_EvalFrameEx + 8080
    frame #11: 0x0000000100084093 Python`PyEval_EvalCodeEx + 1641

Which was enough information for me to look more carefully at my markov_chain_sequence_gen function.

While x:

This is a cautionary tale about not misusing Python’s kindness. Or perhaps a cautionary tale about not trusting even widely used libraries to mirror what Python does. It is a little technical …

I have a program that, for the purposes of this post, operates on a list of pairs of numbers, going through each pair one by one until it exhausts the list. For certain reasons I’m using an iterator on the list and using Python’s ‘None’ to indicate that the list has run out, like so:

x = [(0, 0), (1, 2), (3,1)]
itr = x.__iter__()
v = next(itr, None)
while v:
  print v
  v = next(itr, None)

The pair of numbers represents particular data. The first number can range from 0 onwards and is always increasing monotonically while the second number is one of {0, 1, 2}. I have several tests for this function and have checked for different edge cases.

As you can see, in the snippet above, the list x is a list of Python tuples. After the bulk of the development for this (and related) functions had been completed, I had decided that a better way to store my list of tuples would be as a numpy structured array. That way I would have labels for my first and second numbers in the pair and wouldn’t get confused which was which six months from now.

And Python, being the great dynamic language it is, let me make the switch without a murmur. Python doesn’t care that the object I was initially passing it was a list of tuples and the new one is a numpy structured array! Python sees both objects yield iterators as needed and both objects are spitting out pairs of numbers at each iteration. So Python just works!

Things ran just fine for a while.

One day my colleague, who has used my program many times before, came across some very strange results in their experiments. I knew right away that this was going to be an interesting bug, just like any bug that is found after a program has been used for a while. The simple/easy bugs have all been squashed. What remains are the cockroaches. Evil, secretive errors in logic that scurry way from the daylight and come out only in the shadows of the night.

So I took the input data my colleague was using and ran my program. Indeed something bizarre was going on. For some of the data the program was acting exactly as it should. But for some of the data the program claimed that the list of pairs of numbers was empty and would exit right away. It would do the rest of the processing without any errors.

My first guess was that something was wrong with the input files. The input files were being generated with some hastily written code that I had not fully tested, so it was the logical choice, though this would make the bug less interesting.

I reviewed the input data and found that it was just fine. So the input to the program was correct. The program was just borking, seemingly, randomly.

I then started to do things the old fashioned way. I started to put in print statements at various parts of the code and simply print the length of my array of pairs of numbers. This was being done on a remote machine and I hadn’t learned yet how to hook my debugger into the remote machine. Also, sometimes, print statements are just a more pragmatic way to debug, regardless of what the eggheads will tell you.

The program loads the list. A million elements in the list. Good. The program does a transformation on the list. A million elements. Good. Another transformation. Still a million little pairs of numbers, hanging on for dear life. Then they enter this piece of code I sketched out above. And BAM! the program skips the loop, grinning as it whizzes by, zero little pairs of numbers.

I run the program again, this time with a different input set. A hundred thousand elements. Passes first transform, second transform, and then my loop. Round and round it goes in the loop, a hundred thousand times before shooting out, all dizzy at the end, having done exactly what it was supposed to do.

Very interesting.

What is different between the two data sets that would make the code DO this? So I opened by the data and started to look closely at the ones that failed and the ones that succeeded and then it struck me. All the ones that failed had (0, 0) as their first element.

And then it was clear.

When I wrote the loop:

itr = x.__iter__()
v = next(itr, None)
while v:
  print v
  v = next(itr, None)

I was a bit lazy. What I really meant was, “exit the loop if v is None”. Python is very kind and None evaluates to False in this test, so it is equivalent. The problem is that the loop will also terminate if v == 0. But, wait this is actually not a problem because my v is not actually a scalar number. It is a pair of numbers and during my testing I have repeatedly verified that (0, 0) != 0. One is a scalar value – zero – while the other is a Python object – a tuple. That can’t possibly be the problem!

But wait! Halfway through development I told you I switched from using Python lists of tuples to numpy arrays.

So I looked a bit into that, and BAM, sure enough, that’s where the behavior differs. For some reason the numpy library has been written to evaluate an element of a structured array the same as zero if both their elements are zero. This is different from taking a slice of a multi-dimensional array, where Python will complain that “ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()”

This is what still throws me. When I ask numpy for an element of my structured array of pairs of numbers, I get back a pair of numbers. These numbers are not simply another numpy array of two elements – which is what you would get from slicing a 2D numpy array. It’s a pair of numbers. However, this pair does not act as a Python tuple. It’s a different beast altogether.

Well, folks, the moral of the story is, don’t be lazy. If you want to test for None, test it explicitly. You never know what evaluates to 1 or to 0 even if you think you do.

If you would like to see this bug for yourself, here is a sufficient example:

import numpy as np

def looper(x):
  itr = x.__iter__()
  v = next(itr, None)
  while v:
    print v
    v = next(itr, None)

def python_list():
  print 'Python list'
  l = [(0, 0), (1, 0), (2, 2), (3, 0)]
  looper(l)

def numpy_struct_array():
  print 'Numpy array'
  l = np.array([(0, 0), (1, 0), (2, 2), (3, 0)], dtype=[('x', 'int32'), ('y', 'int8')])
  looper(l)

if __name__ == '__main__':
  python_list()
  numpy_struct_array()

The three switches problem

The light bulb problem is a class of problem often presented at interviews as a test of lateral thinking. Here, in a mischievous spirit taken from the best subversive traditions we will attack this question by thinking a little too laterally …

There are different versions of the light bulb problem, but they rely on the same trick. Here is one version:

Your grand uncle lives in a rambling old house. He had an electrician over to wire a light in the attic but the electrician messed with the wiring and set up three switches instead of one. Your grand uncle wants to figure out which of the switches turns on the light in the attic. Problem is, you can’t tell from below whether the attic light is on or off. You can fiddle with the switches and then go up to the attic to inspect the light, but you can only do this once. The old coot says you can’t keep going back down and up the attic stairs fiddling with the light – apparently it disturbs the cat and he scratches the furniture.

The conventional answer to the question is to flip one switch on and leave it for a bit. Then turn it off and turn on another then go up to the attic. If the light is on, the last switch you flipped was the one. If the light is warm (Aha!) then the first switch you flipped is the one. If the light is off and cold, well is the remaining one. Let’s try and subvert the conventional answer to this problem by engaging in some mild subterfuge in the spirit of Calandra‘s “Angels-on-a-Pin“.

  1. Rapidly flip one of the switches on and off many times. Finally set this switch to the off position and flip another switch to on. Now go to the attic. If the light is on, it’s the currently on switch. If the light is burnt out, its the first switch.

  2. Call in your grand aunt, explain the problem to her, crawl up to the attic and have her flip the switches one by one. Yell when the light comes on.

  3. Flip a switch and inspect all the fixtures in the house, repeat and eliminate two of the switches. The third one is the attic switch. If none of the switches seem to be doing anything, where’s the problem? Just hit all three switches when you need the attic light on.

  4. Wait for a moonless night. Turn off all the lights in the house and draw all the curtains and blinds. Now flip the switches one by one. There will be a difference in the light level, however slight, from the attic light going on.

  5. Fly a drone up into the attic. Now you have eyes on target. Screw the cat.

  6. Take all bulbs out of their sockets and unplug all appliances. Now flip the switches one by one and watch the electric meter. This may take some patience.

  7. Open up the cover plate. Take a voltmeter and measure the drop in voltage as the switches are flipped. If only one has a voltage drop, that’s the one wired to the attic switch. If more than one have drops, throw stones into the attic until the light bulb is broken. Redo the measurements. The switch that now has no voltage drop (but did previously) is the one. Take a new bulb, go up into the attic and replace it.

  8. Flip each switch in turn and let the bulb warm up. Throw a bucket of water into the attic. If you hear a loud “POP” the bulb was on. Leave adequate cool down periods between tests. Take a new bulb, go up into the attic and replace it.

  9. Tie three strings to the switches. Go up into the attic with the strings. (Some will say this is subverting the question. Yes it is. Yes it is)

  10. Get all the shiny pots and pans your grand uncle has. Arrange them like mirrors so that you have a view of the attic from the switches.

As an aside, here is a problem I like better, as solving it actually teaches you some mathematical concepts:

Your grand uncle has passed on. In his will he’s bequeathed you an antique coin of immense value (so he says). He’s also left you with eight accurate forgeries of that coin that are indistinguishable from the original, except that they are just a fraction lighter. He’s also left you a balance, which can only be used twice before it breaks. So he says.

You are only allowed to take one of the coins with you. The original coin is priceless, the forgeries are worthless. So he says. You suspect all the coins are duds and the balance won’t break, but you take your Grand uncle’s will in the spirit it was meant, as a last fun puzzle to keep you busy so you don’t get too sad at the old codger’s passing.