How to shoot yourself in the foot with Cython

If you’ve grown fat and complacent using Python, try a little Cython. It’ll put hair on your chest and take some off your scalp.

Take the following Cython code I was writing (heavily paraphrased)

cdef:
  char *s = seq  # seq is passed in to the function
  unsigned char i, j
  unsigned char sub_mat[85]

for i, j in zip([65, 67, 71, 84], [84, 65, 67, 71]):
  sub_mat[i] = j

return [ sub_mat[s[i]]  for i in range(len(seq))]

This code compiles file. It even runs fine for a few test cases I set up for it. But when I put it into a standard testing pipeline I use, it caused a different test to freeze up.

My first reaction was, I’ve lost a pointer somewhere. So I started to dig around. (That reminds me, there has got to be a easier way to hook a debugger up to Cython. The prescribed ways are so complicated). I started to sprinkle print statements in the code. Nothing.

I then broke the final list comprehension out into a loop and put print statements in there. Suddenly I saw that my program wasn’t going off the reservation by losing a pointer. It was actually looping over and over again at that point. And then the lightbulb went off. The loop variable i had been typed as an unsigned char (max value 255) and if len(seq) was ever greater than that it would just cause the loop to go on for ever as i overflowed and reset to 0 repeatedly.

Of course, those of you programming in C will have immediately noticed this issue and asked – “Does len(seq) ever get larger than 255?” But I’ve grown fat and lazy on Python which lets me do things like 1<<1024 without missing a beat (yes, that creates a 1025 bit integer. Suck it up C programmer), so, on occasion, I have to lose some hair from the top of my head when I slum it with all you C-coders.

Get more out of Cython

Getting more out of your Cython code doesn’t have to be a black art. A key step is to peek into the cythonized .c file and note if inner loops are getting down to bare C without any of the overheads associated with Python’s dynamism.

In an earlier post I had introduced Cython, which is a really cool language that straddles Python and C. Cython is Python with some ctype defs that helps a translator (Cython) to convert the Python code into a .c file which can be compiled and imported as a module. I’m one of the many folks who rave about this system because, just by changing .py to .pyx and running the code through the Cython system get get a 2x speedup for your functions. The next stage is to add some type declarations to your code. These give the real speedups.

I became curious as to what the type declarations actually do, and how they can help. Beyond curiosity, I was hoping that understanding this process would help me write more efficient Cython code.

I was working on a simple piece of code to determine the transition matrix for a DNA sequence. The DNA sequence is a string of characters from the set {A,C,G,T,N} (The N represents an unknown character, or, sometimes, a don’t care character). The transition matrix simply represents the probabilities that, in the sequence, character X will be followed by character Y, for every possible transition. This means a 5×5 matrix.

My first pass looked like this:
actg.pyx

def compute_transition_matrix(bytes seq):
  # Given a string, computes the transition matrix description of that string
  cdef:
    unsigned long int t_mat[85][85]
    unsigned long int seq_len = len(seq), n, m

  for n in [65, 67, 71, 84, 78]:
    for m in [65, 67, 71, 84, 78]:
      t_mat[n][m] = 0

  for n in range(1, seq_len):
    t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1

  # Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
  return [[t_mat[n][m] for m in [65, 67, 71, 84, 78]] for n in [65, 67, 71, 84, 78]]

Inspecting the actg.c code generated by running cython on this file I found that my inner loop looked like this:

  /* "actg.pyx":17
 *       t_mat[n][m] = 0
 * 
 *   for n in range(1, seq_len):             # <<<<<<<<<<<<<<
 *     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1
 * 
 */
  __pyx_t_4 = __pyx_v_seq_len;
  for (__pyx_t_7 = 1; __pyx_t_7 < __pyx_t_4; __pyx_t_7+=1) {
    __pyx_v_n = __pyx_t_7;

    /* "actg.pyx":18
 * 
 *   for n in range(1, seq_len):
 *     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1             # <<<<<<<<<<<<<<
 * 
 *   # Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
 */
    if (unlikely(__pyx_v_seq == Py_None)) {
      PyErr_SetString(PyExc_TypeError, "'NoneType' object is not subscriptable");
      {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    }
    __pyx_t_8 = (__pyx_v_n - 1);
    __pyx_t_9 = __Pyx_PyBytes_GetItemInt(__pyx_v_seq, __pyx_t_8, 1); if (unlikely(__pyx_t_9 == ((char)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __pyx_t_10 = ((unsigned char)__pyx_t_9);
    if (unlikely(__pyx_v_seq == Py_None)) {
      PyErr_SetString(PyExc_TypeError, "'NoneType' object is not subscriptable");
      {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    }
    __pyx_t_9 = __Pyx_PyBytes_GetItemInt(__pyx_v_seq, __pyx_v_n, 1); if (unlikely(__pyx_t_9 == ((char)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __pyx_t_11 = ((unsigned char)__pyx_t_9);
    ((__pyx_v_t_mat[__pyx_t_10])[__pyx_t_11]) = (((__pyx_v_t_mat[__pyx_t_10])[__pyx_t_11]) + 1);
  }

I know that the first thing you are thinking is that, Man this looks butt-ugly. Also gotos?!?! Well, the code is not really meant for human consumption AND there are some things in here that makes it easy to find what you are looking for. The first is that the name mangling preserves the core of the function name, so you can do a search for the function in an editor. The second thing, which turns out to be very, very important is that the original code is in comments along side the relevant C code. And for this I take my hats off the Cython folks, because they did not have to do this, but they put it in any way, which makes our inspection a LOT easier.

Now, you can see right away that the inner loop has a bunch of funny checks related to making sure variable types are correct and have appropriate attributes etc. And I thought, I don’t need this cruft, I KNOW this is a string of bytes – an array of chars.

My change was to explicitly tell Cython that my bytes seq variable was really an array of char (Note the char *seq = s):

def compute_transition_matrix(bytes s):
  """Given a string, computes the transition matrix description of that string"""
  cdef:
    unsigned long int t_mat[85][85]
    unsigned long int seq_len = len(s), n, m
    char *seq = s

  for n in [65, 67, 71, 84, 78]:
    for m in [65, 67, 71, 84, 78]:
      t_mat[n][m] = 0

  for n in range(1, seq_len):
    t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1

  # Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
  return [[t_mat[n][m] for m in [65, 67, 71, 84, 78]] for n in [65, 67, 71, 84, 78]]

The inner loop that results from this code, looks like:

  /* "actg.pyx":35
 *       t_mat[n][m] = 0
 * 
 *   for n in range(1, seq_len):             # <<<<<<<<<<<<<<
 *     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1
 * 
 */
  __pyx_t_5 = __pyx_v_seq_len;
  for (__pyx_t_8 = 1; __pyx_t_8 < __pyx_t_5; __pyx_t_8+=1) {
    __pyx_v_n = __pyx_t_8;

    /* "actg.pyx":36
 * 
 *   for n in range(1, seq_len):
 *     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1             # <<<<<<<<<<<<<<
 * 
 *   # Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
 */
    __pyx_t_9 = ((unsigned char)(__pyx_v_seq[(__pyx_v_n - 1)]));
    __pyx_t_10 = ((unsigned char)(__pyx_v_seq[__pyx_v_n]));
    ((__pyx_v_t_mat[__pyx_t_9])[__pyx_t_10]) = (((__pyx_v_t_mat[__pyx_t_9])[__pyx_t_10]) + 1);
  }

Which looks A LOT BETTER.

This is reflected in some simple %timeit based benchmarking by running this code on Human chromosome 1. The first version takes 1.4s to run while the second version takes 0.27s – a 5x speedup, which corresponds with us taking all that dynamic cruft out of the inner loop.

One thing to note is that the PROPER way to do all this is profile your code first and find out where the bottle necks are. Sometimes doing %timeit on individual functions are all that is needed. Sometimes you will need to do a proper profile, and Cython as directives to allow you to profile Cython functions, but these can slow the code down.