# Get more out of Cython

Getting more out of your Cython code doesn’t have to be a black art. A key step is to peek into the cythonized .c file and note if inner loops are getting down to bare C without any of the overheads associated with Python’s dynamism.

In an earlier post I had introduced Cython, which is a really cool language that straddles Python and C. Cython is Python with some ctype defs that helps a translator (Cython) to convert the Python code into a .c file which can be compiled and imported as a module. I’m one of the many folks who rave about this system because, just by changing .py to .pyx and running the code through the Cython system get get a 2x speedup for your functions. The next stage is to add some type declarations to your code. These give the real speedups.

I became curious as to what the type declarations actually do, and how they can help. Beyond curiosity, I was hoping that understanding this process would help me write more efficient Cython code.

I was working on a simple piece of code to determine the transition matrix for a DNA sequence. The DNA sequence is a string of characters from the set {A,C,G,T,N} (The N represents an unknown character, or, sometimes, a don’t care character). The transition matrix simply represents the probabilities that, in the sequence, character X will be followed by character Y, for every possible transition. This means a 5×5 matrix.

My first pass looked like this:
actg.pyx

def compute_transition_matrix(bytes seq):
# Given a string, computes the transition matrix description of that string
cdef:
unsigned long int t_mat[85][85]
unsigned long int seq_len = len(seq), n, m

for n in [65, 67, 71, 84, 78]:
for m in [65, 67, 71, 84, 78]:
t_mat[n][m] = 0

for n in range(1, seq_len):
t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1

# Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
return [[t_mat[n][m] for m in [65, 67, 71, 84, 78]] for n in [65, 67, 71, 84, 78]]

Inspecting the actg.c code generated by running cython on this file I found that my inner loop looked like this:

/* "actg.pyx":17
*       t_mat[n][m] = 0
*
*   for n in range(1, seq_len):             # <<<<<<<<<<<<<<
*     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1
*
*/
__pyx_t_4 = __pyx_v_seq_len;
for (__pyx_t_7 = 1; __pyx_t_7 < __pyx_t_4; __pyx_t_7+=1) {
__pyx_v_n = __pyx_t_7;

/* "actg.pyx":18
*
*   for n in range(1, seq_len):
*     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1             # <<<<<<<<<<<<<<
*
*   # Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
*/
if (unlikely(__pyx_v_seq == Py_None)) {
PyErr_SetString(PyExc_TypeError, "'NoneType' object is not subscriptable");
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
}
__pyx_t_8 = (__pyx_v_n - 1);
__pyx_t_9 = __Pyx_PyBytes_GetItemInt(__pyx_v_seq, __pyx_t_8, 1); if (unlikely(__pyx_t_9 == ((char)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__pyx_t_10 = ((unsigned char)__pyx_t_9);
if (unlikely(__pyx_v_seq == Py_None)) {
PyErr_SetString(PyExc_TypeError, "'NoneType' object is not subscriptable");
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
}
__pyx_t_9 = __Pyx_PyBytes_GetItemInt(__pyx_v_seq, __pyx_v_n, 1); if (unlikely(__pyx_t_9 == ((char)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
__pyx_t_11 = ((unsigned char)__pyx_t_9);
((__pyx_v_t_mat[__pyx_t_10])[__pyx_t_11]) = (((__pyx_v_t_mat[__pyx_t_10])[__pyx_t_11]) + 1);
}

I know that the first thing you are thinking is that, Man this looks butt-ugly. Also gotos?!?! Well, the code is not really meant for human consumption AND there are some things in here that makes it easy to find what you are looking for. The first is that the name mangling preserves the core of the function name, so you can do a search for the function in an editor. The second thing, which turns out to be very, very important is that the original code is in comments along side the relevant C code. And for this I take my hats off the Cython folks, because they did not have to do this, but they put it in any way, which makes our inspection a LOT easier.

Now, you can see right away that the inner loop has a bunch of funny checks related to making sure variable types are correct and have appropriate attributes etc. And I thought, I don’t need this cruft, I KNOW this is a string of bytes – an array of chars.

My change was to explicitly tell Cython that my bytes seq variable was really an array of char (Note the char *seq = s):

def compute_transition_matrix(bytes s):
"""Given a string, computes the transition matrix description of that string"""
cdef:
unsigned long int t_mat[85][85]
unsigned long int seq_len = len(s), n, m
char *seq = s

for n in [65, 67, 71, 84, 78]:
for m in [65, 67, 71, 84, 78]:
t_mat[n][m] = 0

for n in range(1, seq_len):
t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1

# Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
return [[t_mat[n][m] for m in [65, 67, 71, 84, 78]] for n in [65, 67, 71, 84, 78]]

The inner loop that results from this code, looks like:

/* "actg.pyx":35
*       t_mat[n][m] = 0
*
*   for n in range(1, seq_len):             # <<<<<<<<<<<<<<
*     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1
*
*/
__pyx_t_5 = __pyx_v_seq_len;
for (__pyx_t_8 = 1; __pyx_t_8 < __pyx_t_5; __pyx_t_8+=1) {
__pyx_v_n = __pyx_t_8;

/* "actg.pyx":36
*
*   for n in range(1, seq_len):
*     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1             # <<<<<<<<<<<<<<
*
*   # Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
*/
__pyx_t_9 = ((unsigned char)(__pyx_v_seq[(__pyx_v_n - 1)]));
__pyx_t_10 = ((unsigned char)(__pyx_v_seq[__pyx_v_n]));
((__pyx_v_t_mat[__pyx_t_9])[__pyx_t_10]) = (((__pyx_v_t_mat[__pyx_t_9])[__pyx_t_10]) + 1);
}

Which looks A LOT BETTER.

This is reflected in some simple %timeit based benchmarking by running this code on Human chromosome 1. The first version takes 1.4s to run while the second version takes 0.27s – a 5x speedup, which corresponds with us taking all that dynamic cruft out of the inner loop.

One thing to note is that the PROPER way to do all this is profile your code first and find out where the bottle necks are. Sometimes doing %timeit on individual functions are all that is needed. Sometimes you will need to do a proper profile, and Cython as directives to allow you to profile Cython functions, but these can slow the code down.