functools.partial

functools.partial is one of those things that you don’t need until you need it, and then you need it badly. Trying to create a series of lambda functions I discovered that the simplest answer that came to my mind did something counter intuitive, and functools.partial – which I had ignored so far, came to my rescue.

Effectively, I wanted to create a list of functions

f1(x), f2(x), f3(x) ...

where each function took the same input but did something different, for example

f1(x) = x + 1, f2(x) = x + 2, f3(x) = x + 3

What I tried was as follows

fl = [lambda x: x + i for i in range(4)]

It ran fine, but it gave me the following output

fl[0](1) -> 4
fl[1](1) -> 4
fl[2](1) -> 4
fl[3](1) -> 4

What was happening was that all the lambda functions were taking the last value of i (as a global), rather than the value of i at the time the function was created.

An object oriented person would probably do something like

class F:
  def __init__(self, i):
    self.i = i
  def f(self, x):
    return x + self.i

fl = [F(i) for i in range(4)]

But I enjoy Python because it borrows a lot of things from LISP and allows you to write a lot of code very succinctly, making it easier to grasp (if done well – it’s also possible to write Perl in Python)

Turns out the way you can capture the state of i in the lambda is to create a partial function, and functools.partial allows you to do this succinctly as

fl = [functools.partial(lambda x, j: x + j, j=i) for i in range(4)]

For those who are curious, WHY I had to do this – well, here is an analogy to my use case: I had data that was well expressed as a table of values. The data itself was stored as a list of dictionary which was sometimes nested – some of the values formed part of a hierarchy. For example, one row of the data could look like this:

row = {
 'name': 'Baker',
 'rank': 22,
 'score': {
   'raw': 43,
   'scaled': 25
 }
}

I wanted a table with columns

Last Name | Rank | raw score | scaled score |
----------|------|-----------|--------------|
    Baker |   22 |        43 |            25|

The actual table I was creating was meant to show ALL the data from an experiment with many variables and so had a lot of columns. I was wary of making a mistake in labeling the column relative to the value it had. It would be one of those silent and deadly errors.

If I had a simple flat structure, I might simply have changed the code such that the dictionary keys matched the column headers, weighing the fact that there was other code – already written and debugged – that used the existing keys and that the existing keys were short and succinct and I liked them.

My solution was to create a list of tuples

[(column header, function to extract column value from dictionary)]

And then simply loop over this list, first to create the header, and then for each row, to extract the data value into it’s correct place. Having the column header right next to the extraction function minimized the chances I would mess up somewhere, especially when adding or moving a column when editing or refactoring the code.

Common Lisp doesn’t yield (but it maps)

I didn’t use Python’s yield statement until I’d been programming for a few years. Within a few days of learning it, however, I could’t live without it. I find Python generators and list/dict comprehensions one of the key things to making Python code compact and readable. I was pretty bummed to learn that Common Lisp does not yield to this kind of program structure. The clever pun not-withstanding, I was happy to learn that common lisp has a different paradigm: a map-like one.

Consider the following straightforward problem: I have a text file and am interested in the string of characters (called a “read”) present on every fourth line. For each such read I’m interested in getting each K character long section (called a “kmer”) (bioinformatics savvy readers will recognize the file as a FASTQ file, and the K character long sections as k-mers from the reads in the FASTQ.

In Python there’s a very elegant way to abstract this extraction of data from the FASTQ file using generators (which is what yield allows us to create)

def read_fastq(fname):
  with open(fname, 'r') as fp:
    for n, ln in enumerate(fp):
      if (n - 1) % 4 == 0:
        yield ln.strip()

def kmers_from_seq(seq, kmer_size):
  assert len(seq) > kmer_size
  for n in range(len(seq) - kmer_size + 1):
    yield seq[n:n + kmer_size]

def kmers_from_fastq(fname, kmer_size):
  for seq in read_fastq(fname):
    for kmer in kmers_from_seq(seq, kmer_size):
      yield kmer

Note how nonchalantly I nest the two generators which allows me to do:

for kmer in kmers_from_fastq(fastq_fname, kmer_size):
  print(kmer)

Update: My first run through is now appended to the end of the main article. Rainer Joswig saw my original post and showed me the Lisp way to do this, which turns out to be to use a nested map like pattern. Many thanks to Rainer! My only reservation was the medium of instruction – a series of tweets. It might work for the 45th president, but I personally prefer the more civilized medium of post comments, which creates a more coherent record.

Rainer’s example code is here. I’ve minorly changed the code below:

(defun map-to-seq-in-fastq (in fn)
  "Apply fn to every second line of the FASTQ file (seq line)"
  (flet ((read-seq-line (in)
	   (prog2
	       (read-line in nil)
	       (read-line in nil)
	     (read-line in nil)
	     (read-line in nil))))
    (loop for seq = (read-seq-line in)
	 while seq do (funcall fn seq))))


(defun map-to-kmer-in-seq (seq kmer-size fn)
  "Apply fn to all kmers of length kmer-size in seq."
  (let ((len (length seq)))
   (loop
      for pos from 0
      for end = (+ pos kmer-size)
      while (<= end len)
      do (funcall fn (subseq seq pos end)))))


(defun map-to-kmer-in-fastq (in kmer-size fn)
  (map-to-seq-in-fastq
   in
   (lambda (seq)
     (map-to-kmer-in-seq seq kmer-size fn))))

You can see what is happening. Computation functions are being passed in, Russian doll-like, into a nest of other functions that loop over some source of data. I get it, but it’s still not as neat looking as Python.

Appendix: my first run through using closures.

Well, I can’t do things like this in Common Lisp. Not easily and not with built-ins.

I’m pretty surprised at this. This has got to be the first idiom I know from Python that I can not replicate in CL. The closest I could get was via a closure:

;; generator (closure) based on https://www.cs.northwestern.edu/academics/courses/325/readings/graham/generators.html
(defun fastq-reader (fname)
  (let ((in (open fname)))
    #'(lambda ()
	(prog2  ; <-- prog1 and prog2 are nifty!
	    (read-line in nil)
	    (read-line in nil)  ; <-- this is the sequence line
	  (read-line in nil)
	  (read-line in nil)))))	;  http://www.gigamonkeys.com/book/files-and-file-io.html using nil for return value

This can be called using the loop macro as follows

;; loop based on http://www.gigamonkeys.com/book/files-and-file-io.html
(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen) ; <-- Note this
   while seq do (format t "~A~%" seq))

Similarly, I can write the kmer extraction as a closure

       
(defun kmer-gen (seq &key (kmer-size 30) (discard-N? nil))
  (let ((n 0) (n-max (- (length seq) kmer-size)))
   #'(lambda ()
       (prog1
	   (if (< n n-max)
	       (subseq seq n (+ n kmer-size))
	       nil)
	 (incf n)))))

And combine the two closures as:

(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen)
   while seq do
     (let ((g (kmer-gen seq :kmer-size 10)))
       (loop for s = (funcall g)
	    while s do (format t "~A~%" s))))

It’s not terrible as the code still looks compact and readable but it’s not as nicely abstracted away as the nested Python generator.

(A complete non sequitor: Remember how I had issues with [square brackets] and (parentheses) for array indexing when I moved from MATLAB to Python? Well now I have this issue with prefix and infix notation. I keep trying to write (print kmer) and print(kmer) is starting to look plain weird …)

Some notes on using Python multiprocessing

This is a short note describing a common code pattern useful for parallelizing some computations using the Python multiprocessing module.

Many problems are of the embarrassingly parallel type, where the task consists of the same set of computations done independently on a large set of input data. In many languages such problems are solved using multi-threaded code.  Due to events too complicated and embarrassing to mention Python doesn’t do too well with multiple threads, and we need to use multiple processes via the multiprocessing module instead.

A common pattern is to set up a number of workers and use queues to pass data back-and -forth. The pattern looks like this:

multiprocessing-queues.png

The use of this pattern is relatively straight forward but there are some things to pay attention to.

Sharing read only data

The general topic of sharing data between processes is worthy of deeper discussion, and there are some pointers in the Python documentation but there is an easy way to share read only data between processes – use globals that are written to in the parent process BEFORE the child processes are started.

If you declare a global variable and then modify it after the child processes are started, or if you modify the global from the child process you’ll enter into the twilight zone. There will be no errors and from within the process where you do the writing everything will look consistent, but the rest of the universe will not pay any attention to your writes.

Queues have overhead – the object has to pickled by the sender and then un-pickled by the receiver, and while the object is in the queue, it’s taking up memory. If you have a large amount of read-only data that needs to be shared amongst child processes – like a lookup table of primes, for example – this pattern of using globals can be pragmatic.

 

Explicitly set Queue size

On Mac OS X the default size of the queue is  32767 items. On Linux, the default size is … INFINITY. I learned this the hard way. I usually prototype my code on my mac, and then deploy to a linux container that runs god knows where on data that is several orders of magnitude larger than my prototype set. On Linux things SEEMED to run fine but the size of the result set was abnormally small compared to the input. After a lot of poking around I discovered this property of the Python implementation and what was happening was that in Linux my queues got larger and larger, taking up more and more RAM and the OS started to kill off the child processes. Setting an explicit queue size allows the processes to block when the queue becomes full, capping memory usage.

Wrap the child process function using traceback

Error messages from child processes are messy and uninformative. Importantly, the friendly python stack trace print out is missing. A fairly convenient pattern is to wrap the actual function call within a try-except block in thin outer function, using the traceback module, like so:

def func_w(args):
  """A thin wrapper to allow proper tracebacks when things go wrong in a thread

  :param args:
  :return:
  """
  import traceback
  try:
    return func(**args)
  except Exception as e:
    traceback.print_exc()
    print('')
    raise e

Check for normal termination from the child

One of the reasons it took me so long to track down the above mentioned memory error was that when the child processes was being killed by the OS no Python error was raised. The principled way to check for normal termination in is to look at the exitcode value of the process.

When not to use this pattern

Passing data via queues can be expensive in terms of computation and memory. If your individual workers have to pass a lot of data back and forth – say large submatrices of an even larger matrix – it may be neater and faster to use shared memory objects as supplied by the multiprocessing module which takes care of semaphores for you.

__repr__()

I think repr() is Python’s genius touch. What?! I’m sure you’re yelling at the screen right now, incoherent as items from the long list of things you love about Python fight each other to get gain command of your voice. But it’s what I think. And as you know, Lisp got there first.

Now, of course, you’re really mad at me. But listen. What do we use Python a lot for? We use it for anything that requires our language to be nimble, fluid and user friendly, like data analysis and algorithm prototyping. And why do we use Python? Because it has a REPL. We’ll think of something to do, we’ll write a set of functions and classes to do it (sometimes never leaving the REPL) and then run our data on it.

The key feature of the REPL is being able to see the results of a computation immediately printed below what we did. This is a user comfort feature, and user comfort has a disproportionate impact on our motivation to keep trying out new ideas.

Python’s built in types have decent printed representations, but these can get crowded (dicts with many keys get hard to read quickly, for example). And what about user defined classes? Often, the results of a computation are not a few strings or simple, single numbers. The result of adding two arrays or two tables can be visually complex, for instance. Representing these as unreadable combinations of Python’s built in types or equally unreadable pointer ids starts to defeat the purpose of the interpreted, REPL workflow.

To mitigate this, Python supplies us with the special repr() method. When we create a class we don’t have to define this method, but if we do define it, making it return a succinct string which we find useful as a summary of the object, at the REPL we are then rewarded with these succinct, readable representations.

For example, say we are doing some work with vectors and are interested in how the magnitude of the resultant vector changes as we add vectors together. We could do this as follows:

class Vect:
  
  def __init__(self, x, y):
    
    self.x, self.y = x, y

  

  def __radd__(self, other):
    
    return Vect(self.x + other.x, self.y + other.y)

  

  def __add__(self, other):
    
    return Vect(self.x + other.x, self.y + other.y)

  

  def __repr__(self):
    
    return &quot;({}, {}):{:0.3f}&quot;.format(self.x, self.y, (self.x ** 2 + self.y ** 2) ** 0.5)

There are some things here in addition to repr which enable us to do vector additions. We will ignore those here.

So now when we create some vectors, we get a nice printout

Vect(3,4)   # -&gt; (3, 4):5.000
Vect(5,12)  # -&gt; (5, 12):13.000

This feature plays wonderfully with Python’s built in types, so a Set of these classes, a dict with these classes as keys or values, or a simple list of these classes all work as you would hope.

[Vect(3,4), Vect(5,12)]       # -&gt; [(3, 4):5.000, (5, 12):13.000]
{Vect(3, 4), Vect(5, 12)}     # -&gt; {(3, 4):5.000, (5, 12):13.000}
{Vect(3, 4): 'A', Vect(5, 12): 'B'}  # -&gt; {(3, 4):5.000: 'A', (5, 12):13.000: 'B'}
sum([Vect(3,4), Vect(5,12)])  # -&gt; (8, 16):17.889

PS. I forgot – I dropped a teaser about how Lisp got there first (as usual). In Common Lisp when you define a structure, the structure gets a default “printer” which you can override:

(defstruct vect x y)    ; -> uses default printer

(make-vect 😡 3 :y 4)   ; -> #S(VECT :X 3 :Y 4)

Now we can replace this with our own printer

(defstruct vect x y)    ; -> uses default printer

(make-vect 😡 3 :y 4)   ; -> #S(VECT :X 3 :Y 4)

Python, global state, multiprocessing and other ways to hang yourself

C and C++ were supposed to be the “dangerous” languages. There’s Stroustrup’s quote, for example, widely circulated on the internet. However, Stroustrup clarifies that his statement applies to all powerful languages. I find Python to be a powerful language that, by design, protects you from “simple” dangers, but lets you wander into more complex dangers without much warning.

Stroustrup’s statement in more detail is (from his website):

“C makes it easy to shoot yourself in the foot; C++ makes it harder, but when you do it blows your whole leg off”. Yes, I said something like that (in 1986 or so). What people tend to miss, is that what I said there about C++ is to a varying extent true for all powerful languages. As you protect people from simple dangers, they get themselves into new and less obvious problems.

The surprising thing about Python I’d like to briefly discuss here is how the multiprocessing module can silently fail when dealing with shared state.

What I effectively did was have a variable declared in the parent process, passed to the child process which then modified it. Python happily lets you do this, but changes to the variable are not seen by the parent process. As this answer to the question on stackoverflow explains:

When you use multiprocessing to open a second process, an entirely new instance of Python, with its own global state, is created. That global state is not shared, so changes made by child processes to global variables will be invisible to the parent process.

I was thrown though, by two additional layers of complexity. Firstly as the example code below shows, I shared the state in a somewhat hidden manner – I passed an instance method to the new process. I should have realized that this implicitly shares the original object – via self – with the new process.

Secondly, when I printed out the ids of the relevant object in the parent and child processes the ids came out to be the same. As the documents explain:

CPython implementation detail: This is the address of the object in memory.

The same id (= memory address) business threw me for a bit. Then I heard a faint voice in my head mumbling things about ‘virtual addressing’ and ‘copy-on-write’ and ‘os.fork()’. So what’s going on here? A little bit of perusing on stack overflow allows us to collect the appropriate details.

As mentioned above, Python, because of some implementation reasons (keyword: Global Interpreter Lock – GIL) uses os.fork() to achieve true multiprocessing via the multiprocessing module. fork() creates an exact copy of the old process and starts it in a new one. This means, that, for everything to be consistent, the original pointers in the program need to keep pointing to the same things, otherwise the new process will fall apart. But WAIT! This is chaos! Now we have two identical running processes writing and reading the same memory locations! Not quite.

Modern OSes use virtual addressing. Basically the address values (pointers) you see inside your program are not actual physical memory locations, but pointers to an index table (virtual addresses) that in turn contains pointers to the actual physical memory locations. Because of this indirection, you can have the same virtual address point to different physical addresses IF the virtual addresses belong to index tables of separate processes.

In our case, this explains the same id() value and the fact that when the child process modified the object with the same id value, it was actually accessing a different physical object, which explains why it’s parent doppelganger now diverges.

For completeness, we should mention copy-on-write. What this means is that the OS cleverly manages things such that initially the virtual addresses actually point to the original physical addresses – as if you copied the address table from the original process. This allows the two processes to share memory while reading (and saves a bunch of copying). Once either of the processes writes to a memory location, however, a bunch of copying is done and the relevant values now reside in a new memory location and one of the virtual tables are updated to reflect this.

Which brings us to the question: what the heck am I doing worrying about addresses in Python?! Shouldn’t this stuff just work? Isn’t that why I’m incurring the performance penalty, so I can have neater code and not worry about the low level details? Well, nothing’s perfect and keep in mind Stroustrup’s saw, I guess.

Also, never pass up a learning opportunity, much of which only comes through the school of hard knocks. It also gives a dopamine rush you’ll not believe.  I wonder what kind of shenanigans you can get into doing concurrent programming in Lisp, hmm …

So, what about the other ways to hang yourself, as promised in the title? Oh, that was just clickbait, sorry. This is all I got.

Code follows:

import Queue as Q
from multiprocessing import Process, Queue
import time

def describe_set(s):
  return 'id: {}, contents: {}'.format(id(s), s)

class SetManager(object):
  def __init__(self):
    self.my_set = set()
    print('Child process: {}'.format(describe_set(self.my_set)))

  def add_to_set(self, item):
    print('Adding {}'.format(item))
    self.my_set.add(item)
    print('Child process: {}'.format(describe_set(self.my_set)))

def test1():
  print('\n\nBasic test, no multiprocessing')

  sm = SetManager()
  print('Parent process: {}'.format(describe_set(sm.my_set)))

  sm.add_to_set(1)
  print('Parent process: {}'.format(describe_set(sm.my_set)))

class SetManager2(SetManager):
  def __init__(self, q, q_reply):
    super(SetManager2, self).__init__()
    # SetManager.__init__(self)
    self.keep_running = True

    self.q, self.q_reply = q, q_reply
    self.sp = Process(target=self.loop, args=())
    self.sp.start()

  def loop(self):
    while self.keep_running:
      try:
        msg = self.q.get(timeout=1)
        if msg == 'quit':
          self.keep_running = False
        elif msg == 'peek':
          self.q_reply.put(list(self.my_set))
        else:
          self.add_to_set(msg)
      except Q.Empty:
        pass
      time.sleep(.1)

def test2():
  print('\n\nMultiprocessing with method set off in new thread')

  q, q_reply = Queue(), Queue()

  sm = SetManager2(q, q_reply)
  print('Parent process: {}'.format(describe_set(sm.my_set)))

  sm.q.put(1)
  time.sleep(1)
  print('Parent process: {}'.format(describe_set(sm.my_set)))

  sm.q.put(2)
  time.sleep(1)
  print('Parent process: {}'.format(describe_set(sm.my_set)))

  q.put('peek')
  time.sleep(1)
  print('Reply from child process: {}'.format(describe_set(q_reply.get())))

  sm.q.put('quit')
  sm.sp.join()

class SetManager3(SetManager2):
  def __init__(self, q, q_reply):
    super(SetManager2, self).__init__()
    self.keep_running = True
    self.q = q
    self.q_reply = q_reply

def start_set_manager3_in_process(q, q_reply):
  sm = SetManager3(q, q_reply)
  sm.loop()

def test3():
  print('\n\nMultiprocessing with object created in new thread')

  q, q_reply = Queue(), Queue()

  sp = Process(target=start_set_manager3_in_process, args=(q, q_reply))
  sp.start()
  # print('Parent process: items are: {}'.format(sm.my_set))

  q.put(1)
  time.sleep(1)

  q.put(2)
  time.sleep(2)

  q.put('peek')
  time.sleep(1)
  print('Reply from child process: {}'.format(describe_set(q_reply.get())))

  q.put('quit')
  sp.join()

test1()
test2()
test3()

Chasing down a Cython error

My python program segfaulted. When I wrote C++ programs segfaults were no big deal. I would hook into gdb mostly through my IDE’s debugger, and dance around the point the program crashed until I had an idea of where I was going wrong. Python gives a wonderful stack trace and tells you where things have gone wrong, so you can just look into the source code from the information Python spits out before it dies. But not for C-extensions.

When things go wrong in a Cython or C module that you are calling from Python you are stuck, a little bit, in no-mans land. Fortunately, from a stack-overflow thread here, you can just run Python in gdb, run your code and then debug it all as a C program (Python) operating on some data (Your Python program).

Unfortunately, I’m on Mac OS X, and though I could use homebrew to install gdb, I could not get it to run my python interpreter due to a binary compatibility issue. I decided to be brave and use LLDB (which comes with Mac OS X) to try and debug this error.

lldb python

This attaches Python to the debugger. Then you can run the Python program as follows

run /path/to/my/script arg1 arg2

My program crashed with

Fatal Python error: GC object already tracked
Process 59193 stopped

And some instructions in assembly which weren’t so helpful. However, I could do

thread backtrace

Which then led me to some candidate functions in the c-source file of the translated Cython code, like:

   frame #4: 0x000000010004d658 Python`PyTuple_New + 252
    frame #5: 0x00000001036ad6af util_cython.so`__Pyx_PyObject_CallOneArg(_object*, _object*) [inlined] __Pyx__PyObject_CallOneArg(arg=&lt;unavailable&gt;) + 5 at util_cython.cpp:7694
    frame #6: 0x00000001036ad6aa util_cython.so`__Pyx_PyObject_CallOneArg(func=0x000000010456f170, arg=0x0000000104c62a90) + 149 at util_cython.cpp:7712
    frame #7: 0x00000001036b1480 util_cython.so`__pyx_pw_5mitty_3lib_11util_cython_9markov_sequences(_object*, _object*, _object*) [inlined] __pyx_f_5mitty_3lib_11util_cython_markov_chain_sequence_gen(__pyx_v_first_letter='C', __pyx_v_l=&lt;unavailable&gt;, __pyx_v_max_len=10, __pyx_v_rng=&lt;unavailable&gt;) [5], unsigned char*, unsigned long*, unsigned long, _object*) + 110 at util_cython.cpp:2898
    frame #8: 0x00000001036b1412 util_cython.so`__pyx_pw_5mitty_3lib_11util_cython_9markov_sequences(_object*, _object*, _object*) [inlined] __pyx_pf_5mitty_3lib_11util_cython_8markov_sequences(__pyx_v_max_len=0x0000000100300678) + 2150 at util_cython.cpp:3710
    frame #9: 0x00000001036b0bac util_cython.so`__pyx_pw_5mitty_3lib_11util_cython_9markov_sequences(__pyx_self=&lt;unavailable&gt;, __pyx_args=&lt;unavailable&gt;, __pyx_kwds=&lt;unavailable&gt;) + 6070 at util_cython.cpp:3263
    frame #10: 0x000000010008614d Python`PyEval_EvalFrameEx + 8080
    frame #11: 0x0000000100084093 Python`PyEval_EvalCodeEx + 1641

Which was enough information for me to look more carefully at my markov_chain_sequence_gen function.

While x:

This is a cautionary tale about not misusing Python’s kindness. Or perhaps a cautionary tale about not trusting even widely used libraries to mirror what Python does. It is a little technical …

I have a program that, for the purposes of this post, operates on a list of pairs of numbers, going through each pair one by one until it exhausts the list. For certain reasons I’m using an iterator on the list and using Python’s ‘None’ to indicate that the list has run out, like so:

x = [(0, 0), (1, 2), (3,1)]
itr = x.__iter__()
v = next(itr, None)
while v:
  print v
  v = next(itr, None)

The pair of numbers represents particular data. The first number can range from 0 onwards and is always increasing monotonically while the second number is one of {0, 1, 2}. I have several tests for this function and have checked for different edge cases.

As you can see, in the snippet above, the list x is a list of Python tuples. After the bulk of the development for this (and related) functions had been completed, I had decided that a better way to store my list of tuples would be as a numpy structured array. That way I would have labels for my first and second numbers in the pair and wouldn’t get confused which was which six months from now.

And Python, being the great dynamic language it is, let me make the switch without a murmur. Python doesn’t care that the object I was initially passing it was a list of tuples and the new one is a numpy structured array! Python sees both objects yield iterators as needed and both objects are spitting out pairs of numbers at each iteration. So Python just works!

Things ran just fine for a while.

One day my colleague, who has used my program many times before, came across some very strange results in their experiments. I knew right away that this was going to be an interesting bug, just like any bug that is found after a program has been used for a while. The simple/easy bugs have all been squashed. What remains are the cockroaches. Evil, secretive errors in logic that scurry way from the daylight and come out only in the shadows of the night.

So I took the input data my colleague was using and ran my program. Indeed something bizarre was going on. For some of the data the program was acting exactly as it should. But for some of the data the program claimed that the list of pairs of numbers was empty and would exit right away. It would do the rest of the processing without any errors.

My first guess was that something was wrong with the input files. The input files were being generated with some hastily written code that I had not fully tested, so it was the logical choice, though this would make the bug less interesting.

I reviewed the input data and found that it was just fine. So the input to the program was correct. The program was just borking, seemingly, randomly.

I then started to do things the old fashioned way. I started to put in print statements at various parts of the code and simply print the length of my array of pairs of numbers. This was being done on a remote machine and I hadn’t learned yet how to hook my debugger into the remote machine. Also, sometimes, print statements are just a more pragmatic way to debug, regardless of what the eggheads will tell you.

The program loads the list. A million elements in the list. Good. The program does a transformation on the list. A million elements. Good. Another transformation. Still a million little pairs of numbers, hanging on for dear life. Then they enter this piece of code I sketched out above. And BAM! the program skips the loop, grinning as it whizzes by, zero little pairs of numbers.

I run the program again, this time with a different input set. A hundred thousand elements. Passes first transform, second transform, and then my loop. Round and round it goes in the loop, a hundred thousand times before shooting out, all dizzy at the end, having done exactly what it was supposed to do.

Very interesting.

What is different between the two data sets that would make the code DO this? So I opened by the data and started to look closely at the ones that failed and the ones that succeeded and then it struck me. All the ones that failed had (0, 0) as their first element.

And then it was clear.

When I wrote the loop:

itr = x.__iter__()
v = next(itr, None)
while v:
  print v
  v = next(itr, None)

I was a bit lazy. What I really meant was, “exit the loop if v is None”. Python is very kind and None evaluates to False in this test, so it is equivalent. The problem is that the loop will also terminate if v == 0. But, wait this is actually not a problem because my v is not actually a scalar number. It is a pair of numbers and during my testing I have repeatedly verified that (0, 0) != 0. One is a scalar value – zero – while the other is a Python object – a tuple. That can’t possibly be the problem!

But wait! Halfway through development I told you I switched from using Python lists of tuples to numpy arrays.

So I looked a bit into that, and BAM, sure enough, that’s where the behavior differs. For some reason the numpy library has been written to evaluate an element of a structured array the same as zero if both their elements are zero. This is different from taking a slice of a multi-dimensional array, where Python will complain that “ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()”

This is what still throws me. When I ask numpy for an element of my structured array of pairs of numbers, I get back a pair of numbers. These numbers are not simply another numpy array of two elements – which is what you would get from slicing a 2D numpy array. It’s a pair of numbers. However, this pair does not act as a Python tuple. It’s a different beast altogether.

Well, folks, the moral of the story is, don’t be lazy. If you want to test for None, test it explicitly. You never know what evaluates to 1 or to 0 even if you think you do.

If you would like to see this bug for yourself, here is a sufficient example:

import numpy as np

def looper(x):
  itr = x.__iter__()
  v = next(itr, None)
  while v:
    print v
    v = next(itr, None)

def python_list():
  print 'Python list'
  l = [(0, 0), (1, 0), (2, 2), (3, 0)]
  looper(l)

def numpy_struct_array():
  print 'Numpy array'
  l = np.array([(0, 0), (1, 0), (2, 2), (3, 0)], dtype=[('x', 'int32'), ('y', 'int8')])
  looper(l)

if __name__ == '__main__':
  python_list()
  numpy_struct_array()