Kaushik Ghose

All your genome are belong to us

Featured Image -- 1101

Leave a comment

Ex Machina


A movie review with some fun discussion of the Turing test.

Originally posted on Spoiler Alert!:

The thing I like the most about Ex Machina is that one of the protagonists not only gives us a proper definition of the Turing test, but he also describes a delicious modification of the Turing test that took me a second to savor fully. The plot also twists quite nicely in the end, though we kind of see it coming.

Movies about machines that replicate human thoughts and emotions pop up periodically. In most of these movies the machines win. Ex Machina is one of the better ones of this genre. Particularly satisfying is that the techno babble in the script touches on some advanced topics in machine intelligence.

To start with, there is a good definition of the Turing test. People make a lot of fuss about the Turing test and take it quite seriously and literally. The Turing test, to me, is basically is an admission that, when people…

View original 1,046 more words


1 Comment

The magic of memoization

Project Euler problem #15 (This may be a spoiler BTW) can be solved via recursion, but because the tree is so wide it quickly blows up. More irritatingly, much like computing the Fibonacci sequence recursively, we keep doing the same computation over and over again. I knew how to solve this using imperative programming and I was trying to shoehorn that solution into Racket, but just thinking about it made me feel dirty. If only there was some way to retain the elegance of recursion but not have to redo computations …

Project Euler problem #15 poses the challenge: Starting in the top left corner of a 2×2 grid, and only being able to move to the right and down, there are exactly 6 routes to the bottom right corner. How many such routes are there through a 20×20 grid?

It turns out that there is an analytical solution to this problem involving combinatorials. I kind of got there while sketching out solutions, but fortunately for me, I decided to let the computer do all the work. Fortunately because I learned a cool new functional tool.

The straightforward way to solve this problem using functional programming paradigms is to use recursion. Just set off little machines that keep forking off copies off them selves chasing down alternate paths. If the machines go off the grid, discard them, and if they reach the exit corner, tell them to report back. Count up the reports and you are done!

#lang racket
(require rackunit)

; A 2D value
(struct pos (x y) #:inspector #f)

;; The two legal moves
(define (move-right p0)
  (struct-copy pos p0 [x (+ (pos-x p0) 1)])) 

(define (move-down p0)
  (struct-copy pos p0 [y (+ (pos-y p0) 1)])) 

;; Return True if we are at the grid boundary
(define (boundary-pos p0 grid-size)
  (if (or (= (pos-x grid-size) (pos-x p0)) (= (pos-y grid-size) (pos-y p0))) #t #f))

;; Use recursion to explore all possible moves. Our base case occurs when we
;; land on the grid boundary. We know that from then on there is only one
;; path to the exit - either straight right or straight down. Return 1 when that
;; happens
; p0 - initial position (pos struct)
; grid-size - size of grid (pos struct)
(define (count-paths grid-size [p0 (pos 0 0)])
    [(boundary-pos p0 grid-size) 1]
    [(+ (count-paths grid-size (move-right p0)) (count-paths grid-size (move-down p0)))])) 

(test-case "Recursive"
           (check-equal? (count-paths (pos 2 2)) 6)
           (check-equal? (count-paths (pos 1 3)) 4)

(time (let ([k (count-paths (pos 10 10))]) (display k))) 
;-> 184756  cpu time: 88 real time: 88 gc time: 0
;; 20x20 grid does not finish and keeps asking for more memory

Now this thing blows up quickly because, as you can see, each node we visit spawns two machines (function calls) and pretty soon our grid is crawling with machines. On my machine I can’t even compute the 20×20 solution.

What is annoying is that it doesn’t have to be this way: we can see that the recursion repeatedly visits the same lattice nodes (after all there are not THAT many of them). An interesting property of this problem is the number of paths from a given node to the exit does not depend on how we got to that node. So we don’t really need to send machines down a node we have already visited.

The way I would have exploited this property in imperative programming is to build a two dimensional array representing the grid. I would have started to fill out this array by walking backwards from the exit and noting, in the array values, how many paths there are to the exit. The value of each node is the sum of the values of the two “children” node – nodes that can be reached by going down or by going right.

If you want keywords, this is a bottom up dynamic programming approach because we are taking small, individual pieces of the problem and working out the answer separately and then putting things together to solve bigger parts of the problem. (Wikipedia’s page on dynamic programming is pretty good)

I started to poke around Racket and look into matrices and arrays and mutability and I started to get a yucky feeling. It started to feel a lot like housekeeping and housekeeping is one thing functional programming promises to free us from, so we may concentrate on the fun stuff.

While doing various web searches related to “dynamic programming functional languages” and “update mutable array Racket” (yes, I was desperate) I kept hitting the word memoization. I’d come across this term before, but had not paid any attention to it. Then I found this great post on the Racket blog.

Before we go any further, take a look at the code below:

(require memoize)

(define/memo* (count-paths-m grid-size p0)
    [(boundary-pos p0 grid-size) 1]
    [(+ (count-paths-m grid-size (move-right p0)) (count-paths-m grid-size (move-down p0)))])) 

(test-case "Memoize"
           (check-equal? (count-paths-m (pos 2 2) (pos 0 0)) 6)
           (check-equal? (count-paths-m (pos 1 3) (pos 0 0)) 4)

(time (let ([k (count-paths-m (pos 20 20) (pos 0 0))]) (display k))) 
; -> 137846528820  cpu time: 0 real time: 0 gc time: 0

;; Note that this last operation would basically run out of memory and crash using the naive recursive implementation

What just went down here? (Yes, I learned this catchy phrase from Tim Roughgarden’s lectures)

We only made one change – we used the memoize package and we defined the function as a memoized function. (We also had to remove the default value, apparently the memoization package does not support that)

What is this Memoization magic?

Recall that what I was looking for was a scheme that would prevent us having to recompute already known values. The approach I was familiar with from imperative programming was an explicit table store of known values that I could reference each time I went to do a computation.

Memoize is a very elegant implementation of a related idea. However, instead of putting the implementation (e.g. a matrix of known values) first, it puts the function first and says “What we really want is a way for a function to remember the values it computes”.

What the two lines of Racket above do (and note, really, that we did not have to touch the algorithm at ALL) is to say, here’s a function count-paths-m and I want to remember the answer to each call that is made to it, so if I ever call the function with the same arguments, instead of executing the body of the function, just return the stored value.

And THAT’s IT! No need to mess with dirty implementation details. Memoize gets to the heart of what we want to do.

Now, such memoizations, because they have to work with any function, are usually implemented as a hash (I know that the Racket one is – by the way, the Racket implementation is one short file). Performance freaks may hem and haw, since in this particular case, for example, rather than having a hash look up, it is faster and more compact to have a 2D array lookup. But, as you can see from the timing, it’s fast enough.

As a side note on Racket, I was very impressed by the smoothness of getting this package installed. I used the GUI to search for “memoize”, the manager searched in different repositories, found the github repo, downloaded the code and integrated the documentation with the rest, so now I can find the docs for memoize just like I would any of the pre-installed packages.

Finally, a small point specific to this Racket package: note I used define/memo*. I first used define/memo and got very unexpected results, which seemed a lot like memoziation wasn’t working. It turns out to be a detail of the package: I was using structs – not primitive types – as inputs to the function, and the *-d versions of the functions implement the proper comparisons to handle such derived data types.

UPDATE: I forgot to put this in: Note that now we have a kind of state in the function. It’s not a state that changes the input/output characteristics of the function, but rather an “implementation state” if you will. The state changes how the function is implemented based on the previous history of the program. This does not give us any problems when reasoning about what the function does (as befits a proper functional program) but it does make it hard to reason about the speed characteristics of the function in a way that mirrors difficulties in reasoning about algorithmic characteristics of an imperative program with mutable state. Funny world.

Screen Shot 2015-10-26 at 5.51.24 AM

Leave a comment

Have a test suite

Two tests failed. They were rather benign tests too. I think I had put those tests there more for completeness than for some edge case I’d run into. The tests were for cases when some of my algorithms would produce an empty data structure. Almost as an after thought I was also saving the empty data to file using h5py – a wrapper around the HDF5 C libraries. I was moving code to a production machine, and those two tests failed.

The fact that in both those cases the data structures were empty did not strike me at first. I spent most of my time making sure setup.py was correct and I had indicated the versions of the packages I had on my dev machine and that the production machine indeed had at least those versions of the packages.

One of the things about Python is that the neat package manager works best for pure Python packages. Once you have packages that wrap libraries written in other languages which have to be compiled separately, you have to rely on the Python package maintainers to detect and flag buggy versions of the underlying libraries. It’s easy to forget this and be lulled into a sense of security once you have your setup.py file all annotated with required dependencies and you are sitting smug and warm in your insulated virtual environment. You forget that the cold clammy claws of the C++ ecosystem’s madness are never too far away.

After a lot of madness of my own, looking over the failed tests and the debug output and some increasingly desperate web searches I came across the following hit:

Screen Shot 2015-10-26 at 5.51.24 AM

Finally, I stopped looking at the Python ecosystem and looked at the libraries. The HDF5 version was indeed at 1.8.4 and upgrading to the latest library version made the test errors go away.

And I sat there for a while thinking how dangerous it is not to have a test suite. If I didn’t have those tests in there this bug would have been sitting in the code, quietly waiting. It’s not even a bug in the code I wrote. It’s an anonymous bug, from some unknown faceless colleague. Actually, it is code from a consortium, who are pretty well funded, since the library is well regarded and well used. And I would never have known until something blew up.

I mean everyone says you should have a test suite, but no one ever gives points for it. People look at how fast your code is, perhaps at the algorithmic tricks you use, perhaps at the nice UI you present, but no one ever says, Hey that’s a nice test suite you have there.

But that’s probably the only important thing in the code.

Screen Shot 2015-08-25 at 8.12.21 PM

Leave a comment

Quick thoughts on starting Functional Programming (in Racket)

Having been “functionally curious” for a while now, I’ve started with Racket, a language in the scheme family which is a type of Lisp. Yes. Racket rhymes with bracket.

My first foray into functional programming was with Haskell. When I started with Learn You a Haskell for Great Good I was initially excited, but then I got a little confused and this was enough of a barrier that I stopped.

When I came back to trying Functional Programming (almost a year later) I made two changes. I picked a different language (Racket) and I decided that, instead of my usual method of jumping into the deep end by starting a real life project with the language, I would patiently step through the Project Euler problems in that language. The combination of the three seems to be working.


DrRacket is a very awesome learning tool. It’s what the IPython notebook and the IPython interactive console aspire to be and more. I don’t use the pop-out hints (on the right hand corner) but I make extensive use of searching the documentation for a standard library description, for graphical description of function usages (DrRacket draws arrows from a function definition to its uses!) and for locating errors.

Screen Shot 2015-08-25 at 11.25.49 AM

Here’s a nice use case for this. I copied over a function intending to modify it and make a different version of it (this is part of some learning code, so I like to keep older versions of algorithms around for comparison). By hovering over the older definition and noting the long arrow that goes down into the new (pasted) code I can see that I haven’t changed the function name in one place yet:

Screen Shot 2015-11-05 at 8.54.56 AM

The documentation is very good, though I haven’t run into a user community (Edit: The folks on the mailing list are very kind and helpful!). Perhaps I should wander down to Northeastern U. I started with the tutorial in the guide, but after settling on Project Euler, I abandoned the tutorial and now make extensive but non-linear use of the guide and reference (which is, incidentally, included with the distribution, so I don’t need an internet connection) and stack-overflow.

In terms of syntax, Racket is a bit annoying. It’s fairly easy for me to write, but I don’t like to read it. The polish prefix notation is annoying but strangely addictive. I definitely prefer infix, but there is a perverse pleasure in getting fluent in prefix. The real problem is with the brackets. Perhaps I’ve just not learned how to format things and need to look at other people’s Racket code, but my code is unreadable.

Project Euler

Project Euler seems just the ticket for me getting started with Functional Programming. The problems are paced and are a nice mixture of mathematical theory and code. For each problem I find myself scribbling on paper, thinking a bit and then coding. I really like this. Solving problems in order of difficulty helps me from getting stuck in one place for too long.

Functional Programming

I have to admit learning a new paradigm is a lot more fun than just learning a new language. After a lifetime of coding imperatively it’s a nice challenge to deny myself the easy pleasures of loops and variable assignments (though that takes it a bit far if done strictly). I’m sure there is a lot more to FP, but that is what I’ve picked up doing the first few Euler Project problems.

Using list comprehensions in Python (which are implicitly map and/or filter) has made the concept of map and reduce very easy to grasp. (As an aside, I really like Python’s syntax for list/dictionary comprehensions which I find to be concise and readable.)

In a way map and reduce are easy. Map carries no state anyway and so is a really trivial “mapping” from the loop form

for i in N:
  l[i] = f(m[i])

simply becomes

l = map(f, m)

Reduce is slightly more interesting because it has an accumulator but it almost seems a like a cheat to me because reduce is a function that actually bundles an interesting functional concept (recursion) inside and lets you get away without thinking about it.

Pattern matching

The first pattern matching I was exposed to was Haskell’s and I fell in love with that. The notation is so close to how you would write down functions on a piece of paper that it seems the only natural way to do it (I’m sad to say that Racket’s pattern matching syntax is kind of cumbersome especially when you want to match conditions. I started using “if” statements as they are more succinct and then I discovered cond, which leads to pretty neat code)

Recursions instead of loops

Though racket has loop-like constructs, I’ve stayed away from them and tried to write everything as recursions. It was a little challenging at first but became more and more natural. It felt a little awkward because when I write Python, in the back of my mind is this voice that keeps saying <ghostly quiver in voice> “Function call overhead” and I usually end up making somewhat meaty functions. In Racket, because I’m doing things like recursion I end up abstracting computations into functions just to make code look neater and more atomic.


I’m really happy that Racket has a very easy to use and fully featured unit test system. It encourages me to include the tests right underneath an important function and I just run the file to execute the tests

Get it right the first time

I’ve now written a bunch of short scripts in Racket, mostly solving Euler problems, but some also to follow along Tim Roughgarden’s Algorithms course on coursera. One of the things that has startled me a bit is how often I get my Racket code right the first time. One big caveat is that I’m usually doing simple things, but it still startles me. I’ll be writing this dense looking code in this new language, I’ll write out the tests and then run and it’ll all pass. I think a large part of this is because I’m doing pure algoirthms – I’m not having to mess with real data which has a thousand edge cases that clutter up the code, and part of it has to do with me breaking up each task into atomic parts and coding and testing each part separately.

When writing in Racket I often feel I’m writing machine code again. I’m down in the weeds doing one small task at a time. This way, Python is more expressive, I feel I can do pretty complex things with a single list comprehension, for example. But functional code can be pretty cool and compact to look at when I get it right.

;; Given two sorted lists l1 and l2, merge them into a sorted list l3
(define (merge l1 l2)
    [(= 0 (length l1)) l2]
    [(= 0 (length l2)) l1]
    [(< (first l1) (first l2)) (cons (first l1) (merge (rest l1) l2))]
    [else (cons (first l2) (merge l1 (rest l2)))]))

(test-case "merge"
           (check-equal? (merge `(1 2 3) `(4 5 6)) `(1 2 3 4 5 6))
           (check-equal? (merge `(1 3 5) `(2 4 6)) `(1 2 3 4 5 6))
           (check-equal? (merge `(2 4 6) `(1 3 5 7)) `(1 2 3 4 5 6 7))
           (check-equal? (merge `(2 4 6) `(2 4 6)) `(2 2 4 4 6 6))

(define (merge-sort l)
    [(< (length l) 2) l]
    [else (let-values ([(l1 l2) (split-at l (quotient (length l) 2))])
            (merge (merge-sort l1) (merge-sort l2)))]))

(test-case "merge-sort"
           (check-equal? (merge-sort `(6 5 4 3 2 1)) `(1 2 3 4 5 6))
           (check-equal? (merge-sort `(5 4 3 2 1)) `(1 2 3 4 5))
           (check-equal? (merge-sort `()) `())
           (check-equal? (merge-sort `(2)) `(2))
           (check-equal? (merge-sort `(6 1 4 3 2 5)) `(1 2 3 4 5 6))           
Screen Shot 2015-10-17 at 12.07.34 PM

1 Comment

Things that stay fixed, and other fairy tales

One of the reasons I hear from scientists/engineers transitioning from wet work to “dry” work is that they would like to work in a field where, when things are fixed, they stay fixed. It’s not entirely clear to me that code, actual computer code that has to work in the field, completely fits that requirement and that folks should jump into computational work for that reason alone.

I have the opportunity, as part of the recruiting efforts of our company, to speak with folks at job fairs, conferences and during interviews. It takes me away from my desk, my plans and my deadlines, but it’s a nice change, and it’s one of the reasons I like working at the company (I get to wear different hats and I rarely have to wear one I don’t like, and I’ve rarely found a hat that I don’t at least want to try out – boy that hat analogy goes pretty far!)

I will often speak with folks who have an engineering or math background who are now doing wet work (experiments with biological systems) and want to transition out of wet work, and often out of academia, and into industry doing computational work. Been there, thought that, done that, very sympathetic to that.

Often the underlying reasons are complex (they were in my case) but sometimes they will manifest as a feeling of “I just fixed this rig YESTERDAY! Why doesn’t it record the data TODAY! Who $%#^ moved my ground wire?! Which #$%@^ idiot changed my filter settings?! I want to move to a field where, when I fix things THEY STAY FIXED!!”

Technically minded people who do both their own wet work and their computational/engineering (analysis, rig setup) work are falling prey to extinction. No, no this is not evolutionary extinction. This is “extinction” as a psychophysical phenomenon. This is when one stimulus is so salient (or you have a certain kind of brain damage) that it drowns out another stimulus, which exists, but is, for some reason, ignored by the brain, which focuses on the other stimulus.

In this case, technically minded people are magnifying the frustrations of working with biological and experimental systems over the frustrations of working on computer systems and this sometimes has to do with the nature of their setup.

In neurophysiology (and I imagine in most experimental work) we were pushing equipment to their limits. To the limits of signal transduction, to the limits of sampling, to the limits of their optics. As a result, they go beyond their engineering curve and become unreliable. The exact position of a ground wire starts to matter a lot more in your dreams (or nightmares) than seems reasonable. The fact that someone, two blocks away, is doing construction and is messing up recordings (by making a biological sample jiggle about 1 nano meter more than one would like) suddenly becomes a strain on a marriage. When you throw biological systems into the mix, well, that’s a whole another can of worms.

The computational aspect, on the other hand can be quite different. It’s challenging in the beginning, and in a good way. You need to write new code, debug it, (sometimes) test it with nasty looking data. You probably do it by yourself, so you know whats going on, you have it all in your head and it all flows out smoothly and creatively. It’s a lot of fun. And once you get it working, it’s a dream. It runs fine on your computer, and at most you may have to recompile it once to put it on your institution’s high performance computing cluster. It may be a few days work, because the software those jokers have on their machines is like a decade out of date, but in the end it is fixed, and it stays fixed.

In my opinion, this comparison is a little unfair to the wet work and it can be dangerous to draw too much by way of differences between wet work and dry work from this experience.

Computer programs that have been fixed stay fixed only when kept in stasis. You must never alter the hardware they are running on, you must never present them with new data and you must never let anyone use them. Real world computer programs don’t match this description.

One of the first things that will happen when you give this delicate, finely crafted, pure genius program that you have written to some one else, is that they will say “But it doesn’t work on WINDOWS! I spent all afternoon trying to download and compile HDF5, but it doesn’t WORK”.

In the off chance that it does work, you should never watch them trying to run the program. It’s a bit like being five and giving your favorite custom built LEGO model (of which you have no pictures and of course no plans) to your friend. The first thing they will do is drop it and it will smash into individual pieces all jumbled up on the floor. “So, I put in -1 where you asked for sequence length, and I got a core dump. Oh, it also erased <important data file>”

Then, this other person will come along, and want to work on your code. “It’s on <version control system>, right? I want to add <shiny feature>”. A few days later the program stops working. Or, even more likely, it works just fine, until this OTHER colleague comes along and says “My data look funny after they come out of your machine. Did you change anything?”

Writing test suites and documentation can be a chore. Why are you writing a test suite when you can be having fun throwing your machine at <big data>? Times-a-wastin. Why are you writing documentation and comments. If they are smart coders they’ll read the code and understand! It’s so BOOORING! I mean, it takes time away from other tasks.

Oh, and when everything is running and things are working fine, you’ll be asked to move your code to this big machine we have running in the cloud and everything will be going fine until you get this compilation error. And you have to spend the rest of the day looking high and low for which library needs to be updated to which version and which library has to be downgraded to which version before the the test suite passes. You have a test suite right? Because other wise a whole lot of OTHER people will be sending you email saying “My data looks funny after it comes out of your machine”.

The point is, none of these are unreasonable or improbable or malicious occurrences. They are a natural part of the process and if one does not like to deal with these aspects too, it will not be a joyful life.

Going back to the introduction and reasoning about career choices, I think, when figuring out what kind of work do, you should ask yourself, first, which matters more (and makes you happy): what you do or what you do it for? Is your job a means to an end, or an end in itself? Many people have judgements about this, all I can say is, everyone is different and you need to figure this out for yourself. Once you do this, the path becomes clear.

For me, working on computer code and algorithms has multiple pulls over lab work. The main ones are that I like learning new math and I LIKE wrestling with algorithms and bugs in my code. I like figuring out why my past me (the older the better) forgot a key aspect of how to tell the computer what to do, or forgot an interesting corner case that is now causing my program to fail in a baffling way. It makes for good war stories too. Compilation blockers can get annoying though …

Basically, I do it because I like the dirty parts as well as the shiny ones.


Leave a comment

(time (apply max a))

A while ago I started up with Haskell in an effort to figure out what this functional programming was all about. One of the simple tests I did was try to find the maximum of a list of numbers and time it. This led me down an interesting rabbit hole, documented here.

I’ve now started with Racket in earnest (and am documenting it here). The corresponding test for Racket – the maximum of a list of a million numbers – is as follows:

#lang racket
(define a (build-list 1000000 values))
(time (apply max a))

And gives a respectable 92ms running time, 70ms of which is garbage collection. To recap, the corresponding performances for Python was 20ms and 2ms for C. Haskell’s performance was, well, confusing.

Untitled drawing


While x:

This is a cautionary tale about not misusing Python’s kindness. Or perhaps a cautionary tale about not trusting even widely used libraries to mirror what Python does. It is a little technical …

I have a program that, for the purposes of this post, operates on a list of pairs of numbers, going through each pair one by one until it exhausts the list. For certain reasons I’m using an iterator on the list and using Python’s ‘None’ to indicate that the list has run out, like so:

x = [(0, 0), (1, 2), (3,1)]
itr = x.__iter__()
v = next(itr, None)
while v:
  print v
  v = next(itr, None)

The pair of numbers represents particular data. The first number can range from 0 onwards and is always increasing monotonically while the second number is one of {0, 1, 2}. I have several tests for this function and have checked for different edge cases.

As you can see, in the snippet above, the list x is a list of Python tuples. After the bulk of the development for this (and related) functions had been completed, I had decided that a better way to store my list of tuples would be as a numpy structured array. That way I would have labels for my first and second numbers in the pair and wouldn’t get confused which was which six months from now.

And Python, being the great dynamic language it is, let me make the switch without a murmur. Python doesn’t care that the object I was initially passing it was a list of tuples and the new one is a numpy structured array! Python sees both objects yield iterators as needed and both objects are spitting out pairs of numbers at each iteration. So Python just works!

Things ran just fine for a while.

One day my colleague, who has used my program many times before, came across some very strange results in their experiments. I knew right away that this was going to be an interesting bug, just like any bug that is found after a program has been used for a while. The simple/easy bugs have all been squashed. What remains are the cockroaches. Evil, secretive errors in logic that scurry way from the daylight and come out only in the shadows of the night.

So I took the input data my colleague was using and ran my program. Indeed something bizarre was going on. For some of the data the program was acting exactly as it should. But for some of the data the program claimed that the list of pairs of numbers was empty and would exit right away. It would do the rest of the processing without any errors.

My first guess was that something was wrong with the input files. The input files were being generated with some hastily written code that I had not fully tested, so it was the logical choice, though this would make the bug less interesting.

I reviewed the input data and found that it was just fine. So the input to the program was correct. The program was just borking, seemingly, randomly.

I then started to do things the old fashioned way. I started to put in print statements at various parts of the code and simply print the length of my array of pairs of numbers. This was being done on a remote machine and I hadn’t learned yet how to hook my debugger into the remote machine. Also, sometimes, print statements are just a more pragmatic way to debug, regardless of what the eggheads will tell you.

The program loads the list. A million elements in the list. Good. The program does a transformation on the list. A million elements. Good. Another transformation. Still a million little pairs of numbers, hanging on for dear life. Then they enter this piece of code I sketched out above. And BAM! the program skips the loop, grinning as it whizzes by, zero little pairs of numbers.

I run the program again, this time with a different input set. A hundred thousand elements. Passes first transform, second transform, and then my loop. Round and round it goes in the loop, a hundred thousand times before shooting out, all dizzy at the end, having done exactly what it was supposed to do.

Very interesting.

What is different between the two data sets that would make the code DO this? So I opened by the data and started to look closely at the ones that failed and the ones that succeeded and then it struck me. All the ones that failed had (0, 0) as their first element.

And then it was clear.

When I wrote the loop:

itr = x.__iter__()
v = next(itr, None)
while v:
  print v
  v = next(itr, None)

I was a bit lazy. What I really meant was, “exit the loop if v is None”. Python is very kind and None evaluates to False in this test, so it is equivalent. The problem is that the loop will also terminate if v == 0. But, wait this is actually not a problem because my v is not actually a scalar number. It is a pair of numbers and during my testing I have repeatedly verified that (0, 0) != 0. One is a scalar value – zero – while the other is a Python object – a tuple. That can’t possibly be the problem!

But wait! Halfway through development I told you I switched from using Python lists of tuples to numpy arrays.

So I looked a bit into that, and BAM, sure enough, that’s where the behavior differs. For some reason the numpy library has been written to evaluate an element of a structured array the same as zero if both their elements are zero. This is different from taking a slice of a multi-dimensional array, where Python will complain that “ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()”

This is what still throws me. When I ask numpy for an element of my structured array of pairs of numbers, I get back a pair of numbers. These numbers are not simply another numpy array of two elements – which is what you would get from slicing a 2D numpy array. It’s a pair of numbers. However, this pair does not act as a Python tuple. It’s a different beast altogether.

Well, folks, the moral of the story is, don’t be lazy. If you want to test for None, test it explicitly. You never know what evaluates to 1 or to 0 even if you think you do.

If you would like to see this bug for yourself, here is a sufficient example:

import numpy as np

def looper(x):
  itr = x.__iter__()
  v = next(itr, None)
  while v:
    print v
    v = next(itr, None)

def python_list():
  print 'Python list'
  l = [(0, 0), (1, 0), (2, 2), (3, 0)]

def numpy_struct_array():
  print 'Numpy array'
  l = np.array([(0, 0), (1, 0), (2, 2), (3, 0)], dtype=[('x', 'int32'), ('y', 'int8')])

if __name__ == '__main__':

Get every new post delivered to your Inbox.

Join 59 other followers