Have a test suite

Two tests failed. They were rather benign tests too. I think I had put those tests there more for completeness than for some edge case I’d run into. The tests were for cases when some of my algorithms would produce an empty data structure. Almost as an after thought I was also saving the empty data to file using h5py – a wrapper around the HDF5 C libraries. I was moving code to a production machine, and those two tests failed.

The fact that in both those cases the data structures were empty did not strike me at first. I spent most of my time making sure setup.py was correct and I had indicated the versions of the packages I had on my dev machine and that the production machine indeed had at least those versions of the packages.

One of the things about Python is that the neat package manager works best for pure Python packages. Once you have packages that wrap libraries written in other languages which have to be compiled separately, you have to rely on the Python package maintainers to detect and flag buggy versions of the underlying libraries. It’s easy to forget this and be lulled into a sense of security once you have your setup.py file all annotated with required dependencies and you are sitting smug and warm in your insulated virtual environment. You forget that the cold clammy claws of the C++ ecosystem’s madness are never too far away.

After a lot of madness of my own, looking over the failed tests and the debug output and some increasingly desperate web searches I came across the following hit:

Screen Shot 2015-10-26 at 5.51.24 AM

Finally, I stopped looking at the Python ecosystem and looked at the libraries. The HDF5 version was indeed at 1.8.4 and upgrading to the latest library version made the test errors go away.

And I sat there for a while thinking how dangerous it is not to have a test suite. If I didn’t have those tests in there this bug would have been sitting in the code, quietly waiting. It’s not even a bug in the code I wrote. It’s an anonymous bug, from some unknown faceless colleague. Actually, it is code from a consortium, who are pretty well funded, since the library is well regarded and well used. And I would never have known until something blew up.

I mean everyone says you should have a test suite, but no one ever gives points for it. People look at how fast your code is, perhaps at the algorithmic tricks you use, perhaps at the nice UI you present, but no one ever says, Hey that’s a nice test suite you have there.

But that’s probably the only important thing in the code.

Quick thoughts on starting Functional Programming (in Racket)

Having been “functionally curious” for a while now, I’ve started with Racket, a language in the scheme family which is a type of Lisp. Yes. Racket rhymes with bracket.

My first foray into functional programming was with Haskell. When I started with Learn You a Haskell for Great Good I was initially excited, but then I got a little confused and this was enough of a barrier that I stopped.

When I came back to trying Functional Programming (almost a year later) I made two changes. I picked a different language (Racket) and I decided that, instead of my usual method of jumping into the deep end by starting a real life project with the language, I would patiently step through the Project Euler problems in that language. The combination of the three seems to be working.

Racket

DrRacket is a very awesome learning tool. It’s what the IPython notebook and the IPython interactive console aspire to be and more. I don’t use the pop-out hints (on the right hand corner) but I make extensive use of searching the documentation for a standard library description, for graphical description of function usages (DrRacket draws arrows from a function definition to its uses!) and for locating errors.

Screen Shot 2015-08-25 at 11.25.49 AM

Here’s a nice use case for this. I copied over a function intending to modify it and make a different version of it (this is part of some learning code, so I like to keep older versions of algorithms around for comparison). By hovering over the older definition and noting the long arrow that goes down into the new (pasted) code I can see that I haven’t changed the function name in one place yet:

Screen Shot 2015-11-05 at 8.54.56 AM

The documentation is very good, though I haven’t run into a user community (Edit: The folks on the mailing list are very kind and helpful!). Perhaps I should wander down to Northeastern U. I started with the tutorial in the guide, but after settling on Project Euler, I abandoned the tutorial and now make extensive but non-linear use of the guide and reference (which is, incidentally, included with the distribution, so I don’t need an internet connection) and stack-overflow.

In terms of syntax, Racket is a bit annoying. It’s fairly easy for me to write, but I don’t like to read it. The polish prefix notation is annoying but strangely addictive. I definitely prefer infix, but there is a perverse pleasure in getting fluent in prefix. The real problem is with the brackets. Perhaps I’ve just not learned how to format things and need to look at other people’s Racket code, but my code is unreadable.

Project Euler

Project Euler seems just the ticket for me getting started with Functional Programming. The problems are paced and are a nice mixture of mathematical theory and code. For each problem I find myself scribbling on paper, thinking a bit and then coding. I really like this. Solving problems in order of difficulty helps me from getting stuck in one place for too long.

Functional Programming

I have to admit learning a new paradigm is a lot more fun than just learning a new language. After a lifetime of coding imperatively it’s a nice challenge to deny myself the easy pleasures of loops and variable assignments (though that takes it a bit far if done strictly). I’m sure there is a lot more to FP, but that is what I’ve picked up doing the first few Euler Project problems.

Using list comprehensions in Python (which are implicitly map and/or filter) has made the concept of map and reduce very easy to grasp. (As an aside, I really like Python’s syntax for list/dictionary comprehensions which I find to be concise and readable.)

In a way map and reduce are easy. Map carries no state anyway and so is a really trivial “mapping” from the loop form

for i in N:
  l[i] = f(m[i])

simply becomes

l = map(f, m)

Reduce is slightly more interesting because it has an accumulator but it almost seems a like a cheat to me because reduce is a function that actually bundles an interesting functional concept (recursion) inside and lets you get away without thinking about it.

Pattern matching

The first pattern matching I was exposed to was Haskell’s and I fell in love with that. The notation is so close to how you would write down functions on a piece of paper that it seems the only natural way to do it (I’m sad to say that Racket’s pattern matching syntax is kind of cumbersome especially when you want to match conditions. I started using “if” statements as they are more succinct and then I discovered cond, which leads to pretty neat code)

Recursions instead of loops

Though racket has loop-like constructs, I’ve stayed away from them and tried to write everything as recursions. It was a little challenging at first but became more and more natural. It felt a little awkward because when I write Python, in the back of my mind is this voice that keeps saying <ghostly quiver in voice> “Function call overhead” and I usually end up making somewhat meaty functions. In Racket, because I’m doing things like recursion I end up abstracting computations into functions just to make code look neater and more atomic.

Testing

I’m really happy that Racket has a very easy to use and fully featured unit test system. It encourages me to include the tests right underneath an important function and I just run the file to execute the tests

Get it right the first time

I’ve now written a bunch of short scripts in Racket, mostly solving Euler problems, but some also to follow along Tim Roughgarden’s Algorithms course on coursera. One of the things that has startled me a bit is how often I get my Racket code right the first time. One big caveat is that I’m usually doing simple things, but it still startles me. I’ll be writing this dense looking code in this new language, I’ll write out the tests and then run and it’ll all pass. I think a large part of this is because I’m doing pure algoirthms – I’m not having to mess with real data which has a thousand edge cases that clutter up the code, and part of it has to do with me breaking up each task into atomic parts and coding and testing each part separately.

When writing in Racket I often feel I’m writing machine code again. I’m down in the weeds doing one small task at a time. This way, Python is more expressive, I feel I can do pretty complex things with a single list comprehension, for example. But functional code can be pretty cool and compact to look at when I get it right.

;; Given two sorted lists l1 and l2, merge them into a sorted list l3
(define (merge l1 l2)
  (cond
    [(= 0 (length l1)) l2]
    [(= 0 (length l2)) l1]
    [(< (first l1) (first l2)) (cons (first l1) (merge (rest l1) l2))]
    [else (cons (first l2) (merge l1 (rest l2)))]))

(test-case "merge"
           (check-equal? (merge `(1 2 3) `(4 5 6)) `(1 2 3 4 5 6))
           (check-equal? (merge `(1 3 5) `(2 4 6)) `(1 2 3 4 5 6))
           (check-equal? (merge `(2 4 6) `(1 3 5 7)) `(1 2 3 4 5 6 7))
           (check-equal? (merge `(2 4 6) `(2 4 6)) `(2 2 4 4 6 6))
           )

(define (merge-sort l)
  (cond
    [(< (length l) 2) l]
    [else (let-values ([(l1 l2) (split-at l (quotient (length l) 2))])
            (merge (merge-sort l1) (merge-sort l2)))]))

(test-case "merge-sort"
           (check-equal? (merge-sort `(6 5 4 3 2 1)) `(1 2 3 4 5 6))
           (check-equal? (merge-sort `(5 4 3 2 1)) `(1 2 3 4 5))
           (check-equal? (merge-sort `()) `())
           (check-equal? (merge-sort `(2)) `(2))
           (check-equal? (merge-sort `(6 1 4 3 2 5)) `(1 2 3 4 5 6))           
           )

Things that stay fixed, and other fairy tales

One of the reasons I hear from scientists/engineers transitioning from wet work to “dry” work is that they would like to work in a field where, when things are fixed, they stay fixed. It’s not entirely clear to me that code, actual computer code that has to work in the field, completely fits that requirement and that folks should jump into computational work for that reason alone.

I have the opportunity, as part of the recruiting efforts of our company, to speak with folks at job fairs, conferences and during interviews. It takes me away from my desk, my plans and my deadlines, but it’s a nice change, and it’s one of the reasons I like working at the company (I get to wear different hats and I rarely have to wear one I don’t like, and I’ve rarely found a hat that I don’t at least want to try out – boy that hat analogy goes pretty far!)

I will often speak with folks who have an engineering or math background who are now doing wet work (experiments with biological systems) and want to transition out of wet work, and often out of academia, and into industry doing computational work. Been there, thought that, done that, very sympathetic to that.

Often the underlying reasons are complex (they were in my case) but sometimes they will manifest as a feeling of “I just fixed this rig YESTERDAY! Why doesn’t it record the data TODAY! Who $%#^ moved my ground wire?! Which #$%@^ idiot changed my filter settings?! I want to move to a field where, when I fix things THEY STAY FIXED!!”

Technically minded people who do both their own wet work and their computational/engineering (analysis, rig setup) work are falling prey to extinction. No, no this is not evolutionary extinction. This is “extinction” as a psychophysical phenomenon. This is when one stimulus is so salient (or you have a certain kind of brain damage) that it drowns out another stimulus, which exists, but is, for some reason, ignored by the brain, which focuses on the other stimulus.

In this case, technically minded people are magnifying the frustrations of working with biological and experimental systems over the frustrations of working on computer systems and this sometimes has to do with the nature of their setup.

In neurophysiology (and I imagine in most experimental work) we were pushing equipment to their limits. To the limits of signal transduction, to the limits of sampling, to the limits of their optics. As a result, they go beyond their engineering curve and become unreliable. The exact position of a ground wire starts to matter a lot more in your dreams (or nightmares) than seems reasonable. The fact that someone, two blocks away, is doing construction and is messing up recordings (by making a biological sample jiggle about 1 nano meter more than one would like) suddenly becomes a strain on a marriage. When you throw biological systems into the mix, well, that’s a whole another can of worms.

The computational aspect, on the other hand can be quite different. It’s challenging in the beginning, and in a good way. You need to write new code, debug it, (sometimes) test it with nasty looking data. You probably do it by yourself, so you know whats going on, you have it all in your head and it all flows out smoothly and creatively. It’s a lot of fun. And once you get it working, it’s a dream. It runs fine on your computer, and at most you may have to recompile it once to put it on your institution’s high performance computing cluster. It may be a few days work, because the software those jokers have on their machines is like a decade out of date, but in the end it is fixed, and it stays fixed.

In my opinion, this comparison is a little unfair to the wet work and it can be dangerous to draw too much by way of differences between wet work and dry work from this experience.

Computer programs that have been fixed stay fixed only when kept in stasis. You must never alter the hardware they are running on, you must never present them with new data and you must never let anyone use them. Real world computer programs don’t match this description.

One of the first things that will happen when you give this delicate, finely crafted, pure genius program that you have written to some one else, is that they will say “But it doesn’t work on WINDOWS! I spent all afternoon trying to download and compile HDF5, but it doesn’t WORK”.

In the off chance that it does work, you should never watch them trying to run the program. It’s a bit like being five and giving your favorite custom built LEGO model (of which you have no pictures and of course no plans) to your friend. The first thing they will do is drop it and it will smash into individual pieces all jumbled up on the floor. “So, I put in -1 where you asked for sequence length, and I got a core dump. Oh, it also erased <important data file>”

Then, this other person will come along, and want to work on your code. “It’s on <version control system>, right? I want to add <shiny feature>”. A few days later the program stops working. Or, even more likely, it works just fine, until this OTHER colleague comes along and says “My data look funny after they come out of your machine. Did you change anything?”

Writing test suites and documentation can be a chore. Why are you writing a test suite when you can be having fun throwing your machine at <big data>? Times-a-wastin. Why are you writing documentation and comments. If they are smart coders they’ll read the code and understand! It’s so BOOORING! I mean, it takes time away from other tasks.

Oh, and when everything is running and things are working fine, you’ll be asked to move your code to this big machine we have running in the cloud and everything will be going fine until you get this compilation error. And you have to spend the rest of the day looking high and low for which library needs to be updated to which version and which library has to be downgraded to which version before the the test suite passes. You have a test suite right? Because other wise a whole lot of OTHER people will be sending you email saying “My data looks funny after it comes out of your machine”.

The point is, none of these are unreasonable or improbable or malicious occurrences. They are a natural part of the process and if one does not like to deal with these aspects too, it will not be a joyful life.

Going back to the introduction and reasoning about career choices, I think, when figuring out what kind of work do, you should ask yourself, first, which matters more (and makes you happy): what you do or what you do it for? Is your job a means to an end, or an end in itself? Many people have judgements about this, all I can say is, everyone is different and you need to figure this out for yourself. Once you do this, the path becomes clear.

For me, working on computer code and algorithms has multiple pulls over lab work. The main ones are that I like learning new math and I LIKE wrestling with algorithms and bugs in my code. I like figuring out why my past me (the older the better) forgot a key aspect of how to tell the computer what to do, or forgot an interesting corner case that is now causing my program to fail in a baffling way. It makes for good war stories too. Compilation blockers can get annoying though …

Basically, I do it because I like the dirty parts as well as the shiny ones.