Common Lisp doesn’t yield (but it maps)

I didn’t use Python’s yield statement until I’d been programming for a few years. Within a few days of learning it, however, I could’t live without it. I find Python generators and list/dict comprehensions one of the key things to making Python code compact and readable. I was pretty bummed to learn that Common Lisp does not yield to this kind of program structure. The clever pun not-withstanding, I was happy to learn that common lisp has a different paradigm: a map-like one.

Consider the following straightforward problem: I have a text file and am interested in the string of characters (called a “read”) present on every fourth line. For each such read I’m interested in getting each K character long section (called a “kmer”) (bioinformatics savvy readers will recognize the file as a FASTQ file, and the K character long sections as k-mers from the reads in the FASTQ.

In Python there’s a very elegant way to abstract this extraction of data from the FASTQ file using generators (which is what yield allows us to create)

def read_fastq(fname):
  with open(fname, 'r') as fp:
    for n, ln in enumerate(fp):
      if (n - 1) % 4 == 0:
        yield ln.strip()

def kmers_from_seq(seq, kmer_size):
  assert len(seq) > kmer_size
  for n in range(len(seq) - kmer_size + 1):
    yield seq[n:n + kmer_size]

def kmers_from_fastq(fname, kmer_size):
  for seq in read_fastq(fname):
    for kmer in kmers_from_seq(seq, kmer_size):
      yield kmer

Note how nonchalantly I nest the two generators which allows me to do:

for kmer in kmers_from_fastq(fastq_fname, kmer_size):
  print(kmer)

Update: My first run through is now appended to the end of the main article. Rainer Joswig saw my original post and showed me the Lisp way to do this, which turns out to be to use a nested map like pattern. Many thanks to Rainer! My only reservation was the medium of instruction – a series of tweets. It might work for the 45th president, but I personally prefer the more civilized medium of post comments, which creates a more coherent record.

Rainer’s example code is here. I’ve minorly changed the code below:

(defun map-to-seq-in-fastq (in fn)
  "Apply fn to every second line of the FASTQ file (seq line)"
  (flet ((read-seq-line (in)
	   (prog2
	       (read-line in nil)
	       (read-line in nil)
	     (read-line in nil)
	     (read-line in nil))))
    (loop for seq = (read-seq-line in)
	 while seq do (funcall fn seq))))


(defun map-to-kmer-in-seq (seq kmer-size fn)
  "Apply fn to all kmers of length kmer-size in seq."
  (let ((len (length seq)))
   (loop
      for pos from 0
      for end = (+ pos kmer-size)
      while (<= end len)
      do (funcall fn (subseq seq pos end)))))


(defun map-to-kmer-in-fastq (in kmer-size fn)
  (map-to-seq-in-fastq
   in
   (lambda (seq)
     (map-to-kmer-in-seq seq kmer-size fn))))

You can see what is happening. Computation functions are being passed in, Russian doll-like, into a nest of other functions that loop over some source of data. I get it, but it’s still not as neat looking as Python.

Appendix: my first run through using closures.

Well, I can’t do things like this in Common Lisp. Not easily and not with built-ins.

I’m pretty surprised at this. This has got to be the first idiom I know from Python that I can not replicate in CL. The closest I could get was via a closure:

;; generator (closure) based on https://www.cs.northwestern.edu/academics/courses/325/readings/graham/generators.html
(defun fastq-reader (fname)
  (let ((in (open fname)))
    #'(lambda ()
	(prog2  ; <-- prog1 and prog2 are nifty!
	    (read-line in nil)
	    (read-line in nil)  ; <-- this is the sequence line
	  (read-line in nil)
	  (read-line in nil)))))	;  http://www.gigamonkeys.com/book/files-and-file-io.html using nil for return value

This can be called using the loop macro as follows

;; loop based on http://www.gigamonkeys.com/book/files-and-file-io.html
(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen) ; <-- Note this
   while seq do (format t "~A~%" seq))

Similarly, I can write the kmer extraction as a closure

       
(defun kmer-gen (seq &key (kmer-size 30) (discard-N? nil))
  (let ((n 0) (n-max (- (length seq) kmer-size)))
   #'(lambda ()
       (prog1
	   (if (< n n-max)
	       (subseq seq n (+ n kmer-size))
	       nil)
	 (incf n)))))

And combine the two closures as:

(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen)
   while seq do
     (let ((g (kmer-gen seq :kmer-size 10)))
       (loop for s = (funcall g)
	    while s do (format t "~A~%" s))))

It’s not terrible as the code still looks compact and readable but it’s not as nicely abstracted away as the nested Python generator.

(A complete non sequitor: Remember how I had issues with [square brackets] and (parentheses) for array indexing when I moved from MATLAB to Python? Well now I have this issue with prefix and infix notation. I keep trying to write (print kmer) and print(kmer) is starting to look plain weird …)

Common Lisp accessors

The Common Lisp setf function is deceptively humble looking. At the start I didn’t think much of it – except that it looked eccentric compared to the assignment syntax from other languages – but recently I’ve found out how powerful it is, though not quite as convenient as I would hope.

In most languages to perform a variable assignment one does something like

a = 22;

which places the value “22” into a place in memory which has been given the label “a” in the program.

In Common Lisp the corresponding syntax is

(setf a 22)

and the only thing weird about it is the everything-is-a-list pattern of Lisp.

I was blown away, though, when I saw code that looked like this

(setf x `(1 2 3 4 5))  ; x -> (1 2 3 4 5)
(setf (rest x) `(6 7 8))  ; x -> (1 6 7 8)

You can’t do this as transparently in C++ or Python. You can implement a function to do the equivalent thing, but you can’t assign a value to the return value of a function. This Lisp syntax makes writing code very compact and logical.

However, I stumbled when I tried to use this in a function I wrote. My function started at the root of a complete binary tree and then snaked recursively through it to find and return a particular node.

;; Use the binary representation trick to thread through the
;; tree to a given node. No bounds checking is done here
; root - the root of the tree
; index - the index number of the node (1 = root)
(defun find-node (root index)
  (multiple-value-bind
   (n-idx rem) (floor index 2)
   (if (= n-idx 1)
     (find-node-pick-child root rem)
     (find-node-pick-child (find-node root n-idx) rem))))

(defun find-node-pick-child (nd rem)
  (if (= rem 0) (node-lchild nd) (node-rchild nd)))

When I tried to execute the statement (which I expected to attach new-node to the place returned by find-node)

(setf (find-node root 8) new-node)

I ran into compiler errors.

This lead me on a very interesting chase all round the internet with the word “generalized variables” turning up a lot. It turns out that (setf …) is a lisp macro that can figure out how to get hold of the “place” its first argument refers to. If the first argument is just a variable, (setf …) acts just like the assignment statement in any other language. If the first argument is a function, Lisp will look for a setter form for that function and setf on the place returned by that function (which is called an accessor function).

Following advice from answers on comp.lang.lisp, I came up with the following code:

;; Use the binary representation trick to thread through the
;; tree to return an tree node and some auxiliary information
;; No bounds checking is done here
; root - the root of the tree
; index - the index number of the node (1 = root)
; Return value is a pair
; node - a node of the tree
; info - :me - this is the node we're looking for
;        :lchild - we want the left child
;        :rchild - we want the right child
(defun find-node-with-index (root index)
  (multiple-value-bind
   (new-idx rem) (floor index 2)
   (cond
     ((= new-idx 0) (values root :me))
     ((= new-idx 1) (values root (if (= rem 0) :lchild :rchild)))
     (t (find-node-with-index
         (if (= rem 0) (node-lchild root) (node-rchild root))
         new-idx)))))

(defun node-with-index (bh index)
  (multiple-value-bind
   (nd info) (find-node-with-index (bheap-root bh) index)
   (case info
     (:me (bheap-root bh))  ; Special case - root of heap
     (:lchild (node-lchild nd))
     (:rchild (node-rchild nd)))))

;; The accessor or setter function. Note the "setf"
(defun (setf node-with-index) (new-node bh index)
  (multiple-value-bind
   (nd info) (find-node-with-index (bheap-root bh) index)
   (case info
     (:me (setf (bheap-root bh) new-node))  ; Special case - root of heap
     (:lchild (setf (node-lchild nd) new-node))
     (:rchild (setf (node-rchild nd) new-node)))))

I like the DRY principle primarily because it reduces bugs and the total amount of code. I tried to abstract as much of the core functionality into a single function as I could, but the getter and setter functions look so duplicate, but I don’t think I can reduce them further.

Thanks go to Kaz Kylheku and Barry Margolin on comp.lang.lisp for answering my questions on this topic and to Rainer Joswig for corrections to this post.

Links-Rechts, Links-Rechts, Ahnentafel!

While looking into priority queues and binary heaps, I ran across two gems, one of which may not be so widely known. Consider a complete binary tree with nodes sequentially numbered:

binary-tree

You will observe that the parent of every node is the integer division of itself by two. For example, the parent of 11 is 11/2 = 5 (integer division), the parent of 14 is 14/2 = 7 and so on. Way back in 1590 a genealogist used this numbering pattern to organize people’s ancestries, calling it an Ahnentafel. From the article:

The subject (proband or progenitor) of the ahnentafel is listed as No. 1, the subject’s father as No. 2 and the mother as No. 3, the paternal grandparents as No. 4 and No. 5 and the maternal grandparents as No. 6 and No. 7, and so on, back through the generations. Apart from No. 1, who can be male or female, all even-numbered persons are male, and all odd-numbered persons are female. In this schema, the number of any person’s father is double the person’s number, and a person’s mother is double the person’s number plus one. Using this definition of numeration, one can derive some basic information about individuals who are listed without additional research.

This construct displays a person’s genealogy compactly, without the need for a diagram such as a family tree. It is particularly useful in situations where one may be restricted to presenting a genealogy in plain text, for example, in e-mails or newsgroup articles. In effect, an ahnentafel is a method for storing a binary tree in an array by listing the nodes (individuals) in level-order (in generation order).

Now, suppose we see the exact same complete binary tree and sequential numbering in binary:

binary-tree-in-binary

Do you notice a striking pattern? The children of any given node are basically the binary of the parent left shifted by one with a 0 or 1 in the new last bit, depending upon whether they are the left or right child. So, for example, the children of 101 are 1010 and 1011. This is the basis for the following neat algorithm (which I first came across here) for finding a given node in this tree.

To find the path from the root (here node #1) to some other node, say (11) express the target node number in binary (1011) and discard the first digit – which is always 1 – (011). Each digit now successively indicates the child to pick (0 = left, 1= right) as you traverse the path. So, to get to node #11 you would take (011) left, right, right. (I read this twice before realizing the choice of eleven (11) is peculiarly bad in this context. Here’s the example using 12. 12 => 1100 => 100 => right, left, left)

This algorithm is useful when using a pointer-based representation and, for example, inserting or deleting from a binary heap. This pattern, as you have guessed, comes from the number representation and the tree being both binary based.

PS. Pardon my terrible title. The Ahnentafel put “German” in my head, traversing the tree put “left, right” in my head and watching too many world war II movies in childhood did the rest.

PPS. Python code to generate the trees, not that you need it.

from matplotlib import pyplot as plt

def node_xy(k, n, dx):
  n0, n1 = 2**k, 2**(k+1)-1
  center = (n0 + n1) / 2.0
  return (n - center) * dx / n0, 0.5 - k / 10.0

def plot(ax, max_k):
  dx = 1.0
  for k in range(max_k):
    for n in range(2**k, 2**(k+1)):
      x, y = node_xy(k, n, dx)
      # ax.text(x, y, str(n), ha='center', va='bottom')
      ax.text(x, y, bin(n)[2:], ha='center', va='bottom')
      if k:
        px, py = node_xy(k - 1, int(n/2), dx)
        plt.plot([px, x], [py, y + 0.025], 'k')

fig = plt.figure()
ax = plt.subplot(111)
plot(ax, 5)
plt.setp(ax, xlim=[-0.5, 0.5], ylim=[0.0, 0.6])
plt.axis('off')

Problem 11

I’m using problems in The Advent of Code 2016 to motivate my study of Common Lisp though, sadly, I’ve not been devoting time to it recently. Luckily several of my colleagues are also doing the challenge and one of them came by and described what sounded like a particularly interesting problem – number 11 – , and I decided to try my hand at it in Lisp. It promises to be engaging and instructive.

I’m not going to repeat the problem here (but some of the stuff I’ll write will only make sense in the context of the details of the problem); in the abstract it is a path finding problem: The problem world is in some initial state, there are set of intricate rules by which you can change the world state, and there is an end goal state. The task is to find the shortest number of steps to that end goal.

For me this is a rich problem both in terms of algorithms and language. I think I’ll try and make short posts about the steps I take. My course of action will be (and these may change, of course):

  1. Figure out what algorithms apply to this problem
  2. Think of data structures that will help
  3. Implement this in Common Lisp, from scratch where it looks interesting
    1. I’ve had at least one case so far where I had a naive way of coding things and then I discovered a different Lisp idiom that made it better. I hope to show these examples too.

There are several tutorials and expositions on path finding. The one I like the best is from Red Blob Games. It has interactive visualizations of the algorithms which are very motivating and the tutorial itself starts slow and easy and makes nice progress to end at the A* algorithm. If you want a refresher, please refer to that. If you’ve read Problem 11 you’ll note that it looks different, but with the proper abstraction, this is what it is.

Now let’s take up where the tutorial leaves off. The A* algorithm seems to be the ticket and we want to figure out some implementation details. The interesting things the A* algorithm requires are a priority queue and a heuristic for determining how far a given state is from the end state.

Popular lore has it that, for this application, a Fibonacci heap is a good implementation for the priority queue for this application, but The algorithm design manual points us to rank-paired heaps as something that is better in practice. I found this paper on rank-pairing heaps. The authors have written the paper very well: They start with a nice systematic introduction to heaps and then slowly introduce modifications they make to existing structures and algorithms.

To start with however, I’m going to implement a priority queue using a simple binary heap (there is a good illustrated guide here). At a later stage, when the whole thing is worked out, I’m going to come back and implement rank-pairing heaps.

The data structure for holding the world state can be separated out from all this priority queue business, but it does interface with the A* algorithm in two places 1) The world state has to be amenable to hashing, such that we can put it in a map-like structure and 2) The implementation of rules for going from state to state i.e. the manner in which we find child nodes of a given parent node can, we intuit, be made considerably easy or difficult based on our choice of world state representation.

This whole thing is going to go down as a series of blog posts and I’ll link to each one (creating a linked list) but also try and set them out here, so that this post serves as a table of contents.

  1. Binary heap priority queue
    1. Learned about accessor functions
    2. And print-object which is what Python’s __repr__ is modeled on.
    3. and a diversion
  2. A* search framework
  3. Data structure and functions to handle world state
  4. Completion of problem 11
  5. (Bonus) Rank-pairing heap implementation
  6. A wrap up?

Holy $#!* THAT’s a Lisp macro!

Still struggling to understand what’s so special about Lisp, I came across a post linked on lobste.rs, innocuously titled “Hash Table Syntax in Common Lisp“. It had the arcane macro syntax I’d seen in some other examples and I was about to skim past it when I suddenly saw the expression {:key1 => "value1", :key2 => "value2"} being blithely evaluated by the REPL as if it was just any other Lisp expression. I sat there with my mouth slightly open.
 
Lisp, Common Lisp no exception, has a very basic syntax. Everything, and I mean everything, consists of function evaluations, and the evaluations all have the format (fun arg1 arg2 ...) and here was this expression full of braces and commas and odd operators being turned into a common lisp hash table.

There has to be some catch, I thought. It has to be invoked in a special way or something. Turns out, no. You can just paste the macro into the REPL and start using the curly braces notation.

I won’t claim that I “understand” Lisp. I won’t even claim I think Lisp is an amazing language – at this point I’m willing to be shown it is, but I still haven’t thought “Ah! I’m going to be using Lisp instead of Python (or C++) for my next project.” However, seeing this example I can attest that I had that religious feeling that some folks have claimed after studying Lisp. An Oh $#!@ moment, in today’s profaner terms.

And quite suddenly, I was taken back a decade and a half, to my first exposure to TeX and LaTeX. People have done crazy things using TeX, demonstrating that it is a programming language disguised as a typesetting system. The few times I dared to mess with TeX I was really modifying LaTeX macros to improve the style of some document I was working on – most probably procrastinating on my master’s thesis.

Both the effect of this Lisp macro, and it’s general appearance remind me of the terse, somewhat arcane alien, form of LaTeX macros and styles. Both change the syntax of the language you are using to make what you are doing more convenient.

I suppose this is not an accident. Both Lisp and TeX come from an era when people fiddling with computers really liked to get into the guts of everything, understand it and change it to their own personal taste.

Further posters on lobste.rs point out to me that this is a “reader-macro” which operates during parse time, rather than a regular macro, which operates during compile time. All in all, this was a very informative series of comments for me!

Sleep on it

I was pretty annoyed. I had installed a program into a docker container and had invoked this container via our cloud computing platform and it was garbling my output files – with the exact same inputs and parameters I had just tested it with on my local machine.

At first, I didn’t realize anything was wrong. The program completed without complaints, the log file indicated a certain size of result which corresponded to the result I obtained locally. However, when I fed the output file into another tool it failed. The tool had failed with an esoteric error message which is very non-informative (yay bioinformatics tools), but after seeing this message several times I’ve come to understand that it usually means the input is truncated.

Sure enough, when I downloaded the file from the platform onto my local machine and gunzipped it gunzip ran for a while before complaining of a truncated file.

Now, I had just come out of dealing with an annoying issue related to running docker containers on our platform. When the containers ran, they ran using a non-interactive non-login shell. This leads to subtle differences in how many things work, which will take up a different post, but basically I was in a foul mood already.

The code I actually wanted to run looked like this

my-program \
  input-file \
  >(gzip > outputfile1) \
  >(gzip > outputfile2)

It turns out that the non-interactive shell on the ubuntu base image I used for my tool image is dash, rather than bash, and dash doesn’t like this process substitution thing and complains sh: 1: Syntax error: "(" unexpected

To get around this expediently I wrapped my program in a bash script my-script.sh and invoked it on the platform as bash my-script.sh. This worked, but now my file was getting truncated.

Same code, same data, same docker image. Container executes correctly on my mac, corrupts my file on the AWS linux machine. I cross checked that I was closing my file in my code, I read up a bit about flushing (close does it) and fsync (named pipes don’t have it – they don’t touch disk) and the fact that Python will raise an exception if something goes wrong with the write.

After some more thrashing about like this I suddenly wondered about the multiple processes involved here. By doing process substitution, I had created two new processes – both involving gzip – that were sucking up data from pipes linking to the process my main program was running in. It was quite possible that the two gzip processes were finishing slightly after my main process was done. Now suppose, just suppose, the original bash shell that I was invoking to run my program was terminating as soon as my main program was done without waiting for the gzip processes? This would lead to this kind of file truncation.

I put in a wait in the shell script, but this did not make any difference. I wasn’t explicitly sending any processes to the background, so this was perhaps expected. Then I added a sleep 2s command to the script:

my-program \
  input-file \
  >(gzip > outputfile1) \
  >(gzip > outputfile2)

sleep 2s # <--- BINGO!

and indeed my files were no longer being truncated. So, what was happening was that the parent shell had no way of ‘realizing’ that the two gzip subprocesses were running and was exiting and the docker container was shutting down, killing my gzips just before they were done writing the last bit of data.

Ah, the joys. The joys.

Some notes on using Python multiprocessing

This is a short note describing a common code pattern useful for parallelizing some computations using the Python multiprocessing module.

Many problems are of the embarrassingly parallel type, where the task consists of the same set of computations done independently on a large set of input data. In many languages such problems are solved using multi-threaded code.  Due to events too complicated and embarrassing to mention Python doesn’t do too well with multiple threads, and we need to use multiple processes via the multiprocessing module instead.

A common pattern is to set up a number of workers and use queues to pass data back-and -forth. The pattern looks like this:

multiprocessing-queues.png

The use of this pattern is relatively straight forward but there are some things to pay attention to.

Sharing read only data

The general topic of sharing data between processes is worthy of deeper discussion, and there are some pointers in the Python documentation but there is an easy way to share read only data between processes – use globals that are written to in the parent process BEFORE the child processes are started.

If you declare a global variable and then modify it after the child processes are started, or if you modify the global from the child process you’ll enter into the twilight zone. There will be no errors and from within the process where you do the writing everything will look consistent, but the rest of the universe will not pay any attention to your writes.

Queues have overhead – the object has to pickled by the sender and then un-pickled by the receiver, and while the object is in the queue, it’s taking up memory. If you have a large amount of read-only data that needs to be shared amongst child processes – like a lookup table of primes, for example – this pattern of using globals can be pragmatic.

 

Explicitly set Queue size

On Mac OS X the default size of the queue is  32767 items. On Linux, the default size is … INFINITY. I learned this the hard way. I usually prototype my code on my mac, and then deploy to a linux container that runs god knows where on data that is several orders of magnitude larger than my prototype set. On Linux things SEEMED to run fine but the size of the result set was abnormally small compared to the input. After a lot of poking around I discovered this property of the Python implementation and what was happening was that in Linux my queues got larger and larger, taking up more and more RAM and the OS started to kill off the child processes. Setting an explicit queue size allows the processes to block when the queue becomes full, capping memory usage.

Wrap the child process function using traceback

Error messages from child processes are messy and uninformative. Importantly, the friendly python stack trace print out is missing. A fairly convenient pattern is to wrap the actual function call within a try-except block in thin outer function, using the traceback module, like so:

def func_w(args):
  """A thin wrapper to allow proper tracebacks when things go wrong in a thread

  :param args:
  :return:
  """
  import traceback
  try:
    return func(**args)
  except Exception as e:
    traceback.print_exc()
    print('')
    raise e

Check for normal termination from the child

One of the reasons it took me so long to track down the above mentioned memory error was that when the child processes was being killed by the OS no Python error was raised. The principled way to check for normal termination in is to look at the exitcode value of the process.

When not to use this pattern

Passing data via queues can be expensive in terms of computation and memory. If your individual workers have to pass a lot of data back and forth – say large submatrices of an even larger matrix – it may be neater and faster to use shared memory objects as supplied by the multiprocessing module which takes care of semaphores for you.