Common Lisp doesn’t yield (but it maps)

I didn’t use Python’s yield statement until I’d been programming for a few years. Within a few days of learning it, however, I could’t live without it. I find Python generators and list/dict comprehensions one of the key things to making Python code compact and readable. I was pretty bummed to learn that Common Lisp does not yield to this kind of program structure. The clever pun not-withstanding, I was happy to learn that common lisp has a different paradigm: a map-like one.

Consider the following straightforward problem: I have a text file and am interested in the string of characters (called a “read”) present on every fourth line. For each such read I’m interested in getting each K character long section (called a “kmer”) (bioinformatics savvy readers will recognize the file as a FASTQ file, and the K character long sections as k-mers from the reads in the FASTQ.

In Python there’s a very elegant way to abstract this extraction of data from the FASTQ file using generators (which is what yield allows us to create)

def read_fastq(fname):
  with open(fname, 'r') as fp:
    for n, ln in enumerate(fp):
      if (n - 1) % 4 == 0:
        yield ln.strip()

def kmers_from_seq(seq, kmer_size):
  assert len(seq) > kmer_size
  for n in range(len(seq) - kmer_size + 1):
    yield seq[n:n + kmer_size]

def kmers_from_fastq(fname, kmer_size):
  for seq in read_fastq(fname):
    for kmer in kmers_from_seq(seq, kmer_size):
      yield kmer

Note how nonchalantly I nest the two generators which allows me to do:

for kmer in kmers_from_fastq(fastq_fname, kmer_size):
  print(kmer)

Update: My first run through is now appended to the end of the main article. Rainer Joswig saw my original post and showed me the Lisp way to do this, which turns out to be to use a nested map like pattern. Many thanks to Rainer! My only reservation was the medium of instruction – a series of tweets. It might work for the 45th president, but I personally prefer the more civilized medium of post comments, which creates a more coherent record.

Rainer’s example code is here. I’ve minorly changed the code below:

(defun map-to-seq-in-fastq (in fn)
  "Apply fn to every second line of the FASTQ file (seq line)"
  (flet ((read-seq-line (in)
	   (prog2
	       (read-line in nil)
	       (read-line in nil)
	     (read-line in nil)
	     (read-line in nil))))
    (loop for seq = (read-seq-line in)
	 while seq do (funcall fn seq))))


(defun map-to-kmer-in-seq (seq kmer-size fn)
  "Apply fn to all kmers of length kmer-size in seq."
  (let ((len (length seq)))
   (loop
      for pos from 0
      for end = (+ pos kmer-size)
      while (<= end len)
      do (funcall fn (subseq seq pos end)))))


(defun map-to-kmer-in-fastq (in kmer-size fn)
  (map-to-seq-in-fastq
   in
   (lambda (seq)
     (map-to-kmer-in-seq seq kmer-size fn))))

You can see what is happening. Computation functions are being passed in, Russian doll-like, into a nest of other functions that loop over some source of data. I get it, but it’s still not as neat looking as Python.

Appendix: my first run through using closures.

Well, I can’t do things like this in Common Lisp. Not easily and not with built-ins.

I’m pretty surprised at this. This has got to be the first idiom I know from Python that I can not replicate in CL. The closest I could get was via a closure:

;; generator (closure) based on https://www.cs.northwestern.edu/academics/courses/325/readings/graham/generators.html
(defun fastq-reader (fname)
  (let ((in (open fname)))
    #'(lambda ()
	(prog2  ; <-- prog1 and prog2 are nifty!
	    (read-line in nil)
	    (read-line in nil)  ; <-- this is the sequence line
	  (read-line in nil)
	  (read-line in nil)))))	;  http://www.gigamonkeys.com/book/files-and-file-io.html using nil for return value

This can be called using the loop macro as follows

;; loop based on http://www.gigamonkeys.com/book/files-and-file-io.html
(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen) ; <-- Note this
   while seq do (format t "~A~%" seq))

Similarly, I can write the kmer extraction as a closure

       
(defun kmer-gen (seq &key (kmer-size 30) (discard-N? nil))
  (let ((n 0) (n-max (- (length seq) kmer-size)))
   #'(lambda ()
       (prog1
	   (if (< n n-max)
	       (subseq seq n (+ n kmer-size))
	       nil)
	 (incf n)))))

And combine the two closures as:

(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen)
   while seq do
     (let ((g (kmer-gen seq :kmer-size 10)))
       (loop for s = (funcall g)
	    while s do (format t "~A~%" s))))

It’s not terrible as the code still looks compact and readable but it’s not as nicely abstracted away as the nested Python generator.

(A complete non sequitor: Remember how I had issues with [square brackets] and (parentheses) for array indexing when I moved from MATLAB to Python? Well now I have this issue with prefix and infix notation. I keep trying to write (print kmer) and print(kmer) is starting to look plain weird …)

Common Lisp accessors

The Common Lisp setf function is deceptively humble looking. At the start I didn’t think much of it – except that it looked eccentric compared to the assignment syntax from other languages – but recently I’ve found out how powerful it is, though not quite as convenient as I would hope.

In most languages to perform a variable assignment one does something like

a = 22;

which places the value “22” into a place in memory which has been given the label “a” in the program.

In Common Lisp the corresponding syntax is

(setf a 22)

and the only thing weird about it is the everything-is-a-list pattern of Lisp.

I was blown away, though, when I saw code that looked like this

(setf x `(1 2 3 4 5))  ; x -> (1 2 3 4 5)
(setf (rest x) `(6 7 8))  ; x -> (1 6 7 8)

You can’t do this as transparently in C++ or Python. You can implement a function to do the equivalent thing, but you can’t assign a value to the return value of a function. This Lisp syntax makes writing code very compact and logical.

However, I stumbled when I tried to use this in a function I wrote. My function started at the root of a complete binary tree and then snaked recursively through it to find and return a particular node.

;; Use the binary representation trick to thread through the
;; tree to a given node. No bounds checking is done here
; root - the root of the tree
; index - the index number of the node (1 = root)
(defun find-node (root index)
  (multiple-value-bind
   (n-idx rem) (floor index 2)
   (if (= n-idx 1)
     (find-node-pick-child root rem)
     (find-node-pick-child (find-node root n-idx) rem))))

(defun find-node-pick-child (nd rem)
  (if (= rem 0) (node-lchild nd) (node-rchild nd)))

When I tried to execute the statement (which I expected to attach new-node to the place returned by find-node)

(setf (find-node root 8) new-node)

I ran into compiler errors.

This lead me on a very interesting chase all round the internet with the word “generalized variables” turning up a lot. It turns out that (setf …) is a lisp macro that can figure out how to get hold of the “place” its first argument refers to. If the first argument is just a variable, (setf …) acts just like the assignment statement in any other language. If the first argument is a function, Lisp will look for a setter form for that function and setf on the place returned by that function (which is called an accessor function).

Following advice from answers on comp.lang.lisp, I came up with the following code:

;; Use the binary representation trick to thread through the
;; tree to return an tree node and some auxiliary information
;; No bounds checking is done here
; root - the root of the tree
; index - the index number of the node (1 = root)
; Return value is a pair
; node - a node of the tree
; info - :me - this is the node we're looking for
;        :lchild - we want the left child
;        :rchild - we want the right child
(defun find-node-with-index (root index)
  (multiple-value-bind
   (new-idx rem) (floor index 2)
   (cond
     ((= new-idx 0) (values root :me))
     ((= new-idx 1) (values root (if (= rem 0) :lchild :rchild)))
     (t (find-node-with-index
         (if (= rem 0) (node-lchild root) (node-rchild root))
         new-idx)))))

(defun node-with-index (bh index)
  (multiple-value-bind
   (nd info) (find-node-with-index (bheap-root bh) index)
   (case info
     (:me (bheap-root bh))  ; Special case - root of heap
     (:lchild (node-lchild nd))
     (:rchild (node-rchild nd)))))

;; The accessor or setter function. Note the "setf"
(defun (setf node-with-index) (new-node bh index)
  (multiple-value-bind
   (nd info) (find-node-with-index (bheap-root bh) index)
   (case info
     (:me (setf (bheap-root bh) new-node))  ; Special case - root of heap
     (:lchild (setf (node-lchild nd) new-node))
     (:rchild (setf (node-rchild nd) new-node)))))

I like the DRY principle primarily because it reduces bugs and the total amount of code. I tried to abstract as much of the core functionality into a single function as I could, but the getter and setter functions look so duplicate, but I don’t think I can reduce them further.

Thanks go to Kaz Kylheku and Barry Margolin on comp.lang.lisp for answering my questions on this topic and to Rainer Joswig for corrections to this post.