A run-in with the garbage men

When you write and run a computer program you are processing data. You load data, create more data, process it and so on. All this data takes up memory on your computer. Without any other action the memory starts to look like a kid’s bedroom – all toys, and stale clothes, and empty cans and magazines and books everywhere. There are typically two ways programming languages deal with this clutter.

One is the “clean your room or I’ll blow up the house” method. I realize this is not what parents say, but this is what happens in languages like C++ where you – the programmer – are responsible for cleaning up after yourself. If you don’t your program crashes and burns and exits out with a nasty error message. If your operating system is not a grown up operating system, it often freezes up your whole computer.

The other is automatic memory management, which is a bit like having a housekeeping service. Common lisp is a garbage collected language. Which means that the runtime of the language takes care of the clutter for you. The service sends over a person who goes through your room and tidies up. As you can imagine, this is a lot easier for you, since now you don’t have to worry about all this low level housekeeping stuff and can concentrate on the fun things in life/computer science.

So imagine my surprise when I ran the following bit of code and my house blew up (program memory space, not my house literally, but you’ve caught on)

(defun random-string (length)
  (with-output-to-string (stream)
    (let ((*print-base* 36))
      (loop repeat length do (princ (+ (random 26) 10) stream)))))

(defun insert (count)
  (let ((ht (make-hash-table :test 'equal)))
    (loop
       repeat count
       do (setf (gethash (random-string 30) ht) 1))))

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (insert ins-count)))

(gc-test 20 1000000 nil)

I wasn’t actually doing this exact thing, but the principle was the same: I was inserting keys into a hash table which got larger and larger, and then I discarded the hash table. After a few cycles of this BAM! my program crashed and SBCL printed out the delightfully campy:

Heap exhausted, game over.

Common Lisp’s handy (room) function prints out the memory usage of the program, and when I instrumented my program with it I saw this:

no-forcing

Now you can see that up to the 4th iteration housekeeping kind of keeps up, though there is this funny sawtooth pattern where housekeeping slacks off on the first pass, does a thorough job on the second, slacks off on the third, catches up on the fourth. But then, the union goes on strike! Housekeeping refuses to show up and the dirty pizza boxes start to pileup and then the floor collapses. Contrary to what your friends working in sales will tell you, not all graphs that rise exponentially on the right hand side are good.

SBCL has a neat function

(sb-ext:gc)

that allows you to manually call for housekeeping, and I modified my program to do this each time I discarded the hash-table, and when I reran the program I saw this:

with-forcing

So, manually forcing the garbage collector to run does do the trick but it made me a little disappointed with SBCL’s implementation. I suspect commercial Lisp compilers do a better job. If any one has a guess as to why the GC fails in this case, please drop a line.

In case you are interested, the function, instrumented with (room) and with the garbage collection forcing looks like this

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (progn
	  (insert ins-count)
	  (when nudge-gc (sb-ext:gc :full t))
	  (room nil))))

(gc-test 20 1000000 nil)

Common Lisp doesn’t yield (but it maps)

I didn’t use Python’s yield statement until I’d been programming for a few years. Within a few days of learning it, however, I could’t live without it. I find Python generators and list/dict comprehensions one of the key things to making Python code compact and readable. I was pretty bummed to learn that Common Lisp does not yield to this kind of program structure. The clever pun not-withstanding, I was happy to learn that common lisp has a different paradigm: a map-like one.

Consider the following straightforward problem: I have a text file and am interested in the string of characters (called a “read”) present on every fourth line. For each such read I’m interested in getting each K character long section (called a “kmer”) (bioinformatics savvy readers will recognize the file as a FASTQ file, and the K character long sections as k-mers from the reads in the FASTQ.

In Python there’s a very elegant way to abstract this extraction of data from the FASTQ file using generators (which is what yield allows us to create)

def read_fastq(fname):
  with open(fname, 'r') as fp:
    for n, ln in enumerate(fp):
      if (n - 1) % 4 == 0:
        yield ln.strip()

def kmers_from_seq(seq, kmer_size):
  assert len(seq) > kmer_size
  for n in range(len(seq) - kmer_size + 1):
    yield seq[n:n + kmer_size]

def kmers_from_fastq(fname, kmer_size):
  for seq in read_fastq(fname):
    for kmer in kmers_from_seq(seq, kmer_size):
      yield kmer

Note how nonchalantly I nest the two generators which allows me to do:

for kmer in kmers_from_fastq(fastq_fname, kmer_size):
  print(kmer)

Update: My first run through is now appended to the end of the main article. Rainer Joswig saw my original post and showed me the Lisp way to do this, which turns out to be to use a nested map like pattern. Many thanks to Rainer! My only reservation was the medium of instruction – a series of tweets. It might work for the 45th president, but I personally prefer the more civilized medium of post comments, which creates a more coherent record.

Rainer’s example code is here. I’ve minorly changed the code below:

(defun map-to-seq-in-fastq (in fn)
  "Apply fn to every second line of the FASTQ file (seq line)"
  (flet ((read-seq-line (in)
	   (prog2
	       (read-line in nil)
	       (read-line in nil)
	     (read-line in nil)
	     (read-line in nil))))
    (loop for seq = (read-seq-line in)
	 while seq do (funcall fn seq))))


(defun map-to-kmer-in-seq (seq kmer-size fn)
  "Apply fn to all kmers of length kmer-size in seq."
  (let ((len (length seq)))
   (loop
      for pos from 0
      for end = (+ pos kmer-size)
      while (<= end len)
      do (funcall fn (subseq seq pos end)))))


(defun map-to-kmer-in-fastq (in kmer-size fn)
  (map-to-seq-in-fastq
   in
   (lambda (seq)
     (map-to-kmer-in-seq seq kmer-size fn))))

You can see what is happening. Computation functions are being passed in, Russian doll-like, into a nest of other functions that loop over some source of data. I get it, but it’s still not as neat looking as Python.

Appendix: my first run through using closures.

Well, I can’t do things like this in Common Lisp. Not easily and not with built-ins.

I’m pretty surprised at this. This has got to be the first idiom I know from Python that I can not replicate in CL. The closest I could get was via a closure:

;; generator (closure) based on https://www.cs.northwestern.edu/academics/courses/325/readings/graham/generators.html
(defun fastq-reader (fname)
  (let ((in (open fname)))
    #'(lambda ()
	(prog2  ; <-- prog1 and prog2 are nifty!
	    (read-line in nil)
	    (read-line in nil)  ; <-- this is the sequence line
	  (read-line in nil)
	  (read-line in nil)))))	;  http://www.gigamonkeys.com/book/files-and-file-io.html using nil for return value

This can be called using the loop macro as follows

;; loop based on http://www.gigamonkeys.com/book/files-and-file-io.html
(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen) ; <-- Note this
   while seq do (format t "~A~%" seq))

Similarly, I can write the kmer extraction as a closure

       
(defun kmer-gen (seq &key (kmer-size 30) (discard-N? nil))
  (let ((n 0) (n-max (- (length seq) kmer-size)))
   #'(lambda ()
       (prog1
	   (if (< n n-max)
	       (subseq seq n (+ n kmer-size))
	       nil)
	 (incf n)))))

And combine the two closures as:

(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen)
   while seq do
     (let ((g (kmer-gen seq :kmer-size 10)))
       (loop for s = (funcall g)
	    while s do (format t "~A~%" s))))

It’s not terrible as the code still looks compact and readable but it’s not as nicely abstracted away as the nested Python generator.

(A complete non sequitor: Remember how I had issues with [square brackets] and (parentheses) for array indexing when I moved from MATLAB to Python? Well now I have this issue with prefix and infix notation. I keep trying to write (print kmer) and print(kmer) is starting to look plain weird …)

Common Lisp accessors

The Common Lisp setf function is deceptively humble looking. At the start I didn’t think much of it – except that it looked eccentric compared to the assignment syntax from other languages – but recently I’ve found out how powerful it is, though not quite as convenient as I would hope.

In most languages to perform a variable assignment one does something like

a = 22;

which places the value “22” into a place in memory which has been given the label “a” in the program.

In Common Lisp the corresponding syntax is

(setf a 22)

and the only thing weird about it is the everything-is-a-list pattern of Lisp.

I was blown away, though, when I saw code that looked like this

(setf x `(1 2 3 4 5))  ; x -> (1 2 3 4 5)
(setf (rest x) `(6 7 8))  ; x -> (1 6 7 8)

You can’t do this as transparently in C++ or Python. You can implement a function to do the equivalent thing, but you can’t assign a value to the return value of a function. This Lisp syntax makes writing code very compact and logical.

However, I stumbled when I tried to use this in a function I wrote. My function started at the root of a complete binary tree and then snaked recursively through it to find and return a particular node.

;; Use the binary representation trick to thread through the
;; tree to a given node. No bounds checking is done here
; root - the root of the tree
; index - the index number of the node (1 = root)
(defun find-node (root index)
  (multiple-value-bind
   (n-idx rem) (floor index 2)
   (if (= n-idx 1)
     (find-node-pick-child root rem)
     (find-node-pick-child (find-node root n-idx) rem))))

(defun find-node-pick-child (nd rem)
  (if (= rem 0) (node-lchild nd) (node-rchild nd)))

When I tried to execute the statement (which I expected to attach new-node to the place returned by find-node)

(setf (find-node root 8) new-node)

I ran into compiler errors.

This lead me on a very interesting chase all round the internet with the word “generalized variables” turning up a lot. It turns out that (setf …) is a lisp macro that can figure out how to get hold of the “place” its first argument refers to. If the first argument is just a variable, (setf …) acts just like the assignment statement in any other language. If the first argument is a function, Lisp will look for a setter form for that function and setf on the place returned by that function (which is called an accessor function).

Following advice from answers on comp.lang.lisp, I came up with the following code:

;; Use the binary representation trick to thread through the
;; tree to return an tree node and some auxiliary information
;; No bounds checking is done here
; root - the root of the tree
; index - the index number of the node (1 = root)
; Return value is a pair
; node - a node of the tree
; info - :me - this is the node we're looking for
;        :lchild - we want the left child
;        :rchild - we want the right child
(defun find-node-with-index (root index)
  (multiple-value-bind
   (new-idx rem) (floor index 2)
   (cond
     ((= new-idx 0) (values root :me))
     ((= new-idx 1) (values root (if (= rem 0) :lchild :rchild)))
     (t (find-node-with-index
         (if (= rem 0) (node-lchild root) (node-rchild root))
         new-idx)))))

(defun node-with-index (bh index)
  (multiple-value-bind
   (nd info) (find-node-with-index (bheap-root bh) index)
   (case info
     (:me (bheap-root bh))  ; Special case - root of heap
     (:lchild (node-lchild nd))
     (:rchild (node-rchild nd)))))

;; The accessor or setter function. Note the "setf"
(defun (setf node-with-index) (new-node bh index)
  (multiple-value-bind
   (nd info) (find-node-with-index (bheap-root bh) index)
   (case info
     (:me (setf (bheap-root bh) new-node))  ; Special case - root of heap
     (:lchild (setf (node-lchild nd) new-node))
     (:rchild (setf (node-rchild nd) new-node)))))

I like the DRY principle primarily because it reduces bugs and the total amount of code. I tried to abstract as much of the core functionality into a single function as I could, but the getter and setter functions look so duplicate, but I don’t think I can reduce them further.

Thanks go to Kaz Kylheku and Barry Margolin on comp.lang.lisp for answering my questions on this topic and to Rainer Joswig for corrections to this post.

Problem 11

I’m using problems in The Advent of Code 2016 to motivate my study of Common Lisp though, sadly, I’ve not been devoting time to it recently. Luckily several of my colleagues are also doing the challenge and one of them came by and described what sounded like a particularly interesting problem – number 11 – , and I decided to try my hand at it in Lisp. It promises to be engaging and instructive.

I’m not going to repeat the problem here (but some of the stuff I’ll write will only make sense in the context of the details of the problem); in the abstract it is a path finding problem: The problem world is in some initial state, there are set of intricate rules by which you can change the world state, and there is an end goal state. The task is to find the shortest number of steps to that end goal.

For me this is a rich problem both in terms of algorithms and language. I think I’ll try and make short posts about the steps I take. My course of action will be (and these may change, of course):

  1. Figure out what algorithms apply to this problem
  2. Think of data structures that will help
  3. Implement this in Common Lisp, from scratch where it looks interesting
    1. I’ve had at least one case so far where I had a naive way of coding things and then I discovered a different Lisp idiom that made it better. I hope to show these examples too.

There are several tutorials and expositions on path finding. The one I like the best is from Red Blob Games. It has interactive visualizations of the algorithms which are very motivating and the tutorial itself starts slow and easy and makes nice progress to end at the A* algorithm. If you want a refresher, please refer to that. If you’ve read Problem 11 you’ll note that it looks different, but with the proper abstraction, this is what it is.

Now let’s take up where the tutorial leaves off. The A* algorithm seems to be the ticket and we want to figure out some implementation details. The interesting things the A* algorithm requires are a priority queue and a heuristic for determining how far a given state is from the end state.

Popular lore has it that, for this application, a Fibonacci heap is a good implementation for the priority queue for this application, but The algorithm design manual points us to rank-paired heaps as something that is better in practice. I found this paper on rank-pairing heaps. The authors have written the paper very well: They start with a nice systematic introduction to heaps and then slowly introduce modifications they make to existing structures and algorithms.

To start with however, I’m going to implement a priority queue using a simple binary heap (there is a good illustrated guide here). At a later stage, when the whole thing is worked out, I’m going to come back and implement rank-pairing heaps.

The data structure for holding the world state can be separated out from all this priority queue business, but it does interface with the A* algorithm in two places 1) The world state has to be amenable to hashing, such that we can put it in a map-like structure and 2) The implementation of rules for going from state to state i.e. the manner in which we find child nodes of a given parent node can, we intuit, be made considerably easy or difficult based on our choice of world state representation.

This whole thing is going to go down as a series of blog posts and I’ll link to each one (creating a linked list) but also try and set them out here, so that this post serves as a table of contents.

  1. Binary heap priority queue
    1. Learned about accessor functions
    2. And print-object which is what Python’s __repr__ is modeled on.
    3. and a diversion
  2. A* search framework
  3. Data structure and functions to handle world state
  4. Completion of problem 11
  5. (Bonus) Rank-pairing heap implementation
  6. A wrap up?

Holy $#!* THAT’s a Lisp macro!

Still struggling to understand what’s so special about Lisp, I came across a post linked on lobste.rs, innocuously titled “Hash Table Syntax in Common Lisp“. It had the arcane macro syntax I’d seen in some other examples and I was about to skim past it when I suddenly saw the expression {:key1 => "value1", :key2 => "value2"} being blithely evaluated by the REPL as if it was just any other Lisp expression. I sat there with my mouth slightly open.
 
Lisp, Common Lisp no exception, has a very basic syntax. Everything, and I mean everything, consists of function evaluations, and the evaluations all have the format (fun arg1 arg2 ...) and here was this expression full of braces and commas and odd operators being turned into a common lisp hash table.

There has to be some catch, I thought. It has to be invoked in a special way or something. Turns out, no. You can just paste the macro into the REPL and start using the curly braces notation.

I won’t claim that I “understand” Lisp. I won’t even claim I think Lisp is an amazing language – at this point I’m willing to be shown it is, but I still haven’t thought “Ah! I’m going to be using Lisp instead of Python (or C++) for my next project.” However, seeing this example I can attest that I had that religious feeling that some folks have claimed after studying Lisp. An Oh $#!@ moment, in today’s profaner terms.

And quite suddenly, I was taken back a decade and a half, to my first exposure to TeX and LaTeX. People have done crazy things using TeX, demonstrating that it is a programming language disguised as a typesetting system. The few times I dared to mess with TeX I was really modifying LaTeX macros to improve the style of some document I was working on – most probably procrastinating on my master’s thesis.

Both the effect of this Lisp macro, and it’s general appearance remind me of the terse, somewhat arcane alien, form of LaTeX macros and styles. Both change the syntax of the language you are using to make what you are doing more convenient.

I suppose this is not an accident. Both Lisp and TeX come from an era when people fiddling with computers really liked to get into the guts of everything, understand it and change it to their own personal taste.

Further posters on lobste.rs point out to me that this is a “reader-macro” which operates during parse time, rather than a regular macro, which operates during compile time. All in all, this was a very informative series of comments for me!

(if (too (many parentheses)) (use parinfer) ‘())

When I first started learning Common Lisp I was a little annoyed by the syntax. After practicing writing Lisp code I have grown to like the simplicity of the basic language. However, like C/C++, even though you don’t have to format the code (i.e. add whitespace) it helps readability immensely if you do. In such cases a decently smart editor helps a lot, and the parinfer plugin is really, really neat.

For me, formatting Lisp code like it was Python improves it’s readability by a huge amount. Syntax highlighting, for Lisp, can’t do so much since Lisp syntax is minimal, but indenting things like meaty if statements, nested function calls and lengthy ‘cond’ constructs makes things easy to follow.

Some editor features, for example paredit, tailored to Lisps, are very helpful but, as the plugin manual itself notes:

ParEdit helps **keep parentheses balanced** and adds many keys for moving S-expressions and moving around in S-expressions. Its behavior can be jarring for those who may want transient periods of unbalanced parentheses, such as when typing parentheses directly or commenting out code line by line.

Jarring is right. I use Atom and had paredit running and had a session once where I thought my keyboard was broken because I couldn’t type parens. And not being able to type parens is pretty traumatic when learning Lisp.

Parinfer brings a very useful concept to the domain of Lisp editing. Like a most editor assists, it will close parentheses for you when you open them, and place the cursor in the right place. However, it does a bit of sophisticated parsing to handle the following condition:

(defun func (x)
  (setf x (+ x 3))
  (if (< x 3) t nil))
(defun func (x)
  (setf x (+ x 3)))      <--- note we now need a close parens here
  ; (if (< x 3) t nil))  <--- parinfer automatically generates it
                         <--- when we comment out the last line

(As a side-note, WordPress’ syntax highlighting for the tag does not have Lisp as an option, but I got the desired effect by claiming this was Python!)

Parinfer does take a little getting used to if you’ve been managing parens and indentation by yourself. I had to learn in which part of my S-exp to hit the “return” key to get the desired effect – the two outcomes being advancing to a new level of indentation (nesting) under the parent expression or dropping out to a new, separate expression