A run-in with the garbage men

When you write and run a computer program you are processing data. You load data, create more data, process it and so on. All this data takes up memory on your computer. Without any other action the memory starts to look like a kid’s bedroom – all toys, and stale clothes, and empty cans and magazines and books everywhere. There are typically two ways programming languages deal with this clutter.

One is the “clean your room or I’ll blow up the house” method. I realize this is not what parents say, but this is what happens in languages like C++ where you – the programmer – are responsible for cleaning up after yourself. If you don’t your program crashes and burns and exits out with a nasty error message. If your operating system is not a grown up operating system, it often freezes up your whole computer.

The other is automatic memory management, which is a bit like having a housekeeping service. Common lisp is a garbage collected language. Which means that the runtime of the language takes care of the clutter for you. The service sends over a person who goes through your room and tidies up. As you can imagine, this is a lot easier for you, since now you don’t have to worry about all this low level housekeeping stuff and can concentrate on the fun things in life/computer science.

So imagine my surprise when I ran the following bit of code and my house blew up (program memory space, not my house literally, but you’ve caught on)

(defun random-string (length)
  (with-output-to-string (stream)
    (let ((*print-base* 36))
      (loop repeat length do (princ (+ (random 26) 10) stream)))))

(defun insert (count)
  (let ((ht (make-hash-table :test 'equal)))
    (loop
       repeat count
       do (setf (gethash (random-string 30) ht) 1))))

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (insert ins-count)))

(gc-test 20 1000000 nil)

I wasn’t actually doing this exact thing, but the principle was the same: I was inserting keys into a hash table which got larger and larger, and then I discarded the hash table. After a few cycles of this BAM! my program crashed and SBCL printed out the delightfully campy:

Heap exhausted, game over.

Common Lisp’s handy (room) function prints out the memory usage of the program, and when I instrumented my program with it I saw this:

no-forcing

Now you can see that up to the 4th iteration housekeeping kind of keeps up, though there is this funny sawtooth pattern where housekeeping slacks off on the first pass, does a thorough job on the second, slacks off on the third, catches up on the fourth. But then, the union goes on strike! Housekeeping refuses to show up and the dirty pizza boxes start to pileup and then the floor collapses. Contrary to what your friends working in sales will tell you, not all graphs that rise exponentially on the right hand side are good.

SBCL has a neat function

(sb-ext:gc)

that allows you to manually call for housekeeping, and I modified my program to do this each time I discarded the hash-table, and when I reran the program I saw this:

with-forcing

So, manually forcing the garbage collector to run does do the trick but it made me a little disappointed with SBCL’s implementation. I suspect commercial Lisp compilers do a better job. If any one has a guess as to why the GC fails in this case, please drop a line.

In case you are interested, the function, instrumented with (room) and with the garbage collection forcing looks like this

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (progn
	  (insert ins-count)
	  (when nudge-gc (sb-ext:gc :full t))
	  (room nil))))

(gc-test 20 1000000 nil)

Millions of tiny hypotheses

I think I recall the exact moment when I began to get a little scared of math. It was an algebra class and we were being taught tricks (tools) that could be used to solve different classes of problems. Except, I don’t recall the teacher ever explicitly saying this was a toolbox. I think what he did was go through a list of equations, performing some trick on each – like adding x to both sides, multiplying by y on both sides and so on – and then proceeding with algebraic simplification. This demoralized me, and started to make me less enthusiastic about actually doing mathematics, though I remained fascinated by it.

I also, I think, recall why I got demoralized. I watched him solve an equation with one of these tricks and sat there staring glumly at the board. I fully understood how to apply the technique, what I couldn’t figure out was how I would know that I would have to apply that technique to that problem as opposed to some other technique. What didn’t help was that I had classmates who seemed to breeze through it, knowing intuitively which equation required which tool.

Unfortunately I did not have enough self-realization at that time to go and ask for help over this, or to talk it over with my mother (who was a mathematician). I just decided I didn’t have the innate talent for mathematics. Fortunately this did not cripple me and I had no hesitation diving into topics like physics and then electrical engineering (which I probably chose because it was applied physics) which had enough interesting math, with cool applications.

Many years later a stray comment on some message board somewhere by someone stuck in my head. They said that the way they did mathematics was to form hypothesis after hypothesis, testing them. If the hypothesis results in a falsehood, discard it and start again. Another comment, on a different message board by a different person, said that it had been explicitly stated to them that there was no magical solutions to mathematical problems, and it was a myth that there were people with innate mathematical skills and those without: you solved mathematical problems through gumption – you kept banging your head against it, trying different things, figuring it out as you went.

These two simple comments made a big impression on me. They suddenly freed me from the expectation that mathematical talent was innate. It raised the possibility that stubbornness – a trait my mother was fond of pointing out in me – was all you needed to solve mathematical problems. I was not worried about the top 99th percentile of mathematicians who most probably had something special going on their brains. I just wanted to enjoy math, learn more and more of it, and see if I could use it in daily life. These comments were immensely liberating. I had looked at the journey between the problem and the solution as mostly an embarrassment, as in the longer it took, the stupider you were. These comments turned the journey into part of the process, something to take joy in.

I just wish my math teachers had said things like this. I know that when I teach my daughter mathematics this is what I’m going to emphasize. It’s about creating millions of small hypotheses – magical worlds with crazy ideas – and then applying mathematical rules to figure out if the ideas contradicted each other, or if they fit. Just like the shape puzzles she loves to solve. Each shape fits in a particular slot. Sometimes, you can see right away that the shape will fit in a slot. Sometimes, you make a guess, if it works, good. If it doesn’t, ALSO GOOD. YOU LEARNED SOMETHING! Now try a different slot!

Common Lisp doesn’t yield (but it maps)

I didn’t use Python’s yield statement until I’d been programming for a few years. Within a few days of learning it, however, I could’t live without it. I find Python generators and list/dict comprehensions one of the key things to making Python code compact and readable. I was pretty bummed to learn that Common Lisp does not yield to this kind of program structure. The clever pun not-withstanding, I was happy to learn that common lisp has a different paradigm: a map-like one.

Consider the following straightforward problem: I have a text file and am interested in the string of characters (called a “read”) present on every fourth line. For each such read I’m interested in getting each K character long section (called a “kmer”) (bioinformatics savvy readers will recognize the file as a FASTQ file, and the K character long sections as k-mers from the reads in the FASTQ.

In Python there’s a very elegant way to abstract this extraction of data from the FASTQ file using generators (which is what yield allows us to create)

def read_fastq(fname):
  with open(fname, 'r') as fp:
    for n, ln in enumerate(fp):
      if (n - 1) % 4 == 0:
        yield ln.strip()

def kmers_from_seq(seq, kmer_size):
  assert len(seq) > kmer_size
  for n in range(len(seq) - kmer_size + 1):
    yield seq[n:n + kmer_size]

def kmers_from_fastq(fname, kmer_size):
  for seq in read_fastq(fname):
    for kmer in kmers_from_seq(seq, kmer_size):
      yield kmer

Note how nonchalantly I nest the two generators which allows me to do:

for kmer in kmers_from_fastq(fastq_fname, kmer_size):
  print(kmer)

Update: My first run through is now appended to the end of the main article. Rainer Joswig saw my original post and showed me the Lisp way to do this, which turns out to be to use a nested map like pattern. Many thanks to Rainer! My only reservation was the medium of instruction – a series of tweets. It might work for the 45th president, but I personally prefer the more civilized medium of post comments, which creates a more coherent record.

Rainer’s example code is here. I’ve minorly changed the code below:

(defun map-to-seq-in-fastq (in fn)
  "Apply fn to every second line of the FASTQ file (seq line)"
  (flet ((read-seq-line (in)
	   (prog2
	       (read-line in nil)
	       (read-line in nil)
	     (read-line in nil)
	     (read-line in nil))))
    (loop for seq = (read-seq-line in)
	 while seq do (funcall fn seq))))


(defun map-to-kmer-in-seq (seq kmer-size fn)
  "Apply fn to all kmers of length kmer-size in seq."
  (let ((len (length seq)))
   (loop
      for pos from 0
      for end = (+ pos kmer-size)
      while (<= end len)
      do (funcall fn (subseq seq pos end)))))


(defun map-to-kmer-in-fastq (in kmer-size fn)
  (map-to-seq-in-fastq
   in
   (lambda (seq)
     (map-to-kmer-in-seq seq kmer-size fn))))

You can see what is happening. Computation functions are being passed in, Russian doll-like, into a nest of other functions that loop over some source of data. I get it, but it’s still not as neat looking as Python.

Appendix: my first run through using closures.

Well, I can’t do things like this in Common Lisp. Not easily and not with built-ins.

I’m pretty surprised at this. This has got to be the first idiom I know from Python that I can not replicate in CL. The closest I could get was via a closure:

;; generator (closure) based on https://www.cs.northwestern.edu/academics/courses/325/readings/graham/generators.html
(defun fastq-reader (fname)
  (let ((in (open fname)))
    #'(lambda ()
	(prog2  ; <-- prog1 and prog2 are nifty!
	    (read-line in nil)
	    (read-line in nil)  ; <-- this is the sequence line
	  (read-line in nil)
	  (read-line in nil)))))	;  http://www.gigamonkeys.com/book/files-and-file-io.html using nil for return value

This can be called using the loop macro as follows

;; loop based on http://www.gigamonkeys.com/book/files-and-file-io.html
(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen) ; <-- Note this
   while seq do (format t "~A~%" seq))

Similarly, I can write the kmer extraction as a closure

       
(defun kmer-gen (seq &key (kmer-size 30) (discard-N? nil))
  (let ((n 0) (n-max (- (length seq) kmer-size)))
   #'(lambda ()
       (prog1
	   (if (< n n-max)
	       (subseq seq n (+ n kmer-size))
	       nil)
	 (incf n)))))

And combine the two closures as:

(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen)
   while seq do
     (let ((g (kmer-gen seq :kmer-size 10)))
       (loop for s = (funcall g)
	    while s do (format t "~A~%" s))))

It’s not terrible as the code still looks compact and readable but it’s not as nicely abstracted away as the nested Python generator.

(A complete non sequitor: Remember how I had issues with [square brackets] and (parentheses) for array indexing when I moved from MATLAB to Python? Well now I have this issue with prefix and infix notation. I keep trying to write (print kmer) and print(kmer) is starting to look plain weird …)

Common Lisp accessors

The Common Lisp setf function is deceptively humble looking. At the start I didn’t think much of it – except that it looked eccentric compared to the assignment syntax from other languages – but recently I’ve found out how powerful it is, though not quite as convenient as I would hope.

In most languages to perform a variable assignment one does something like

a = 22;

which places the value “22” into a place in memory which has been given the label “a” in the program.

In Common Lisp the corresponding syntax is

(setf a 22)

and the only thing weird about it is the everything-is-a-list pattern of Lisp.

I was blown away, though, when I saw code that looked like this

(setf x `(1 2 3 4 5))  ; x -> (1 2 3 4 5)
(setf (rest x) `(6 7 8))  ; x -> (1 6 7 8)

You can’t do this as transparently in C++ or Python. You can implement a function to do the equivalent thing, but you can’t assign a value to the return value of a function. This Lisp syntax makes writing code very compact and logical.

However, I stumbled when I tried to use this in a function I wrote. My function started at the root of a complete binary tree and then snaked recursively through it to find and return a particular node.

;; Use the binary representation trick to thread through the
;; tree to a given node. No bounds checking is done here
; root - the root of the tree
; index - the index number of the node (1 = root)
(defun find-node (root index)
  (multiple-value-bind
   (n-idx rem) (floor index 2)
   (if (= n-idx 1)
     (find-node-pick-child root rem)
     (find-node-pick-child (find-node root n-idx) rem))))

(defun find-node-pick-child (nd rem)
  (if (= rem 0) (node-lchild nd) (node-rchild nd)))

When I tried to execute the statement (which I expected to attach new-node to the place returned by find-node)

(setf (find-node root 8) new-node)

I ran into compiler errors.

This lead me on a very interesting chase all round the internet with the word “generalized variables” turning up a lot. It turns out that (setf …) is a lisp macro that can figure out how to get hold of the “place” its first argument refers to. If the first argument is just a variable, (setf …) acts just like the assignment statement in any other language. If the first argument is a function, Lisp will look for a setter form for that function and setf on the place returned by that function (which is called an accessor function).

Following advice from answers on comp.lang.lisp, I came up with the following code:

;; Use the binary representation trick to thread through the
;; tree to return an tree node and some auxiliary information
;; No bounds checking is done here
; root - the root of the tree
; index - the index number of the node (1 = root)
; Return value is a pair
; node - a node of the tree
; info - :me - this is the node we're looking for
;        :lchild - we want the left child
;        :rchild - we want the right child
(defun find-node-with-index (root index)
  (multiple-value-bind
   (new-idx rem) (floor index 2)
   (cond
     ((= new-idx 0) (values root :me))
     ((= new-idx 1) (values root (if (= rem 0) :lchild :rchild)))
     (t (find-node-with-index
         (if (= rem 0) (node-lchild root) (node-rchild root))
         new-idx)))))

(defun node-with-index (bh index)
  (multiple-value-bind
   (nd info) (find-node-with-index (bheap-root bh) index)
   (case info
     (:me (bheap-root bh))  ; Special case - root of heap
     (:lchild (node-lchild nd))
     (:rchild (node-rchild nd)))))

;; The accessor or setter function. Note the "setf"
(defun (setf node-with-index) (new-node bh index)
  (multiple-value-bind
   (nd info) (find-node-with-index (bheap-root bh) index)
   (case info
     (:me (setf (bheap-root bh) new-node))  ; Special case - root of heap
     (:lchild (setf (node-lchild nd) new-node))
     (:rchild (setf (node-rchild nd) new-node)))))

I like the DRY principle primarily because it reduces bugs and the total amount of code. I tried to abstract as much of the core functionality into a single function as I could, but the getter and setter functions look so duplicate, but I don’t think I can reduce them further.

Thanks go to Kaz Kylheku and Barry Margolin on comp.lang.lisp for answering my questions on this topic and to Rainer Joswig for corrections to this post.

Links-Rechts, Links-Rechts, Ahnentafel!

While looking into priority queues and binary heaps, I ran across two gems, one of which may not be so widely known. Consider a complete binary tree with nodes sequentially numbered:

binary-tree

You will observe that the parent of every node is the integer division of itself by two. For example, the parent of 11 is 11/2 = 5 (integer division), the parent of 14 is 14/2 = 7 and so on. Way back in 1590 a genealogist used this numbering pattern to organize people’s ancestries, calling it an Ahnentafel. From the article:

The subject (proband or progenitor) of the ahnentafel is listed as No. 1, the subject’s father as No. 2 and the mother as No. 3, the paternal grandparents as No. 4 and No. 5 and the maternal grandparents as No. 6 and No. 7, and so on, back through the generations. Apart from No. 1, who can be male or female, all even-numbered persons are male, and all odd-numbered persons are female. In this schema, the number of any person’s father is double the person’s number, and a person’s mother is double the person’s number plus one. Using this definition of numeration, one can derive some basic information about individuals who are listed without additional research.

This construct displays a person’s genealogy compactly, without the need for a diagram such as a family tree. It is particularly useful in situations where one may be restricted to presenting a genealogy in plain text, for example, in e-mails or newsgroup articles. In effect, an ahnentafel is a method for storing a binary tree in an array by listing the nodes (individuals) in level-order (in generation order).

Now, suppose we see the exact same complete binary tree and sequential numbering in binary:

binary-tree-in-binary

Do you notice a striking pattern? The children of any given node are basically the binary of the parent left shifted by one with a 0 or 1 in the new last bit, depending upon whether they are the left or right child. So, for example, the children of 101 are 1010 and 1011. This is the basis for the following neat algorithm (which I first came across here) for finding a given node in this tree.

To find the path from the root (here node #1) to some other node, say (11) express the target node number in binary (1011) and discard the first digit – which is always 1 – (011). Each digit now successively indicates the child to pick (0 = left, 1= right) as you traverse the path. So, to get to node #11 you would take (011) left, right, right. (I read this twice before realizing the choice of eleven (11) is peculiarly bad in this context. Here’s the example using 12. 12 => 1100 => 100 => right, left, left)

This algorithm is useful when using a pointer-based representation and, for example, inserting or deleting from a binary heap. This pattern, as you have guessed, comes from the number representation and the tree being both binary based.

PS. Pardon my terrible title. The Ahnentafel put “German” in my head, traversing the tree put “left, right” in my head and watching too many world war II movies in childhood did the rest.

PPS. Python code to generate the trees, not that you need it.

from matplotlib import pyplot as plt

def node_xy(k, n, dx):
  n0, n1 = 2**k, 2**(k+1)-1
  center = (n0 + n1) / 2.0
  return (n - center) * dx / n0, 0.5 - k / 10.0

def plot(ax, max_k):
  dx = 1.0
  for k in range(max_k):
    for n in range(2**k, 2**(k+1)):
      x, y = node_xy(k, n, dx)
      # ax.text(x, y, str(n), ha='center', va='bottom')
      ax.text(x, y, bin(n)[2:], ha='center', va='bottom')
      if k:
        px, py = node_xy(k - 1, int(n/2), dx)
        plt.plot([px, x], [py, y + 0.025], 'k')

fig = plt.figure()
ax = plt.subplot(111)
plot(ax, 5)
plt.setp(ax, xlim=[-0.5, 0.5], ylim=[0.0, 0.6])
plt.axis('off')

Problem 11

I’m using problems in The Advent of Code 2016 to motivate my study of Common Lisp though, sadly, I’ve not been devoting time to it recently. Luckily several of my colleagues are also doing the challenge and one of them came by and described what sounded like a particularly interesting problem – number 11 – , and I decided to try my hand at it in Lisp. It promises to be engaging and instructive.

I’m not going to repeat the problem here (but some of the stuff I’ll write will only make sense in the context of the details of the problem); in the abstract it is a path finding problem: The problem world is in some initial state, there are set of intricate rules by which you can change the world state, and there is an end goal state. The task is to find the shortest number of steps to that end goal.

For me this is a rich problem both in terms of algorithms and language. I think I’ll try and make short posts about the steps I take. My course of action will be (and these may change, of course):

  1. Figure out what algorithms apply to this problem
  2. Think of data structures that will help
  3. Implement this in Common Lisp, from scratch where it looks interesting
    1. I’ve had at least one case so far where I had a naive way of coding things and then I discovered a different Lisp idiom that made it better. I hope to show these examples too.

There are several tutorials and expositions on path finding. The one I like the best is from Red Blob Games. It has interactive visualizations of the algorithms which are very motivating and the tutorial itself starts slow and easy and makes nice progress to end at the A* algorithm. If you want a refresher, please refer to that. If you’ve read Problem 11 you’ll note that it looks different, but with the proper abstraction, this is what it is.

Now let’s take up where the tutorial leaves off. The A* algorithm seems to be the ticket and we want to figure out some implementation details. The interesting things the A* algorithm requires are a priority queue and a heuristic for determining how far a given state is from the end state.

Popular lore has it that, for this application, a Fibonacci heap is a good implementation for the priority queue for this application, but The algorithm design manual points us to rank-paired heaps as something that is better in practice. I found this paper on rank-pairing heaps. The authors have written the paper very well: They start with a nice systematic introduction to heaps and then slowly introduce modifications they make to existing structures and algorithms.

To start with however, I’m going to implement a priority queue using a simple binary heap (there is a good illustrated guide here). At a later stage, when the whole thing is worked out, I’m going to come back and implement rank-pairing heaps.

The data structure for holding the world state can be separated out from all this priority queue business, but it does interface with the A* algorithm in two places 1) The world state has to be amenable to hashing, such that we can put it in a map-like structure and 2) The implementation of rules for going from state to state i.e. the manner in which we find child nodes of a given parent node can, we intuit, be made considerably easy or difficult based on our choice of world state representation.

This whole thing is going to go down as a series of blog posts and I’ll link to each one (creating a linked list) but also try and set them out here, so that this post serves as a table of contents.

  1. Binary heap priority queue
    1. Learned about accessor functions
    2. And print-object which is what Python’s __repr__ is modeled on.
    3. and a diversion
  2. A* search framework
  3. Data structure and functions to handle world state
  4. Completion of problem 11
  5. (Bonus) Rank-pairing heap implementation
  6. A wrap up?

Holy $#!* THAT’s a Lisp macro!

Still struggling to understand what’s so special about Lisp, I came across a post linked on lobste.rs, innocuously titled “Hash Table Syntax in Common Lisp“. It had the arcane macro syntax I’d seen in some other examples and I was about to skim past it when I suddenly saw the expression {:key1 => "value1", :key2 => "value2"} being blithely evaluated by the REPL as if it was just any other Lisp expression. I sat there with my mouth slightly open.
 
Lisp, Common Lisp no exception, has a very basic syntax. Everything, and I mean everything, consists of function evaluations, and the evaluations all have the format (fun arg1 arg2 ...) and here was this expression full of braces and commas and odd operators being turned into a common lisp hash table.

There has to be some catch, I thought. It has to be invoked in a special way or something. Turns out, no. You can just paste the macro into the REPL and start using the curly braces notation.

I won’t claim that I “understand” Lisp. I won’t even claim I think Lisp is an amazing language – at this point I’m willing to be shown it is, but I still haven’t thought “Ah! I’m going to be using Lisp instead of Python (or C++) for my next project.” However, seeing this example I can attest that I had that religious feeling that some folks have claimed after studying Lisp. An Oh $#!@ moment, in today’s profaner terms.

And quite suddenly, I was taken back a decade and a half, to my first exposure to TeX and LaTeX. People have done crazy things using TeX, demonstrating that it is a programming language disguised as a typesetting system. The few times I dared to mess with TeX I was really modifying LaTeX macros to improve the style of some document I was working on – most probably procrastinating on my master’s thesis.

Both the effect of this Lisp macro, and it’s general appearance remind me of the terse, somewhat arcane alien, form of LaTeX macros and styles. Both change the syntax of the language you are using to make what you are doing more convenient.

I suppose this is not an accident. Both Lisp and TeX come from an era when people fiddling with computers really liked to get into the guts of everything, understand it and change it to their own personal taste.

Further posters on lobste.rs point out to me that this is a “reader-macro” which operates during parse time, rather than a regular macro, which operates during compile time. All in all, this was a very informative series of comments for me!