Development environment

While it can be an avenue for procrastination, searching for a proper development environment often pays off with greatly improved productivity in the long term. I had some finicky requirements. I wanted to maintain my own CMakeLists file so that the project was IDE/editor/platform independent. I wanted to be able to just add files to the project from anywhere – from VI, by OS copy, by git pull – (I set wildcards in the CMakeLists.txt file) – and definitely not have to fiddle in the IDE to add files. I wanted to edit markdown files which I was using for documentation, and placing in a separate docs folder. I wanted to see a decent preview of the markdown occassionally. Ideally, my environment would simply track my directory organization, including non C++ files. However, I also wanted intelligent code completion, including for my own code.

At work, where a ton of my current stuff is in Python, I happily use Pycharm which has been a very worthwhile investment. So CLion was a natural candidate to start with. I didn’t get very far, because I balked at the license terms. It felt a little annoying to have to apply for an open source license and then renew every year and so on. When I want to play with red-tape I talk to the government.

I completely understand needing to charge people for your software. It’s a necessity for maintenance and continued development. But you can not get money from people who have none. To me the rational model is to charge professional entities for the software, to make it free to use for non-professional use and just not bother with the small fry. Basically, have a smallish legal department that goes after a big player if they violate. For companies and individual professionals a software license is the least of their costs (and probably the item with the highest ROI), and they will gladly pay to have some one to go to for support. Chasing down individual hobbyists just doesn’t make sense. I’m sorry, where did that soapbox come from? I must have stepped on it by accident. Moving on to the next editor/IDE …

I tried XCode next. It was a medium-ish install, and I think I needed an iTunes account. I was encouraged to use it because CMake can generate XCode projects and many coders working on Macs swore by it. I have used it infrequently – mostly when I was editing a largish Objective C project during my postdoc. I found it fine to use and debug with. It was fine, but I disliked how CMake organized the XCode project virtual directories, and I disliked how I had to manually add non-source and new files. And memories of the over complicated build system started to come back to me.

I then found something called CodeLite. It was confusing because I spent some time trying to figure out if it was related to Code::Blocks. CodeLite looked kind of promising. There was a nice tutorial on how to integrate with CMake based projects. But it looked cumbersome, and I had to add files to the project through the interface. By now, I was willing to endure this but I couldn’t get over the fact that there was no preview for Markdown. I liked that they limited the number of plugins, but I really wanted markdown preview. I told you I was finicky.

Then, quite by accident I started to poke around Visual Studio Code and downloaded it to try it out. In brief, it seems to satisfy all my finicky requirements. This choice rather surprised me, because it was just not on my radar. I also realize this does nothing to advance my attempt to get into the cool kids programming club where people use vim or emacs, but that was a lost cause anyway. (I now recall that I initially avoided VSC because a colleague spoke out very strongly against it a few months ago. Goes to show, you need to try things out for yourself)

CMake and XCode

Oddly enough, though I’ve been working on Macs for eight years now, I haven’t used Xcode a lot and in the past I used Bloodshed C++ and eclipse. I also have fond memories of using vi and make with my IDE being a cluster of four terminals (code, compile, debug and run output, I think they were. Or maybe one was pine). Also, the fondness is perhaps an effect of nostalgia.

Since I’m using CMake I looked around for ways to make XCode work with it. CMake will generate an xcode project with the command -G XCode but the structure of this project looked atrocious and I wondered what I was doing wrong. This post by John Lamp gives some nice details about how and why the generated XCode project is so structured.

One trick I learned (Lamp mentions it but I really paid attention when I watched this video) was to add the header files to the sources list as well, in the CMake project. After I did this the organization of the files in the XCode Project view became more sensiple – XCode knows how to split up header and source files.

Also, before you open up XCode, add the build directory to .gitignore. When XCode tries to manage your versioning, it seems to want to add just about everything to version control by default, and the build directory is an un-holy mess.

Cmake tutorial, slightly easier than the official one.

 

Starting a project

It’s that time of the decade again – to start a new hobby programming project that will probably span a few years, given that it’s a side project. So I went through the usual decision tree and this time I decided to document it for my future self.

Picking a language

I really surprised myself. I picked c++. A year or so ago I would have told you I was going to use Haskell or Common Lisp for new projects. I’m pretty proficient in Python and so I wouldn’t pick that. But C++??? Now, I’ll still be doing coding challenge type tasks in common lisp this year, and next year will be the year of Haskell, but every time I touch one of these I get fingers bitten off. There is always something easy I want to do that breaks things and the solution, often from practitioners in the field, seems to be pretty messy.

Python, funnily enough, is the closest I’ve come to an ideal language for practical use. I don’t use custom defined classes very much and it’s only fake functional (it comes out in the major – no tail recursion – and in the minor – try and use lambdas with the multiprocessing module) and it can be astoundingly slow for compute intensive tasks that can’t be cast as numpy operations. But you can’t beat it in time it takes to go from a blank file to a working, self-documenting, fairly bug free program where you can often write very complex things very concisely and elegantly.

So why C++?? C++14, to be precise, drew me back from these REPL oriented languages that make prototyping so fast and easy. In a nutshell I think the standard library has turned into quite a nice thing and things like CATCH and Cmake make medium sized projects easier to manage. I anticipate an intense amount of numerical computation, potentially large amounts of generated data and an attendant desire to control the type of the data and how it’s laid out. Common Lisp, Python and, I suspect, Haskell aren’t so bothered by such low level details. I suspect I’d be able to quickly prototype what I want to do in those languages, but then I’d get into a hellhole trying to remove the bottlenecks.

Build system: Cmake: I’ve used this before and found it manageable
Command line parser: cxxopts – one file (just header)
Test system: CATCH – I’ve used this before and I like how easy it is.

One thing I’m excited to experiment with is Cling – allegedly a REPL for C++. We’ll see …

Picking a version system

This was easy. git. I toyed briefly with using this as an excuse to learn mercurial, but I decided to reduce the number of moving parts for this particular project. Mercurial will have to wait.

Picking a license

I’ve always used the GPL. This time I picked the MIT license. There were a couple of funny thoughts that went through my head as I was picking the license. Funny because they reveal very deep (but perhaps not so serious) desires that I don’t often reflect on. One dominant thought was, what if this becomes widely popular (yeah, sure) and people start to capitalize on this but I don’t get any of the upside (fame/money). You’d probably want to use the GPL to block people like that.

And so I went with the MIT license, because almost everything I know, I know because random people on the internet unconditionally shared their knowledge with me, and almost every tool I use has been unconditionally granted to me by random people on the internet.  The GPL started to sound a little mean spirited. A little too over bearing. The MIT license, on the other hand, sounded more in the hacker spirit. Here is some code. I throw it to the wind. If you find it useful send, me a post card. Just don’t sue me. Leave the lawyers out of this.

Picking a host

github, because of the network effects, and I’m using git. And to be fair, github has made a lot of effort to make the site feature full. I do hope it does not end up like sourceforge. Remember when sourceforge was where we put all our projects?

Picking a documentation system

Text files in markdown. Algorithm design notes in the code files. Because everything else gets out of date and becomes a giant interdependent mess that you deal with only if this is the day job.

Work style

Ok, this one I’m gonna get right (yeah, sure). I’m starting with the specifications, then the user manual, then the tests, then the code. This is not a magic formula, and it doesn’t stop me from skipping around, but this set of practices has always seemed more logical and a lot less irritating to me. Code is living, you always go back and change things, but as long as you follow this order when you change things, sanity prevails.

Off we go …

A run-in with the garbage men

When you write and run a computer program you are processing data. You load data, create more data, process it and so on. All this data takes up memory on your computer. Without any other action the memory starts to look like a kid’s bedroom – all toys, and stale clothes, and empty cans and magazines and books everywhere. There are typically two ways programming languages deal with this clutter.

One is the “clean your room or I’ll blow up the house” method. I realize this is not what parents say, but this is what happens in languages like C++ where you – the programmer – are responsible for cleaning up after yourself. If you don’t your program crashes and burns and exits out with a nasty error message. If your operating system is not a grown up operating system, it often freezes up your whole computer.

The other is automatic memory management, which is a bit like having a housekeeping service. Common lisp is a garbage collected language. Which means that the runtime of the language takes care of the clutter for you. The service sends over a person who goes through your room and tidies up. As you can imagine, this is a lot easier for you, since now you don’t have to worry about all this low level housekeeping stuff and can concentrate on the fun things in life/computer science.

So imagine my surprise when I ran the following bit of code and my house blew up (program memory space, not my house literally, but you’ve caught on)

(defun random-string (length)
  (with-output-to-string (stream)
    (let ((*print-base* 36))
      (loop repeat length do (princ (+ (random 26) 10) stream)))))

(defun insert (count)
  (let ((ht (make-hash-table :test 'equal)))
    (loop
       repeat count
       do (setf (gethash (random-string 30) ht) 1))))

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (insert ins-count)))

(gc-test 20 1000000 nil)

I wasn’t actually doing this exact thing, but the principle was the same: I was inserting keys into a hash table which got larger and larger, and then I discarded the hash table. After a few cycles of this BAM! my program crashed and SBCL printed out the delightfully campy:

Heap exhausted, game over.

Common Lisp’s handy (room) function prints out the memory usage of the program, and when I instrumented my program with it I saw this:

no-forcing

Now you can see that up to the 4th iteration housekeeping kind of keeps up, though there is this funny sawtooth pattern where housekeeping slacks off on the first pass, does a thorough job on the second, slacks off on the third, catches up on the fourth. But then, the union goes on strike! Housekeeping refuses to show up and the dirty pizza boxes start to pileup and then the floor collapses. Contrary to what your friends working in sales will tell you, not all graphs that rise exponentially on the right hand side are good.

SBCL has a neat function

(sb-ext:gc)

that allows you to manually call for housekeeping, and I modified my program to do this each time I discarded the hash-table, and when I reran the program I saw this:

with-forcing

So, manually forcing the garbage collector to run does do the trick but it made me a little disappointed with SBCL’s implementation. I suspect commercial Lisp compilers do a better job. If any one has a guess as to why the GC fails in this case, please drop a line.

In case you are interested, the function, instrumented with (room) and with the garbage collection forcing looks like this

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (progn
	  (insert ins-count)
	  (when nudge-gc (sb-ext:gc :full t))
	  (room nil))))

(gc-test 20 1000000 nil)

Common Lisp doesn’t yield (but it maps)

I didn’t use Python’s yield statement until I’d been programming for a few years. Within a few days of learning it, however, I could’t live without it. I find Python generators and list/dict comprehensions one of the key things to making Python code compact and readable. I was pretty bummed to learn that Common Lisp does not yield to this kind of program structure. The clever pun not-withstanding, I was happy to learn that common lisp has a different paradigm: a map-like one.

Consider the following straightforward problem: I have a text file and am interested in the string of characters (called a “read”) present on every fourth line. For each such read I’m interested in getting each K character long section (called a “kmer”) (bioinformatics savvy readers will recognize the file as a FASTQ file, and the K character long sections as k-mers from the reads in the FASTQ.

In Python there’s a very elegant way to abstract this extraction of data from the FASTQ file using generators (which is what yield allows us to create)

def read_fastq(fname):
  with open(fname, 'r') as fp:
    for n, ln in enumerate(fp):
      if (n - 1) % 4 == 0:
        yield ln.strip()

def kmers_from_seq(seq, kmer_size):
  assert len(seq) > kmer_size
  for n in range(len(seq) - kmer_size + 1):
    yield seq[n:n + kmer_size]

def kmers_from_fastq(fname, kmer_size):
  for seq in read_fastq(fname):
    for kmer in kmers_from_seq(seq, kmer_size):
      yield kmer

Note how nonchalantly I nest the two generators which allows me to do:

for kmer in kmers_from_fastq(fastq_fname, kmer_size):
  print(kmer)

Update: My first run through is now appended to the end of the main article. Rainer Joswig saw my original post and showed me the Lisp way to do this, which turns out to be to use a nested map like pattern. Many thanks to Rainer! My only reservation was the medium of instruction – a series of tweets. It might work for the 45th president, but I personally prefer the more civilized medium of post comments, which creates a more coherent record.

Rainer’s example code is here. I’ve minorly changed the code below:

(defun map-to-seq-in-fastq (in fn)
  "Apply fn to every second line of the FASTQ file (seq line)"
  (flet ((read-seq-line (in)
	   (prog2
	       (read-line in nil)
	       (read-line in nil)
	     (read-line in nil)
	     (read-line in nil))))
    (loop for seq = (read-seq-line in)
	 while seq do (funcall fn seq))))


(defun map-to-kmer-in-seq (seq kmer-size fn)
  "Apply fn to all kmers of length kmer-size in seq."
  (let ((len (length seq)))
   (loop
      for pos from 0
      for end = (+ pos kmer-size)
      while (<= end len)
      do (funcall fn (subseq seq pos end)))))


(defun map-to-kmer-in-fastq (in kmer-size fn)
  (map-to-seq-in-fastq
   in
   (lambda (seq)
     (map-to-kmer-in-seq seq kmer-size fn))))

You can see what is happening. Computation functions are being passed in, Russian doll-like, into a nest of other functions that loop over some source of data. I get it, but it’s still not as neat looking as Python.

Appendix: my first run through using closures.

Well, I can’t do things like this in Common Lisp. Not easily and not with built-ins.

I’m pretty surprised at this. This has got to be the first idiom I know from Python that I can not replicate in CL. The closest I could get was via a closure:

;; generator (closure) based on https://www.cs.northwestern.edu/academics/courses/325/readings/graham/generators.html
(defun fastq-reader (fname)
  (let ((in (open fname)))
    #'(lambda ()
	(prog2  ; <-- prog1 and prog2 are nifty!
	    (read-line in nil)
	    (read-line in nil)  ; <-- this is the sequence line
	  (read-line in nil)
	  (read-line in nil)))))	;  http://www.gigamonkeys.com/book/files-and-file-io.html using nil for return value

This can be called using the loop macro as follows

;; loop based on http://www.gigamonkeys.com/book/files-and-file-io.html
(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen) ; <-- Note this
   while seq do (format t "~A~%" seq))

Similarly, I can write the kmer extraction as a closure

       
(defun kmer-gen (seq &key (kmer-size 30) (discard-N? nil))
  (let ((n 0) (n-max (- (length seq) kmer-size)))
   #'(lambda ()
       (prog1
	   (if (< n n-max)
	       (subseq seq n (+ n kmer-size))
	       nil)
	 (incf n)))))

And combine the two closures as:

(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen)
   while seq do
     (let ((g (kmer-gen seq :kmer-size 10)))
       (loop for s = (funcall g)
	    while s do (format t "~A~%" s))))

It’s not terrible as the code still looks compact and readable but it’s not as nicely abstracted away as the nested Python generator.

(A complete non sequitor: Remember how I had issues with [square brackets] and (parentheses) for array indexing when I moved from MATLAB to Python? Well now I have this issue with prefix and infix notation. I keep trying to write (print kmer) and print(kmer) is starting to look plain weird …)

Common Lisp accessors

The Common Lisp setf function is deceptively humble looking. At the start I didn’t think much of it – except that it looked eccentric compared to the assignment syntax from other languages – but recently I’ve found out how powerful it is, though not quite as convenient as I would hope.

In most languages to perform a variable assignment one does something like

a = 22;

which places the value “22” into a place in memory which has been given the label “a” in the program.

In Common Lisp the corresponding syntax is

(setf a 22)

and the only thing weird about it is the everything-is-a-list pattern of Lisp.

I was blown away, though, when I saw code that looked like this

(setf x `(1 2 3 4 5))  ; x -> (1 2 3 4 5)
(setf (rest x) `(6 7 8))  ; x -> (1 6 7 8)

You can’t do this as transparently in C++ or Python. You can implement a function to do the equivalent thing, but you can’t assign a value to the return value of a function. This Lisp syntax makes writing code very compact and logical.

However, I stumbled when I tried to use this in a function I wrote. My function started at the root of a complete binary tree and then snaked recursively through it to find and return a particular node.

;; Use the binary representation trick to thread through the
;; tree to a given node. No bounds checking is done here
; root - the root of the tree
; index - the index number of the node (1 = root)
(defun find-node (root index)
  (multiple-value-bind
   (n-idx rem) (floor index 2)
   (if (= n-idx 1)
     (find-node-pick-child root rem)
     (find-node-pick-child (find-node root n-idx) rem))))

(defun find-node-pick-child (nd rem)
  (if (= rem 0) (node-lchild nd) (node-rchild nd)))

When I tried to execute the statement (which I expected to attach new-node to the place returned by find-node)

(setf (find-node root 8) new-node)

I ran into compiler errors.

This lead me on a very interesting chase all round the internet with the word “generalized variables” turning up a lot. It turns out that (setf …) is a lisp macro that can figure out how to get hold of the “place” its first argument refers to. If the first argument is just a variable, (setf …) acts just like the assignment statement in any other language. If the first argument is a function, Lisp will look for a setter form for that function and setf on the place returned by that function (which is called an accessor function).

Following advice from answers on comp.lang.lisp, I came up with the following code:

;; Use the binary representation trick to thread through the
;; tree to return an tree node and some auxiliary information
;; No bounds checking is done here
; root - the root of the tree
; index - the index number of the node (1 = root)
; Return value is a pair
; node - a node of the tree
; info - :me - this is the node we're looking for
;        :lchild - we want the left child
;        :rchild - we want the right child
(defun find-node-with-index (root index)
  (multiple-value-bind
   (new-idx rem) (floor index 2)
   (cond
     ((= new-idx 0) (values root :me))
     ((= new-idx 1) (values root (if (= rem 0) :lchild :rchild)))
     (t (find-node-with-index
         (if (= rem 0) (node-lchild root) (node-rchild root))
         new-idx)))))

(defun node-with-index (bh index)
  (multiple-value-bind
   (nd info) (find-node-with-index (bheap-root bh) index)
   (case info
     (:me (bheap-root bh))  ; Special case - root of heap
     (:lchild (node-lchild nd))
     (:rchild (node-rchild nd)))))

;; The accessor or setter function. Note the "setf"
(defun (setf node-with-index) (new-node bh index)
  (multiple-value-bind
   (nd info) (find-node-with-index (bheap-root bh) index)
   (case info
     (:me (setf (bheap-root bh) new-node))  ; Special case - root of heap
     (:lchild (setf (node-lchild nd) new-node))
     (:rchild (setf (node-rchild nd) new-node)))))

I like the DRY principle primarily because it reduces bugs and the total amount of code. I tried to abstract as much of the core functionality into a single function as I could, but the getter and setter functions look so duplicate, but I don’t think I can reduce them further.

Thanks go to Kaz Kylheku and Barry Margolin on comp.lang.lisp for answering my questions on this topic and to Rainer Joswig for corrections to this post.

Links-Rechts, Links-Rechts, Ahnentafel!

While looking into priority queues and binary heaps, I ran across two gems, one of which may not be so widely known. Consider a complete binary tree with nodes sequentially numbered:

binary-tree

You will observe that the parent of every node is the integer division of itself by two. For example, the parent of 11 is 11/2 = 5 (integer division), the parent of 14 is 14/2 = 7 and so on. Way back in 1590 a genealogist used this numbering pattern to organize people’s ancestries, calling it an Ahnentafel. From the article:

The subject (proband or progenitor) of the ahnentafel is listed as No. 1, the subject’s father as No. 2 and the mother as No. 3, the paternal grandparents as No. 4 and No. 5 and the maternal grandparents as No. 6 and No. 7, and so on, back through the generations. Apart from No. 1, who can be male or female, all even-numbered persons are male, and all odd-numbered persons are female. In this schema, the number of any person’s father is double the person’s number, and a person’s mother is double the person’s number plus one. Using this definition of numeration, one can derive some basic information about individuals who are listed without additional research.

This construct displays a person’s genealogy compactly, without the need for a diagram such as a family tree. It is particularly useful in situations where one may be restricted to presenting a genealogy in plain text, for example, in e-mails or newsgroup articles. In effect, an ahnentafel is a method for storing a binary tree in an array by listing the nodes (individuals) in level-order (in generation order).

Now, suppose we see the exact same complete binary tree and sequential numbering in binary:

binary-tree-in-binary

Do you notice a striking pattern? The children of any given node are basically the binary of the parent left shifted by one with a 0 or 1 in the new last bit, depending upon whether they are the left or right child. So, for example, the children of 101 are 1010 and 1011. This is the basis for the following neat algorithm (which I first came across here) for finding a given node in this tree.

To find the path from the root (here node #1) to some other node, say (11) express the target node number in binary (1011) and discard the first digit – which is always 1 – (011). Each digit now successively indicates the child to pick (0 = left, 1= right) as you traverse the path. So, to get to node #11 you would take (011) left, right, right. (I read this twice before realizing the choice of eleven (11) is peculiarly bad in this context. Here’s the example using 12. 12 => 1100 => 100 => right, left, left)

This algorithm is useful when using a pointer-based representation and, for example, inserting or deleting from a binary heap. This pattern, as you have guessed, comes from the number representation and the tree being both binary based.

PS. Pardon my terrible title. The Ahnentafel put “German” in my head, traversing the tree put “left, right” in my head and watching too many world war II movies in childhood did the rest.

PPS. Python code to generate the trees, not that you need it.

from matplotlib import pyplot as plt

def node_xy(k, n, dx):
  n0, n1 = 2**k, 2**(k+1)-1
  center = (n0 + n1) / 2.0
  return (n - center) * dx / n0, 0.5 - k / 10.0

def plot(ax, max_k):
  dx = 1.0
  for k in range(max_k):
    for n in range(2**k, 2**(k+1)):
      x, y = node_xy(k, n, dx)
      # ax.text(x, y, str(n), ha='center', va='bottom')
      ax.text(x, y, bin(n)[2:], ha='center', va='bottom')
      if k:
        px, py = node_xy(k - 1, int(n/2), dx)
        plt.plot([px, x], [py, y + 0.025], 'k')

fig = plt.figure()
ax = plt.subplot(111)
plot(ax, 5)
plt.setp(ax, xlim=[-0.5, 0.5], ylim=[0.0, 0.6])
plt.axis('off')