CIGAR strings

When I first joined Seven Bridges my boss would be describing things and he would say “Do this with the CIGAR” or “compute that from the CIGAR”. Reluctant to be found out so early I nodded sagely and made a furtive note in my notebook to check out what the heck a CIGAR was.

You are most likely to run into this word – and it is in all caps, for reasons that may become clear later – if you operate closer to the nasty end of any work that requires genomic sequencing. One of the more compute intensive things we have to do to raw genomic data before we can get with the real sciencey stuff is to figure out how a sequence obtained in an experiment relates to some reference data.

A lot of life is tied together and it shows in the DNA. There are similarities not only between the DNA of humans, between the DNA of humans and chips, but also between mice and men, but the closest match is by far between human beings of all “races” and ethnicities, despite what some ignorant folk will try to tell you.

At the tiniest level, at very early stages of genomic sequence data analysis, we take a string of data, say

data = ACTGACTGACTGACTG

and compare it against some reference data, say

ref = ACTTACTAAGACTACTG

We notice that we can align the data against the reference by changing one letter (marked by *), assuming some letters have been deleted (– in the data) and some letters have been inserted (- in the ref)

data = ACTGACT--GACTGACTG
       |||*|||  |||| ||||
ref  = ACTTACTAAGACT-ACTG

 
In general the vast majority of these alignments of data against the reference can be described as a series of exact matches interspersed by single base mismatches, insertions and deletions.

Now, what does CIGAR stand for? It stands for Compact Idiosyncratic Gapped Alignment Record. Yes, it does sound like a backronym doesn’t it? But the “gapped alignment part” has probably given you a good clue. The major operation above was in inserting gaps, either in the data or the reference, in order to get the two strings to align against each other.

A CIGAR string is simply the set of operations we did encoded succinctly. So, in the case above, the CIGAR is

7M2D4M1I4M

 

In the CIGAR we usually ignore single base mismatches and focus on the insertions and deletions (though the “extended CIGAR” is much more detailed and includes information about mismatches) and so we convert the gapless bits to “M”, the number indicating how many letters (bases) form a stretch of gapless alignment. This is interspersed with “D”s and “I”s to indicate deleted subsequences and inserted subsequences.

And now you know.

Development environment

While it can be an avenue for procrastination, searching for a proper development environment often pays off with greatly improved productivity in the long term. I had some finicky requirements. I wanted to maintain my own CMakeLists file so that the project was IDE/editor/platform independent. I wanted to be able to just add files to the project from anywhere – from VI, by OS copy, by git pull – (I set wildcards in the CMakeLists.txt file) – and definitely not have to fiddle in the IDE to add files. I wanted to edit markdown files which I was using for documentation, and placing in a separate docs folder. I wanted to see a decent preview of the markdown occassionally. Ideally, my environment would simply track my directory organization, including non C++ files. However, I also wanted intelligent code completion, including for my own code.

At work, where a ton of my current stuff is in Python, I happily use Pycharm which has been a very worthwhile investment. So CLion was a natural candidate to start with. I didn’t get very far, because I balked at the license terms. It felt a little annoying to have to apply for an open source license and then renew every year and so on. When I want to play with red-tape I talk to the government.

I completely understand needing to charge people for your software. It’s a necessity for maintenance and continued development. But you can not get money from people who have none. To me the rational model is to charge professional entities for the software, to make it free to use for non-professional use and just not bother with the small fry. Basically, have a smallish legal department that goes after a big player if they violate. For companies and individual professionals a software license is the least of their costs (and probably the item with the highest ROI), and they will gladly pay to have some one to go to for support. Chasing down individual hobbyists just doesn’t make sense. I’m sorry, where did that soapbox come from? I must have stepped on it by accident. Moving on to the next editor/IDE …

I tried XCode next. It was a medium-ish install, and I think I needed an iTunes account. I was encouraged to use it because CMake can generate XCode projects and many coders working on Macs swore by it. I have used it infrequently – mostly when I was editing a largish Objective C project during my postdoc. I found it fine to use and debug with. It was fine, but I disliked how CMake organized the XCode project virtual directories, and I disliked how I had to manually add non-source and new files. And memories of the over complicated build system started to come back to me.

I then found something called CodeLite. It was confusing because I spent some time trying to figure out if it was related to Code::Blocks. CodeLite looked kind of promising. There was a nice tutorial on how to integrate with CMake based projects. But it looked cumbersome, and I had to add files to the project through the interface. By now, I was willing to endure this but I couldn’t get over the fact that there was no preview for Markdown. I liked that they limited the number of plugins, but I really wanted markdown preview. I told you I was finicky.

Then, quite by accident I started to poke around Visual Studio Code and downloaded it to try it out. In brief, it seems to satisfy all my finicky requirements. This choice rather surprised me, because it was just not on my radar. I also realize this does nothing to advance my attempt to get into the cool kids programming club where people use vim or emacs, but that was a lost cause anyway. (I now recall that I initially avoided VSC because a colleague spoke out very strongly against it a few months ago. Goes to show, you need to try things out for yourself)

CMake and XCode

Oddly enough, though I’ve been working on Macs for eight years now, I haven’t used Xcode a lot and in the past I used Bloodshed C++ and eclipse. I also have fond memories of using vi and make with my IDE being a cluster of four terminals (code, compile, debug and run output, I think they were. Or maybe one was pine). Also, the fondness is perhaps an effect of nostalgia.

Since I’m using CMake I looked around for ways to make XCode work with it. CMake will generate an xcode project with the command -G XCode but the structure of this project looked atrocious and I wondered what I was doing wrong. This post by John Lamp gives some nice details about how and why the generated XCode project is so structured.

One trick I learned (Lamp mentions it but I really paid attention when I watched this video) was to add the header files to the sources list as well, in the CMake project. After I did this the organization of the files in the XCode Project view became more sensiple – XCode knows how to split up header and source files.

Also, before you open up XCode, add the build directory to .gitignore. When XCode tries to manage your versioning, it seems to want to add just about everything to version control by default, and the build directory is an un-holy mess.

Cmake tutorial, slightly easier than the official one.

 

Starting a project

It’s that time of the decade again – to start a new hobby programming project that will probably span a few years, given that it’s a side project. So I went through the usual decision tree and this time I decided to document it for my future self.

Picking a language

I really surprised myself. I picked c++. A year or so ago I would have told you I was going to use Haskell or Common Lisp for new projects. I’m pretty proficient in Python and so I wouldn’t pick that. But C++??? Now, I’ll still be doing coding challenge type tasks in common lisp this year, and next year will be the year of Haskell, but every time I touch one of these I get fingers bitten off. There is always something easy I want to do that breaks things and the solution, often from practitioners in the field, seems to be pretty messy.

Python, funnily enough, is the closest I’ve come to an ideal language for practical use. I don’t use custom defined classes very much and it’s only fake functional (it comes out in the major – no tail recursion – and in the minor – try and use lambdas with the multiprocessing module) and it can be astoundingly slow for compute intensive tasks that can’t be cast as numpy operations. But you can’t beat it in time it takes to go from a blank file to a working, self-documenting, fairly bug free program where you can often write very complex things very concisely and elegantly.

So why C++?? C++14, to be precise, drew me back from these REPL oriented languages that make prototyping so fast and easy. In a nutshell I think the standard library has turned into quite a nice thing and things like CATCH and Cmake make medium sized projects easier to manage. I anticipate an intense amount of numerical computation, potentially large amounts of generated data and an attendant desire to control the type of the data and how it’s laid out. Common Lisp, Python and, I suspect, Haskell aren’t so bothered by such low level details. I suspect I’d be able to quickly prototype what I want to do in those languages, but then I’d get into a hellhole trying to remove the bottlenecks.

Build system: Cmake: I’ve used this before and found it manageable
Command line parser: cxxopts – one file (just header)
Test system: CATCH – I’ve used this before and I like how easy it is.

One thing I’m excited to experiment with is Cling – allegedly a REPL for C++. We’ll see …

Picking a version system

This was easy. git. I toyed briefly with using this as an excuse to learn mercurial, but I decided to reduce the number of moving parts for this particular project. Mercurial will have to wait.

Picking a license

I’ve always used the GPL. This time I picked the MIT license. There were a couple of funny thoughts that went through my head as I was picking the license. Funny because they reveal very deep (but perhaps not so serious) desires that I don’t often reflect on. One dominant thought was, what if this becomes widely popular (yeah, sure) and people start to capitalize on this but I don’t get any of the upside (fame/money). You’d probably want to use the GPL to block people like that.

And so I went with the MIT license, because almost everything I know, I know because random people on the internet unconditionally shared their knowledge with me, and almost every tool I use has been unconditionally granted to me by random people on the internet.  The GPL started to sound a little mean spirited. A little too over bearing. The MIT license, on the other hand, sounded more in the hacker spirit. Here is some code. I throw it to the wind. If you find it useful send, me a post card. Just don’t sue me. Leave the lawyers out of this.

Picking a host

github, because of the network effects, and I’m using git. And to be fair, github has made a lot of effort to make the site feature full. I do hope it does not end up like sourceforge. Remember when sourceforge was where we put all our projects?

Picking a documentation system

Text files in markdown. Algorithm design notes in the code files. Because everything else gets out of date and becomes a giant interdependent mess that you deal with only if this is the day job.

Work style

Ok, this one I’m gonna get right (yeah, sure). I’m starting with the specifications, then the user manual, then the tests, then the code. This is not a magic formula, and it doesn’t stop me from skipping around, but this set of practices has always seemed more logical and a lot less irritating to me. Code is living, you always go back and change things, but as long as you follow this order when you change things, sanity prevails.

Off we go …

A run-in with the garbage men

When you write and run a computer program you are processing data. You load data, create more data, process it and so on. All this data takes up memory on your computer. Without any other action the memory starts to look like a kid’s bedroom – all toys, and stale clothes, and empty cans and magazines and books everywhere. There are typically two ways programming languages deal with this clutter.

One is the “clean your room or I’ll blow up the house” method. I realize this is not what parents say, but this is what happens in languages like C++ where you – the programmer – are responsible for cleaning up after yourself. If you don’t your program crashes and burns and exits out with a nasty error message. If your operating system is not a grown up operating system, it often freezes up your whole computer.

The other is automatic memory management, which is a bit like having a housekeeping service. Common lisp is a garbage collected language. Which means that the runtime of the language takes care of the clutter for you. The service sends over a person who goes through your room and tidies up. As you can imagine, this is a lot easier for you, since now you don’t have to worry about all this low level housekeeping stuff and can concentrate on the fun things in life/computer science.

So imagine my surprise when I ran the following bit of code and my house blew up (program memory space, not my house literally, but you’ve caught on)

(defun random-string (length)
  (with-output-to-string (stream)
    (let ((*print-base* 36))
      (loop repeat length do (princ (+ (random 26) 10) stream)))))

(defun insert (count)
  (let ((ht (make-hash-table :test 'equal)))
    (loop
       repeat count
       do (setf (gethash (random-string 30) ht) 1))))

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (insert ins-count)))

(gc-test 20 1000000 nil)

I wasn’t actually doing this exact thing, but the principle was the same: I was inserting keys into a hash table which got larger and larger, and then I discarded the hash table. After a few cycles of this BAM! my program crashed and SBCL printed out the delightfully campy:

Heap exhausted, game over.

Common Lisp’s handy (room) function prints out the memory usage of the program, and when I instrumented my program with it I saw this:

no-forcing

Now you can see that up to the 4th iteration housekeeping kind of keeps up, though there is this funny sawtooth pattern where housekeeping slacks off on the first pass, does a thorough job on the second, slacks off on the third, catches up on the fourth. But then, the union goes on strike! Housekeeping refuses to show up and the dirty pizza boxes start to pileup and then the floor collapses. Contrary to what your friends working in sales will tell you, not all graphs that rise exponentially on the right hand side are good.

SBCL has a neat function

(sb-ext:gc)

that allows you to manually call for housekeeping, and I modified my program to do this each time I discarded the hash-table, and when I reran the program I saw this:

with-forcing

So, manually forcing the garbage collector to run does do the trick but it made me a little disappointed with SBCL’s implementation. I suspect commercial Lisp compilers do a better job. If any one has a guess as to why the GC fails in this case, please drop a line.

In case you are interested, the function, instrumented with (room) and with the garbage collection forcing looks like this

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (progn
	  (insert ins-count)
	  (when nudge-gc (sb-ext:gc :full t))
	  (room nil))))

(gc-test 20 1000000 nil)

Millions of tiny hypotheses

I think I recall the exact moment when I began to get a little scared of math. It was an algebra class and we were being taught tricks (tools) that could be used to solve different classes of problems. Except, I don’t recall the teacher ever explicitly saying this was a toolbox. I think what he did was go through a list of equations, performing some trick on each – like adding x to both sides, multiplying by y on both sides and so on – and then proceeding with algebraic simplification. This demoralized me, and started to make me less enthusiastic about actually doing mathematics, though I remained fascinated by it.

I also, I think, recall why I got demoralized. I watched him solve an equation with one of these tricks and sat there staring glumly at the board. I fully understood how to apply the technique, what I couldn’t figure out was how I would know that I would have to apply that technique to that problem as opposed to some other technique. What didn’t help was that I had classmates who seemed to breeze through it, knowing intuitively which equation required which tool.

Unfortunately I did not have enough self-realization at that time to go and ask for help over this, or to talk it over with my mother (who was a mathematician). I just decided I didn’t have the innate talent for mathematics. Fortunately this did not cripple me and I had no hesitation diving into topics like physics and then electrical engineering (which I probably chose because it was applied physics) which had enough interesting math, with cool applications.

Many years later a stray comment on some message board somewhere by someone stuck in my head. They said that the way they did mathematics was to form hypothesis after hypothesis, testing them. If the hypothesis results in a falsehood, discard it and start again. Another comment, on a different message board by a different person, said that it had been explicitly stated to them that there was no magical solutions to mathematical problems, and it was a myth that there were people with innate mathematical skills and those without: you solved mathematical problems through gumption – you kept banging your head against it, trying different things, figuring it out as you went.

These two simple comments made a big impression on me. They suddenly freed me from the expectation that mathematical talent was innate. It raised the possibility that stubbornness – a trait my mother was fond of pointing out in me – was all you needed to solve mathematical problems. I was not worried about the top 99th percentile of mathematicians who most probably had something special going on their brains. I just wanted to enjoy math, learn more and more of it, and see if I could use it in daily life. These comments were immensely liberating. I had looked at the journey between the problem and the solution as mostly an embarrassment, as in the longer it took, the stupider you were. These comments turned the journey into part of the process, something to take joy in.

I just wish my math teachers had said things like this. I know that when I teach my daughter mathematics this is what I’m going to emphasize. It’s about creating millions of small hypotheses – magical worlds with crazy ideas – and then applying mathematical rules to figure out if the ideas contradicted each other, or if they fit. Just like the shape puzzles she loves to solve. Each shape fits in a particular slot. Sometimes, you can see right away that the shape will fit in a slot. Sometimes, you make a guess, if it works, good. If it doesn’t, ALSO GOOD. YOU LEARNED SOMETHING! Now try a different slot!

Common Lisp doesn’t yield (but it maps)

I didn’t use Python’s yield statement until I’d been programming for a few years. Within a few days of learning it, however, I could’t live without it. I find Python generators and list/dict comprehensions one of the key things to making Python code compact and readable. I was pretty bummed to learn that Common Lisp does not yield to this kind of program structure. The clever pun not-withstanding, I was happy to learn that common lisp has a different paradigm: a map-like one.

Consider the following straightforward problem: I have a text file and am interested in the string of characters (called a “read”) present on every fourth line. For each such read I’m interested in getting each K character long section (called a “kmer”) (bioinformatics savvy readers will recognize the file as a FASTQ file, and the K character long sections as k-mers from the reads in the FASTQ.

In Python there’s a very elegant way to abstract this extraction of data from the FASTQ file using generators (which is what yield allows us to create)

def read_fastq(fname):
  with open(fname, 'r') as fp:
    for n, ln in enumerate(fp):
      if (n - 1) % 4 == 0:
        yield ln.strip()

def kmers_from_seq(seq, kmer_size):
  assert len(seq) > kmer_size
  for n in range(len(seq) - kmer_size + 1):
    yield seq[n:n + kmer_size]

def kmers_from_fastq(fname, kmer_size):
  for seq in read_fastq(fname):
    for kmer in kmers_from_seq(seq, kmer_size):
      yield kmer

Note how nonchalantly I nest the two generators which allows me to do:

for kmer in kmers_from_fastq(fastq_fname, kmer_size):
  print(kmer)

Update: My first run through is now appended to the end of the main article. Rainer Joswig saw my original post and showed me the Lisp way to do this, which turns out to be to use a nested map like pattern. Many thanks to Rainer! My only reservation was the medium of instruction – a series of tweets. It might work for the 45th president, but I personally prefer the more civilized medium of post comments, which creates a more coherent record.

Rainer’s example code is here. I’ve minorly changed the code below:

(defun map-to-seq-in-fastq (in fn)
  "Apply fn to every second line of the FASTQ file (seq line)"
  (flet ((read-seq-line (in)
	   (prog2
	       (read-line in nil)
	       (read-line in nil)
	     (read-line in nil)
	     (read-line in nil))))
    (loop for seq = (read-seq-line in)
	 while seq do (funcall fn seq))))


(defun map-to-kmer-in-seq (seq kmer-size fn)
  "Apply fn to all kmers of length kmer-size in seq."
  (let ((len (length seq)))
   (loop
      for pos from 0
      for end = (+ pos kmer-size)
      while (<= end len)
      do (funcall fn (subseq seq pos end)))))


(defun map-to-kmer-in-fastq (in kmer-size fn)
  (map-to-seq-in-fastq
   in
   (lambda (seq)
     (map-to-kmer-in-seq seq kmer-size fn))))

You can see what is happening. Computation functions are being passed in, Russian doll-like, into a nest of other functions that loop over some source of data. I get it, but it’s still not as neat looking as Python.

Appendix: my first run through using closures.

Well, I can’t do things like this in Common Lisp. Not easily and not with built-ins.

I’m pretty surprised at this. This has got to be the first idiom I know from Python that I can not replicate in CL. The closest I could get was via a closure:

;; generator (closure) based on https://www.cs.northwestern.edu/academics/courses/325/readings/graham/generators.html
(defun fastq-reader (fname)
  (let ((in (open fname)))
    #'(lambda ()
	(prog2  ; <-- prog1 and prog2 are nifty!
	    (read-line in nil)
	    (read-line in nil)  ; <-- this is the sequence line
	  (read-line in nil)
	  (read-line in nil)))))	;  http://www.gigamonkeys.com/book/files-and-file-io.html using nil for return value

This can be called using the loop macro as follows

;; loop based on http://www.gigamonkeys.com/book/files-and-file-io.html
(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen) ; <-- Note this
   while seq do (format t "~A~%" seq))

Similarly, I can write the kmer extraction as a closure

       
(defun kmer-gen (seq &key (kmer-size 30) (discard-N? nil))
  (let ((n 0) (n-max (- (length seq) kmer-size)))
   #'(lambda ()
       (prog1
	   (if (< n n-max)
	       (subseq seq n (+ n kmer-size))
	       nil)
	 (incf n)))))

And combine the two closures as:

(defparameter gen (fastq-reader "../../DATA/sm_r1.fq"))
(loop for seq = (funcall gen)
   while seq do
     (let ((g (kmer-gen seq :kmer-size 10)))
       (loop for s = (funcall g)
	    while s do (format t "~A~%" s))))

It’s not terrible as the code still looks compact and readable but it’s not as nicely abstracted away as the nested Python generator.

(A complete non sequitor: Remember how I had issues with [square brackets] and (parentheses) for array indexing when I moved from MATLAB to Python? Well now I have this issue with prefix and infix notation. I keep trying to write (print kmer) and print(kmer) is starting to look plain weird …)