CIGAR strings

When I first joined Seven Bridges my boss would be describing things and he would say “Do this with the CIGAR” or “compute that from the CIGAR”. Reluctant to be found out so early I nodded sagely and made a furtive note in my notebook to check out what the heck a CIGAR was.

You are most likely to run into this word – and it is in all caps, for reasons that may become clear later – if you operate closer to the nasty end of any work that requires genomic sequencing. One of the more compute intensive things we have to do to raw genomic data before we can get with the real sciencey stuff is to figure out how a sequence obtained in an experiment relates to some reference data.

A lot of life is tied together and it shows in the DNA. There are similarities not only between the DNA of humans, between the DNA of humans and chips, but also between mice and men, but the closest match is by far between human beings of all “races” and ethnicities, despite what some ignorant folk will try to tell you.

At the tiniest level, at very early stages of genomic sequence data analysis, we take a string of data, say

data = ACTGACTGACTGACTG

and compare it against some reference data, say

ref = ACTTACTAAGACTACTG

We notice that we can align the data against the reference by changing one letter (marked by *), assuming some letters have been deleted (– in the data) and some letters have been inserted (- in the ref)

data = ACTGACT--GACTGACTG
       |||*|||  |||| ||||
ref  = ACTTACTAAGACT-ACTG

 
In general the vast majority of these alignments of data against the reference can be described as a series of exact matches interspersed by single base mismatches, insertions and deletions.

Now, what does CIGAR stand for? It stands for Compact Idiosyncratic Gapped Alignment Record. Yes, it does sound like a backronym doesn’t it? But the “gapped alignment part” has probably given you a good clue. The major operation above was in inserting gaps, either in the data or the reference, in order to get the two strings to align against each other.

A CIGAR string is simply the set of operations we did encoded succinctly. So, in the case above, the CIGAR is

7M2D4M1I4M

 

In the CIGAR we usually ignore single base mismatches and focus on the insertions and deletions (though the “extended CIGAR” is much more detailed and includes information about mismatches) and so we convert the gapless bits to “M”, the number indicating how many letters (bases) form a stretch of gapless alignment. This is interspersed with “D”s and “I”s to indicate deleted subsequences and inserted subsequences.

And now you know.

Development environment

While it can be an avenue for procrastination, searching for a proper development environment often pays off with greatly improved productivity in the long term. I had some finicky requirements. I wanted to maintain my own CMakeLists file so that the project was IDE/editor/platform independent. I wanted to be able to just add files to the project from anywhere – from VI, by OS copy, by git pull – (I set wildcards in the CMakeLists.txt file) – and definitely not have to fiddle in the IDE to add files. I wanted to edit markdown files which I was using for documentation, and placing in a separate docs folder. I wanted to see a decent preview of the markdown occassionally. Ideally, my environment would simply track my directory organization, including non C++ files. However, I also wanted intelligent code completion, including for my own code.

At work, where a ton of my current stuff is in Python, I happily use Pycharm which has been a very worthwhile investment. So CLion was a natural candidate to start with. I didn’t get very far, because I balked at the license terms. It felt a little annoying to have to apply for an open source license and then renew every year and so on. When I want to play with red-tape I talk to the government.

I completely understand needing to charge people for your software. It’s a necessity for maintenance and continued development. But you can not get money from people who have none. To me the rational model is to charge professional entities for the software, to make it free to use for non-professional use and just not bother with the small fry. Basically, have a smallish legal department that goes after a big player if they violate. For companies and individual professionals a software license is the least of their costs (and probably the item with the highest ROI), and they will gladly pay to have some one to go to for support. Chasing down individual hobbyists just doesn’t make sense. I’m sorry, where did that soapbox come from? I must have stepped on it by accident. Moving on to the next editor/IDE …

I tried XCode next. It was a medium-ish install, and I think I needed an iTunes account. I was encouraged to use it because CMake can generate XCode projects and many coders working on Macs swore by it. I have used it infrequently – mostly when I was editing a largish Objective C project during my postdoc. I found it fine to use and debug with. It was fine, but I disliked how CMake organized the XCode project virtual directories, and I disliked how I had to manually add non-source and new files. And memories of the over complicated build system started to come back to me.

I then found something called CodeLite. It was confusing because I spent some time trying to figure out if it was related to Code::Blocks. CodeLite looked kind of promising. There was a nice tutorial on how to integrate with CMake based projects. But it looked cumbersome, and I had to add files to the project through the interface. By now, I was willing to endure this but I couldn’t get over the fact that there was no preview for Markdown. I liked that they limited the number of plugins, but I really wanted markdown preview. I told you I was finicky.

Then, quite by accident I started to poke around Visual Studio Code and downloaded it to try it out. In brief, it seems to satisfy all my finicky requirements. This choice rather surprised me, because it was just not on my radar. I also realize this does nothing to advance my attempt to get into the cool kids programming club where people use vim or emacs, but that was a lost cause anyway. (I now recall that I initially avoided VSC because a colleague spoke out very strongly against it a few months ago. Goes to show, you need to try things out for yourself)

CMake and XCode

Oddly enough, though I’ve been working on Macs for eight years now, I haven’t used Xcode a lot and in the past I used Bloodshed C++ and eclipse. I also have fond memories of using vi and make with my IDE being a cluster of four terminals (code, compile, debug and run output, I think they were. Or maybe one was pine). Also, the fondness is perhaps an effect of nostalgia.

Since I’m using CMake I looked around for ways to make XCode work with it. CMake will generate an xcode project with the command -G XCode but the structure of this project looked atrocious and I wondered what I was doing wrong. This post by John Lamp gives some nice details about how and why the generated XCode project is so structured.

One trick I learned (Lamp mentions it but I really paid attention when I watched this video) was to add the header files to the sources list as well, in the CMake project. After I did this the organization of the files in the XCode Project view became more sensiple – XCode knows how to split up header and source files.

Also, before you open up XCode, add the build directory to .gitignore. When XCode tries to manage your versioning, it seems to want to add just about everything to version control by default, and the build directory is an un-holy mess.

Cmake tutorial, slightly easier than the official one.

 

Starting a project

It’s that time of the decade again – to start a new hobby programming project that will probably span a few years, given that it’s a side project. So I went through the usual decision tree and this time I decided to document it for my future self.

Picking a language

I really surprised myself. I picked c++. A year or so ago I would have told you I was going to use Haskell or Common Lisp for new projects. I’m pretty proficient in Python and so I wouldn’t pick that. But C++??? Now, I’ll still be doing coding challenge type tasks in common lisp this year, and next year will be the year of Haskell, but every time I touch one of these I get fingers bitten off. There is always something easy I want to do that breaks things and the solution, often from practitioners in the field, seems to be pretty messy.

Python, funnily enough, is the closest I’ve come to an ideal language for practical use. I don’t use custom defined classes very much and it’s only fake functional (it comes out in the major – no tail recursion – and in the minor – try and use lambdas with the multiprocessing module) and it can be astoundingly slow for compute intensive tasks that can’t be cast as numpy operations. But you can’t beat it in time it takes to go from a blank file to a working, self-documenting, fairly bug free program where you can often write very complex things very concisely and elegantly.

So why C++?? C++14, to be precise, drew me back from these REPL oriented languages that make prototyping so fast and easy. In a nutshell I think the standard library has turned into quite a nice thing and things like CATCH and Cmake make medium sized projects easier to manage. I anticipate an intense amount of numerical computation, potentially large amounts of generated data and an attendant desire to control the type of the data and how it’s laid out. Common Lisp, Python and, I suspect, Haskell aren’t so bothered by such low level details. I suspect I’d be able to quickly prototype what I want to do in those languages, but then I’d get into a hellhole trying to remove the bottlenecks.

Build system: Cmake: I’ve used this before and found it manageable
Command line parser: cxxopts – one file (just header)
Test system: CATCH – I’ve used this before and I like how easy it is.

One thing I’m excited to experiment with is Cling – allegedly a REPL for C++. We’ll see …

Picking a version system

This was easy. git. I toyed briefly with using this as an excuse to learn mercurial, but I decided to reduce the number of moving parts for this particular project. Mercurial will have to wait.

Picking a license

I’ve always used the GPL. This time I picked the MIT license. There were a couple of funny thoughts that went through my head as I was picking the license. Funny because they reveal very deep (but perhaps not so serious) desires that I don’t often reflect on. One dominant thought was, what if this becomes widely popular (yeah, sure) and people start to capitalize on this but I don’t get any of the upside (fame/money). You’d probably want to use the GPL to block people like that.

And so I went with the MIT license, because almost everything I know, I know because random people on the internet unconditionally shared their knowledge with me, and almost every tool I use has been unconditionally granted to me by random people on the internet.  The GPL started to sound a little mean spirited. A little too over bearing. The MIT license, on the other hand, sounded more in the hacker spirit. Here is some code. I throw it to the wind. If you find it useful send, me a post card. Just don’t sue me. Leave the lawyers out of this.

Picking a host

github, because of the network effects, and I’m using git. And to be fair, github has made a lot of effort to make the site feature full. I do hope it does not end up like sourceforge. Remember when sourceforge was where we put all our projects?

Picking a documentation system

Text files in markdown. Algorithm design notes in the code files. Because everything else gets out of date and becomes a giant interdependent mess that you deal with only if this is the day job.

Work style

Ok, this one I’m gonna get right (yeah, sure). I’m starting with the specifications, then the user manual, then the tests, then the code. This is not a magic formula, and it doesn’t stop me from skipping around, but this set of practices has always seemed more logical and a lot less irritating to me. Code is living, you always go back and change things, but as long as you follow this order when you change things, sanity prevails.

Off we go …