CIGAR strings

When I first joined Seven Bridges my boss would be describing things and he would say “Do this with the CIGAR” or “compute that from the CIGAR”. Reluctant to be found out so early I nodded sagely and made a furtive note in my notebook to check out what the heck a CIGAR was.

You are most likely to run into this word – and it is in all caps, for reasons that may become clear later – if you operate closer to the nasty end of any work that requires genomic sequencing. One of the more compute intensive things we have to do to raw genomic data before we can get with the real sciencey stuff is to figure out how a sequence obtained in an experiment relates to some reference data.

A lot of life is tied together and it shows in the DNA. There are similarities not only between the DNA of humans, between the DNA of humans and chips, but also between mice and men, but the closest match is by far between human beings of all “races” and ethnicities, despite what some ignorant folk will try to tell you.

At the tiniest level, at very early stages of genomic sequence data analysis, we take a string of data, say

data = ACTGACTGACTGACTG

and compare it against some reference data, say

ref = ACTTACTAAGACTACTG

We notice that we can align the data against the reference by changing one letter (marked by *), assuming some letters have been deleted (– in the data) and some letters have been inserted (- in the ref)

data = ACTGACT--GACTGACTG
       |||*|||  |||| ||||
ref  = ACTTACTAAGACT-ACTG

 
In general the vast majority of these alignments of data against the reference can be described as a series of exact matches interspersed by single base mismatches, insertions and deletions.

Now, what does CIGAR stand for? It stands for Compact Idiosyncratic Gapped Alignment Record. Yes, it does sound like a backronym doesn’t it? But the “gapped alignment part” has probably given you a good clue. The major operation above was in inserting gaps, either in the data or the reference, in order to get the two strings to align against each other.

A CIGAR string is simply the set of operations we did encoded succinctly. So, in the case above, the CIGAR is

7M2D4M1I4M

 

In the CIGAR we usually ignore single base mismatches and focus on the insertions and deletions (though the “extended CIGAR” is much more detailed and includes information about mismatches) and so we convert the gapless bits to “M”, the number indicating how many letters (bases) form a stretch of gapless alignment. This is interspersed with “D”s and “I”s to indicate deleted subsequences and inserted subsequences.

And now you know.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s