CIGAR strings

When I first joined Seven Bridges my boss would be describing things and he would say “Do this with the CIGAR” or “compute that from the CIGAR”. Reluctant to be found out so early I nodded sagely and made a furtive note in my notebook to check out what the heck a CIGAR was.

You are most likely to run into this word – and it is in all caps, for reasons that may become clear later – if you operate closer to the nasty end of any work that requires genomic sequencing. One of the more compute intensive things we have to do to raw genomic data before we can get with the real sciencey stuff is to figure out how a sequence obtained in an experiment relates to some reference data.

A lot of life is tied together and it shows in the DNA. There are similarities not only between the DNA of humans, between the DNA of humans and chips, but also between mice and men, but the closest match is by far between human beings of all “races” and ethnicities, despite what some ignorant folk will try to tell you.

At the tiniest level, at very early stages of genomic sequence data analysis, we take a string of data, say


and compare it against some reference data, say


We notice that we can align the data against the reference by changing one letter (marked by *), assuming some letters have been deleted (– in the data) and some letters have been inserted (- in the ref)

       |||*|||  |||| ||||

In general the vast majority of these alignments of data against the reference can be described as a series of exact matches interspersed by single base mismatches, insertions and deletions.

Now, what does CIGAR stand for? It stands for Compact Idiosyncratic Gapped Alignment Record. Yes, it does sound like a backronym doesn’t it? But the “gapped alignment part” has probably given you a good clue. The major operation above was in inserting gaps, either in the data or the reference, in order to get the two strings to align against each other.

A CIGAR string is simply the set of operations we did encoded succinctly. So, in the case above, the CIGAR is



In the CIGAR we usually ignore single base mismatches and focus on the insertions and deletions (though the “extended CIGAR” is much more detailed and includes information about mismatches) and so we convert the gapless bits to “M”, the number indicating how many letters (bases) form a stretch of gapless alignment. This is interspersed with “D”s and “I”s to indicate deleted subsequences and inserted subsequences.

And now you know.

Use pysam

This is a plug for Pysam: a well designed and feature rich wrapper for HTSlib. I’ve used pysam for a few years now and it has gotten better with time. It is now a very efficient and convenient library for reading and writing BAM files, reading VCF, FASTQ and FASTA files. For VCF and FASTA files it can use the tabix index to read regions.

I first used Pysam to read BAM files, which was the most feature complete part when I started. I started to use it for writing BAM files when the quirks were ironed away. I used to use PyVCF for reading use my own (inefficient) reader for FASTA and FASTQ files.

In it’s current state ( Pysam is a very effective solution to reading, in addition, VCF, FASTA and FASTQ files. It transparently handles gzipped files and BCF files, so you don’t have to keep track of the file type yourself.

Not only is it able to use the VCF and FASTA indices it also returns the data nicely encapsulated. My only complaint is that I feel the VCF record format is a tad over-complicated (it uses a similar interface to PyVCF). I always end up with very complicated accessor expressions when I wish to get alleles and genotypes from a record.

What I really liked about the VCF reader is that it intelligently handles the case of deletions crossing a region boundary. So for example, if you ask for variants in the range 10 to 20 and there is a 5 base deletion starting at position 8, it will be captured because it crossed into the requested regions. It’s very nice not to have to write your own logic to handle such cases.

I’ve also found VCF reading efficient enough for my purposes. When reading a single sample from multi-sample (population) VCF files it helps to set the filter, so that only the required sample is loaded.

Molecular combing

Molecular combing is a fun technique that is used to study longer stretches of DNA and quite different from the next generation sequencing by synthesis I’ve been exposed to. In this technique  DNA is mechanically stretched out on a slide and then locations of interest are labeled using fluorescent compounds. One can then image the slide and inspect the regions that light up.

DNA in solution coils up randomly. In molecular combing a charged probe is dipped in the solution and attracts the  ends of the DNA molecules. The probe is then retracted slowly from the solution, moving against a glass slide. The surface tension of the fluid pulls on the DNA and helps to stretch it out. Fluorescent DNA probes designed to attach to attach to particular locations on the DNA can then be applied and the slide is observed under a fluorescent microscope.

One use of this method is to detect large genetic events, like structural variations (long insertions, long deletions, inversions) that are really hard to detect with short read technologies and even with long read technologies. The method does not give single base resolution – a person I talked to said they get 1kb resolution (which is pretty great, if you consider that we are looking at DNA molecules under the microscope!) – but it the only way to get these classes of events. Typically this would be used when you know what you are looking for and want to test individual samples to see if they have this feature or not.

As a side note this reminds me of my Neuroscience days. One way we would classify techniques to study brain activity was a two-dimensional grouping based on resolution in time and space, by which we mean, if you detected an event with a technique, how well could you tell at what point in time it occurred – 1ms, or 1hour? and where it occurred – in THIS CELL, or in this half of the brain?

So, for example, invasive single cell recording would give very high resolution in both time (1ms or less) and space (this neuron), fMRI would sit in the middle, with moderate resolution in both time (1min or so) and space (1mm cubed chunks of the brain), and EEG would have relatively high time resolution (a few ms) and very crummy spatial resolution (This half of the brain, or if you used fancy statistical techniques, this region of the brain).

An effective study would try and combine different techniques to get at a question from different levels of detail to try an cross-validate the findings (we can rarely do this because of resource limitations, though)

Similarly, in genomics, groups are now trying fusion based approaches where “evidence classes” (It’s fun – when you move to new fields you always find the same concepts, but with totally different names. I guess everyone wants to sound cool in different ways. In electrical engineering (robotics) the buzz word for a while was “Sensor Fusion”) from different techniques are combined together to cross-validate the existence of genetic features.

The devil in the details of course is how you weight the different evidence classes and what do you do when they conflict with each other.

Further reading:




Seeing chromosomes

Did you know that chromosomes can be seen under a light microscope, and they get their name from the fact that they are strongly stained by certain colored dyes?

I was reading through our government’s collection of educational materials on genomics. Going through the history of the gene I was startled to learn that people knew about chromosomes by the late 1800s. I had assumed they were a later discovery, requiring more sophisticated instruments than a simple light microscope.

Walther Fleming was the first to report seeing these structures while studying cell division. The dye he was using, aniline, was heavily absorbed by these mysterious granules in the nucleus, which he called chromatin. Though human chromosomes can be quite long (Chromosome 1, the longest, is 85mm long), being a single molecule (!) you can imagine it’s quite invisible even under a microscope when stretched out.

It turns out, however, that during cell division the chromosomes condense and become dense enough that a stain will make them visible under a light microscope. Since the job of the DNA in the chromosomes is to serve instructions to create proteins, it is probably important for the DNA to be “unfurled” so it has a large surface area for ready transcription. When a cell needs to divide, however, the nuclear material needs to be transported and shuffled around, and it is best to have it in a tidy ball to do so.

The study of the structure of chromosomes using microscopes is called cyto-genetics. One could imagine that the physical structure of the chromosome is not important for the understanding of the genetic code. We mostly study DNA as a one dimensional, linear sequence of characters we call the genetic code. It turns out, however, that the physical three dimensional organization of chromosomes can affect which proteins are produced and in what quantity (“gene expression”)

I’m also told that chromosomes have their own domains in the nucleus – they are not just tendrils floating willy nilly in the nucleus.

Yup, that was your small, sticky ball of biological facts from me today …

The masked chromosome

A tale of grand theft genome and the mysterious origins of the Y chromosome …

I was looking at the sequence of the Y chromosome and trundling merrily along:


Whaaa? Did someone fall asleep at the wheel? Did the government shutdown in the middle of sequencing the Y chromosome?

You know that DNA sequences of chromosomes are represented by a four letter alphabet, ACGT, called bases. You also know that we have to figure out the sequence of bases in each chromosome through great effort. When we can’t figure out what a base is (but we know there should be a base there) we put in a ‘N’. Typically, chromosomes start with a long string of Ns because there are often difficult to sequence parts at the ends of the chromosomes.

However, I ran into this string of Ns bang in the middle of chromosome Y. And it did not stop. It went on and on and on. It took up the entire second half of the Y chromosome! Did I get a bad file? Did I somehow download an older version of the genome?

I fired up IGV – a genome browser developed by the Broad institute – and took a look at what it had to say. Nope, even there, somewhere halfway through, the Y chromosome disappears into a sea of Ns.

Screen Shot 2014-06-10 at 1.26.10 PM

I went over to the University of Southern California and looked at the most recent sequence data from Chromosome Y.

Screen Shot 2014-06-10 at 12.17.53 PM

Nope! See that middle part where the black blocks disappear, only to reappear right at the end (right hand border)? The place where the blocks disappear is the missing part of chromosome Y.

Then it hit me. Of course. It was obvious. Somebody had stolen half of Chromosome Y from right under our noses! In a panic I dialed up my colleagues at Seven Bridges. Devin picked up first.

“The pseudoautosomal region (PAR) of Y is commonly masked out with Ns because it closely matches the X chromosome in this region and would result in two very similar sequences within the reference if left unmasked.” he said after hearing me tell him about grand theft genome.

The what-what-in-the-what-now?

The. Pseudoautosomal. Region … ” he began to repeat, slowly and distinctly. After some web searching and reading of wikipedia pages I pieced together the story behind that string of ‘N’s in the Y chromosome of the human reference genome.

The story starts about 160 million years ago. As you know, we have two copies of every chromosome except for the 23rd chromosome, also called X/Y. Females have two copies (XX) but males have one regular copy of X and a strange, shorter copy called Y. Around 160 million years ago, one of our great-great-great-great …. – great grandfathers (whom, if we met today, we would not recognize, him being a different species and all) obtained, through dubious means, a mutated copy of an X chromosome.

Our current conjecture is that, simply possessing this mutation ensured that the individual would develop into a male, and so this mutation turned that copy of the X chromosome into a sex-determining Y chromosome passed on down from father to son for millions of years. The Y chromosome has the fastest mutation rate of all the chromosomes in the genome, and differs from the chimpanzee by 30% (The whole genome, in comparison, differs by less than 1% between humans and chimps).

However, most relevant to us here is the fact that some parts of the Y chromosome are virtually identical to the X chromosome because of their shared origins. These are called pseudo-autosomal regions (PAR). When the Y-chromosome was sequenced, it was decided not to sequence the PARs and instead copy the corresponding (homologus) section from the X chromosome sequence.

Now, if you recall, when we analyze genetic data (DNA) we don’t actually get the sequences of each chromosome as complete wholes. Rather, we get short (typically 100 letter long) DNA sequences from our sample which come from all the chromosomes together in one big pile. We then use a program, called an aligner, that takes these short sequences and matches them against the chromosomes in the reference sequence.

Now you can see we have a problem with the X and Y chromosomes. In our reference sequence we have parts of X and Y which are exactly identical. An aligner will take a read which came from a PAR of the X chromosome and often place it on the Y chromosome. To avoid this the PAR regions on the Y chromosome are often hard-masked – filled with Ns.

(A very succinct description of the rationale for doing this is given here if you are interested)

So that is how the masked chromosome came to be. I love genomics. You ask a simple question and you discover stuff that started 160 million years ago!

Raw (unaligned), paired reads in a BAM file

Since BAM files are the binary version of SAM files, which in turn stand for Sequence Alignment/Mapping, its a little strange to store unaligned data (the raw reads from the sequencer) in a BAM. However, as eloquently argued by others, the text based FASTQ format is showing its age and an indexed binary file is a very tempting store for our big bucket of raw reads.

It turns out that storing unaligned reads in a BAM file in a way that will not break downstream tools is fairly easy – you simply set all information related to alignment to be unavailable.

Storing paired reads in FASTQ files is an after thought with a bunch of ad hoc solutions, including storing the reads in pairs of files (labeled 1 and 2) or as an interleaved file, where the read pairs are stored one after the other. In a BAM file, on the other hand, there is a very systematic way of indicating paired reads

You simply:

  1. Set FLAG bit 0x1 (template having multiple segments)
  2. Set FLAG bit 0x40 if the read is closer to the 5′ end, or 0x80 if the read is from the 3′ end
  3. Set the TLEN value to the expected length of the template.
  4. Make sure that the QNAME of the reads match.

(It took me a while to catch on that ‘template’ meant the shred of DNA these reads come from)

A toy script that uses PySAM to dump a valid BAM file representing unaligned paired read is:

import pysam

header = { 'HD': {'VN': '1.4'},
'SQ': [{'LN': 702, 'SN': 'chr1'}] } #You need to indicate the length of the reference and give

outfile = pysam.Samfile( 'test.bam', "wb", header = header )
for n in range(10):
  a = pysam.AlignedRead()
  a.qname = "r{:d}".format(n)
  a.flag = 0x41 #end1 0x01 flag has to be set to indicate multiple segments
  a.tlen = 100

for n in range(10):
  a = pysam.AlignedRead()
  a.qname = "r{:d}".format(n)
  a.flag = 0x81 #end2 0x01 flag has to be set to indicate multiple segments
  a.tlen = 100


SAM! BAM! VCF! What?

As part of an exercise I’m doing I was supplied with a FASTQ file which contains a set of reads and I needed to figure out how to get a VCF file out. A what what now? Exactly.

When we try to obtain the sequence of some DNA the big machine that we dump our DNA does not — contrary to what we have been trained by CSI:Boston* to believe — spit out the complete DNA sequence and tell us who the perpetrator of the crime is. (*What? There was no CSI:Boston? Every other city had one … )

When we run “The Machine” (some of which actually increasingly look like a cheapo laser printer from the 2000s) the long chains of DNA are shred into random little bits. The sequencing machine then looks at even smaller bits of these small shreds and gives us a list of these shreds of shreds which are called ‘reads’. If you are thinking what I’m thinking:

You are right.

You’ll be also thinking, why don’t we do this thing correctly and sequence the whole strand of DNA at one go instead of getting this pile of reads and how are we going to put it back together? Why did we take the document and then shred it? Couldn’t we hire a better intern? There is an interesting story behind this which we will look at another time, but the short answer – as always – is money (and time).

If you are thinking what I’m thinking:

You are right.

Long story short, the government spent about 9 billion dollars – roughly two Nimitz class aircraft carriers – and came up with the (almost) complete sequence of the human genome. People who do not have a cool 10 billion lying around take the shredder approach and get a big pile of reads. Then they buy or rent computers which run programs that take these shreds and put them together like so many puzzle pieces. The government’s sequenced genome is like the picture on the puzzle box. The computer software looks at the government’s genome and tries to figure out where each little piece goes and slowly builds it up together. This turns out to be much, much cheaper.

Of course, it turns out we are not all perfect clones of one another and the differences amongst us are reflected in differences in our DNA. Now you are thinking: Great! I have these little puzzle pieces AND the picture on the box is DIFFERENT from what the pieces add up to? That’s one of the challenges of Genomics which makes it fun and interesting. The differences also turn out to be very important for things like cancer and why we get it and why some of us respond to some medicines and others don’t.

Anyhoo – to get back on topic – often people take their big old pile of reads – which are a huge collection of short strips of DNA letters looking hopelessly similar to each other :


and dump them into a semi-text file called the FASTQ format. I just love this format.

Anyhow, there are tools that take this FASTQ file and a reference sequence (the picture on the box) and align the reads (the shreds) to try and make the best fit to the reference. This alignment information is stored in a Sequence Alignment Map (SAM) file which is a text file. Someone got tired of lugging ginormous text files around and went home and developed the Binary Alignment Map (BAM) format which is the same as SAM, but in binary, which tends to save space and is faster to process and so on.

We then take this SAM/BAM file and try and compare it to the reference sequence. Any differences to the reference are then attributed to individual variation between the chap who gave the sample and the mysterious caped figure who gave the government their sample. These are called variants (or mutations – yes we are all mutants, compared to each other) and are stored, logically, in a Variant Call Format (VCF) file.

So this is my journey. I’m taking my FASTQ, getting a SAM/BAM and then generating a VCF. See, you talk pretty one day.

So I buttonholed Devin and asked him to suggest some tools I could start with. I’m sure Nate already told me this, but I have a short attention span. Devin suggested that I try out BWA and then GATK. I ended up starting with BWA and samtools which turned out to be sufficient for my toy program.

The tools compiled straightforwardly (by typing ‘make’) on my mac and were self contained binaries that could just be copied over to my local binary directory. They both have good command line help but some quirks. I’m sure there are many of them but in the example workflow below I annotated the ones I ran into.

The reference sequence is a string of nucleotide base symbols (GATC…). In order to do fast searching and matching on this string – which can be very, very long – programs first index the string, which basically means they create a look up table of sorts that allows them to matches to parts of the sequence more efficiently.

bwa index ref.fa

Now that we have the index, we can run the aligner – the workhorse program that pieces together the puzzle.

bwa mem ref.fa reads.fastq > aligned.sam

The next tools we will be using need binary formatted alignment maps, rather than text ones, so we convert the text data to a binary format. No new information is being added here. Genomics is a relatively newer field and we are kind of in a transition between text formats (which are easily human readable, but bloated) to binary formats, which humans can’t read directly, but which are more space and time efficient for processing large sequences (like the human genome).

samtools view -Sb aligned.sam > aligned.bam

Now, before we can do further processing on the aligned reads in the bam file, we need to index it (for the same reason we indexed the reference sequence). In order to index the file we need to sort it – the data has to be in order for the index to be efficient.

samtools sort aligned.bam aligned_sorted

Now we can index it

samtools index aligned_sorted.bam

Now, finally we can pull up everything – the reference sequence and the aligned reads – and see how they look like. You should see any variants (differences from the references) highlighted.

samtools tview aligned_sorted.bam ref.fa

Next we would like to print out these differences (variants) into the standard format (VCF).

samtools mpileup -uf ref.fa aligned_sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | varFilter -D100 > var.flt.vcf 

As an aside, while samtools has a neat alignment viewer – and I really like the command line – I found Tablet to be pretty slick too.

Why do we use ASCII files in genomics?

This rant was triggered when I was going over the format for FASTQ files. This is a pure rant: I propose no good solutions for anything here. I’m not even angry – just bemused. I’m a novice, so there are probably good reasons for keeping ASCII files for raw data which I just don’t know about.

First, a short run down of the trigger: A FASTQ file stores “reads” from a genomic sequencing machine. The entire genome is not read at once, but rather as a set of short reads – short sequences of base pairs, like ACTGGTAC – which are then put together into a chromosome or genome using alignment programs.

Following convention, we store the data in plain text (ASCII encoded) files, where we use the actual letter (one byte) to represent the base (‘A’,’C’,’G’,’T’). It is easy to guess the background for this choice. I love plain text files – there is nothing to interpret here – you simply open it up in a text editor. 50 year old files can be read this way. When people were fighting to sequence parts of the genomes of viruses and bacteria this was a convenient and practical choice. For example you can write down the entire genome of the bacteriophage phiX174 one one page of your writing block.

The human genome has something like 3 billion base pairs. There are four base pairs which can therefore be represented in 2 bits each. When we get the sequence data from sequencing machines there is naturally some slop about the identity of each unit in the sequence. This uncertainty in the raw data is represented by a quality score which can probably be quantized into a six bit space (the convention is 94 levels). Therefore, we can represent both the identity of a sequence element and its quality in one byte.Using this kind of binary bit field based representation reads from the whole human genome could then be packed into a file of the order of a few GB. (And a quick search reveals the 2bit file format! But it uses masks and things because real data has low quality/unknown pieces which need extra bits to represent.)

However, we still use the old text based format (FASTQ) to store data right from the sequencing machines. We do something funny with the FASTQ format: we keep the sequence data in human readable format but store the quality data (94) levels using 94 printable but otherwise meaningless characters:


You will find that these nonsense characters are actually the printed representations of the numbers 33 to 126 in order. This sleight of hand betrays our dilemma and our sneaky weaseling around it. Older formats kept human readable numbers for the quality values, but these would use three bytes (two for the digits and one for the gap between the numbers). This format saves us some space by reducing the quality code to one digit (but if you read the FASTQ specs you will note some symbols in this scale have additional meanings as well, such as indicating the start of a read and so on).

We have here an admission that we no longer gaze at whole genomes letter by letter, bedazzled by the sight of our own source code AND we need to save space, because, man, those genomes are big! And we are accumulating so many of them! But we go half way. We COULD go all the way and compress everything into one byte, not bother with opening up the files in text editors and settle on a simple, well defined binary format.

But we don’t. Are we afraid we will lose the magic of the ancients? The ability to stare at the source code? I don’t think so. Humans are pretty good at learning symbols you know …