Raw (unaligned), paired reads in a BAM file

Since BAM files are the binary version of SAM files, which in turn stand for Sequence Alignment/Mapping, its a little strange to store unaligned data (the raw reads from the sequencer) in a BAM. However, as eloquently argued by others, the text based FASTQ format is showing its age and an indexed binary file is a very tempting store for our big bucket of raw reads.

It turns out that storing unaligned reads in a BAM file in a way that will not break downstream tools is fairly easy – you simply set all information related to alignment to be unavailable.

Storing paired reads in FASTQ files is an after thought with a bunch of ad hoc solutions, including storing the reads in pairs of files (labeled 1 and 2) or as an interleaved file, where the read pairs are stored one after the other. In a BAM file, on the other hand, there is a very systematic way of indicating paired reads

You simply:

  1. Set FLAG bit 0x1 (template having multiple segments)
  2. Set FLAG bit 0x40 if the read is closer to the 5′ end, or 0x80 if the read is from the 3′ end
  3. Set the TLEN value to the expected length of the template.
  4. Make sure that the QNAME of the reads match.

(It took me a while to catch on that ‘template’ meant the shred of DNA these reads come from)

A toy script that uses PySAM to dump a valid BAM file representing unaligned paired read is:

import pysam

header = { 'HD': {'VN': '1.4'},
'SQ': [{'LN': 702, 'SN': 'chr1'}] } #You need to indicate the length of the reference and give

outfile = pysam.Samfile( 'test.bam', "wb", header = header )
for n in range(10):
  a = pysam.AlignedRead()
  a.qname = "r{:d}".format(n)
  a.flag = 0x41 #end1 0x01 flag has to be set to indicate multiple segments
  a.tlen = 100

for n in range(10):
  a = pysam.AlignedRead()
  a.qname = "r{:d}".format(n)
  a.flag = 0x81 #end2 0x01 flag has to be set to indicate multiple segments
  a.tlen = 100


SAM! BAM! VCF! What?

As part of an exercise I’m doing I was supplied with a FASTQ file which contains a set of reads and I needed to figure out how to get a VCF file out. A what what now? Exactly.

When we try to obtain the sequence of some DNA the big machine that we dump our DNA does not — contrary to what we have been trained by CSI:Boston* to believe — spit out the complete DNA sequence and tell us who the perpetrator of the crime is. (*What? There was no CSI:Boston? Every other city had one … )

When we run “The Machine” (some of which actually increasingly look like a cheapo laser printer from the 2000s) the long chains of DNA are shred into random little bits. The sequencing machine then looks at even smaller bits of these small shreds and gives us a list of these shreds of shreds which are called ‘reads’. If you are thinking what I’m thinking:

You are right.

You’ll be also thinking, why don’t we do this thing correctly and sequence the whole strand of DNA at one go instead of getting this pile of reads and how are we going to put it back together? Why did we take the document and then shred it? Couldn’t we hire a better intern? There is an interesting story behind this which we will look at another time, but the short answer – as always – is money (and time).

If you are thinking what I’m thinking:

You are right.

Long story short, the government spent about 9 billion dollars – roughly two Nimitz class aircraft carriers – and came up with the (almost) complete sequence of the human genome. People who do not have a cool 10 billion lying around take the shredder approach and get a big pile of reads. Then they buy or rent computers which run programs that take these shreds and put them together like so many puzzle pieces. The government’s sequenced genome is like the picture on the puzzle box. The computer software looks at the government’s genome and tries to figure out where each little piece goes and slowly builds it up together. This turns out to be much, much cheaper.

Of course, it turns out we are not all perfect clones of one another and the differences amongst us are reflected in differences in our DNA. Now you are thinking: Great! I have these little puzzle pieces AND the picture on the box is DIFFERENT from what the pieces add up to? That’s one of the challenges of Genomics which makes it fun and interesting. The differences also turn out to be very important for things like cancer and why we get it and why some of us respond to some medicines and others don’t.

Anyhoo – to get back on topic – often people take their big old pile of reads – which are a huge collection of short strips of DNA letters looking hopelessly similar to each other :


and dump them into a semi-text file called the FASTQ format. I just love this format.

Anyhow, there are tools that take this FASTQ file and a reference sequence (the picture on the box) and align the reads (the shreds) to try and make the best fit to the reference. This alignment information is stored in a Sequence Alignment Map (SAM) file which is a text file. Someone got tired of lugging ginormous text files around and went home and developed the Binary Alignment Map (BAM) format which is the same as SAM, but in binary, which tends to save space and is faster to process and so on.

We then take this SAM/BAM file and try and compare it to the reference sequence. Any differences to the reference are then attributed to individual variation between the chap who gave the sample and the mysterious caped figure who gave the government their sample. These are called variants (or mutations – yes we are all mutants, compared to each other) and are stored, logically, in a Variant Call Format (VCF) file.

So this is my journey. I’m taking my FASTQ, getting a SAM/BAM and then generating a VCF. See, you talk pretty one day.

So I buttonholed Devin and asked him to suggest some tools I could start with. I’m sure Nate already told me this, but I have a short attention span. Devin suggested that I try out BWA and then GATK. I ended up starting with BWA and samtools which turned out to be sufficient for my toy program.

The tools compiled straightforwardly (by typing ‘make’) on my mac and were self contained binaries that could just be copied over to my local binary directory. They both have good command line help but some quirks. I’m sure there are many of them but in the example workflow below I annotated the ones I ran into.

The reference sequence is a string of nucleotide base symbols (GATC…). In order to do fast searching and matching on this string – which can be very, very long – programs first index the string, which basically means they create a look up table of sorts that allows them to matches to parts of the sequence more efficiently.

bwa index ref.fa

Now that we have the index, we can run the aligner – the workhorse program that pieces together the puzzle.

bwa mem ref.fa reads.fastq > aligned.sam

The next tools we will be using need binary formatted alignment maps, rather than text ones, so we convert the text data to a binary format. No new information is being added here. Genomics is a relatively newer field and we are kind of in a transition between text formats (which are easily human readable, but bloated) to binary formats, which humans can’t read directly, but which are more space and time efficient for processing large sequences (like the human genome).

samtools view -Sb aligned.sam > aligned.bam

Now, before we can do further processing on the aligned reads in the bam file, we need to index it (for the same reason we indexed the reference sequence). In order to index the file we need to sort it – the data has to be in order for the index to be efficient.

samtools sort aligned.bam aligned_sorted

Now we can index it

samtools index aligned_sorted.bam

Now, finally we can pull up everything – the reference sequence and the aligned reads – and see how they look like. You should see any variants (differences from the references) highlighted.

samtools tview aligned_sorted.bam ref.fa

Next we would like to print out these differences (variants) into the standard format (VCF).

samtools mpileup -uf ref.fa aligned_sorted.bam | bcftools view -bvcg - > var.raw.bcf
bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf 

As an aside, while samtools has a neat alignment viewer – and I really like the command line – I found Tablet to be pretty slick too.

Text File formats – ASCII Delimited Text – Not CSV or TAB delimited text

One issue of doing this is, by including non-printing characters, we are breaking our ‘readable using a simple text editor’ pledge. When we open this file in a regular text editor we will not get the nice alignment of units and records that commas and tabs give us. (Ok I exaggerated on the readability of cvs and tab delimitated files)

Ronald Duncan's Blog

Unfortunately a quick google search on “ASCII Delimited Text” shows that IBM and Oracle failed to read the ASCII specification and both define ASCII Delimited Text as a CSV format.  ASCII Delimited Text should use the record separators defined as ASCII 28-31.

The most common formats are CSV (Comma Separated Values) and tab delimited text.  Tab delimited text breaks when ever you have either a field with a tab or a new line in it, and CSV breaks depending on the implementation on Quotes, Commas and lines. Sadly Quotes, Commas and Tab characters are very common in text, and this makes the formats extremely bad for exporting and importing data.  There are some other formats such as pipe (|) delimited text, and whilst better in that | is less frequently used they still suffer from being printable characters that are entered into text, and worst of all people, when they…

View original post 183 more words

Why do we use ASCII files in genomics?

This rant was triggered when I was going over the format for FASTQ files. This is a pure rant: I propose no good solutions for anything here. I’m not even angry – just bemused. I’m a novice, so there are probably good reasons for keeping ASCII files for raw data which I just don’t know about.

First, a short run down of the trigger: A FASTQ file stores “reads” from a genomic sequencing machine. The entire genome is not read at once, but rather as a set of short reads – short sequences of base pairs, like ACTGGTAC – which are then put together into a chromosome or genome using alignment programs.

Following convention, we store the data in plain text (ASCII encoded) files, where we use the actual letter (one byte) to represent the base (‘A’,’C’,’G’,’T’). It is easy to guess the background for this choice. I love plain text files – there is nothing to interpret here – you simply open it up in a text editor. 50 year old files can be read this way. When people were fighting to sequence parts of the genomes of viruses and bacteria this was a convenient and practical choice. For example you can write down the entire genome of the bacteriophage phiX174 one one page of your writing block.

The human genome has something like 3 billion base pairs. There are four base pairs which can therefore be represented in 2 bits each. When we get the sequence data from sequencing machines there is naturally some slop about the identity of each unit in the sequence. This uncertainty in the raw data is represented by a quality score which can probably be quantized into a six bit space (the convention is 94 levels). Therefore, we can represent both the identity of a sequence element and its quality in one byte.Using this kind of binary bit field based representation reads from the whole human genome could then be packed into a file of the order of a few GB. (And a quick search reveals the 2bit file format! But it uses masks and things because real data has low quality/unknown pieces which need extra bits to represent.)

However, we still use the old text based format (FASTQ) to store data right from the sequencing machines. We do something funny with the FASTQ format: we keep the sequence data in human readable format but store the quality data (94) levels using 94 printable but otherwise meaningless characters:


You will find that these nonsense characters are actually the printed representations of the numbers 33 to 126 in order. This sleight of hand betrays our dilemma and our sneaky weaseling around it. Older formats kept human readable numbers for the quality values, but these would use three bytes (two for the digits and one for the gap between the numbers). This format saves us some space by reducing the quality code to one digit (but if you read the FASTQ specs you will note some symbols in this scale have additional meanings as well, such as indicating the start of a read and so on).

We have here an admission that we no longer gaze at whole genomes letter by letter, bedazzled by the sight of our own source code AND we need to save space, because, man, those genomes are big! And we are accumulating so many of them! But we go half way. We COULD go all the way and compress everything into one byte, not bother with opening up the files in text editors and settle on a simple, well defined binary format.

But we don’t. Are we afraid we will lose the magic of the ancients? The ability to stare at the source code? I don’t think so. Humans are pretty good at learning symbols you know …