As part of an exercise I’m doing I was supplied with a FASTQ file which contains a set of reads and I needed to figure out how to get a VCF file out. A what what now? Exactly.
When we try to obtain the sequence of some DNA the big machine that we dump our DNA does not — contrary to what we have been trained by CSI:Boston* to believe — spit out the complete DNA sequence and tell us who the perpetrator of the crime is. (*What? There was no CSI:Boston? Every other city had one … )
When we run “The Machine” (some of which actually increasingly look like a cheapo laser printer from the 2000s) the long chains of DNA are shred into random little bits. The sequencing machine then looks at even smaller bits of these small shreds and gives us a list of these shreds of shreds which are called ‘reads’. If you are thinking what I’m thinking:
You are right.
You’ll be also thinking, why don’t we do this thing correctly and sequence the whole strand of DNA at one go instead of getting this pile of reads and how are we going to put it back together? Why did we take the document and then shred it? Couldn’t we hire a better intern? There is an interesting story behind this which we will look at another time, but the short answer – as always – is money (and time).
If you are thinking what I’m thinking:
You are right.
Long story short, the government spent about 9 billion dollars – roughly two Nimitz class aircraft carriers – and came up with the (almost) complete sequence of the human genome. People who do not have a cool 10 billion lying around take the shredder approach and get a big pile of reads. Then they buy or rent computers which run programs that take these shreds and put them together like so many puzzle pieces. The government’s sequenced genome is like the picture on the puzzle box. The computer software looks at the government’s genome and tries to figure out where each little piece goes and slowly builds it up together. This turns out to be much, much cheaper.
Of course, it turns out we are not all perfect clones of one another and the differences amongst us are reflected in differences in our DNA. Now you are thinking: Great! I have these little puzzle pieces AND the picture on the box is DIFFERENT from what the pieces add up to? That’s one of the challenges of Genomics which makes it fun and interesting. The differences also turn out to be very important for things like cancer and why we get it and why some of us respond to some medicines and others don’t.
Anyhoo – to get back on topic – often people take their big old pile of reads – which are a huge collection of short strips of DNA letters looking hopelessly similar to each other :
Read1: CCCTCCTGGGCGGTGGACATGATGAGATTCAATATTAATGACTTTCTTCCCCCAGGAGGGGGCTCAAACC Read2: CAGTGCTGTTATTCTAGATGATAACTTTGTAACAAAGGCCACAGCCCTCACCTATGACCCCTATGTAAAC ... CCAAGGAGGCGTTACCGGAGAAGAAGACACCGCCCCCGCAGCCATCTTGGCCAGATCCTCCGCCGCCGCC ... TTGAATACTACAGAATAAGAAAGGTTAAGGTTGAATTCTGGCCCTGCTCCCCGATCACCCAGGGTGACAG
and dump them into a semi-text file called the FASTQ format. I just love this format.
Anyhow, there are tools that take this FASTQ file and a reference sequence (the picture on the box) and align the reads (the shreds) to try and make the best fit to the reference. This alignment information is stored in a Sequence Alignment Map (SAM) file which is a text file. Someone got tired of lugging ginormous text files around and went home and developed the Binary Alignment Map (BAM) format which is the same as SAM, but in binary, which tends to save space and is faster to process and so on.
We then take this SAM/BAM file and try and compare it to the reference sequence. Any differences to the reference are then attributed to individual variation between the chap who gave the sample and the mysterious caped figure who gave the government their sample. These are called variants (or mutations – yes we are all mutants, compared to each other) and are stored, logically, in a Variant Call Format (VCF) file.
So this is my journey. I’m taking my FASTQ, getting a SAM/BAM and then generating a VCF. See, you talk pretty one day.
So I buttonholed Devin and asked him to suggest some tools I could start with. I’m sure Nate already told me this, but I have a short attention span. Devin suggested that I try out BWA and then GATK. I ended up starting with BWA and samtools which turned out to be sufficient for my toy program.
The tools compiled straightforwardly (by typing ‘make’) on my mac and were self contained binaries that could just be copied over to my local binary directory. They both have good command line help but some quirks. I’m sure there are many of them but in the example workflow below I annotated the ones I ran into.
The reference sequence is a string of nucleotide base symbols (GATC…). In order to do fast searching and matching on this string – which can be very, very long – programs first index the string, which basically means they create a look up table of sorts that allows them to matches to parts of the sequence more efficiently.
bwa index ref.fa
Now that we have the index, we can run the aligner – the workhorse program that pieces together the puzzle.
bwa mem ref.fa reads.fastq > aligned.sam
The next tools we will be using need binary formatted alignment maps, rather than text ones, so we convert the text data to a binary format. No new information is being added here. Genomics is a relatively newer field and we are kind of in a transition between text formats (which are easily human readable, but bloated) to binary formats, which humans can’t read directly, but which are more space and time efficient for processing large sequences (like the human genome).
samtools view -Sb aligned.sam > aligned.bam
Now, before we can do further processing on the aligned reads in the bam file, we need to index it (for the same reason we indexed the reference sequence). In order to index the file we need to sort it – the data has to be in order for the index to be efficient.
samtools sort aligned.bam aligned_sorted
Now we can index it
samtools index aligned_sorted.bam
Now, finally we can pull up everything – the reference sequence and the aligned reads – and see how they look like. You should see any variants (differences from the references) highlighted.
samtools tview aligned_sorted.bam ref.fa
Next we would like to print out these differences (variants) into the standard format (VCF).
samtools mpileup -uf ref.fa aligned_sorted.bam | bcftools view -bvcg - > var.raw.bcf bcftools view var.raw.bcf | vcfutils.pl varFilter -D100 > var.flt.vcf
As an aside, while samtools has a neat alignment viewer – and I really like the command line – I found Tablet to be pretty slick too.