Raw (unaligned), paired reads in a BAM file

Since BAM files are the binary version of SAM files, which in turn stand for Sequence Alignment/Mapping, its a little strange to store unaligned data (the raw reads from the sequencer) in a BAM. However, as eloquently argued by others, the text based FASTQ format is showing its age and an indexed binary file is a very tempting store for our big bucket of raw reads.

It turns out that storing unaligned reads in a BAM file in a way that will not break downstream tools is fairly easy – you simply set all information related to alignment to be unavailable.

Storing paired reads in FASTQ files is an after thought with a bunch of ad hoc solutions, including storing the reads in pairs of files (labeled 1 and 2) or as an interleaved file, where the read pairs are stored one after the other. In a BAM file, on the other hand, there is a very systematic way of indicating paired reads

You simply:

  1. Set FLAG bit 0x1 (template having multiple segments)
  2. Set FLAG bit 0x40 if the read is closer to the 5′ end, or 0x80 if the read is from the 3′ end
  3. Set the TLEN value to the expected length of the template.
  4. Make sure that the QNAME of the reads match.

(It took me a while to catch on that ‘template’ meant the shred of DNA these reads come from)

A toy script that uses PySAM to dump a valid BAM file representing unaligned paired read is:

import pysam

header = { 'HD': {'VN': '1.4'},
'SQ': [{'LN': 702, 'SN': 'chr1'}] } #You need to indicate the length of the reference and give

outfile = pysam.Samfile( 'test.bam', "wb", header = header )
for n in range(10):
  a = pysam.AlignedRead()
  a.qname = "r{:d}".format(n)
  a.seq="AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
  a.flag = 0x41 #end1 0x01 flag has to be set to indicate multiple segments
  a.tlen = 100
  outfile.write(a)

for n in range(10):
  a = pysam.AlignedRead()
  a.qname = "r{:d}".format(n)
  a.seq="AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
  a.flag = 0x81 #end2 0x01 flag has to be set to indicate multiple segments
  a.tlen = 100
  outfile.write(a)

outfile.close()
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s