Raw (unaligned), paired reads in a BAM file

Since BAM files are the binary version of SAM files, which in turn stand for Sequence Alignment/Mapping, its a little strange to store unaligned data (the raw reads from the sequencer) in a BAM. However, as eloquently argued by others, the text based FASTQ format is showing its age and an indexed binary file is a …

Advertisements

SAM! BAM! VCF! What?

As part of an exercise I'm doing I was supplied with a FASTQ file which contains a set of reads and I needed to figure out how to get a VCF file out. A what what now? Exactly. When we try to obtain the sequence of some DNA the big machine that we dump our DNA does …

Text File formats – ASCII Delimited Text – Not CSV or TAB delimited text

One issue of doing this is, by including non-printing characters, we are breaking our ‘readable using a simple text editor’ pledge. When we open this file in a regular text editor we will not get the nice alignment of units and records that commas and tabs give us. (Ok I exaggerated on the readability of cvs and tab delimitated files)

Ronald Duncan's Blog

Unfortunately a quick google search on “ASCII Delimited Text” shows that IBM and Oracle failed to read the ASCII specification and both define ASCII Delimited Text as a CSV format.  ASCII Delimited Text should use the record separators defined as ASCII 28-31.

The most common formats are CSV (Comma Separated Values) and tab delimited text.  Tab delimited text breaks when ever you have either a field with a tab or a new line in it, and CSV breaks depending on the implementation on Quotes, Commas and lines. Sadly Quotes, Commas and Tab characters are very common in text, and this makes the formats extremely bad for exporting and importing data.  There are some other formats such as pipe (|) delimited text, and whilst better in that | is less frequently used they still suffer from being printable characters that are entered into text, and worst of all people, when they…

View original post 183 more words

Why do we use ASCII files in genomics?

This rant was triggered when I was going over the format for FASTQ files. This is a pure rant: I propose no good solutions for anything here. I'm not even angry - just bemused. I'm a novice, so there are probably good reasons for keeping ASCII files for raw data which I just don't know about. First, …