This rant was triggered when I was going over the format for FASTQ files. This is a pure rant: I propose no good solutions for anything here. I’m not even angry – just bemused. I’m a novice, so there are probably good reasons for keeping ASCII files for raw data which I just don’t know about.
First, a short run down of the trigger: A FASTQ file stores “reads” from a genomic sequencing machine. The entire genome is not read at once, but rather as a set of short reads – short sequences of base pairs, like ACTGGTAC – which are then put together into a chromosome or genome using alignment programs.
Following convention, we store the data in plain text (ASCII encoded) files, where we use the actual letter (one byte) to represent the base (‘A’,’C’,’G’,’T’). It is easy to guess the background for this choice. I love plain text files – there is nothing to interpret here – you simply open it up in a text editor. 50 year old files can be read this way. When people were fighting to sequence parts of the genomes of viruses and bacteria this was a convenient and practical choice. For example you can write down the entire genome of the bacteriophage phiX174 one one page of your writing block.
The human genome has something like 3 billion base pairs. There are four base pairs which can therefore be represented in 2 bits each. When we get the sequence data from sequencing machines there is naturally some slop about the identity of each unit in the sequence. This uncertainty in the raw data is represented by a quality score which can probably be quantized into a six bit space (the convention is 94 levels). Therefore, we can represent both the identity of a sequence element and its quality in one byte.Using this kind of binary bit field based representation reads from the whole human genome could then be packed into a file of the order of a few GB. (And a quick search reveals the 2bit file format! But it uses masks and things because real data has low quality/unknown pieces which need extra bits to represent.)
However, we still use the old text based format (FASTQ) to store data right from the sequencing machines. We do something funny with the FASTQ format: we keep the sequence data in human readable format but store the quality data (94) levels using 94 printable but otherwise meaningless characters:
You will find that these nonsense characters are actually the printed representations of the numbers 33 to 126 in order. This sleight of hand betrays our dilemma and our sneaky weaseling around it. Older formats kept human readable numbers for the quality values, but these would use three bytes (two for the digits and one for the gap between the numbers). This format saves us some space by reducing the quality code to one digit (but if you read the FASTQ specs you will note some symbols in this scale have additional meanings as well, such as indicating the start of a read and so on).
We have here an admission that we no longer gaze at whole genomes letter by letter, bedazzled by the sight of our own source code AND we need to save space, because, man, those genomes are big! And we are accumulating so many of them! But we go half way. We COULD go all the way and compress everything into one byte, not bother with opening up the files in text editors and settle on a simple, well defined binary format.
But we don’t. Are we afraid we will lose the magic of the ancients? The ability to stare at the source code? I don’t think so. Humans are pretty good at learning symbols you know …