The masked chromosome

A tale of grand theft genome and the mysterious origins of the Y chromosome …

I was looking at the sequence of the Y chromosome and trundling merrily along:


Whaaa? Did someone fall asleep at the wheel? Did the government shutdown in the middle of sequencing the Y chromosome?

You know that DNA sequences of chromosomes are represented by a four letter alphabet, ACGT, called bases. You also know that we have to figure out the sequence of bases in each chromosome through great effort. When we can’t figure out what a base is (but we know there should be a base there) we put in a ‘N’. Typically, chromosomes start with a long string of Ns because there are often difficult to sequence parts at the ends of the chromosomes.

However, I ran into this string of Ns bang in the middle of chromosome Y. And it did not stop. It went on and on and on. It took up the entire second half of the Y chromosome! Did I get a bad file? Did I somehow download an older version of the genome?

I fired up IGV – a genome browser developed by the Broad institute – and took a look at what it had to say. Nope, even there, somewhere halfway through, the Y chromosome disappears into a sea of Ns.

Screen Shot 2014-06-10 at 1.26.10 PM

I went over to the University of Southern California and looked at the most recent sequence data from Chromosome Y.

Screen Shot 2014-06-10 at 12.17.53 PM

Nope! See that middle part where the black blocks disappear, only to reappear right at the end (right hand border)? The place where the blocks disappear is the missing part of chromosome Y.

Then it hit me. Of course. It was obvious. Somebody had stolen half of Chromosome Y from right under our noses! In a panic I dialed up my colleagues at Seven Bridges. Devin picked up first.

“The pseudoautosomal region (PAR) of Y is commonly masked out with Ns because it closely matches the X chromosome in this region and would result in two very similar sequences within the reference if left unmasked.” he said after hearing me tell him about grand theft genome.

The what-what-in-the-what-now?

The. Pseudoautosomal. Region … ” he began to repeat, slowly and distinctly. After some web searching and reading of wikipedia pages I pieced together the story behind that string of ‘N’s in the Y chromosome of the human reference genome.

The story starts about 160 million years ago. As you know, we have two copies of every chromosome except for the 23rd chromosome, also called X/Y. Females have two copies (XX) but males have one regular copy of X and a strange, shorter copy called Y. Around 160 million years ago, one of our great-great-great-great …. – great grandfathers (whom, if we met today, we would not recognize, him being a different species and all) obtained, through dubious means, a mutated copy of an X chromosome.

Our current conjecture is that, simply possessing this mutation ensured that the individual would develop into a male, and so this mutation turned that copy of the X chromosome into a sex-determining Y chromosome passed on down from father to son for millions of years. The Y chromosome has the fastest mutation rate of all the chromosomes in the genome, and differs from the chimpanzee by 30% (The whole genome, in comparison, differs by less than 1% between humans and chimps).

However, most relevant to us here is the fact that some parts of the Y chromosome are virtually identical to the X chromosome because of their shared origins. These are called pseudo-autosomal regions (PAR). When the Y-chromosome was sequenced, it was decided not to sequence the PARs and instead copy the corresponding (homologus) section from the X chromosome sequence.

Now, if you recall, when we analyze genetic data (DNA) we don’t actually get the sequences of each chromosome as complete wholes. Rather, we get short (typically 100 letter long) DNA sequences from our sample which come from all the chromosomes together in one big pile. We then use a program, called an aligner, that takes these short sequences and matches them against the chromosomes in the reference sequence.

Now you can see we have a problem with the X and Y chromosomes. In our reference sequence we have parts of X and Y which are exactly identical. An aligner will take a read which came from a PAR of the X chromosome and often place it on the Y chromosome. To avoid this the PAR regions on the Y chromosome are often hard-masked – filled with Ns.

(A very succinct description of the rationale for doing this is given here if you are interested)

So that is how the masked chromosome came to be. I love genomics. You ask a simple question and you discover stuff that started 160 million years ago!