Use pysam

This is a plug for Pysam: a well designed and feature rich wrapper for HTSlib. I’ve used pysam for a few years now and it has gotten better with time. It is now a very efficient and convenient library for reading and writing BAM files, reading VCF, FASTQ and FASTA files. For VCF and FASTA files it can use the tabix index to read regions.

I first used Pysam to read BAM files, which was the most feature complete part when I started. I started to use it for writing BAM files when the quirks were ironed away. I used to use PyVCF for reading use my own (inefficient) reader for FASTA and FASTQ files.

In it’s current state (0.9.1.4) Pysam is a very effective solution to reading, in addition, VCF, FASTA and FASTQ files. It transparently handles gzipped files and BCF files, so you don’t have to keep track of the file type yourself.

Not only is it able to use the VCF and FASTA indices it also returns the data nicely encapsulated. My only complaint is that I feel the VCF record format is a tad over-complicated (it uses a similar interface to PyVCF). I always end up with very complicated accessor expressions when I wish to get alleles and genotypes from a record.

What I really liked about the VCF reader is that it intelligently handles the case of deletions crossing a region boundary. So for example, if you ask for variants in the range 10 to 20 and there is a 5 base deletion starting at position 8, it will be captured because it crossed into the requested regions. It’s very nice not to have to write your own logic to handle such cases.

I’ve also found VCF reading efficient enough for my purposes. When reading a single sample from multi-sample (population) VCF files it helps to set the filter, so that only the required sample is loaded.

Don’t hang up!

I started using multiuser Unix systems in graduate school and I quickly learned to love 'nohup'. It let you do the impossible thing: you logged into a machine, started a long computation and then logged out! You didn’t have to hang around guarding the terminal till 2am. You could go get some sleep, hang out with friends, whatever.  This post is a short survey of ‘nohup’ like things that I’m still using and learning about, a decade and a half later.

Screen: For the longest time nohup <some command> & was my staple. I’d package all my code so that it was non-interactive, reading instructions from a parameter file, and writing out everything to data and log files, fire it off and come back the next day. Then someone introduced me to ‘screen‘. This was amazing. It let me reattach to a terminal session and carry on where I left off, allowing me to keep sessions open as I shuttle between work and home. I like doing screen -S <a name> to start sessions with distinct names.

Mosh: The one thing about screen is that I have to ssh back into the machine and start it up again. Mosh is a client server system that allow roaming, which means in practice, I can open a terminal, work on it, send my computer to sleep, work on the train with no WiFi, get into the office, and when my computer finds a WiFi again, Mosh takes up where it last left off, seamlessly – except for a little glitching on the display. Mosh doesn’t quite replace screen though – if I reboot my laptop or shutdown the terminal app, I lose the Mosh client. The session keeps running, but I can no longer interact with it.

reptyr: Ok, this is a fun one. Now suppose you have a running process that you started in a normal ssh session. Then you realize, this is actually a long running process. Oh great, the flashbacks to the late night terminal guarding sessions in grad school come back. There’s a program caller reptyr that lets you hop a running process from one terminal to another. What you want to do in this case, is start up a screen session and then invoke reptyr from the screen session.

Unfortunately I’ve not been able to get it to work properly – I always get

Unable to attach to pid 11635: Operation not permitted
The kernel denied permission while attaching. If your uid matches
the target's, check the value of /proc/sys/kernel/yama/ptrace_scope.
For more information, see /etc/sysctl.d/10-ptrace.conf

In my original terminal the program just stops and hands me back my command prompt. Nothing happens in my new terminal, but I can see that the program is running. It’s a bit like doing disown, which in turn is like doing nohup retroactively – your process can keep running, but there is no terminal. Fortunately you redirected stderr to a file by doing 2> err.txt, right? right?

Disassembly

Adventures in functional programming

Reading through Let over Lambda I ran into several instances where the author showed the disassembly of a bit of Lisp code. This struck me as awesome for several reasons and I wanted to do it myself. I initially thought I had to dig up a disassembler for my system (as I would have done it for C/C++) and I was blown away when I learnt that  there was a Lisp command for this!

I found getting the disassembly amazing because I didn’t think of Lisp – a dynamically typed high level language – as generating succinct machine code. I knew it could be compiled, but I expected the compilation to be something very messy, a bit like the C++ code that Cython generates for Python without the types put in – full of instruments to infer types at run time. Instead, what I saw was tight machine code that one…

View original post 186 more words

Rip Van C++

After a long hiatus, probably about a decade, I once again began coding in C++. It wasn’t hard to get back into the stride but I am surprised at how different modern C++ is, and how pleasurable it is to write. In no particular order here are the things that surprised me.

No more “this->” in class functions. I discovered this by accident when I forgot to add the “this->” qualifier in front of an instance variable and the code still compiled. I’m not sure if I like this – but it sure as hell is convenient.

CATCH. Man! One of my big draws for Python is how rapidly I can set up tests and run them. It basically made parallel testing and development (I’m sure the kids have some kind of cool name for it nowadays) possible for me. I remember with agony how it was to setup a test suite for C++ code. CATCH makes it almost as easy to write tests for C++. And you can break into the debugger. I think this settled the matter for me.

std::random. No more hanging out in dark alleys, setting up rendezvous with shady characters to get your random numbers. It’s right there, in the standard library! And you can seed it and everything!

CMAKE! Felix sent me a small self contained example, sat with me for a few minutes and that was enough to get me up and running. I do NOT miss creating makefiles by hand. Cmake appears to be very logical.

std::threads – I haven’t actually used this yet, but after looking through several examples it’s simplicity rivals that of Python’s multiprocessing pool. We’ve come a long way baby.

Contrary to what you might think, I’m not that excited by the auto keyword or by the (optional) funky new way you can write loops. I still have to get used to the lambda syntax – I think Python gets it just right – straight and simple.

Update: OK, I really like the:

for(auto x : <some collection>) {
   // process x
}

syntax. It’s pretty slick.

I was also amused by the difference between, I guess you would call it culture, for C++ and Python. I was looking for a logo for this post and, firstly, did not really find an official one, and secondly, landed on the standards page:

Screen Shot 2016-02-18 at 6.11.02 AM

which stands in contrast to the first hit for “Python” which is the official one stop shop for all your Python needs:

Screen Shot 2016-02-18 at 6.10.48 AM

Interestingly, not six months ago, I think I took a stab at getting back into C++, for a personal project, I think. And I had an adverse reaction. Nothing has really changed with C++ since then, so I guess this time round I actually stopped to ask for help and my kind colleagues pointed me in the right direction.

 

Ex Machina

A movie review with some fun discussion of the Turing test.

Spoiler Alert!

The thing I like the most about Ex Machina is that one of the protagonists not only gives us a proper definition of the Turing test, but he also describes a delicious modification of the Turing test that took me a second to savor fully. The plot also twists quite nicely in the end, though we kind of see it coming.

Movies about machines that replicate human thoughts and emotions pop up periodically. In most of these movies the machines win. Ex Machina is one of the better ones of this genre. Particularly satisfying is that the techno babble in the script touches on some advanced topics in machine intelligence.

To start with, there is a good definition of the Turing test. People make a lot of fuss about the Turing test and take it quite seriously and literally. The Turing test, to me, is basically is an admission that, when people…

View original post 1,046 more words

The three switches problem

The light bulb problem is a class of problem often presented at interviews as a test of lateral thinking. Here, in a mischievous spirit taken from the best subversive traditions we will attack this question by thinking a little too laterally …

There are different versions of the light bulb problem, but they rely on the same trick. Here is one version:

Your grand uncle lives in a rambling old house. He had an electrician over to wire a light in the attic but the electrician messed with the wiring and set up three switches instead of one. Your grand uncle wants to figure out which of the switches turns on the light in the attic. Problem is, you can’t tell from below whether the attic light is on or off. You can fiddle with the switches and then go up to the attic to inspect the light, but you can only do this once. The old coot says you can’t keep going back down and up the attic stairs fiddling with the light – apparently it disturbs the cat and he scratches the furniture.

The conventional answer to the question is to flip one switch on and leave it for a bit. Then turn it off and turn on another then go up to the attic. If the light is on, the last switch you flipped was the one. If the light is warm (Aha!) then the first switch you flipped is the one. If the light is off and cold, well is the remaining one. Let’s try and subvert the conventional answer to this problem by engaging in some mild subterfuge in the spirit of Calandra‘s “Angels-on-a-Pin“.

  1. Rapidly flip one of the switches on and off many times. Finally set this switch to the off position and flip another switch to on. Now go to the attic. If the light is on, it’s the currently on switch. If the light is burnt out, its the first switch.

  2. Call in your grand aunt, explain the problem to her, crawl up to the attic and have her flip the switches one by one. Yell when the light comes on.

  3. Flip a switch and inspect all the fixtures in the house, repeat and eliminate two of the switches. The third one is the attic switch. If none of the switches seem to be doing anything, where’s the problem? Just hit all three switches when you need the attic light on.

  4. Wait for a moonless night. Turn off all the lights in the house and draw all the curtains and blinds. Now flip the switches one by one. There will be a difference in the light level, however slight, from the attic light going on.

  5. Fly a drone up into the attic. Now you have eyes on target. Screw the cat.

  6. Take all bulbs out of their sockets and unplug all appliances. Now flip the switches one by one and watch the electric meter. This may take some patience.

  7. Open up the cover plate. Take a voltmeter and measure the drop in voltage as the switches are flipped. If only one has a voltage drop, that’s the one wired to the attic switch. If more than one have drops, throw stones into the attic until the light bulb is broken. Redo the measurements. The switch that now has no voltage drop (but did previously) is the one. Take a new bulb, go up into the attic and replace it.

  8. Flip each switch in turn and let the bulb warm up. Throw a bucket of water into the attic. If you hear a loud “POP” the bulb was on. Leave adequate cool down periods between tests. Take a new bulb, go up into the attic and replace it.

  9. Tie three strings to the switches. Go up into the attic with the strings. (Some will say this is subverting the question. Yes it is. Yes it is)

  10. Get all the shiny pots and pans your grand uncle has. Arrange them like mirrors so that you have a view of the attic from the switches.

As an aside, here is a problem I like better, as solving it actually teaches you some mathematical concepts:

Your grand uncle has passed on. In his will he’s bequeathed you an antique coin of immense value (so he says). He’s also left you with eight accurate forgeries of that coin that are indistinguishable from the original, except that they are just a fraction lighter. He’s also left you a balance, which can only be used twice before it breaks. So he says.

You are only allowed to take one of the coins with you. The original coin is priceless, the forgeries are worthless. So he says. You suspect all the coins are duds and the balance won’t break, but you take your Grand uncle’s will in the spirit it was meant, as a last fun puzzle to keep you busy so you don’t get too sad at the old codger’s passing.

Text File formats – ASCII Delimited Text – Not CSV or TAB delimited text

One issue of doing this is, by including non-printing characters, we are breaking our ‘readable using a simple text editor’ pledge. When we open this file in a regular text editor we will not get the nice alignment of units and records that commas and tabs give us. (Ok I exaggerated on the readability of cvs and tab delimitated files)

Ronald Duncan's Blog

Unfortunately a quick google search on “ASCII Delimited Text” shows that IBM and Oracle failed to read the ASCII specification and both define ASCII Delimited Text as a CSV format.  ASCII Delimited Text should use the record separators defined as ASCII 28-31.

The most common formats are CSV (Comma Separated Values) and tab delimited text.  Tab delimited text breaks when ever you have either a field with a tab or a new line in it, and CSV breaks depending on the implementation on Quotes, Commas and lines. Sadly Quotes, Commas and Tab characters are very common in text, and this makes the formats extremely bad for exporting and importing data.  There are some other formats such as pipe (|) delimited text, and whilst better in that | is less frequently used they still suffer from being printable characters that are entered into text, and worst of all people, when they…

View original post 183 more words