Sleep on it

I was pretty annoyed. I had installed a program into a docker container and had invoked this container via our cloud computing platform and it was garbling my output files – with the exact same inputs and parameters I had just tested it with on my local machine.

At first, I didn’t realize anything was wrong. The program completed without complaints, the log file indicated a certain size of result which corresponded to the result I obtained locally. However, when I fed the output file into another tool it failed. The tool had failed with an esoteric error message which is very non-informative (yay bioinformatics tools), but after seeing this message several times I’ve come to understand that it usually means the input is truncated.

Sure enough, when I downloaded the file from the platform onto my local machine and gunzipped it gunzip ran for a while before complaining of a truncated file.

Now, I had just come out of dealing with an annoying issue related to running docker containers on our platform. When the containers ran, they ran using a non-interactive non-login shell. This leads to subtle differences in how many things work, which will take up a different post, but basically I was in a foul mood already.

The code I actually wanted to run looked like this

my-program \
  input-file \
  >(gzip > outputfile1) \
  >(gzip > outputfile2)

It turns out that the non-interactive shell on the ubuntu base image I used for my tool image is dash, rather than bash, and dash doesn’t like this process substitution thing and complains sh: 1: Syntax error: "(" unexpected

To get around this expediently I wrapped my program in a bash script my-script.sh and invoked it on the platform as bash my-script.sh. This worked, but now my file was getting truncated.

Same code, same data, same docker image. Container executes correctly on my mac, corrupts my file on the AWS linux machine. I cross checked that I was closing my file in my code, I read up a bit about flushing (close does it) and fsync (named pipes don’t have it – they don’t touch disk) and the fact that Python will raise an exception if something goes wrong with the write.

After some more thrashing about like this I suddenly wondered about the multiple processes involved here. By doing process substitution, I had created two new processes – both involving gzip – that were sucking up data from pipes linking to the process my main program was running in. It was quite possible that the two gzip processes were finishing slightly after my main process was done. Now suppose, just suppose, the original bash shell that I was invoking to run my program was terminating as soon as my main program was done without waiting for the gzip processes? This would lead to this kind of file truncation.

I put in a wait in the shell script, but this did not make any difference. I wasn’t explicitly sending any processes to the background, so this was perhaps expected. Then I added a sleep 2s command to the script:

my-program \
  input-file \
  >(gzip > outputfile1) \
  >(gzip > outputfile2)

sleep 2s # <--- BINGO!

and indeed my files were no longer being truncated. So, what was happening was that the parent shell had no way of ‘realizing’ that the two gzip subprocesses were running and was exiting and the docker container was shutting down, killing my gzips just before they were done writing the last bit of data.

Ah, the joys. The joys.

Some notes on using Python multiprocessing

This is a short note describing a common code pattern useful for parallelizing some computations using the Python multiprocessing module.

Many problems are of the embarrassingly parallel type, where the task consists of the same set of computations done independently on a large set of input data. In many languages such problems are solved using multi-threaded code.  Due to events too complicated and embarrassing to mention Python doesn’t do too well with multiple threads, and we need to use multiple processes via the multiprocessing module instead.

A common pattern is to set up a number of workers and use queues to pass data back-and -forth. The pattern looks like this:

multiprocessing-queues.png

The use of this pattern is relatively straight forward but there are some things to pay attention to.

Sharing read only data

The general topic of sharing data between processes is worthy of deeper discussion, and there are some pointers in the Python documentation but there is an easy way to share read only data between processes – use globals that are written to in the parent process BEFORE the child processes are started.

If you declare a global variable and then modify it after the child processes are started, or if you modify the global from the child process you’ll enter into the twilight zone. There will be no errors and from within the process where you do the writing everything will look consistent, but the rest of the universe will not pay any attention to your writes.

Queues have overhead – the object has to pickled by the sender and then un-pickled by the receiver, and while the object is in the queue, it’s taking up memory. If you have a large amount of read-only data that needs to be shared amongst child processes – like a lookup table of primes, for example – this pattern of using globals can be pragmatic.

 

Explicitly set Queue size

On Mac OS X the default size of the queue is  32767 items. On Linux, the default size is … INFINITY. I learned this the hard way. I usually prototype my code on my mac, and then deploy to a linux container that runs god knows where on data that is several orders of magnitude larger than my prototype set. On Linux things SEEMED to run fine but the size of the result set was abnormally small compared to the input. After a lot of poking around I discovered this property of the Python implementation and what was happening was that in Linux my queues got larger and larger, taking up more and more RAM and the OS started to kill off the child processes. Setting an explicit queue size allows the processes to block when the queue becomes full, capping memory usage.

Wrap the child process function using traceback

Error messages from child processes are messy and uninformative. Importantly, the friendly python stack trace print out is missing. A fairly convenient pattern is to wrap the actual function call within a try-except block in thin outer function, using the traceback module, like so:

def func_w(args):
  """A thin wrapper to allow proper tracebacks when things go wrong in a thread

  :param args:
  :return:
  """
  import traceback
  try:
    return func(**args)
  except Exception as e:
    traceback.print_exc()
    print('')
    raise e

Check for normal termination from the child

One of the reasons it took me so long to track down the above mentioned memory error was that when the child processes was being killed by the OS no Python error was raised. The principled way to check for normal termination in is to look at the exitcode value of the process.

When not to use this pattern

Passing data via queues can be expensive in terms of computation and memory. If your individual workers have to pass a lot of data back and forth – say large submatrices of an even larger matrix – it may be neater and faster to use shared memory objects as supplied by the multiprocessing module which takes care of semaphores for you.

Named pipes, process substitution and tee

Most people using bash are familiar with piping: this is the ability to direct the output of one process to the input of another allowing the second process to work on data as soon as becomes available. What is not so well known is that piping is a particular case of a more general and powerful inter process communication system bash gives us.

The basic bash piping flow looks like:

pipe-fig1

In this setup cmd1 writes to stdout and the pipe operator (|) redirects this to the cmd2 which is setup to accept input from stdin and which finally either writes directly to file, or to stdout again,  which is then redirected to file. The bash command would look like

cmd1 | cmd2 > file

There are two great advantages to this kind of data flow. Firstly we are skipping writing data from cmd1 to disk, which saves us a potentially very expensive operation. Secondly, cmd2 is running in parallel to cmd1 and can start processing results from cmd1 as soon as they become available.

(This kind of structure is of course restricted to a data flow that can work in chunks. i.e cmd2 does not need the whole output of cmd1 to be written before starting work)

Now suppose that our first program produces multiple files and so can not afford to write to stdout. Our second program, similarly, accepts two files as input and produces two files as output.

pipe-fig2.png

Here cmd1 expects to write two streams of data to two files, whose names are represented by the first and second arguments to cmd1. cmd2 expects to read data from two files and write data to two files, all represented by four positional arguments.

Can we still use pipes? Can we still get parallelization for free if cmd1 and cmd2 can work in chunks? It turns out that we can, using some thing called named pipes. The above data flow, in bash, can be executed as follows

mkfifo tf1
mkfifo tf2
cmd1 tf1 tf2 &
cmd2 tf1 tf2 file1 file2

The fun thing here is the mkfifo command which creates a named pipes, also called First In First Out (FIFO) structures. After executing mkfifo tf1 if you do a directory listing, you will two odd files named tf1 and tf2. They have no size and their permissions flag string starts with ‘p’.

ls -al 
prw-r--r--   1 kghose  staff      0 Oct 26 21:35 tf1
prw-r--r--   1 kghose  staff      0 Oct 26 21:35 tf2

Whenever a process asks to write to or read from these ‘files’ the OS sets up an in-memory buffer. The process reading the buffer will block until data is placed on the buffer. The process writes to the buffer also blocks until it can be sure there is a receiving process at the other end. Once both ends are connected an in-memory transfer of data occurs from process1 to process2, never touching the disk.

To me the amazing thing is that the original programs cmd1 and cmd2 don’t know anything about “pipes”. They just expect to write to/read from files and the OS presents an interface that looks just like a file. (There are differences, and it is possible to write programs that won’t work with pipes. Any program, for example, that needs to do random access on the file won’t work with this paradigm)

Now, what if you want to maintain this data flow, but in addition, want to save the original data after passing it through another process, say a compression program?

pipe-fig3.png

Now things get a little involved and we’ll invoke two additional concepts, process substitution and the tee command

Consider the command

cmd1 >(gzip > fileA) >(gzip > fileB) &

In the places where cmd1 expects to find a file name we have supplied it with a very strange looking construct! The pattern >(...) is called process substitution. It tells Bash to construct a named pipe on the fly and connect the first stream output from cmd1 to the input of the command in the parens (in this case it is the gzip command). We do this twice because we want both outputs to be gzipped.

Symmetrically, we can do process substitution with the pattern <(...) . Here we tell Bash to send the output of the process in the parens (...) via a named pipe to the relevant input of the other process.

Now consider the command

cmd1 >(tee > tf1 >(gzip > fileA)) >(tee > tf2 >(gzip > fileB) ) &

The new thing here is the tee command (and the nesting of process substitutions which is just neat). The tee command takes input from stdin and copies it to stdout as well as any files that are supplied to it. In this case tee’s stdout is redirected to our named pipe (the >tf1 construct) as well as the ‘file’ represented by the process substitution >(gzip > fileA) (which is a nested process substitution).

So, the data flow drawn above can be represented as:

mkfifo tf1
mkfifo tf2
cmd1 >(tee > tf1 >(gzip > fileA)) >(tee > tf2 >(gzip > fileB) ) &
cmd2 tf1 tf2 file1 file2

Part of the subtle magic in all of this is that it’s a bit like functional programming with parallelization managed by the OS. We’re crunching data in one process as soon as it has emerged from another process. We’ve managed to isolate our concerns but still parallelize some of the work! And if we take our programs to some non-fancy operating system that can’t do all these fireworks we can still run the code just fine, except we’ll be writing to disk files and operating serially.

Frankly, the people who developed *nix systems are geniuses.

(But a distant memory comes back to me, of sitting at an IBM PC in what we called  “The Measurement and Automation Lab”  at Jadavpur University in far, far Calcutta, and an excited professor explaining to me how a RAMdisk worked in PC DOS, though this is clearly not of the same calibre as the *nix features)

A short rant on Darwin

So when I first tried this all out, the code did not work on my Mac. I was pretty sure I was doing things correctly but things just hung and I had to kill the processes. When I tried the exact same commands on a Linux box, things worked as expected.

Then I changed the way I was reading from the pipe to:

python cmd2.py <(cat < tf1) <(cat < tf2) fo1.txt fo2.txt

Using a – seemingly redundant – process substitution via cat to read the pipes made the code work on my Mac.

So we have here, again, a subtle way in which Darwin’s BASH shell differs from Linux’s and lurks, waiting, to screw you up and waste your time. It’s a pity that the Mac has taken off for developers (myself included) because it’s clear Darwin will never be a proper *nix but the adoption of Macs by developers probably takes some energy away from getting Linux on laptops.

Further reading

Use pysam

This is a plug for Pysam: a well designed and feature rich wrapper for HTSlib. I’ve used pysam for a few years now and it has gotten better with time. It is now a very efficient and convenient library for reading and writing BAM files, reading VCF, FASTQ and FASTA files. For VCF and FASTA files it can use the tabix index to read regions.

I first used Pysam to read BAM files, which was the most feature complete part when I started. I started to use it for writing BAM files when the quirks were ironed away. I used to use PyVCF for reading use my own (inefficient) reader for FASTA and FASTQ files.

In it’s current state (0.9.1.4) Pysam is a very effective solution to reading, in addition, VCF, FASTA and FASTQ files. It transparently handles gzipped files and BCF files, so you don’t have to keep track of the file type yourself.

Not only is it able to use the VCF and FASTA indices it also returns the data nicely encapsulated. My only complaint is that I feel the VCF record format is a tad over-complicated (it uses a similar interface to PyVCF). I always end up with very complicated accessor expressions when I wish to get alleles and genotypes from a record.

What I really liked about the VCF reader is that it intelligently handles the case of deletions crossing a region boundary. So for example, if you ask for variants in the range 10 to 20 and there is a 5 base deletion starting at position 8, it will be captured because it crossed into the requested regions. It’s very nice not to have to write your own logic to handle such cases.

I’ve also found VCF reading efficient enough for my purposes. When reading a single sample from multi-sample (population) VCF files it helps to set the filter, so that only the required sample is loaded.

Molecular combing

Molecular combing is a fun technique that is used to study longer stretches of DNA and quite different from the next generation sequencing by synthesis I’ve been exposed to. In this technique  DNA is mechanically stretched out on a slide and then locations of interest are labeled using fluorescent compounds. One can then image the slide and inspect the regions that light up.

DNA in solution coils up randomly. In molecular combing a charged probe is dipped in the solution and attracts the  ends of the DNA molecules. The probe is then retracted slowly from the solution, moving against a glass slide. The surface tension of the fluid pulls on the DNA and helps to stretch it out. Fluorescent DNA probes designed to attach to attach to particular locations on the DNA can then be applied and the slide is observed under a fluorescent microscope.

One use of this method is to detect large genetic events, like structural variations (long insertions, long deletions, inversions) that are really hard to detect with short read technologies and even with long read technologies. The method does not give single base resolution – a person I talked to said they get 1kb resolution (which is pretty great, if you consider that we are looking at DNA molecules under the microscope!) – but it the only way to get these classes of events. Typically this would be used when you know what you are looking for and want to test individual samples to see if they have this feature or not.

As a side note this reminds me of my Neuroscience days. One way we would classify techniques to study brain activity was a two-dimensional grouping based on resolution in time and space, by which we mean, if you detected an event with a technique, how well could you tell at what point in time it occurred – 1ms, or 1hour? and where it occurred – in THIS CELL, or in this half of the brain?

So, for example, invasive single cell recording would give very high resolution in both time (1ms or less) and space (this neuron), fMRI would sit in the middle, with moderate resolution in both time (1min or so) and space (1mm cubed chunks of the brain), and EEG would have relatively high time resolution (a few ms) and very crummy spatial resolution (This half of the brain, or if you used fancy statistical techniques, this region of the brain).

An effective study would try and combine different techniques to get at a question from different levels of detail to try an cross-validate the findings (we can rarely do this because of resource limitations, though)

Similarly, in genomics, groups are now trying fusion based approaches where “evidence classes” (It’s fun – when you move to new fields you always find the same concepts, but with totally different names. I guess everyone wants to sound cool in different ways. In electrical engineering (robotics) the buzz word for a while was “Sensor Fusion”) from different techniques are combined together to cross-validate the existence of genetic features.

The devil in the details of course is how you weight the different evidence classes and what do you do when they conflict with each other.

Further reading:

 

 

 

Why can’t I print from Google photos?

I love Google Photos as a means off backing up and sharing photos. On the Mac it requires minimal configuration and works without supervision and it is easy to share albums and photos. So I’m really puzzled why there is no way to print photos.

Google photos byline is “Free storage and automatic organization for all your memories.” and the software works! It appears to be written professionally – so perhaps a team from outside Google made it originally – I kid, I kid.

The auto uploader is easy to configure and non-intrusive. I tell it where my photos are and it silently looks for new ones, de-duplicates them and streams all my personal photos into google’s servers.  Wait. God! I just re-read that last sentence slowly. It’s too late now. … Anyway

Google’s statistical learning algorithms do some semi-useful things like image categorization and some cute things like animations with music which are nice but not essential or something I use often. I haven’t looked, but I assume that there is a way to bulk download if I ever need to recover the photos.

Update: Google photo is pretty much just a web only photo sharing service. The quality of the stored photos is OK for web viewing but does not stand up to closer scrutiny. I would only use this as a “backup” of last resort, a kind of cache in case all other real backups have failed. And I guess that’s why there is no print option – the quality is just too poor to really print.

Screen Shot 2016-10-22 at 7.56.10 PM.pngIn the example above the left image is of the google photos copy at 1:1 and the right is the original photo, also at 1:1. You can clearly see Google photo’s compression artifacts and poorer underlying resolution. There are also software glitches when viewing photos – the web viewer often gets stuck at a very low resolution of the photo, and you have to reload, or otherwise ‘jiggle’ the software to get it working again.

So, imagine my surprise and frustration when I went to print my photos and started to feel like Marcel The Mime stuck in that glass box. I tried to find the print button for two days, searching forums and stack overflow, convinced that it was just hidden and if I was just diligent enough I would find it, perhaps earning $100 in free prints at the end of it.

Once, I ran into a post that said I just needed to log into the Picasa webservice: I’d be able to see the photos I’d uploaded and then select for print. I went to picasaweb, and indeed, found my albums and found the print option. I was overjoyed. I started to collect photos to print. I then navigated away. A few days later I came back and discovered that the design had changed and I no longer had the “Print” button. I realized I was part of a giant psychological experiment which made the events in Gas Light look like kindness.

It was then that a bigger mystery began to occupy my mind. Why do this? Why fuck with your users like this? Why take a course of action that both leaves money on the table and angers users at the same time? I couldn’t stop thinking about it and this post is a form of therapy. I hope it works. So hear me out.

screen-shot-2016-10-09-at-7-34-10-pmNow, Google is desperate to make money from their services.

Whenever I do a search I see a string of ads above my search results that are either identical to my search results or considerably less informative.

Google is sacrificing search result accuracy and user convenience for revenue. Google was earning a healthy ad revenue before it started to advertise so luridly, and so it’s not clear to me why they’ve become so desperate.

screen-shot-2016-10-06-at-11-00-06-am

So, in this context the absence of any way to print photos from Google photos strikes me as particularly odd.

I’m not very experienced in product commercialization, but I imagine that if you create an online photo storage and management service, it’s a net plus to either offer a printing service yourself or, if that takes you too far outside your traditional domain of expertise, have an arrangement with an established photo printing service. Not letting your users print, and being ambiguous about it, is, on the other hand, a net negative.

So, is this lack of functionality malice or stupidity? Let’s take malice first.

When we upload our photos to google’s servers we are giving them intimate personal data. The images are being processed through statistical learning algorithms which can cluster faces and probably recognize backgrounds. We also give Google our personal and professional email. These data streams are a marketers dream. It’s the kind of information that allows Google to insert Ads for baby clothes in emails once you use the word ‘pregnancy’ in an email. In the future one can imagine that Google will insert such ads once you upload photos of your pregnancy to share with family.

Perhaps, though, that fear is overdone, as we can see from the clumsy state of targeted marketing; the brightest minds of our generation, thankfully and contrary to popular perception, have not been occupied in trying to serve ads to us (they have, of course, been occupied in borking our encryption algorithms and back-dooring our router hardware, but that is a matter for a different post) but an army of second rate minds have certainly been trying to productize our personal information.

So, from this point of view, as far as Google is concerned, we are the product and in exchange for some free storage we are giving google an even more complete peek into our personal lives so they can build a better psychological profile of us, so that they may judiciously prey on our deepest insecurities to sell us disposable razors. They don’t care if we can’t print, and they want this fact to be hard to discover. What they really want is us to upload our photos for their analysis.

 

What about stupidity? Google is a big company with many, many failed products. Most of the products failed not because of buggy software but because of a lack of imagination. A basic misunderstanding of what people want their computers to do for them. Like, say, print a bunch of photos into a photo book to give as a gift. The lack of a print facility is, under this hypothesis, just another example of product management sleeping at the helm.

There is of course another option – strategic insight.

Perhaps Google has decided for us that the vast majority of people no longer print photos. Perhaps they have seen into the future and it’s all digital, from the screens on our phones to the screens on our fridges. There will be no more eight-by-ten color glossy pictures of children and of wives and of parents and of halloween parties hanging on our walls, or inserted into albums (real albums, made of cardboard paper and cellophane) to be shown to relatives on thanksgiving. Perhaps we’ll be offering a guest a drink and instead of pulling out an album from our bookcase, we’ll swipe on our refrigerator and say ‘Hey did I show you our wedding photos?’

Well, that’s the future, and it ain’t here yet. I have relatives here and now that want photos of Mom and Dad, and I can’t waste half an hour downloading them and then uploading them to some other service EVERY TIME.

 

__repr__()

I think repr() is Python’s genius touch. What?! I’m sure you’re yelling at the screen right now, incoherent as items from the long list of things you love about Python fight each other to get gain command of your voice. But it’s what I think. And as you know, Lisp got there first.

Now, of course, you’re really mad at me. But listen. What do we use Python a lot for? We use it for anything that requires our language to be nimble, fluid and user friendly, like data analysis and algorithm prototyping. And why do we use Python? Because it has a REPL. We’ll think of something to do, we’ll write a set of functions and classes to do it (sometimes never leaving the REPL) and then run our data on it.

The key feature of the REPL is being able to see the results of a computation immediately printed below what we did. This is a user comfort feature, and user comfort has a disproportionate impact on our motivation to keep trying out new ideas.

Python’s built in types have decent printed representations, but these can get crowded (dicts with many keys get hard to read quickly, for example). And what about user defined classes? Often, the results of a computation are not a few strings or simple, single numbers. The result of adding two arrays or two tables can be visually complex, for instance. Representing these as unreadable combinations of Python’s built in types or equally unreadable pointer ids starts to defeat the purpose of the interpreted, REPL workflow.

To mitigate this, Python supplies us with the special repr() method. When we create a class we don’t have to define this method, but if we do define it, making it return a succinct string which we find useful as a summary of the object, at the REPL we are then rewarded with these succinct, readable representations.

For example, say we are doing some work with vectors and are interested in how the magnitude of the resultant vector changes as we add vectors together. We could do this as follows:

class Vect:
  
  def __init__(self, x, y):
    
    self.x, self.y = x, y

  

  def __radd__(self, other):
    
    return Vect(self.x + other.x, self.y + other.y)

  

  def __add__(self, other):
    
    return Vect(self.x + other.x, self.y + other.y)

  

  def __repr__(self):
    
    return &quot;({}, {}):{:0.3f}&quot;.format(self.x, self.y, (self.x ** 2 + self.y ** 2) ** 0.5)

There are some things here in addition to repr which enable us to do vector additions. We will ignore those here.

So now when we create some vectors, we get a nice printout

Vect(3,4)   # -&gt; (3, 4):5.000
Vect(5,12)  # -&gt; (5, 12):13.000

This feature plays wonderfully with Python’s built in types, so a Set of these classes, a dict with these classes as keys or values, or a simple list of these classes all work as you would hope.

[Vect(3,4), Vect(5,12)]       # -&gt; [(3, 4):5.000, (5, 12):13.000]
{Vect(3, 4), Vect(5, 12)}     # -&gt; {(3, 4):5.000, (5, 12):13.000}
{Vect(3, 4): 'A', Vect(5, 12): 'B'}  # -&gt; {(3, 4):5.000: 'A', (5, 12):13.000: 'B'}
sum([Vect(3,4), Vect(5,12)])  # -&gt; (8, 16):17.889

PS. I forgot – I dropped a teaser about how Lisp got there first (as usual). In Common Lisp when you define a structure, the structure gets a default “printer” which you can override:

(defstruct vect x y)    ; -> uses default printer

(make-vect😡 3 :y 4)   ; -> #S(VECT :X 3 :Y 4)

Now we can replace this with our own printer

(defstruct vect x y)    ; -> uses default printer

(make-vect😡 3 :y 4)   ; -> #S(VECT :X 3 :Y 4)