Named pipes, process substitution and tee

Most people using bash are familiar with piping: this is the ability to direct the output of one process to the input of another allowing the second process to work on data as soon as becomes available. What is not so well known is that piping is a particular case of a more general and powerful inter process communication system bash gives us.

The basic bash piping flow looks like:

pipe-fig1

In this setup cmd1 writes to stdout and the pipe operator (|) redirects this to the cmd2 which is setup to accept input from stdin and which finally either writes directly to file, or to stdout again,  which is then redirected to file. The bash command would look like

cmd1 | cmd2 > file

There are two great advantages to this kind of data flow. Firstly we are skipping writing data from cmd1 to disk, which saves us a potentially very expensive operation. Secondly, cmd2 is running in parallel to cmd1 and can start processing results from cmd1 as soon as they become available.

(This kind of structure is of course restricted to a data flow that can work in chunks. i.e cmd2 does not need the whole output of cmd1 to be written before starting work)

Now suppose that our first program produces multiple files and so can not afford to write to stdout. Our second program, similarly, accepts two files as input and produces two files as output.

pipe-fig2.png

Here cmd1 expects to write two streams of data to two files, whose names are represented by the first and second arguments to cmd1. cmd2 expects to read data from two files and write data to two files, all represented by four positional arguments.

Can we still use pipes? Can we still get parallelization for free if cmd1 and cmd2 can work in chunks? It turns out that we can, using some thing called named pipes. The above data flow, in bash, can be executed as follows

mkfifo tf1
mkfifo tf2
cmd1 tf1 tf2 &
cmd2 tf1 tf2 file1 file2

The fun thing here is the mkfifo command which creates a named pipes, also called First In First Out (FIFO) structures. After executing mkfifo tf1 if you do a directory listing, you will two odd files named tf1 and tf2. They have no size and their permissions flag string starts with ‘p’.

ls -al 
prw-r--r--   1 kghose  staff      0 Oct 26 21:35 tf1
prw-r--r--   1 kghose  staff      0 Oct 26 21:35 tf2

Whenever a process asks to write to or read from these ‘files’ the OS sets up an in-memory buffer. The process reading the buffer will block until data is placed on the buffer. The process writes to the buffer also blocks until it can be sure there is a receiving process at the other end. Once both ends are connected an in-memory transfer of data occurs from process1 to process2, never touching the disk.

To me the amazing thing is that the original programs cmd1 and cmd2 don’t know anything about “pipes”. They just expect to write to/read from files and the OS presents an interface that looks just like a file. (There are differences, and it is possible to write programs that won’t work with pipes. Any program, for example, that needs to do random access on the file won’t work with this paradigm)

Now, what if you want to maintain this data flow, but in addition, want to save the original data after passing it through another process, say a compression program?

pipe-fig3.png

Now things get a little involved and we’ll invoke two additional concepts, process substitution and the tee command

Consider the command

cmd1 >(gzip > fileA) >(gzip > fileB) &

In the places where cmd1 expects to find a file name we have supplied it with a very strange looking construct! The pattern >(...) is called process substitution. It tells Bash to construct a named pipe on the fly and connect the first stream output from cmd1 to the input of the command in the parens (in this case it is the gzip command). We do this twice because we want both outputs to be gzipped.

Symmetrically, we can do process substitution with the pattern <(...) . Here we tell Bash to send the output of the process in the parens (...) via a named pipe to the relevant input of the other process.

Now consider the command

cmd1 >(tee > tf1 >(gzip > fileA)) >(tee > tf2 >(gzip > fileB) ) &

The new thing here is the tee command (and the nesting of process substitutions which is just neat). The tee command takes input from stdin and copies it to stdout as well as any files that are supplied to it. In this case tee’s stdout is redirected to our named pipe (the >tf1 construct) as well as the ‘file’ represented by the process substitution >(gzip > fileA) (which is a nested process substitution).

So, the data flow drawn above can be represented as:

mkfifo tf1
mkfifo tf2
cmd1 >(tee > tf1 >(gzip > fileA)) >(tee > tf2 >(gzip > fileB) ) &
cmd2 tf1 tf2 file1 file2

Part of the subtle magic in all of this is that it’s a bit like functional programming with parallelization managed by the OS. We’re crunching data in one process as soon as it has emerged from another process. We’ve managed to isolate our concerns but still parallelize some of the work! And if we take our programs to some non-fancy operating system that can’t do all these fireworks we can still run the code just fine, except we’ll be writing to disk files and operating serially.

Frankly, the people who developed *nix systems are geniuses.

(But a distant memory comes back to me, of sitting at an IBM PC in what we called  “The Measurement and Automation Lab”  at Jadavpur University in far, far Calcutta, and an excited professor explaining to me how a RAMdisk worked in PC DOS, though this is clearly not of the same calibre as the *nix features)

A short rant on Darwin

So when I first tried this all out, the code did not work on my Mac. I was pretty sure I was doing things correctly but things just hung and I had to kill the processes. When I tried the exact same commands on a Linux box, things worked as expected.

Then I changed the way I was reading from the pipe to:

python cmd2.py <(cat < tf1) <(cat < tf2) fo1.txt fo2.txt

Using a – seemingly redundant – process substitution via cat to read the pipes made the code work on my Mac.

So we have here, again, a subtle way in which Darwin’s BASH shell differs from Linux’s and lurks, waiting, to screw you up and waste your time. It’s a pity that the Mac has taken off for developers (myself included) because it’s clear Darwin will never be a proper *nix but the adoption of Macs by developers probably takes some energy away from getting Linux on laptops.

Further reading

Use pysam

This is a plug for Pysam: a well designed and feature rich wrapper for HTSlib. I’ve used pysam for a few years now and it has gotten better with time. It is now a very efficient and convenient library for reading and writing BAM files, reading VCF, FASTQ and FASTA files. For VCF and FASTA files it can use the tabix index to read regions.

I first used Pysam to read BAM files, which was the most feature complete part when I started. I started to use it for writing BAM files when the quirks were ironed away. I used to use PyVCF for reading use my own (inefficient) reader for FASTA and FASTQ files.

In it’s current state (0.9.1.4) Pysam is a very effective solution to reading, in addition, VCF, FASTA and FASTQ files. It transparently handles gzipped files and BCF files, so you don’t have to keep track of the file type yourself.

Not only is it able to use the VCF and FASTA indices it also returns the data nicely encapsulated. My only complaint is that I feel the VCF record format is a tad over-complicated (it uses a similar interface to PyVCF). I always end up with very complicated accessor expressions when I wish to get alleles and genotypes from a record.

What I really liked about the VCF reader is that it intelligently handles the case of deletions crossing a region boundary. So for example, if you ask for variants in the range 10 to 20 and there is a 5 base deletion starting at position 8, it will be captured because it crossed into the requested regions. It’s very nice not to have to write your own logic to handle such cases.

I’ve also found VCF reading efficient enough for my purposes. When reading a single sample from multi-sample (population) VCF files it helps to set the filter, so that only the required sample is loaded.

Molecular combing

Molecular combing is a fun technique that is used to study longer stretches of DNA and quite different from the next generation sequencing by synthesis I’ve been exposed to. In this technique  DNA is mechanically stretched out on a slide and then locations of interest are labeled using fluorescent compounds. One can then image the slide and inspect the regions that light up.

DNA in solution coils up randomly. In molecular combing a charged probe is dipped in the solution and attracts the  ends of the DNA molecules. The probe is then retracted slowly from the solution, moving against a glass slide. The surface tension of the fluid pulls on the DNA and helps to stretch it out. Fluorescent DNA probes designed to attach to attach to particular locations on the DNA can then be applied and the slide is observed under a fluorescent microscope.

One use of this method is to detect large genetic events, like structural variations (long insertions, long deletions, inversions) that are really hard to detect with short read technologies and even with long read technologies. The method does not give single base resolution – a person I talked to said they get 1kb resolution (which is pretty great, if you consider that we are looking at DNA molecules under the microscope!) – but it the only way to get these classes of events. Typically this would be used when you know what you are looking for and want to test individual samples to see if they have this feature or not.

As a side note this reminds me of my Neuroscience days. One way we would classify techniques to study brain activity was a two-dimensional grouping based on resolution in time and space, by which we mean, if you detected an event with a technique, how well could you tell at what point in time it occurred – 1ms, or 1hour? and where it occurred – in THIS CELL, or in this half of the brain?

So, for example, invasive single cell recording would give very high resolution in both time (1ms or less) and space (this neuron), fMRI would sit in the middle, with moderate resolution in both time (1min or so) and space (1mm cubed chunks of the brain), and EEG would have relatively high time resolution (a few ms) and very crummy spatial resolution (This half of the brain, or if you used fancy statistical techniques, this region of the brain).

An effective study would try and combine different techniques to get at a question from different levels of detail to try an cross-validate the findings (we can rarely do this because of resource limitations, though)

Similarly, in genomics, groups are now trying fusion based approaches where “evidence classes” (It’s fun – when you move to new fields you always find the same concepts, but with totally different names. I guess everyone wants to sound cool in different ways. In electrical engineering (robotics) the buzz word for a while was “Sensor Fusion”) from different techniques are combined together to cross-validate the existence of genetic features.

The devil in the details of course is how you weight the different evidence classes and what do you do when they conflict with each other.

Further reading:

 

 

 

Why can’t I print from Google photos?

I love Google Photos as a means off backing up and sharing photos. On the Mac it requires minimal configuration and works without supervision and it is easy to share albums and photos. So I’m really puzzled why there is no way to print photos.

Google photos byline is “Free storage and automatic organization for all your memories.” and the software works! It appears to be written professionally – so perhaps a team from outside Google made it originally – I kid, I kid.

The auto uploader is easy to configure and non-intrusive. I tell it where my photos are and it silently looks for new ones, de-duplicates them and streams all my personal photos into google’s servers.  Wait. God! I just re-read that last sentence slowly. It’s too late now. … Anyway

Google’s statistical learning algorithms do some semi-useful things like image categorization and some cute things like animations with music which are nice but not essential or something I use often. I haven’t looked, but I assume that there is a way to bulk download if I ever need to recover the photos.

Update: Google photo is pretty much just a web only photo sharing service. The quality of the stored photos is OK for web viewing but does not stand up to closer scrutiny. I would only use this as a “backup” of last resort, a kind of cache in case all other real backups have failed. And I guess that’s why there is no print option – the quality is just too poor to really print.

Screen Shot 2016-10-22 at 7.56.10 PM.pngIn the example above the left image is of the google photos copy at 1:1 and the right is the original photo, also at 1:1. You can clearly see Google photo’s compression artifacts and poorer underlying resolution. There are also software glitches when viewing photos – the web viewer often gets stuck at a very low resolution of the photo, and you have to reload, or otherwise ‘jiggle’ the software to get it working again.

So, imagine my surprise and frustration when I went to print my photos and started to feel like Marcel The Mime stuck in that glass box. I tried to find the print button for two days, searching forums and stack overflow, convinced that it was just hidden and if I was just diligent enough I would find it, perhaps earning $100 in free prints at the end of it.

Once, I ran into a post that said I just needed to log into the Picasa webservice: I’d be able to see the photos I’d uploaded and then select for print. I went to picasaweb, and indeed, found my albums and found the print option. I was overjoyed. I started to collect photos to print. I then navigated away. A few days later I came back and discovered that the design had changed and I no longer had the “Print” button. I realized I was part of a giant psychological experiment which made the events in Gas Light look like kindness.

It was then that a bigger mystery began to occupy my mind. Why do this? Why fuck with your users like this? Why take a course of action that both leaves money on the table and angers users at the same time? I couldn’t stop thinking about it and this post is a form of therapy. I hope it works. So hear me out.

screen-shot-2016-10-09-at-7-34-10-pmNow, Google is desperate to make money from their services.

Whenever I do a search I see a string of ads above my search results that are either identical to my search results or considerably less informative.

Google is sacrificing search result accuracy and user convenience for revenue. Google was earning a healthy ad revenue before it started to advertise so luridly, and so it’s not clear to me why they’ve become so desperate.

screen-shot-2016-10-06-at-11-00-06-am

So, in this context the absence of any way to print photos from Google photos strikes me as particularly odd.

I’m not very experienced in product commercialization, but I imagine that if you create an online photo storage and management service, it’s a net plus to either offer a printing service yourself or, if that takes you too far outside your traditional domain of expertise, have an arrangement with an established photo printing service. Not letting your users print, and being ambiguous about it, is, on the other hand, a net negative.

So, is this lack of functionality malice or stupidity? Let’s take malice first.

When we upload our photos to google’s servers we are giving them intimate personal data. The images are being processed through statistical learning algorithms which can cluster faces and probably recognize backgrounds. We also give Google our personal and professional email. These data streams are a marketers dream. It’s the kind of information that allows Google to insert Ads for baby clothes in emails once you use the word ‘pregnancy’ in an email. In the future one can imagine that Google will insert such ads once you upload photos of your pregnancy to share with family.

Perhaps, though, that fear is overdone, as we can see from the clumsy state of targeted marketing; the brightest minds of our generation, thankfully and contrary to popular perception, have not been occupied in trying to serve ads to us (they have, of course, been occupied in borking our encryption algorithms and back-dooring our router hardware, but that is a matter for a different post) but an army of second rate minds have certainly been trying to productize our personal information.

So, from this point of view, as far as Google is concerned, we are the product and in exchange for some free storage we are giving google an even more complete peek into our personal lives so they can build a better psychological profile of us, so that they may judiciously prey on our deepest insecurities to sell us disposable razors. They don’t care if we can’t print, and they want this fact to be hard to discover. What they really want is us to upload our photos for their analysis.

 

What about stupidity? Google is a big company with many, many failed products. Most of the products failed not because of buggy software but because of a lack of imagination. A basic misunderstanding of what people want their computers to do for them. Like, say, print a bunch of photos into a photo book to give as a gift. The lack of a print facility is, under this hypothesis, just another example of product management sleeping at the helm.

There is of course another option – strategic insight.

Perhaps Google has decided for us that the vast majority of people no longer print photos. Perhaps they have seen into the future and it’s all digital, from the screens on our phones to the screens on our fridges. There will be no more eight-by-ten color glossy pictures of children and of wives and of parents and of halloween parties hanging on our walls, or inserted into albums (real albums, made of cardboard paper and cellophane) to be shown to relatives on thanksgiving. Perhaps we’ll be offering a guest a drink and instead of pulling out an album from our bookcase, we’ll swipe on our refrigerator and say ‘Hey did I show you our wedding photos?’

Well, that’s the future, and it ain’t here yet. I have relatives here and now that want photos of Mom and Dad, and I can’t waste half an hour downloading them and then uploading them to some other service EVERY TIME.