Most people using bash are familiar with piping: this is the ability to direct the output of one process to the input of another allowing the second process to work on data as soon as becomes available. What is not so well known is that piping is a particular case of a more general and powerful inter process communication system bash gives us.

The basic bash piping flow looks like:

pipe-fig1

In this setup cmd1 writes to stdout and the pipe operator (|) redirects this to the cmd2 which is setup to accept input from stdin and which finally either writes directly to file, or to stdout again,  which is then redirected to file. The bash command would look like

cmd1 | cmd2 > file

There are two great advantages to this kind of data flow. Firstly we are skipping writing data from cmd1 to disk, which saves us a potentially very expensive operation. Secondly, cmd2 is running in parallel to cmd1 and can start processing results from cmd1 as soon as they become available.

(This kind of structure is of course restricted to a data flow that can work in chunks. i.e cmd2 does not need the whole output of cmd1 to be written before starting work)

Now suppose that our first program produces multiple files and so can not afford to write to stdout. Our second program, similarly, accepts two files as input and produces two files as output.

pipe-fig2.png

Here cmd1 expects to write two streams of data to two files, whose names are represented by the first and second arguments to cmd1. cmd2 expects to read data from two files and write data to two files, all represented by four positional arguments.

Can we still use pipes? Can we still get parallelization for free if cmd1 and cmd2 can work in chunks? It turns out that we can, using some thing called named pipes. The above data flow, in bash, can be executed as follows

mkfifo tf1
mkfifo tf2
cmd1 tf1 tf2 &
cmd2 tf1 tf2 file1 file2

The fun thing here is the mkfifo command which creates a named pipes, also called First In First Out (FIFO) structures. After executing mkfifo tf1 if you do a directory listing, you will two odd files named tf1 and tf2. They have no size and their permissions flag string starts with ‘p’.

ls -al 
prw-r--r--   1 kghose  staff      0 Oct 26 21:35 tf1
prw-r--r--   1 kghose  staff      0 Oct 26 21:35 tf2

Whenever a process asks to write to or read from these ‘files’ the OS sets up an in-memory buffer. The process reading the buffer will block until data is placed on the buffer. The process writes to the buffer also blocks until it can be sure there is a receiving process at the other end. Once both ends are connected an in-memory transfer of data occurs from process1 to process2, never touching the disk.

To me the amazing thing is that the original programs cmd1 and cmd2 don’t know anything about “pipes”. They just expect to write to/read from files and the OS presents an interface that looks just like a file. (There are differences, and it is possible to write programs that won’t work with pipes. Any program, for example, that needs to do random access on the file won’t work with this paradigm)

Now, what if you want to maintain this data flow, but in addition, want to save the original data after passing it through another process, say a compression program?

pipe-fig3.png

Now things get a little involved and we’ll invoke two additional concepts, process substitution and the tee command

Consider the command

cmd1 >(gzip > fileA) >(gzip > fileB) &

In the places where cmd1 expects to find a file name we have supplied it with a very strange looking construct! The pattern >(...) is called process substitution. It tells Bash to construct a named pipe on the fly and connect the first stream output from cmd1 to the input of the command in the parens (in this case it is the gzip command). We do this twice because we want both outputs to be gzipped.

Symmetrically, we can do process substitution with the pattern <(...) . Here we tell Bash to send the output of the process in the parens (...) via a named pipe to the relevant input of the other process.

Now consider the command

cmd1 >(tee > tf1 >(gzip > fileA)) >(tee > tf2 >(gzip > fileB) ) &

The new thing here is the tee command (and the nesting of process substitutions which is just neat). The tee command takes input from stdin and copies it to stdout as well as any files that are supplied to it. In this case tee’s stdout is redirected to our named pipe (the >tf1 construct) as well as the ‘file’ represented by the process substitution >(gzip > fileA) (which is a nested process substitution).

So, the data flow drawn above can be represented as:

mkfifo tf1
mkfifo tf2
cmd1 >(tee > tf1 >(gzip > fileA)) >(tee > tf2 >(gzip > fileB) ) &
cmd2 tf1 tf2 file1 file2

Part of the subtle magic in all of this is that it’s a bit like functional programming with parallelization managed by the OS. We’re crunching data in one process as soon as it has emerged from another process. We’ve managed to isolate our concerns but still parallelize some of the work! And if we take our programs to some non-fancy operating system that can’t do all these fireworks we can still run the code just fine, except we’ll be writing to disk files and operating serially.

Frankly, the people who developed *nix systems are geniuses.

(But a distant memory comes back to me, of sitting at an IBM PC in what we called  “The Measurement and Automation Lab”  at Jadavpur University in far, far Calcutta, and an excited professor explaining to me how a RAMdisk worked in PC DOS, though this is clearly not of the same calibre as the *nix features)

A short rant on Darwin

So when I first tried this all out, the code did not work on my Mac. I was pretty sure I was doing things correctly but things just hung and I had to kill the processes. When I tried the exact same commands on a Linux box, things worked as expected.

Then I changed the way I was reading from the pipe to:

python cmd2.py <(cat < tf1) <(cat < tf2) fo1.txt fo2.txt

Using a – seemingly redundant – process substitution via cat to read the pipes made the code work on my Mac.

So we have here, again, a subtle way in which Darwin’s BASH shell differs from Linux’s and lurks, waiting, to screw you up and waste your time. It’s a pity that the Mac has taken off for developers (myself included) because it’s clear Darwin will never be a proper *nix but the adoption of Macs by developers probably takes some energy away from getting Linux on laptops.

Further reading