I was pretty annoyed. I had installed a program into a docker container and had invoked this container via our cloud computing platform and it was garbling my output files – with the exact same inputs and parameters I had just tested it with on my local machine.
At first, I didn’t realize anything was wrong. The program completed without complaints, the log file indicated a certain size of result which corresponded to the result I obtained locally. However, when I fed the output file into another tool it failed. The tool had failed with an esoteric error message which is very non-informative (yay bioinformatics tools), but after seeing this message several times I’ve come to understand that it usually means the input is truncated.
Sure enough, when I downloaded the file from the platform onto my local machine and gunzipped it gunzip ran for a while before complaining of a truncated file.
Now, I had just come out of dealing with an annoying issue related to running docker containers on our platform. When the containers ran, they ran using a non-interactive non-login shell. This leads to subtle differences in how many things work, which will take up a different post, but basically I was in a foul mood already.
The code I actually wanted to run looked like this
my-program \ input-file \ >(gzip > outputfile1) \ >(gzip > outputfile2)
It turns out that the non-interactive shell on the ubuntu base image I used for my tool image is dash, rather than bash, and dash doesn’t like this process substitution thing and complains
sh: 1: Syntax error: "(" unexpected
To get around this expediently I wrapped my program in a bash script
my-script.sh and invoked it on the platform as
bash my-script.sh. This worked, but now my file was getting truncated.
Same code, same data, same docker image. Container executes correctly on my mac, corrupts my file on the AWS linux machine. I cross checked that I was closing my file in my code, I read up a bit about flushing (close does it) and fsync (named pipes don’t have it – they don’t touch disk) and the fact that Python will raise an exception if something goes wrong with the write.
After some more thrashing about like this I suddenly wondered about the multiple processes involved here. By doing process substitution, I had created two new processes – both involving gzip – that were sucking up data from pipes linking to the process my main program was running in. It was quite possible that the two gzip processes were finishing slightly after my main process was done. Now suppose, just suppose, the original bash shell that I was invoking to run my program was terminating as soon as my main program was done without waiting for the gzip processes? This would lead to this kind of file truncation.
I put in a
wait in the shell script, but this did not make any difference. I wasn’t explicitly sending any processes to the background, so this was perhaps expected. Then I added a
sleep 2s command to the script:
my-program \ input-file \ >(gzip > outputfile1) \ >(gzip > outputfile2) sleep 2s # <--- BINGO!
and indeed my files were no longer being truncated. So, what was happening was that the parent shell had no way of ‘realizing’ that the two gzip subprocesses were running and was exiting and the docker container was shutting down, killing my gzips just before they were done writing the last bit of data.
Ah, the joys. The joys.