Progressive Cactus is a set of tools that will align multiple DNA/protein sequences and save to the interesting HAL format. I decided to take it out for a spin.
Compiling on Max OS Mavericks was easy (I just followed their Readme), except for this one problem with wget, but it was a easy fix. The command line arguments and input file format are very simple and logical. I started out by running the algorithm on the GRCh38 Y chromosome and the Venter institute Y chromosome (found here.)
My first run terminated rather quickly. Fortunately they provide log-files (and tell you where the log-files are saved). The program does not like non-alphanumeric characters in the header of the fasta files which turns out to be the | characters. I simply used vi to replace the |s with spaces. I didn’t do any clever monitoring of the code, but I did note that all my four cores hit 100% right away and cactus took over top. Interestingly, my memory usage remained low, which is my metric of a well designed big data crunching program. I’m sure there are options that can trade off computation for space, allowing us to tweak the program for our setup. Cactus doesn’t talk a lot on the command line but peeking into cactus.log in the working directory you specify can tell you a little of what is going on. Cactus also claims that it stores some state in files, so that the computation can take off approximately where it stopped.
I made two attempts at running cactus. During the first run I got stuck in a program called ktserver for a very long time. After a while I got bored and killed the program, then restarted it. I let this instance run for days (Sorry, I don’t have reliable stats – this was my laptop, I set it to sleep, did other work on it and at one point paused the task because I needed the CPU).
At one point the program switched to a low CPU high memory mode. In fact my 16GB Mac laptop started to gave me repeated low memory warning. I was probably lucky that I had a SSD, if I had a HDD I would probably have gone into the thrash of death. But even the SSD could not save me. My laptop became semi responsive and I finally killed the process.
My memory came back quickly – don’t know if I should credit cactus or Mac OS for not creating memory leaks – but I had to kill Chrome (my only other application) since I never got Chrome back.
I’m unsure of what to think. I will peek into the logs, but I don’t know how far the program got, whether we were just done, however, the fact that memory usage ballooned like this is not a good sign. My endeavor is to always design programs that can run on any machine whatever. If there is less RAM it should take more time, but it should never push the machine to unresponsiveness.