A recent PLoS Comp. Bio. article
(W.S. Noble, PLoS Comp. Bio. V5(7)) contains one person's suggestions
for organizing computational projects. I particularly enjoyed the
sections on experiment structure and on scripts.
Many of the
scripting goals discussed in the article might be accomplished by using
makefiles to drive analyses. Makefile rules could implement building
block operations, and dependencies be used to ensure all steps are
updated if the data are changed. Also, a properly crafted Makefile may
enable "easy" parallelization. E.g. the GNU implementation of make allows parallel build operations (make -j) where dependencies allow it.
Also,
I got to wondering how virtualization might be used for maintaining
snapshots of an experiment. For example, a delta VM of an existing base
VM could be created at the initiation of an experiment. Changes
(creation/modification of files, etc.) is all that would be stored in
the delta VM. Subsequent changes to state would cause a branching from
the base VM. Using a VM allows one to return to an experiment, with the
entire "machine" in the same state as it was when the experiment was
originally performed. Contextual changes can be optionally brought
forward, including updated datasets and changes in the OS. One also
gains the ability to migrate long running experiments in the face of
required system maintenance. These benefits are gained at the cost of
minimal extra space and minimal performance loss.
If you've tried
either of these approaches, or have others suggestions for organizing
computational experiments, please share your experiences.