They told you you’d have to manage your time, manage your equipment, even manage your research assistants, but nobody told you that you’d have to manage your data. This post is an haphazard list of tips that I’ve found useful for keeping data well behaved and obedient and that I would like to share with you.
- It’s helpful to get the data into a uniform, tabular format during the exploratory stage, this really encourages playing and experimenting with the data.
- Make sure the data format you use is amenable to pulling out chunks (subsets) from – when you are experimenting you don’t want to wait hours for an analysis to run, you just want to pilot stuff with small chunks of the data to test your logic and your coding. Also, make sure you have a convenient way of pulling out subsets of the data based on conditions. SQL helps, but tends to be cumbersome.
- Make data structures expandable, so you can easily add new data to the collection. However, don’t over think things: It’s like coding; make it general, but not too general, if a new application/question comes up, you might have to redesign stuff, but that’s OK, and is better than wasting a lot of time designing data structures that never come in useful.
- Automate as much of the processing chain as possible – make scripts so that you can rerun the whole chain (from raw data to highly processed format) efficiently, especially if there is a change in specifications or data format.
- Document not just the code but the data structures and file formats – other wise it will be hell. I love ascii-art diagrams. You can put them right in the code or as a simple text Readme file. Also, try and use open formats for as much of the data storage as you can (I use hdf5 for data and simple text for documention). The next guy working on the data will thank you for this. Frequently the next guy is yourself, 1 year in the future.
- Make the data as uniform as possible, preferably with numerical coding for non-numeric data, so things end up as uniform shaped matrices which can be efficiently concatenated together, sliced and chunked.
- For the exploratory phase consider using notebook style software, like Ipython notebook. Otherwise make well documented short scripts (but pretty soon you have a flotilla of scripts that need their own manager). Often it is useful to keep an actual, physical notebook. If nothing else, there is an aesthetic satisfaction to seeing your progress as you flip through the notebook. I’ve tried electronic notes but personally I find nothing beats ink scribbles on paper.
- Decide how you will split the ‘small’ and ‘large’ parts of data. The ‘small’ part is often metadata that will be used to select out parts of the ‘large’ data. The large data can be unruly and take up massive amouts of space (such as raw traces from a data recorder), while the small data lends itself to a simple tabular format. A pragmatic solution is often to save the ‘large’ data separately, linked to the ‘small’ data by an index. The large data is pulled of disk in chunks based on subsets selected from the ‘small’ data table.
- Write the analysis code and design the data structures such that as new data comes in, the analysis can be appended to the existing analysis. This often saves time. At some stage this is not possible, since we work on aggregate statistics, but if possible, delay the aggregation as much as possible and make it modular – so we have expensive computations done in stages and the partial computations saved in intermediate data files.
- Personally, I like to turn everything into a command line script and basically rerun a sequence of scripts when I change the data structure or design some new analysis. This probably stems from an exposure to Unix machines in my formative years: I have many colleagues who just as profitably write GUIs to do similar tasks
- Use spreadsheets to keep track of metadata like files/sessions/subjects/experiments – the tabular format can be easily exported, keeps things consistent, is faster than writing a CRUD application and more durable.
You’ll know when you’ve gotten past the data management stage: your code starts to become shorter, dealing more with mathematical transforms and less with handling exceptions in the data. It’s nice to come to this stage. It’s a bit like those fights in Lord of the Rings, where you spend a lot of time crossing the murky swamp full of nasty creatures, which isn’t that much of a challenge, but you could die if you don’t pay attention. Then you get out of the swamp and into the evil lair and that’s when things get interesting, short and quick.
Coding up analyses and data management are fairly intertwined (the demands of the analyses drive data management) and I will try a separate post with another haphazard list of tips for analysis coding.