Data Management

They told you you’d have to manage your time, manage your equipment, even manage your research assistants, but nobody told you that you’d have to manage your data. This post is an haphazard list of tips that I’ve found useful for keeping data well behaved and obedient and that I would like to share with you.

data_management_cartoon

  1. It’s helpful to get the data into a uniform, tabular format during the exploratory stage, this really encourages playing and experimenting with the data.
  2. Make sure the data format you use is amenable to pulling out chunks (subsets) from – when you are experimenting you don’t want to wait hours for an analysis to run, you just want to pilot stuff with small chunks of the data to test your logic and your coding. Also, make sure you have a convenient way of pulling out subsets of the data based on conditions. SQL helps, but tends to be cumbersome.
  3. Make data structures expandable, so you can easily add new data to the collection. However, don’t over think things: It’s like coding; make it general, but not too general, if a new application/question comes up, you might have to redesign stuff, but that’s OK, and is better than wasting a lot of time designing data structures that never come in useful.
  4. Automate as much of the processing chain as possible – make scripts so that you can rerun the whole chain (from raw data to highly processed format) efficiently, especially if there is a change in specifications or data format.
  5. Document not just the code but the data structures and file formats – other wise it will be hell. I love ascii-art diagrams. You can put them right in the code or as a simple text Readme file. Also, try and use open formats for as much of the data storage as you can (I use hdf5 for data and simple text for documention). The next guy working on the data will thank you for this. Frequently the next guy is yourself, 1 year in the future.
  6. Make the data as uniform as possible, preferably with numerical coding for non-numeric data, so things end up as uniform shaped matrices which can be efficiently concatenated together, sliced and chunked.
  7. For the exploratory phase consider using notebook style software, like Ipython notebook. Otherwise make well documented short scripts (but pretty soon you have a flotilla of scripts that need their own manager). Often it is useful to keep an actual, physical notebook. If nothing else, there is an aesthetic satisfaction to seeing your progress as you flip through the notebook. I’ve tried electronic notes but personally I find nothing beats ink scribbles on paper.
  8. Decide how you will split the ‘small’ and ‘large’ parts of data. The ‘small’ part is often metadata that will be used to select out parts of the ‘large’ data. The large data can be unruly and take up massive amouts of space (such as raw traces from a data recorder), while the small data lends itself to a simple tabular format. A pragmatic solution is often to save the ‘large’ data separately, linked to the ‘small’ data by an index. The large data is pulled of disk in chunks based on subsets selected from the ‘small’ data table.
  9. Write the analysis code and design the data structures such that as new data comes in, the analysis can be appended to the existing analysis. This often saves time. At some stage this is not possible, since we work on aggregate statistics, but if possible, delay the aggregation as much as possible and make it modular – so we have expensive computations done in stages and the partial computations saved in intermediate data files.
  10. Personally, I like to turn everything into a command line script and basically rerun a sequence of scripts when I change the data structure or design some new analysis. This probably stems from an exposure to Unix machines in my formative years: I have many colleagues who just as profitably write GUIs to do similar tasks
  11. Use spreadsheets to keep track of metadata like files/sessions/subjects/experiments – the tabular format can be easily exported, keeps things consistent, is faster than writing a CRUD application and more durable.

You’ll know when you’ve gotten past the data management stage: your code starts to become shorter, dealing more with mathematical transforms and less with handling exceptions in the data. It’s nice to come to this stage. It’s a bit like those fights in Lord of the Rings, where you spend a lot of time crossing the murky swamp full of nasty creatures, which isn’t that much of a challenge, but you could die if you don’t pay attention. Then you get out of the swamp and into the evil lair and that’s when things get interesting, short and quick.

Coding up analyses and data management are fairly intertwined (the demands of the analyses drive data management) and I will try a separate post with another haphazard list of tips for analysis coding.

Depth-of-field

Depth-of-field (DoF) is one of the most fun things about photography. It is enjoyable on both the technical and artistic levels. Depth-of-field is the extent (“depth”) in a scene that is in focus (“field”) on a photograph. Artistically it is usually used to isolate a subject from the surroundings and can be used to indicate depth. The subjective nature of the out of focus elements (the blur) is called Bokeh and is a marketing term invented by the Japanese to sell lenses with larger apertures that are insanely expensive (I’m kidding, I’m kidding).

There are many, many good technical articles on DoF and many many religious wars on Bokeh. What I want to do here is focus on the values of DoF that you can get at different focal lengths and at different apertures and try and relate that to everyday portraits.

A good place to get started on the geometry of DoF is Wikipedia’s article, which also has a nice derivation for formulae for DoF which I will be using.

Intuitively, DoF exists because lenses bring to focus only an infinitely thin slice of the three dimensional scene onto a two dimensional plane. The parts of the scene in front of, and behind this slice become progressively more and more defocused. Depending on the size at which you view the image you will notice this blur at an earlier (large image) or later (small image) stage. Aperture plays an important role in DoF with larger apertures giving shallower DoF.

Like most things in photography DoF is the one-dimensional shadow of a multi-dimensional creature depending upon focal length (f), aperture (a), subject distance (s) and sensor size. This creature is well expressed by the equation:

D = \frac{sf^2}{f^2 \pm ac(s-f)}

Here c is the circle of confusion and is the amount of blur you will tolerate before you say “this is out of focus”. Typically the value for c is given a value based on the resolving power of the sensor. Film with larger grain will have larger c and sensors with smaller sizes will have smaller c. I’m doing the calculations with c = 0.02mm which is the accepted value for APS-C sizes sensors (entry level DSLRS).

As I mentioned, here I’m interested in figuring out DoF for common focal lengths for when I want to take portraits. The general advice given is that for a proper portrait you need fast glass so you can open the aperture up wide and isolate your subject by blurring the background. I was interested in getting some values of what range of apertures to use for different focal lengths in order to create good portraits.

For reasons that will become clear shortly, I need to know how far back I should stand to get the framing I want with each lens. As you can guess, the longer the focal length of the lens the further back I need to stand. I can work out this distance by computing the magnification factor mfor the portrait. This is the ratio of the size of the image on the sensor to the actual size of the subject.

I’m working this out for APS-C sensors, which are 24×16 mm. A person’s head is about 26cm high x 15 cm deep (nose to ears) x 18cm wide (I know it really looks like clueless nerd trying to do ‘art’, but bear with me). I’m going to split portraits into two classes I like:

Close-up: The face fills the frame, every part is in focus (especially the eyes) and any background is blurred as much as possible (generally oriented tall). Here m = \frac{24.}{260} \simeq \frac{24}{240} = 0.1.

Environmental: The head and upper body take up about 50% or less of the frame area and the rest comes from the subject’s surroundings which are blurred, but retain enough structure such that you can tell what it is, which gives the subject a context (generally oriented wide). Here m \simeq \frac{10.}{260} \simeq 0.04

I can compute the subject distance (how far I need to stand) from the equation m = \frac{f}{s-f} which ultimately results in

\displaystyle D = \frac{f^3 (\frac{1+m}{m})}{f^2 \pm ac\frac{f}{m}}

Now we can plot our DoF for a range of focal lengths for our purpose at different apertures for an APS-C sensor:

closeup_dof_plot

environmental_dof_plot

In the plots, the shaded gray area represents the nose-ear depth, the black lines represent the DoF at the given aperture and subject distance for the indicated focal lengths (given in mm above the shaded area eg. 18,35,50…)

Here’s two interesting things I take away from the plots.

The first is that DoF remains constant whatever focal length you use (as long as you are keeping the size of the subject on your photo constant). Without working this out I would have expected longer focal lengths to have smaller DoFs: this is true for the same subject distance, but since we are keeping image size constant, we get this different result.

The second is, for close-ups you need to go down to f/22 to get a DoF that covers nose-to-ears and even for environmental portraits you need to stop down to at least f/5.6 to get the nose and ears in focus.

For the longest time I was trying portraits with lenses like the 35mm/1.8 and 50mm/1,8 wide open and I was failing miserably (especially with the 50mm, which I manual focus). Then I started getting braver and started to stop down more, 4.0, even 5.6, and I got good results. These calculations show me why.

I would say, you don’t necessarily need fast glass to obtain subject isolation. In some cases it looks cool to have one eye in focus and the other eye and nose and ears not but I prefer to have the whole face in focus, with the environment thrown off. For this, it seems, I can use pretty much any lens since most lenses will give me at least f/5.6 In terms of aesthetics, of course, 35mm and longer is preferred for portraits to avoid distorting the face unpleasantly.

I have studiously stayed away from the ‘artistic’ aspects of Bokeh. If you do want to get into depth in this aspect of the field, all you really need to know, is that the worst insult you can throw in a Bokeh debate is to tell the other fellow his Bokeh looks like donuts.

Code follows:

"""
https://en.wikipedia.org/wiki/Depth_of_field#Derivation_of_the_DOF_formulas.
http://www.dofmaster.com/dofjs.html"""
import pylab

#m - subject magnification
#f - focal length
#a - aperture number
#c - circle of confusion

# Plot the depth of field by focal length and aperture

#D = \frac{f^3 (\frac{1+m}{m})}{f^2 \pm ac\frac{f}{m}}
Dn = lambda m,f,a,c: ((f**3)*(1.+m)/m)/(f**2 + a*c*f/m)
Df = lambda m,f,a,c: ((f**3)*(1.+m)/m)/(f**2 - a*c*f/m)

H = 0.15 #Nose to ears

F = [.018, .035, .050, .085, .105, .135, .2, .3] #Our choice of focal lengths, need to be expressed in m.
A = [22, 16, 8.0, 5.6, 3.4, 1.8, 1.4] #Our choice of f-stops

c = 0.00002 #0.02mm is circle of confusion http://www.dofmaster.com/dofjs.html for D5100 (APS-C)

dn = pylab.empty((len(F), len(A)))
df = pylab.empty((len(F), len(A)))
m = 0.1
#m = 0.04

for i,f in enumerate(F):
  for j,a in enumerate(A):
    dn[i,j] = Dn(m,f,a,c)
    df[i,j] = Df(m,f,a,c)

pylab.figure(figsize=(10,4))
pylab.subplots_adjust(bottom=.15)
for i,f in enumerate(F):
  s = (f/m)+f
  pylab.fill_between([s-H/2.0, s+H/2.0], [len(A)-.9, len(A)-.9], y2=-.1, color='gray', edgecolor='gray')
  pylab.text(s-.1,len(A)-.5,'{:02d}'.format(int(f*1000)))
  for j,a in enumerate(A):
    pylab.plot([dn[i,j], df[i,j]], [j, j], 'k',lw=2)
pylab.xlabel('Distance (m)')
pylab.ylabel('Aperture f-stop')
pylab.setp(pylab.gca(), 'ylim', [-.1,len(A)], 'yticks', range(len(A)), 'yticklabels', A)
pylab.suptitle('Close-up')
#pylab.suptitle('Environmental')

The curse of D- and the LDA

All you dataheads know the curse whose name must not be spoken. The curse of D(imensionality)! Let’s look at how the curse sickens us when we perform Linear Discriminant Analysis (LDA). Our intuition, when we perform LDA, is that we are rotating a higher dimensional data space and casting a shadow onto a lower dimensional surface. We rotate things such that the shadow exaggerates the separation of data coming from different sources (categories) and we hope that the data, which may look jumbled in a higher dimension, actually cast a shadow where the categories are better separated.

I have some data that can be thought of as having a large number of dimensions, say about a hundred. (Each dimension is a neuron in the brain, if you must know). I know, from plotting the data from the neurons individually, that some neurons just don’t carry any information I’m interested in, while others do. I’m interested in the question: if I combine multiple neurons together can I get more information out of them than if I look at them individually. It is possible to construct toy scenarios where this is true, so I want to know if this works in a real brain.

A question quickly arose: which neurons should I pick to combine? I know that by using some criteria I can rank the neurons as to informativeness and then pick the top 50% or 30% of the neurons to put together. But what happens if I just take all the neurons? Will LDA discard the useless noisy ones. Will my data space be rotated so that these useless neurons don’t cast any shadow and are eclipsed by the useful neurons?

This is not entirely an idle question, or a question simply of data analysis. The brain itself, if it is using data from these neurons, needs some kind of mechanism to figure out which neurons are important for what. I find this is an important question and I don’t think we are sure of the answers yet.

However, back to the math. We can generate a toy scenario where we can test this. I created a dataset that has 10,25 and 50 dimensions. Only one dimension is informative, the rest are noise. Data from the first dimension come from two different classes and are separated. What happens when we rotate these spaces such that the points from the two classes are as well separated as possible?

The plot below shows thlda_dimensionalitye original data (blue and green classes). You can see that the ‘bumps’ are decently separated. Then you can see the 10d, 25d and 50d data.

Wow! Adding irrelevant dimensions sure helps us separate our data! Shouldn’t we all do this, just add noise as additional dimensions and then rotate the space to cast a well separated shadow? Uh, oh! The curse of D- strikes again!

We aren’t fooled though. We know what’s going on. In real life we have limited data. For example, in this data set I used 100 samples. Our intuition tells us that as our dimensions increase the points get less crowded. Each point is able to nestle into a nice niche in a higher dimension, further and further away from its neighbors. Its like the more dimensions we add, the more streets and alleys we add to the city. All the points no longer have to live in the same street. They now have their own zip codes. (OK I’ll stop this theme now)

Poor old LDA knows nothing about this. LDA simply picks up our space and starts to rotate it and is extremely happy when the shadow looks like it’s well separated and stops. The illusion will be removed as soon as we actually try to use the shadow. Say we split our data into test and train sets. Our train set data look nicely separated, but the moment we dump in the test data: CHAOS! It’s really jumbled. Those separate zipcodes – mail fraud! Thanks to the curse of D-

Code follows:

import pylab
from sklearn.lda import LDA

def myplot(ax, F, title):
  bins = pylab.arange(-15,15,1)
  N = F.shape[0]
  pylab.subplot(ax)
  pylab.hist(F[N/2:,0], bins, histtype='step', lw=3)
  pylab.hist(F[:N/2,0], bins, histtype='step', lw=3)
  pylab.title(title)

D = 50 #Dimensions
N = 100 #Samples
pylab.np.random.seed(0)
F = pylab.randn(N,D)
C = pylab.zeros(N)
C[:N/2] = 1 #Category vector
F[:,0] += C*4 #Adjust 1st dimension to carry category information

lda = LDA(n_components=1)
#bins = pylab.arange(-15,15,1)
fig, ax = pylab.subplots(4,1,sharex=True, figsize=(4,8))

myplot(ax[0], F, 'original')
F_new = lda.fit_transform(F[:,:10],C) #Ten dimensions
myplot(ax[1], F_new, '10d')
F_new = lda.fit_transform(F[:,:25],C) #25 dimensions
myplot(ax[2], F_new, '25d')
F_new = lda.fit_transform(F[:,:50],C) #50 dimensions
myplot(ax[3], F_new, '50d')