Use pysam

This is a plug for Pysam: a well designed and feature rich wrapper for HTSlib. I’ve used pysam for a few years now and it has gotten better with time. It is now a very efficient and convenient library for reading and writing BAM files, reading VCF, FASTQ and FASTA files. For VCF and FASTA files it can use the tabix index to read regions.

I first used Pysam to read BAM files, which was the most feature complete part when I started. I started to use it for writing BAM files when the quirks were ironed away. I used to use PyVCF for reading use my own (inefficient) reader for FASTA and FASTQ files.

In it’s current state (0.9.1.4) Pysam is a very effective solution to reading, in addition, VCF, FASTA and FASTQ files. It transparently handles gzipped files and BCF files, so you don’t have to keep track of the file type yourself.

Not only is it able to use the VCF and FASTA indices it also returns the data nicely encapsulated. My only complaint is that I feel the VCF record format is a tad over-complicated (it uses a similar interface to PyVCF). I always end up with very complicated accessor expressions when I wish to get alleles and genotypes from a record.

What I really liked about the VCF reader is that it intelligently handles the case of deletions crossing a region boundary. So for example, if you ask for variants in the range 10 to 20 and there is a 5 base deletion starting at position 8, it will be captured because it crossed into the requested regions. It’s very nice not to have to write your own logic to handle such cases.

I’ve also found VCF reading efficient enough for my purposes. When reading a single sample from multi-sample (population) VCF files it helps to set the filter, so that only the required sample is loaded.

Don’t hang up!

I started using multiuser Unix systems in graduate school and I quickly learned to love 'nohup'. It let you do the impossible thing: you logged into a machine, started a long computation and then logged out! You didn’t have to hang around guarding the terminal till 2am. You could go get some sleep, hang out with friends, whatever.  This post is a short survey of ‘nohup’ like things that I’m still using and learning about, a decade and a half later.

Screen: For the longest time nohup <some command> & was my staple. I’d package all my code so that it was non-interactive, reading instructions from a parameter file, and writing out everything to data and log files, fire it off and come back the next day. Then someone introduced me to ‘screen‘. This was amazing. It let me reattach to a terminal session and carry on where I left off, allowing me to keep sessions open as I shuttle between work and home. I like doing screen -S <a name> to start sessions with distinct names.

Mosh: The one thing about screen is that I have to ssh back into the machine and start it up again. Mosh is a client server system that allow roaming, which means in practice, I can open a terminal, work on it, send my computer to sleep, work on the train with no WiFi, get into the office, and when my computer finds a WiFi again, Mosh takes up where it last left off, seamlessly – except for a little glitching on the display. Mosh doesn’t quite replace screen though – if I reboot my laptop or shutdown the terminal app, I lose the Mosh client. The session keeps running, but I can no longer interact with it.

reptyr: Ok, this is a fun one. Now suppose you have a running process that you started in a normal ssh session. Then you realize, this is actually a long running process. Oh great, the flashbacks to the late night terminal guarding sessions in grad school come back. There’s a program caller reptyr that lets you hop a running process from one terminal to another. What you want to do in this case, is start up a screen session and then invoke reptyr from the screen session.

Unfortunately I’ve not been able to get it to work properly – I always get

Unable to attach to pid 11635: Operation not permitted
The kernel denied permission while attaching. If your uid matches
the target's, check the value of /proc/sys/kernel/yama/ptrace_scope.
For more information, see /etc/sysctl.d/10-ptrace.conf

In my original terminal the program just stops and hands me back my command prompt. Nothing happens in my new terminal, but I can see that the program is running. It’s a bit like doing disown, which in turn is like doing nohup retroactively – your process can keep running, but there is no terminal. Fortunately you redirected stderr to a file by doing 2> err.txt, right? right?

Disassembly

Adventures in functional programming

Reading through Let over Lambda I ran into several instances where the author showed the disassembly of a bit of Lisp code. This struck me as awesome for several reasons and I wanted to do it myself. I initially thought I had to dig up a disassembler for my system (as I would have done it for C/C++) and I was blown away when I learnt that  there was a Lisp command for this!

I found getting the disassembly amazing because I didn’t think of Lisp – a dynamically typed high level language – as generating succinct machine code. I knew it could be compiled, but I expected the compilation to be something very messy, a bit like the C++ code that Cython generates for Python without the types put in – full of instruments to infer types at run time. Instead, what I saw was tight machine code that one…

View original post 186 more words

Rip Van C++

After a long hiatus, probably about a decade, I once again began coding in C++. It wasn’t hard to get back into the stride but I am surprised at how different modern C++ is, and how pleasurable it is to write. In no particular order here are the things that surprised me.

No more “this->” in class functions. I discovered this by accident when I forgot to add the “this->” qualifier in front of an instance variable and the code still compiled. I’m not sure if I like this – but it sure as hell is convenient.

CATCH. Man! One of my big draws for Python is how rapidly I can set up tests and run them. It basically made parallel testing and development (I’m sure the kids have some kind of cool name for it nowadays) possible for me. I remember with agony how it was to setup a test suite for C++ code. CATCH makes it almost as easy to write tests for C++. And you can break into the debugger. I think this settled the matter for me.

std::random. No more hanging out in dark alleys, setting up rendezvous with shady characters to get your random numbers. It’s right there, in the standard library! And you can seed it and everything!

CMAKE! Felix sent me a small self contained example, sat with me for a few minutes and that was enough to get me up and running. I do NOT miss creating makefiles by hand. Cmake appears to be very logical.

std::threads – I haven’t actually used this yet, but after looking through several examples it’s simplicity rivals that of Python’s multiprocessing pool. We’ve come a long way baby.

Contrary to what you might think, I’m not that excited by the auto keyword or by the (optional) funky new way you can write loops. I still have to get used to the lambda syntax – I think Python gets it just right – straight and simple.

Update: OK, I really like the:

for(auto x : <some collection>) {
   // process x
}

syntax. It’s pretty slick.

I was also amused by the difference between, I guess you would call it culture, for C++ and Python. I was looking for a logo for this post and, firstly, did not really find an official one, and secondly, landed on the standards page:

Screen Shot 2016-02-18 at 6.11.02 AM

which stands in contrast to the first hit for “Python” which is the official one stop shop for all your Python needs:

Screen Shot 2016-02-18 at 6.10.48 AM

Interestingly, not six months ago, I think I took a stab at getting back into C++, for a personal project, I think. And I had an adverse reaction. Nothing has really changed with C++ since then, so I guess this time round I actually stopped to ask for help and my kind colleagues pointed me in the right direction.

 

Ex Machina

A movie review with some fun discussion of the Turing test.

Spoiler Alert!

The thing I like the most about Ex Machina is that one of the protagonists not only gives us a proper definition of the Turing test, but he also describes a delicious modification of the Turing test that took me a second to savor fully. The plot also twists quite nicely in the end, though we kind of see it coming.

Movies about machines that replicate human thoughts and emotions pop up periodically. In most of these movies the machines win. Ex Machina is one of the better ones of this genre. Particularly satisfying is that the techno babble in the script touches on some advanced topics in machine intelligence.

To start with, there is a good definition of the Turing test. People make a lot of fuss about the Turing test and take it quite seriously and literally. The Turing test, to me, is basically is an admission that, when people…

View original post 1,046 more words

The three switches problem

The light bulb problem is a class of problem often presented at interviews as a test of lateral thinking. Here, in a mischievous spirit taken from the best subversive traditions we will attack this question by thinking a little too laterally …

There are different versions of the light bulb problem, but they rely on the same trick. Here is one version:

Your grand uncle lives in a rambling old house. He had an electrician over to wire a light in the attic but the electrician messed with the wiring and set up three switches instead of one. Your grand uncle wants to figure out which of the switches turns on the light in the attic. Problem is, you can’t tell from below whether the attic light is on or off. You can fiddle with the switches and then go up to the attic to inspect the light, but you can only do this once. The old coot says you can’t keep going back down and up the attic stairs fiddling with the light – apparently it disturbs the cat and he scratches the furniture.

The conventional answer to the question is to flip one switch on and leave it for a bit. Then turn it off and turn on another then go up to the attic. If the light is on, the last switch you flipped was the one. If the light is warm (Aha!) then the first switch you flipped is the one. If the light is off and cold, well is the remaining one. Let’s try and subvert the conventional answer to this problem by engaging in some mild subterfuge in the spirit of Calandra‘s “Angels-on-a-Pin“.

  1. Rapidly flip one of the switches on and off many times. Finally set this switch to the off position and flip another switch to on. Now go to the attic. If the light is on, it’s the currently on switch. If the light is burnt out, its the first switch.

  2. Call in your grand aunt, explain the problem to her, crawl up to the attic and have her flip the switches one by one. Yell when the light comes on.

  3. Flip a switch and inspect all the fixtures in the house, repeat and eliminate two of the switches. The third one is the attic switch. If none of the switches seem to be doing anything, where’s the problem? Just hit all three switches when you need the attic light on.

  4. Wait for a moonless night. Turn off all the lights in the house and draw all the curtains and blinds. Now flip the switches one by one. There will be a difference in the light level, however slight, from the attic light going on.

  5. Fly a drone up into the attic. Now you have eyes on target. Screw the cat.

  6. Take all bulbs out of their sockets and unplug all appliances. Now flip the switches one by one and watch the electric meter. This may take some patience.

  7. Open up the cover plate. Take a voltmeter and measure the drop in voltage as the switches are flipped. If only one has a voltage drop, that’s the one wired to the attic switch. If more than one have drops, throw stones into the attic until the light bulb is broken. Redo the measurements. The switch that now has no voltage drop (but did previously) is the one. Take a new bulb, go up into the attic and replace it.

  8. Flip each switch in turn and let the bulb warm up. Throw a bucket of water into the attic. If you hear a loud “POP” the bulb was on. Leave adequate cool down periods between tests. Take a new bulb, go up into the attic and replace it.

  9. Tie three strings to the switches. Go up into the attic with the strings. (Some will say this is subverting the question. Yes it is. Yes it is)

  10. Get all the shiny pots and pans your grand uncle has. Arrange them like mirrors so that you have a view of the attic from the switches.

As an aside, here is a problem I like better, as solving it actually teaches you some mathematical concepts:

Your grand uncle has passed on. In his will he’s bequeathed you an antique coin of immense value (so he says). He’s also left you with eight accurate forgeries of that coin that are indistinguishable from the original, except that they are just a fraction lighter. He’s also left you a balance, which can only be used twice before it breaks. So he says.

You are only allowed to take one of the coins with you. The original coin is priceless, the forgeries are worthless. So he says. You suspect all the coins are duds and the balance won’t break, but you take your Grand uncle’s will in the spirit it was meant, as a last fun puzzle to keep you busy so you don’t get too sad at the old codger’s passing.

Text File formats – ASCII Delimited Text – Not CSV or TAB delimited text

One issue of doing this is, by including non-printing characters, we are breaking our ‘readable using a simple text editor’ pledge. When we open this file in a regular text editor we will not get the nice alignment of units and records that commas and tabs give us. (Ok I exaggerated on the readability of cvs and tab delimitated files)

Ronald Duncan's Blog

Unfortunately a quick google search on “ASCII Delimited Text” shows that IBM and Oracle failed to read the ASCII specification and both define ASCII Delimited Text as a CSV format.  ASCII Delimited Text should use the record separators defined as ASCII 28-31.

The most common formats are CSV (Comma Separated Values) and tab delimited text.  Tab delimited text breaks when ever you have either a field with a tab or a new line in it, and CSV breaks depending on the implementation on Quotes, Commas and lines. Sadly Quotes, Commas and Tab characters are very common in text, and this makes the formats extremely bad for exporting and importing data.  There are some other formats such as pipe (|) delimited text, and whilst better in that | is less frequently used they still suffer from being printable characters that are entered into text, and worst of all people, when they…

View original post 183 more words

A lame adventure with progressiveCactus

Progressive Cactus is a set of tools that will align multiple DNA/protein sequences and save to the interesting HAL format. I decided to take it out for a spin.

Compiling on Max OS Mavericks was easy (I just followed their Readme), except for this one problem with wget, but it was a easy fix. The command line arguments and input file format are very simple and logical. I started out by running the algorithm on the GRCh38 Y chromosome and the Venter institute Y chromosome (found here.)

My first run terminated rather quickly. Fortunately they provide log-files (and tell you where the log-files are saved). The program does not like non-alphanumeric characters in the header of the fasta files which turns out to be the | characters. I simply used vi to replace the |s with spaces. I didn’t do any clever monitoring of the code, but I did note that all my four cores hit 100% right away and cactus took over top. Interestingly, my memory usage remained low, which is my metric of a well designed big data crunching program. I’m sure there are options that can trade off computation for space, allowing us to tweak the program for our setup. Screen Shot 2014-01-25 at 9.51.27 AM Cactus doesn’t talk a lot on the command line but peeking into cactus.log in the working directory you specify can tell you a little of what is going on. Cactus also claims that it stores some state in files, so that the computation can take off approximately where it stopped.

I made two attempts at running cactus. During the first run I got stuck in a program called ktserver for a very long time. After a while I got bored and killed the program, then restarted it. I let this instance run for days (Sorry, I don’t have reliable stats – this was my laptop, I set it to sleep, did other work on it and at one point paused the task because I needed the CPU).

At one point the program switched to a low CPU high memory mode. In fact my 16GB Mac laptop started to gave me repeated low memory warning. I was probably lucky that I had a SSD, if I had a HDD I would probably have gone into the thrash of death. But even the SSD could not save me. My laptop became semi responsive and I finally killed the process.

Screen Shot 2014-02-06 at 8.59.39 AM

My memory came back quickly – don’t know if I should credit cactus or Mac OS for not creating memory leaks – but I had to kill Chrome (my only other application) since I never got Chrome back.

I’m unsure of what to think. I will peek into the logs, but I don’t know how far the program got, whether we were just done, however, the fact that memory usage ballooned like this is not a good sign. My endeavor is to always design programs that can run on any machine whatever. If there is less RAM it should take more time, but it should never push the machine to unresponsiveness.

Random numbers in a parallel world

TL;DR: It’s always a great idea to explicitly initialize your random number generator especially in a parallel computing environment.

Random number generation in computing commonly uses pseudorandom number generators. These are iterated functions that take an initial number (a seed) and spit out a new number each time you call them. Each number is part of a sequence that is completely defined by the seed. The sequence of numbers itself satisfies all the important properties of a random number sequence.

When you do a computer “experiment” you can explicitly initialize your random number generator by feeding it a seed. You can also let the algorithm pick out a seed from a stream of random numbers being constantly spit out by the computer. If you don’t care about repeatability and just want good quality randomness, the default in modern packages – which is to pick up a seed from the computer – works fine.

Python numpy’s random number generator uses the Mersenne Twister which, I am assured, is good enough for government work. By default the twister is initialized using /dev/urandom or the clock, which should be, also, good enough. Normally it is, but when we move into the weird whacky twilight world of parallel computing strange things can happen…

Say we have an embarrassingly parallel job where we simply start up a bunch of nodes and do some computations. What happens if we ask for a stream of 5 random numbers in each child process?

Say we ask for five numbers each:

[-1.83711524 -0.57025788 -0.47002959  1.13873683 -1.20119582]
[ 1.24164455 -1.05592017 -2.04148199 -1.10787051  0.0561248 ] <--- B
[-0.95616591 -0.17653271 -1.81273353 -0.39608772 -0.08856469] <--- C
[ 1.23821309  0.44823325 -2.09541964 -1.7484197  -0.46045061] <--- D
[ 0.21477345 -1.30221712 -0.53970868 -0.00636153  0.85769071] <--- E
[-1.1929044   0.29644317  0.64373556 -1.38675638 -0.42406317]
[ 0.84775676  0.69999582  0.36933264  1.79700735  0.34603859]
[ 0.12984335  0.26137673 -0.32926777  0.06448171  0.33496994]
[ 0.10993647  0.6695855  -1.32065528  0.93429973  0.32059549]
[-1.83711524 -0.57025788 -0.47002959  1.13873683 -1.20119582]
[ 0.81037046  0.056652    0.54062643  1.18642807 -0.16265761]
[ 1.06970208  0.13867405  1.2972758  -1.02361929  0.08262749]
[ 1.24164455 -1.05592017 -2.04148199 -1.10787051  0.0561248 ] <--- B
[-0.95616591 -0.17653271 -1.81273353 -0.39608772 -0.08856469] <--- C
[ 0.35640486 -0.36840793 -0.84085094  1.82913844 -0.09785973]
[ 1.03816641 -0.26439036 -1.73092253 -1.05740638  1.40258063]
[ 0.69154406 -0.11333892 -0.55908131 -0.80996816  1.33116669]
[-0.42959555  1.31447804 -0.5917133  -0.26106957 -1.42283209]
[ 1.23821309  0.44823325 -2.09541964 -1.7484197  -0.46045061] <--- D
[ 0.21477345 -1.30221712 -0.53970868 -0.00636153  0.85769071] <--- E

It LOOKS OK, but on closer inspection our hair stands on end. There are sequence repeats (I’ve pointed them out). That should scare you. Want to see something even more scary?

How about another run that takes slightly longer for the computer to run? (All I did was change how many random numbers I asked for, see code below):

[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[-0.1035066   0.93604127  0.34371349 -2.03675653  1.47444023  0.29328905]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[ 2.45623686  1.61541923 -0.57659283  1.33272398 -0.1824892  -0.98865263]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[-2.49928063  0.0620409   0.953208   -1.11290331 -0.0086512  -1.18000836]
[ 1.20134648  0.99562444  0.39421463 -0.38797808  1.33273899 -0.44182553]

Oh, this is enough to keep us awake at night, huh?

We are not in control of when the child process reads from /dev/urandom or when it reads the system clock. If the process lands on the same machine it can read the same seed from the /dev/urandom or from the system clock, or not. Don’t take the risk, explicitly initialize the generator. An easy way is to pass a unique identifier to the child which can then form the seed.

Code follows:

import multiprocessing as mp, numpy

def child(n):
  numpy.random.seed(n) #<-- comment this out to get the fright of your life.
  m = numpy.random.randn(6)
  return m

N = 20
pool = mp.Pool()
results = pool.map(child, range(N))
for res in results: print res

What is Mutual Information?

The mutual information between two things is a measure of how much knowing one thing can tell you about the other thing. In this respect, it’s a bit like correlation, but cooler – at least in theory.

Suppose we have accumulated a lot of data about the size of apartments and their rent and we want to know if there is any relationship between the two quantities. We could do this by measuring their mutual information.

Say, for convenience, we’ve normalized our rent and size data so that the highest rent and size are 1 “unit” and the smallest ones are 0 “units”. We start out by plotting two dimensional probability distributions for the rent and size.

Screen Shot 2013-11-01 at 12.32.04 PM

We plot rent on the x-axis, size on the y-axis. The density – a normalized measure of how often we run into a particular (rent,size) combination), and called the joint distribution (p(r,s)) – is actually plotted on the z-axis, coming out of the screen, forming a surface. To simplify matters, let’s assume the joint distribution here is uniform all over, so this surface is flat and at a constant height.

So, here the joint distribution of rents and sizes (p(r,s)) is given by the square (which is actually the roof of a cube, poking out) and the distribution of rents and sizes by themselves (called the marginals, because they are drawn on the margins of the joint distribution) are given by p(r) and p(s).

To recall a bit of probability, and probability distributions, the probability of finding a house/apartment within a certain rent/size range combo is given by the volume of the plot within that rent/size range. The volume of the whole plot, is therefore, equal to 1, since all our data is within this range.

The mutual information is given by the equation:
\displaystyle I(R;S) = \int \int p(r,s) \log \frac{p(r,s)}{p(r)p(s)}drds

This equation takes in our rent/size data and spits out a single number. This is the value of the mutual information. The logarithm is one of the interesting parts of this equation. In practice the only effect of changing the base is to multiply your mutual information value by some number. If you use base 2 you get out an answer in ‘bits’ which makes sense in an interesting way.

Intuitively we see that, for this data, knowing the rent tells us nothing additional about the size (and vice versa).

If we work out the value of the mutual information by substituting the values for p(r,s), p(r) and p(s) into the equation above we see that, since all these quantities are constant, we can just perform the calculation within the integral sign and multiply the result by the area of the plot (which is 1 and indicated by the final x1 term)
I(R;S) = 1 \log_2 \frac{1}{1 \times 1} \times 1 = 0

So we have 0 bits of information in this relation, which jives with our intuition that there is no information here – rents just don’t tell us anything about size.

Now suppose our data came out like this.
Screen Shot 2013-11-01 at 12.32.55 PM
[one-bit diagram]

Substituting the values we see that (noting we have two areas to integrate, each of size 1/2 x 1/2 = 1/4)
I(R;S) = 2 \log_2 \frac{2}{1 \times 1} \times \frac{1}{4} \times 2 = 1

That’s interesting. We can see intuitively there is a relation between rent and size, but what is this 1 bit of information? One way of looking at our plot is to say, if you give me a value for rent, I can tell you in which range of sizes the apartment will fall, and this range splits the total range of sizes in two. 2^1=2 so we say we have 1 bit of information which allows us to distinguish between two alternatives: large size and small size.

Interestingly, if you tell me the size of the apartment, I can tell you the range of the rent, and this range splits the total range of rents in two, so the information is still 1 bit. The mutual information is symmetric, as you may have noted from the formula.

Now, suppose our data came out like this.
Screen Shot 2013-11-01 at 12.33.41 PM
[two-bit diagram]

You can see that:
I(R;S) = 4 \log_2 \frac{4}{1 \times 1} \times \frac{1}{16} \times 4 = 2

Two bits! The rents and sizes seem to split into four clusters, and knowing the rent will allow us to say in which one of four clusters the size will fall. Since 2^2=4 we have 2 bits of information here.

Now so far, this has been a bit ho-hum. You could imagine working out the correlation coefficient between rent and size and getting a similar notion of whether rents and sizes are related. True, we get a fancy number in bits, but so what?

Well, suppose our data came out like this.
Screen Shot 2013-11-01 at 12.34.29 PM
[two-bit, scrambled diagram]

It’s funny, but the computation for MI comes out exactly the same as before:
I(R;S) = 4 \log_2 \frac{4}{1 \times 1} \times \frac{1}{16} \times 4 = 2

Two bits again! There is no linear relationship that we can see between rents and sizes, but upon inspection we realize that rents and sizes cluster into four groups, and knowing the rent allows us to predict which one of four size ranges the apartment will fall in.

This, then, is the power of mutual information in exploratory data analysis. If there is some relationship between the two quantities we are testing, the mutual information will reveal this to us, without having to assume any model or pattern.

However, WHAT the relationship is, is not revealed to us, and we can not use the value of the mutual information to build any kind of predictive “box” that will allow us to predict, say, sizes from rents.

Knowing the mutual information, however, gives us an idea of how well a predictive box will do at best, regardless of whether it is a simple linear model, or a fancy statistical method like a support vector machine. Sometimes, computing the mutual information is a good, quick, first pass to check if it is worthwhile spending time training a computer to do the task at all.

A note on computing mutual information

In our toy examples above it has been pretty easy to compute mutual information because the forms of p(r,s), p(r) and p(s) have been given explicitly. In real life we don’t have the distributions and all we are given is a (not large enough) pile of data. We try to estimate the functions p(r,s), p(r) and p(s) from this data on our way to computing the mutual information.

You will notice that the term in the integral is always positive. p(r,s) \geq 0 because it is a probability and \frac{p(r,s)}{p(r)p(s)} \geq 1. This second fact can be seen by considering the extreme cases where r and s are independent (in which case p(r,s)=p(r)p(s) which leads us to \frac{p(r,s)}{p(r)p(s)} = 1) and when they are completely dependent (in which case p(r,s)=p(r)=p(s) which leads us to \frac{p(r,s)}{p(r)p(s)} = \frac{p(r)}{p^2(r)} = \frac{1}{p(r)} \geq 1).

You will immediately sense the problem here. In many calculations, when we have noise in the terms, the noise averages out because the plus terms balance out the minus terms. Here, all we have are plus terms, and our integral has a tendency to get bigger.

Histograms (where we take the data and bin it) is an expedient way of estimating probability distributions and they normally work alright. But this can lead us to a funny problem when computing mutual information because of this always positive nature of the integral term.

For example, say there was really no dependence between rents and sizes, but suppose our data and our binning interacted in an odd manner to give us a pattern such as this:
Screen Shot 2013-11-01 at 12.35.14 PM
[checkerboard]

We can see that the marginals are not affected badly, but the joint, because it is in two-dimensional space, is filled rather more sparsely which leads to us having ‘holes’ in the distribution. If we now compute the mutual information we find that we have ended up with 1 bit of information, when really, it should be 0 bits.

Most attempts to address this bias in mutual information computations recognize the problem with these ‘holes’ in the joint distribution and try to smear them out using various ever more sophisticated techniques. The simplest way is to make larger bins (which would completely solve our problem in this toy case) and other methods blur out the original data points themselves.

All of these methods, not matter how fancy, still leave us with the problem of how much to smear the data: smear too little and you inflate the mutual information, smear too much and you start to wipe it out.

Often, to be extra cautious, we do what I have known as ‘shuffle correction’ (and I was told by a pro is actually called the ‘null model’). Here you thoroughly jumble up your data so that any relation ship that existed between r and s is gone. You then compute the mutual information of that jumbled up data. You know that the mutual information should actually be zero, but because of the bias it comes out to something greater. You then compare the mutual information from the data with this jumbled one to see if there is something peeking above the bias.