Bayesian networks are directed graphs that represent probabilistic relationships between variables. I’m using the super-excellent book “Pattern Recognition and Machine Learning” by C. Bishop as my reference to learn about bayesian networks and conditional independence, and I’d like to share some intuitions about these.

First, let’s motivate our study with a common home situation related to kids. We start with the question “Do we have to gas up the car?” (As you know, everything related to kids is binary. There is no room for gray areas). Well, this **depends upon **whether we need to do a grocery trip, or not. That, in turn depends upon whether the food in the fridge has gone bad and whether we need dino nuggets. Why would the food go bad? Well that depends on whether the electricity went out, and whether the darn kids broke the fridge again. The kids never broke your fridge? Lucky you. Whether we need dino nuggets depends on whether we’re having a party. The probability that the kids broke the fridge depends on whether there were a horde of them or not (i.e. a party).

Whoa, this paragraph looks complicated, like a GRE word problem. Let’s represent this as a directed graph:

Much better!

All the variables here are binary and we can put actual numbers to this network as follows:

Explaining some of the numbers, the entry for the (P)arty node indicates that there is a probability of 0.1 we’re going to have a party, and this is uninfluenced by anything else. Looking at the entry for the (G)as up node, we see that whether we gas up depends on the states of F and D. For example, if both F and D are false, there is a 0.1 probability of needing to gas up, and if F is true and D is false, there is a 0.6 probability of needing to gas up, and so on.

Intuitively, for a pair of nodes A and B, there is a linkage if the probability of the output depends on the state of the input(s). So, in the truth tables above I’ve made every row different. If, for example, the probability that B would take the state “T” is 0.2 regardless of whether A was “T” or “F” then we can intuit that A doesn’t influence B and there is no line from A to B. If, on the other hand, B takes the state “T” with p=0.2 if A is “T” and with p=0.8 if A is “F” then we can see that the probability B takes the state “T” depends on A.

Before we start to look at our kid’s party network, I’d like to examine three basic cases Bishop describes in his book, and try an develop intuitions for those, all involving just three nodes. For each of these cases I’m going to run some simulations and print results using the code given in this gist in case you want to redo the experiments for your satisfaction.

The only thing to know going forward is that I’ll be using the phi coefficient to quantify the degree of correlation between pairs of nodes. phi is a little more tricky to interpret than pearson’s R, but in general a phi of 0 indicates no correlation, and larger values indicate more correlation.

The first case is the tail-to-head configuration:

Say A=”T” with p=0.5 (i.e. a fair coin) and C=”T” p=0.7 if A=”T” and p=0.3 otherwise, and B=”T” p=0.7 if C=”T” and p=0.3 otherwise. Here “B” is related to “A” through an intermediary node “C”. Our intuition tells us that in general, A influences C, C influences B and so A has some influence on B. A and B are correlated (or dependent).

Now, here is the interesting thing: suppose we “observe C”, which effectively means, we do a bunch of experimental runs to get the states of A, C and B according to the probability tables and then we pick just those experiments where C = “T” or C = “F”. What happens to the correlation of A and B?

If we take just the experiments where C=”T”, while we do have a bias in A (because C comes up “T” a lot more times when A is “T”) and a bias in B (because B comes up “T” a lot more times when C is “T”) the two are actually not related anymore! This is because B is coming up “T” with a fixed 0.7 probability REGARDLESS OF WHAT A was. Yes, we have a lot more A=”T”, but that happens independently of what B does.

That’s kind of cool to think about. Bishop, of course, has algebra to prove this, but I find intuition to be more fun.

As a verification, running this configuration 100000 times I find the phi between A & B = 0.16379837272156458 while with C observed to be “T” it is -0.00037 and “F” it is 0.0024. Observing C decorrelates A and B. MATH WORKS!

As an aside, if I setup the network such that the bias of C does not change with the input from A, I get phi of A & B = -0.00138 and phi of A & B with C observed to be “T” = -0.00622 and “F” = 0.001665 confirming that there is never any correlation in the first place in such a “flat” network.

The second case bishop considers is the tail-to-tail configuration:

Here we have a good intuition that C influences A and B, causing A and B to be correlated. If C is “T” it causes A to be “T” with pA(T) and B to be “T” with pB(T) and when C is “F” A is “T” with pB(F) and B is “T” with pB(F). These probabilities switch to track C as it changes, resulting in the linkage.

What happens if we observe C? Say C is “T”. Now this fixes the probability of “T” for both A and B. They maybe different, but they are fixed, and the values A and B take are now independent!

The third case is very cool, the head-to-head configuration:

We easily see that A and B are independent. What happens if we observe C? Let’s consider a concrete example with a probability table

A B p(T) for C F F 0.1 F T 0.7 T F 0.3 T T 0.9

Say we run 100 trials and A and B come up “T” or “F” equally likely (p=0.5). We expect each AB pattern will occur equally often (25 times) and then the expected number of C=T states is

A B E(C=T) F F 0.1 * 25 = 2.5 F T 0.7 * 25 = 17.5 T F 0.3 * 25 = 7.5 T T 0.9 * 25 = 22.5

So we have an expected 50 trials where C=T. Of that subset of trials, the fraction of trials that each pattern comes up is:

A B p(AB|C=T) F F 2.5 / 50 = 0.05 F T 17.5 / 50 = 0.35 T F 7.5 / 50 = 0.15 T T 22.5 / 50 = 0.45

Ah! You say. Look, when A and B are independent, each of those patterns come up equally often, but here, after we observe C, there is clearly an imbalance! How sneaky! Basically, because some patterns of AB are more likely to result in C=T this sneaks into the statistics once we pick a particular C (Those hoity, toity statisticians call this “conditioning on C”). A and B are now dependent!!

I have to say, I got pretty excited by this. Perhaps I’m odd.

But wait! There’s more! Say C has a descendent node. Now observing a descendent node actually “biases” the ancestor node – so in a way you are partially observing the ancestor node and this can also cause A and B to become dependent!

Now, I was going to then show you some experiments I did with a simulation of the Kid’s party network that I started with, and show you all these three conditions, and a funky one where there are two paths linking a pair of nodes (P and G) and how observing the middle node reduces their correlation, but not all the way, cause of the secondary path, but I think I’ll leave you with the simulation code so you can play with it yourself.

Code for this little experiment is available as a github gist.