Bayesian networks and conditional independence

Bayesian networks are directed graphs that represent probabilistic relationships between variables. I’m using the super-excellent book “Pattern Recognition and Machine Learning” by C. Bishop as my reference to learn about bayesian networks and conditional independence, and I’d like to share some intuitions about these.

First, let’s motivate our study with a common home situation related to kids. We start with the question “Do we have to gas up the car?” (As you know, everything related to kids is binary. There is no room for gray areas). Well, this depends upon whether we need to do a grocery trip, or not. That, in turn depends upon whether the food in the fridge has gone bad and whether we need dino nuggets. Why would the food go bad? Well that depends on whether the electricity went out, and whether the darn kids broke the fridge again. The kids never broke your fridge? Lucky you. Whether we need dino nuggets depends on whether we’re having a party. The probability that the kids broke the fridge depends on whether there were a horde of them or not (i.e. a party).

Whoa, this paragraph looks complicated, like a GRE word problem. Let’s represent this as a directed graph:

Screen Shot 2017-04-13 at 7.57.02 AM

Much better!

All the variables here are binary and we can put actual numbers to this network as follows:

Screen Shot 2017-04-13 at 8.14.09 AM

Explaining some of the numbers, the entry for the (P)arty node indicates that there is a probability of 0.1 we’re going to have a party, and this is uninfluenced by anything else. Looking at the entry for the (G)as up node, we see that whether we gas up depends on the states of F and D. For example, if both F and D are false, there is a 0.1 probability of needing to gas up, and if F is true and D is false, there is a 0.6 probability of needing to gas up, and so on.

Intuitively, for a pair of nodes A and B, there is a linkage if the probability of the output depends on the state of the input(s). So, in the truth tables above I’ve made every row different. If, for example, the probability that B would take the state “T” is 0.2 regardless of whether A was “T” or “F” then we can intuit that A doesn’t influence B and there is no line from A to B. If, on the other hand, B takes the state “T” with p=0.2 if A is “T” and with p=0.8 if A is “F” then we can see that the probability B takes the state “T” depends on A.

Before we start to look at our kid’s party network, I’d like to examine three basic cases Bishop describes in his book, and try an develop intuitions for those, all involving just three nodes. For each of these cases I’m going to run some simulations and print results using the code given in this gist in case you want to redo the experiments for your satisfaction.

The only thing to know going forward is that I’ll be using the phi coefficient to quantify the degree of correlation between pairs of nodes. phi is a little more tricky to interpret than pearson’s R, but in general a phi of 0 indicates no correlation, and larger values indicate more correlation.

The first case is the tail-to-head configuration:Screen Shot 2017-04-13 at 4.56.51 PM.png

Say A=”T” with p=0.5 (i.e. a fair coin) and C=”T” p=0.7 if A=”T” and p=0.3 otherwise, and B=”T” p=0.7 if C=”T” and p=0.3 otherwise. Here “B” is related to “A” through an intermediary node “C”. Our intuition tells us that in general, A influences C, C influences B and so A has some influence on B. A and B are correlated (or dependent).

Now, here is the interesting thing: suppose we “observe C”, which effectively means, we do a bunch of experimental runs to get the states of A, C and B according to the probability tables and then we pick just those experiments where C = “T” or C = “F”. What happens to the correlation of A and B?

If we take just the experiments where C=”T”, while we do have a bias in A (because C comes up “T” a lot more times when A is “T”) and a bias in B (because B comes up “T” a lot more times when C is “T”) the two are actually not related anymore! This is because B is coming up “T” with a fixed 0.7 probability REGARDLESS OF WHAT A was. Yes, we have a lot more A=”T”, but that happens independently of what B does.

That’s kind of cool to think about. Bishop, of course, has algebra to prove this, but I find intuition to be more fun.

As a verification, running this configuration 100000 times I find the phi between A & B = 0.16379837272156458 while with C observed to be “T” it is -0.00037 and “F” it is 0.0024. Observing C decorrelates A and B. MATH WORKS!

As an aside, if I setup the network such that the bias of C does not change with the input from A, I get phi of A & B = -0.00138 and phi of A & B with C observed to be “T” = -0.00622 and “F” =  0.001665 confirming that there is never any correlation in the first place in such a “flat” network.

The second case bishop considers is the tail-to-tail configuration:

Screen Shot 2017-04-13 at 6.33.31 PM

Here we have a good intuition that C influences A and B, causing A and B to be correlated. If C is “T” it causes A to be “T” with pA(T)  and B to be “T” with pB(T) and when C is “F” A is “T” with pB(F) and B is “T” with pB(F). These probabilities switch to track C as it changes, resulting in the linkage.

What happens if we observe C? Say C is “T”. Now this fixes the probability of “T” for both A and B. They maybe different, but they are fixed, and the values A and B take are now independent!

The third case is very cool, the head-to-head configuration:

Screen Shot 2017-04-13 at 7.22.51 PM.png

We easily see that A and B are independent. What happens if we observe C? Let’s consider a concrete example with a probability table

A B  p(T) for C
F F  0.1
F T  0.7
T F  0.3
T T  0.9

Say we run 100 trials and A and B come up “T” or “F” equally likely (p=0.5). We expect each AB pattern will occur equally often (25 times) and then the expected number of C=T states is

A B  E(C=T)
F F  0.1 * 25 =  2.5
F T  0.7 * 25 = 17.5
T F  0.3 * 25 =  7.5
T T  0.9 * 25 = 22.5

So we have an expected 50 trials where C=T. Of that subset of trials, the fraction of trials that each pattern comes up is:

A B  p(AB|C=T)
F F   2.5 / 50 = 0.05
F T  17.5 / 50 = 0.35
T F   7.5 / 50 = 0.15
T T  22.5 / 50 = 0.45

Ah! You say. Look, when A and B are independent, each of those patterns come up equally often, but here, after we observe C, there is clearly an imbalance! How sneaky! Basically, because some patterns of AB are more likely to result in C=T this sneaks into the statistics once we pick a particular C (Those hoity, toity statisticians call this “conditioning on C”). A and B are now dependent!!

I have to say, I got pretty excited by this. Perhaps I’m odd.

But wait! There’s more! Say C has a descendent node. Now observing a descendent node actually “biases” the ancestor node – so in a way you are partially observing the ancestor node and this can also cause A and B to become dependent!

Now, I was going to then show you some experiments I did with a simulation of the Kid’s party network that I started with, and show you all these three conditions, and a funky one where there are two paths linking a pair of nodes (P and G) and how observing the middle node reduces their correlation, but not all the way, cause of the secondary path, but I think I’ll leave you with the simulation code so you can play with it yourself.

Code for this little experiment is available as a github gist.

CIGAR strings

When I first joined Seven Bridges my boss would be describing things and he would say “Do this with the CIGAR” or “compute that from the CIGAR”. Reluctant to be found out so early I nodded sagely and made a furtive note in my notebook to check out what the heck a CIGAR was.

You are most likely to run into this word – and it is in all caps, for reasons that may become clear later – if you operate closer to the nasty end of any work that requires genomic sequencing. One of the more compute intensive things we have to do to raw genomic data before we can get with the real sciencey stuff is to figure out how a sequence obtained in an experiment relates to some reference data.

A lot of life is tied together and it shows in the DNA. There are similarities not only between the DNA of humans, between the DNA of humans and chips, but also between mice and men, but the closest match is by far between human beings of all “races” and ethnicities, despite what some ignorant folk will try to tell you.

At the tiniest level, at very early stages of genomic sequence data analysis, we take a string of data, say

data = ACTGACTGACTGACTG

and compare it against some reference data, say

ref = ACTTACTAAGACTACTG

We notice that we can align the data against the reference by changing one letter (marked by *), assuming some letters have been deleted (– in the data) and some letters have been inserted (- in the ref)

data = ACTGACT--GACTGACTG
       |||*|||  |||| ||||
ref  = ACTTACTAAGACT-ACTG

 
In general the vast majority of these alignments of data against the reference can be described as a series of exact matches interspersed by single base mismatches, insertions and deletions.

Now, what does CIGAR stand for? It stands for Compact Idiosyncratic Gapped Alignment Record. Yes, it does sound like a backronym doesn’t it? But the “gapped alignment part” has probably given you a good clue. The major operation above was in inserting gaps, either in the data or the reference, in order to get the two strings to align against each other.

A CIGAR string is simply the set of operations we did encoded succinctly. So, in the case above, the CIGAR is

7M2D4M1I4M

 

In the CIGAR we usually ignore single base mismatches and focus on the insertions and deletions (though the “extended CIGAR” is much more detailed and includes information about mismatches) and so we convert the gapless bits to “M”, the number indicating how many letters (bases) form a stretch of gapless alignment. This is interspersed with “D”s and “I”s to indicate deleted subsequences and inserted subsequences.

And now you know.

Development environment

While it can be an avenue for procrastination, searching for a proper development environment often pays off with greatly improved productivity in the long term. I had some finicky requirements. I wanted to maintain my own CMakeLists file so that the project was IDE/editor/platform independent. I wanted to be able to just add files to the project from anywhere – from VI, by OS copy, by git pull – (I set wildcards in the CMakeLists.txt file) – and definitely not have to fiddle in the IDE to add files. I wanted to edit markdown files which I was using for documentation, and placing in a separate docs folder. I wanted to see a decent preview of the markdown occassionally. Ideally, my environment would simply track my directory organization, including non C++ files. However, I also wanted intelligent code completion, including for my own code.

At work, where a ton of my current stuff is in Python, I happily use Pycharm which has been a very worthwhile investment. So CLion was a natural candidate to start with. I didn’t get very far, because I balked at the license terms. It felt a little annoying to have to apply for an open source license and then renew every year and so on. When I want to play with red-tape I talk to the government.

I completely understand needing to charge people for your software. It’s a necessity for maintenance and continued development. But you can not get money from people who have none. To me the rational model is to charge professional entities for the software, to make it free to use for non-professional use and just not bother with the small fry. Basically, have a smallish legal department that goes after a big player if they violate. For companies and individual professionals a software license is the least of their costs (and probably the item with the highest ROI), and they will gladly pay to have some one to go to for support. Chasing down individual hobbyists just doesn’t make sense. I’m sorry, where did that soapbox come from? I must have stepped on it by accident. Moving on to the next editor/IDE …

I tried XCode next. It was a medium-ish install, and I think I needed an iTunes account. I was encouraged to use it because CMake can generate XCode projects and many coders working on Macs swore by it. I have used it infrequently – mostly when I was editing a largish Objective C project during my postdoc. I found it fine to use and debug with. It was fine, but I disliked how CMake organized the XCode project virtual directories, and I disliked how I had to manually add non-source and new files. And memories of the over complicated build system started to come back to me.

I then found something called CodeLite. It was confusing because I spent some time trying to figure out if it was related to Code::Blocks. CodeLite looked kind of promising. There was a nice tutorial on how to integrate with CMake based projects. But it looked cumbersome, and I had to add files to the project through the interface. By now, I was willing to endure this but I couldn’t get over the fact that there was no preview for Markdown. I liked that they limited the number of plugins, but I really wanted markdown preview. I told you I was finicky.

Then, quite by accident I started to poke around Visual Studio Code and downloaded it to try it out. In brief, it seems to satisfy all my finicky requirements. This choice rather surprised me, because it was just not on my radar. I also realize this does nothing to advance my attempt to get into the cool kids programming club where people use vim or emacs, but that was a lost cause anyway. (I now recall that I initially avoided VSC because a colleague spoke out very strongly against it a few months ago. Goes to show, you need to try things out for yourself)

CMake and XCode

Oddly enough, though I’ve been working on Macs for eight years now, I haven’t used Xcode a lot and in the past I used Bloodshed C++ and eclipse. I also have fond memories of using vi and make with my IDE being a cluster of four terminals (code, compile, debug and run output, I think they were. Or maybe one was pine). Also, the fondness is perhaps an effect of nostalgia.

Since I’m using CMake I looked around for ways to make XCode work with it. CMake will generate an xcode project with the command -G XCode but the structure of this project looked atrocious and I wondered what I was doing wrong. This post by John Lamp gives some nice details about how and why the generated XCode project is so structured.

One trick I learned (Lamp mentions it but I really paid attention when I watched this video) was to add the header files to the sources list as well, in the CMake project. After I did this the organization of the files in the XCode Project view became more sensiple – XCode knows how to split up header and source files.

Also, before you open up XCode, add the build directory to .gitignore. When XCode tries to manage your versioning, it seems to want to add just about everything to version control by default, and the build directory is an un-holy mess.

Cmake tutorial, slightly easier than the official one.

 

Starting a project

It’s that time of the decade again – to start a new hobby programming project that will probably span a few years, given that it’s a side project. So I went through the usual decision tree and this time I decided to document it for my future self.

Picking a language

I really surprised myself. I picked c++. A year or so ago I would have told you I was going to use Haskell or Common Lisp for new projects. I’m pretty proficient in Python and so I wouldn’t pick that. But C++??? Now, I’ll still be doing coding challenge type tasks in common lisp this year, and next year will be the year of Haskell, but every time I touch one of these I get fingers bitten off. There is always something easy I want to do that breaks things and the solution, often from practitioners in the field, seems to be pretty messy.

Python, funnily enough, is the closest I’ve come to an ideal language for practical use. I don’t use custom defined classes very much and it’s only fake functional (it comes out in the major – no tail recursion – and in the minor – try and use lambdas with the multiprocessing module) and it can be astoundingly slow for compute intensive tasks that can’t be cast as numpy operations. But you can’t beat it in time it takes to go from a blank file to a working, self-documenting, fairly bug free program where you can often write very complex things very concisely and elegantly.

So why C++?? C++14, to be precise, drew me back from these REPL oriented languages that make prototyping so fast and easy. In a nutshell I think the standard library has turned into quite a nice thing and things like CATCH and Cmake make medium sized projects easier to manage. I anticipate an intense amount of numerical computation, potentially large amounts of generated data and an attendant desire to control the type of the data and how it’s laid out. Common Lisp, Python and, I suspect, Haskell aren’t so bothered by such low level details. I suspect I’d be able to quickly prototype what I want to do in those languages, but then I’d get into a hellhole trying to remove the bottlenecks.

Build system: Cmake: I’ve used this before and found it manageable
Command line parser: cxxopts – one file (just header)
Test system: CATCH – I’ve used this before and I like how easy it is.

One thing I’m excited to experiment with is Cling – allegedly a REPL for C++. We’ll see …

Picking a version system

This was easy. git. I toyed briefly with using this as an excuse to learn mercurial, but I decided to reduce the number of moving parts for this particular project. Mercurial will have to wait.

Picking a license

I’ve always used the GPL. This time I picked the MIT license. There were a couple of funny thoughts that went through my head as I was picking the license. Funny because they reveal very deep (but perhaps not so serious) desires that I don’t often reflect on. One dominant thought was, what if this becomes widely popular (yeah, sure) and people start to capitalize on this but I don’t get any of the upside (fame/money). You’d probably want to use the GPL to block people like that.

And so I went with the MIT license, because almost everything I know, I know because random people on the internet unconditionally shared their knowledge with me, and almost every tool I use has been unconditionally granted to me by random people on the internet.  The GPL started to sound a little mean spirited. A little too over bearing. The MIT license, on the other hand, sounded more in the hacker spirit. Here is some code. I throw it to the wind. If you find it useful send, me a post card. Just don’t sue me. Leave the lawyers out of this.

Picking a host

github, because of the network effects, and I’m using git. And to be fair, github has made a lot of effort to make the site feature full. I do hope it does not end up like sourceforge. Remember when sourceforge was where we put all our projects?

Picking a documentation system

Text files in markdown. Algorithm design notes in the code files. Because everything else gets out of date and becomes a giant interdependent mess that you deal with only if this is the day job.

Work style

Ok, this one I’m gonna get right (yeah, sure). I’m starting with the specifications, then the user manual, then the tests, then the code. This is not a magic formula, and it doesn’t stop me from skipping around, but this set of practices has always seemed more logical and a lot less irritating to me. Code is living, you always go back and change things, but as long as you follow this order when you change things, sanity prevails.

Off we go …

A run-in with the garbage men

When you write and run a computer program you are processing data. You load data, create more data, process it and so on. All this data takes up memory on your computer. Without any other action the memory starts to look like a kid’s bedroom – all toys, and stale clothes, and empty cans and magazines and books everywhere. There are typically two ways programming languages deal with this clutter.

One is the “clean your room or I’ll blow up the house” method. I realize this is not what parents say, but this is what happens in languages like C++ where you – the programmer – are responsible for cleaning up after yourself. If you don’t your program crashes and burns and exits out with a nasty error message. If your operating system is not a grown up operating system, it often freezes up your whole computer.

The other is automatic memory management, which is a bit like having a housekeeping service. Common lisp is a garbage collected language. Which means that the runtime of the language takes care of the clutter for you. The service sends over a person who goes through your room and tidies up. As you can imagine, this is a lot easier for you, since now you don’t have to worry about all this low level housekeeping stuff and can concentrate on the fun things in life/computer science.

So imagine my surprise when I ran the following bit of code and my house blew up (program memory space, not my house literally, but you’ve caught on)

(defun random-string (length)
  (with-output-to-string (stream)
    (let ((*print-base* 36))
      (loop repeat length do (princ (+ (random 26) 10) stream)))))

(defun insert (count)
  (let ((ht (make-hash-table :test 'equal)))
    (loop
       repeat count
       do (setf (gethash (random-string 30) ht) 1))))

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (insert ins-count)))

(gc-test 20 1000000 nil)

I wasn’t actually doing this exact thing, but the principle was the same: I was inserting keys into a hash table which got larger and larger, and then I discarded the hash table. After a few cycles of this BAM! my program crashed and SBCL printed out the delightfully campy:

Heap exhausted, game over.

Common Lisp’s handy (room) function prints out the memory usage of the program, and when I instrumented my program with it I saw this:

no-forcing

Now you can see that up to the 4th iteration housekeeping kind of keeps up, though there is this funny sawtooth pattern where housekeeping slacks off on the first pass, does a thorough job on the second, slacks off on the third, catches up on the fourth. But then, the union goes on strike! Housekeeping refuses to show up and the dirty pizza boxes start to pileup and then the floor collapses. Contrary to what your friends working in sales will tell you, not all graphs that rise exponentially on the right hand side are good.

SBCL has a neat function

(sb-ext:gc)

that allows you to manually call for housekeeping, and I modified my program to do this each time I discarded the hash-table, and when I reran the program I saw this:

with-forcing

So, manually forcing the garbage collector to run does do the trick but it made me a little disappointed with SBCL’s implementation. I suspect commercial Lisp compilers do a better job. If any one has a guess as to why the GC fails in this case, please drop a line.

In case you are interested, the function, instrumented with (room) and with the garbage collection forcing looks like this

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (progn
	  (insert ins-count)
	  (when nudge-gc (sb-ext:gc :full t))
	  (room nil))))

(gc-test 20 1000000 nil)

Millions of tiny hypotheses

I think I recall the exact moment when I began to get a little scared of math. It was an algebra class and we were being taught tricks (tools) that could be used to solve different classes of problems. Except, I don’t recall the teacher ever explicitly saying this was a toolbox. I think what he did was go through a list of equations, performing some trick on each – like adding x to both sides, multiplying by y on both sides and so on – and then proceeding with algebraic simplification. This demoralized me, and started to make me less enthusiastic about actually doing mathematics, though I remained fascinated by it.

I also, I think, recall why I got demoralized. I watched him solve an equation with one of these tricks and sat there staring glumly at the board. I fully understood how to apply the technique, what I couldn’t figure out was how I would know that I would have to apply that technique to that problem as opposed to some other technique. What didn’t help was that I had classmates who seemed to breeze through it, knowing intuitively which equation required which tool.

Unfortunately I did not have enough self-realization at that time to go and ask for help over this, or to talk it over with my mother (who was a mathematician). I just decided I didn’t have the innate talent for mathematics. Fortunately this did not cripple me and I had no hesitation diving into topics like physics and then electrical engineering (which I probably chose because it was applied physics) which had enough interesting math, with cool applications.

Many years later a stray comment on some message board somewhere by someone stuck in my head. They said that the way they did mathematics was to form hypothesis after hypothesis, testing them. If the hypothesis results in a falsehood, discard it and start again. Another comment, on a different message board by a different person, said that it had been explicitly stated to them that there was no magical solutions to mathematical problems, and it was a myth that there were people with innate mathematical skills and those without: you solved mathematical problems through gumption – you kept banging your head against it, trying different things, figuring it out as you went.

These two simple comments made a big impression on me. They suddenly freed me from the expectation that mathematical talent was innate. It raised the possibility that stubbornness – a trait my mother was fond of pointing out in me – was all you needed to solve mathematical problems. I was not worried about the top 99th percentile of mathematicians who most probably had something special going on their brains. I just wanted to enjoy math, learn more and more of it, and see if I could use it in daily life. These comments were immensely liberating. I had looked at the journey between the problem and the solution as mostly an embarrassment, as in the longer it took, the stupider you were. These comments turned the journey into part of the process, something to take joy in.

I just wish my math teachers had said things like this. I know that when I teach my daughter mathematics this is what I’m going to emphasize. It’s about creating millions of small hypotheses – magical worlds with crazy ideas – and then applying mathematical rules to figure out if the ideas contradicted each other, or if they fit. Just like the shape puzzles she loves to solve. Each shape fits in a particular slot. Sometimes, you can see right away that the shape will fit in a slot. Sometimes, you make a guess, if it works, good. If it doesn’t, ALSO GOOD. YOU LEARNED SOMETHING! Now try a different slot!