# Self-Organizing Maps

Self-organizing maps are an old idea (first published in 1989) and take strong inspiration from some empirical neurophysiological observations from that time. The original paper “Self-Organizing Semantic Maps” by Ritter and Kohonen (pdf) has a nice discussion that took me back to some questions I was looking at in another life as a neurophysiologist. The discussion centers around cortical maps, which are most pronounced in the clearly sensory and motor areas of the brain.

Very early experiments (in their paper Lashley‘s work is referenced) led some people to propose the brain was a complete mishmash, with no functional separable internal organization. Each part of the brain did exactly what the other parts did, a bit like how a lens cut in half still works like the original lens (just dimmer), or how a hologram cut in half, still shows the original scene.

Later experiments looked more closely and found a wealth of specialization at all scales in the brain. The cortex (the most evolutionarily recent part of the brain) shows an especially fascinating organization at the smallest scales. This is most apparent in the visual cortex, where the neurons are often organized in two or higher dimensional maps. A famous example is the set of orientation maps in early visual cortex:

This figure is taken from Moser et al, Nature Reviews Neuroscience 15,466–481(2014) but you’ll run into such figures every where. It shows physical space along the cortex, color coded for the orientation of bar that the neurons at that location best respond to. One can see that the colors are not randomly assigned but form contiguous regions, indicating that neighboring neurons respond to similar orientations.

I could go on about cortical organization like this (and indeed I have elsewhere) but this is a good segway into self-organizing maps. Researchers studying machine learning got interested in this feature of the brain and wondered if it held some use for classification problems. Kohonen drew up an algorithm for what he called a self-organizing map that has been used on and off mainly for visualization via un-supervised learning.

The Kohonen map is constructed as follows.

1. Construct a sheet of “neurons” with the dimensionality you want. Usually this is limited to one, two or three dimensions, because we want to visualize a complex higher dimensional dataset.
2. Each neuron is a template – it has the same dimensions as the input data and can be overlaid on each instance of the input data.
3. We initialize the templates randomly – so the templates look like white noise.
4. We “present” each input example to the map. Then we find the template neuron that most closely matches this input. We then morph the template slightly so that it even more closely matches the input. We do this for all the neighbors of the neuron too.
5. We keep repeating this for all the input we have.

What does this result in? I have two videos for you, you product of the MTV generation, that shows how this works.

I took the famous NIST handwritten digits data set and fed it to a two dimensional SOM.

The first video represents the self-organizing map as a set of templates that change as inputs are supplied to it.

This is easy to interpret as you can identify the templates as they evolve. Note how neurons form neighborhoods that naturally link together similar stimuli. It is easy to see with this data set because the data are two dimensional visual patterns and we can judge similarity intuitively.

The second video represents each neuron’s stimulus preferences, with it’s most preferred stimulus digit appearing as the largest.

You can see of the network evolves as examples are thrown at it and segregates out into zones that respond to or represent a particular form of the stimulus. You can see how zones travel and try to spread-out, ending up at the corners of the map. This is due to the competition created by the winner-take-all operation we use (closest matching template wins) combined with the effect of adjusting the winning neuron’s neighbors too.

SOMs are fun things to play with, but are a little fiddly. You may have picked up from the videos that we had two parameters called sigma and eta that changed as the learning went on. The numbers got smaller as the training went on. Sigma is the size of the neighborhood and eta is the learning rate – the smaller the value of eta the less the neurons change in response to inputs. These are two things you have to fiddle with to get decent results. Also as you can imagine, this whole process is sensitive to the random templates we start with.

Numerous attempts have been made to try and figure out objective ways to determine these numerous parameters for a SOM. It turns out they are pretty data dependent and most people just do multiple runs until they get results they are happy with.

As I was reading Bishop I ran into his version of the SOM which he calls Generative Topographic Mapping which Bishop claims is more principled and mathematically grounded than SOMs. The GTM, which I really haven’t heard about at all, seems like a fun thing to study, but as a ex-neurophysiologist SOMs are definitely more intuitive to understand.

Code (and link to data source) follows:

# Bayesian networks and conditional independence

Bayesian networks are directed graphs that represent probabilistic relationships between variables. I’m using the super-excellent book “Pattern Recognition and Machine Learning” by C. Bishop as my reference to learn about bayesian networks and conditional independence, and I’d like to share some intuitions about these.

First, let’s motivate our study with a common home situation related to kids. We start with the question “Do we have to gas up the car?” (As you know, everything related to kids is binary. There is no room for gray areas). Well, this depends upon whether we need to do a grocery trip, or not. That, in turn depends upon whether the food in the fridge has gone bad and whether we need dino nuggets. Why would the food go bad? Well that depends on whether the electricity went out, and whether the darn kids broke the fridge again. The kids never broke your fridge? Lucky you. Whether we need dino nuggets depends on whether we’re having a party. The probability that the kids broke the fridge depends on whether there were a horde of them or not (i.e. a party).

Whoa, this paragraph looks complicated, like a GRE word problem. Let’s represent this as a directed graph:

Much better!

All the variables here are binary and we can put actual numbers to this network as follows:

Explaining some of the numbers, the entry for the (P)arty node indicates that there is a probability of 0.1 we’re going to have a party, and this is uninfluenced by anything else. Looking at the entry for the (G)as up node, we see that whether we gas up depends on the states of F and D. For example, if both F and D are false, there is a 0.1 probability of needing to gas up, and if F is true and D is false, there is a 0.6 probability of needing to gas up, and so on.

Intuitively, for a pair of nodes A and B, there is a linkage if the probability of the output depends on the state of the input(s). So, in the truth tables above I’ve made every row different. If, for example, the probability that B would take the state “T” is 0.2 regardless of whether A was “T” or “F” then we can intuit that A doesn’t influence B and there is no line from A to B. If, on the other hand, B takes the state “T” with p=0.2 if A is “T” and with p=0.8 if A is “F” then we can see that the probability B takes the state “T” depends on A.

Before we start to look at our kid’s party network, I’d like to examine three basic cases Bishop describes in his book, and try an develop intuitions for those, all involving just three nodes. For each of these cases I’m going to run some simulations and print results using the code given in this gist in case you want to redo the experiments for your satisfaction.

The only thing to know going forward is that I’ll be using the phi coefficient to quantify the degree of correlation between pairs of nodes. phi is a little more tricky to interpret than pearson’s R, but in general a phi of 0 indicates no correlation, and larger values indicate more correlation.

The first case is the tail-to-head configuration:

Say A=”T” with p=0.5 (i.e. a fair coin) and C=”T” p=0.7 if A=”T” and p=0.3 otherwise, and B=”T” p=0.7 if C=”T” and p=0.3 otherwise. Here “B” is related to “A” through an intermediary node “C”. Our intuition tells us that in general, A influences C, C influences B and so A has some influence on B. A and B are correlated (or dependent).

Now, here is the interesting thing: suppose we “observe C”, which effectively means, we do a bunch of experimental runs to get the states of A, C and B according to the probability tables and then we pick just those experiments where C = “T” or C = “F”. What happens to the correlation of A and B?

If we take just the experiments where C=”T”, while we do have a bias in A (because C comes up “T” a lot more times when A is “T”) and a bias in B (because B comes up “T” a lot more times when C is “T”) the two are actually not related anymore! This is because B is coming up “T” with a fixed 0.7 probability REGARDLESS OF WHAT A was. Yes, we have a lot more A=”T”, but that happens independently of what B does.

That’s kind of cool to think about. Bishop, of course, has algebra to prove this, but I find intuition to be more fun.

As a verification, running this configuration 100000 times I find the phi between A & B = 0.16379837272156458 while with C observed to be “T” it is -0.00037 and “F” it is 0.0024. Observing C decorrelates A and B. MATH WORKS!

As an aside, if I setup the network such that the bias of C does not change with the input from A, I get phi of A & B = -0.00138 and phi of A & B with C observed to be “T” = -0.00622 and “F” =  0.001665 confirming that there is never any correlation in the first place in such a “flat” network.

The second case bishop considers is the tail-to-tail configuration:

Here we have a good intuition that C influences A and B, causing A and B to be correlated. If C is “T” it causes A to be “T” with pA(T)  and B to be “T” with pB(T) and when C is “F” A is “T” with pB(F) and B is “T” with pB(F). These probabilities switch to track C as it changes, resulting in the linkage.

What happens if we observe C? Say C is “T”. Now this fixes the probability of “T” for both A and B. They maybe different, but they are fixed, and the values A and B take are now independent!

We easily see that A and B are independent. What happens if we observe C? Let’s consider a concrete example with a probability table

```A B  p(T) for C
F F  0.1
F T  0.7
T F  0.3
T T  0.9
```

Say we run 100 trials and A and B come up “T” or “F” equally likely (p=0.5). We expect each AB pattern will occur equally often (25 times) and then the expected number of C=T states is

```A B  E(C=T)
F F  0.1 * 25 =  2.5
F T  0.7 * 25 = 17.5
T F  0.3 * 25 =  7.5
T T  0.9 * 25 = 22.5
```

So we have an expected 50 trials where C=T. Of that subset of trials, the fraction of trials that each pattern comes up is:

```A B  p(AB|C=T)
F F   2.5 / 50 = 0.05
F T  17.5 / 50 = 0.35
T F   7.5 / 50 = 0.15
T T  22.5 / 50 = 0.45
```

Ah! You say. Look, when A and B are independent, each of those patterns come up equally often, but here, after we observe C, there is clearly an imbalance! How sneaky! Basically, because some patterns of AB are more likely to result in C=T this sneaks into the statistics once we pick a particular C (Those hoity, toity statisticians call this “conditioning on C”). A and B are now dependent!!

I have to say, I got pretty excited by this. Perhaps I’m odd.

But wait! There’s more! Say C has a descendent node. Now observing a descendent node actually “biases” the ancestor node – so in a way you are partially observing the ancestor node and this can also cause A and B to become dependent!

Now, I was going to then show you some experiments I did with a simulation of the Kid’s party network that I started with, and show you all these three conditions, and a funky one where there are two paths linking a pair of nodes (P and G) and how observing the middle node reduces their correlation, but not all the way, cause of the secondary path, but I think I’ll leave you with the simulation code so you can play with it yourself.

Code for this little experiment is available as a github gist.

# Millions of tiny hypotheses

I think I recall the exact moment when I began to get a little scared of math. It was an algebra class and we were being taught tricks (tools) that could be used to solve different classes of problems. Except, I don’t recall the teacher ever explicitly saying this was a toolbox. I think what he did was go through a list of equations, performing some trick on each – like adding x to both sides, multiplying by y on both sides and so on – and then proceeding with algebraic simplification. This demoralized me, and started to make me less enthusiastic about actually doing mathematics, though I remained fascinated by it.

I also, I think, recall why I got demoralized. I watched him solve an equation with one of these tricks and sat there staring glumly at the board. I fully understood how to apply the technique, what I couldn’t figure out was how I would know that I would have to apply that technique to that problem as opposed to some other technique. What didn’t help was that I had classmates who seemed to breeze through it, knowing intuitively which equation required which tool.

Unfortunately I did not have enough self-realization at that time to go and ask for help over this, or to talk it over with my mother (who was a mathematician). I just decided I didn’t have the innate talent for mathematics. Fortunately this did not cripple me and I had no hesitation diving into topics like physics and then electrical engineering (which I probably chose because it was applied physics) which had enough interesting math, with cool applications.

Many years later a stray comment on some message board somewhere by someone stuck in my head. They said that the way they did mathematics was to form hypothesis after hypothesis, testing them. If the hypothesis results in a falsehood, discard it and start again. Another comment, on a different message board by a different person, said that it had been explicitly stated to them that there was no magical solutions to mathematical problems, and it was a myth that there were people with innate mathematical skills and those without: you solved mathematical problems through gumption – you kept banging your head against it, trying different things, figuring it out as you went.

These two simple comments made a big impression on me. They suddenly freed me from the expectation that mathematical talent was innate. It raised the possibility that stubbornness – a trait my mother was fond of pointing out in me – was all you needed to solve mathematical problems. I was not worried about the top 99th percentile of mathematicians who most probably had something special going on their brains. I just wanted to enjoy math, learn more and more of it, and see if I could use it in daily life. These comments were immensely liberating. I had looked at the journey between the problem and the solution as mostly an embarrassment, as in the longer it took, the stupider you were. These comments turned the journey into part of the process, something to take joy in.

I just wish my math teachers had said things like this. I know that when I teach my daughter mathematics this is what I’m going to emphasize. It’s about creating millions of small hypotheses – magical worlds with crazy ideas – and then applying mathematical rules to figure out if the ideas contradicted each other, or if they fit. Just like the shape puzzles she loves to solve. Each shape fits in a particular slot. Sometimes, you can see right away that the shape will fit in a slot. Sometimes, you make a guess, if it works, good. If it doesn’t, ALSO GOOD. YOU LEARNED SOMETHING! Now try a different slot!

# The three switches problem

The light bulb problem is a class of problem often presented at interviews as a test of lateral thinking. Here, in a mischievous spirit taken from the best subversive traditions we will attack this question by thinking a little too laterally …

There are different versions of the light bulb problem, but they rely on the same trick. Here is one version:

Your grand uncle lives in a rambling old house. He had an electrician over to wire a light in the attic but the electrician messed with the wiring and set up three switches instead of one. Your grand uncle wants to figure out which of the switches turns on the light in the attic. Problem is, you can’t tell from below whether the attic light is on or off. You can fiddle with the switches and then go up to the attic to inspect the light, but you can only do this once. The old coot says you can’t keep going back down and up the attic stairs fiddling with the light – apparently it disturbs the cat and he scratches the furniture.

The conventional answer to the question is to flip one switch on and leave it for a bit. Then turn it off and turn on another then go up to the attic. If the light is on, the last switch you flipped was the one. If the light is warm (Aha!) then the first switch you flipped is the one. If the light is off and cold, well is the remaining one. Let’s try and subvert the conventional answer to this problem by engaging in some mild subterfuge in the spirit of Calandra‘s “Angels-on-a-Pin“.

1. Rapidly flip one of the switches on and off many times. Finally set this switch to the off position and flip another switch to on. Now go to the attic. If the light is on, it’s the currently on switch. If the light is burnt out, its the first switch.

2. Call in your grand aunt, explain the problem to her, crawl up to the attic and have her flip the switches one by one. Yell when the light comes on.

3. Flip a switch and inspect all the fixtures in the house, repeat and eliminate two of the switches. The third one is the attic switch. If none of the switches seem to be doing anything, where’s the problem? Just hit all three switches when you need the attic light on.

4. Wait for a moonless night. Turn off all the lights in the house and draw all the curtains and blinds. Now flip the switches one by one. There will be a difference in the light level, however slight, from the attic light going on.

5. Fly a drone up into the attic. Now you have eyes on target. Screw the cat.

6. Take all bulbs out of their sockets and unplug all appliances. Now flip the switches one by one and watch the electric meter. This may take some patience.

7. Open up the cover plate. Take a voltmeter and measure the drop in voltage as the switches are flipped. If only one has a voltage drop, that’s the one wired to the attic switch. If more than one have drops, throw stones into the attic until the light bulb is broken. Redo the measurements. The switch that now has no voltage drop (but did previously) is the one. Take a new bulb, go up into the attic and replace it.

8. Flip each switch in turn and let the bulb warm up. Throw a bucket of water into the attic. If you hear a loud “POP” the bulb was on. Leave adequate cool down periods between tests. Take a new bulb, go up into the attic and replace it.

9. Tie three strings to the switches. Go up into the attic with the strings. (Some will say this is subverting the question. Yes it is. Yes it is)

10. Get all the shiny pots and pans your grand uncle has. Arrange them like mirrors so that you have a view of the attic from the switches.

As an aside, here is a problem I like better, as solving it actually teaches you some mathematical concepts:

Your grand uncle has passed on. In his will he’s bequeathed you an antique coin of immense value (so he says). He’s also left you with eight accurate forgeries of that coin that are indistinguishable from the original, except that they are just a fraction lighter. He’s also left you a balance, which can only be used twice before it breaks. So he says.

You are only allowed to take one of the coins with you. The original coin is priceless, the forgeries are worthless. So he says. You suspect all the coins are duds and the balance won’t break, but you take your Grand uncle’s will in the spirit it was meant, as a last fun puzzle to keep you busy so you don’t get too sad at the old codger’s passing.

# The Felix problem

Say you have a multiplayer game where there is no central server. You need to indicate to a pair of players if they are within a certain distance from each other, but you can’t let any of the players know the actual positions of any other players. Could you do it?

I was driving us back from the 2014 Biological Data Science meeting at Cold Spring Harbor when Felix posed this problem to me. The drive was alternatingly boring and annoying as we hit pockets of over polite New York drivers on the way out (Though there was a stretch of road along the Hutchinson River Parkway that was SPECTACULAR in terms of fall foliage. We didn’t take any pictures, but that alone made driving to and from the conference worth it)

As a complete aside, Felix had the GPS voice on his phone on and we were entertained by numerous howlers, one them being the voice calling it the “Hutchinson River Peek-wee” (pkwy) and the other being: “Take I95 New Hampshire minus Maine” which Felix declared would simply be New Hampshire, since there is no intersection between the two sets.

Anyhow, back on point, to relieve the boredom Felix was telling me crypto related things and came up with this question, which he claimed he did not have a definitive answer to.

Here are my attempts during the drive:

Players A, B want their distance checked. Players C, D are “third-parties”

### Scheme 1 (Fails)

Player A sends their position along with $N-1$ random positions to C. Each position has a tag. Only A knows the correct position by tag. Player B does the same

Player C computes $N^2$ position combinations and returns the results to both A and B.

My idea was that when they get the results back A and B will be able to pick the correct results, because they’ll know which is their tag, but, at this point, I realized the scheme will fail, because each will get back $N$ results with their tag in it, but not know which was the original one.

But, this led me to the second scheme which works:

### Scheme 2

Player A and B send their position, along with $N-1$ random positions, tagged, to C. They also separately they send their tags to D.

C sends the $N^2$ computation results to D who then matches the correct answer based on the tag pair from A and B and then sends the result back to A and B

### Scheme 3 – nice refinement of 2

I then realized that you can save on the large number of computations ($N^2$ even if they are simple) by doing the following: A adds a random number to their position $(p_a + q_a)$ and sends this to C. B sends a similarly randomly offset number to C $(p_b + q_b)$. Both A and B send the random offsets to D.

C computes $r_1 = (p_a + q_a) - (p_b + q_b)$ and sends it to D.

D then subtracts the difference of the offsets from the result from C $r_2 = r_1 - (q_a - q_b)$, which is $r_2 = (p_a + q_a) - (p_b + q_b) - (q_a - q_b)$ i.e. $r_2 = p_a - p_b$.

This last is the computation we want and D broadcasts the flag “In range” or “Out of range”. Note that neither C nor D know the actual positions of A and B, and A and B don’t know each others’ positions either.

I was pretty happy at arriving at this solution and by this time we had meandered our way out of New Rochelle.

# Musings on Mutual Information

Mutual information (MI), often given in terms of bits of information, is a neat way to describe the relationship between two variables, x and y. The MI tells us how predictive x is of y (and vice versa). Here I will explore what the MI value in bits means.

For example, if x takes 8 values and very reliably predicts which of 8 values y will take, it turns out that the mutual information between x and y is 3 bits. One interesting aspect of MI is to get an intuition of how noise and uncertainty reflect in the MI. In the example below we will explore this a little bit.

At the end of the post I show all the Python code needed to reproduce the plots.

As our basic data we consider an array of numbers (x) each of which is either 0 or 1. So x can have two states (0,1).

Next we create another array (y). For each value in x, we have a value in y. Each value in y is sampled from a unit variance Gaussian distribution whose mean is 0 or 4 depending on the value of its corresponding x.

We have an intuition that x and y predict each other pretty well. We can see the distribution of y has two peaks depending on whether they are associated with x=0 or x=1. The mutual information between the two comes out to be 1 bit. Our intuition is that if some one gave us a value of x we could tell them if the value of y is above or below 4. Alternatively, if some one gave us a value for y, we could tell them is x was 0 or 1. We can tell them, reliably one of two alternatives. For this reason, we claim that there is 1 bit of information here. We can distinguish 2**1 alternatives with the value given to us.

Say we now arrange things such that whatever the value of x, the value of y is always chosen from the same Gaussian distribution.

Now, our intuition is that there is no way to tell what our value of y will be based on the value of x, and vice versa. When we work out our mutual information, we see that we have close to zero bits of information, which matches our intuition that x and y carry no information about each other.

Next, suppose we arrange things such that the value of y is taken from Gaussian distributions with mean 0 or 1, depending on whether the value of the corresponding x is 0 or 1.

We see that the distributions overlap, but not completely. The mutual information computation tells us that there are 0.45 bits of information between our two variables. What does that mean? What does it mean to be able to distinguish among 2**0.45=1.36 possibilities? I find it hard to interpret this value directly. Given knowledge about x (it can be 0 or 1) I know that the maximum mutual information possible is 1 bit. So anything less than 1 represents a noisy or flawed relation between x and y. I have little intution on whether 1.5 bits of information is a lot better than 1.3 bits and just a little worse than 1.7 bits. We can check what happens when we gradually increase the separation between the two distributions, making them more easily separable.

As we can see, 0.4 bits of separation imply the distributions are separated by one standard deviation (d’=1) 0.8 bits means d’=2 and then we start to saturate.

Code follows

```def mi(x,y, bins=11):
"""Given two arrays x and y of equal length, return their mutual information in bits
"""
Hxy, xe, ye = pylab.histogram2d(x,y,bins=bins)
Hx = Hxy.sum(axis=1)
Hy = Hxy.sum(axis=0)

Pxy = Hxy/float(x.size)
Px = Hx/float(x.size)
Py = Hy/float(x.size)

pxy = Pxy.ravel()
px = Px.repeat(Py.size)
py = pylab.tile(Py, Px.size)

idx = pylab.find((pxy > 0) & (px > 0) & (py > 0))
return (pxy[idx]*pylab.log2(pxy[idx]/(px[idx]*py[idx]))).sum()

x1 = pylab.zeros(1000)
x2 = pylab.ones(1000)
x = pylab.concatenate((x1,x2*2))

y = pylab.randn(x.size)+4*x
h = hist([y[:1000], y[1000:]], histtype='step', bins=31)
h = text(2,80,'MI={:f}'.format(mi(x,y)))

y = pylab.randn(x.size)+0*x
h = hist([y[:1000], y[1000:]], histtype='step', bins=31)
h = text(2,80,'MI={:f}'.format(mi(x,y)))

y = pylab.randn(x.size)+x
h = hist([y[:1000], y[1000:]], histtype='step', bins=31)
h = text(2,80,'MI={:f}'.format(mi(x,y)))

N = 100
kk = pylab.linspace(0,5,N)
mutinf = [mi(x,pylab.randn(x.size)+k*x) for k in kk]
plot(kk, mutinf)
ax = gca()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
xlabel('Separation')
ylabel('Mutual information')

```