Self-Organizing Maps

Self-organizing maps are an old idea (first published in 1989) and take strong inspiration from some empirical neurophysiological observations from that time. The original paper “Self-Organizing Semantic Maps” by Ritter and Kohonen (pdf) has a nice discussion that took me back to some questions I was looking at in another life as a neurophysiologist. The discussion centers around cortical maps, which are most pronounced in the clearly sensory and motor areas of the brain.

Very early experiments (in their paper Lashley‘s work is referenced) led some people to propose the brain was a complete mishmash, with no functional separable internal organization. Each part of the brain did exactly what the other parts did, a bit like how a lens cut in half still works like the original lens (just dimmer), or how a hologram cut in half, still shows the original scene.

Later experiments looked more closely and found a wealth of specialization at all scales in the brain. The cortex (the most evolutionarily recent part of the brain) shows an especially fascinating organization at the smallest scales. This is most apparent in the visual cortex, where the neurons are often organized in two or higher dimensional maps. A famous example is the set of orientation maps in early visual cortex:

nrn3766-b1This figure is taken from Moser et al, Nature Reviews Neuroscience 15,466–481(2014) but you’ll run into such figures every where. It shows physical space along the cortex, color coded for the orientation of bar that the neurons at that location best respond to. One can see that the colors are not randomly assigned but form contiguous regions, indicating that neighboring neurons respond to similar orientations.

I could go on about cortical organization like this (and indeed I have elsewhere) but this is a good segway into self-organizing maps. Researchers studying machine learning got interested in this feature of the brain and wondered if it held some use for classification problems. Kohonen drew up an algorithm for what he called a self-organizing map that has been used on and off mainly for visualization via un-supervised learning.

The Kohonen map is constructed as follows.

  1. Construct a sheet of “neurons” with the dimensionality you want. Usually this is limited to one, two or three dimensions, because we want to visualize a complex higher dimensional dataset.
  2. Each neuron is a template – it has the same dimensions as the input data and can be overlaid on each instance of the input data.
  3. We initialize the templates randomly – so the templates look like white noise.
  4. We “present” each input example to the map. Then we find the template neuron that most closely matches this input. We then morph the template slightly so that it even more closely matches the input. We do this for all the neighbors of the neuron too.
  5. We keep repeating this for all the input we have.

What does this result in? I have two videos for you, you product of the MTV generation, that shows how this works.

I took the famous NIST handwritten digits data set and fed it to a two dimensional SOM.

The first video represents the self-organizing map as a set of templates that change as inputs are supplied to it.

This is easy to interpret as you can identify the templates as they evolve. Note how neurons form neighborhoods that naturally link together similar stimuli. It is easy to see with this data set because the data are two dimensional visual patterns and we can judge similarity intuitively.

The second video represents each neuron’s stimulus preferences, with it’s most preferred stimulus digit appearing as the largest.

You can see of the network evolves as examples are thrown at it and segregates out into zones that respond to or represent a particular form of the stimulus. You can see how zones travel and try to spread-out, ending up at the corners of the map. This is due to the competition created by the winner-take-all operation we use (closest matching template wins) combined with the effect of adjusting the winning neuron’s neighbors too.

SOMs are fun things to play with, but are a little fiddly. You may have picked up from the videos that we had two parameters called sigma and eta that changed as the learning went on. The numbers got smaller as the training went on. Sigma is the size of the neighborhood and eta is the learning rate – the smaller the value of eta the less the neurons change in response to inputs. These are two things you have to fiddle with to get decent results. Also as you can imagine, this whole process is sensitive to the random templates we start with.

Numerous attempts have been made to try and figure out objective ways to determine these numerous parameters for a SOM. It turns out they are pretty data dependent and most people just do multiple runs until they get results they are happy with.

As I was reading Bishop I ran into his version of the SOM which he calls Generative Topographic Mapping which Bishop claims is more principled and mathematically grounded than SOMs. The GTM, which I really haven’t heard about at all, seems like a fun thing to study, but as a ex-neurophysiologist SOMs are definitely more intuitive to understand.

Code (and link to data source) follows:

What is machine learning, and why should I care?

This is a (longish) informal intro to machine learning aimed at Biology/Neuroscience undergraduates who are encountering this for the first time in the context of biological data analysis.

We are, as you know, furiously lazy. It’s a biological adaptation: we always try to find the path of least resistance. Only people with distorted senses of reality, like Neurobio students, make things harder for themselves. Naturally, we dream of making machines do not just the easy work, like cutting our lawn, harvesting our crops and assembling our digital cameras, but also do the hard work, like picking stocks, diagnosing our diseases and analyzing all our data.


Categorization – making discrete decisions based on complex information – is at the heart of many interesting things we do, which are collectively called decision making. For example, we have an x-ray done and the radiologist examines the picture and determines if there is cancer or not. There are many, many input factors which go into this determination but the output is relatively simple: it is one of  ‘Yes’ or ‘No’. A lot of machine learning is devoted to understanding and implementing this kind of decision making in computers.

Machine learning = Learning by examples

When we humans do half-way interesting things, like read, make coffee and do math, we attribute that to learning. We learned how to look at squiggles on a piece of paper and interpret them as information from somebody else’s mind, usually indicating an event or thought. Often we learn things by example: we are exposed to things, like words, and we are given examples of what they sound like and what they represent and we ‘pickup’ the words, their sounds and their meanings.

Machine learning is a set of statistical techniques that take inspiration from this learning-by-example and try to mimic this process on computers, thereby realizing our dream of a perfect society where we’ll all be chilling at the beach while the machines do all the work.

The basic idea is that we collect a set of examples (“samples”) and their meanings (“category”). Right now, we are doing baby steps, so the meaning (category) is restricted to simple concepts that have been reduced to a single label. The hope is that if we show the machine enough examples we can let it loose on the world and when it finds new things it can look to its book of samples and match up the new thing it sees with what it already has seen and then take a decision: This is a ‘tree’, this is ‘euler’s equation’, ‘you need to exercise more’.

Supervised learning

Suppose I want to teach the machine what a cat looks like. I collect a huge bunch of cat pictures and label them “cat”. We show the computer the pictures (more on that later) and tell it “These are cats”. Cool! Now the computer knows what a cat looks like.

i_know_catBut, what happens when it sees a picture of a house, or a tree? It doesn’t know what a cat doesn’t look like. Some people will say they have a secret method where it’s possible to show the computer just pictures of cats, and when a non-cat comes along, the computer will know the difference. We are more cautious. For good measure we throw in a huge collection of pictures that are not cats and tell the computer about that too, so it knows to look for the difference.

This is called supervised learning, because we are supervising the computer by telling it what picture is what.

Unsupervised learning: the height of laziness

Humans learn many things by just observing. In machine learning there are techniques, called unsupervised learning, where don’t even bother to label the examples. We just a dump a whole lot of data into the computer and say to it “Here’s all the information, you sort it out, I’m off camping.” The computer patiently goes through the samples and finds differences between them and sorts them into categories it comes up on its own. This is a powerful method, but as you can imagine, without supervision, the results can be quite hilarious. In general, the computer, being prone to over-analysis, tends to find tiny differences between samples, and tends to break them up into many, many categories. Without some kind of intervention the computer will patiently put each sample into its own separate category and be quite satisfied in the end.

How is this different from more traditional computing?

In traditional computing we do all the ‘thinking’ for the computer and program a specific algorithm (a clear cut set of computing steps) based on an explicit mathematical formula. We are the ones who come up with the core process of how to solve the problem. For example, say we are trying to get a computer to drive a car and we want it to stop when it sees a red stop sign. We think hard and say, “The most salient thing about a stop sign is the insane amount of red color it has. We’ll program the computer to stop the car when it sees lot of red.”

If we used a machine learning approach, we would say, “Why should I do all the thinking? I want to hang out with my homies. I’ll just show the computer a huge scrapbook  of stop signs, yield signs, pedestrian crossing signs and let the computer figure it out. The only work I’ll do is that I’ll put the signs into the correct categories first. If I’m in a hurry, I might not even do that.”

“Learning” = finding similarities in the features

How do we distinguish cats from, say, fire hydrants? We’ve found things (features) that are common to cats that are not present in fire hydrants. More precisely, we’ve found combinations of features that are very likely to make up a cat and very unlikely to make up a fire hydrant.

So how and what does the computer learn? The details of this depend on which exact machine learning method you use from the veritable zoo of machine learning methods. In general, the input that you give to the computer is converted to a list of numbers called a feature vector. It can be as simple as taking a picture of a cat, rearranging all the pixels into one long row and then feeding in this row of pixels as numbers.

So how does the computer learn to use this feature vector to learn what is a picture of a cat and what isn’t? You would expect that the feature vector taken from the picture of the cat is similar is some sense to vectors from pictures of other cats, and different from vectors of fire hydrant pictures. Machine learning algorithms use different mathematical formulae to automatically find the similarity between cat vectors and the differences between cat and fire hydrant vectors.

When we show the computer a new picture, the computer converts the picture into a feature vector,  refers to its giant book of existing feature vectors and asks the question “Is this new vector more similar to a cat vector or a something else in my book?” and bases its decision on that.

Feature selection

Most things in the world have many, many features. A cat’s features would, for example, include the shape and size of the ears, the color of the eyes, the color and length of the fur and so on. If we were distinguishing between fire-hydrants and cats we probably don’t have to look in great detail into the features of both, just a handfull, such as color and overall shape will do. If we are trying to distinguish between breeds of cats, however, we probably need to delve in great detail into a lot of the features.

Deciding which features to use has a great impact on our ability to teach the computer to perform categorization successfully. Sometimes we hand select the features, where we have a good idea of what things distinguish the categories we care about. Other times, we get lazy and, once again, let the computer sort it out.

Dirty little secrets: Overtraining/overfitting and the curse of dimensionality

A fallout 3 in-game poster from destructoid

As you may have noticed this rambling piece of text bears a suspicious resemblance to rose tinted predictions of the 60s of how robots were going to do all our chores for us as we lay about in our bathrobes, smoking our pipes and ordering trifles by mail. If machine learning is so great where are our jet-packs, flying cars and domestic robots?

Well, two things. First, the future is here, but it’s a lot more profane that the romantics of the 1960s depicted. Automated algorithms track our spending habits and place annoying ads on our social media pages. The US postal service uses robots that can read our scraggly handwriting to route our birthday card to mom. UPS fired all the human telephone operators and now you can speak in your tracking number and the computer will take it down despite your thick southern accent.

Secondly, all this glib talk about “Letting the computer sort things out” should have you on your guard. After all, we all know about skynet. There are two main problems with having machines do all the work.


The first is called overfitting. This is where the computer gets too fixated on the details of the input and builds over-elaborate classification rules.  Basically, we give the computer 1000 pictures of cats, and the computer builds a super elaborate library memorizing every teeny detail of those 1000 pictures of cats. So now we show it the 1001th picture of a cat and because the picture is a little bit different from each of the others the computer says “That picture is not an exact match to any of my 1000 cat pictures so that’s not a cat, it must be a fire hydrant.”

We can diagnose over fitting by doing something called cross-validation. Though this sounds like some kind of new-agey pop-psychology mumbo-jumbo, the concept is simple: you simply don’t test your students using same question set that you trained them on. We take our set of 1000 cat pictures and divide it up into two sets. The first set we call the training set (say 700 pictures). We let the computer learn using those 700 pictures. Then, we pull out the other 300 pictures that the computer has never seen before and make the computer classify those. The idea is that if the computer has done some kind of rote memorization trick and has not understood the essence of cat, it’s going to do rather poorly on them while it does spectacularly well on the original 700. This is how we catch computers and other students out.

But, how do we ensure that the computer does not do rote-memorization in the first place. That’s a pretty problem. There are no easy answers, but one thing we try is to figure out ways the computer could rote learn and penalize it. If, for example, the computer is using too many features, we say “Bad computer, use less features.” But this is all very subjective and trial and error and domain specific. Our only objective measure of success is to do cross-validation.

The curse of dimensionality

This is very scary, but I’m sorry I have to tell you about it. Remember when we talked about feature vectors? A feature vector is a list of numbers that describes the input. The more complex the nature of the input the longer each feature vector (the more numbers in the list).

Suppose we have been given a dossier 20 people and we are trying to figure out who to pick for the basketball team. None of these people have played sports of any kind before, so all we have are physical attributes and bunch of other things like hometown, what color car they drive, how many times they flunked math and so on and so forth. The only other thing we have is a similar dossier from 50  other folks from last year and a notation that says “Good player” or “Bad player”. How do we do this?

We start off simple. We pick the physical attribute “Height” and look at last year’s dossier. Interestingly,  when we arrange the 50 previous players it turns out most everyone above 5’8″ is a good player and most everyone below that height is a bad player. So we sort this year’s list into ‘Good’ and ‘Bad’ piles based on height. We send the kids out to play and it turns out our picks are pretty good.

Well, the coach comes back and says, can we do better? We say, hell, if we did so well with just one feature (height) why don’t we toss ALL the information in the dossier into the comparison. It can only get better right?

So we start to include everything we have. Pretty soon a cool pattern emerges. All the tall players, who scored above D+ , who live in a yellow house and drive a red car AND all the tall players who scored above B+ and live in a blue house and drive a yellow car are all good players. Man, ring up that Nate Gladwell guy, he’ll want to know about this! We split the new players up according this criterion and wait to hear the good news from the coach.

Next week the coach storms in and chews us out. He tells us we’re a bunch of dunder heads, it’s such a mixed bag of players his pet cat could have picked a better team, and we are FIRED!

(My apologies to people who realize I know nothing about sports)

What happened? How did we fail so badly? Does this remind you of the overfitting problem? Where our computer got to hung up about the details? Well, yes. It’s kind of the same thing. When you add more features (also called dimensions of the feature vector – how many numbers there are in the list) you need many, many more sample points. If we had enough dossiers those phoney correlations between playing ability and house color would have been drowned out. However, because we only had a relatively few number of dossiers we started to get hung up on coincidences between things. The more features we pick from the dossier, the more coincidences we find.

Statisticians, known for their rapier wit and general good humor, call this the curse of dimensionality.

As applied to Neuroscience: An information measure

As a neuroscientist, or neurophysiologist or general biology student why should you care? Well, these statisticians are encroaching on your turf. Statisticians can smell data a mile away, and biology absolutely reeks of it.

Using fMRI studies as an example, traditionally we have looked at each voxel individually and done a simple statistical test to answer the question “Is the activity in this cubic inch of the brain significantly different from chance”. Using somewhat dodgy corrections for multiple comparisons we then create a picture of the brain with all the significantly active parts of the brain colored in, and we draw deep conclusions like “The visual cortex is more active when the person looks at visual stimuli” or “The motor cortex lights up when the subject does motor actions”.

When the statisticians sneak in with forged immigration papers, things get more wild. The statisticians are not happy with these stodgy answers, like this brain region is differentially active during a working memory task. They go straight to the heart of why any body would do brain science in the first place. “Can I read your mind and guess which card you are holding?”

And the plan they have is to use these machine learning techniques we just discussed to answer this question. A popular method is to take those fMRI pictures of the brain, much like the ones you find on the internet and feed them into the computer, just as we talked about, along with some category information “Looking at cat”, “Looking at fire-hydrant” and so on.

Then you test the computer (Recall, cross-validation) and show them an fMRI and ask the computer to guess what it was the person was thinking/seeing/doing when the fMRI was done.

The result is a value, a percentage correct. This percetage correct ranges from chance to 100% and is a crude measure of how much information there is in the brain about a certain thing, like a stimulus, or a mood and, if we partition the fMRI, we can answer questions like, how much information is there in the prefrontal cortex, in the basal ganglia, in the thalamus and so on.

Again, this differs from traditional methods of analyzing fMRI data only in that we don’t fix what it is exactly in the data that will give is the answer. We let the computer figure it out. As we learned, this is exciting, but also dangerous if we let the computer totally loose on the data (Recall, curse of dimensionality, overfitting and whatnot).

It is also important to remember that this is not a statement on the nuts and bolts of HOW the brain is processing the information, merely a statement on how MUCH information there is (possibly) is some part of the brain that the rest of the brain could possibly use.