Parzen windows for estimating distributions

Part of a set of moderately technical writeups of some adventures in computing mutual information for neural data.

Often, for example, when you are computing mutual information, you need to estimate the probability distribution of a random variable. The simplest way, which I had done for years, is to to create a histogram. You take each sample and put it into a bin based on its value. Then you can use one of several tests to check if the shape of the histogram deviates from whatever distribution you are interested in.

When you don’t have enough data (aka Always) your histogram comes out jagged (Almost everyone who’s ever made a histogram knows what I’m sayin’). In undergrad stats I learned that 11 was a nice number of bins, and indeed both Matplotlib and MATLAB seem to have that as the default. What I ended up doing was plotting the data using various bins until, by inspection, I was satisfied by the smoothness of the histogram.

Turns out mathematicians have a whole cottage industry devoted to formalizing how to compute the number of bins your histogram should have. The relevant keyword is (imaginatively enough) “histogram problem”. I ran into this word while reading up a Navy technical report by Cellucci et al found here. That paper has references to a bunch of dudes who worked on that problem. Anyhow, there are lots of complex schemes but I liked Tukey’s formula, which was $n^{1/2}$ (n being the number of data points).

This post however, is not about binning. It’s about Parzen windows to get rid of binning altogether. There is a deep theoretical underpinning to Parzen windowing, but intuitively I understand Parzen windowing as a smoothing operation that takes the samples we have and creates a smooth distribution out of the points.

Here’s how I understand Parzen windows: Each sample creates a splash – it’s own little gaussian (Apparently, you can also use boxcar windows or whatever window has a nice property for your problem). This is the Parzen window. As a result, the sample is no longer tightly localized but has a bit of a blur to it. We then add up all the blurs to create a smoothened curve/surface which is our estimate of the pdf of the samples. With a judicious choice of the width of the blurring and proper normalization of the height of each gaussian we can come up with a sensible pdf estimate. The advantage of this is that you know have a continuous function representing the pdf, which you can integrate.

Formally (I referred to a paper by Kwak and Choi – Input Feature Selection by Mutual Information based on Parzen window) the Parzen window estimate of the pdf is given by

$\hat{p} = \frac{1}{n}\sum_{i=1}^{n} \phi(x-x_i,h)$

where $\phi$ is the window function, and I used a gaussian for that. As you can see, the density estimate at any given point $x$ is given by the sum of gaussians centered around each data point $x_i$ with the width of the gaussian being given by $h$. The larger $h$ is the more washed out the estimate is and the smaller $h$ is the more jaggy is the estimate.

We seem to have transferred our problem of finding an appropriate bin width (bin count) to finding an appropriate smoothening constant (What! You expected a free lunch!?). I used what wikipedia calls Silverman’s rule of thumb: $h=1.06\hat{\sigma}n^{-1/5}$.

Here is a fun little animation showing how the Parzen window estimate of a pdf (thin black line) matches up with the actual pdf (thicker blue line). The histogram of the actual data points are shown in light gray in the background.

Interesting things about this smoothing is that there is no binning involved – the final curve depends only on the actual data samples – and we make no strong assumptions about the pdf of the data – it’s not like we are trying to fit a model of the pdf to the data. Here is an animation of the exact same technique being used to fit a uniform distribution. We could have done better with a different choice of window width more tuned to the distribution, but the idea is that it still works if we don’t have any idea of what the actual pdf looks like.

These are also known as mixture of Gaussians or mixture decompositions.

During my web-searches I ran across this nice set of lecture slides about estimating pdfs from a prof at TAMU.

import numpy, pylab, matplotlib.animation as animation

root_two_pi = (2*numpy.pi)**0.5
parzen_est = lambda x,X,h,sigma: numpy.exp(-(X-x)**2/(2*h**2*sigma**2)).sum()/(root_two_pi*h*sigma*X.size)
gaussian_pdf = lambda x: (1/(2*numpy.pi)**.5)*numpy.exp(-x**2/2)
def uniform_pdf(x):
p = numpy.zeros(x.size)
p[(-2 < x) & (x < 2)] = 0.25
return p

def init():
fig = pylab.figure(figsize=(3,6))
ax = pylab.axes(xlim=(-5, 5), ylim=(-0.5, 0.7))
return fig, ax

def animate(i, r, ax, distr):
this_r = r[:i+2]
ax.cla()
h= 1.06*this_r.std()*this_r.size**(-.2)#1/(2*numpy.log(N))
lim = [-5,5]
bins = 51
x=numpy.linspace(lim[0],lim[1],num=100)
pylab.text(-.4,-.025,'n=' + str(this_r.size))
pylab.hist(this_r, bins=bins,range=lim,normed=True,edgecolor=(.9,.9,.9),color=(.8,.8,.8),histtype='stepfilled')
#pylab.plot(x, (1/(2*numpy.pi)**.5)*numpy.exp(-x**2/2), color=(.1,.1,1.0), lw=5)
pylab.plot(x, distr(x), color=(.1,.1,1.0), lw=5)
pylab.plot(x, [parzen_est(xx, this_r, h, this_r.std()) for xx in x],'k', lw=3)
pylab.setp(ax,xlim=(-5, 5), ylim=(-0.05, 0.7))

N = 200
r0=numpy.random.randn(N); distr = gaussian_pdf
#r0=numpy.random.rand(N)*4-2; distr = uniform_pdf

fig, ax = init()
anim = animation.FuncAnimation(fig, animate, fargs=(r0, ax, distr), frames=N-2, interval=2, repeat=False)
#anim.save('parzen_smoothing.mp4', fps=30, extra_args=['-vcodec', 'libx264'])
pylab.show()