I have some data where each point consists of two values (x,y). A large bunch of the (x,y) values are likely to come from a distribution centered around the origin. A few of the points may come from distributions where x is significantly greater than zero. I wanted to find out if y for those points was significantly greater than zero.

My first idea was to draw a best fit line to the (x,y) pairs and see if the slope was significantly different from zero. I don’t have any hypothesis about how x is related to y, except that I think it is likely that if x is not different from zero, neither will be y.

To start with, I don’t think such a fitting is the correct approach but I did pursue it for want of a better idea. One thing I worried about is that if I performed such a fitting the noise from the uncorrelated points at the origin would drown out the far fewer points away from the origin.

Suppose you have a cluster of (x,y) points where both x and y come from a normal distribution with mean 0 and variance 1 such that x,y are independent. Now suppose you have one additional point for which x=y. What is the best fit line (‘chain’) through this cloud of points?

As your intuition will probably tell you, that depends on how far away the lone correlated point is from the ‘ball’ (the cluster of uncorrelated points round the origin). The following two animations show how the slope and correlation coefficient change as this lone point gets further and further from the origin. In the first animation we have 10 points on the origin and in the second animation we have 100 points.

The further away the point is from the origin the more influence it has on the best fit line and correlation coefficient. Interesting to note is that the best fit line has a bias towards the horizontal. This is, of course, because of the form of the best fit line. If we had proposed a three parameter fit ax + by +c = 0 then our ‘best fit line’ would be sampled from around the circle.

This bias is interesting because we can see that the effect of the uncorrelated ball is to bias the chain more horizontal (corresponding to a reduction in correlation coefficient). If we use a different fitting form (such as ax+by+c=0) this bias would disappear.

import pylab, scipy.stats as ss, matplotlib.animation as animation N = 100 Nc = 1 def run_fit(i=0): x = pylab.randn(N) y = pylab.randn(N) xx = pylab.randn(Nc) if i < 100 else (i-100)/20. x[:Nc] = xx #+ pylab.randn(Nc) y[:Nc] = x[:Nc] fitp = ss.linregress(x,y) xsim = pylab.array([-20, 20]) yhat = fitp[0]*xsim + fitp[1] return x,y,xsim,yhat,fitp def init(): fig = pylab.figure(figsize=(3,3)) pylab.axis('equal') ax = pylab.axes(xlim=(-20, 20), ylim=(-20, 20)) x,y,xsim,yhat,fitp = run_fit() line1, = pylab.plot(x,y,'ko') line2, = pylab.plot(xsim, yhat, 'k--', lw=2) text = pylab.text(-10,10,'m={:+4.2f}\nb={:+4.2f}\nr={:+4.2f}'.format(fitp[0],fitp[1],fitp[2])) return fig, line1, line2, text def animate(i, line1, line2, text): x,y,xsim,yhat,fitp = run_fit(i) line1.set_data(x, y) line2.set_data(xsim, yhat) text.set_text('m={:6.2f}\nb={:6.2f}\nr={:6.2f}'.format(fitp[0],fitp[1],fitp[2])) fig, line1, line2, text = init() anim = animation.FuncAnimation(fig, animate, fargs=(line1, line2, text), frames=400, interval=20, repeat=False) anim.save('ball_and_chain.mp4', fps=30, extra_args=['-vcodec', 'libx264']) pylab.show()