We all have to get our car oil changed periodically. If we do it too often its a waste of money (and a strain on the environment), if we do it too infrequently we run the risk of damaging our engine. Say an insightful friend has developed a formula – a function – that informs us what our total cost is (cost of oil + probable cost of engine repair) when we plug in how often we are doing oil changes.
We have an idea that we can use this graph to figure out the best frequency for our oil change. Finding the value of the input of a function to obtain the lowest value of the output is called optimization.
One easy way to find the optimum is to simply compute the function over a large range of input values and pick the value of the input that gives the smallest output. We already did this by inspection when we saw the above graph, because some one has already plotted the graph for us.
One problem with this is that, in the real world, the function may be complicated and involve lots of individual mathematical formulae making it very time consuming to compute the function over a large range of values. This is especially true for multi-dimensional functions that take multiple inputs that can be altered.
Say our formula not only takes into account oil change frequency but also our average driving speed and average length of travel and we are really obsessive and we want to optimize all three. We need to pick ranges for the three dimensions and then decide how densely we are going to sample this space of three dimensions. Even if we are very careless and decide to check only 10 values for periodicity, travel distance and average speed we end up having to check 10 x 10 x 10 = 1000 combinations of values!
We need to find a short cut!
Gravity is our friend
Look again at the graph out friend has given us. Imagine the graph is actually a physical surface, like a bowl.
If we put a marble somewhere it’s going to roll down and end up at the bottom. Now, this is where we want to be: we want to be at the lowest point of the curve. If we have such a bowl shaped surface, the action of gravity on a marble placed on that surface acts as a minimizing algorithm: it finds the lowest point for us. (At this point, you might suggest that this is not always true, we won’t always find the lowest point, and we will discuss this later).
This gives us an idea for a method (an algorithm) by which we can fake this ‘rolling down a slope’ to find the input that minimizes a function. Before we continue, just so we are on the same page, teh slope of the curve is defined as the ratio of the change in output to the change in input. This will be familiar to any one who has been driving in a hilly area:
1. Place an imaginary marble at some arbitrarily picked starting position
2. Find the slope of the curve at that point
3. Move our point along the curve to a new position computed by adding the numerical value of the negative of the slope to the old position.
4. Keep repeating this until our slope becomes smaller than some value we are happy with calling ‘flat’ i.e. we are done.
Now, in our physical example, how fast the marble rolls down the slope depends on the strength of gravity. If we did our experiment in orbit (say on the International Space Station) the marble of course would not roll down but float around. If we did it on the moon, it would roll down, but slowly, since the gravity is lower there.
In our fake version of this experiment we also have to decide on a ‘gravity’. We don’t want to make the gravity too low or our point will take a long time to move along the slope. If we make our gravity too high, our point will overshoot and bounce around a lot.
We call our fake gravity ‘alpha’ and it is simply a number that we multiply our slope by to decide how far our point is going to move.
It is important to point out here that our marble in a ball is an analogy. In our algorithm that we are going to use, for practical reasons, we don’t include inertia. So, in the real world the ball would slide down to the bottom and then swing back upward due to momentum (inertia). In our algorithm, as soon as the slope goes to zero our ‘ball’ (point) stops instantly. However, it is still important to choose a gravity (alpha) that is not too large.
Stuck in local minima
You would have thought to yourself, what if the surface is kind of uneven, can’t the ball get stuck on a ledge somewhere as it is rolling down, and not reach the bottom? Yes, this can happen using this method, and is called getting stuck in local minima.
It’s a matter of luck if you get stuck in local minima. You could avoid it simply by accident, by starting from a different position or by having a different alpha value.
The Gradient descent method requires far fewer function evaluations than the brute force method, especially when the number of input variables we are tweaking (the dimensions of the input space) get larger.
Getting stuck in local minima is a big problem which crops up quite frequently in many interesting applications. There are some tricks to get out of local minima which we will discuss later.
Also, if you pick too large an alpha value you could end up bouncing around in your function, perhaps lost for eternity, either spinning out into space, or cycling through the same positions, never getting to where you want. You have been warned …