Modern CPUs are fast, and main memory is comparatively slow. A CPU sits and idles for many cycles when it has to fetch data from main memory, which can be a huge processing bottleneck. Fundamentally, this is, like most things, an economic tradeoff.

CPU caches

As a compromise, CPU manufacturers add some fast memory on the CPU chip itself, calling this cache memory. This memory is expensive and takes up precious space on the CPU chip and so cache memory is tiny in size, compared to main memory. There can be several types of caches, each offering different amounts of speed. L1 cache is the fastest, and smallest, followed by L2 and L3. My laptop has an Intel core i5. It’s L1 cache is 32 KiB/core and L2 is 256 KiB/core.

Roughly speaking, when the CPU goes to fetch some data from main memory, instead of grabbing just a few bytes of data it grabs a whole section of the data (called a cache line, about 64 bytes). This data is put into the cache. The next time the CPU needs some data it first looks in the cache. If the data is there is grabs it from the cache, which is very fast. This is called a “cache hit”. If the data is not in the cache (a “miss”) the data has to be fetched. Of course, since the cache is so small, there is a limited amount of data that can be kept in the cache at one time. Older bits of data that haven’t been used recently are dropped from the cache and overwritten with more recently fetched data.

Using the cache wisely: data locality

As you can guess, if we can organize our computations such that once a line of data is fetched into the cache all that data is used in the computation before a different line is fetched, we will be a lot faster than if we keep going back and forth, fetching and refetching data from different parts of memory. This is called data locality.

A demonstration using matrix multiplication

Multiplying matrix A with matrix B involves multiplying each element of each row of A with each element of each column of B to create elements on the result matrix.

Now, in memory, there are two ways to store a matrix. We can take each row of the matrix and tile them side by side, a pattern called row major ordering:

Row major ordering

Or, we can take each column, and store them side by side, called column major ordering:

Column major ordering

Now, we can see from the description of matrix multiplication above that when we do a multiplication it would be nice if we could pull a row (or at least a part of a row) of the first matrix and a column or part of the second matrix in the cache.

A good way, then, to store the matrices would be row-major for the first matrix and column-major for the second one. Let’s call this row-col

A so-so way would be to store both matrices in row-major order. In this case while we make good use of the cache for the first matrix data the second matrix keeps sending us to main memory: we ask for an element from the first row – so we cache that row, but immediately afterward we ask for an element from the next row, which is stored far away, and likely requires a fresh pull from main memory because of its non-locality. Lets call this row-row

The worst way to store the data would be to store the first matrix column wise, and the second matrix row wise. Here we absolutely butcher the caching algorithm and force the CPU to keep fetching data from main memory each time. We call this col-row.

Code to do this experiment can be found in this gist.

Hardware performance counters

Before we go further, we should briefly mention hardware performance counters. I first read about them in this presentation. They are basically

  • 64-bit registers
  • Four to eight per core
  • Distinct from the regular integer and floating point registers
  • Can each be configured to count the number of CPU level events, such as instructions executed, L2 Cache Misses, branches taken and so on

As you can guess, they are effectively on-chip diagnostic monitoring tools that give you access to really detailed statistics for very internal processor events, like cache misses, which is useful to improve the performance of your code.

Cache profiling using Instruments

Instruments of course is macOS’s spectacular developer power tool suite for profiling code and is the sole reason I have 7 GB of junk (XCode) floating around on my hard disk. I do wish they would distribute Instruments separately …

Instruments has a mode called “Counters” that allows us to run an application, set the mode of these hardware performance counters and record and display various events.

I’m interested in understanding how often I get a cache miss when performing the computations, because I can see that code doesn’t have the best data locality. Then, I’d like to rearrange the data structures a bit so that access is sequential as the computation proceeds, rather than jumpy, as I suspect it is now.

Even though Instruments has helpful tooltips, the names of the events that can be counted are a bit cryptic.

The event names are a bit cryptic, though the Instruments tooltips try to help.

From looking through Intel’s documentation here I decided to look at the events named MEM_LOAD_UOPS_RETIRED.L1_MISS and MEM_LOAD_UOPS_RETIRED.L2_MISS as measures of how the code does in looking up data in L1 and L2 cache. In this context the word “retired” means the instruction was executed and finished with some detailed nuance in the context of out of order execution.

I do have to add here, that as much as I love Instruments, the Counters UI for adding and manipulating variables and formulae is buggy and will crash the app on occasion.

Results

I will run a piece of code (given in this gist) that will set up two 1200×1200 matrices and multiply them together. It will do this three times. The first time the data is stored in row-major order in both matrices, in the second time the data is stored row-major in the first matrix and column-major in the second matrix, and in the third part the data is stored colum-major and row-major. In all three cases the computations done are exactly the same, and done in the same order. The only thing that changes is how the data is stored internally.

I have outfitted instruments as described above. The result of a typical run of this code is given here:

Segment 1: row-row, segment 2: row-col, segment 3: col-row. Computation is fastest, making best use of caches, when we multiply a matrix organized row-major with a matrix organized column-major. This data organization is congruent with the computation we are doing. Note how different the computation times are (and note the L1
and L2 miss counts) even though the actual computation (and values) are exactly identical when we change the data layout in memory.

Notice the traces marked L1 miss and L2 miss. The are moderate in the first section, tiny in the next section and largest in the third section.

You can also see that the first section is moderately long (16s) the middle one the shortest (7.7s) and the third one the longest (21.2s).

You can see that the row-col data organization leads to the lowest number of cache misses and the fastest execution time, while the col-row organization has the highest misses and the longest execution time, even though they are all doing exactly the same number of computation.

And, from the discussion we’ve had about cache and data locality, you know why!

Further reading

  1. http://igoro.com/archive/gallery-of-processor-cache-effects/
  2. https://gameprogrammingpatterns.com/data-locality.html