Programmed Profiling

3.2 Loops Blocked

This code is called mmbu. Performance is substantially better than the naive code. It is quite sensitive to the block size near the point where the three blocks can all fit in the L1 cache. Other observations:

  • The ratio of data memory references to flops is a little less than 1
  • The number of data memory references per cache line in is about 6
  • The number of L1 cache lines in per L2 cache line is also about 6
The L2 cache reuse is considerably better than in the unblocked case, and the total number of memory references is also decreased.