Programmed Profiling

3.3 MKL Library
This code is called mme. The Mflop rate is about half of the clock rate, and even in this case the processor is stalled for resources about half the time.
  • The ratio of data memory references to flops is about 2/3
  • The number of data memory references per cache line is about 50
  • The number of L1 cache line in per L2 cache line is about 1.11
A conclusion to be drawn from this is that blocking to encourage more efficient use of L1 cache is very helpful, but because L2 cache misses are even more expensive than L1 cache misses, it is more effective to minimize L2 cache misses and not worry about reuse of L1 cache.