This code is called mma. There are many variables that might be measured, but I have focused on the following six:
- Processor:RESOURCE_STALLS
- FPU:flops
- Cache:DATA_MEM_REFS
- Cache:DCU_LINES_IN
- Cache:L2_LINES_IN
- Cache:DCU_MISS_OUTSTANDING
The C code performs very poorly, even when using many of the compiler's optimization flags. The code gets only 21.2 Mflops and takes more than 100 seconds to complete. Some indicators of poor performance are:
- The ratio of data memory references to flops is greater than 1.5
- The ratio of data memory references to DCU lines in is near 3
- The CPU is stalled for resources more than 95% of the time
The Compaq Fortran compiler does significantly better if the highest optimization (5) is used. Here the Mflop rate is 126 and it completes in 17 seconds. The fact that those two numbers are BOTH 6 times better than the C case is no coincidence. The total number of flops is essentially independent of the degree of optimization. When comparing with the C code, we see:
- The ratio of data memory references to flops is greater than 1.8!
- The ratio of data memory references to DCU lines in is about 14
- The CPU is stalled for resourcees only 77% of the time
This hardly approaches the potential of the machine, but a six-fold speedup isn't bad either!