Programmed Profiling

Lab Exercise
Prerequisites Overview Exercise Solution Cleanup

Prerequisites

This lab follows the module CPU Profiling. You should complete the lab Optimizing on a Single Processor before starting this lab.

These prerequisites cover the following material, an understanding of which is necessary to proceed to the lab below:

  • Using CPUMon
  • Instrumenting a code using the PerfThread library


Overview

Although the ultimate goal of optimization is reducing the execution time of a program, it is often helpful to monitor other CPU statistics to determine what attempts at optimization have been effective and what other strategies for optimization might be pursued.

This exercise gives you the opportunity to monitor various CPU statistics for the codes that you used in the "Optimizing on a Single Processor" Lab.

This exercise will probably take 20 minutes to complete, though because it is open-ended, you might spend all day!


Exercise

Before You Begin

  1. Since you presumably created a subdirectory for the exercises in the "Optimizing on a Single Processor" lab, you will probably find it useful to use the same one for this lab.

  2. Copy all lab files found in
    H:\VWlabs\Performance\Optimization\[C|Fortran]

    to your home directory (or subdirectory) on H:, e.g.

    copy H:\VWlabs\Performance\Optimization\C\*   H:\Users\your_userid\Lab\

 

Exercise 1: Monitor some CPU statistics

C lab files:mma.exe, mmbu.exe, mme.exe
C source files:src\mma.c, src\mmbu.c, src\mme.c

Fortran lab files: mma.exe, mmbu.exe, mme.exe
Fortran source files: src\mma.f, src\mmbu.f, src\mme.f

These files have already been compiled for you, so you need only to run them and work with the output.

mma is a naive matrix multiply routine that multiplies two 512-element square matrices and places the result in a third matrix.

mmbu is a less naive matrix multiply routine that multiplies two 1024-element square matrices and places the result in a third matrix. This routine is an improvement over mma in that multiplication is carried out on 50x50 blocks of the matrix and the inner loop has been unrolled to a level of 4.

mme is the smart matrix multiply routine that multiples two 1024-element square matrices by calling a routine from the Intel Math Kernel Library.

Before running any of these codes go to the directory containing the executables, start cpumon by typing "cpumon" at a command prompt. It will open a window giving you choices of quantities that you might count. You may choose any two quantities and click the start button. Then go back to the command window and invoke an executable by typing its name at the command prompt. The timing output (at one-second intervals) from the two counters will be displayed on your screen and written to a file called mm.out.

You may find it useful to view the output in an Excel spreadsheet. One named mm.xls has been provided. Select the top left cell (A1) and go to data->refresh data. Say OK to import data from mm.out. Go to cell A43 and put in the time reported for the matrix multiply that you did. Then row 41 contains the average counts per second of whatever you were counting and row 42 contains the total counts over the course of the multiplication step.

Interesting counters to look at are: FPU:FLOPS, Cache:DATA_MEM_REFS, Cache:DCU_LINES_IN, Cache:L2_LINES_IN, and Processor:RESOURCE_STALLS.

 

Exercise 2: (optional) Change the blocking factor in mmbu

C lab file: mmbu.c

Fortran lab file: mmbu.f

If you want to work with the C file, open Microsoft Developer Studio and the mmc workspace or if you want to work with the Fortran file, open Compaq Visual Fortran Developer Studio and the mmf workspace.

Open the mmbu source file in the Developer Studio editor and change the line that defines nb from 50 to a value of your choice. (24 gives the fastest results for this algorithm.) Then recompile and run it as you did in the previous exercise to see how your results have changed.


Solution


Cleanup

You may wish to delete any files you copied into your \Lab\ folder on H:.