Single Processor Performance Considerations

Lab Exercise
Prerequisites Overview Exercise Solution Cleanup

Prerequisites

This lab exercise is intended to augment the related talk, Single Processor Performance Considerations. This exercise is not a tutorial. You should already be familiar with the information in the Single Processor Performance Considerations module, and be comfortable compiling and running programs C or FORTRAN programs in CTC's computing environment.


Overview

This lab has two principal objectives:

  • To give you an opportunity to try applying some of the tuning techniques presented in Single Processor Performance Considerations; and
  • To provide a realistic experience of the code tuning process, giving you a chance to observe the way different techniques interact with each other.

The exercise consists of a number of steps, each working on the same basic program, but implementing different performance improvement techniques. If you have the time, you will get the most from this lab by doing all of the steps in order. However, if you are short on time, try doing the first three lab steps on your own, and then just copying, studying and compiling the provided solution files for the remaining steps.


Exercise

General Lab Instructions

  • Programming Languages

    These instructions are written for either Fortran or C programmers. Where there is a difference due to language, the difference will be clearly indicated by two separate sets of instructions. Otherwise, differences in the lab depend on the program files themselves, which follow these naming conventions:

    Lab Files Solutions
    Fortran something.f something.soln.f
    C something.c something.soln.c

  • Copying lab and solution files

    Copy all lab files found in
    H:\VWlabs\Performance\SingleProcPerf\

    to your home directory (or subdirectory) on H:, e.g.

    copy H:\VWlabs\Performance\SingleProcPerf\*   H:\Users\your_userid\

  • Running setup files

    Run the setup files for the compiler(s) you will be using. The compiler(s) you choose use must be installed on the machine where you will be doing the compilations, typically in the Collaboratory or on a ctclogin node via Terminal Server or telnet. The setup files are located in "javascript:void(null);", which should be in your default path.

    No matter which compiler you choose, you should first run setup_visualc to make sure Microsoft's link.exe is in your path.

    Fortran:
    • For df, the Compaq Visual Fortran Optimizing Compiler, the setup file is:  setup_compaqf
    • For ifl, the Intel Fortran Compiler, the setup file is:  setup_intelf32

    C:
    • For cl, the Microsoft C/C++ Optimizing Compiler, the setup files is:  setup_visualc
    • For cl and icl, the Intel C/C++ Compiler, the setup file is:  setup_intelc32
      (the purpose of this step for cl is to verify that the Intel MKL is available for linking).

The Basic Program

We need a "baseline" of performance for a simple program before we start tuning the code.

  1. This section uses mma.f (Fortran) or mma.c (C version). Look over the code, so that you understand what it does.

  2. Compile the program with the appropriate batch file, either compilef.bat or compilec.bat.

    Look over the script to see how it works. You will see that it takes 3 arguments:

    • the name of the source,
    • the name of the command-line compiler, and
    • the name of the optimization set (blank, "opt", or "debug").

  3. Run the code. At your option, you can either do this on your local Pentium III computer, on a Collaboratory node, or on a CTC batch node. (Note that running codes on ctclogin nodes is contrary to CTC policy and leads to erratic timing results.) If you decide to run the code in batch, you may wish to use the runtest.bat template script provided for that purpose (it just requires some simple editing).

Note the execution time for this naive matrix-matrix multiply. Some sample results are also output; these are provided for debugging purposes later. The code will take a minute or so to run, so you may want to go on to the next couple of steps while waiting for the results.

Compiler Options

Here we see what different compiler options can do to help our simple program.

  1. Compile with "opt" as the final argument to the batch script; run the code and note the time.

    • Check the sample output (several matrix elements) carefully. It may not be the same as your first run. If it is not, but you find the results are "close enough", then ignore the differences and go on. (When a compiler reorders operations, it can affect the last digit of accuracy.)

    • You should have seen a substantial improvement over the plain-compiled code for "df", but you may have seen almost none for the other compilers. Why is this? First of all, "df" knows some tricks that the others don't, as we will see later. But another important factor is that the default optimization levels for the Intel compilers are already set fairly aggressively. In fact, most current compilers try to optimize extensively unless specifically instructed not to do so.

  2. Now compile with "debug" as the final argument to the batch script; run the code and note the time.

    • One common circumstance in which optimization is generally turned off is during debugging. You should see a significant degradation in performance.

Loop Tricks

Now we begin experimenting with "hand tuning" the code. First, we tinker with the structure of the loops.

  1. Loop reordering: FORTRAN programmers only need to complete this step.

    Create another copy of mma.f and switch the I and J loops. The simple solution is mmr.soln.f. Compile with "opt", run and time. This lets you know whether it is important to minimize the memory stride in the C(I,J) matrix.

  2. Loop unrolling: Now unroll the middle loop to depth 4.

    The object is to stride by 4 over the middle loop, while creating 4 temporary variables (think of registers) to use as accumulators in the innermost loop. This reduces the probability of stalls due to loads, stores, and cache misses. Furthermore, it will be advantageous to create a temporary variable (again, think "register") to hold the matrix element accessed in each of the 4 multiplications in the innermost loop. Section 2.7 in the related talk shows an example. Note that an extra "cleanup" loop must be inserted after the main, unrolled loop unless the matrix size is restricted to be a multiple of 4.

    Fortran:
    1. Use a copy of the mmr.f code.
    2. Stride the middle loop by 4.
    3. Create temporary variables to replace C(I,J) through C(I+3,J).
    4. Create a temporary variable to hold B(K,J).
    5. Compile with "opt", run and time.
    6. Your solution should look like mmu.soln.f.

    C:
    1. Use a copy of the mma.c code.
    2. Stride the middle loop by 4.
    3. Create temporary variables to replace c[i][j] through c[i][j+3].
    4. Create a temporary variable to hold a[i][k].
    5. Compile with "opt", run and time.
    6. Your solution should look like mmu.soln.c.

  3. Loop Blocking: Create a copy of mma.f or mma.c and implement loop blocking on it.

    You'll need to nest three more loops inside the triply-nested main loop, giving the outer loops a stride which makes the inner loops fit entirely within cache. (See Sections 3.2 and 3.3 in the related talk for discussion and an example.) Compile with "opt", run, and note the time. Your solution should look like mmb.soln.f or mmb.soln.c in your working directory.

    Did you get a speedup? You may not have! With optimzation options, you are asking the compiler to make code transformations of its own. But you have tinkered around with the loops already, which may be interfering with what the compiler is trying to do for you. So be aware that hand-tuning has its risks...

    It can also have its benefits. The current Intel compilers, in particular, appear to benefit substantially from loop blocking. With this technique handed to them, they are able to outdo Compaq on this example. This may be an indication that the Intel compilers are incapable of doing loop blocking, even at the highest optimization levels.

Array Padding

Now we apply array padding, a simple hand tuning step. The matrix size of 512x512 was deliberately chosen to be a large power of 2. Recall that this reduces the effective size of the cache (see section 3.5 in the related talk).

To fix this problem, enlarge one of the dimensions of the arrays to A(ND,ND) where the parameter ND is not equal to a power of 2: say, N+1 instead of N. Do not change the contents of the matrices nor any of the loops! The goal is merely to change the way the matrices of the original problem are stored in the main memory.

Apply this simple fix to each of the programs mma, mmb, and mmu. Examine the results after you compile with "opt", run, and time. Do you think the compiler you are using is able to do array padding on its own? You may also want to try "debug" and compare the results with and without array padding. The provided solutions are mmp.soln.f, mmbp.soln.f and mmup.soln.f for Fortran; and mmp.soln.c, mmbp.soln.c and mmup.soln.c for C.

 

Advanced Exercise

For all you real hackers, here's a puzzle to solve that will test your understanding of the cache mapping problem. It is actually possible to achieve the performance improvement you've seen by padding the array in a single dimension. Which one?

You can find out via a test, of course, but here's a hint: The dimension you choose depends on the language your program uses. Can you explain what happened? Click here to check your reasoning.

Everything Together

Now we try the all of the loop tricks together with array padding. You can do this by combining your mmbp and mmup codes.

The inner 3 loops of the blocked code will need to be reordered (for FORTRAN) and unrolled. The easy way to do this is to take the loops from mmup and use them as replacements for the inner loops in mmbp, after modifying them in a straightforward way. Again, an extra "cleanup" loop must be included in the innermost iteration unless one restricts both the matrix and block sizes to be multiples of 4.

Compile with "opt", run and time. For comparison, See how much of this improved performance is retained when ND=N instead of N+1 (i.e., no padding). The provided solutions are mmbup.soln.f and mmbu.soln.f for Fortran; and mmbup.soln.c and mmbu.soln.c.

 

Advanced Exercise

The Intel Pentium III Xeon chip is capable of executing vectorized instructions called "Streaming SIMD Extensions". To try an exercise exploring how this type of vectorization can be implemented in our matrix multiply example, click here.

What Really Works

Of course, we've saved the best strategy for the last: replace the entire set of loops by a single call to DGEMM, the appropriate BLAS Level 3 routine.

BLAS stands for the Basic Linear Algebra Subroutines. This collection of routines (from www.netlib.org) is usually provided to you by the compiler vendors, who routinely include the whole suite in their custom-built optimized libraries. We are dealing with Level 3 here because matrix-matrix multiplication requires O(N^3) operations for a pair of NxN matrices.

Fortran:
  1. Using either mma.f or mmp.soln.f, add:
    INCLUDE 'CXML_INCLUDE.F90'     for the Compaq Extended Math Library
    (nothing is needed for the Intel Math Kernel Library)
  2. Replace the loops with:
    CALL DGEMM('N', 'N', N, N, N, 1.0D0, A, ND, B, ND, 0.0D0, C, ND)
  3. (optional) Alternatively, replace the loops in the non-padded code with:
    C = MATMUL(A,B)
    This is the Fortran 90/95 intrinsic function for matrix multiplication. Is it optimized? How well?
    (If you are using the df compiler, take note that "blaslibs" is set to /link /stack:100000000 in the compilef.bat script. This is to prevent stack overflow in df's MATMUL routine.)

C:
  1. Using either mma.c or mmp.soln.c, add:
    #include <mkl.h>    for the Intel Math Kernel Library
  2. Replace the loops with:
    cblas_dgemm( CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, &a[0][0], nd, &b[0][0], nd, 0.0, &c[0][0], nd );

Note the C arrays are passed as pointers to the first elements of the arrays. This is because the underlying library is written in Fortran, and Fortran uses the "call by reference" method for passing arguments. C also requires an extra argument as compared to Fortran. This is the first one in the list; it determines whether the matrices are stored in row major order (C convention) or column major order (Fortran convention).

Here is a neat thing you can try in C: you can change the first argument to CblasColMajor and switch &a[0][0] and &b[0][0] in the calling sequence. This alternate calling order is just making use of the matrix property (AB)' = B'A' = C' where ' denotes the matrix transpose operation. The result, [C' stored in column major order], is equivalent to [C stored in row major order], which is just [the C matrix stored correctly for the C language]. It is therefore mathematically equivalent to the first way we called clas_dgemm, even though the data access pattern is different. Is the performance better, worse, or the same?

Both solution versions are provided for you: mmep.soln.f or mmep.soln.c, which use padded arrays; and mme.soln.f or mme.soln.c (as well as mmf90.soln.f), which don't.

Humble pie: Compile your program that calls BLAS, either with or without "opt". The necessary include/link files and paths have already been incorporated into either the source codes or the batch scripts. Run the codes, time them and groan... It seems the folks who write these libraries know the tricks and the hardware at least as well as we do. In selected cases there is little doubt they have gone to the trouble of writing out key loops in assembler language in order to milk every last flop out of the available cycles.

 

Advanced Exercise

If you have linked to the Intel MKL, you need to be aware that MKL has a powerful built-in multithreading capability via OpenMP. What this means to you is that if you run the code on a machine with multiple processors (an SMP node), MKL will automatically spawn a number of threads equal to the number of processors!

Don't believe it? Try it! The current CTC "development" cluster is composed of two-way P3 nodes. If you run the code there, you will find it runs in half the time it would otherwise.

On the surface, this seems pretty nice. However, what if your code has been parallelized with MPI, and your MPI processes are already occupying all the processors? Then you can turn off the MKL multithreading by setting the environment variable OMP_NUM_THREADS to 1. Try this out, too. (Remove "REM" from the relevant line in runtest.bat.)

 


Solution

Even though we failed to improve upon the time of the library call, this is not to say that hand-tuning has been an empty exercise. Not every operation is available as a prepackaged library routine. As we have seen, the compilers can occasionally turn out better performance when a trick like loop blocking or array padding is implemented by hand. This kind of experience is typical when hand-optimizing a code: a number of things must be tried in order to find the one thing (or the combination) that is most effective.

The message is to experiment with a number of approaches, including compiler flags, and then to rely on timing tests (in addition to profiling and hardware counters, which are covered in other modules) to reveal what works best. This principle extends to trying out different libraries, too. While MKL outperformed CXML in this lab exercise, it is unwise to attach too much importance to just one linear algebra example that has been run at one particular matrix size.

The morals of the story:

  • Don't reinvent the wheel. Standard operations have been optimized by the experts, and you probably won't be able to outdo the software that is found in the numerical libraries.

  • If you're doing something nonstandard, a good compiler still knows the basic tricks pretty well. For the most part, you'll save yourself the work by relying on the compiler and experimenting with the available options.

  • The best plan is to bear in mind these basic optimization strategies when writing your code. Try not to do things that will interfere with the compiler. Cooperate with the compiler in an intelligent way, and you will end up with speedy code!

If you're curious, you can click here for typical timing results on the CTC's IBM SP2 circa 1998.


Cleanup

If you wish to delete the directory and files you created for this exercise:

  • If you are using Windows, choose "Up" in the lab directory window and delete the lab folder.

  • If you are using telnet, change directories up one level and delete the lab directory you created earlier together with its contents.

      cd ..
      rd /S /Q labdir