Performance Basics

4.2 Memory Hierarchy
The Velocity Cluster consists of virtual memory machines, i.e. they allow data structures to be larger than main memory's capacity. At any moment, a data element may be in any of these locations, which are listed in order from fastest to slowest access:
  • A register, which holds data actively being used in computations
  • Cache, a high-speed memory which serves as a buffer between the CPU and main memory
  • Main memory, the central repository for program instructions and data. A page (4096 bytes) in main memory that has recently been accessed may have its address (page reference) listed in the Translation Lookaside Buffer (TLB), which allows faster access
  • Swap space, the area on disk set aside for the portions of virtual memory that won't fit in the node's real main memory. It is considered to be separate from paging space.

The floating-point and fixed-point units require data to be in registers to operate on them. The following chart details the delay in moving data from one element of the hierarchy to another. In the following diagram, "1 word" denotes 32 bits. Note: the numbers used in this diagram are representative. They do not reflect performance on current machines.

DataMove


The keys to performance:

  • Arrange the calculations in your program so that once a piece of data has been loaded into a register, it is re-used as often as possible. For example, if you are going to do three calculations on each array element, if possible do them all before going on to the next element. (temporal locality)

  • Try to reference contiguous elements of data. Main memory is organized into "cache lines," adjacent locations in memory that are always fetched together. By operating consecutively on data elements in the same cache line, you can reduce the number of time-costly trips to main memory. (spatial locality)
Examples are given in the "Getting on Top of the Memory Hierarchy" section of the Single Processor Performance Considerations module.