Distributed Memory Programming

2.2 Shared Memory

Shared

There's little argument that, conceptually, dealing with all the memory in a parallel system as if it were logically contiguous is much simpler than having to be concerned with exactly where it is, and exactly how one gets to it. This is the great advantage of shared memory systems, be they actual (i.e., physically shared) or virtual (logically shared, but not necessarily physically shared).

Physically shared systems enjoy additional benefits, above-and-beyond simple cognitive simplicity; they are also potentially much faster in their operation than their distributed alternatives, and more consistent in their makeup, utilization and other characteristics. However:

  1. Practical effectiveness depends critically on characteristics of memory system

    The potential speed advantage can only be realized if the memory being shared, and the network over which the sharing takes place, meets certain criteria, such as:

    • Uniformly low latency, high bandwidth access to global memory, requiring complicated high speed memory interconnection network

      A particularly convenient feature of shared memory systems is that of uniform memory access, by which is meant that any processor should be able to access any memory location in roughly the same amount of time ... and the more uniform that access time is, across all processors and to every memory location, the better; even more significantly, the contrapositive applies: the less uniform the access distribution, the worse for applications that depend on uniformity assumptions.

      Guaranteeing such uniformity requires significant expenditure in design complexity when building the communications network linking the memory subsystem to the processors, and the more complex the network, the more expense the price-tag, and the more likely things will go wrong.

    • Capacity must grow at least linearly with the number of processors, but latencies must remain constant because of processor architectures

      As the shared memory system has more processors added to it, operational efficiency requires that the memory subsystem be grown proportionally, or you run the risk of having too many requests and not enough data to fulfill them: increased contention, leading to increased latency. However, just because you've added sufficient memory to offset the additional processors doesn't mean you're home free: if your network didn't have the necessary expansion potential, and the additional memory had to be hung off a shoehorned-in sidelink to the main subsystem, then you've likely destroyed the advertised uniform access characteristic. Besides which, it's not all that easy to add either memory or processors that the network wasn't originally designed to accommodate, and that design had "uniform access" as a guiding principle, anyway.

    • "Single system image" (cache coherence)

      Just because you have a single global address space, doesn't mean that every piece of data is necessarily only going to be found in a single location; there are a number of reasons why you might want to allow multiple copies of some data objects to be scattered around ... recall our earlier discussion about multilayer caches, and consider how much more frequently many processors might want to access an important piece of data, and how much easier that would be if there were multiple copies of it available; but whenever you do allow data to be copied to more than one location, you have to make sure that, at significant synchronization points, all of those separate copies have the same value: this is what is meant by "single system image" -- every separate memory bank has identical values for whatever data objects they both have copies of (cache coherence is a more technical term referring to the same set of concepts).

  2. Cache coherence scheme must be integrated into processor design

    To build on that last point, cache coherence is a very sophisticated, complex set of operations, which automatically leads to:

    • Expensive

      You'd better be willing to pay for it, as it's not going to come cheap. You'll likely need a fair amount of hardware and software whose only function is to serve cache coherency, and it's expense isn't simply in terms of cash (sorry!): it takes time to make sure that all memory locations referencing the same data object have equivalent values, and that takes adds to the latency you're trying to minimize.

    • Limited system design options

      When you've got a very important component to accommodate, and it occupies a critical place in a crucial subsystem, it shouldn't come as a surprise that the system as a whole ends up looking as if it were designed around that one component. In this case, the coherency requirement leads to any number of restrictions on system design having to do with, for example, contention for write-access (only one write can be allowed at any one time), and guaranteeing that all copies are automatically invalidated whenever a new write has been done, until an update can be accomplished. These, in turn, have implications for other operations farther downstream, for example, the intelligent scheduling of operations based on the availability of reliable data.

  3. Availability of atomic operations on global variables for low overhead synchronization (test-and-set, fetch-and-add, etc.)

    An extremely effective mechanism for helping to insure the utmost speed when sychronization is necessary involves doubling up on commonly associated operations and essentially making the pair a single atomic (i.e., "done without interrupt", and in as few cycles as ultimately possible) mini-procedure. For example, test-and-set combines the testing of a global variable for a particular value (say, not-locked) and, if that value is found, immediately setting it to the value specified (say, locked). Fetch-and-add is another very common example: a memory location is obtained from memory and immediately something is added to it and the memory location is re-written.

    The un-interruptability of these atomic operations is extremly important not simply from the standpoint of speed, but also because of the extent to which this insures freedom from contention for resources and the corruption of shared data.

  4. Mature compiler, debugging and development technologies are widely available

    Shared memory systems wouldn't be able to deliver the kinds of performance the sales literature leads you to expect, if the enablement software wasn't equal to the task; so it shouldn't come as a shock to find out that quite a bit of effort has been put into the tools that users need to turn designs into working applications, and that effort has been largely successful ... proprietary, useful only on the system for which it was developed, but successful, nonetheless.

    However, just because you've got the tools, doesn't mean that the performance is just going to suddenly appear: "There are a few provisos, some quid-pro-quos...", as the genie so aptly put it:

    • Achievable performance dependent on characteristics of memory system

      All of the processors depend heavily on the speed and efficiency of the memory-and-network subsystem, and the extent to which these facilities are not operating at peak capability largely determines the bottom-line performance of the entire system.

    • Potential to exploit fine-grained parallelism depends on availability of atomic operations on global variables

      Fine-grained parallelism, you'll recall, is defined as a low computation-to-communication ration -- the closer to 1 (i.e., one computation operation to one communication operation ... in fact, actual cycles are compared), the more fine-grained. With this in mind, and realizing that quite a few applications could be designed to utilize such capabilities (indeed, there are no small number of applications whose efficiency is greatly augmented by it), it therefore follows that you'll do no better at providing it than you do at implementing features it depends on, like atomic operations.