Distributed Memory Programming

2.3 Distributed Memory
As cluster-computing comes more and more into vogue, attention naturally becomes directed towards that form of system image best suited to realize parallelism upon it: distributed memory.

Dist

  1. Exploit commodity microprocessors and conventional uniprocessor memory systems

    One of the nicest aspects of distributed memory systems is that you can basically build very respectable parallel capability from essentially off-the-shelf components: commodity processors are reaching higher and higher clock-rates and larger and larger MACHO-FLOPs, and when you buy one, you get a complete working system built around it. This gives you a number of very nice features, like being able to swap in/out entire uniprocessor systems if something breaks, and tailoring your parallel/serial systems to meet your needs at the moment.

    Communications is also off-the-shelf, and is essentially point-to-point using whatever you can afford to run between your machines: ethernet, co-ax, fiber, ATM ... tin-cans and string, if your budget is under real pressure.

  2. Non-uniform memory access times (NUMA)

    It's immediately obvious that one of the costs associated with this form of parallelism is the fact that the time it takes one processor to obtain data from any other processor is going to be directly dependent on how "far apart" they are in terms of communication-latency; the technical term for this kind of situation is NUMA.

    This characteristic has very significant implications for the complexity of application designs whenever data is expected to be spread across multiple processors. The concept of data-locality has no real referent within a shared-memory system within which all data is equally close to all processors, but it is tremendously significant when processors may have to wait for different lengths of time for a piece of data, the wait depending on how far away the data has to be brought from ... in such a case, a great deal of effort is expended (by good designs) on doing everything possible to guarantee that, by the time the data is actually needed, it has already been brought as close as possible to the processor about to use it.

  3. No atomic operations on global variables for synchronization

    Clearly, synchronization is going to be much more expensive within a distributed environment than it is within a shared one -- there are, after all, no global variables that all processors can be "watching" for signals as to when to do what. Instead, synchronization information has to be ferried around the entire network, and thus leads to rather painful delays and costly latencies while "far-away" units are brought into step.

  4. Scalability

    • Hardware and software designed with nominal (asymptotic) scalability to large numbers of processors. Large constant factors rather than asymptotic complexity may dominate realizable performance

      Beware of "scalability" which necessarily has only asymptotic validity (e.g. so-and-so holds as the number of processors approach infinity). In the real world there are only finite numbers of processors. In many practical situations, the magnitude of constant factors may be large in comparison to the number of processors, etc. for realizable machine configurations. It is easy to see that 100*N*N > N*N*N for N <100. It is routine to find algorithms which are have higher complexities (worse scalabilities) which give better performance on existing machines.

    • Algorithm may scale less than the maximum system size

      You may have 1000 processors to throw at a particular application that can only efficiently use 10 or 20; if you force your algorithm to the limits of your available system, you'll wind up with lots of wasted cycles and really terrible performance, something similar to using a 16-lb. sledge hammer to tack up a picture-frame: you probably mashed your fingers, and what are you going to do about that hole in the wall?

      Bottom-line: use only what the problem requires, any more is just going to make things worse, not better.