Parallel Programming Concepts

5.3 Some Approaches

Distributed memory is, for all intents and purposes, virtually synonymous with message-passing, although the actual characteristics of the particular communication schemes used by different systems may hide that fact.

  • Message-passing approach: tasks communicate by sending data packets to each other

    Messages are discrete units of information, discrete meaning that they have a definite identity, and can be distinguished from all other messages ... well, that's the theory, at least. In practice, one of the most common programming errors is to forget to actually make the messages distinctly different, by giving them unique identifiers or tags. Regardless, parallel tasks use these messages to send information and requests for same to their peers.

    • Overhead is proportional to size and number of packets (more communication means greater costs; sending data is slower than accessing shared memory.)

      Message-passing isn't cheap: every one of those messages has to be individually constructed, addressed, sent, delivered, and read, all before the information it contains can be acted upon. Obviously, then, the more messages being sent, the more time and cycles spent in servicing message-oriented duties, and the less spent on the actual tasks that the messages are supposed to be subservient to. It is also clear from this portrayal that, in the general case, message-passing will take more time and effort than shared-memory.

      Having said that, it should be pointed out that shared memory scales less well than message passing, and, once past its maximum effective bandwidth utilization, the latency associated with message-passing may actually be lower than that encountered on an over-extended shared memory communications network.

    • Effective message-passing schemes overlap calculations & message-passing.

      There are ways to decrease the overhead associated with message-passing, the most significant being to somehow arrange to do as much valuable computation as possible while communication is occurring. The most easily conceived method of doing this is to have two completely separate processors, each dedicated to either computation or communication, and coupled via a dual-ported DMA (direct memory access) in order to cooperate. This is something of the nature of shared-memory being put in the service of distributed-memory, and does require a multi-processor configuration for a single entity in the distributed system.

      Other schemes involve time-slicing between the two tasks, or active-waiting where a processor waiting for a communications event, such as receipt of an awaited message or acknowledgment of delivery of a sent message, arranges for a preemptive signal to be generated when the event occurs, and then goes off and does independent computation. These alternatives require considerably more sophistication in the control programs than simply sitting and twiddling one's thumbs until the communication process completes, but can be made to be very effective.

    • Examples:

      The following are few examples of common types of message-passing systems:

      • MPP ( ASCI White, ASCI Red, ASCI Blue-Pacific), Clusters of Workstations / PCs;

        Utilizing high bandwidth switch networking architecture, the nodes within such systems use a special very low-level message-passing library for the lowest-level (i.e., most efficient, fastest) mechanism for communicating with one another. On top of this is built the MPI /PVM message-passing facility.

      • hypercube machines (e.g., nCube2S)

        Hypercube architectures typically utilize off-the-shelf processors coupled via proprietary networks (both hardware and message-passing software); the processor, as is the case in the above example, can also be proprietary, and often is when the decision is to trade price for performance, especially communications performance.

    • ccNUMA ( Cache Coherent Non-Uniform Memory Access ) machines : systems that have a rather small (up to 16) number of RISC processors that are tightly integrated in a cluster, a Symmetric Multi-Processing (SMP) node (SGI Origin 2000). The processors in such a node are virtually always connected by a 1-stage crossbar while these clusters are connected by a less costly network.

      ccNUMA approach provide the programmer with a shared-memory programming model, even though the actual design of the hardware and the network was distributed. A very sophisticated address-mapping scheme insured that the entire distributed memory space could be uniquely represented as a single shared resource. The actual communication model utilized at the lowest level, however, was in fact "message-passing", but that "lowest level" was not accessible to the programmer.