|
"Yes, Virginia, there is such a thing as distributed shared memory...", so there are ways in which you can pretty much "have your cake and eat it, too" -- "pretty much":
- NUMA characteristics *still* define performance
We're still talking distributed memory, here, and anytime you hear that phrase, you should immediately think non-uniform memory access. So even though you have the illusion of a single global address space, you're still going to have to deal with the real-world aspects related to "some of that global address space is farther away than other parts". Plan on still having to be worried about getting your data closer to where it's going to be used before it's called for.
- Implementation: Software (logical)
There are software packages available, some free and some not, that will allow you to impose a logically shared mask over your in-fact distributed memory:
- Convenient extension of familiar concepts
Using these packages, you are pretty much freed of all of the hassle involved in dealing with multiple, independently-addressed memory banks and multiple, independently-programmed machines: you essentially have a single memory space coupled to a single execution environment -- everything relating to the actual independent modules, both processor and memory, is more-or-less handled behind the scenes.
- Must use API calls for access
You do have to use specific function-calls (API stands for application program interface, in case you were wondering) to invoke the illusion, which means that you have a learning curve to climb regarding both the available commands and their argument lists, not to mention any "...provisos, a few quid pro quos..." related to their use.
- Large overheads on multicomputers
And did I forget to mention that ... TANSTAAFL ("There Ain't No Such Thing As A Free Lunch", as so aptly expressed by Robert A. Heinlein): sure, you get to write your code as if you were dealing with a single, large computing resource, but that just means that there have to be layers of software below your code taking care of all of the details that this model lifted off of your back. What's more, the way those layers of software have been coded has to be so general (read "not necessarily as efficient as possible in ever case") as to be able to handle many different kinds of situations, so you're dealing with inefficiency not simply in a simple translation of your shared model into a distributed reality, but, much more significantly, in the sense that that translation doesn't know anything about the actual workings of your application.
The bottom-line, here, is that you get potentially large cognitive gains and simplified code, but you pay, a probably-large price in performance.
- Examples: Linda, TCGMSG, Global Arrays
Network-Linda, a commercial product sold by Scientific Computational Associates, Inc., implements a novel form of object-oriented-ness termed tuple-space over the entire computing resource, be it a single system or a distributed environment. Conceptually one of the simplest programming models, it also has the smallest set of new function calls to learn (6), but pays for this user-oriented simplicity with a reliance on intelligent sophistication and foresight in the compilation and linking tools (in order to provide as much data-locality as possible), and potential limitations to both scalability and performance.
TCGMSG ("Theoretical Chemistry Group MeSsaGe-passing") is arguably one of the leanest and lowest-overhead of the message-passing libraries for distributed computing. Its inclusion in this section comes about due to its collective communications capabilities, allowing applications to perform various global operations (e.g., sum, max, min, etc.), and its implementation of a simplified shared counter, allowing for loop-level distribution of processing.
Quoting from the Supercomputing '94 paper "Global Arrays: A Portable 'Shared-Memory' Programming Model for Distributed Memory Computers", by Nieplocha, Harrison and Littlefield:
The key concept of GA [Global Arrays] is that it provides a portable interface through which each process in a MIMD parallel program can independently, asynchronously, and efficiently access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes. In this respect, it is simliar to the shared-memory programming model. However, the GA model also acknowledges that remote data is slower to access than local, and it allows data locality to be explicitly specified and used. In this respect, it is similar to message passing.
For more information on GA, please refer to the above attribution, ISSN: 1063-9535..
- Hardware (physical)
Certain systems provide a true shared-memory environment implemented on top of a physically-distributed foundation.
- Single address space and enforced consistency
The programmer deals strictly with a logical view of a single, global address space; hidden beneath the covers is the fact that, quite the contrary, the system is truly distributed, unique memory, and all data communication is accomplished via tightly controlled and heavily optimized message passing. Cache coherency is closely monitored during all data manipulations in order to assure a single system image over all units
- Examples: KSR-1, Convex Exemplar
The KSR-1 (and, shortly before the company's demise, the KSR-2) incorporated proprietary processors fixed into a uni-directional, high-bandwidth communications ring, each such ring:0 comprised of a maximum of 32 pairs of processor and associated 32MB local cache memory. The hierarchical nature of the ring structure provided for the capability of the system to employ a ring-of-rings, called ring:1, with a maximum dimension of 32 ring:0's ... so a fully populated ring:1 would be comprised of 1024 processors and a little over 1TB (1.024*10e12) of cache memory. A sophisticated virtual-to-real memory addressing scheme, and a hierarchical, 16-way associative memory for tracking data requests, provided the underlying cache coherency supporting a single system image across the entire distributed resource.
The Exemplar, a Convex design using Hewlett-Packard processors, has a topology that mixes both cross-bar switches and rings: the fundamental unit is a hypernode, a set of 8 paired processors and 8 banks of DRAM connected to a 5-by-5 (5 simultaneous inputs, 5 simultaneous outputs) crossbar; hypernodes can themselves be connected together into a multi-hypernode system via 4 Coherent Toriodal Interconnect communication rings, up to a maximum of 16 hypernodes or 128 processors. Four memory-types are implemented, differentiating both sharing and latency characteristics:
- CPU-private: accessed by only a single processor (e.g., stack memory);
- Hypernode-private: accessed by any processor in the hypernode, for use in, e.g., fine-grained parallelism;
- Near-shared: may be accessed by any processor on any hypernode, but the majority of all accesses will occur from processes within the same hypernode; and
- Far-shared: like Near-shared, but without any knowledge of access-characteristics -- memory access contention is reduced by spreading the pages of this memory across all hypernodes.
|