Parallel Programming Concepts

3. Taxonomy

Taxonomy1

To set a foundation for our examination of parallel processing, we need to understand just what kinds of processing alternatives have already been identified, and where they fit into the "parallel picture", if you will. One of the longest-lived and still very reasonable classification schemes was proposed by Flynn, in 1966, and distinguishes computer architectures according to how they can be classified along two independent, binary-valued dimensions; independent simply asserts that neither of the two dimensions has any effect on the other, and binary-valued means that each dimension has only two possible states, as a coin has only two distinct flat sides. For computer architectures, Flynn proposed that the two dimensions be termed Instruction and Data, and that, for both of them, the two values they could take be Single or Multiple. The two dimensions could then be drawn like a matrix having two rows and two columns, and each of the four cells thus defined would characterize a unique type of computer architecture.

Single Instruction, Single Data (SISD)

Sisd2

This is the oldest style of computer architecture, and still one of the most important: all personal computers fit within this category, as did most computers ever designed and built until fairly recently. Single instruction refers to the fact that there is only one instruction stream being acted on by the CPU during any one clock tick; single data means, analogously, that one and only one data stream is being employed as input during any one clock tick. These factors lead to two very important characteristics of SISD style computers:

  • Serial

    Instructions are executed one after the other, in lock-step; this type of sequential execution is commonly called serial, as opposed to parallel, in which multiple instructions may be processed simultaneously.

  • Deterministic

    Because each instruction has a unique place in the execution stream, and thus a unique time during which it and it alone is being processed, the entire execution is said to be deterministic, meaning that you (can potentially) know exactly what is happening at all times, and, ideally, you can exactly recreate the process, step by step, at any later time.

  • Examples: Most non-supercomputers

    Most computers commonly available today are of the SISD variety: all personal computers, all single-instruction-unit-CPU workstations, mini-computers, and mainframes.


Multiple Instruction, Single Data (MISD)

Misd

Few actual examples of computers in this class exist; this category was included more for the sake of completeness than to identify a working group of actual computer systems. However, special-purpose machines are certainly conceivable that would fit into this niche: multiple frequency filters operating on a single signal stream, or multiple cryptography algorithms attempting to crack a single coded message. Both of these are examples of this type of processing where multiple, independent instruction streams are applied simultaneously to a single data stream.

A less technological, but perhaps more cosmopolitan example, was suggested by a participant in the Cornell Theory Center's Virtual Workshop:

    "I thought of another example of a MISD process that is carried out routinely at [the] United Nations. When a delegate speaks in a language of his/her choice, his speech is simultaneously translated into a number of other languages for the benefit of other delegates present. Thus the delegate's speech (a single data) is being processed by a number of translators (processors) yielding different results."


    Single Instruction, Multiple Data (SIMD)

    Simd3

    A very important class of architectures in the history of computation, single-instruction/multiple-data machines are capable of applying the exact same instruction stream to multiple streams of data simultaneously. For certain classes of problems, e.g., those known as data-parallel problems, this type of architecture is perfectly suited to achieving very high processing rates, as the data can be split into many different independent pieces, and the multiple instruction units can all operate on them at the same time.

    • Synchronous (lock-step)

      These types of systems are generally considered to be synchronous, meaning that they are built in such a way as to guarantee that all instruction units will receive the same instruction at the same time, and thus all will potentially be able to execute the same operation simultaneously.

    • Deterministic

      SIMD architectures are deterministic because, at any one point in time, there is only one instruction being executed, even though multiple units may be executing it. So, every time the same program is run on the same data, using the same number of execution units, exactly the same result is guaranteed at every step in the process.

    • Well-suited to instruction/operation level parallelism

      The "single" in single-instruction doesn't mean that there's only one instruction unit, as it does in SISD, but rather that there's only one instruction stream, and this instruction stream is executed by multiple processing units on different pieces of data, all at the same time, thus achieving parallelism.

    The most advanced parallel processor arrays that are in production today:

    • The Cambridge Parallel Processing Gamma II Plus;

      The Gamma 1000 models 1024 processors are ordered in a 32×32 array, while the Gamma 4000 has 4096 processors arranged in a 64×64 square. As in all processor-array machines, the control processor (called the Master Control Unit (MCU) in the Gamma-II) has a separate memory to hold program instructions while the data are held in the data memory associated with each Processing Element (PE) in the processor array.

    • The Quadrics Apemille;

      The Apemille is a commercial spin-off of the APE-1000 project of the Italian National Institute for Nuclear Physics and a successor to the APE-100 systems. The systems are available in multiples of 8 processor nodes where up to 16 boards can be fitted into one crate or in multiples of 128 nodes by adding up to 15 crates to the minimal 1-crate system. The interconnection topology of the Quadrics is a 3-D grid with interconnections to the opposite sides (so, in effect a 3-D torus). Communication is controlled by the Memory Controller and the Communication Controller which are both housed on the backplane of a crate. The TAO language has several extensions to employ the SIMD features of the Quadrics. Firstly, floating-point variables are assumed to be local to the processor that owns them, while integer variables are assumed to be global.

    Note: The face of  HPC is changing very quickly. While few years ago a lot of SIMD vector pipeline supercomputers had been produced, few of them are produced today. However some of them are still in production run. You may find here the list of major SIMD platform that may be still available at some supercomputing centers.


    Multiple Instruction, Multiple Data (MIMD)

    Mimd4

    Many believe that the next major advances in computational capabilities will be enabled by this approach to parallelism which provides for multiple instruction streams simultaneously applied to multiple data streams. The most general of all of the major categories, a MIMD machine is capable of being programmed to operate as if it were in fact any of the four.

    • Synchronous or asynchronous

      MIMD instruction streams can potentially be executed either synchronously or asynchronously, i.e., either in tightly controlled lock-step or in a more loosely bound "do your own thing" mode. Some kinds of algorithms require one or the other, and different kinds of MIMD systems are better suited to one or the other; optimum efficiency depends on making sure that the system you run your code on reflects the style of synchronicity required by your code.

    • Deterministic or non-deterministic

      MIMD systems are potentially capable of deterministic behavior, that is, of reproducing the exact same set of processing steps every time a program is run on the same data. A number of other factors, for example, how multiple messages at a receiver are handled, go into the actual determination of this characteristic, but, if it is important for the system to be deterministic, there is nothing in the nature of MIMD parallelism that fundamentally precludes it.

    • Well-suited to block, loop, or subroutine level parallelism

      The more code each processor in an MIMD assembly is given domain over, the more efficiently the entire system will operate, in general. This is largely due to communications requirements, particularly synchronization, which are characteristically less stringent at levels above instruction-oriented parallelism.

    • Multiple Instruction or Single Program

      MIMD-style systems are capable of running in true "multiple-instruction" mode, with every processor doing something different, or every processor can be given the same code; this latter case is called SPMD, "Single Program Multiple Data", and is a generalization of SIMD-style parallelism, with much less strict synchronization requirements.

    • MIMD Examples

      The following are representative of the many different ways that MIMD parallelism can be realized:

      • Asynchronous

        All of these systems implement MIMD parallelism in terms of more or less loosely coupled instruction streams:

        • IBM SPx, or clusters of workstations, using PVM, MPI etc.

          The nature of message-passing libraries, especially general ones applicable to many different kinds of processors, makes the resulting parallel system much more suited to coarser-grained, loosely-coupled tasks.

        • Multiple vector units working on one problem (e.g., Fujitsu VPP5000)

          Vector-parallel systems are very efficient at data parallel tasks, where each vector unit is given responsibility for computations involving one unique segment of the overall data.

        • Hypercubes (e.g., nCube2S) and Meshes (e.g., Intel Paragon)

          The hypercube is one of a whole family of network architectures that provide multiple connection points among the processors, typically allowing each processor to be directly connected to 8 or more processors. Meshes do much the same kind of thing, and come in 2- or 3-dimensional configurations, the former looking like a simple flat grid, the latter like a box.

      • Synchronous: e.g., IBM RS/6000 (up to 4 instructions per cycle), NEC SX-5.

        Processors are now capable of executing multiple instructions every cycle, although not every cycle is so filled. This capability allows these processors to execute a number of related instructions simultaneously, for example, the multiplication of two floating point values already loaded into registers, the integer addition corresponding to the array location in memory of the next floating point value to be added, and the fetch of that value. This type of parallel processing, however, requires a sophisticated compiler with the ability to order instructions in such a way as to both preserve necessary instruction precedence and recognize instruction independence.