|
We finally come to the proper topic with which to close our introduction to distributed computing: how can I tell how good a job of parallelizing I've done? This is generally termed the speedup question, and is not to be considered a well-defined and non-contentious issue -- there are a number of different schools of thought on the issue of parallel performance, and what we'll cover here is simply one of the best known and most widely used characterizations, but, even so, it has its detractors.
- Defined as ratio of execution times, S(p)=E(1)/E(p)
You start, ideally, with the best serial implementation of your application in one hand, and the best distributed implementation of it in your other. Speedup, then, will be how much faster the parallel version runs than does the serial version, if at all; this is expressed as a ratio of the serial runtime over the parallel runtime -- if, as expected, the parallel is faster, then the ratio is greater than 1, and a perfect speedup is said to occur when that ratio is exactly equal to the number of processors you've parallelized over.
- Amdahl's Law (fixed problem size, no overhead for parallel processing)
But along comes Amdahl who claims "You'll never do it!" There are inescapable regions of serialization that will forever keep you from achieving perfect speedup, and, further, your parallelization doesn't come for free, either, so, no matter how closely you cut it, and even when you assume that you'll pay no price for parallelization, for a problem of fixed size, you'll always be under (and usually 'way under) that elusive perfect speedup.
- E(p)=G+H/p where G is elapsed time for serial operations, and H is elapsed time for the parallelizable operations in serial mode.
We start by taking into account the unavoidable serial operations, added to the amount of time it would take to do the parallel sections in serial (again, assume the best possible serial algorithms for doing these operations; we want as fair a comparison as possible), this last term then divided by the total number of processors used in the parallel run.
- S(p)=(G+H)/(G+H/p)
Now refine the terms originally introduced, replacing the serial execution time by the sum of the unavoidably serial section plus the serialization of the parallel operations, all divided by the expression we just finished putting together, and you can clearly see that, no matter many processors you toss at the parallel section, you'll always have the unavoidably serial section standing there, keeping you from attaining perfect speedup.
- For large numbers of processors, S(p) approaches 1+H/G
In fact, the more processors you effectively use, the smaller the role played by the parallel portions, and the more closely the final figure tracks the ratio of the serialized-parallel to the unavoidably serial: this would imply, then, that the higher the number of processors you use, the more you should focus on reducing the unavoidably serial sections of your code -- this derivation claims that you will get larger returns on your efforts this way, than by trying to further refine your parallel sections.
- Parallel overheads such as message passing, and load imbalance reduce speedup relative to ideal
Three primary factors in driving down realized parallel performance are communications, process managment (creation, initialization, signalling, and destruction), and inefficient allocation of work among the available processors. Certainly the communications and management functions cannot be entirely avoided, even in what are termed embarassingly parallel applications (this is something of a misnomer: you're supposed to be embarassed that your application is so nicely, completely parallelizable ... like sorting n different sets of numbers, each one separately. Personally, I've always thought that that designation was just a sign of sour grapes coined by poor researchers whose own algorithms didn't parallelize quite so easily); and it's not always possible to keep all processors equally busy ... in fact, it's rarely possible to have all processors finish at the same time.
So, here again, we have intuitively understandable evidence that perfect speedup is an unattainable ideal.
- "Superlinear" speedups possible due to increase in aggregate physical memory space or improved data reuse
You may shock yourself, someday, by calculating your speedup and finding that, contrary to what we've just agreed, your figures claim that you've actually done better than perfect speedup -- this is known as superlinear speedup, and it doesn't mean that you've done your math wrong. Once you've tracked down all of the factors, you'll find that, most commonly, you've achieved this enviable state by virtue of having much more memory available to you in the parallel situation (which means that many things that will bog down your serial implementation get tremendously efficient when spread out among all of the available processors), and/or that your parallel case was better able (again, generally due to the spread-out nature of the implementation) to re-use data it already had loaded and easily (quickly) accessible. It's fairly obvious, after all, that the more processors you have, the less data per processor will have to supported, and the more likely it becomes that you'll be able to benefit from a higher frequency of cache-hits for the data you request.
|