Distributed Memory Programming

9.5 Load Balance

If you only have a single machine available to you, the whole question of load balance is moot: there's only one way to divy-up the work, and that's to give it all to the only machine you've got. However, if you've got your application spread across a whole bunch of machines (let's play worst-case, here, and assume that they're all single-user, so if you have them, no one else can), and you're doing the bulk of the work on just a few of the machines, you could get some pretty irate messages from other users complaining about "inefficient use of scarce resources", and "massive load-imbalance" ... besides, any reasonable compute-shop is going to charge you for the number of machines you use, whether or not you use them wisely, so, if you request and obtain n machines, your most cost-effective strategy would be to use all of them as much as possible.

  1. Goal is to keep all processors busy all the time

    If only for the sake of your poor pocketbook, you'd like to have your per-processor usage to be as high as possible. Any time you have a processor sitting idle, you have evidence that your workload is unevenly spread among the available machines, and there's a likelihood that your application isn't running as fast or as efficiently as it might, were you simply able to discover a more balanced distribution of the workload.

    There are, basically, two different classes of methods by which load balance may be approached; their principle difference lies in whether or not actual aspects of the data are utilized in determining the workload assignments:

  2. Static methods

    If the only information used to allocate work to processors is one of dimension, then the method is said to be static in nature, as it doesn't change regardless of what is found within the data, but only as the dimension changes.

    • Matrix decomposition

      Many matrix operations are very nicely decomposable algorithmically, in terms of regular subsections of the original; for example, the calculation of the determinate uses aspects of the matrix structure quite apart from whatever data values might inhabit any particular cell. The computation can be easily split among the available processors without regard for the actual data itself, using only knowledge of the dimensions of the matrix.

    • Index set partitioning

      In a more general vein, if you have n things to be worked on, and m processors available to do the work, it's a very easy process to simply give the first n/m to the first processor, the next n/m to the next, and on and on ... you might have to play a little with remainders and such, but, still, it's as easy a distribution scheme as sharing out candy among children ... and there's a lot less fuss and crying about it, too.

  3. Dynamic

    As easy and intuitive as the static methods might be, they do only focus on the structure of the data, and not on characteristics of the data itself; for example, what if the matrix whose determinate was being calculated was tremendously sparse (i.e., had a majority of "0" elements): a static approach wouldn't even take that into consideration, while a more intelligent design might be able to recognize it as significant, and process it more efficiently.

    • Adaptive grid methods

      There are a whole class of problems which can be characterized as by use of grids on the data ... the finer the grid, the more exact the result, but, just as surely, the more calculation required. Very often you don't know how fine a grid you need to employ until you actually start calculating, and then you discover that, for example, the error range on your calculations is too large, requiring a smaller sampling grid and more exact calculations. A serial algorithm would simply re-calculate the grid points, and brute-force its way through the finer mesh; a distributed approach would allow for use of adaptive grid methods, which would treat the finer mesh as an overall data element, split it among the available processors, and thereby achieve much better load-balancing than would have been the case if the processor working the original chunk had tried to do the whole refined section itself.

    • N-body simulations

      Another whole class of problems (actually, there's some overlap between the two) deals with situations where N uniquely identifiable particles (more generally known as bodies), which may or may not have effects on one another, must have some calculations performed on each of them. In some cases, especially those in which there are no inter-particle effects, this devolves into a static, index set partitioning situation; more commonly, though, each particle exerts some form of influence over its peers, sometimes very localized (i.e., the effects don't travel far, like the Strong Atomic Force), sometimes not (e.g., gravity). In cases where there are well-defined areas of interaction, it sometimes works well to assign all interacting particles to the same processor, and handle any remaining interactions (i.e., interactions involving particles outside of the local area) as if all of the local particles were simply one (albeit possibly large) particle.

    • Master/Worker

      It is sometimes possible to design your application so that work gets created in units that can be put into a queue, and matched up with processors added to another queue as they finish the task they've been working on. In fact, this is a general strategy applicable to either of the above situations, or, indeed, any where data dependencies don't exist between chunks. Keep in mind, though, that this particular approach doesn't scale well to large numbers of processors.