The most significant inhibitor to parallel performance is the overhead of communicating data between compute nodes. In designing the parallel implementation of an application, it is useful to model the communication overhead.
The use of MPI_Scatter and MPI_Scatterv is encouraged. These routines are supposed to optimally and automatically make the choice of round-robin (point-to-point) vs. binary tree pattern for data distribution. Discussions of these routines can be found in the talks MPI Collective Communication I and MPI Collective Communication II.
Here is an example of modeling the communications cost of a scatter operation for N processors, with each message of size p. Pictorially, the scatter operation is:

There are two natural ways of implementing a scatter operation:
- A round-robin, point-to-point (PTP) send where the root node sends each other node its data
- A binary tree (BT) implementation (shown below)

A simplified model of data communications costs for the round-robin method is:
The root node must send one message to each of N-1 other nodes. Because the messages must be sent one after another, each message incurs a new latency penalty of LATENCY microseconds. Thus the total latency cost is
(N-1)*LATENCY microseconds
The data volume for N-1 messages is (N-1)*p bytes. Assuming a bandwidth of BANDWIDTH MB/sec (or BANDWIDTH Bytes/microsecond), the bandwidth penalty is
(N-1)*p Bytes x 1 microsecond/BANDWIDTH Bytes =
(N-1)*p/BANDWIDTH microseconds
The corresponding model for the binary tree implementation is:
The messages at each level of the tree are occurring at the same time, so the latency cost is directly related to the number of levels in the tree of N nodes. The number of levels is log2 (read: "log-base-2") N plus 1, and messages are sent by all but the last (bottom) level. Thus the total latency is
# levels * latency/level = log2(N)*LATENCY microseconds
This method causes more data to be transferred, but the messages at each stage can be sent in parallel. Since the SP switch has minimal contention between parallel messages, the bandwidth penalty is still
(N-1)*p/BANDWIDTH microseconds
You may wonder how the (N-1) was derived. Referring to the diagram above, you see that at stage 1, the root node transmits N/2 messages and keeps the other half. At stage 2, each of two nodes transmit 1/4 of the data and hold on to the other 1/4, but since this is done in parallel, the bandwidth penalty is as though only N/4 messages were sent. At the last stage, each sending node sends out exactly 1 message. Using a very well-known series expression, the bandwidth penalty is:
[N/2+N/4+...+1] * p Bytes * 1 microsecond/BANDWIDTH Bytes =
1 microsecond
(N-1) * p Bytes * ----------------
BANDWIDTH Bytes
Important Note!: This is an incomplete model of data communications and should be viewed as a rough guide. If communications costs will be a significant factor, timing runs should be made with prototype codes to verify results.