Non-blocking standard send, message size > threshold
Non-blocking receive
The case of a non-blocking standard send MPI_Isend (S) for a message larger than the threshold is more interesting:
Compared to a blocking send, sync overhead is reduced by the time between the MPI_Isend (S) and the MPI_Wait (S)
For a blocking send, the synchronization overhead would be the period between the blocking call and the copy over the network. For a non-blocking call, the synchronization overhead is reduced by the amount of time between the non-blocking call and the MPI_Wait (S), in which useful computation is proceeding.
Now even more benefit from posting MPI_Irecv (S) early: imagine sync overhead if MPI_Recv (S) replaced MPI_Wait (S)
Again, the non-blocking receive MPI_Irecv (S) will reduce synchronization overhead on the receiving task for the case in which the receive is posted first. There is also a benefit to using a non-blocking receive when the send is posted first. Consider how the figure would change if a blocking receive were posted. Typically, blocking receives are posted immediately before the message data must be used (to allow the maximum amount of time for the communication to complete). So, the blocking receive would be posted in place of the MPI_Wait. This would delay the synchronization with the send call until this later point in the program, and thus increase synchronization overhead on the sending task.