As we've seen in previous posts , in a parallel region, all threads execute the same code. When we parallelised the calculation of the Mandelbrot set , we had to devise manual ways of dividing the work amongst the threads ourselves.
However, OpenMP provides additional directives, called work-sharing directives, to help with this. Work-sharing directives indicate that work should be divided up between threads, rather than replicated.
OpenMP has extensive support for parallelising loops, since loops are the main source of parallelism in many, particularly scientific, applications.
!$OMP DO [<clauses>...] ... [!$OMP END DO]
The closing directive is optional.
!$OMP DO directive can take:
clauses. Recall from earlier that loop counters are private by default (in Fortran).
!OMP PARALLEL /
$OMP DO combination is so common, there's a
!$OMP PARALLEL DO [<clauses>...] ... [!$OMP END PARALLEL DO]
This can take all the clauses that the
PARALLEL directive can take.
Returning to our Mandelbrot set example, we can replace the logic for manually
distributing threads with the
!$OMP DO directive:
!$OMP DO do ipix = 1_ik, npix ... loop body ... enddo !$OMP END DO
The speedup we see is comparable with the first example, where we divided the pixels into a set of \(N\) contiguous blocks, where \(N\) was the number of threads used.
Without any clauses,
!$OMP DO or
!$OMP PARALLEL DO will attempt to
distribute work as evenly as possible across threads. The scheduling can also
be controlled explicitly using the
!$OMP DO SCHEDULE(STATIC[,n]|DYNAMIC[,n]|AUTO|GUIDED|RUNTIME) ...
n is an integer chunk size to be passed to the scheduling algorithm.
n is specified, the iterations are divided into chunks of
n, and distributed cyclically to the the threads; this is called a
block cyclic schedule. If a chunk size is not specified, the one chunk per
thread is created, of approximately equal size; this is a block schedule.
In the manual work distribution in the
Mandelbrot set exercise
, the first, poorer way we chose of distributing the threads was equivalent to
SCHEDULE(STATIC), and the second more successful attempt was equivalent to
Looking at the run times, we can see this:
The figure shows speed-ups (lower is better) compared with the serial version
as a function of chunk size, for different schedules. The manual block schedule
is the green curve and the manual cyclic schedule is the green curve. Use of
!$OMP DO SCHEDULE(STATIC, n) directive instead, gives the red curve. For
\(n = 1\) the performance is similar to the manual cyclic, and for \(n = 525\) the
performance is similar to the manual block schedule (where the total number of
pixels in the problem was 2100, and four threads were used).
With the dynamic schedule, the problem is broken up into chunks of size \(n\) as before, but instead of being cyclically allocated to threads, the chunks are allocated dynamically; i.e. on a first-come, first-served basis. Once a thread finishes its chunk, it is assigned the next chunk in the list. If no chunk size is specified, it defaults to one.
As might be expected, this schedule improves the overall performance slightly. Not only this, but the performance is less sensitive to chunk size.
Here we've also added single points for both the static and dynamic schedules, to indicate the performance when no chunk size is specified.
This schedule is like
DYNAMIC, except that the chunk size starts off large
and gets exponentially smaller as it proceeds. The size of the next chunk is
proportional to the number of remaining iterations divided by the number of
threads. In this scheme the chunk size \(n\) defines the minimum chunk size;
when this is not defined, the default is one.
GUIDED schedule performs poorly for this problem at this scale; better
than serial, but consistently worse than all the other ways of dividing the
problem considered so far.
AUTO schedule allows the run-time complete freedom to choose the
assignment of loop iterations to threads. This is useful if a loop is executed
many time, as then the runtime can evolve a good schedule with high performance
and low overheads.
RUNTIME causes the schedule to be determined at run time, from the
OMP_SCHEDULE environment variable. E.g.:
export OMP_SCHEDULE="dynamic, 8"
The following rules of thumb will help choosing a schedule:
STATIChas the lowest overhead, but is appropriate for only load-balanced loops.
STATIC, ncan be an improvement for mildly load-imbalanced loops; the smaller chunks can potentially help mix up the more and less costly iterations so they are spread more evenly across threads.
DYNAMICis often better if the load imbalance is very severe. However, care must be taken with data locality.
GUIDEDcan be an improvement over
DYNAMIC, but care should be taken with loops where the earlier iterations are more costly. In such cases, since the loop by definition can be performed in any order, it might be possible to re-order it to begin with the less costly iterations.
AUTOmight be useful if the loop is executed many times over.
RUNTIMEshould only be used for interactive experimentation.
These notes are built on the "Hands-on Introduction to OpenMP" tutorial given at the UK OpenMP Users' Conference.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This means you are free to copy and redistribute the material and adapt and build on the material under the following terms: you must give appropriate credit, provide a link to the license and indicate if changes were made. If you adapt or build on the material you must distribute your work under the same license as the original.