OpenMP is an emerging standard for parallelizing programs in a shared memory environment. It provides a set of pragma's for programmers to easily parallelize their code. This article provides a brief introduction to OpenMP and gives a few tips on using it to parallelize your program. This information is of particular interest to programmers who are new to OpenMP and who have minimal parallel programming experience. Introduction
OpenMP is a set of standards and interfaces for parallelizing programs in a shared memory environment. OpenMP provides a set of
The OpenMP supports two basic kinds of parallelism - loops and sections. The #pragma omp for is used for loops, and #pragma omp section is used for sections. Sections are blocks of code that can be executed in parallel. These pragmas can be used in a nested fashion. A combination of parallel for and section pragmas can also be used. The #pragma omp master instructs the compiler that the following block of code is to be executed by the master thread only. The #pragma omp barrier instructs all threads to wait for each other. There is an implicit barrier pragma at the end of a parallel region. The #pragma omp single indicates that only one thread should execute the following block of code. This thread may not necessarily be the master thread. You can protect blocks of code that are not threadsafe by using the #pragma omp critical pragma. Of course all of these make sense only in the context of a parallel pragma (parallel region).
Using a simple matrix multiplication program you can see how to use
for (ii = 0; ii < nrows; ii++){
for(jj = 0; jj < ncols; jj++){
for (kk = 0; kk < nrows; kk++){
array[ii][jj] = array[ii]kk] * array[kk][jj];
}
}
}
Parallelizing the above code segment is straightforward: Insert the #pragma omp parallel for pragma before the first loop. It is beneficial to use the pragmas at the highest loop, since it gives the most performance gain. Since there are no inter-loop dependencies, or any conflicting variables, you don't need to declare any shared or private variables. The preceding code now becomes:
#pragma omp parallel for
for (ii = 0; ii < nrows; ii++){
for(jj = 0; jj < ncols; jj++){
for (kk = 0; kk < nrows; kk++){
array[ii][jj] = array[ii]kk] * array[kk][jj];
}
}
}
Another example Consider the following code fragment that finds the sum of f(x) for 0 <= x < n.
for(ii = 0; ii < n; ii++){
sum = sum + some_complex_long_fuction(a[ii]);
}
To parallelize the above fragment, the first step could be
#pragma omp parallel for shared(sum)
for(ii = 0; ii < n; ii++){
value = some_complex_long_fuction(a[ii]);
#pragma omp critical
sum = sum + value;
}
or better, you can use the reduction clause to get
#pragma omp parallel for private(sum) reduction(+: sum)
for(ii = 0; ii < n; ii++){
sum = sum + some_complex_long_fuction(a[ii]);
}
OpenMP provides a few runtime enviromental variables that can be used to control the behaviour of the openmp-program. The most important and widely used variable is How to Begin There are several ways to parallelize programs. First determine if you need parallelization. Sometimes, parallelization requires big machines, and some algorithms are not suitable for parallelizing. If you are starting a new project, you could choose an algorithm that can be parallelized. It is very important to be sure that the code is correct (serially) before trying to parallelize it. Be sure to maintain timings of your serial run, so that you can decide if parallelization is useful.
Compile the serial version with several optimization options. The compiler can generally perform more lower level optimizations than you can. Try using the automatic parallelization options of the compiler. Delegating parallelization to the compiler makes it easier for you to maintain a common source code base. The autoparallelizer can also help you identify pieces of code that can be parallelized, or point out things in the code that could prevent parallelization (for example, a function call inside a for loop). You can accomplish this by compiling your program with the
Identify bottlenecks in the program using a profiling tool, such as Forte Performance Analyzer or Rational Quantify. This should help you identify routines (hot routines) where the major amount of time is spent. It is important that this is user CPU time, and not system time, since system time may be sequential time (two threads trying to read a disk segment).
Once you have identified the hot routines, study them to find loops that do much of the computation. Try using the OpenMP and MPI
MPI (Message Passing Interface[2]) is another specification for paralleling programs. Unlike OpenMP, MPI spawns multiple processes that then communicate using TCP/IP. Since these processes do not share the same address space, they can run on remote machines (or a cluster of machines). It is difficult to say whether OpenMP or MPI is better. They both have their advantages and disadvantages. What is more interesting is that OpenMp can be used with MPI. The ideal situation would be to use MPI to coarsely distribute work among several SMP machines, and then use OpenMP to parallelize at a finer level. For more information on using mixed mode Tools for Using OpenMP
The vendors supporting OpenMP on SPARC and Solaris products include
Sun, KAI (KAP/Pro tool set), and OMNI OpenMP (an opensource openmp
compiler). KAI supports C/Fortran/C++ compilers. NAS Parallel benchmarks are popular for measuring the performance of For more information on C, C++, and Fortran support for Sun compilers, please see: Sun ONE Studio 7, Compiler Collection (formerly Forte Compiler Collection 7).
You can profile your Resources
References/Bibliography/Footnotes
| ||||||||
|
| ||||||||||||