Sun Java Solaris Communities My SDN Account Join SDN

Article

Using OpenMP to Parallelize a Program

 
By Neelakanth Nadgir, June 2001  
Abstract

OpenMP is an emerging standard for parallelizing programs in a shared memory environment. It provides a set of pragma's for programmers to easily parallelize their code. This article provides a brief introduction to OpenMP and gives a few tips on using it to parallelize your program.

This information is of particular interest to programmers who are new to OpenMP and who have minimal parallel programming experience.

Introduction

OpenMP is a set of standards and interfaces for parallelizing programs in a shared memory environment. OpenMP provides a set of pragmas which, when used in a program, instructs an openmp-capable compiler to parallelize it. No other source code modifications are necessary. (However, you will typically modify your sources to get the maximum performance.) openmp pragmas enable you to use an elegant and uniform interface to parallelize programs on various architectures and systems. OpenMP is a widely accepted standard, and vendors like Sun, KAI, and SGI support it. The latest version of the openmp spec is 2.0[1]. Currently OpenMP specs for Fortran, C and C++ programming languages are available. OpenMP takes parallel programming to the next level by creating and synchronizing threads for you. All you need to do is insert appropriate pragmas in the source program, and then build the program with a compiler supporting openmp. The compiler interprets these pragmas and parallelizes the code following the pragma. When using compilers that are not openmp aware, the openmp pragmas are silently ignored.

OpenMP Pragmas

The openmp specification defines a set of pragmas. These pragmas are compiler directives on how to process the block of code that follows the pragma. The most basic pragma is the #pragma omp parallel. The parallel pragma denotes a parallel region. The main thread of execution is called the master thread. Once the master thread encounters the parallel pragma, it creates a team of worker threads that then distribute the work among themselves and the master thread. The environmental variable OMP_NUM_THREADS controls the number of worker threads that are created. At the end of the parallel region, all threads wait for each other (also accomplished by a barrier pragma) and the program continues executing sequentially with the master thread.

OpenMP supports two basic kinds of parallelism - loops and sections. The #pragma omp for is used for loops, and #pragma omp section is used for sections. Sections are blocks of code that can be executed in parallel. These pragmas can be used in a nested fashion. A combination of parallel for and section pragmas can also be used.

The #pragma omp master instructs the compiler that the following block of code is to be executed by the master thread only. The #pragma omp barrier instructs all threads to wait for each other. There is an implicit barrier pragma at the end of a parallel region. The #pragma omp single indicates that only one thread should execute the following block of code. This thread may not necessarily be the master thread. You can protect blocks of code that are not threadsafe by using the #pragma omp critical pragma. Of course all of these make sense only in the context of a parallel pragma (parallel region).

Using a simple matrix multiplication program you can see how to use openmp to parallelize the program. Consider the following small code fragment that multiplies 2 matrices. This is a very simple example and, if you really want a good matrix multiply routine, you will have to consider cache effects, or use a better algorithmn (Strassen's, or Coppersmith and Winograd's, etc.).

for (ii = 0; ii < nrows; ii++){      
  for(jj = 0; jj < ncols; jj++){        
    for (kk = 0; kk < nrows; kk++){           
       array[ii][jj] = array[ii]kk] * array[kk][jj];        
    }     
  }    
}

Parallelizing the above code segment is straightforward: Insert the #pragma omp parallel for pragma before the first loop. It is beneficial to use the pragmas at the highest loop, since it gives the most performance gain. Since there are no inter-loop dependencies, or any conflicting variables, you don't need to declare any shared or private variables. The preceding code now becomes:

    
#pragma omp parallel for    
for (ii = 0; ii < nrows; ii++){      
  for(jj = 0; jj < ncols; jj++){       
    for (kk = 0; kk < nrows; kk++){
       array[ii][jj] = array[ii]kk] * array[kk][jj];
    }
  }
}
Another example Consider the following code fragment that finds the sum of f(x) for 0 <= x < n.
        
for(ii = 0; ii < n; ii++){
   sum = sum + some_complex_long_fuction(a[ii]);        
}

To parallelize the above fragment, the first step could be

        
#pragma omp parallel for shared(sum)
for(ii = 0; ii < n; ii++){
   value = some_complex_long_fuction(a[ii]);          
   #pragma omp critical  
   sum = sum + value;        
}

or better, you can use the reduction clause to get

        
#pragma omp parallel for private(sum) reduction(+: sum)        
for(ii = 0; ii < n; ii++){           
   sum = sum + some_complex_long_fuction(a[ii]);        
}

OpenMP provides a few runtime enviromental variables that can be used to control the behaviour of the openmp-program. The most important and widely used variable is OMP_NUM_THREADS. OMP_NUM_THREADS determines the number of worker threads that will be created when the master thread encounters a parallel region. The general rule is to make the number of threads equal to the number of processors in the system.

How to Begin

There are several ways to parallelize programs. First  determine if you need parallelization. Sometimes, parallelization requires big machines, and some algorithms are not suitable for parallelizing. If you are starting a new project, you could choose an algorithm that can be parallelized. It is very important to be sure that the code is correct (serially) before trying to parallelize it. Be sure to maintain timings of your serial run, so that you can decide if parallelization is useful.

Compile the serial version with several optimization options. The compiler can generally perform more lower level optimizations than you can. Try using the automatic parallelization options of the compiler. Delegating parallelization to the compiler makes it easier for you to maintain a common source code base. The autoparallelizer can also help you identify pieces of code that can be parallelized, or point out things in the code that could prevent parallelization (for example, a function call inside a for loop). You can accomplish this by compiling your program with the -g flag, and using the er_src utility (a part of Forte Developer 6 Update 2).

er_src program functionname displays the program listing with indented compiler commentary.

Identify bottlenecks in the program using a profiling tool, such as Forte Performance Analyzer or Rational Quantify. This should help you identify routines (hot routines) where the major amount of time is spent. It is important that this is user CPU time, and not system time, since system time may be sequential time (two threads trying to read a disk segment).

Once you have identified the hot routines, study them to find loops that do much of the computation. Try using the -xautopar option of the Forte C compiler to identify loops that the compiler thinks can be parallelized. Identify shared and private variables by studying the interloop dependencies. Parallelize them using openmp pragmas. If you are lucky they should work fine. If not, try setting OMP_NUM_THREADS to 1 and see if the correct results are generated. You can also use dbx's runtime checking or tools like AssureView to find bugs in the program.

OpenMP and MPI

MPI (Message Passing Interface[2]) is another specification for paralleling programs. Unlike OpenMP, MPI spawns multiple processes that then communicate using TCP/IP. Since these processes do not share the same address space, they can run on remote machines (or a cluster of machines). It is difficult to say whether OpenMP or MPI is better. They both have their advantages and disadvantages. What is more interesting is that OpenMp can be used with MPI. The ideal situation would be to use MPI to coarsely distribute work among several SMP machines, and then use OpenMP to parallelize at a finer level. For more information on using mixed mode openmp, see MPI NPACI and Mixed Mode MPI/OpenMP Programming.

Tools for Using OpenMP

The vendors supporting OpenMP on SPARC and Solaris products include Sun, KAI (KAP/Pro tool set), and OMNI OpenMP (an opensource openmp compiler). KAI supports C/Fortran/C++ compilers. NAS Parallel benchmarks are popular for measuring the performance of openmp compilers.

For more information on C, C++, and Fortran support for Sun compilers, please see: Sun ONE Studio 7, Compiler Collection (formerly Forte Compiler Collection 7).

You can profile your openmp programs using Forte Performance Analyzer. It can also be used to profile programs that do not use openmp.

Resources

  • OpenMP spec, plus some sample Fortran programs and links to other openmp vendors.
  • OpenMP support in Forte Developer Fortran or C User`s Guide.
  • KAI`s OpenMP solution.
  • OpenMP FAQ

References/Bibliography/Footnotes

  1. The fortran version of the spec is 2.0. C and C++ version 2.0 are being worked on.
  2. MPIStandard
Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.