OpenMP is a specification for a set of compiler directives, library routines, and environment variables that can be used to express multi-threaded shared-memory parallelism in C, C++, and Fortran programs. OpenMP is fast becoming the standard paradigm for parallelizing applications. With a relatively small amount of coding effort, programmers can obtain scalable performance for their applications on a shared-memory multi-processor system.
This paper presents an overview of the OpenMP model of computation, and describes OpenMP support in the Sun Studio compilers and tools. In addition, the paper reports on the performance of the SPEC OMP2001 benchmarks and outlines directions for future work.
OpenMP is an Application Programming Interface (API) that can be used to explicitly specify multi-threaded shared-memory parallelism in C, C++, and Fortran programs. The OpenMP API is composed of three components:
#pragma mechanism. In Fortran,
OpenMP directives
are specified using special comments that are identified by unique
sentinels (in fixed form source files, the sentinels
!$omp, c$omp, and *$omp
are
recognized).The OpenMP Specification is the definitive reference on OpenMP. and can be found at [1]. The latest Specification is Version 2.5 which is a combined Specification for C, C++, and Fortran. The Specification is owned and managed by the OpenMP Architecture Review Board (ARB) [2], a non-profit organization established in 1997. Membership in the ARB is open to corporations, research organizations, and academic institutions. Sun Microsystems is a member in the OpenMP ARB and plays a prominent role in shaping the future of the API. Sun Microsystems takes active part in weekly meetings of the language committee of the ARB, where the Specification is discussed and updated. In addition, a Sun Microsystems Distinguished Engineer is on the Board of Directors of the ARB, and a Sun Microsystems engineer serves as the Secretary of the ARB. The main motivations for using OpenMP are performance, scalability, portability, and standardization. As an application's requirements and data set become larger, more computing power is needed. Higher performance can be achieved by utilizing many processors together to execute a single application. OpenMP provides a widely supported API for programming shared-memory machines. With a relatively small amount of coding effort, users can obtain scalable performance for their applications on these machines. The Sun Studio application development suite [3] is a comprehensive, integrated set of compilers and tools for the development and deployment of applications on Sun platforms. The Sun Studio compilers support the OpenMP Specification Version 2.5. In addition, the Sun Studio tools support writing, debugging, and analyzing the performance of OpenMP applications.
The underlying machine model for OpenMP is a shared-memory machine, where all the processors access one global memory. Examples of shared memory machines include the Sun Fire V40z server with up to four dual-core AMD Opteron processors, the Sun Fire V890 server with up to eight dual-core UltraSPARC IV+ processors, and the Sun Fire Enterprise E25K with up to 72 dual-core UltraSPARC IV+ processors. Figure 1 gives a simplified view of a shared-memory machine with n processors. ![]() OpenMP uses the fork-join model of parallel execution. When a thread encounters a parallel construct, the thread creates a team of threads composed of itself and some additional (possibly zero) number of threads. The thread that encounters the parallel construct is called the master thread of the team. The other threads are called slave threads of the team. All team members execute the code inside the parallel construct. When a thread finishes its work within the parallel construct, it waits at an implicit barrier at the end of the construct. When all team members have arrived at the barrier, the master thread alone continues execution of user code beyond the end of the parallel construct. Any number of parallel constructs can be specified in a single program.
OpenMP has a rich set of directives that the user can use to specify parallelism in a program. In this section, we give examples of three OpenMP directives, namely the PARALLEL directive, the DO/for directive, and the SECTIONS directive. More detailed information about these and other OpenMP directives can be found in [1] and [4]. Since OpenMP is based on the shared-memory programming model, variables are shared by default. OpenMP data scope attribute clauses can be used to explicitly define the scope of variables. These clauses include the SHARED, PRIVATE, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses. The number of threads used to execute a parallel construct can be specified by setting the OMP_NUM_THREADS environment variable, by calling the omp_set_num_threads
runtime library routine, or by using the NUM_THREADS clause.Parallel The PARALLEL directive defines a region of code to be executed in parallel by multiple threads. All the threads participating in the execution of the PARALLEL construct will execute the same region of code. In effect, the region of code is replicated across the threads. When a thread reaches a PARALLEL construct, it creates a team of threads and becomes the master of the team. The master has thread number 0 within the team. The other threads are numbered 1, 2, ..., n-1, where n is the total number of threads in the team. There is an implied barrier at the end of the PARALLEL construct. Only the master thread continues execution past this point. See Appendix A.1 for an example of the PARALLEL construct. DO/for The DO/for directive is a work-sharing directive that applies to a DO loop (Fortran) or a for loop (C/C++). The DO/for construct divides the iterations of the DO/for loop among the team of threads that encounters the construct. There is an implied barrier at the end of the DO/for construct, unless a NOWAIT clause is specified. It is the programmer's responsibility to ensure that the iterations of a loop with the DO/for directive have no dependencies. That is, the result of one loop iteration does not depend on the result of any other loop iteration. If this condition holds, then two different loop iterations can be executed in parallel by two different threads. See Appendix A.2 for an example of the DO/for directive. Sections The SECTIONS directive is a work-sharing directive that applies to a set of structured blocks of code (each structured block is called a SECTION). The SECTIONS construct divides the blocks of code among the team of threads that encounters the construct. Each block is executed once by a thread in the team. There is an implied barrier at the end of the SECTIONS construct, unless a NOWAIT clause is specified. It is the programmer's responsibility to ensure that the structured blocks of code in the SECTIONS construct are independent of each other and can be executed in parallel by different threads. See Appendix A.3 for an example of the SECTIONS directive.
The -xopenmp compiler option instructs the
Sun Studio
compilers to recognize OpenMP directives in a program.OpenMP support in the compilers consists of two parts. First, the compiler processes OpenMP directives and transforms the code so it can be executed by multiple threads. Second, a runtime library provides support for thread management, synchronization, and scheduling of work. Compiler Support Figure 2 shows the various stages and components of a compiler. These consist of a language-specific Front-End, an Optimizer, and a machine-dependent Code Generator. The Front-End component of the compiler recognizes OpenMP directives, processes the information associated with them, and then passes that information to the Optimizer. The Optimizer processes the information passed by the Front-End and transforms the code so it can be executed by multiple threads. In transforming the code, the Optimizer inserts calls to the OpenMP runtime support library, libmtsk. Finally, the Code Generator generates target machine code. When the Optimizer processes a PARALLEL construct in the program, it does the following: ![]() First, the Optimizer analyzes the scopes of variables in the PARALLEL construct. That is, it determines whether a variable accessed in the body of the PARALLEL construct is SHARED, PRIVATE, FIRSTPRIVATE, LASTPRIVATE, REDUCTION, etc. Second, the Optimizer extracts the body of the PARALLEL construct and places it in a separate routine, called an outlined routine. Variables that are SHARED are passed as arguments to the outlined routine so they can be accessed by multiple threads. Variables that are PRIVATE are declared to be local in the outlined routine, so separate copies of these variables are allocated on different thread stacks. Additional code is added to the outlined routine to initialize FIRSTPRIVATE variables, update LASTPRIVATE variables, combine reduction results, etc. Third, the Optimizer replaces the original PARALLEL construct by a call to the libmtsk library routine,
__mt_MasterFunction_. The address of the
outlined routine
is passed as an argument
to __mt_MasterFunction_.
When __mt_MasterFunction_
is executed at
runtime, it dispatches a team of threads to execute the outlined
routine.The outlining transformation described above has advantages. First, an outlined routine defines a context for parallel execution that can be easily executed by multiple threads. Second, outlining simplifies storage management, since variables that are local to the outlined routine will automatically be allocated on different thread stacks thus making them local to each thread. Figures 3(a) and 3(b) illustrate the outlining transformation. The body of the PARALLEL construct in Figure 3(a) is extracted and placed in the outlined routine __mf_par_001. Since
variable n
is SHARED in the PARALLEL construct, its address is passed as an
argument to __mf_par_001, so all
threads
executing __mf_par_001 would access
the same
copy. On the other hand, since variable id
is
PRIVATE, it is declared local to __mf_par_001,
so
every thread executing __mf_par_001
would have its
own copy of the variable on its stack.Figure 3(b) shows how the PARALLEL construct is replaced by a call to __mt_MasterFunction_, and the
address
of __mf_par_001 is passed as an
argument
to __mt_MasterFunction_.Other constructs, such as the work-sharing DO/for and SECTIONS, are processed by the Optimizer in a similar fashion. The Optimizer, however, replaces a work-sharing construct by a call to the libmtsk library routine __mt_WorkSharing_.
![]() Automatic Parallelization Besides OpenMP directive-based parallelization, the Optimizer can also automatically parallelize loops in a program. When a program is compiled with the -xautopar option, the
Optimizer
examines all loops in the program and uses data-flow analysis to
determine which loops have iterations that can be executed
independently of each other. The Optimizer then transforms
these
loops in a fashion similar to that described above for OpenMP.Runtime Library Support The OpenMP runtime support library, libmtsk, provides support for thread management, synchronization, and scheduling of work. The library is implemented on top of the POSIX threads library (libpthread). As described above, the Optimizer replaces the code for a PARALLEL construct by a call to __mt_MasterFunction_.
When a
thread calls __mt_MasterFunction_, it
creates a team
of threads to execute the PARALLEL construct and it becomes the master
thread of the team. Then the master thread
dispatches the
slave threads to work on the outlined routine. The
master thread itself also takes part in executing the
outlined
routine. When finished, the master thread
synchronizes
with other threads in the team via a call to the barrier routine
__mt_EndOfTask_Barrier_.
The general logic
of __mt_MasterFunction_ is shown in
Figure 4.
Thr runtime library, libmtsk, maintains a pool of threads that can be used as slave threads for PARALLEL constructs. The threads in the pool are created via calls to the POSIX threads library routine pthread_create. When a master thread needs to
create a
team of more than one thread, the master thread checks the pool and
grabs idle threads from the pool, making them slave threads of the
team. When the team finishes executing the PARALLEL region,
the
slave threads are returned to the pool.Throughout its lifetime, a slave thread executes the runtime routine slave_startup_function, where it alternates
between
waiting for the next PARALLEL task and executing a PARALLEL
task. While waiting for a PARALLEL task, a slave thread may
be
spinning or sleeping. This behavior can be controlled by setting the
environment variable SUNW_MP_THR_IDLE. When a thread finishes
working on a task, it synchronizes with the master thread and other
threads in the team via a call to the barrier routine
__mt_EndOfTask_Barrier_. The general
logic of
slave_startup_function is shown in
Figure 5. OpenMP allows PARALLEL regions to be nested inside each other. The runtime library, libmtsk, supports nested parallelism. If nested parallelism is enabled by setting the environment variable OMP_NESTED or by calling omp_set_nested, then
a nested
PARALLEL region can be executed by a team that consists of more than
one thread.In addition, the runtime library, libmtsk, supports multiple user threads. If a user program is threaded via explicit calls to the POSIX threads library (libpthread) or the Solaris OS threads library (libthread), then libmtsk will treat each of the user program threads as a master thread, and provide it with its own team of slave threads. ![]() Tools Support The Sun Studio application development suite provides a variety of tools that facilitate and support OpenMP programming. These include tools that aid the programmer in parallelizing a program using OpenMP, as well as tools for checking, debugging, and analyzing the performance of OpenMP programs. Some of these tools are described below. Automatic Scoping of Variables The process of manually specifying scopes of variables when writing an OpenMP program is both tedious and error-prone. To improve productivity, an autoscoping feature was implemented in the Sun Studio compilers, as a Sun-specific extension to OpenMP. The Sun Studio compilers are currently the only commercially available compilers that support this feature. The autoscoping feature leverages the analysis capability of the Optimizer to determine the appropriate scopes of variables. The programmer specifies which variables in a given PARALLEL construct should be scoped automatically by the Optimizer. The Optimizer determines the appropriate scopes of these variables by analyzing the program and applying a set of autoscoping rules. The scoping results are displayed in an annotated source code listing as compiler commentary. This automatic scoping feature offers a very attractive compromise between automatic and manual parallelization. For additional information on autoscoping, refer to [7]. Static Error Checking Under the control of the compiler option -vpara (Fortran) or -xvpara
(C),
the compiler can check a program for a variety of static
errors. These include invalid nesting of OpenMP
constructs, invalid scoping of variables, data races, etc.In addition, under the control of the compiler option -XlistMP, the Fortran compiler can perform
global
(inter-procedural) analysis of the program and report inconsistencies
and possible runtime problems in the code. Problems reported
include invalid use of OpenMP directives, errors in alignment,
disagreement in the number or type of procedure arguments, etc.Runtime Error Checking If the environment variable SUNW_MP_WARN is set to TRUE, then the runtime library libmtsk checks the program for a variety of runtime errors. Problems reported include semantic errors that violate the OpenMP Specification, invalid nesting of OpenMP constructs, inconsistencies in the use of OpenMP directives, invalid chunk sizes, deadlock at barriers, etc. OpenMP Debugging The dbx tool in the Sun Studio software can be used to debug C, C++, and Fortran OpenMP programs. An OpenMP program should first be prepared for debugging with dbx by compiling it with the options -xopenmp=noopt -g. All of the dbx commands that operate on threads can be used for OpenMP debugging. dbx allows the user to single-step into a PARALLEL region, set breakpoints in the body of an OpenMP construct, as well as print the values of SHARED, PRIVATE, THREADPRIVATE, etc., variables for a given thread. Performance Analysis The Collector and Performance Analyzer [8] are a pair of tools in Sun Studio that can be used to collect and analyze performance data for an application. Both tools can be used from the command line or from a graphical user interface. The Collector tool collects performance data using a statistical method called profiling and by tracing function calls. The data can include call-stacks, microstate accounting information, thread synchronization delay data, hardware counter overflow data, memory allocation data, and summary information for the operating system and the process. The Performance Analyzer processes the data recorded by the Collector, and displays various metrics of performance at program, function, caller-callee, source-line, and assembly instruction levels. The Performance Analyzer can also display the raw data in a graphical format as a function of time. The Performance Analyzer can present the performance of an OpenMP program in either of two modes: User mode and Machine mode. In User mode, the Performance Analyzer presents profile data in a manner that matches the user's intuitive understanding of the program. In this mode, the master thread and slave thread call-stacks are reconciled, and artificial functions with names of the form In Machine mode, the Performance Analyzer presents the call-stacks as measured, with no transformations done and no artificial functions constructed, thus exposing the implementation details of the runtime library, libmtsk. Compiler Commentary Compiler commentary in annotated source code listings informs the user about the various optimizations and transformations that have been applied to the source code by the compiler. The generate compiler commentary, the program should be compiled with -g.
The
compiler commentary can be viewed in an annotated source code listing
by using the Performance Analyzer or by running the command-line
utility er_src.
SPEC OMP2001 is a software benchmark produced by the High-Performance Group (HPG) of the Standard Performance Evaluation Corporation (SPEC). The benchmark is designed to evaluate the performance of real scientific and engineering applications parallelized using OpenMP, and is representative of high performance technical computing applications from the areas of chemistry, mechanical engineering, climate modeling, and physics (see Table 1). The SPEC OMP2001 benchmark contains two suites. The first suite, SPEC OMPM2001, uses a medium-sized data set and is designed to measure the performance of shared-memory systems with between four and 32 processors. The second suite, SPEC OMPL2001, uses a large-sized data set and is designed to measure the performance of systems with a larger number of processors.
Table 1: Applications in
the SPEC OMP2001 Benchmark
The SPEC OMP2001 benchmark exhibits superior scaling and performance on Sun systems. Figure 6 shows the scaling of the SPEC OMPL2001 suite (base performance) on a Sun Fire 6800 configured with 1.2 GHz UltraSPARC III Cu processors. Figure 7 shows the scaling of the SPEC OMPL2001 suite (base performance) on a Sun Fire 15K also configured with 1.2 GHz UltraSPARC III Cu processors. Sun Microsystems has announced several world-record performance results for the SPEC OMP2001 benchmark:
![]() Figure 6: Scaling of OMPL2001 (Base) on Sun Fire 6800 ![]() Figure 7: Scaling of OMPL2001 (Base) on Sun Fire 15K
Sun Microsystems continues to invest in delivering world-class, high quality OpenMP support in its compilers and tools. Current and future projects include the following:
1. OpenMP Specification, http://www.openmp.org/drupal/node/view/8 2. OpenMP Architecture Review Board, http://www.openmp.org 3. Sun Studio Software, http://www.sun.com/software/products/studio/index.html 4. Sun Studio 11 OpenMP API User's Guide, http://docs.sun.com/doc/819-3694 5. Sun Fire E25K server, http://www.sun.com/servers/highend/sunfire_e25k/index.xml 6. The SPEC OMP benchmark suite, http://www.spec.org/omp 7. Yuan Lin, Christian Terboven, Dieter an Mey, and Nawal Copty, “Automatic Scoping of Variables in Parallel Regions of an OpenMP Program”, WOMPAT 2004. (PDF) 8.Sun Studio 11: Performance Analyzer, http:/docs.sun.com/app/docs/doc/819-3687 9. Myungho Lee, Brian Whitney, and Nawal Copty, “Performance and Scalability of OpenMP Programs on the Sun Fire E25K Throughput Computing Server”, WOMPAT 2004. 10. SPEC OMP2001 benchmarks results, http://www.spec.org/omp/results
The following is a simple “Hello World” program with a PARALLEL directive. The number of threads to be used is specified via a call to the OpenMP runtime library routine omp_set_num_threads. Dynamic
adjustment of the
number of threads is disabled by calling the OpenMP runtime library
routine omp_set_dynamic.The initial thread of the program executes sequentially until it reaches the PARALLEL construct. At that point, the initial thread creates a team of 10 threads. The team is composed of the initial thread itself (master of the team) and 9 other threads (slaves of the team). All the threads in the team execute the code enclosed in the PARALLEL construct concurrently. When a thread reaches the end of the PARALLEL construct, it waits at the implicit barrier at the end of the construct. When all the threads have reached the barrier, only the master thread continues executing the code following the PARALLEL construct. Fortran – PARALLEL Directive Example:PROGRAM HELLO C/C++ – PARALLEL Directive Example:#include <stdio.h>
The following is an example program with a DO/for directive. The initial thread of the program executes sequentially until it reaches the PARALLEL construct. At that point, the initial thread creates a team of 20 threads. The team is composed of the initial thread itself (master of the team) and 19 other threads (slaves of the team). All the threads in the team execute the code enclosed in the PARALLEL construct concurrently. When the threads in the team encounter the DO/for construct, the 100 iterations of the loop are divided among the 20 threads. So, each thread executes 5 iterations of the loop. The threads execute their iterations concurrently. When a thread completes its work, it waits at the implicit barrier at the end of the DO/for loop. When all threads have reached the barrier, the threads continue executing the PARALLEL region code. Fortran – DO Directive Example:PROGRAM VECTOR_ADD C/C++ – For Directive Example:#include <stdio.h>
The following is an example program with a SECTIONS directive applied to three sections of code. The initial thread of the program executes sequentially until it reaches the PARALLEL construct. At that point, the initial thread creates a team of 3 threads. The team is composed of the initial thread itself (master of the team) and 2 other threads (slaves of the team). All the threads in the team execute the code enclosed in the PARALLEL construct concurrently. When the threads in the team encounter the SECTIONS construct, the 3 sections are divided among the 3 threads in the team. Each section is executed only once by a thread in the team. When a thread completes its work, it waits at the implicit barrier at the end of the SECTIONS construct. When all threads have reached the barrier, the threads continue executing the PARALLEL region code. Fortran – SECTIONS Directive Example:PROGRAM SECTIONS C/C++ – SECTIONS Directive Example:#include <stdio.h> Nawal Copty is a staff engineer in the Scalable Systems Group, and OpenMP project lead. |
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||