Sun Java Solaris Communities My SDN Account

Article

Developing Applications For Parallel Computing

 
By Liang T. Chen, Distinguished Engineer and Software Architect, Platform Developer Tools, Sun Microsystems , December 15, 2005  
This article discusses the important parallel application development issues now emerging from the parallel computing technology trend. It introduces and explains some of the most important industry standards such as OpenMP, MPI, and Grid Computing, and describes the current state of parallel application software development. This article highlights several significant design issues from two classes of real application case studies. It also discusses the important developer tools which are critical at the various stages during the software development cycle. In the end, it summarizes the right mindset for developers facing the challenge of developing parallel applications.
1: Introduction

There are two critical forces shaping software development today. One is the popular adoption of Parallel Computing and the other is the trend toward Service Oriented Architecture. Both ideas have existed for quite a while, but the current technology of CMT(Chip Multi-Threading) processor designs, horizontally scaled systems, near zero latency interconnects and new web service standards are all accelerating both ideas into the main stream and are becoming adopted everywhere. It is quite easy to predict that most desktop machines or even laptops will be powered by multi-core or CMT processors over the next few years. A more intriguing and important issue is whether the current state of software development is sufficient to produce good quality parallel applications for the new computing machines.  

Advanced scientific and engineering communities have used parallel computing to solve their large scale complex problems for a long time. But even these advanced developers find it hard to implement their parallel applications effectively. These advanced developers have designed and implemented several interesting programming models to help develop parallel applications. The most popular ones are OpenMP for shared memory programming and MPI(Message Passing Interface) for distributed memory programming. Although there are several other important programming models such as UPC(Unified Parallel C)  and CAF(Co-Array Fortran), we will discuss only OpenMP and MPI. Even though OpenMP and MPI are widely used today, both models have their own intrinsic constraints and still don't meet the rigorous requirements of  many parallel application types.   

This article will also talk about another important technology trend related to parallel computing, Grid computing. Internet technology has steadily improved during the past several years. People are exploring and beginning to transform the internet from passive information sharing to active service sharing and resource sharing. Grid computing is an important technology that makes the computing power circulate like currency so it's utilized in the optimal way.

The case study section will discuss the unique challenges from two classes of real life applications:  parallel search and parallel simulation. These two classes of applications are widely used in various fields of academics and industries. Both classes of real applications exemplify the difficulty of developing parallel applications and point toward one potential direction for the future advancement of parallel software technology.

There is a famous ancient Chinese saying “A craftsman would sharpen his tools before  laboring on his work”. This is very true on software development, especially for complex parallel applications. Many experienced software developers are more afraid of programming faults caused by wild pointer or memory leakage than by bad logic. Parallel application development amplifies significantly the programming difficulty. Race conditions are common and difficult problems in multi-thread programming. Even the logic problem related to load balancing is pretty hard to fix. These tough problems would make a competent programmer feel humble. In the tools section of this article, we will talk about the important tools for the various development stages during the software development cycle. We will also discuss how to maximize the existing tools to reduce the pain of parallel application development.

2: OpenMP

2.1 OpenMP General
OpenMP is an industry standard API design specification developed by joint effort of many top computer manufactures and software developers such as Sun Microsystems, Hewlett-Packard, IBM, and Intel etc,. The reader can find more OpenMP information at the website www.openmp.org. The goal is to offer an common specification that will let the software programmers easily design new parallel applications or modify and parallelize the existing sequential applications to take advantage of shared memory configured multi-processor computing systems. Portability is an important goal of OpenMP. A parallel application source code developed using OpenMP can be compiled by any compiler supporting OpenMP and the compiled binary should run on the target hardware platform to achieve parallel performance gain.

OpenMP is supported on the most popular native programming languages: Fortran and C/C++.  A simple example OpenMP program written in both C/C++ and Fortran is shown in the figure 1.



The loop iterations of adding the y array to the x array will be executed in parallel in this example. OpenMP constructs are expressed by pragmas, directives and programming API calls in the source code. OpenMP  constructs allow the programmers to specify parallel regions, synchronization, and data scope attributes. It also supports run-time environment variables to specify the run-time configuration.  For example, the environment variable  OMP_NUM_THREADS specifies the number of working threads used at run time.

2.2 OpenMP example
Sun Studio Compiler holds several world records including the OpenMP specific SPEC OMPM on 2P,  4P and dual core benchmarks. To play with Sun Studio OpenMP, I implemented a small C++ OpenMP program to look for the prime numbers within a specified range. I ran the program to find all the prime numbers smaller than 3 million on a Sun Fire 6800 machine which has 24 1.2GHz US-III+ processors with 9.8GB main memory and Solaris 5.9 OS.  The result is interesting and shown in table 1.

I spent only few hours in converting my sequential program into the OpenMP version. The parallelization effect is beyond my expectation for such a small and simple program with minimal effort. When this OpenMP program ran with single working thread, its performance dropped 8.65% due to the openMP operation overhead. However the performance got a good gain starting from 2 threads up to 20 threads. Interestingly when the program ran with more than 20 threads, it hit the scalability limitation and suffered the saturation drop. An OpenMP program should be able to scale up much better than 20 threads. One potential explanation for this early saturation is that the programming logic of this program is not complex enough for further parallelization. Indeed there are just few lines of code within the program's OpenMP parallel loop.     

Table 1: An example OpenMP program performance
1
Serial Version
6.636 sec
base
2
OpenMP 1 Thread
7.210 sec
8.65% drop
3
OpenMP 2 Threads
3.771 sec
1.76x faster
4
OpenMP 4 Threads
1.988 sec
3.34x faster
5
OpenMP 8 Threads
1.090 sec
6.09x faster
6
OpenMP 16 Threads
0.638 sec
10.40x faster
7
OpenMP 20 Threads
0.550 sec
12.06x faster
8
OpenMP 24 Threads
0.931 sec
saturation


2.3 OpenMP Constraints
Because OpenMP programming model is specified for single process, its scalability is ultimately restricted by the  number of processors (threads) in a single machine. The scalability is also limited by the programming logic complexity and programming style as explained in the above example. Current OpenMP semantics may not be flexible enough to handle some applications. For example,  current OpenMP specification only allows limited dynamic thread creation in parallel region. The OpenMP language committee, whose members include Sun Microsystems, Intel and many other companies are in the process of specifying a dynamic task queue construct to remove this critical limitation. The other more general problems when programming an OpenMP application are avoiding  the race condition errors from multiple threads and achieving good load balance among the concurrent working threads.     
 
3: MPI

3.1 MPI General
MPI is an industry standard API specifications designed for high performance computing on multi-processor machines and clusters of machines. The standard was designed jointly by a broad groups of computer vendors and software developers. There are a number of  MPI implementations coming from different research institutes and companies. The most popular one is MPICH which is quite often used as the code base for MPI implementations optimized for a specific platform or interconnect. Interested readers can get MPI standard and more MPICH information from the website:  http://www-unix.mcs.anl.gov/mpi.  

MPI offers a distributed memory programming model for parallel applications. Although the entire MPI API set is relative large containing more than 300 routines, many MPI applications can be programmed with less than a dozen of basic routines. Figure 2 shows a pair of send and receive routines to communicate a message.



The first 3 arguments of both routines specify the location, data type and amount of the message. The fourth argument identifies the target process to communicate with. The fifth argument tag ID provides a further mechanism to distinguish between different messages. The sixth argument specifies the communication context. There is an additional argument in the receive routine to report the receiving status.

In MPICH design, the entire API set is implemented and built on top of a small core set of low level device interconnect routines. This design architecture offers good portability across different platforms. One only needs to re-work the small core set of device interconnect routines to port or optimize MPICH for a new platform or interconnect.

The MPI implementations are still evolving. MPI-1 supports the key features such as point to point and collective message communication with a communicator. A message can contain MPI data of primitive data types or derived (user-defined) data types. The message data content can be in packed or unpacked format. It also supports interconnect topology. MPI-2 provides many advanced communication features such as remote memory access and one-side communication. It also supports dynamic process creation/management and parallel IO.

3.2 Memory Hierarchy
At the current state of semiconductor technology and computer architecture, the dominant system  performance  factor is  the memory  hierarchy rather than CPU clock rate. It shows the non-uniform memory performance graph in figure 3.



An application program will run faster, if most of  its instruction and data memory accesses fall within the cache range. Because MPI is a distributed memory programming model, it usually can achieve good linear scalability for a large scale application. When a MPI application is partitioned to run on a large cluster of computing nodes, the memory space of each MPI process is reduced  and the memory accesses could fall within the high performance range of the memory hierarchy. The non-uniform memory performance effect can apply to the other programming models including OpenMP.

3.3 MPI Constraints
In general MPI offers a good solution for parallel application on computer clustering, but it is also a difficult programming model for many developers. Because MPI has longer communication latency, the program core logic must be partitioned well to justify the distribution overhead  It is not an intuitive task to analyze and partition an application problem and map it to a set of distributed processes. . Because of the complexity of interaction among many MPI processes, it is also quite challenging to debug and tune a MPI application running on a large number of nodes even with the right tools. The quality of MPI routine implementation may impose additional software development challenges. MPI performance depends on the underlying hardware platform and interconnect. In some extreme cases, the MPI application  behavior may be affected by heavy interconnect traffic. Another big issue is the reliability of large scale MPI application. For many MPI implementations, a MPI program will stop working whenever a single computing node fails to respond correctly. Unfortunately when an MPI application runs for a long periods on thousands of computing nodes, the failure rate is no longer negligible.

4: Grid Computing

4.1 Grid General
Grid computing is a very important technology trend and part of the parallel computing spectrum. Some European countries think Grid computing is the next wave of the information technology revolution after the internet and invested huge resources to explore Grid technology. However Grid computing is at such an early stage, we really don't know where it will lead to. In the mid 80's when PCs started to have the sound cards, Multi-media became a very popular buzzword and many people pursued the future multi-media technology. It's doubtful that anyone could think of anything like today's Ipod at that time.

Here we discuss only the basic things of what we know about Grid computing today. In a simple definition, Grid computing is the virtualization of  the computing resources with a utility delivery mechanism. Grid computing will shift the internet paradigm from the passive sharing of information to the active sharing of computing resources. Computing power is just one kind of the computing resources and can be circulated like a currency. A Grid computing environment is to provide  an integrated  service environment for resource directory and discovery, security, computing and monitoring.

4.2 DRM
One core component in Grid computing is the Distributed Resource Manager (DRM).  DRM has an master daemon to  schedule and dispatch the submitted tasks in the job queue(s). The master daemon constantly monitors the status of the execution nodes in order to dispatch a task to the most appropriate execution node. The DRM also has an administration module to manage  the system configuration, accounting and resource usage policy. The current DRM market leader is LSF (Load Sharing Facility) from Platform Computing, but LSF requires an expensive license fee. Sun Microsystems offers an full integrated DRM called Sun N1 Grid Engine which doesn't require any license fee. Sun N1 Grid Engine runs most of popular operating systems such as Solaris, Windows, Linux, HP-UX, IBM AIX and Apple Macintosh OS/X etc,. It also supports the standard DRM Application API DRMAA for Java and Native language binding. Interested readers should go to the web page  http://www.sun.com/software/gridware to get more detailed information.   

4.3 Grid Constraints
As mentioned earlier, Grid computing is at very early stage today. There are still no really compelling software infrastructure and tools to enable a robust Grid environment. Globus Toolkit is a good initial effort from Globus ( http://www.globus.org ) to provide enabling software for Grid computing. Globus Toolkit is an open source software with good adoption. It includes a wide category of software components ranging from resource monitoring and discovery to security. However this software system is too complex for many users. Building and deploying the toolkit software over a network environment is not an easy task. Furthermore its design specifications and software are changed quite often and let the existing investment become obsolete.

As you might think, Grid computing is tightly related to SOA. One of Grid computing main goals is to deliver the utility service of computing resources. Indeed the current trend is to merge Grid computing with Web Service. There are many important Web Service infrastructures which Grid computing can leverage and build upon further. One important issue is how to align the different interests from the perspectives of Web Service and Grid to evolve Web Service specifications. Overall the biggest challenge of Grid computing is to create an unified set of industry standards and design specifications accepted by most people and institutions.        

5: A Real Case Study

After the above discussion on shared memory and distributed memory programming models and Grid computing, let's examine two classes of real applications to evaluate the difficulty of parallel application development.  

5.1 Parallel Search
Tree search is a very popular approach to finding the optimal solution or all the enumerated solutions out of a huge number of choices that solve a well formulated problem. Many scientific or daily life problems can be transformed and formulated as a parallel tree search problem. The potential applications range from trivial puzzle solving to complex strategy analysis. One popular approach is to construct a state tree with each node representing a specific application state and search through the state tree to find the target node(s) with the optimal state or reach a target node in a minimum number of steps.

There are two steps to developing software for such a parallel search application. First step is to create a good abstraction model to transform the problem and represent it in a state tree. The quality of the abstraction model will affect the tree shape and tree size, which affects the required searching effort. The second step is to search through the state tree using parallel computing. The first step is easy if the tree quality doesn't matter, but it becomes quite hard and requires good domain expertise to construct a good quality tree for even a modestly complex problem. The second step is to search through the state tree. The step is more difficult than it looks. In most cases, the tree is dynamically constructed during the program run time. Because the tree tends to grow in an un-balanced way, it could lose load balance by assigning a child branch randomly to a working thread or process. A more complex approach is to schedule a node operation in a task queue and distribute the tasks to the working threads or processes. Even this approach would not work when a single tree node operation is too light weight compared to the queuing overhead.

Overall it is difficult to generalize parallel search programming. The challenge starts with model abstraction, then tree construction, and finally the parallel search algorithm. A good tree model needs domain specific expertise to formulate the problem and represent it with the tree node presentation. When the state tree is constructed, it usually requires experimental analysis on the tree patterns to design an efficient parallel searching algorithm. The algorithm usually needs to partition and group the tree nodes in order to get parallel performance.  

5.2 Parallel Simulation
Simulation is an important and critical application in various industries. The applications range from financial Monte Carol simulation, EDA(Electronic Design Automation) logic simulation, life science protein folding to applied physics simulation etc. The biggest challenge of large scale simulation is dealing with huge amount of objects and data. Many applications like EDA logic simulation or life science molecular simulation often handle millions or even billions of modeling objects. It is unrealistic to create millions of  threads or processes to model the simulation objects. Furthermore it takes extensive computation to evaluate the interaction among the neighboring elements and propagate the interaction outcome to the other elements until the effect dies down.

The conventional parallel programming method is inadequate to handle a large scale simulation of modest complexity. The solution is to sacrifice a little accuracy by taking a mixed level modeling approach to work around the extensive computation and communication problem. It usually applies a large grain modeling in the global scope to reduce the amount of objects or to approximate the interaction effect in order to simulate a large scale of real things. Fine grain modeling can be applied to the hot spot regions for more accurate simulation. In some applications, the simulation developers may even take extra effort to build an application specific hardware accelerator to solve their problems.  The hardware accelerator is just another way to implement the parallel computing algorithm.       

6: Tools for the Developer

Many software developers think Sun Studio provides the best integrated set of developer tools for parallel application development. Recently, starting with Sun Studio 11, release on November 14, 2005, Sun Studio tools have become free of charge. Therefore in this section we will focus on Sun Studio tools when discussing how the developers can leverage the existing tools to develop their parallel applications. Interested readers can find Sun Studio product information directly at http://www.sun.com/software/products/studio/index.xml. In the following sub-sections, we will discuss the critical developer tools for the three main stages in the software development cycle: compilation, debugging and performance tuning.
 
6.1 Compilation
Parallel application development requires various developer tools through the development life cycle. The most fundamental tool is compiler. The quality of a chosen compiler determines the ultimate performance and quality of the executable code. The compiler should support the latest OpenMP standard for OpenMP applications. Furthermore the compiler's internal OpenMP working thread implementation profoundly determines the application parallel performance running on CMT or CMP machines. Sun Studio compilers have consistently produced many SPEC OMP world records on Sun SMP systems in the recent years. Interested readers can find more benchmark information at: http://www.sun.com/software/products/studio/benchmarks.xml.

A few compilers such as Sun Studio compilers and Intel compilers can analyze and parallelize a sequential application into a multi-thread application automatically. In general it is a very difficult problem to do automatic parallelization on a large section of the program. Therefore the compilers tend to focus on the loop constructs. Sun Studio compilers analyze the application data dependency of loop iterations and parallelize the iterations into multiple threads if there are no data dependencies.

6.2 Debugging
Debugging is a very time consuming stage in a application development cycle. A good debugging environment should consist of a fully integrated debugging GUI and a powerful debugging back-end such as dbx. Sun compilers produce better quality symbol mapping data for debugging than most compilers, so Sun Studio debugger has better transparency in dealing with the program data and the instructions.

Sun Studio's fully integrated GUI environment allows the developers to set and manage breakpoints and watch points effectively. It provides calling stack and local variable windows to display the current program context at any time.  It is very easy to navigate the program flow by applying single step and step in commands on the source code editor window directly. The fix and continue feature can cache the incremental fixes to avoid  the large program build time and speed the debugging cycle. Sun Studio debugging environment also includes a run-time checking tool which can instrument the program execution and detect any invalid memory access from a wild pointer or un-initialized variable. The tool also monitors the memory allocation and free operations and reports any memory leakage.

The Sun Studio debugging environment supports  the above features for OpenMP programs and Posix thread (Pthread) applications. Additionally Sun Studio provides a multi-session debugging feature which can load multiple program instances and debug them simultaneously. The developers can use this feature to debug and compare a faulty multi-thread version with a sequential reference version of the same application program.

6.3 Performance Tuning
Sun Studio performance analyzer is one of the best performance tools in the industry. The tool  supports C/C++, Fortran and Java. Sun Studio performance analyzer is easy to use and works with unmodified binaries. The developers only need to recompile the source codes with -g or -g0 for C++ to get the complete source line level information. It works on parallel programs created by either automatic parallelization or explicit parallelization such as OpenMP, MPI and Pthreads.

Using the performance analyzer has two simple steps. First the analyzer collects the profiling data by running through the experiment(s). It collects the data by applying statistical callstack samplings triggered by clock or hardware counter. Next the analyzer loads the performance data and presents them in multiple presentation views. The developers can navigate and examine the performance data at routine, statement and instruction levels. Besides the interactive mode, the developer can collect and print the performance data in batch command mode.

The analyzer can load multiple experiments to aggregate and filter the data by experiment, thread, function and time. The performance tool also provides a set of API routines to allow the developers to instrument the codes manually. Sun Studio performance analyzer supports Java Profiling in three modes. In user mode, it only shows the user threads and Java callstacks. In expert mode, it shows all threads and Java callstacks for user threads and machine callstacks for the other threads. Last, in machine mode, it can show all threads and machine callstacks for all threads.   

7: Summary

In the above tool section, it only discusses three critical developer tools. There are other critical tools for parallel application development, for example a race condition checker. However it is very important for the developers to realize that the developer tools alone cannot solve the  problems of developing parallel application. As discussed in the real case study section, many parallel application developers need to adopt a customized abstraction modeling and programming approach to develop their software.

Facing the emerging platforms based on CMT and CMP processors, we, the software developers, need to change our mind set and attitude toward application development. Starting from the problem definition, we need to analyze, abstract and map the problem into parallel domain directly. The data structures and the algorithm need to be designed for parallel execution. This is definitely not a trivial change from what we are doing today. But this is a small price to pay compared to the  reward we get from the parallel platforms.

Acknowledgements

Special thanks to my three colleagues.  Mike Ball, Sun Distinguished Engineer and HPCS researcher, reviewed and polished this article. Yuan Lin, Sun Senior Staff Engineer, Compiler Optimization and Parallelization team, improved the prime number program and experimented it on the Sun Fire 6800 machine. Yuan also reviewed this article. Ruud van der Pas, Sun Senior Staff Engineer, Compilers, Libraries and Performance Technologies, reviewed and also contributed important content to this article from his Sun HPC 2005 workshop proceedings.     
About the Author
Liang Chen is Distinguished Engineer and Software Architect for Platform Developer Tools at Sun Microsystems. His focus is on the architecture and future technology of Sun Studio developer tools. Liang has more than 2 decades of application development experience. His current main interest is to investigate parallel application development from the tightly coupled OpenMP to loosely coupled SOA and Grid. Before his current job, Liang was software architect and manager of a research project at SunLabs to build a massive parallel machine with thousands of processors to simulate Sun future computing systems. Liang worked at several companies including SGI and AMD before joining Sun.
 
Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.
 
Related Links
 

Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.