This article describes the profiling of Message Passing Interface (MPI) applications with the Sun Studio Performance Tools. It starts with an overview of MPI performance data, explains how to profile MPI applications, shows examples from the analysis of performance data, and finishes with a discussion of supported MPI implementations.
The Sun Studio Performance Tools require a supported version of
Solaris or Linux and a supported version of Java. You can verify that
all the appropriate components and patches are installed by running
the This document assumes you are familiar with MPI. A description of the MPI standard API and runtime system can be found in Wikipedia or any number of reference books on MPI. This article is based on the Sun Studio Express SSX 11/08 software. The examples are based on the Sun HPC ClusterTools 8.1 MPI implementation. Overview of MPI Performance Data
MPI applications consist of processes that use MPI to synchronize and communicate. The processes may be instantiations of a single executable, or of multiple executables. An application is launched with an agent, typically mpirun or mpiexec, that assigns the processes to nodes of an MPI cluster. Each process is assigned a sequential identifier known as its global rank. The process uses MPI API calls to exchange data, combine partial results of computations, or synchronize processes. MPI performance issues can be grouped into two categories: problems that can be identified by focusing on communication and synchronization, and those that look at the application computation itself. The two types of data are addressed with different kinds of data collection. Communication data is collected by tracing MPI API calls. Computation data is collected with clock profiling, and optionally, CPU hardware-counter profiling. Communication Profiling Communication profiling data can be used to show an exact sequence of MPI API calls. It can also be used to identify imbalances, outliers, and communication patterns. (Examples of analysis screen shots can be found in MPI Timeline and MPI Charts.) This data is gathered by tracing MPI API calls using a technique called interposition.
When the application calls the During the application's run, the data collector records the name of the calls, the number of messages sent, the sent-byte count, the number of messages received, and the received-byte count. Each trace event also includes an entry and exit timestamp.
During Computation Profiling Computation profiling data can show which of an application's functions and source lines are consuming system resources or causing program stalls. (Examples of analysis screen shots can be found in Examining Computation Profiling). Computation data is based on clock profiling, and optionally, CPU hardware-counter profiling. Data consists of performance metrics and stack information that is recorded each time a clock-profile or CPU hardware-counter event occurs. Clock Profiling and MPI State Data
Clock profiling is enabled by the Two additional performance metrics, known collectively as MPI State Data, are obtained from the MPI runtime library whenever clock profiling is performed on ClusterTools 8.1 or later. The two metrics are MPI-Work Time, which accumulates when the process is doing work inside the MPI Library, and MPI-Wait Time which accumulates when the process is busy-waiting or sleep-waiting. MPI State Data can be used to identify the nature of MPI-related stalls. Hardware-Counter Profiling Data
Hardware-counter profiling works on a principal similar to
clock profiling, but uses CPU hardware counter events to trigger the
recording of samples instead. The events are CPU-specific, but they
typically measure the use of resources like cpu-cycles, caches, and
TLBs. Hardware-counter profiling is available on Solaris systems as
well as on supported Linux systems that have the Comparing MPI API Tracing and Clock-Profiling Data MPI API trace data is orthogonal to MPI State data, although both get their information from the MPI runtime. As a result, there are correlations between the data. Specifically, the MPI Time metric measured by tracing represents the real time spent in the MPI runtime, summed across processes, and MPI-Wait and MPI-Work represent statistical measures which approximately sum to the MPI time. Scalability Differences between API Tracing and Clock-Profiling MPI can be used for very large scale jobs, using hundreds, if not thousands, of processors and MPI processes. In such cases, the scalability of the performance measurements becomes significant. The data volume of MPI API tracing is approximately proportional to the number of MPI API calls and messages. As a result, data volume can only be reduced by limiting the application's runtime or process count. Selective profiling of a subset of ranks is not possible because the collector would not have enough information to match message sends and receives.
On the other hand, the data volume from clock-profiling
depends on the number of samples taken. The frequency can be managed by
choosing a lower profiling rate. For example, How to Profile MPI Applications
This section discusses how to compile and launch your application for performance profiling. Compiling MPI Applications You can compile your application with either the Sun Studio compilers or the GNU compilers. Most MPI implementations provide wrapper scripts for the C, C++ and Fortran compilers to ensure that the proper include files and libraries are found. For C++ applications, be sure to use a version of the MPI implementation compiled with the same compiler as your application. Some Fortran compilers have similar restrictions.
If you are going to profile your application, specify Some MPI distributions are available with either static linking or with shared-object linking. You must use a version built to use shared-object linking in order to use the Sun Studio Profiling Tools. Launching MPI Performance Data-Collection
Use the MPI API trace data can only be collected for applications that are run on supported MPIs. We also recommend using ClusterTools 8.1 or later for MPI Profiling. It is based on Open MPI and is tuned for running on Sun equipment. In addition, it has a profiling feature that no other MPI has, the profiling of MPI States which is automatically collected with clock profiling. If you run an MPI job using ClusterTools 8.1 with:
You can collect MPI performance data on it with:
Note the
For MPI applications that run longer than several minutes, you
may want to reduce the frequency of statistical samples with the
Invoking the
Although many users employ a script to launch their MPI Disabling MPI API Tracing
MPI API tracing is enabled by default with
produces an MPI Experiment with only computation profiling data and
produces an MPI Experiment with computation profiling data recorded at reduced rate. Selective Computation Profiling of MPI Jobs [This section is based on a suggestion of Hans Joraandstad.] In addition to disabling MPI API tracing, you can further
reduce the data volume of an MPI profile by using selective collection
on a subset of ranks. Instead of using the Replace the computation-only MPI profiling command:
with
where The following simple script profiles the first two ranks, but not any others, with ClusterTools 8.1:
Some MPI implementations use a different environment variable
to specify the MPI rank, and some pass the rank by arguments passed
into the target. You can adapt the script, and its invocation in the You should exercise care in selecting the MPI ranks for which you want to collect data. Rank zero is usually different from all the other ranks, and it may or may not be particularly interesting. You should choose whichever ranks are relevant to your performance problems. Analysis of Performance Data
Performance data can be examined with the The following sections show a few examples of these analysis views. The MPI Timeline and MPI Charts tabs can be used to explore the MPI API trace data. MPI Timeline The MPI Timeline graphically displays the MPI activity that occurred during an application's run. For each MPI process you can look horizontally to see what the process is doing as a function of elapsed time. Figure 1 is a screen shot of the Sun Studio Analyzer window which shows processes P0 to P24 over a timespan of 560 milliseconds.
The Absolute Time, measured in seconds, is shown at the top along the horizontal axis. The Relative Time, measured in milliseconds, is shown at the bottom along the horizontal axis. At the left, the MPI process ranks from P0 to P24 are listed. At this level of zoom, the names of MPI API functions are not visible. Figure 2 shows a zoomed-in view of the data shown above with
The Messages slider, on the right, controls the number of message lines displayed on the screen. You can adjust the message volume so the screen is readable and the tool remains responsive. If fewer than 100% of the messages are shown, the visible messages are those that are most "costly" in terms of the total time used in the send and receive functions of that message. MPI Charts The MPI Charts generate scatter plots and histograms to visualize the MPI API trace data. MPI Charts can be used to identify communication patterns, imbalances, and outliers. The charting facility is a general one which you can use to specify which data types and metrics to plot. The initial view, shown in Figure 3, helps users visualize the ratio of user Application time to time spent in various MPI API functions:
MPI Charts can be used to show communication pattern. Figure 4 shows the data volume sent to and from each rank:
One way to identify outliers is to use scatter plots. For example, Figure 5 shows a distribution of average function duration. The function Entry Times, measured in seconds, are shown along the X axis, and the average function durations for each time period is shown along the Y axis. Colors near the red end of the scale identify when average function duration is higher than normal. There are two outliers in this screen shot, one with an Entry Time near 20 seconds and another near 40 seconds.
For more information on MPI Timeline, MPI Charts, and MPI Filtering, please see the MPI Analyzer Tutorial. Examining Computation Profiling You can identify the functions, source lines, and calling sequences that contribute the most to each performance metric by browsing the Functions, Source, and Callers-callees, tabs respectively. Functions Tab The Functions Tab is used to identify the functions that consumed the most resources. It consists of a table of functions and the associated metrics for each function. Any column can be sorted by clicking on the column header. Functions can be selected in order to see more details, and the source can be viewed for a selected function. Figure 6 shows the Functions with selected
metrics on the left. Details for the selected function,
Of the metrics shown on the right, clock profiling provides the information from User CPU down to Other Wait. The metrics supplied by MPI API tracing follow:
At the end of the list are the metrics supplied by ClusterTools 8.1 with clock-profiling:
Source Tab The Source Tab shows source code annotated with performance metrics.
Clicking on the Source Tab after selecting the
The text in blue is commentary supplied by the Sun Studio
compiler. On the left, the MPI Work and MPI Wait metrics are shown.
These metrics show that the call to
There are a number of other Analyzer tabs available. For
further information, please try Supported MPI Implementations
MPI Implementations Recognized for MPI API Tracing
Table 1 lists the MPI Implementations which are recognized for
MPI API Tracing. You must specify the implementation as the parameter
to the
Table 1. MPI Implementations recognized for MPI API Tracing Additional MPI Implementations Recognized for Computation-Only Profiling While you cannot collect communication profile data for an unsupported version of MPI, you can collect computation profile data. For example, you can collect computation profiling using the command:
(Note that some MPI implementations use a differently-named
command for The command will collect computation profiles in separate experiments for each MPI rank. If the MPI rank environment variable is recognized, the experiments will be named by rank. If the rank is specified by a variable that is not recognized, the experiments will be named in the order created. You can bring up Analyzer either on a single experiments, or
you may bring it up on all the experiments and see the data aggregated
across all ranks. Specifying the Table 2 lists the MPI Implementations which are not recognized for MPI API Tracing. If the environment variable specifying the rank is recognized, and experiments will be named by rank. Table 1. MPI Implementations recognized for MPI API Tracing Additional MPI Implementations Recognized for Computation-Only Profiling While you cannot collect communication profile data for an unsupported version of MPI, you can collect computation profile data. For example, you can collect computation profiling using the command:
(Note that some MPI implementations use a differently-named
command for The command will collect computation profiles in separate experiments for each MPI rank. If the MPI rank environment variable is recognized, the experiments will be named by rank. If the rank is specified by a variable that is not recognized, the experiments will be named in the order created. You can bring up Analyzer either on a single experiments, or
you may bring it up on all the experiments and see the data aggregated
across all ranks. Specifying the Table 2 lists the MPI Implementations which are not recognized for MPI API Tracing. If the environment variable specifying the rank is recognized, and experiments will be named by rank.
Table 2. MPI Implementations not recognized for MPI API Tracing References
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.
|
| ||||||||||||