Sun Java Solaris Communities My SDN Account Join SDN
 
Article

Using the Solaris DTrace Utility With Open MPI Applications

 
By Terry Dontje, Karen Norteman, Rolf vandeVaart, August 2007  

This article describes how to use the Solaris Dynamic Tracing (DTrace) utility with Open MPI applications. DTrace is a comprehensive dynamic tracing utility that you can use to monitor the behavior of applications programs as well as the operating system itself. You can use DTrace on live production systems to understand those systems' behavior and to track down any problems that might be occurring.

The D programming language is used to create the source code for DTrace programs.

This article assumes knowledge of the D language and how to use DTrace. For more information about the D language and DTrace, refer to the Solaris Dynamic Tracing Guide. This guide is part of the Solaris 10 OS Software Developer Collection.

Note: The programs and scripts mentioned in the sections that follow are located here:

/opt/SUNWhpc/examples/mpi/dtrace

Contents

Checking the mpirun Privileges

Before you run a program with DTrace under mpirun, you need to make sure that you have the correct mpirun privileges on the entire cluster. The privileges you need are dtrace_proc and dtrace_user. Without these privileges, DTrace returns the following error message:

dtrace:  failed to initialize dtrace:  DTrace requires additional privileges

To determine the correct privileges on the cluster:

  1. Use your favorite text editor to create the following shell script, named mpppriv.sh:
    #!/bin/sh
    # mpppriv.sh -  run ppriv under a shell so you can get the privileges
    #	of the process that mpirun creates
    ppriv $$
    

  2. Type the following command, replacing host1 and host2 with the names of hosts in your cluster:
    mpirun -np 2 --host host1,host2 mpppriv.sh
    

If the output of ppriv shows that the E privilege set has the dtrace privileges, then you will be able to run dtrace under mpirun (see the following two examples). Otherwise, you must adjust your system to get dtrace access.

The following example shows the output from ppriv when the privileges have not been set.

% ppriv $$
4084:  -csh 
flags = <none>
        E:  basic 
        I:  basic 
        P:  basic 
        L:  all

This example shows ppriv output when the privileges have been set.

% ppriv $$
2075:  tcsh 
flags = <none>
        E:basic,dtrace_proc,dtrace_user 
        I:basic,dtrace_proc,dtrace_user
        P:basic,dtrace_proc,dtrace_user 
        L:  all

Note: To update your privileges, ask your system administrator to add the dtrace_user and dtrace_proc privileges to your account in the /etc/user_attr file.

After the privileges have been changed, you can use the ppriv command to view the changed privileges.

Running DTrace with MPI Programs

There are two ways to use dynamic tracing with MPI programs:

  • Run the MPI program directly under DTrace
  • Attach DTrace to a running MPI program
Running an MPI Program Under DTrace

Assume you have a program named mpiapp.

To trace a program using the mpitrace.d script, type the following command:

% mpirun -np 4 dtrace -s mpitrace.d -c mpiapp

The advantage of tracing an MPI program in this way is that all the processes in the job are traced from the beginning. This method is probably most useful in doing performance measurements, when you need to start at the beginning of an application and you need all the processes in a job to participate in collecting data.

This approach also has some disadvantages. One disadvantage of running a program like the one in the above example is that all the tracing output for all four processes is directed to standard output (stdout). One way around this problem is to create a script similar to the following example.

#!/bin/sh 
# partrace.sh - a helper script to dtrace Open MPI jobs from the 
#         start of the job.  
dtrace -s $1 -c $2 -o $2.$OMPI_MCA_ns_nds_vpid.trace

Note: The status of the OMPI_MCA_ns_nds_vpid.trace variable is unstable and subject to change. Use this variable with caution.

To trace a parallel program and get separate trace files:

  1. Create a shell script (named partrace.sh in this example) similar to the following:
    #!/bin/sh
    # partrace.sh - a helper script to dtrace Open MPI jobs from the
    #         start of the job.
    dtrace -s $1 -c $2 -o $2.$OMPI_MCA_ns_nds_vpid.trace
    

  2. Type the following command to run the partrace.sh shell script:
  3. % mpirun -np 4 partrace.sh mpitrace.d mpiapp
    

    This runs mpiapp under dtrace using the mpitrace.d script. The script saves the trace output for each process in a job under a separate file name, based on the program name and rank of the process. Note that subsequent runs append the data into the existing trace files.

Attaching DTrace to a Running MPI Program

The second way to use dtrace with Open MPI is to attach dtrace to a running MPI program.

To attach DTrace to a running MPI program:

  1. Log in to the node on which you want to run the program.
  2. Type commands similar to the following example to get the process ID (PID) of the running program on the node of interest:
    % prstat 0 1 | grep mpiapp
    24768 joeuser     526M 3492K sleep   59    0   0:00:08 0.1% mpiapp/1
    24770 joeuser     518M 3228K sleep   59    0   0:00:08 0.1% mpiapp/1
    

  3. Decide which rank you want to use to attach dtrace.

    The lower PID number is usually the lower rank on the node.

  4. Type the following command to attach to the rank 1 process (identified by its process ID, which is 24770 in the example) and run the DTrace script mpitrace.d:
    dtrace -p 24770 -s mpitrace.d
    

Simple MPI Tracing

DTrace enables you to easily trace programs. When used in conjunction with MPI and the more than 200 functions defined in the MPI standard, DTrace provides an easy way to determine which functions might be in error during the debugging process, or those functions that might be of interest. After you determine the function that shows the error, it is easy to locate the desired job, process, and rank on which to run your scripts. As demonstrated above, DTrace allows you to perform these determinations while the program is running.

Although the MPI standard provides the MPI profiling interface, using DTrace provides a number of advantages. The advantages of using DTrace include the following:

  • The PMPI interface requires you to restart a job every time you make changes to the interposing library.
  • DTrace allows you to define probes that let you capture tracing information on MPI without having to code the specific details for each function that you want to capture.
  • The DTrace scripting language (D) has several built-in functions that help in debugging problematic programs.

The following example shows a simple script that traces the entry and exit into all the MPI API calls.

mpitrace.d:  
pid$target:libmpi:MPI_*:entry 
{ 
printf("Entered %s...", probefunc);
}

pid$target:libmpi:MPI_*:return 
{ 
printf("exiting, return value = %d\n", arg1); 
}

When you use this example script to attach DTrace to a job that performs send and recv operations, the output looks similar to the following example.

% dtrace -q -p 24770 -s mpitrace.d
Entered MPI_Send...exiting, return value = 0
Entered MPI_Recv...exiting, return value = 0 
Entered MPI_Send...exiting, return value = 0 
Entered MPI_Recv...exiting, return value = 0 
Entered MPI_Send...exiting, return value = 0 ...

You can easily modify the mpitrace.d script to include an argument list. The resulting output resembles truss output, as shown in the following example.

mpitruss.d:  
pid$target:libmpi:MPI_Send:entry,
pid$target:libmpi:MPI_*send:entry, 
pid$target:libmpi:MPI_Recv:entry,
pid$target:libmpi:MPI_*recv:entry 
{ 
printf("%s(0x%x, %d, 0x%x, %d, %d, 0x%x)",probefunc, arg0, arg1, arg2, arg3, arg4, arg5);
}
pid$target:libmpi:MPI_Send:return, 
pid$target:libmpi:MPI_*send:return,
pid$target:libmpi:MPI_Recv:return, 
pid$target:libmpi:MPI_*recv:return 
{
printf("\t\t = %d\n", arg1); 
}

The mpitruss.d script shows how you can specify wildcard names to match the functions. Both probes match all send and receive type function calls in the MPI library. The first probe shows the usage of the built-in arg variables to print out the arglist of the function being traced.

Take care when wildcarding the entrypoint and formatting the argument output, because you could end up printing either too many arguments, or not enough arguments, for certain functions. For example, in the above case, the MPI_Irecv and MPI_Isend functions do not have their Request handle parameters printed out.

The following example shows a sample output of the mpitruss.d script.

% dtrace -q -p 24770 -s mpitruss.d
MPI_Send(0x80470b0, 1, 0x8060f48, 0, 1,0x8060d48) = 0 
MPI_Recv(0x80470a8, 1, 0x8060f48, 0, 0, 0x8060d48) = 0
MPI_Send(0x80470b0, 1, 0x8060f48, 0, 1, 0x8060d48) = 0 
MPI_Recv(0x80470a8, 1, 0x8060f48, 0, 0, 0x8060d48) = 0 ...


Tracking Down Resource Leaks

One of the biggest issues with programming is the unintentional leaking of resources, such as memory. With MPI, tracking and repairing resource leaks can be somewhat more challenging because the objects being leaked are in the middleware, and thus are not easily detected by the use of memory checkers.

DTrace helps with debugging such problems using variables, the profile provider, and a callstack function. The mpicommcheck.d script (shown in the example below) probes for all the the MPI communicator calls that allocate and deallocate communicators, and keeps track of the stack each time the function is called. Every 10 seconds the script dumps out the current count of MPI communicator calls and the total calls for the allocation and deallocation of communicators. When the dtrace session ends (usually by pressing Ctrl-C, if you attached to a running MPI program), the script prints out the totals and all the different stack traces, as well as the number of times those stack traces were reached.

To perform these tasks, the script uses DTrace features such as variables, associative arrays, built-in functions (count, ustack), and the predefined variable probefunc.

The following example shows the mpicommcheck.d script.

mpicommcheck.d:  
BEGIN 
{ 
  allocations = 0; 
  deallocations = 0; 
  prcnt = 0; 
}

pid$target:libmpi:MPI_Comm_create:entry, 
pid$target:libmpi:MPI_Comm_dup:entry,
pid$target:libmpi:MPI_Comm_split:entry 
{ 
  ++allocations; 
  @counts[probefunc] = count(); 
  @stacks[ustack()] = count(); 
}

pid$target:libmpi:MPI_Comm_free:entry 
{ 
  ++deallocations; 
  @counts[probefunc] = count(); 
  @stacks[ustack()] = count(); 
}

profile:::tick-1sec 
/++prcnt > 10/ 
{
  printf("=====================================================================");
  printa(@counts); 
  printf("Communicator Allocations = %d \n", allocations);
  printf("Communicator Deallocations = %d\n", deallocations); 
  prcnt = 0; 
}

END 
{ 
  printf("Communicator Allocations = %d, Communicator Deallocations = %d\n",
  	allocations, deallocations); 
}


This script attaches dtrace to a suspect section of code in your program (that is, a section of code that might contain a resource leak). If, during the process of running the script, you see that the printed totals for allocations and deallocations are starting to steadily diverge, you might have a resource leak. Depending on how your program is designed, it might take some time and observation of the allocation and deallocation totals to definitively determine that the code contains a resource leak. When you determine that a resource leak is definitely occurring, you can press Ctrl-C to break out of the dtrace session. Next, using the stack traces dumped, you can try to determine where the issue might be occurring.

The following example shows code that contains a resource leak, and the output that is displayed using the mpicommcheck.d script.

The sample MPI program that contains the resource leak is named mpicommleak. This program performs three MPI_Comm_dup operations and two MPI_Comm_free operations. The program thus "leaks" one communicator operation with each iteration of a loop.

When you attach dtrace to mpicommleak using the mpicommcheck.d script above, you see a 10-second periodic output. This output shows that the count of the allocated communicators is growing faster than the count of deallocations.

When you finally end the dtrace session by pressing Ctrl-C, the session outputs a total of five stack traces, showing the distinct three MPI_Comm_dup and two MPI_Comm_free call stacks, as well as the number of times each call stack was encountered.

Here is the example code.

% prstat 0 1 | grep mpicommleak
 24952 joeuser    518M 3212K sleep   59    0   0:00:01 1.8% mpicommleak/1
 24950 joeuser    518M 3212K sleep   59    0   0:00:00 0.2% mpicommleak/1
% dtrace -q -p 24952  -s mpicommcheck.d
=====================================================================
  MPI_Comm_free                                                     4
  MPI_Comm_dup                                                      6
Communicator Allocations = 6
Communicator Deallocations = 4
=====================================================================
  MPI_Comm_free                                                     8
  MPI_Comm_dup                                                     12
Communicator Allocations = 12
Communicator Deallocations = 8
=====================================================================
  MPI_Comm_free                                                    12
  MPI_Comm_dup                                                     18
Communicator Allocations = 18
Communicator Deallocations = 12
^C
Communicator Allocations = 21, Communicator Deallocations = 14

libmpi.so.0.0.0`MPI_Comm_free
              mpicommleak`deallocate_comms+0x19
              mpicommleak`main+0x6d
              mpicommleak`0x805081a
                7

              libmpi.so.0.0.0`MPI_Comm_free
              mpicommleak`deallocate_comms+0x26
              mpicommleak`main+0x6d
              mpicommleak`0x805081a
                7

              libmpi.so.0.0.0`MPI_Comm_dup
              mpicommleak`allocate_comms+0x1e
              mpicommleak`main+0x5b
              mpicommleak`0x805081a
                7

              libmpi.so.0.0.0`MPI_Comm_dup
              mpicommleak`allocate_comms+0x30
              mpicommleak`main+0x5b
              mpicommleak`0x805081a
                7

              libmpi.so.0.0.0`MPI_Comm_dup
              mpicommleak`allocate_comms+0x42
              mpicommleak`main+0x5b
              mpicommleak`0x805081a
                7

Authors

Terry Dontje is a Senior Staff Engineer and Technical Lead in the Cluster Tools group who has spent over 20 years in the parallel computing industry. He has spent over 10 years working on MPI implementations including being a vendor representive to the MPI-2 forum and co-designer and implementor of Sun's various MPI implementations starting with Cluster Tools 3 through Cluster Tools 7. He's done work around interfacing MPI implementations to many different user by-pass network protocols. He currently is an active member and Sun's technical representative to the Open MPI community. Terry enjoys interruptions from his daughter and prechewed crackers in his hands. At least the former is true.

Karen Norteman is a software technical writer for the Systems Group at Sun Microsystems. She is the lead writer for the Sun HPC ClusterTools project, based in Burlington, MA. Karen lives in rural southern Maine with her other half, Greg Hall, three bearded collies, and one disgruntled cat.

Rolf vandeVaart is a software developer who has spent many years working on MPI. Initially, he worked on Sun HPC ClusterTools 3, 4, 5 and 6 and he now works on Sun ClusterTools 7, which is based upon Open MPI. He is also an active member of the Open MPI community. He is based at Sun's Burlington, MA campus.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.