Sun Java Solaris Communities My SDN Account Join SDN

Article

Performance Analysis Made Simple Using SPOT

 
By Darryl Gove, Senior Performance Engineer, Sun Microsystems, March 7, 2006  

Introduction

Performance analysis is the process of looking at how fast an application runs, determining why it is running as fast as it does, and then finding ways to improve the performance. A further complication is the problem of knowing when performance can not be improved any further.

An application's performance depends on a combination of hardware and software factors. For example, what events must the hardware deal with, and what degree of optimisation was applied when compiling the application? There are a number of tools that can be used to extract information or collect this kind of information, but knowing which tools to pick for a given application can be tricky.

This paper introduces a new tool that aims to simplify the process of performance analysis. We call it the Simple Performance Optimisation Tool – or 'spot'. Spot is an add-on package to Sun Studio 11, and it is only available for UltraSPARC based systems. Spot has been released as part of the Cool Tools project, further information on the tool and how to obtain it is available from this site.

What is spot?

The idea behind spot is to make it as simple as possible to produce a report on the performance of a given application. There are two ways of using spot to do this, the way that produces the most data is for applications that can be run multiple times. In this case spot can run the application under a comprehensive set of tools and collect data for each tool over the entire run of the application. Alternatively, some applications run continuously, and in this case spot can attach to the running application and use each tool for a short period of time to gather the data.

One very useful feature of spot is that it generates a report on the application's performance as a set of hyperlinked web-pages. This addresses a couple of hard-to-solve problems. First of all, the spot-report archives a snapshot of the performance of the application under a given workload, complete with build flags and source code for the parts of the application where the time is spent (assuming that this information is available). Secondly, because the spot-report is a hypertext document, it is possible to pinpoint a particular place in the code with a URL – this feature is very helpful for either working on remote machines, or for collaborating with remote colleagues.

The architecture of spot is shown in the following figure:

Frame1

Figure 1 - Architecture of SPOT

Spot uses a number of tools, and each tool generates a part of the report. The major tools spot uses are:

  • The 'collect' command from the Sun Studio Performance Analyzer. This command can gather either a profile of where the time is spent in an application, or where processor events occur in the application.

  • The command 'ripc' gathers information about how much time is spent in processor stalls, and reports this information by the type of stall that causes the problem. For example, it will report the number of seconds spent waiting for data that is located in memory (the counter for this is named Re_EC_miss. The counter names are those that are found in the User's Guide for the processor). This command can also select counters that are worth profiling using 'collect' to determine exactly where in the program the particular events are happening. The tool ripc will also generate a set of graphs if it finds gnuplot to be available on the user's path.

  • The BIT tool instruments the application (so long as the application has been compiled appropriately) and counts the number of times each routine and instruction is executed. Using this information it is possible to derive things like the average trip count for a particular loop, or how frequently a condition is true or false.

  • The output from the 'collect' command and from BIT is fed through another tool called er_html. er_html uses er_print from the Performance Analyzer to render the data as a set of hyperlinked webpages. These pages show the source (if the application has been compiled with debug) and disassembly for the parts of the code where the time is spent.

  • The tool 'bw' reports system-wide bandwidth utilisation over the entire run of the application. The tool 'traps' reports system-wide trap statistics. These tools need root privileges to run. Both of these tools will also generate graphs of the results if gnuplot is found on the path.

Running spot

Spot is installed on top of Sun Studio 11, if this is installed in the default place of /opt/SUNWspro, then spot will be located in /opt/SUNWspro/extra/bin.

Spot can be invoked in one of two ways:

When invoked as  /opt/SUNWspro/extra/bin/spot <application> <parameters>

it will run the target application multiple times, each time under a new tool.

When invoked as:  /opt/SUNWspro/extra/bin/spot -P <pid>

it will attach to a running process and rotate through each tool for a short period of time. In both cases the report will be broadly the same.

By default spot will produce a short report on the performance of the application, but if spot is started with the -X flag it will attempt to provide a more detailed set of data (if it has sufficient permissions).

Compiling the application to get the most data from spot

To get the most benefit from spot, the application needs to be compiled with the Sun Studio 11 compiler. The following flags are recommended:

  • Enable debug information using the flag -g for C or Fortran or the flag -g0 for C++. This will enable spot to report time attributed to lines of source, not just assembly language instructions.

  • Allow spot to gather instruction and function count data using BIT by ensuring that the application is compiled with -O, -fast, or at least -xO1, and the -xbinopt=prepare flag. The effect of these flags is to record sufficient information in the binary for the tool BIT to be able to instrument it.

An example

The code test.c demonstrates some of the key features of the spot-report. This test code has three separate stages which do the following things:

  • Consume memory bandwidth.

  • Cause cache misses.

  • Cause TLB misses.

For the purposes of being an effective demonstration of the tool, the code should be compiled with the minimum amount of optimisation, and the appropriate compiler flags:

cc -g -O -xbinopt=prepare -o test test.c

Then it can be run under spot using the following command:

spot -X test

Recording system and build information

The first thing that spot does is to record information on the test bed that was used to gather the data, and also on the flags that were used to build the application. This is shown in the following figure:

Frame2

Figure 2 - System and Build Information

In this particular case the application built as shown above, and then run on a Sun Fire v880 with 8 900 Mhz UltraSPARC-III+ processors.

Hardware performance counter information

The tool ripc then runs the application and gathers data on the relevant events recorded by the performance counters on the processor. The information gathered from the performance counters can be used to guide the process of selecting appropriate compiler options, as suggested in this paper. The information from ripc is presented as a table, as shown in the next figure.

Frame3

Figure 3 - Hardware Performance Counter Information

The table has three major sections. The first section attempts to attributed time that the application spent stalled to the various hardware conditions that caused the stall. In this particular example nearly the entire runtime of the application was spent stalled.

About 14% of the time was lost to Data TLB misses (DTLB_miss), this is one condition which can often be corrected using the compiler flag -xpagesize=4M.

A further ~80% of the runtime was spent stalled waiting for data that was not resident in the Data Cache (Re_DC_miss). About 60% of the runtime was spent waiting for data that was not in the External Cache (Re_EC_miss). This means that, the difference, about 20% of the runtime the data was not in the Data Cache, but was in the External Cache.

The next section reports information on IPC (Instructions per Cycle) which is a measure of how much useful work the processor is achieving each cycle. The idea of a Grouped IPC is a measure of how much work would be achieved if the processor did not encounter any stall conditions.

At this point the tool also reports the number of unfinished floating point traps that occurred during the run (on a system-wide basis). Floating point traps occur very rarely, but if they do happen it can be tricky to realise that it is a problem.

The tool produces a section which reports events as a proportion, for example cache misses per reference.

The final section reports the memory usage of the application.

If gnuplot is available on the path, this information is also presented in a graphical form.


Figure 4 - ripc Output Showing DTLB Misses Over The Run Of The Program

In the figure above the Data TLB misses are shown over the entire run of the program. In the first part of the run there were very few TLB misses, but in the last part there were significant numbers.

Instruction frequencies

The tool BIT provides a number of pieces of information. The most useful of these are the number of times each routine is called and each instruction is executed. These will be covered in the section on profiles. However, BIT also gathers data on individual instruction frequencies – for example how many floating point instructions were executed during the run of the benchmark.

The counts from BIT are for just the instrumented part of the application, so do not include library calls. In some cases (for example for some templates in C++) not all the code is instrumented, so the resulting counts are not complete.

The following figure shows instruction frequencies for the test program.

Frame5

Figure 5 - Instruction Frequences From BIT

The difference between the number of floating point operations and the number of floating point operations is a count of the number of floating point calculations (the additions and multiplications). In the above example, 53% of the operations involve floating point, 40% of the instructions are floating point loads or stores, which means that 13% of the instructions are floating point calculations.

BIT also correctly counts instructions in delay slots and annulled instructions. The delay slot is the instruction after a branch. Depending on whether the branch is taken or not this instruction may or may not end up being executed. If it is not executed it is an annulled instruction.

Time based profiles

Spot uses the 'collect' command to gather a time based profile of the application. Spot then renders this as a set of hyperlinked webpages. The profile of the test code is shown below:

Frame6
Figure 6 - Profile of the Test Code

The profiles combine the information from collect with the information from BIT. The columns shown are as follows:

  • The exclusive user time is the time spent executing code in the named routine.

  • The inclusive user time is time spent executing code in a named routine, and all the routines that it calls.

  • The system time column is the time that a given routine spends in system code.

  • The wall time column is the elapsed time for the master thread. In single threaded applications this profile is similar to the user and system time. For multi-threaded applications this profile tracks what the master thread is doing, and may not correspond to where the work is performed in the application.

  • The exclusive function count column counts the number of times that a given routine is called during the run of the application. This information is provided by BIT and is only available on suitably compiled binaries. This count data is not available for routines that reside in libraries. This can be observed in the example where it appears as if _memset is never called.

  • The exclusive instructions executed column is a count of the number of instructions executed in each routine during the run of the application. Again this information is not available for all routines – in particular routines that reside in libraries are excluded.

  • The exclusive instructions annulled column is a count of the number of instructions in delay slots that were annulled (not executed).

On the right of the page are hyperlinks that go to the following results:

  • The trimmed link goes to disassembly (and source if available) which has been cut down in size so that only the important parts of the code remain. This is usually the best link to use to examine the profile in more detail because the untrimmed disassembly can sometimes be several MB in size.

  • The name of the routine links to the untrimmed disassembly (and source if available).

  • The src link goes to just the source for the given routine, showing the time and events attributed to each line of source code. This link is only available if the tool has been able to locate the source code for a particular routine.

  • The caller-callee link goes to a page which show which routines call which other routines.

The following figure shows an example of the mix of disassembly with source code, showing the attribution of time to individual assembly language instructions.

Frame7
Figure 7 - Profile of Assembly Language Instructions

The columns show the time spent on each assembly language instruction and the number of times that it was executed. Using this information it is possible to determine that the trip count for the loop between lines 0x1243c and 0x12448 is nearly 170 million times, and that the loop was only entered once.

The time is recorded on the increment instruction at 0x1243c. The sampling method used by the Performance Analyzer records time spent on the instruction waiting to be executed. So in this case it is necessary to identify the instruction that is executed before the increment; which turns out to be the load (in the delay slot) at 0x12448. It is this load instruction that is taking all the time.

There are hyperlinks in the disassembly. Those that are in square brackets will jump to the line of source code that the assembly language instruction belongs to. The links that are part of branch instructions will jump to the target of the branch.

Hardware event profiles

If spot is run with the -X option, it will gather profile using the performance counters that record the most stall time. In the following figure the profile shows where the hardware counter events occur in the test program.

Frame8
Figure 8 - Profiling Using Hardware Counters

Looking a the first of the two profiles, it is apparent that the cycles spent stalled because of Data Cache miss events are in all three routines. However, looking at the second profile it can be seen that the External Cache miss events only happen in the cache_miss and tlb_miss routines.

System-wide bandwidth consumption and trap data

If spot is run with root permissions and the -X flag, it will try to gather system-wide information on the bandwidth consumption and the traps encountered over the entire run of the application. If other applications are also running on the system, these statistics will include bandwidth consumption and traps from those other applications.

The bandwidth consumption and trap information is presented as a table that summarises the entire run. If gnuplot is present in the user's path, then the results will also be available as graphs over time.

The following shows a graph of bandwidth consumption over the run of the application.


Figure 9 - Bandwidth Consumption Over Run of Application

The most bandwidth is consumed during the first routine where data is being streamed from memory. The subsequent routines which are more tests of memory latency, consume less bandwidth.

The next figure shows the Data TLB traps reported over the entire run of the application.


Figure 10 - Data TLB Traps Over Run of Application

As might be expected, the trap data looks very similar to the data reported by the hardware counters for the same event.

Concluding remarks

Spot generates a wealth of performance data that in most cases enables the user to rapidly narrow down on the places in the code where time is being spent. It also assists in identifying the causes of application slow-down.

For more about spot, including downloading information, visit http://cooltools.sunsource.net/spot/index.html.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.
Darryl Gove is a senior staff engineer in Compiler Performance Engineering at Sun Microsystems Inc., analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK. Before joining Sun, Darryl held various software architecture and development roles in the UK. Read Darryl's blog at blogs.sun.com/d/.