IntroductionPerformance analysis is the process of looking at how fast an application runs, determining why it is running as fast as it does, and then finding ways to improve the performance. A further complication is the problem of knowing when performance can not be improved any further. An application's performance depends on a combination of hardware and software factors. For example, what events must the hardware deal with, and what degree of optimisation was applied when compiling the application? There are a number of tools that can be used to extract information or collect this kind of information, but knowing which tools to pick for a given application can be tricky. This paper introduces a new tool that aims to simplify the process of performance analysis. We call it the Simple Performance Optimisation Tool – or 'spot'. Spot is an add-on package to Sun Studio 11, and it is only available for UltraSPARC based systems. Spot has been released as part of the Cool Tools project, further information on the tool and how to obtain it is available from this site. What is spot?The idea behind spot is to make it as simple as possible to produce a report on the performance of a given application. There are two ways of using spot to do this, the way that produces the most data is for applications that can be run multiple times. In this case spot can run the application under a comprehensive set of tools and collect data for each tool over the entire run of the application. Alternatively, some applications run continuously, and in this case spot can attach to the running application and use each tool for a short period of time to gather the data. One very useful feature of spot is that it generates a report on the application's performance as a set of hyperlinked web-pages. This addresses a couple of hard-to-solve problems. First of all, the spot-report archives a snapshot of the performance of the application under a given workload, complete with build flags and source code for the parts of the application where the time is spent (assuming that this information is available). Secondly, because the spot-report is a hypertext document, it is possible to pinpoint a particular place in the code with a URL – this feature is very helpful for either working on remote machines, or for collaborating with remote colleagues. The architecture of spot is shown in the following figure:
Figure 1 - Architecture
of SPOT
Spot uses a number of tools, and each tool generates a part of the report. The major tools spot uses are:
Running spotSpot is installed on top of Sun Studio 11, if this is installed in the default place of /opt/SUNWspro, then spot will be located in /opt/SUNWspro/extra/bin. Spot can be invoked in one of two ways: When invoked as
it will run the target application multiple times, each time under a new tool. When invoked as:
it will attach to a running process and rotate through each tool for a short period of time. In both cases the report will be broadly the same. By default spot will produce a short report on the performance of the application, but if spot is started with the -X flag it will attempt to provide a more detailed set of data (if it has sufficient permissions). Compiling the application to get the most data from spotTo get the most benefit from spot, the application needs to be compiled with the Sun Studio 11 compiler. The following flags are recommended:
An exampleThe code test.c demonstrates some of the key features of the spot-report. This test code has three separate stages which do the following things:
For the purposes of being an effective demonstration of the tool, the code should be compiled with the minimum amount of optimisation, and the appropriate compiler flags:
Then it can be run under spot using the following command:
Recording system and build informationThe first thing that spot does is to record information on the test bed that was used to gather the data, and also on the flags that were used to build the application. This is shown in the following figure:
Figure 2 - System and
Build Information
In this particular case the application built as shown above, and then run on a Sun Fire v880 with 8 900 Mhz UltraSPARC-III+ processors. Hardware performance counter informationThe tool ripc then runs the application and gathers data on the relevant events recorded by the performance counters on the processor. The information gathered from the performance counters can be used to guide the process of selecting appropriate compiler options, as suggested in this paper. The information from ripc is presented as a table, as shown in the next figure.
Figure 3 - Hardware
Performance Counter Information
The table has three major sections. The first section attempts to attributed time that the application spent stalled to the various hardware conditions that caused the stall. In this particular example nearly the entire runtime of the application was spent stalled. About 14% of the time was lost to Data TLB misses (DTLB_miss), this is one condition which can often be corrected using the compiler flag -xpagesize=4M. A further ~80% of the runtime was spent stalled waiting for data that was not resident in the Data Cache (Re_DC_miss). About 60% of the runtime was spent waiting for data that was not in the External Cache (Re_EC_miss). This means that, the difference, about 20% of the runtime the data was not in the Data Cache, but was in the External Cache. The next section reports information on IPC (Instructions per Cycle) which is a measure of how much useful work the processor is achieving each cycle. The idea of a Grouped IPC is a measure of how much work would be achieved if the processor did not encounter any stall conditions. At this point the tool also reports the number of unfinished floating point traps that occurred during the run (on a system-wide basis). Floating point traps occur very rarely, but if they do happen it can be tricky to realise that it is a problem. The tool produces a section which reports events as a proportion, for example cache misses per reference. The final section reports the memory usage of the application. If gnuplot is available on the path, this information is also presented in a graphical form.
In the figure above the Data TLB misses are shown over the entire run of the program. In the first part of the run there were very few TLB misses, but in the last part there were significant numbers. Instruction frequenciesThe tool BIT provides a number of pieces of information. The most useful of these are the number of times each routine is called and each instruction is executed. These will be covered in the section on profiles. However, BIT also gathers data on individual instruction frequencies – for example how many floating point instructions were executed during the run of the benchmark. The counts from BIT are for just the instrumented part of the application, so do not include library calls. In some cases (for example for some templates in C++) not all the code is instrumented, so the resulting counts are not complete. The following figure shows instruction frequencies for the test program.
Figure 5 - Instruction
Frequences From BIT
The difference between the number of floating point operations and the number of floating point operations is a count of the number of floating point calculations (the additions and multiplications). In the above example, 53% of the operations involve floating point, 40% of the instructions are floating point loads or stores, which means that 13% of the instructions are floating point calculations. BIT also correctly counts instructions in delay slots and annulled instructions. The delay slot is the instruction after a branch. Depending on whether the branch is taken or not this instruction may or may not end up being executed. If it is not executed it is an annulled instruction. Time based profilesSpot uses the 'collect' command to gather a time based profile of the application. Spot then renders this as a set of hyperlinked webpages. The profile of the test code is shown below:
The profiles combine the information from collect with the information from BIT. The columns shown are as follows:
On the right of the page are hyperlinks that go to the following results:
The following figure shows an example of the mix of disassembly with source code, showing the attribution of time to individual assembly language instructions.
The columns show the time spent on each assembly language instruction and the number of times that it was executed. Using this information it is possible to determine that the trip count for the loop between lines 0x1243c and 0x12448 is nearly 170 million times, and that the loop was only entered once. The time is recorded on the increment instruction at 0x1243c. The sampling method used by the Performance Analyzer records time spent on the instruction waiting to be executed. So in this case it is necessary to identify the instruction that is executed before the increment; which turns out to be the load (in the delay slot) at 0x12448. It is this load instruction that is taking all the time. There are hyperlinks in the disassembly. Those that are in square brackets will jump to the line of source code that the assembly language instruction belongs to. The links that are part of branch instructions will jump to the target of the branch. Hardware event profilesIf spot is run with the -X option, it will gather profile using the performance counters that record the most stall time. In the following figure the profile shows where the hardware counter events occur in the test program.
Looking a the first of the two profiles, it is apparent that the cycles spent stalled because of Data Cache miss events are in all three routines. However, looking at the second profile it can be seen that the External Cache miss events only happen in the cache_miss and tlb_miss routines. System-wide bandwidth consumption and trap dataIf spot is run with root permissions and the -X flag, it will try to gather system-wide information on the bandwidth consumption and the traps encountered over the entire run of the application. If other applications are also running on the system, these statistics will include bandwidth consumption and traps from those other applications. The bandwidth consumption and trap information is presented as a table that summarises the entire run. If gnuplot is present in the user's path, then the results will also be available as graphs over time. The following shows a graph of bandwidth consumption over the run of the application.
The most bandwidth is consumed during the first routine where data is being streamed from memory. The subsequent routines which are more tests of memory latency, consume less bandwidth. The next figure shows the Data TLB traps reported over the entire run of the application.
Figure 10 - Data TLB Traps Over Run of Application As might be expected, the trap data looks very similar to the data reported by the hardware counters for the same event. Concluding remarksSpot generates a wealth of performance data that in most cases enables the user to rapidly narrow down on the places in the code where time is being spent. It also assists in identifying the causes of application slow-down. For more about spot, including downloading information, visit http://cooltools.sunsource.net/spot/index.html. |
| ||||||||
|
| ||||||||||||