Sun Java Solaris Communities My SDN Account Join SDN
 
Article

Using UltraSPARC-IIICu Performance Counters to Improve Application Performance

 
By Darryl Gove, Compiler Performance Engineering Group, Sun Microsystems  
The UltraSPARC-IIICu implements a number of very useful hardware performance counters. These either count processor events (such as the number of times that a floating point operation completed), or they count the number of cycles for which something was true (such as the number of cycles that the processor was waiting for data from memory). This article introduces you to the performance counters, indicating which ones are are of most interest, and demonstrates how you might use the Sun ONE Studio Performance Tools to identify where in your application these events are happening. You can use this information to improve the performance of your application.

This is an overview of the topic and the tools you can use. The commands and compiler options in these examples may not apply to your code as shown, and it may be necessary to use different options to get the best results.
 
Contents
 
Performance Counters of Interest
Reading the Performance Counters
Using Performance Counters with the Sun ONE Studio Performance Tools
Interpreting Analyzer Output
Common Solutions
Conclusions
Links to More Information
 
Performance Counters of Interest

There are a variety of performance counters available on the UltraSPARC-IIICu processor. See the processor User's Guide for a complete overview. The following table shows which performance counters may be of interest. For clarity the table is broken into three groups of events.

Interesting UltraSPARC-IIICu Performance Counters
Type of Event
Counters
Description
Data layout
DTLB_miss
Number of times that the conversion of a virtual data address to physical data address was not immediately available. Each occurrence costs about 100 cycles.
Re_DC_miss
The number of cycles which were lost due to data not being in the Level-1 cache.
Re_EC_miss
The number of cycles which were lost due to data not being in the Level-2 cache (these cycles are included in the Level-1 cache miss cycles, ie Re_EC_miss < Re_DC_miss).
Rstall_storeQ
Number of cycles where the processor was stalled waiting for store operations to complete.
Code layout
ITLB_miss
Number of times that the conversion of a virtual instruction address to physical instruction address was not immediately available. Each occurrence costs about 100 cycles.
Dispatch0_IC_miss
Number cycles where no instructions were dispatched because of an instruction cache miss.
Re_RAW_miss
Number of cycles stalled due to Read after Write of data.
Dispatch0_mispredict
Number of cycles where no instructions were dispatched because of a mispredicted branch.
Load-use stalls
Rstall_FP_use
Number of cycles stalled because the processor was waiting for a floating point value to be generated.
Rstall_IU_use
Number of cycles stalled because the processor was waiting for an integer value to be generated.
 
Reading the Performance Counters

There are three possible tools that you might use to read and interpret the hardware performance counters.

  • Starting with the Solaris 8 operating environment, there are two very useful tools for reading the performance counters - cpustat and cputrack. cpustat is run as root, and reports the performance counter statistics on a system-wide basis. cputrack is run on a single application and reports only those events which occur to that application. (Links to their man pages appear below.)

  • The tool har is available to provide you with synthetic system-wide performance metrics (like flops). (See links below.)

  • The Sun ONE Studio Performance Tools can read performance counters and attribute the performance counter events to the location in the code where the events occurred.

My own approach is to use cputrack to gather high-level summary stats of the application, and then to collect detailed information on just the top scoring events using the Sun ONE Studio Performance Tools. I will run through an example of doing this on a fictitious code:

Our Code Example
$ more sumtest.c
int main()
{
double d[20000];
double total,total1;
int count, rpt;

for (count=0; count<20000; count++)
d[count]=0.01;

total=1;
total1=0.5;
for (rpt=0; rpt<50000; rpt++)
{
for (count=0;count<20000;count++)
total+=total1*d[count];
total1=total/1.776;
}
if (total==0.5) return 1 ; else return 0;
}
 

The above code was compiled in the following way:

cc -g -O -o sumtest sumtest.c


The -g is necessary to include debug information which will greatly help with the analysis later.

You can generate debug information for your application by compiling with the -g flag. Note the following:

  • At levels of optimisation of higher than -O3 the compiler will perform some transformations which make the code harder to understand under the debugger. At these levels of optimisation, the code generated (and the performance) is the same whether or not debug information is present, at lower levels of optimisation there is some performance impact.

  • At levels of optimisation -O3 and below, the compiler will favour clarity of the generated code over performance.

With C++, -g tells the compiler to avoid doing some inlining of routines. Because C++ code often has many short methods that benefit from inlining, this can have a significant performance impact. In order to get the best performance while still generating debug information, use the C++ compiler flag -g0, which generates the debug information and does the inlining - however it will make the resulting code harder to debug.

Both cpustat and cputrack can either report on just a single pair of performance counters, or they can rotate through a selection of counters. If your application executes for a sufficiently long time, and you're happy that the application's behaviour is reasonable homogenous (i.e. the events are evenly spread through the entire runtime, and not bunched up), then rotating through the performance counters is a reasonable approach. If you are not sure, then running the application a number of times to collect all the data is always possible.

Invoking cputrack in the following way will collect information on the most useful counters.

 
Running cputrack
cputrack -nfe -T 1 \
-c pic0=Cycle_cnt,pic1=DTLB_miss,sys \
-c pic0=Instr_cnt,pic1=Re_DC_miss,sys \
-c pic0=Dispatch0_IC_miss,pic1=Re_EC_miss,sys \
-c pic0=Rstall_storeQ,pic1=Rstall_FP_use,sys \
-c pic0=Rstall_IU_use,pic1=Re_RAW_miss,sys \
<app> <params>
 

Every second, cputrack will report the statistics for every thread in the program - these are reported in rows labelled 'tick'. At the end of the application's run, cputrack will report summary data for each pair of performance counters - these rows are labelled 'exit'. The following is example data.

Output from cputrack
# cputrack -nfe -T 1 \
-c pic0=Cycle_cnt,pic1=DTLB_miss,sys \
-c pic0=Instr_cnt,pic1=Re_DC_miss,sys \
-c pic0=Dispatch0_IC_miss,pic1=Re_EC_miss,sys \
-c pic0=Rstall_storeQ,pic1=Rstall_FP_use,sys \
-c pic0=Rstall_IU_use,pic1=Re_RAW_miss,sys \
sumtest
1.018 14135 1 tick 1065399307 649 # pic0=Cycle_cnt,pic1=DTLB_miss,sys
2.168 14135 1 tick 752041715 3709955 # pic0=Instr_cnt,pic1=Re_DC_miss,sys
3.128 14135 1 tick 41193 212699 # pic0=Dispatch0_IC_miss,pic1=Re_EC_miss,sys
4.058 14135 1 tick 58242 542371011 # pic0=Rstall_storeQ,pic1=Rstall_FP_use,sys
5.048 14135 1 tick 6621 3203 # pic0=Rstall_IU_use,pic1=Re_RAW_miss,sys
6.058 14135 1 tick 1064502279 13 # pic0=Cycle_cnt,pic1=DTLB_miss,sys
7.058 14135 1 tick 932935082 4051475 # pic0=Instr_cnt,pic1=Re_DC_miss,sys
8.058 14135 1 tick 29433 210045 # pic0=Dispatch0_IC_miss,pic1=Re_EC_miss,sys
6.058 14135 1 exit 2129901586 662 # pic0=Cycle_cnt,pic1=DTLB_miss,sys 7.058 14135 1 exit 1684976797 7761430 # pic0=Instr_cnt,pic1=Re_DC_miss,sys 8.058 14135 1 exit 70626 422744 # pic0=Dispatch0_IC_miss,pic1=Re_EC_miss,sys 8.932 14135 1 exit 122575 1052317277 # pic0=Rstall_storeQ,pic1=Rstall_FP_use,sys 5.048 14135 1 exit 6621 3203 # pic0=Rstall_IU_use,pic1=Re_RAW_miss,sys
 

The lines of interest are the totals - labelled as 'exit' (highlighted in bold). The columns show:

  • The time that the sample was collected
  • The pid of the process (in this case 14135)
  • The ID of the LWP that the counts refer to (in this case there is only 1 LWP, and so it is identified as 1).
  • A label which is either tick or exit - tick indicates the completion of a single second of data collection (notice that the default interval is a second, but this can be overridden), exit indicates that the numbers on the line are totals for the counter observed over the entire run of the application.
  • The event count for the first counter
  • The event count for the second counter
  • A comment indicating what the two counters were counting.

The next thing to do is to present this data in a readable way, as shown in the following table.

Data Summary
Counter
Event Count
Converted to Cycles
% of Total Cycles
Cycle_cnt
2,129,901,586
2,129,901,586
N/A
Instr_cnt
1,684,976,797
0
N/A
DTLB_miss
662
66,200
0.00%
Re_DC_miss
7,761,430
7,761,430
0.36%
Re_EC_miss
422,744
422,744
0.02%
Dispatch0_IC_miss
70,626
70,626
0.00%
Rstall_storeQ
122,575
122,575
0.01%
Rstall_FP_use
1,052,317,277
1,052,317,277
49.41%
Rstall_IU_use
6,621
6,621
0.00%
Re_RAW_miss
3,203
3,203
0.00%
 

Most of these events are already recorded in terms of the number of cycles. Recall that you need to estimate that each TLB miss takes about 100 cycles.

Calculating Instructions Per Clock (IPC)
 

One of the often quoted metrics is IPC - or instructions per clock. Using the above table we can calculate the IPC as 0.79. This means that just under one instruction is completed every cycle. IPC is sometimes used as a measure of performance - the higher the IPC the better. Unfortunately, it's not a good metric for this task.

IPC can be made higher by either decreasing the number of cycles that the application takes to run - this is the situation that you would prefer. Or alternatively, by increasing the number of instructions issued - and keeping the number of cycles the same, or making it worse - this is not such a good outcome.

 

Looking at the example data it is apparent that this application suffers from FP-use stalls - half the total number of cycles is spent waiting for FP data. We can use this information to collect some detailed data using the Sun ONE Studio Analyzer.

 
Using Performance Counters with the Sun ONE Studio Performance Tools

The vital component of the Sun ONE Studio Performance Tools is the Analyzer. See links. Here is a quick summary of the Analyzer's features.

  • The Analyzer looks at what your application is doing one hundred times per second (by default), plus it can also look every few hundred thousand performance counter events. Whenever the Analyzer looks, it records where in its code the application was executing. Consequently, at the end of a run, you can determine which routines were hot, and where the events occurred.

  • The Analyzer can attribute time to lines of source code - so long as the application was compiled with -g (or -g0 for C++). Compiling with -g will have no effect on performance if the application was compiled at high levels of optimization (-xO4, -xO5), but if the application was compiled at low levels of optimisation (-O, -xO2, -xO3) then some minor optimizations are disabled to make the output clearer. A suggestion is to compile with -g whenever possible as it will really make a difference to both debugging your application and to investigating its performance.

  • The Analyzer has two parts - the tool collect which collects the data, and the GUI analyzer which displays the results. You can invoke collect multiple times on the same application with the same or different parameters, and load all the experiments into the GUI at the same time - this allows you to get really good code coverage on your application. If you prefer there is also a command line version of the Analyzer called er_print. The examples on this page use er_print to generate the appropriate output.

Run your application under collect as follows.

Running With collect
collect -p on -h <performance counter 1>,on,<performance counter 2>,on <app> <params>
 

The flag "-p on" tells collect to use time based profiling, and selects the default interval (10ms) between samples. The flag -h tells collect to collect counter overflow events on the specified performance counters.

Picking the instruction count and FP-use stalls counters on the application above we can do the following.

Collecting Data with Two Counters
$ collect -p on -h Instr_cnt,on,Rstall_FP_use,on sumtest
Creating experiment database test.1.er ...
 

Here we have collected the first experiment (test.1.er) and will look at the results. Rather than showing screen shots of the GUI, I'll use the command line version of the tool (er_print) to display text output.

Our first display shows which functions are hot.

Displaying Test Results with er_print
$ er_print -limit 20 -metrics e.user -func test.1.er
current: e.user:name
Functions sorted by metric: Exclusive User CPU Time

Excl. Name
User CPU
sec.
8.830 <Total>
8.830 main
0. <Unknown>
0. __collector_open_experiment
0. __open
0. _audit_objclose
0. _exithandle
....
 

Here we ask er_print to display the following.

  • Only show the first 20 functions (-limit 20). Often applications can have hundreds of routines, and telling er_print to only display the first ones means that the hottest routines have not scrolled off the screen.

  • Only show the exclusive user time metric (-metrics e.user). There are a multitude of metrics collected, the most useful ones are often user time, system time, and wall time. The metrics are typically available in two flavours - exclusive and inclusive. Exclusive means attributable to this routine and this routine alone. Inclusive means attributable to this routine and all the routines that it calls.

  • Show a list of the time attributed to functions (-func)

  • Use the experiment test.1.er

Unsurprisingly, we can see that the function main is hot. There are various other functions that also appear, but basically no time was spent in them. We need to look at main in a bit more detail.

Order of Parameters for er_print
 

er_print, like most applications, interprets the command line from left to right, which means that options on the left will apply to those on the right, and that options on the right will override those on the left. We would have very different output if I had swapped the position on the command line of the -func flag and the -metrics <metrics> flag.

 
 
Interpreting Analyzer Output

There are two basic rules for interpreting the output of the Analyzer:

  • The time attributed to an instruction is the time that the instruction spent waiting to be executed - not the time that the instruction spent executing. So if you see an instruction which has a lot of time attributed to it - look at the previous instructions to determine which one is really to blame for the time. (Note that this means that the source code attribution of timing may not be totally accurate.)

  • The processor hesitates before attributing time to floating point instructions, this means that in a sequence of floating point instructions, it is sometimes the case that the first couple do not get any time attributed to them - even when they really should.

Lets have a look at a couple of examples:

   0.            10980:  inc         9, %l6
0.070 10984: ld [%i4 + %g2], %f16
2.590 10988: nop
0. 1098c: cmp %l4, %l3
 

In the above example, the load instruction is actually taking all the time, but the 'nop' following it gets attributed with the 2.5 seconds of time.

   0.010        5699f0:  ldd         [%l0], %f0
0. 5699f4: fsubd %f6, %f4, %f26
2.630 5699f8: ldd [%l0 + %o1], %f2
 

In the above example, the first floating point load double actually causes the delay, but the processor hesitates in reporting the delay while it completes the floating point subtraction, and it finally reports the delay on second load.

Interpreting the analyzer output can be more of a skill. However, in most cases it is not too hard if you remember the above two rules. Note that there are other situations which can be confusing - for example a branch target might accumulate a large amount of time. The Analyzer documentation contains more details.

Returning to the example, let's have a look at how the attribution of time to the source code works:

$ er_print -metrics e.user:e.Rstall_FP_use -src main test.1.er
current: e.user:e.Rstall_FP_use:name
Source file: ./sumtest.c
Object file: ./sumtest.o
Load Object: ./sumtest

Excl. Excl.
User CPU Rstall_FP_use
sec. Events sec.
1. int main()
0. 0. 2. {
3. double d[20000];
4. double total,total1;
5. int count, rpt;
6.
0. 0. 7. for (count=0; count<20000; count++)
0. 0. 8. d[count]=0.01;
9.
0. 0. 10. total=1;
0. 0. 11. total1=0.5;
0. 0. 12. for (rpt=0; rpt<50000; rpt++)
13. {
0. 0. 14. for (count=0;count<20000;count++)
## 8.830 4.742 15. total+=total1*d[count];
0. 0. 16. total1=total/1.776;
17. }
0. 0. 18. if (total==0.5) return 1 ; else return 0;
19. }
 

In this case I have asked er_print to:

  • Use the metrics exclusive user CPU time, and exclusive Rstall_FP_use events (-metrics e.user:e.Rstall_FP_use). Rstall_FP_use is the hot performance counter that we collected the data for. It is quite useful to limit the number of columns of counter data reported, because interpreting the output can be tricky, and it's almost impossible to do if the lines wrap around.
  • Show the source for the routine main (-src main). To show the source code you need to have compiled your application with debug information.

So immediately you can see (as you would expect) that the bulk of the time is spent on line 15 which does the summation part of the code. So we need to drill down a bit further to see what's going on at the assembly code level. Here's the disassembly code for the hottest routine, most of the code has been omitted for clarity.

$ er_print -metrics e.user:e.Rstall_FP_use -dis main test.1.er
current: e.user:e.Rstall_FP_use:name
Source file: ./sumtest.c
Object file: ./sumtest.o
Load Object: ./sumtest

Excl. Excl.
User CPU Rstall_FP_use
sec. Events sec.
....
0. 0. [12] 106d8: add %i2, 848, %l6
13. {
14. for (count=0;count<20000;count++)
0. 0. [14] 106dc: clr %i1
0. 0. [14] 106e0: mov %l4, %l7
15. total+=total1*d[count];
0.400 0.903 [15] 106e4: ld [%l7], %f0
0.550 0.016 [15] 106e8: inc %i1
0.460 0.943 [15] 106ec: ld [%l7 + 4], %f1
0.720 0. [15] 106f0: cmp %i1, %l5
0. 0. [15] 106f4: fmuld %f6, %f0, %f2
0. 0. [15] 106f8: faddd %f8, %f2, %f8
## 6.240 2.879 [15] 106fc: bl 0x106e4
0.460 0. [15] 10700: inc 8, %l7
16. total1=total/1.776;
0. 0. [16] 10704: inc %i3
...
 

In this case I have asked er_print to:

  • Use the metrics exclusive user CPU time and exclusive number of cycles spent waiting for FP data (-metrics e.user:e.Rstall_FP_use).
  • Disassemble the routine main (-dis main). Note that for real codes it is probably useful to use the option -outfile <filename> which will send all the following output to the specified file. The disassembly can get quite long, and it is probably best to either view it in an editor, or run the analyzer GUI.

You can see that the bulk of the time appears to be spent on the branch instruction.... however, from the discussion above, you'll be looking to see if the time is really caused by other instructions.

You can see nearly 3 seconds of FP use stalls on the branch. So we can be sure that the bulk of the time spent on that branch instruction is due to the proceeding floating point instructions.

So let's talk through this snippet of code.

Notice that there's two single-precision floating point load instructions, one to load %f0 and one to load %f1. What is happening here is that the compiler is assuming that the data is 4-byte aligned and therefore requires two four-byte loads rather than a single eight byte load. If you look at the fmuld instruction it consumes the double-precision floating point number %f0 (the double precision floating point register %f0 is made up of the two single precision registers %f0 and %f1). The compiler flag -dalign would help the compiler use a single floating point load double rather than the two floating point load singles.

The time attributable to the load instructions appears on the instructions following them. As expected there is little time here - the data is cache resident. However, if the arrays were larger, it would be useful for the compiler to insert prefetch instructions - the compiler will do this if the flags -xprefetch -xdepend -xtarget=ultra3 -xarch=v8plusa are used on the compile line.

Now we have the two floating point instructions. First the fmuld. Unfortunately the data that this instruction requires is provided by the second load instruction - this instruction will still be completing when the multiply gets issued - so there will be a delay at this point. Then there's the add. The add requires the data from the multiply - and it takes a couple of cycles for the multiply to complete - so once again there's another FP use stall.

Finally there is a branch instruction.

What can we do about this code? We can try to recompile with -fast to let the compiler be more aggressive with its optimizations. This may not be suitable for all codes, but we can give it a try for this one.

Trying -fast
$ cc -g -fast -o sumtest sumtest.c
$ collect sumtest
Creating experiment database test.2.er ...
$ er_print -metrics e.user -func test.2.er
current: e.user:name
Functions sorted by metric: Exclusive User CPU Time

Excl. Name
User CPU
sec.
4.990 <Total>
4.990 main
0. __collector_open_experiment
0. __open
0. _init
0. _open
0. _private_close
0. _rt_boot
 

Now you can see that by increasing the optimization level from -O to -fast we went from nearly 9 seconds runtime down to about 5 seconds. Lets have a look at what it did to the hot bit of code to get this performance gain.

$er_print -metrics e.user -dis main test.2.er
current: e.user:name
Source file: ./sumtest.c
Object file: ./sumtest.o
Load Object: ./sumtest

Excl.
User CPU
sec.
...
Loop below pipelined with steady-state cycle count = 1 before unrolling
Loop below unrolled 6 times
Loop below has 1 loads, 0 stores, 0 prefetches, 1 FPadds, 1 FPmuls, and 0 FPdivs per iteration
14. for (count=0;count<20000;count++)
15. total+=total1*d[count];
0. [15] 106d0: sethi %hi(0x27000), %g1
....
0. [14] 10764: add %g1, %fp, %l5
0. [15] 10768: fmuld %f6, %f24, %f2
0. [15] 1076c: faddd %f8, %f16, %f4
0.210 [15] 10770: inc 6, %i5
0.080 [15] 10774: ldd [%l5], %f24
0. [15] 10778: fmuld %f6, %f60, %f30
0. [15] 1077c: faddd %f10, %f18, %f0
## 1.450 [15] 10780: cmp %i5, %i0
0. [15] 10784: ldd [%l5 + 8], %f60
0. [15] 10788: fmuld %f6, %f26, %f16
0. [15] 1078c: faddd %f12, %f20, %f8
0.130 [15] 10790: ldd [%l5 + 16], %f26
## 1.470 [15] 10794: inc 48, %l5
0. [15] 10798: fmuld %f6, %f24, %f18
0. [15] 1079c: faddd %f14, %f22, %f10
0.080 [15] 107a0: ldd [%l5 - 24], %f24
0. [15] 107a4: fmuld %f6, %f60, %f20
0. [15] 107a8: faddd %f4, %f2, %f12
0.150 [15] 107ac: ldd [%l5 - 16], %f60
0. [15] 107b0: fmuld %f6, %f26, %f22
0. [15] 107b4: faddd %f0, %f30, %f14
## 1.440 [15] 107b8: ble,pt %icc,0x10768
0.050 [15] 107bc: ldd [%l5 - 8], %f26
...
 

Let's discuss the commentary that the compiler inserts into the output. Here it is telling us:

  • That the loop was unrolled six times. This means that six iterations of the loop were done back-to-back. This optimisation reduces the number of times that the processor has to take a branch.
  • The loop was pipelined. This is a more complex optimisation where the next iteration of the loop is started before the current one completes - think of it as taking the six unrolled iterations of the loop and interleaving them so that one instruction is done from the first, then an instruction from the second etc. You can see this looking at the load at 0x10774, this loads %f24 which is used by the multiply at 0x10798. The multiply generates %f18 which is used by the add at 0x1077c.
  • The compiler commentary also tells us of the instruction make up of the loop - the number of each type of instruction per iteration.

In this case the big gain comes from pipelining the loop, there's now sufficient time between one instruction generating a result and the instruction that requires that result.

There are a few other things that it is worth pointing out about the loop:

  • -fast includes the -dalign flag, so the two loads of single precision values have now become a single load of a double precision value.
  • The flag which allows the compiler freedom to do the floating point optimisations is -fsimple=2, which is included in -fast. This flag tells the compiler that it is not necessary to do the floating point calculations in the exact same order that they are specified in the code. So if you look at the floating point adds in the above disassembly, you can identify that what was in the source a simple summation, has been split into two separate summations (which will be added together after the loop completes). %f4, %f8, and %f12 are one group, and %f0, %f10, and %f14 are another group.
 
Common Solutions

While the causes for a high number of events for a given counter are very dependant on the characteristics of the application, the following table outlines some suggestions for what you might try depending on the performance counter events.

Suggested Solutions
Condition
Counters
Solutions
Data Layout
DTLB_miss
You are using a lot of data in your application - so the processor needs to be able to map a large amount of memory. There is the facility in Solaris 9 to use large pages (the compiler flag -xpagesize is available to assist in doing this).

It may also be the case that you are using data structures which have a low data-density. For example, you may only be accessing a single element from a large data structure. Check the data structures to determine whether they are being used efficiently or not.
Re_DC_miss
Re_EC_miss
L1 and L2 cache misses indicate that the application is spending time having to go to memory. The compiler can do a good job at reducing the time spent in cache misses if the application is recompiled with -xprefetch enabled - you should also include the flags -xdepend -xtarget=ultra3 and -xarch=v8plusa.

If the problem persists, take a look at the data structures in the application and see if they are using memory efficiently. Can the accesses be made such that adjacent memory locations are used. This can be done by reordering the elements in structures so that the fields which are accessed frequently are placed close together, or by going through arrays one adjacent element at a time.
Rstall_storeQ
This indicates that stores are being put into the store queue faster than they can drain to memory. Recompiling with prefetch can improve this. Also changing the way data is stored so that stores to adjacent memory locations appear together in the code will also help.
Code Layout
ITLB_miss
Instruction TLB misses probably indicate that the application has a lot of code in it. Using large pages to map the application into memory will help. It is also possible that the compiler can generate a more optimal code layout through one of the following - profile feedback (-xprofile=[collect:|use:]), interprocedural optimisation (-xipo), use the link time optimiser (-xlinkopt), or mapfiles (-xmapfile=). mapfiles can be generated from the analyzer, and tell the compiler the best order in which to layout the routines in memory.
Dispatch0_IC_miss
Instruction cache misses are similar to ITLB misses, but less severe. The solutions are broadly similar. Use mapfiles to organise the layout of the hot routines in memory. Use profile feedback to optimise branches in order to make the normal case the one with the linear code-path. Add -xipo to get crossfile optimisations, which will inline short routines. In S1S8, use -xlinkopt to invoke the link time optimiser to improve instruction cache utilisation.
Re_RAW_miss
This means that some locations in memory are being accessed in such a way that the processor is finding it hard to determine whether the stored data should be passed directly to the following load. The easiest solution to this is to recompile with latest compiler and specify -xtarget=ultra3 to get scheduling for UltraSPARC-IIICu. If the condition persists, alter source code to avoid stores and loads from locations that are very close in memory.
Dispatch0_mispredict
If this counter is high, it means that there are many mispredicted branches. Use profile feedback to improve the scheduling of the branches so as to improve branch predictability.
Load-Use Stalls
Rstall_FP_use
This counter indicates that there are floating point instructions that are being delayed whilst they wait for the results of previous floating point operations. Recompile the application with -fsimple=2 (if possible) to substitute simpler (but non-ieee754 compliant) code sequences. Or if this is not appropriate, locate the hot spots in the code and see if there is an alternative way of coding them to avoid the problem. In particular watch out for FP divides and square roots which are time consuming operations.
Rstall_IU_use
This counter indicates that there are integer instructions waiting for the completion of previous integer operations. The simplest solution is to recompile with latest compiler, specify -xchip=ultra3 to get appropriate instruction latencies for the UltraSPARC-IIICu platform. However for C and C++ appliations it may be possible to recompile with aliasing information (-xalias_level or -xrestrict) if appropriate - these flags are quite complex, and you should be sure that you understand what you are telling the compiler before you use them.
 
 
Conclusions

You should have a much better idea about how to use the performance counters to highlight performance opportunities in your application. You should also understand how to use this knowledge to drill down in more detail using the Sun ONE Studio Analyzer. If you do follow this procedure, you should not be surprised to find that you can make significant performance improvements for your application. Obviously the performance gains you make will depend on how well optimised the original code was, how recent the compiler you are using, and what your application is trying to do. Having said that, we often find that performance gains of 10% to orders of magnitude, are possible just by looking at what the counters are telling us, and what the Analyzer tells us.

 
Links For More Information
 
cputrack Command
cpustat Command
Sun ONE Studio Performance Analysis Tools
Technical Article: Performance Analysis and Monitoring Using Hardware Counters in Solaris.
UltraSPARC-IIICu User's Guide
Compiling for the UltraSPARC-IIICu Processor
 
 

About the Author

Darryl Gove is a staff engineer in the Compiler Performance Engineering group at Sun Microsystems Inc., analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK. Before joining Sun, Darryl held various software architecture and development roles in the UK.