Sun Java Solaris Communities My SDN Account Join SDN
 
Article

Performance Analysis and Monitoring Using Hardware Counters

 
By Frédéric Parienté, December 2001  

Synopsis

The hardware performance counters found in the UltraSPARC and Pentium microprocessors provide a powerful means to monitor and analyze performance of a given application or an entire system. Starting with the Solaris 8 Operating Environment (OE), there is increasing software support of hardware counters and easier access for application developers, including programming interfaces and utility tools. On top of the new libraries, a tool called the Hardware Activity Reporter (HAR) has been built to answer the needs of application programmers. HAR combines low-level processor events into common performance metrics, such as MIPS, FLOPS and cache miss rates. On a couple of case studies in the fields of CFD and MCAD, HAR proves useful in identifying performance bottlenecks and quantifying code-tuning improvements.

Contents

Introduction
Solaris 8 OE Support for Hardware Performance Counters
Solaris 8 OE Libraries
Solaris 8 OE Utility Tools
The Hardware Activity Reporter (HAR)
Performance Metrics for UltraSPARC-II Chip
The Hardware Activity Reporter Tool
Case Studies
Case Study 1: CFD Code
Case Study 2: MCAD Code
Additional Reading

Introduction

The UltraSPARC and Pentium microprocessors contain hardware performance counters that allow counting a series of processor events, such as cache misses, pipeline stalls and floating-point operations. Statistics of processor events can be collected in hardware with little or no overhead, making these counters a powerful means to monitor an application and analyze its performance. Such counters, to their advantage, are non-intrusive, do not to require recompilation of applications and are available in every UltraSPARC-based system that Sun Microsystems currently ships. Nonetheless, these counters are not widely used beyond hardware specialists because of the lack of programming interfaces and sparse documentation.

Starting with release 8 of the Solaris OE, there is an increasing support of hardware counters, with the availability of utility tools and public programming interfaces. This support for hardware counters at the operating system level is a major milestone for the Solaris software developer community but is not a direct answer to its needs. Application programmers work at a higher level where the common metrics are a combination of the low-level event counts accumulated in hardware. There is a need for higher-level tools that hide the complexity of the underlying hardware design and present in a uniform manner the common metrics the programmers expect: million instructions per second (MIPS), floating-point operations per second (FLOPS), cache miss rate.

In this paper, the Solaris 8 platform's support for hardware performance counters is covered, and a tool, the Hardware Activity Reporter -- that builds on the new Solaris interfaces and aims to address the needs of application programmers -- is introduced.

2 Solaris 8 OE Support for Hardware Performance Counters

In February 2000, Sun Microsystems announced a new release of the Solaris Operating Environment, Solaris 8. With the latest release of Solaris 8 OE, Sun began to deliver public interfaces to the UltraSPARC and Pentium hardware performance counters. A series of application programming interfaces (APIs) have been made available as shared libraries, to program various hardware counters and the utility tools cpustat and cputrack to access the microprocessor (CPU) hardware counters from the command line. The Solaris 8 platform also provides similar libraries and tools to access the hardware counters of the system bus and I/O boards.

2.1Solaris 8 OE Libraries

Solaris 8 software ships with two libraries directly related to the usage of the CPU performance counters: libcpc, to access CPU counters and libpctx, to track a process. Using these APIs, one can instrument code to access the performance hardware counters and collect performance information. The steps to instrument a piece of code are:

  1. Check the versions and accessibility of the hardware performance counters with cpc_version() and cpc_access();

  2. Initialize a cpc_event_t data structure, using cpc_getcpuver() and cpc_strtoevent(), to be used in conjunction with the counters;

  3. Bind the data structure to the CPU using cpc_bind_event();

  4. Read the counters, as desired, using cpc_take_sample();

  5. Release the CPU using cpc_rele() when finished.

The attractive aspect of these interfaces is their simplicity. Once the counters are initialized, a program reads them by a single call to cpc_take_sample(). No matter how easy-to-use these APIs are, however, people are usually reluctant to instrument code. To avoid instrumenting code, Solaris 8 software also ships with a couple of command-line based utilities, cpustat and cputrack, that report on CPU performance counters.

2.2Solaris 8 OE Utility Tools

The cpustat utility reports on CPU performance counters in a system-wide fashion. cpustat is invoked from the command line, and a pair of processor events to monitor must be passed as an argument. Events go by pair because the current UltraSPARC and Pentium microprocessors have two hardware performance counters that must be programmed simultaneously. Optional arguments are the sampling interval and count, i.e., the frequency at which the counters are read and the number of times the counters are read. Multiple pairs of events can be specified; in that case, the system alternates between the multiple pairs. Because cpustat is a system-wide utility, it must be run as root.

Sample Output from cpustat on a 4-CPU Machine

# cpustat -c IC_ref,IC_hit -c Cycle_cnt,Instr_cnt 1 3
time	cpu	event	pic0	pic1
1.007	1	tick	152762	133720	# pic0=IC_ref,pic1=IC_hit
1.007	0	tick	13907	12547	# pic0=IC_ref,pic1=IC_hit
1.007	3	tick	114929	96667	# pic0=IC_ref,pic1=IC_hit
1.007	2	tick	178574	161523	# pic0=IC_ref,pic1=IC_hit
2.007	1	tick	241815	110649	# pic0=Cycle_cnt,pic1=Instr_cnt
2.007	2	tick	240013	121404	# pic0=Cycle_cnt,pic1=Instr_cnt
2.007	0	tick	313879	106311	# pic0=Cycle_cnt,pic1=Instr_cnt
2.007	3	tick	2079680	549262	# pic0=Cycle_cnt,pic1=Instr_cnt
3.007	1	tick	6122	4997	# pic0=IC_ref,pic1=IC_hit
3.007	2	tick	29094	24663	# pic0=IC_ref,pic1=IC_hit
3.007	0	tick	24966	19550	# pic0=IC_ref,pic1=IC_hit
3.007	3	tick	196420	174700	# pic0=IC_ref,pic1=IC_hit
3.007	4	total	716774	628367	# pic0=IC_ref,pic1=IC_hit
2.007	4	total	2875387	887626	# pic0=Cycle_cnt,pic1=Instr_cnt

The cputrack utility is similar to cpustat but only reports the contribution of a single process. A pair of processor events is a required argument as well as a command to execute or a process ID to which cputrack attaches. Because cputrack attaches to a single process, the counters suspend counting when that process is swapped out; they resume counting when the process is back on the CPU. There is no need to be root to run the cputrack utility as long as one is attaching to one's own processes.

Sample Output from cputrack

# cputrack -T 0 -fve -c Cycle_cnt,Instr_cnt sh -c date
time	pid	lwp	event		pic0	pic1
0.008	9526	1	init_lwp	0	0
0.021	9526	1	fork				# 9527
0.023	9527	1	init_lwp	0	0
0.025	9527	1	fini_lwp	93760	60136
0.025	9527	1	exec		93760	60136
0.000	9527	1	exec				# 'date'
0.033	9527	1	init_lwp	0	0
Tue Oct 24 16:37:52 MEST 2000
0.041	9527	1	fini_lwp	787164	435118
0.041	9527	1	exit		787164	435118
0.044	9526	1	fini_lwp	1085444	542027
0.044	9526	1	exit		1085444	542027

3 The Hardware Activity Reporter (HAR)

Many have found tools like cpustat useful and simple to use but, at the same time, too low-level to use for application tuning. As mentioned in the introduction, application programmers need to work at a higher level where the common metrics are MIPS, FLOPS, cache miss rates, i.e. a combination of the numbers output by cpustat. This section introduces the Hardware Activity Reporter, a tool that builds on the Solaris 8 libcpc library and combines the low-level counts into higher-level metrics more useful to application programmers.

3.1 Performance Metrics for UltraSPARC-II Chip

Application programmers are typically interested in the following metrics: cycles per instructions (CPI), FLOPS, MIPS, address bus percentage utilization, cache miss rates, branch and branch miss rates, and stall rates.

These performance metrics are computed from primitive event counts collected in hardware. Therefore, for a given metric to be available on a specific microprocessor, the CPU needs to have hardware support for all the underlying events involved. From now on, we will focus on the UltraSPARC-II microprocessor, on which HAR was first developed. However, keep in mind that HAR is not tied to this microprocessor and can be ported to every microprocessor supported by the Solaris 8 platform.

The UltraSPARC-II is a four-way superscalar microprocessor with two levels in cache hierarchy and the following features:

  • One floating-point add plus one floating-point multiply per cycle
  • Two integer operations per cycle
  • One load/store per cycle (counts as one integer operation)
  • Two level-1 caches for data (D$) and instructions (I$)
  • One common external level-2 cache (E$)

The chip has a theoretical peak CPI of 0.25; i.e., at 0.25 CPI all four units are simultaneously executing code. With respect to this architecture, the application programmer will be interested in knowing the achieved CPI, to know how efficiently the application is using the various microprocessor units, and the various cache miss rates, to know how efficiently the application is accessing data.

The UltraSPARC-II microprocessor supports the hardware events presented in Table 1. There are two Performance Instrumentation Counters, PIC0 and PIC1; therefore, two events can be simultaneously measured. A given event can typically be measured from only one counter. There is a third counter, noted %TICK, that is incremented at each clock cycle and is reset to zero at each power on.

Table 1. UltraSPARC-II Performance Instrumentation

Event ID Description Available in counter
Cycle_cnt accumulated cycles pic0, pic1
Instr_cnt number of instructions completed pic0, pic1
IC_ref I$ references pic0
IC_hit I$ hits pic1
EC_ref E$ references pic0
EC_hit E$ hits pic1
EC_wb E$ misses that do write-backs pic1
EC_ic_hit E$ hits from I$ misses pic1
EC_rd_hit E$ hits from D$ misses pic0
EC_write_hit_RDO E$ hits that do a RDO transaction pic0
EC_snoop_cb E$ snoop copy-backs pic1
EC_snoop_inv E$ invalidates pic0
DC_rd D$ read references pic0
DC_wr D$ write references pic0
DC_rd_hit D$ read hits pic1
DC_wr_hit D$ write hits pic1
Load_use stalls on load pic0
Load_use_RAW stalls on read-after-write pic1
Dispatch0_IC_miss stalls on I$ miss pic0
Dispatch0_storeBuf stalls on store pic0
Dispatch0_mispred stalls on branch misprediction pic1
Dispatch0_FP_use stalls on floating-point unit pic1

Remark 1. Stalls are counted for each clock when the associated condition is true.

Remark 2. Cycle_cnt, unlike %TICK, selectively counts user or kernel cycles.

From the above UltraSPARC-II events, we compute the following metrics.

Table 2. UltraSPARC-II Performance Metrics

Metric Formula
mips Instr_cnt * %tick / CPU_clock
cpi Cycle_cnt / Instr_cnt
bus 2 * (EC_ref - EC_hit + EC_wb) / (%tick * Bus_clock / CPU_clock)
dcm 1 - (DC_rd_hit + DC_wr_hit) / (DC_rd + DC_wr)
icm 1 - IC_hit / IC_ref
ecm 1 - EC_hit / EC_ref
dsr (Load_use + Load_use_RAW + Dispatch0_storeBuf) / Cycle_cnt
isr Dispatch0_IC_miss / Cycle_cnt
fsr Dispatch0_FP_use / Cycle_cnt
bsr Dispatch0_mispred / Cycle_cnt

Remark 3. DCM, ICM, ECM are the data, instruction, external cache miss rates.

Remark 4. DSR, ISR, FSR, BSR are the data, instruction, floating-point, branch stall rates. Stalls are the number of cycles when the chip is stalling because of a given resource. A data stall rate equates to waiting on data, typically a load. A floating-point stall rate equates to waiting for the next available floating-point unit. Branch stalls are the number of cycles recovering from a mispredicted branch.

Remark 5. The bus utilization does not take into account disk and DMA access. The factor of 2 comes from the fact that each bus cycle takes 2 clocks on the system bus.

Remark 6. CPU_clock and Bus_clock refer to the frequencies of the CPU and system bus. They are constant numbers for a given machine.

Remark 7. Miss and stall rates are presented here as per-reference ratios; per-reference ratios are interesting because they are dimensionless numbers and can be presented as percentages. Depending on the programmer's background, rates can be computed as per-second and per-instruction ratios. Database people like per-instruction ratios because they are absolute numbers and relate easily to the common database performance measure, such as the number of transactions per instruction.

3.2The Hardware Activity Reporter Tool

Based on the Solaris 8 performance hardware counter interfaces and the source code of cpustat, a utility tool called HAR has been developed in the C programming language. HAR is modular and microprocessor independent in the sense that knowledge about a specific chip (availability of hardware events, formulas to compute performance metrics) is plugged in by means of function pointers. For the UltraSPARC-II chip, HAR implements the performance metrics of Table 2, which can be presented as per-reference, per-second and per-instruction ratios (-r, -s, -i options at the command line).

To summarize, HAR:

  • Requires no installation.
  • Needs root execution privilege.
  • Can report on both user and kernel modes.
  • Can report maximum, average, and per-CPU values inside a multiprocessor machine.
  • Has a command line interface, named har.
  • Outputs data similar to vmstat, line-by-line, as shown below.

Sample Output from HAR

# har -r 1 3
mips	bus	cpi	dcm	icm	ecm	dsr	isr	fsr	bsr
2	0.0	3.1	33.7	25.1	15.4	15.6	31.9	0.0	3.1
0	0.0	3.6	31.3	22.8	1.0	24.4	25.4	0.1	6.1
0	0.0	3.3	34.2	21.9	0.0	24.6	19.6	0.0     5.3

4 Case Studies

In this section, two case studies are presented that show the usefulness of the HAR tool. In the field of Computational Fluid Dynamics (CFD), we use HAR to characterize the code signature and to quantify the improvements in manually tuning the code. In the field of Mechanical Computer-Aided Design (MCAD), we use HAR to show effect of the large memory pages feature of the Solaris platform.

4.1Case Study 1: CFD Code

The following loop is extracted from a commercial CFD solver. It has been over-simplified for the sake of clarity. There is indirection in accessing the arrays, resulting in some random memory access.

Original Solver Loop

for i=1,n
if test(a[i]) then w=w+a[i]*b[i]

With current compiler technology, the loop is not pipelined. Also, the UltraSPARC-II microprocessor has limited branch prediction capabilities. As a result, the chip stalls on the "if" test. The above loop has been replaced by the following modified solver loop.

Modified Solver Loop

f(x)	= x if test(x) true
	= 0 if not

for i=1,n
a[i]=f(a[i])   
w=w+a[i]*b[i]

We expect this modified loop to improve the execution of the solver as function "f" can be inlined. It is implemented using Boolean operations rather than "if" tests, which enables the compiler to pipeline the loop. Based on the original or the modified loop, different versions of the CFD solver have been built using the various levels of compiler-generated prefetch available in the compilers in the Forte Developer 6 products: none, automatic and explicit where arrays to be prefetched are specified in the source code by the programmer using directives. Table 3 shows the timings and performance metrics for the different versions of the CFD solver.

Looking at elapsed time for the original code, we see that adding automatic prefetch results in no gain. Explicit prefetch, however, gives some gain. This is a first interesting result: Automatic prefetch did not identify the good arrays to prefetch, indicating that the programming explicit prefetch was worth the effort. Also, as expected, the modified code runs faster than the original code. To better understand where the various performance gains come from, it is interesting to graphically represent the CPI breakdown.

In the graph, the total value represents the total CPI. The CPI breaks down into the time waiting on floating-point unit (FSR), the time waiting on data loads (DSR) and some remaining time that we will call execution CPI (EXE). Execution CPI represents the actual execution, and its numerical value can be related to the number of instructions executed in parallel. For instance, an execution CPI close to 0.25 means that, in average, four instructions are issued per cycle.

Table 3. Performance Metrics for CFD Case

  time mips bus dcm ecm
  (ms)   (% bandwidth) (% refs) (% refs)
original code 58 180 5 28 18
w. auto. prefetch 58 180 5 28 18
w. explicit prefetch 45 250 6 25 12
modified code 35 290 8 28 18
w. auto. prefetch 28 340 10 28 17
  cpi dsr   fsr  
    (cpi) (% cycles) (cpi) (% cycles)
original code 2.5 1.4 56 0.5 21
w. auto. prefetch          
w. explicit prefetch     41 0.5 26
modified code 1.5 1.2 78 0.0 0
w. auto. prefetch 1.2 0.9 72 0.0 0

Remark 8. DSR and FSR are given as a fraction of CPI and a percentage of cycles.

For the original code, the blocking nature of the "if" test, the fact that the loop is not pipelined, is shown by the presence of floating-point stalls and an execution CPI close to 0.5, that is the parallelization of maximum two instructions, far from what the UltraSPARC-II can theoretically achieve. Prefetch does not change the structure of the loop, so we logically find the same FSR and EXE times. The gain from explicit prefetch shows in the reduction of data stalls. This is exactly what one should expect from prefetch, reducing the data stalls by prefetching data into the caches.

In the modified code, the loop is pipelined. The compiler has done a good job in spreading the floating-point operations and in issuing the instructions so that they parallelize well on the chip. As a result, floating-point stalls are completely gone and the execution CPI is now close to the peak CPI of 0.25 for the UltraSPARC-II.

CPI Breakdown for CFD Solver
Figure 1: CPI Breakdown for CFD Solver













4.2 Case Study: MCAD Code

Memory access pattern inside a commercial MCAD program can be emulated by:

for i=1,n
	if (j > NIND) then j = 0
	k1 = index[j++]
	k2 = index[j++]
	sum = sum + array[k1] 

The memory access pattern is pseudo-random, typically resulting in high cache and Data Translation Lookaside Buffer (dTLB) misses and leading to a high data stall rate and poor performance. The TLB is a quick lookup table to retrieve memory pages in physical memory. Missing the TLB typically delays the loading of a piece of data by 70 to 100 cycles. As a rule, larger memory pages, available under Solaris 8 OE through the shared memory interface, will help in such a case. Table 4 shows the timing and performance metrics for the above sample code, with both regular and large memory pages. The ability of HAR to distinguish between user and kernel modes is crucial here in analyzing performance.

Table 4. Performance Metrics for MCAD Case

    mips cpi dsr   dtlb misses time
        (cpi) (% cycles) (per second) (ms)
8K page user 27 7.4 6 81 2.6M 36
  kernel 34 7.3 1.3 18    
4M page user 45 9.8 9.3 94 11K 19
  kernel - - - -    

With regular 8KB pages, there is high kernel activity, due to the high number of dTLB misses. In moving to 4MB pages, the code still misses cache frequently, because it is still accessing the same data the same way, but is no longer missing the TLB. When kernel activity is gone, the number of user MIPS is up and elapsed time goes from 36ms to 19ms, almost a factor of 2.

Remark 9. For lack of hardware support in the UltraSPARC-II, the TLB data was obtained by trapstat, a kernel monitoring tool developed at Sun Microsystems.

Conclusion

HAR builds on the new Solaris interfaces to hardware performance counters and reports performance metrics of interest to application programmers, such as MIPS, CPI, and cache miss rates. On a couple of case studies in the fields of CFD and MCAD, we used HAR to identify performance bottlenecks and quantify code-tuning improvements. Currently available for the UltraSPARC family, HAR is portable to any microprocessor supported by the Solaris 8 OE, such as the Pentium microprocessor.

The new UltraSPARC-III can count three times as many events as its predecessor. As a result, new HAR metrics for the UltraSPARC-III will include instruction and data TLB misses, FLOPS, branch and branch miss rates. While HAR will become more useful by reporting on more metrics, it will become more necessary as well. Future-generation machines will indeed have an increasing complexity in memory hierarchy and processor design. To understand the behavior of an application, we will need tools such as HAR to locate bottlenecks and guide tuning efforts.

Beyond HAR, we encourage the performance community to continue to build on the Solaris interfaces to hardware performance counters, such as including knowledge about the application and the computing environment in order to provide intelligent tools that make their own conclusions on performance status and conjectures on performance improvements strategies, in terms of code tuning or system upgrade.

Additional Reading

About the Author

Frédéric Parienté is a member of the Market Development Engineering group at Sun Microsystems where he works on performance analysis and optimization of commercial applications for mechanical computer-aided engineering, operations research, bio-informatics and financial markets. He graduated from Ecole Nationale Superieure de Techniques Avancées, Paris, France in 1994 and received a MS in Mechanical Engineering from the University of Illinois at Urbana-Champaign in 1995. His professional interests include high-performance computing, grid and portal computing, parallel and distributed systems, and performance analysis and optimization. Frédéric Parienté can be contacted at frederic.pariente@sun.com.

December 2001

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.