SynopsisThe hardware performance counters found in the UltraSPARC and Pentium microprocessors provide a powerful means to monitor and analyze performance of a given application or an entire system. Starting with the Solaris 8 Operating Environment (OE), there is increasing software support of hardware counters and easier access for application developers, including programming interfaces and utility tools. On top of the new libraries, a tool called the Hardware Activity Reporter (HAR) has been built to answer the needs of application programmers. HAR combines low-level processor events into common performance metrics, such as MIPS, FLOPS and cache miss rates. On a couple of case studies in the fields of CFD and MCAD, HAR proves useful in identifying performance bottlenecks and quantifying code-tuning improvements. Contents
IntroductionThe UltraSPARC and Pentium microprocessors contain hardware performance counters that allow counting a series of processor events, such as cache misses, pipeline stalls and floating-point operations. Statistics of processor events can be collected in hardware with little or no overhead, making these counters a powerful means to monitor an application and analyze its performance. Such counters, to their advantage, are non-intrusive, do not to require recompilation of applications and are available in every UltraSPARC-based system that Sun Microsystems currently ships. Nonetheless, these counters are not widely used beyond hardware specialists because of the lack of programming interfaces and sparse documentation. Starting with release 8 of the Solaris OE, there is an increasing support of hardware counters, with the availability of utility tools and public programming interfaces. This support for hardware counters at the operating system level is a major milestone for the Solaris software developer community but is not a direct answer to its needs. Application programmers work at a higher level where the common metrics are a combination of the low-level event counts accumulated in hardware. There is a need for higher-level tools that hide the complexity of the underlying hardware design and present in a uniform manner the common metrics the programmers expect: million instructions per second (MIPS), floating-point operations per second (FLOPS), cache miss rate. In this paper, the Solaris 8 platform's support for hardware performance counters is covered, and a tool, the Hardware Activity Reporter -- that builds on the new Solaris interfaces and aims to address the needs of application programmers -- is introduced. 2 Solaris 8 OE Support for Hardware Performance CountersIn February 2000, Sun Microsystems announced a new release of the Solaris Operating Environment, Solaris 8. With the latest release of Solaris 8 OE, Sun began to deliver public interfaces to the UltraSPARC and Pentium hardware performance counters. A series of application programming interfaces (APIs) have been made available as shared libraries, to program various hardware counters and the utility tools cpustat and cputrack to access the microprocessor (CPU) hardware counters from the command line. The Solaris 8 platform also provides similar libraries and tools to access the hardware counters of the system bus and I/O boards. 2.1Solaris 8 OE Libraries
Solaris 8 software ships with two libraries directly related to the usage of the CPU performance counters:
The attractive aspect of these interfaces is their simplicity. Once the counters are initialized, a program reads them by a single call to 2.2Solaris 8 OE Utility Tools
The Sample Output from
|
|
Remark 1. Stalls are counted for each clock when the associated condition is true.
Remark 2. Cycle_cnt, unlike %TICK, selectively counts user or kernel cycles.
From the above UltraSPARC-II events, we compute the following metrics.
Table 2. UltraSPARC-II Performance Metrics
|
Remark 3. DCM, ICM, ECM are the data, instruction, external cache miss rates.
Remark 4. DSR, ISR, FSR, BSR are the data, instruction, floating-point, branch stall rates. Stalls are the number of cycles when the chip is stalling because of a given resource. A data stall rate equates to waiting on data, typically a load. A floating-point stall rate equates to waiting for the next available floating-point unit. Branch stalls are the number of cycles recovering from a mispredicted branch.
Remark 5. The bus utilization does not take into account disk and DMA access. The factor of 2 comes from the fact that each bus cycle takes 2 clocks on the system bus.
Remark 6. CPU_clock and Bus_clock refer to the frequencies of the CPU and system bus. They are constant numbers for a given machine.
Remark 7. Miss and stall rates are presented here as per-reference ratios; per-reference ratios are interesting because they are dimensionless numbers and can be presented as percentages. Depending on the programmer's background, rates can be computed as per-second and per-instruction ratios. Database people like per-instruction ratios because they are absolute numbers and relate easily to the common database performance measure, such as the number of transactions per instruction.
Based on the Solaris 8 performance hardware counter interfaces and the source code of cpustat, a utility tool called HAR has been developed in the C programming language. HAR is modular and microprocessor independent in the sense that knowledge about a specific chip (availability of hardware events, formulas to compute performance metrics) is plugged in by means of function pointers. For the UltraSPARC-II chip, HAR implements the performance metrics of Table 2, which can be presented as per-reference, per-second and per-instruction ratios (-r, -s, -i options at the command line).
To summarize, HAR:
har.vmstat, line-by-line, as shown below.# har -r 1 3 mips bus cpi dcm icm ecm dsr isr fsr bsr 2 0.0 3.1 33.7 25.1 15.4 15.6 31.9 0.0 3.1 0 0.0 3.6 31.3 22.8 1.0 24.4 25.4 0.1 6.1 0 0.0 3.3 34.2 21.9 0.0 24.6 19.6 0.0 5.3
In this section, two case studies are presented that show the usefulness of the HAR tool. In the field of Computational Fluid Dynamics (CFD), we use HAR to characterize the code signature and to quantify the improvements in manually tuning the code. In the field of Mechanical Computer-Aided Design (MCAD), we use HAR to show effect of the large memory pages feature of the Solaris platform.
The following loop is extracted from a commercial CFD solver. It has been over-simplified for the sake of clarity. There is indirection in accessing the arrays, resulting in some random memory access.
for i=1,n if test(a[i]) then w=w+a[i]*b[i]
With current compiler technology, the loop is not pipelined. Also, the UltraSPARC-II microprocessor has limited branch prediction capabilities. As a result, the chip stalls on the "if" test. The above loop has been replaced by the following modified solver loop.
|
for i=1,n a[i]=f(a[i]) w=w+a[i]*b[i]
We expect this modified loop to improve the execution of the solver as function "f" can be inlined. It is implemented using Boolean operations rather than "if" tests, which enables the compiler to pipeline the loop. Based on the original or the modified loop, different versions of the CFD solver have been built using the various levels of compiler-generated prefetch available in the compilers in the Forte Developer 6 products: none, automatic and explicit where arrays to be prefetched are specified in the source code by the programmer using directives. Table 3 shows the timings and performance metrics for the different versions of the CFD solver.
Looking at elapsed time for the original code, we see that adding automatic prefetch results in no gain. Explicit prefetch, however, gives some gain. This is a first interesting result: Automatic prefetch did not identify the good arrays to prefetch, indicating that the programming explicit prefetch was worth the effort. Also, as expected, the modified code runs faster than the original code. To better understand where the various performance gains come from, it is interesting to graphically represent the CPI breakdown.
In the graph, the total value represents the total CPI. The CPI breaks down into the time waiting on floating-point unit (FSR), the time waiting on data loads (DSR) and some remaining time that we will call execution CPI (EXE). Execution CPI represents the actual execution, and its numerical value can be related to the number of instructions executed in parallel. For instance, an execution CPI close to 0.25 means that, in average, four instructions are issued per cycle.
Table 3. Performance Metrics for CFD Case
|
Remark 8. DSR and FSR are given as a fraction of CPI and a percentage of cycles.
For the original code, the blocking nature of the "if" test, the fact that the loop is not pipelined, is shown by the presence of floating-point stalls and an execution CPI close to 0.5, that is the parallelization of maximum two instructions, far from what the UltraSPARC-II can theoretically achieve. Prefetch does not change the structure of the loop, so we logically find the same FSR and EXE times. The gain from explicit prefetch shows in the reduction of data stalls. This is exactly what one should expect from prefetch, reducing the data stalls by prefetching data into the caches.
In the modified code, the loop is pipelined. The compiler has done a good job in spreading the floating-point operations and in issuing the instructions so that they parallelize well on the chip. As a result, floating-point stalls are completely gone and the execution CPI is now close to the peak CPI of 0.25 for the UltraSPARC-II.
Figure 1: CPI Breakdown for CFD Solver |
Memory access pattern inside a commercial MCAD program can be emulated by:
for i=1,n if (j > NIND) then j = 0 k1 = index[j++] k2 = index[j++] sum = sum + array[k1]
The memory access pattern is pseudo-random, typically resulting in high cache and Data Translation Lookaside Buffer (dTLB) misses and leading to a high data stall rate and poor performance. The TLB is a quick lookup table to retrieve memory pages in physical memory. Missing the TLB typically delays the loading of a piece of data by 70 to 100 cycles. As a rule, larger memory pages, available under Solaris 8 OE through the shared memory interface, will help in such a case. Table 4 shows the timing and performance metrics for the above sample code, with both regular and large memory pages. The ability of HAR to distinguish between user and kernel modes is crucial here in analyzing performance.
Table 4. Performance Metrics for MCAD Case
|
With regular 8KB pages, there is high kernel activity, due to the high number of dTLB misses. In moving to 4MB pages, the code still misses cache frequently, because it is still accessing the same data the same way, but is no longer missing the TLB. When kernel activity is gone, the number of user MIPS is up and elapsed time goes from 36ms to 19ms, almost a factor of 2.
Remark 9. For lack of hardware support in the UltraSPARC-II, the TLB data was obtained by trapstat, a kernel monitoring tool developed at Sun Microsystems.
HAR builds on the new Solaris interfaces to hardware performance counters and reports performance metrics of interest to application programmers, such as MIPS, CPI, and cache miss rates. On a couple of case studies in the fields of CFD and MCAD, we used HAR to identify performance bottlenecks and quantify code-tuning improvements. Currently available for the UltraSPARC family, HAR is portable to any microprocessor supported by the Solaris 8 OE, such as the Pentium microprocessor.
The new UltraSPARC-III can count three times as many events as its predecessor. As a result, new HAR metrics for the UltraSPARC-III will include instruction and data TLB misses, FLOPS, branch and branch miss rates. While HAR will become more useful by reporting on more metrics, it will become more necessary as well. Future-generation machines will indeed have an increasing complexity in memory hierarchy and processor design. To understand the behavior of an application, we will need tools such as HAR to locate bottlenecks and guide tuning efforts.
Beyond HAR, we encourage the performance community to continue to build on the Solaris interfaces to hardware performance counters, such as including knowledge about the application and the computing environment in order to provide intelligent tools that make their own conclusions on performance status and conjectures on performance improvements strategies, in terms of code tuning or system upgrade.
Frédéric Parienté is a member of the Market Development Engineering group at Sun Microsystems where he works on performance analysis and optimization of commercial applications for mechanical computer-aided engineering, operations research, bio-informatics and financial markets. He graduated from Ecole Nationale Superieure de Techniques Avancées, Paris, France in 1994 and received a MS in Mechanical Engineering from the University of Illinois at Urbana-Champaign in 1995. His professional interests include high-performance computing, grid and portal computing, parallel and distributed systems, and performance analysis and optimization. Frédéric Parienté can be contacted at frederic.pariente@sun.com.
December 2001
|
| ||||||||||||