Many modern system architectures use a Non-Uniform Memory Architecture (NUMA). Some applications have difficulty scaling on such architectures as CPUs are added. The addition of CPU cores does not proportionately increase the application's ability to do work. In this article, I present how the customer and I increased application performance on a Sun SPARC E6900, a NUMA machine, by reducing the effect of the machine's "NUMAness." I will also discuss how we achieved similar results with an AMD Opteron X4600 for a different application. I will explain the methodology used, which has been characterized as the scientific method of performance analysis. I will emphasize the importance of the business metric and give an overview of several NUMA-aware tools from the OpenSolaris project. The Business Metric
When doing performance analysis, it is useful to relate all performance-related metrics to what we really care about. In this case, that includes calls per time unit, transactions per time unit, records processed per time unit, and so on. I call this metric the business metric. Performance metrics are only interesting in the context of how they relate to the business metric. By considering the effect on the business metric, we shift the focus of investigation from identifying metrics that are out of bounds to identifying metrics that correlate with the business metric. For example, if achievement of the business metric suffers when the tcp retransmission rate increases, improving the tcp retransmission rate will likely improve the business metric. Likewise, if slow disk I/O does not correlate with degradation of the business metric, improving disk I/O is not likely to improve the business metric. This article's graphs will compare measurements of the business metric to specific performance-related metrics. The Scientific Method: Performance Analysis
When planning a performance investigation, it is useful to specify the parameters of the experiment before beginning. First, we set a goal that expresses what we are trying to learn or achieve. The goal is expressed in terms of the business metric. The goal specifies the criteria for success. We then determine the use cases for the experiment. In the case of production applications, a use case is defined as a workload cycle during which the load on the system rises and falls. It is important to capture time-stamped measurements of the business metric during the cycle. In the lab, we want to establish repeatable load steps. For example, 1 million, 2 million, and 3 million busy calls per hour define three load steps for a virtual telephone switch. During analysis, I use graphs or plots to identify correlations between performance metrics and the business metric. The Problem With NUMA
Non-Uniform Memory Architectures (NUMAs) are commonplace in modern systems. In a NUMA architecture, CPUs do not all have the same view of memory. Consider the illustration in Figure 1.
CPU1 can access the memory on its own system board in less time than it can access the memory on the system board where CPU2 resides and vice versa. When an application thread runs on different CPUs in a NUMA architecture, it runs in environments where the cost of a load and/or store can be different depending on which CPU the thread executes. The amount of business metric that an application can do per unit time is thus dependent on how "far" the application is from its memory when it runs. One class of SUN SPARC machines has system boards attached to a central bus architecture, similar to Figure 1. The machine can have multiple system boards, each populated with memory and CPUs. A CPU can access memory on its own board faster than it can access memory on a different system board. On the AMD Opteron-based machines, each CPU socket is connected to other CPU sockets through HyperTransport links. The speed to access a memory bank is dependent on how many HyperTransport links must be traversed. Table 1 gives some examples, listing times in nanoseconds (nsecs).
Sun's Opteron-based platforms take approximately 80 nsecs to access memory local to a core and approximately 50 additional nsecs per HyperTransport link. A dual-socket platform has an average access time of approximately 105 nsecs. The average access time for a quad-socket system is 124 nsecs. And the average access time is 155 nsecs for an eight-socket system. Case Study One: Sun Fire E6900 Server
The Sun Fire E6900 server has six system boards. Each system board has four sockets. In our configuration, each socket was populated with a dual-core UltraSPARC IV+ CPU containing a 64 KB L1 instruction cache, a 64 KB L1 data cache, a 2 MB shared L2 cache, and a 32 MB shared L3 cache. The application is a component of a virtual telephone switch. Although the business metric is expressed in calls processed per second or in busy calls per hour, we will generically use transactions per second in the graphs. The application consists of numerous Java technology-based and C++ processes. Problem: CPU utilization increases nonlinearly with load. Specifically, when the customer compared a one-system board configuration to a six-system board configuration, the application was able to process only about three times the load with the six-system board solution, as opposed to the expected factor of six. In other words, six times the number of CPUs does not handle six times the load. Goals:
Plan: First, we identified four load cases: 15,000,
30,000, 45,000, and 60,000 transactions per second. Next, we ran the
load cases sequentially and collected performance metrics. For data
collection, we used a
In Figure 2, 100 equals 100 percent of the system's CPU capacity. The business metric is given in transactions per second. What we want to see is an increase in CPU utilization that corresponds to an equal increase in the load. What we see is that the CPU consumption rose more than expected for the third load step and twice as much as expected for the fourth load step. The first step in discovering the reason for the increased
utilization is to look at the other possible performance inhibitors.
For this, we use a tool named Here is the sample output:
The interesting metrics include but are not limited to thread
migrations ( A thread migration occurs when a thread does not run on the same CPU core this time as the last time it ran. An involuntary context switch occurs when a thread is taken off a CPU core so that a higher-priority thread might run or because the thread exhausted its time slice. Spins on mutex locks count how many unsuccessful attempts were made to acquire a mutex lock in the kernel. Cross calls are software interrupts that occur when one CPU core has to interrupt another CPU core to get a request satisfied. Among the causes of cross calls are signals and page-mapping activity. Involuntary context switches did correlate with the increased CPU utilization at higher load steps, as did thread migrations, but the most interesting evidence came from system calls per second. See Figure 3.
Normally, we expect a certain number of system calls per unit of work. The graph in Figure 3 shows that we were doing fewer system calls per unit of work. This is unusual, but it is possible for systems that use timed events such as poll and timed waits on condition variables to use fewer system calls as load increases. We used dTrace to sample what system calls were being used, and we decided that timed events were not the cause of our slowdown. We also eliminated increased spins on mutex locks as a source of slower system calls. Finally, we theorized that user instructions in particular were actually taking longer to run and thus making fewer system calls per second than what we expected. We decided to concentrate on determining whether loads and stores were taking longer due to off-board CPU migrations and L3 cache misses, in effect, using more cycles per instruction. Fortunately, the Solaris OS contains a tool for accessing
hardware performance counters,
An example of the output follows, where
By using the ratio of L3 miss cost in cycles
(
The result was that 41 percent of the system was being used to service L3 misses. We identified the following possible remedies: processor affinity and lgroups. The Solaris OS has a built-in processor affinity that gives a thread a tendency to run where it has run before. The scheduler considers how long it has been since the thread last ran as well as the length of the run queue where the thread ran before. To put things most simply, if it has been too long or if the run queue on the desired CPU core is too deep, then the thread migrates. Obviously, processor affinity was not sufficient here to assure that threads would tend to run on the same CPU core repeatedly. The Solaris OS has a concept known as lgroups. An lgroup is a set of CPUs that have the same view of memory. You can find several tools that are NUMA-aware. Among these tools are the following:
What follows are some examples of output from use of these NUMA-aware tools. Here is the output from
The number in the Here is the output from
The 6, 3,and 4 indicate the lgroup from which the memory was allocated. Using The first remedy we evaluated was assigning all the threads of a process to the same home lgroup. We spread the process assignments out over all of the system boards. We expected that lgroup affinity would reduce off-board migrations and reduce L3 miss cost. But this increased performance only slightly, from 10 to 15 percent. More drastic action was needed. Then we considered processor sets. Processor sets are exclusive, and we did not have the knowledge needed to assign all of the processes to processor sets without running the risk of unbalancing the load. Finally, we considered processor binding. We would bind the latency-sensitive threads of the processes to their own CPUs. The danger in processor binding is that once a thread is bound to a CPU, it can only run on that CPU. There is a danger that it may not get to run promptly or even that it may be starved for CPU cycles. We reduced this danger by running the bound threads in the Fixed Priority (FX) scheduling class at a high priority with a large time slice. This solution provided the final answer. We used the following commands:
The graph in Figure 5 shows the improved scalability after our efforts.
99,000 transactions per second vs. 60,000 is quite an improvement. This scalability can largely be attributed to a stabilization of the cost of L3 misses. Although we reduced involuntary context switches as well as thread migrations, the primary increase in performance was due to the reduction in cost of the L3 misses, as shown in Figure 6.
Case Study Two: Sun Fire X4600 Server
The Sun Fire X4200 has two sockets, and each socket has two cores. The maximum number of HyperTransport hops to memory is one. The V40z has four sockets, each with two cores. The largest number of hops to memory is two. But the Sun Fire X4600 server has eight sockets, each with two cores. Now the number of hops to memory can be as much as three. The average memory access time for the X4600 is 155 nsecs, as compared to 105 nsecs for a two-socket system. The application analyzed was Java technology-based. The problem description follows. The X4200 server could process a block of work in 214 minutes. This was accomplished with one Java Virtual Machine (JVM) and four worker threads.* When the application was deployed on an X4600 server with one JVM and 16 worker threads, the execution time was 195 minutes. The goal was to run the workload on the X4600 in one-fourth the time that it took to run the application on the X4200. Figure 7 shows the layout of the sockets on the X4600.
Using a Sun internal tool that displays near real-time CPU utilization, Recall that each socket contains a CPU with two cores. We reasoned that the key to making one X4600 server (16 cores) perform in the same manner as four X4200 servers (4 cores) was to partition the X4600 into processor sets so that each set resembled an X4200. In terms of the diagram in Figure 7, we wanted to pair sockets 0 and 1, sockets 2 and 4, sockets 3 and 5, and sockets 6 and 7. Other combinations are possible. The commands to do this partitioning are as follows:
We made the above changes to ensure that memory allocations would occur within the processor set. By splitting up a single large Java process into four processes, assigning each to one of the above processor sets, using FX scheduling, and processor set-aware memory allocations, we were able to complete a piece of work that had been taking 195 minutes in fewer than 80 minutes. We did not reach our goal, but 80 minutes represented a considerable improvement nevertheless. Summary
Integrated memory management units give modern processors significant performance advantages when accessing local memory. This advantage becomes a liability when memory accesses become disproportionately remote. Modern systems seek to maximize execution efficiency in three ways: by giving a thread an affinity for the processor on which it most recently ran, by giving a thread an affinity for the processor that is "closest" to the memory that the thread needs to access (lgroups in the Solaris OS), and by using large L2 and L3 caches. Sometimes these policies are not enough to ensure efficient and scalable execution. Intelligent policy-based scheduling would help, but the operating system can only do so much. An experienced competent analyst with sufficient knowledge of the workload can make the crucial difference between a smooth-running efficient system and an unscalable, erratically performing system.
_______ References
Solaris Memory Placement Optimization and Sunfire Servers (PDF), technical white paper, Sun Microsystems, March 2003. Sunfire X4600/X4600M2 Server Architecture (PDF), technical white paper, May 2007. Richard McDougall and Jim Mauro, Solaris Internals, 2nd edition, Sun Microsystems Press, 2006. Acknowledgments
Thanks to Bob Sneed of Sun Microsystems, who first articulated the concept of the business metric and its importance, and to Jim Fiori of Sun Microsystems for reviewing this paper and making valuable suggestions. About the Author
Rickey C. Weisner is a senior engineer for Sun Microsystems. He currently works for Solaris 10 OS adoption in the U.S. Systems practice. His main areas of interest are performance analysis and C and C++ software development. | |||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||