Sun Java Solaris Communities My SDN Account Join SDN

Article

Achieving Near-Linear Scalability Using Solaris OS on NUMA Architectures

 
By Rickey C. Weisner, December 2007  

Many modern system architectures use a Non-Uniform Memory Architecture (NUMA). Some applications have difficulty scaling on such architectures as CPUs are added. The addition of CPU cores does not proportionately increase the application's ability to do work.

In this article, I present how the customer and I increased application performance on a Sun SPARC E6900, a NUMA machine, by reducing the effect of the machine's "NUMAness." I will also discuss how we achieved similar results with an AMD Opteron X4600 for a different application. I will explain the methodology used, which has been characterized as the scientific method of performance analysis. I will emphasize the importance of the business metric and give an overview of several NUMA-aware tools from the OpenSolaris project.

The Business Metric

When doing performance analysis, it is useful to relate all performance-related metrics to what we really care about. In this case, that includes calls per time unit, transactions per time unit, records processed per time unit, and so on. I call this metric the business metric.

Performance metrics are only interesting in the context of how they relate to the business metric. By considering the effect on the business metric, we shift the focus of investigation from identifying metrics that are out of bounds to identifying metrics that correlate with the business metric. For example, if achievement of the business metric suffers when the tcp retransmission rate increases, improving the tcp retransmission rate will likely improve the business metric. Likewise, if slow disk I/O does not correlate with degradation of the business metric, improving disk I/O is not likely to improve the business metric.

This article's graphs will compare measurements of the business metric to specific performance-related metrics.

The Scientific Method: Performance Analysis

When planning a performance investigation, it is useful to specify the parameters of the experiment before beginning. First, we set a goal that expresses what we are trying to learn or achieve. The goal is expressed in terms of the business metric. The goal specifies the criteria for success.

We then determine the use cases for the experiment. In the case of production applications, a use case is defined as a workload cycle during which the load on the system rises and falls. It is important to capture time-stamped measurements of the business metric during the cycle. In the lab, we want to establish repeatable load steps. For example, 1 million, 2 million, and 3 million busy calls per hour define three load steps for a virtual telephone switch. During analysis, I use graphs or plots to identify correlations between performance metrics and the business metric.

The Problem With NUMA

Non-Uniform Memory Architectures (NUMAs) are commonplace in modern systems. In a NUMA architecture, CPUs do not all have the same view of memory. Consider the illustration in Figure 1.

NUMA Architecture
Figure 1. NUMA Architecture
 

CPU1 can access the memory on its own system board in less time than it can access the memory on the system board where CPU2 resides and vice versa.

When an application thread runs on different CPUs in a NUMA architecture, it runs in environments where the cost of a load and/or store can be different depending on which CPU the thread executes. The amount of business metric that an application can do per unit time is thus dependent on how "far" the application is from its memory when it runs.

One class of SUN SPARC machines has system boards attached to a central bus architecture, similar to Figure 1. The machine can have multiple system boards, each populated with memory and CPUs. A CPU can access memory on its own board faster than it can access memory on a different system board.

On the AMD Opteron-based machines, each CPU socket is connected to other CPU sockets through HyperTransport links. The speed to access a memory bank is dependent on how many HyperTransport links must be traversed. Table 1 gives some examples, listing times in nanoseconds (nsecs).

Table 1. Example Memory-Access Times
 
 
 
Sun Fire 3800-6800 Servers
Sun Fire 12K/15K Servers
Same-board memory access
180-207
180-197
Remote-board memory access
240
333-440
 

Sun's Opteron-based platforms take approximately 80 nsecs to access memory local to a core and approximately 50 additional nsecs per HyperTransport link. A dual-socket platform has an average access time of approximately 105 nsecs. The average access time for a quad-socket system is 124 nsecs. And the average access time is 155 nsecs for an eight-socket system.

Case Study One: Sun Fire E6900 Server

The Sun Fire E6900 server has six system boards. Each system board has four sockets. In our configuration, each socket was populated with a dual-core UltraSPARC IV+ CPU containing a 64 KB L1 instruction cache, a 64 KB L1 data cache, a 2 MB shared L2 cache, and a 32 MB shared L3 cache.

The application is a component of a virtual telephone switch. Although the business metric is expressed in calls processed per second or in busy calls per hour, we will generically use transactions per second in the graphs. The application consists of numerous Java technology-based and C++ processes.

Problem: CPU utilization increases nonlinearly with load. Specifically, when the customer compared a one-system board configuration to a six-system board configuration, the application was able to process only about three times the load with the six-system board solution, as opposed to the expected factor of six. In other words, six times the number of CPUs does not handle six times the load.

Goals:

  • Immediate: 15 percent more supportable load on the six-system board configuration
  • Within six months: 20 percent more supportable load on the six-system board configuration
  • Within one year: 50 percent more supportable load on the six-system board configuration

Plan: First, we identified four load cases: 15,000, 30,000, 45,000, and 60,000 transactions per second. Next, we ran the load cases sequentially and collected performance metrics. For data collection, we used a ksh shell script that uses largely Solaris OS-provided tools to gather time-stamped performance metrics. Figure 2 shows a comparison of the business metric or load with the CPU utilization.

Business Metric Versus CPU
Utilization
Figure 2. Business Metric Versus CPU Utilization
Click here for a larger image
 

In Figure 2, 100 equals 100 percent of the system's CPU capacity. The business metric is given in transactions per second. What we want to see is an increase in CPU utilization that corresponds to an equal increase in the load. What we see is that the CPU consumption rose more than expected for the third load step and twice as much as expected for the fourth load step.

The first step in discovering the reason for the increased utilization is to look at the other possible performance inhibitors. For this, we use a tool named mpstat. To find more information about this and other tools, consult the Sun Microsystems documentation site.

Here is the sample output:

CPU minf mjf xcal intr ithr  csw icsw migr smtx  srw syscl usr sys  wt idl
0   217   0  925  312  206   344    4  133 3024    0  1196   9   3   0  87
1    43   0 1146   36    8  1476   19  492  201    0  1816  10   4   0  86
2    55   0 1108   18    6  1285    5  486  171    0  1665  10   4   0  86
3    45   0 1069   18    7  1238    5  467  169    0  1614  10   4   0  86
4   283   0 1615   18    6  1521    5  517  206    0  2000  11   5   0  84
 

The interesting metrics include but are not limited to thread migrations (migr), involuntary context switches (icsw), spins on mutex locks (smtx), and cross calls (xcal).

A thread migration occurs when a thread does not run on the same CPU core this time as the last time it ran. An involuntary context switch occurs when a thread is taken off a CPU core so that a higher-priority thread might run or because the thread exhausted its time slice. Spins on mutex locks count how many unsuccessful attempts were made to acquire a mutex lock in the kernel. Cross calls are software interrupts that occur when one CPU core has to interrupt another CPU core to get a request satisfied. Among the causes of cross calls are signals and page-mapping activity.

Involuntary context switches did correlate with the increased CPU utilization at higher load steps, as did thread migrations, but the most interesting evidence came from system calls per second. See Figure 3.

System Calls per Second per CPU
Core Versus the Business Metric
Figure 3. System Calls per Second per CPU Core Versus the Business Metric
Click here for a larger image
 

Normally, we expect a certain number of system calls per unit of work. The graph in Figure 3 shows that we were doing fewer system calls per unit of work. This is unusual, but it is possible for systems that use timed events such as poll and timed waits on condition variables to use fewer system calls as load increases.

We used dTrace to sample what system calls were being used, and we decided that timed events were not the cause of our slowdown. We also eliminated increased spins on mutex locks as a source of slower system calls. Finally, we theorized that user instructions in particular were actually taking longer to run and thus making fewer system calls per second than what we expected.

We decided to concentrate on determining whether loads and stores were taking longer due to off-board CPU migrations and L3 cache misses, in effect, using more cycles per instruction.

Fortunately, the Solaris OS contains a tool for accessing hardware performance counters, cpustat. For the E6900, we can measure the miss cost by using the following command:

cpustat -c pic0=Cycle_cnt,pic1=Re_L3_miss 10 100 (10 seconds between samples, 100 samples)
 

An example of the output follows, where pic0 is the cycle count and pic1 is the L3 miss cost in cycles (Re_L3_miss):

time    cpu event    pic0      pic1
10.008    8  tick 264339684  64702523
10.008   16  tick 379033424  82264350
10.009   17  tick 139120474  43937429
10.009    2  tick 188557829  40718416
10.008    1  tick 196036061  51324303
10.009    9  tick 147137952  31141793
10.008  512  tick 179496352  47598315
10.009  514  tick 213559632  51053565
10.009   10  tick 146540546  29848051
10.009   18  tick 197444045  54358956
10.009    3  tick 241315996  59839941
10.009  515  tick 169493146  37817954
10.009   11  tick 206307589  48544746
10.009   19  tick 139896698  38977705
 

By using the ratio of L3 miss cost in cycles (Re_L3_miss) to Cycle_cnt, we saw what percentage of CPU cycles is used to service L3 cache misses. Figure 4 shows the graph.

Percentage of Total System Time
Spent Waiting for L3 Miss
Figure 4. Percentage of Total System Time Spent Waiting for L3 Miss
Click here for a larger image
 

The result was that 41 percent of the system was being used to service L3 misses.

We identified the following possible remedies: processor affinity and lgroups.

The Solaris OS has a built-in processor affinity that gives a thread a tendency to run where it has run before. The scheduler considers how long it has been since the thread last ran as well as the length of the run queue where the thread ran before. To put things most simply, if it has been too long or if the run queue on the desired CPU core is too deep, then the thread migrates. Obviously, processor affinity was not sufficient here to assure that threads would tend to run on the same CPU core repeatedly.

The Solaris OS has a concept known as lgroups. An lgroup is a set of CPUs that have the same view of memory. You can find several tools that are NUMA-aware. Among these tools are the following:

  • lgrpinfo(1): Perl script that displays lgroup hierarchy, contents, and characteristics
  • plgrp(1): proc tool for observing and affecting home lgroup and lgroup affinities
  • pmadvise(1): proc tool for applying advice with madvise(3C)
  • pmap(1) extensions to display lgroup containing physical memory backing given a virtual address in a specified process
  • ps(1) extensions for lgroups
  • prstat(1) extensions for lgroups

What follows are some examples of output from use of these NUMA-aware tools.

Here is the output from plgrp -l 12502:

PID/LWPID    HOME
 12502/1     1
 12502/2     3
 12502/3     4
 12502/4     5
 12502/5     2
 12502/6     6
 12502/7     1
 12502/8     4
 12502/9     5
 

The number in the HOME column indicates the home lgroup for each lightweight process (lwp).

Here is the output from pmap -x -L 5962:

5962:  xxxx
00010000      8K r-x--    6 xxxx
00020000      8K rwx--    6 xxxx
00022000    800K rwx--    6   [ heap ]
000EA000      8K rwx--    -   [ heap ]
000EC000     40K rwx--    6   [ heap ]
000F6000      8K rwx--    -   [ heap ]
000F8000     32K rwx--    6   [ heap ]
00100000     32K rwx--    -   [ heap ]
00108000      8K rwx--    6   [ heap ]
0010A000     32K rwx--    -   [ heap ]
00112000      8K rwx--    6   [ heap ]
00114000     32K rwx--    -   [ heap ]
0011C000    216K rwx--    6   [ heap ]
00152000     32K rwx--    -   [ heap ]
0015A000   1416K rwx--    6   [ heap ]
002BC000   1008K rwx--    3   [ heap ]
003B8000     32K rwx--    -   [ heap ]
003C0000      8K rwx--    3   [ heap ]
...
FF3B0000    184K r-x--    6 /lib/ld.so.1
FF3E6000      8K r-x--    4 /lib/libthread.so.1
FF3E8000      8K r-x--    - /lib/libthread.so.1
FF3EE000      8K rwx--    6 /lib/ld.so.1
FF3F0000      8K rwx--    6 /lib/ld.so.1
FF3F8000     16K r-x--    6 /lib/libpthread.so.1
FF400000     24K -----    -   [ anon ]
FFB2C000    848K rwx--    6   [ stack ]
 total  3320488K
 

The 6, 3,and 4 indicate the lgroup from which the memory was allocated.

Using pmap -xL, we saw that the processes of interest had their memory allocations scattered throughout memory. Furthermore, using plgrp, we also saw that the threads of the processes of interest had their home lgroups on different system boards.

The first remedy we evaluated was assigning all the threads of a process to the same home lgroup. We spread the process assignments out over all of the system boards. We expected that lgroup affinity would reduce off-board migrations and reduce L3 miss cost. But this increased performance only slightly, from 10 to 15 percent. More drastic action was needed.

Then we considered processor sets. Processor sets are exclusive, and we did not have the knowledge needed to assign all of the processes to processor sets without running the risk of unbalancing the load.

Finally, we considered processor binding. We would bind the latency-sensitive threads of the processes to their own CPUs. The danger in processor binding is that once a thread is bound to a CPU, it can only run on that CPU. There is a danger that it may not get to run promptly or even that it may be starved for CPU cycles. We reduced this danger by running the bound threads in the Fixed Priority (FX) scheduling class at a high priority with a large time slice. This solution provided the final answer.

We used the following commands:

  • To bind a thread or lwp (there is a 1:1 relationship in Solaris OS 9 and 10 between threads and lwps):
     
    pbind -b processor_id pid/lwpid
    
     
  • To assign a PID to the FX scheduling class:
     
    priocntl -s -c FX -m 59 -p 59 -t 300 -i pid "PID"
    
     

    That is, priority 59, 300 msec time slice.

The graph in Figure 5 shows the improved scalability after our efforts.

CPU Utilization Averaged Over All CPU Cores vs. the Business
Metric
Figure 5. CPU Utilization Averaged Over All CPU Cores vs. the Business Metric
Click here for a larger image
 

99,000 transactions per second vs. 60,000 is quite an improvement. This scalability can largely be attributed to a stabilization of the cost of L3 misses. Although we reduced involuntary context switches as well as thread migrations, the primary increase in performance was due to the reduction in cost of the L3 misses, as shown in Figure 6.

Miss Cost as Percent of System CPU
Figure 6. Miss Cost as Percent of System CPU
Click here for a larger image
 
Case Study Two: Sun Fire X4600 Server

The Sun Fire X4200 has two sockets, and each socket has two cores. The maximum number of HyperTransport hops to memory is one. The V40z has four sockets, each with two cores. The largest number of hops to memory is two. But the Sun Fire X4600 server has eight sockets, each with two cores. Now the number of hops to memory can be as much as three. The average memory access time for the X4600 is 155 nsecs, as compared to 105 nsecs for a two-socket system.

The application analyzed was Java technology-based. The problem description follows. The X4200 server could process a block of work in 214 minutes. This was accomplished with one Java Virtual Machine (JVM) and four worker threads.* When the application was deployed on an X4600 server with one JVM and 16 worker threads, the execution time was 195 minutes. The goal was to run the workload on the X4600 in one-fourth the time that it took to run the application on the X4200.

Figure 7 shows the layout of the sockets on the X4600.

X4600 Socket Layout
Figure 7. X4600 Socket Layout
 

Using a Sun internal tool that displays near real-time CPU utilization, perfbar, we noticed that there were periods when the Java process became single threaded. By turning on verbose garbage collection, we confirmed that these single-threaded periods corresponded to garbage collection. The decision was made to split the application into multiple Java processes. Because the workload was batch in nature, balancing the work between multiple processes was straightforward. Because the X4600 has four times the cores of the X4200, we decided to use four Java processes.

Recall that each socket contains a CPU with two cores. We reasoned that the key to making one X4600 server (16 cores) perform in the same manner as four X4200 servers (4 cores) was to partition the X4600 into processor sets so that each set resembled an X4200. In terms of the diagram in Figure 7, we wanted to pair sockets 0 and 1, sockets 2 and 4, sockets 3 and 5, and sockets 6 and 7. Other combinations are possible.

The commands to do this partitioning are as follows:

#delete any existing processor sets
psrset -d 1 2 3
#create processor set 1, assign cores 0,1 and 2,3 to processor set 0 (sockets 0 and 1)
psrset -c -F 0 1 2 3
#create processor set 2, assign cores 4,5 and 8,9 to processor set 1 (sockets 2 and 4)
psrset -c -F 4 5 8 9
#create processor set 3, assign cores  6,7 and 10,11 to processor set 2 (sockets 3 and 5)
psrset -c -F 6 7 10 11
# which leaves Cores 12,13 and 14,15 in the default set (sockets 6 and 7)

Use FX(Fixed Priority scheduling, $$ = pid of interest)
priocntl -s -c FX -m59 -p59 -t300 -i pid $$

in /etc/system,
set lgrp_mem_pset_aware=1
set lgrp_mem_default_policy=3 ( LGRP_MEM_POLICY_RANDOM_PSET)
 

We made the above changes to ensure that memory allocations would occur within the processor set.

By splitting up a single large Java process into four processes, assigning each to one of the above processor sets, using FX scheduling, and processor set-aware memory allocations, we were able to complete a piece of work that had been taking 195 minutes in fewer than 80 minutes. We did not reach our goal, but 80 minutes represented a considerable improvement nevertheless.

Summary

Integrated memory management units give modern processors significant performance advantages when accessing local memory. This advantage becomes a liability when memory accesses become disproportionately remote.

Modern systems seek to maximize execution efficiency in three ways: by giving a thread an affinity for the processor on which it most recently ran, by giving a thread an affinity for the processor that is "closest" to the memory that the thread needs to access (lgroups in the Solaris OS), and by using large L2 and L3 caches. Sometimes these policies are not enough to ensure efficient and scalable execution.

Intelligent policy-based scheduling would help, but the operating system can only do so much. An experienced competent analyst with sufficient knowledge of the workload can make the crucial difference between a smooth-running efficient system and an unscalable, erratically performing system.

_______
* As used on this Web site, the terms "Java Virtual Machine" or "JVM" mean a virtual machine for the Java platform.

References

Solaris Memory Placement Optimization and Sunfire Servers (PDF), technical white paper, Sun Microsystems, March 2003.

Sunfire X4600/X4600M2 Server Architecture (PDF), technical white paper, May 2007.

Richard McDougall and Jim Mauro, Solaris Internals, 2nd edition, Sun Microsystems Press, 2006.

Acknowledgments

Thanks to Bob Sneed of Sun Microsystems, who first articulated the concept of the business metric and its importance, and to Jim Fiori of Sun Microsystems for reviewing this paper and making valuable suggestions.

About the Author

Rickey C. Weisner is a senior engineer for Sun Microsystems. He currently works for Solaris 10 OS adoption in the U.S. Systems practice. His main areas of interest are performance analysis and C and C++ software development.

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.