Abstract: This article introduces CMT technology, such as chip multi-processing and simultaneous multi-threading, from a software developer's perspective.
Contents: 1. Introduction to Chip Multi-Threading Chip multi-threading (CMT) brings to hardware the concept of multi-threading, similar to software multi-threading. Software multi-threading refers to the execution of multiple tasks within a process. The tasks are executed using software threads. The software threads are executed on a single processor or on many processors simultaneously. A CMT-enabled processor, similar to software multi-threading, executes many software threads simultaneously within a processor on cores. So in a system with CMT processors, software threads can be executed simultaneously within one processor or across many processors. Executing software threads simultaneously within a single processor increases a processor's efficiency as wait latencies are minimized. The paper introduces CMT technology, such as chip multi-processing (CMP), chip multi-processing with hardware threads, and simultaneous multi-threading, from a software developer's perspective. It introduces Sun's throughput computing strategy and Sun's first CMT processors, UltraSPARC US-IV and the Niagara. Software tuning techniques to improve application efficiency on a CMT processor-based system are then introduced. A prototype Java technology-based application using some of techniques in the paper shows the scalability on a CMT system. Throughput Computing: Sun introduced throughput computing [1],[2] to cut the cost and complexity of network computing. The idea is to maximize application workload throughput, or the aggregate amount of work done in a given amount of time [1] by using CMT-enabled processors. The CMT processors execute multiple software threads simultaneously -- contrasting to a single thread processed by the current non-CMT processors -- and this can help improve price/performance and reduce the total cost of ownership. Throughput computing maximizes the throughput per processor and per system. So a processor with multiple cores will be able to increase the throughput by the number of cores per processor. This increase in performance comes at a lower cost, fewer systems, reduced power consumption, and lower maintenance and administration, with increase in reliability due to fewer systems. 1 Introduction to Chip Multi-Threading
CMT brings to hardware the concept of multi-threading, similar to software multi-threading. Software multi-threading refers to the execution of multiple tasks within a process. The tasks are executed using software threads. This allows other tasks to execute when one thread is blocked, maybe on an IO operation. A central processing unit or a processor executes instructions from a software application [3]. Figure 1 shows a one-core processor with the L1 cache on the core and the L2 cache of the core. The processor executes instructions from one software thread and waits, wasting clock cycles, when it executes instructions that need memory operations. This wait takes most of the time [4].
Figure 2 shows compute cycles followed by a wait, as memory is accessed. The compute cycles are 'C', and the memory wait time is 'M'. The wait time is dependent on the speed of the memory operations. Memory operations are slow compared to the processor speed. Processor speed has increased many times -- it doubles every two years, while memory is still very slow, doubling every six years [1]. Memory access like a cache miss can take quite a few clock cycles (it could be in hundreds of clock cycles), and the processor has to wait until the memory access is complete. Increases in processor speed without an increase in memory speed will only improve performance by a small amount, since the processor still has to wait for the clock cycles for the memory access to complete [5]. Figure 3 shows a faster processor; compute cycles are faster, but memory access time is still the same. The compute time saved is not much as the memory access slows the processor. (The faster processor waits the same amount of time for memory access.)
Similar to software multi-threading, instruction execution could also be multi-threaded in hardware using additional cores on the same silicon die. The additional cores can continue processing instructions from other threads while one core is waiting for memory to be accessed. The use of multiple cores (two or more) on a silicon die to process instructions from multiple threads is chip multi-processing (CMP) [6]. The core using additional registers and logic could switch the threads in hardware to continue execution instead of waiting for the memory operation to complete. This switching in hardware is known as vertical multi-threading (VMT) [7]. The core can process instructions from multiple threads simultaneously without switching between software threads. This is known as horizontal multi-threading (HMT) or simultaneous multi-threading (SMT). These techniques to improve application performance by executing more than one software thread on a silicon die are known as chip multi-threading. CMT improves application efficiency by executing instructions from multiple software threads in parallel per clock cycle. These instructions are executed in parallel without any changes to the multi-threaded application code. Using the latest compilers to optimize the multi-threaded application code should increase this performance. The operating system needs to be CMT aware, to minimize resource conflicts, maximizing the parallelization and application throughput. The throughput can be increased by the number of cores per processor. 1.1 Chip Multi-Processing (CMP) Chip multi-processing (CMP) involves having cores on a silicon die. Figure 4 shows a two-core CMP processor with an on-core L1 cache and an external split L2 cache. A core is made up of functional units, pipelines, caches, and so on, which can process computer instructions. The cores could have their own L1 and L2 cache and share a L3 cache. The cores could also share the memory and cache bus.
If access is required to memory due to a cache miss, a memory stall, or a Translation Lookaside Buffer (TLB) miss, this can take from tens to hundreds of cycles. A processor with just one core needs to wait until this access is completed (Figure 2). The OS can context switch [6] the thread instead of waiting for the memory access to complete. Software context switches can take quite a few cycles. A processor with multiple cores can continue executing software threads while a thread on one core waits for a memory access. Having more than one core improves performance per processor since other threads continue execution. This also enables thread level parallelism (TLP) [8] as instructions from multiple threads are executed simultaneously per processor. Figure 6 shows two threads executing simultaneously on two cores. 1.1.1 UltraSPARC US-IV UltraSPARC US-IV [9],[10] is a CMP processor derived from the UIII [11] processor but with two cores. UIII has only one core. Each core on a US-IV has an internal L1 cache (64 Kbyte data, 32 Kbyte instruction) and an external L2 cache (16 MB). The L2 cache is a split L2 cache and each core can access 8 MB of the split cache. The address and data bus to the cache are shared. The memory controller is on the processor, and the path to the memory is shared between the two cores. Each core can process 4 instructions per clock cycle. The cores have a 14-stage pipeline and use instruction level parallelism (ILP) [12] to execute 4 instructions every cycle. The two cores could together execute a maximum of 8 instructions per clock cycle. Figure 5 shows independent instructions in a single thread being executed simultaneously on a core.
The two cores appear as processors to the OS, and the OS schedules software threads on the cores. Each core can execute one software thread, so two software threads can be executed simultaneously. On a memory access, only the core involved in the access waits, while the second core continues processing instructions. The number of instructions processed by the processor is now two times that of a processor with a single core. The processor executes instructions from two different threads simultaneously, providing thread level parallelism (TLP) in addition to the instruction level parallelism on each core. TLP can help in increasing the performance of applications by almost doubling the throughput. 1.2 Chip Multi-Processing With Hardware Threads The OS schedules software threads on the cores in a US-IV processor. In the case of a memory access, a core needs to wait for the access to complete while the other core continues processing instructions. This is not very efficient as a core wastes clock cycles. The core using more registers and logic in hardware can context-switch the software thread to another software thread. This logic in hardware holding the thread state is called a hardware thread. The hardware thread appears as a logical processor to the OS. A processor with n hardware threads would appear as n logical processors. For example, if the US-IV processor had 4 hardware threads per core, it would appear to have 8 logical processors (4 hardware threads * 2 cores).
The OS schedules software threads on these logical processors. On a memory access, the core switches to a new hardware thread so it does not have to wait for the access to be complete. The context switch is in hardware so no clock cycles are wasted. Hardware threads improve the efficiency of a processor by processing instructions every clock cycle. Figure 7 shows a context switch to a new thread on a memory access. The switch enables the core to continue processing instructions from the new thread while memory is being accessed from the old thread. (Faster processors can execute threads in parallel.)
1.2.1 Niagara Processor The Niagara1 processor (code name for Sun's new CMT processor) [13] has eight cores and four hardware threads per core. Each core can process one instruction every cycle. It has a L1 cache per core and a shared L2 cache. The processor does not use ILP [14] but focuses more on TLP. The processor has a simple mechanism to prevent execution stalls [13], switching to another hardware thread. This leads to a simpler processor with very low power consumption. It is well suited for systems such as blade servers, middleware, database servers, and so on. The Niagara processor has an on-chip Ethernet controller and a TCP/IP off load engine. The on-chip Ethernet processing saves processor cycles and avoids the memory access needed to process the network packets. 1.2.1.1 Locks in a Single CMT Processor Niagara based systems are single processor based systems. The L2 cache has 12-way associativity and is shared among the logical processors. Once a shared variable like a lock is accessed into the L2 cache then it can be shared by all the threads. This improves the performance compared to a symmetric multi-processing (SMP) system, where threads could be running on different processors, and the L2 cache is shared on each processor but not among processors. 1.2.1.2 Performance The Niagara processor can process eight instructions simultaneously every clock cycle or one instruction per core. 1.3 Simultaneous Multi-Threading Simultaneous multi-threading [15],[16],[17] uses hardware threads layered on top of a core to execute instructions from multiple threads. The hardware threads consist of all the different registers to keep track of a thread execution state. These hardware threads are also called logical processors. The logical processors can process instructions from multiple software thread streams simultaneously on a core, as compared to a CMP processor with hardware threads where instructions from only one thread are processed on a core. SMT processors have a L1 cache per logical processor while the L2 and L3 cache is usually shared. The L2 cache is usually on the processor with the L3 off the processor. SMT processors usually have logic for ILP as well as TLP. The core is is not only usually multi-issue for a single thread, but can simultaneously process multiple streams of instructions from multiple software threads. 1.4 Chip Multi-Threading Chip multi-threading encompasses the techniques of CMP, CMP with hardware threads, and SMT to improve the instructions processed per cycle. To increase the number of instructions processed per cycle, CMT uses TLP [8] (as in Figure 6) as well as ILP (see Figure 5). ILP exploits parallelism within a single thread using compiler and processor technology to simultaneously execute independent instructions from a single thread. There is a limit to the ILP [1],[12],[18] that can be found and executed within a single thread. TLP can be used to improve on ILP by executing parallel tasks from multiple threads simultaneously [18],[19]. 1.5 Symmetric Multi-Processing With CMT SMP involves multiple processors on a single system sharing a single memory system. The memory appears as one to all the processors and can be accessed with the same latency. A SMP-based system with CMT has multiple processors, with each processor having cores and logical processors. The cores/logical processors share the single memory system. 2 Improving Application Performance on a System With CMT
2.1 Operating System The OS is the most important component that can improve the performance of applications on a system with CMT. CMT enables a single processor to process instructions from multiple software threads simultaneously. The software threads are scheduled on the cores in CMP processors and logical processors in CMP processors with hardware threads and SMT processors. If the OS is CMT aware then it can schedule the application threads to minimize contention for the resources. The resources of contention are the system bus and shared caches. The OS needs to schedule threads per processor and then per core. 2.1.1 Solaris Operating System The Solaris Operating System (OS) is CMT enabled with the Solaris 10 release. Applications can see improved performance with the new TCP/IP stack, improved CMT scheduling (available on x86 platforms also), improved MPO, Solaris Containers, DTrace, and so on. The Solaris 9 OS also has some CMT features but applications can reap more benefits by being on the Solaris 10 OS. The Solaris OS provides the following features to improve application performance. 2.1.1.1 Multi-Threaded Library The Solaris OS provides a POSIX compliant as well as a native multi-threaded library for multi-threading applications. The library uses a 1:1 threading model. In a 1:1 thread model, an application thread has a corresponding kernel thread. A multi-threaded application with n application threads will have n kernel threads. Each application thread has a light-weight process (LWP) connecting it to the kernel thread. An LWP is always associated with a kernel thread, but not every kernel thread is associated with an LWP [20]. The kernel threads are the entities scheduled on the cores. 2.1.1.1.1 Usage 2.1.1.1.2 Using POSIX Threads CC -mt program -lpthread 2.1.1.1.3 Using Solaris Threads CC -mt program
Note: In the Solaris 10 OS, POSIX and Solaris thread library calls are now in
2.1.1.2 Adaptive Mutexes The Solaris OS offers adaptive mutexes for use in device drivers. With an adaptive mutex, a thread will spin if the current thread holding the lock is on a core. This avoids the overhead of going to sleep. The thread also spins before sleeping, if there are n cores. Another feature in the Solaris OS has this effect: Before unlocking a lock, a thread spins to see if other threads are waiting for the lock and if the lock is acquired. This avoids waking up sleeping threads unnecessarily. 2.1.1.2.1 Usage void mutex_init(kmutex_t *mp, char *name, kmutex_type_t type, void *arg); void mutex_destroy(kmutex_t *mp); void mutex_enter(kmutex_t *mp); void mutex_exit(kmutex_t *mp); int mutex_owned(kmutex_t *mp); int mutex_tryenter(kmutex_t *mp); 2.1.1.3 Scheduler Activation The Solaris OS provides support for an application thread to communicate with the kernel. The kernel informs the application thread of the current execution state of the LWP. Similarly, the application thread informs the kernel about the priority of the currently running thread, and also about preemption control. If the application thread is in a critical section, it can inform the kernel to not preempt it. 2.1.1.3.1 Usage
2.1.1.4 Rechoose Interval Rechoose Interval2 is a thread affinity tunable used by the scheduler to switch a thread from one processor to another [21]. This interval tells the scheduler that the L2 cache is still warm, and it could schedule the thread back on the same processor. This tunable is now core specific, and if L2 caches are shared by the cores in a processor, this tunable is for all the cores. This interval could be varied for some types of application to improve the performance [22]. 2.1.1.4.1 Usage
Rechoose interval can be set in
2.1.1.5 Memory Placement Optimization (MPO) MPO3 can help ensure that data related to a process is stored as close to the processor as possible. This is to improve locality and reduce access latency and improve bandwidth. Access latency refers to how fast data can be brought from memory to the cache, and bandwidth refers to how much data can be moved. MPO[23] is a smart algorithm in the Solaris OS which places process data close to the thread accessing the data by optimizing for latency and bandwidth.
The Solaris OS uses the concept of locality groups ( Note: MPO works using a "first touch" algorithm. An application that initializes all of its data structures within an initial serial thread and then spins off lots of compute threads will have all of its memory allocated as close as possible to the processor which ran the initialization thread (thereby guaranteeing that the N-1 threads will be working with "remote" memory). MPO can degrade performance in such cases. The degradation can be severe, a 30% decrease in performance. Usage:
2.1.1.6 Multiple Page Size Support
An application using the default page size (8 Kbyte) will have a high number of TLB misses. A TLB miss is very expensive [24]. The Solaris OS provides support for multiple page [25] sizes and allows applications to choose an optimum page size from 8 Kbyte to 4 Mbyte. The page sizes supported can be obtained using the 2.1.1.6.1 Usage
a. Using export LD_PRELOAD=$LD_PRELOAD:mpss.so.1 export MPSSHEAP=size export MPSSSTACK=size a1. Example LD_PRELOAD=$LD_PRELOAD:mpss.so.1 export MPSSHEAP=4m export MPSSSTACK=4m
b. Using
The /usr/bin/ppgsz [-F] -o option[,option] cmd | -p pid...
b1. Example of using
c. Using
The
c1. Example of using int memcntl(caddr_t addr, size_t len, int cmd, caddr_t arg,int attr, int mask);
The d. Using The /usr/bin/pagesize -a 2.1.1.6.2 Library /usr/lib/libmpss.so 2.1.1.7 Processor Sets Processor sets in the Solaris OS allow processors to be grouped. A process can then be scheduled to run in this set. Software threads from the process are scheduled on processors in the set. This can be used to increase the efficiency of scheduling and locality. Each set has its own dispatcher queue, so the scheduler looks in this queue to dispatch the threads. This reduces the latency of scheduling since the scheduler does not have to look at a global queue. Processors are also organized by system boards on Sun systems. So creating a processor set by using processors on the same board helps in improving locality. Interrupts within a processor set can be disabled to allow threads to run without being context switched. Note: Create processor sets using cores across processors. This will allow threads to be scheduled across processors minimizing resource conflicts. For Niagara based systems, create processor sets across cores. 2.1.1.7.1 Usage a. Processor sets can be created and cores assigned using the following command: %psrset -c %psrset -a 1 1 2 3 4 512 513 514 515 b. Interrupts can be turned off within a set using: %psrset -f 1 c. An application can be run within the set or LWPs of an application can be bound into the set using:
2.1.1.8 Multi-Threaded
2.1.1.8.1 Usage
a. Using LD_PRELOAD=libmtmalloc.so
b. Using LD_PRELOAD=libumem.so cc [ flag... ] file... -lumem [ library... ] 2.1.1.9 Compilers From version 9 onwards, Sun Studio software provides compile options4 that allow optimizations for UltraSPARC US-IV processors.[21],[26],[27] The compiler also provides software pre-fetch insertions. This allows data to be pre-fetched from memory. The compiler also provides a high performance library, especially optimized for multi-threaded applications. It also provides options to multi-thread single-threaded applications, an optimized math library, and profile feedback. The profile feedback can be used to track runtime paths and generate optimized code for branch predictions. 2.1.1.9.1 Usage a. Compile options
b. New Pragmas Introduced With Sun Studio 9 Software
2.1.1.10 Java Virtual Machine The JVM in the Solaris OS is optimized for CMT processors. The JVM is written in C++ and is compiled with the latest version of Sun Studio software. Using the latest Sun compilers with CMT optimizations can further improve application performance. 2.2 C/C++/Java Applications on the Solaris OS 2.2.1 Multi-Threaded Applications Multi-threading is the execution of many tasks within a single process. The tasks could be executed concurrently on different software threads. On an SMP system with no CMT, the software threads are executed on different processors in the system. On an SMP system with CMT, the threads are executed on cores in CMP processors, and logical processors in CMP processors with hardware threads and SMT processors.
Multi-threaded applications run with no changes on CMT-enabled systems. Optimizations made for SMP-enabled system hold for CMT-enabled systems. The optimization specific to CMT-enabled systems would be to avoid cache thrashing as multiple threads now run simultaneously on a processor. Other optimizations would be to keep threads accessing shared data closer to the shared L2 cache, lock-free data structures, atomic operations, thread local storage, multi-threaded 2.2.1.1 Lock-Free Applications Locks are memory locations used to synchronize access to a critical section. Locks are accessed to read the current value. The value of a lock needs to be changed in an atomic way. Locks are shared by threads which need access to a critical section. If the lock is not in the L2 cache then this needs to be accessed from memory (an expensive operation). If it is in the L2 cache then threads running on cores within a processor can access the lock value in the shared L2 cache, without a memory operation. A lock protects a critical section, so when one thread enters the critical section, other threads accessing the critical section will be blocked. When the lock is released, these blocked threads need to be woken up, resulting in context switches. Lock-free structures [28],[29],[30] avoid thread blocking and context switches. The data can also be in the shared caches, avoiding memory access. 2.2.1.1.2 Usage Refer to Practical lock-free data structures and AppCore: A Portable High Performance Thread Synchronization Library. 2.2.1.2 Atomic Operations Atomic operations [31],[32],[33] are similar to lock-free structures and are completed without needing a lock to access/change data. The operation could be incrementing/decrementing/setting a value at a memory location. This will be faster if the memory location value is in the shared L2 cache and accessed by threads running on cores within a processor. Atomic operations also avoid a thread context switch. 2.2.1.2.1 Usage Atomic libraries are part of the Solaris 10 OS and part of Java 2 Platform, Standard Edition (J2SE) 5.0. a. C/C++ applications a1. Include file
a2. Library
Releases below Solaris 10 can use the library referenced in [33]. b. Java Applications import java.util.concurrent.atomic.*; AtomicBoolean.set(true) 2.2.1.3 Cache Thrashing Threads now run simultaneously within a processor. Since they all access memory locations simultaneously, if the size of the shared L1 and L2 cache is small, then this might result in the destruction of the cache data in use, as new cases of access fill up the caches. A running thread might need information to be brought in again from memory-degrading application performance [34]. Most applications cannot control how data is laid out in the cache [35], [36], but system-level applications like database servers, JVM, drivers, and so on, can make use of the cache sizes to keep the most accessed data in the cache.
Cache statistics can be measured using 2.2.1.3.1 Usage Example A cpustat [-c events] [-nhD] [interval [count]] cpustat -c IC_ref,IC_hit -c Cycle_cnt,Instr_cnt 1 3 Example B cputrack [-T secs] [-N count] [-Defhnv] [-o file] -c events [command [args] | -p pid] cputrack -T 0 -fve -c Cycle_cnt,Instr_cnt sh -c date Example C
har [-{i|s|r|M}] [-{u|a|k}] [-DnmRh] [interval [count]]
har -r 1
2.2.1.4 Thread Local Storage Thread local storage (TLS) is data local to a thread and not accessible from other threads. This is similar to local variables in a function or a method. Local variables are allocated on the stack knowing that these can be destroyed at the end of the function or method. Similarly, thread local allocations can be kept on the stack but are needed for the lifetime of the thread. The compiler can optimize to pre-fetch these into the L2 cache for improved performance. Reducing global data to thread local data, if the data is not shared between threads, should improve performance and eliminate synchronization. 2.2.1.4.1 Usage a. C/C++ Applications
Variables are declared thread-local using the __thread int i; __thread char *p; __thread struct state s; cc -xthreadvar Program b. Java Applications
java.lang.ThreadLocal;
public class SerialNum {
// The next serial number to be assigned
private static int nextSerialNum = 0;
private static ThreadLocal serialNum = new ThreadLocal() {
protected synchronized Object initialValue() {
return new Integer(nextSerialNum++);
}
};
public static int get() {
return ((Integer) (serialNum.get())).intValue();
}
}
2.2.1.5 Use VIS Instruction Set VIS6 is an extension of the SPARC instruction set. The VIS instructions can process four 16-bit data, packed into a register with one instruction. This instruction set can accelerate processing of some algorithms by as much as 7 times, by performing up to 10 operations in parallel per cycle [38]. Applications using imaging, linear algebra, signal processing, audio, video and networking can benefit from this increased performance. Applications written in the Java programming language can use VIS by means of wrappers. Wrappers in Java technology use Java Native Interface (JNI) to access the library. 2.2.1.5.1 Usage VIS can be used with the VSDK [39] or the Media Library [40]. a. Examples:
b. Wrappers for Java applications are also available for accessing the VIS library using JNI. 2.3 Single-Threaded Applications As the Solaris OS uses a 1:1 threading model, an application thread has a corresponding kernel thread. A multi-threaded application has n application threads (LWPs), and n kernel threads. A single-threaded application has one LWP. Running multiple single-threaded processes results in your application appearing as multiple LWPs to the kernel. The difference between the single-threaded application and multi-threaded application is the size of the process, how the data segments are loaded in memory, and the address space. In single-threaded apps, text and libraries segments are shared, but data can only be shared through interprocess communication across an address space. On the other hand, in a multi-threaded process, data between LWPs can be shared through intra-process communication within an address space, including direct access to a global variable. The second difference is the scheduling. The Time Share (TS) class has the time-quanta per process, so each LWP gets more core time. A single-threaded application waiting for IO will be context switched, but in a multi-threaded app, a second thread could continue processing using the time-quanta. Running multiple copies of a single-threaded process results in a process context switch, which is more expensive than a thread context switch. 2.3.1 Multi-Threading a Single-Threaded Application
If a single-threaded application cannot be multi-threaded, the loops and some of the library routines it uses can be made multi-threaded to improve the efficiency [41]. The Sun Studio compiler offers an 2.3.1.1 Usage cc -fast -xO4 -xautopar example.c -o example setenv PARALLEL 2 setenv SUNW_MP_THR_IDLE value 2.3.2 Single-Threaded Applications in a Container The Solaris 10 OS provides containers called zones [42],[54], which provide fault isolation, resource allocation and management. Solaris Zones can be used to sand-box applications on a server within a single operating system instance. This can be used to run multiple copies of a single-threaded application in isolation and is similar to running the single-threaded application on multiple systems. Zones provide a solution to virtualize the operating environment. Each zone can have its own IP address and its own file system. Single-threaded applications with dependency on ports, IP address, and so on, can now be run in zones. This allows single-threaded applications to scale horizontally on a vertical system. 2.3.2.1 Usage zonecfg -z lisa 'create; set zonepath=/aux0/lisa' zoneadm -list -vc zoneadm -z lisa install zoneadm -list -vc zoneadm -z lisa boot zoneadm -list -vc zlogin lisa [install and run your app] 2.4 Java Technology-Based Applications on the Solaris OS Java applications need no code changes since the JVM is optimized for a CMT system. Using lock-free, atomic, or TLS data structures should improve the performance. 2.4.1 Garbage Collection Tuning Tuning garbage collection [43],[44],[45],[46] is rarely considered. However, just making a few changes to the runtime environment, changing the garbage collector, can result in quite a bit of improvement in performance (see 3.1.2 for a 39% improvement). J2SE 5.0 technology offers automatic tuning but this is limited to the garbage collection (GC) [47] at the moment. GC can be tuned differently for throughput applications and for low-pause applications [47]. For throughput applications, using automatic tuning in J2SE 5.0 software offers a better performance advantage. Low-pause applications can use automatic tuning or the concurrent collector for the optimum performance at low pause. 2.4.1.1 Usage a. For throughput applications, use: java -XX:+UseParallelGC -XX:GCTimeRatio application b. For low-pause applications, use: b1. Automatic tuning java -XX:+UseParallelGC -XX:GCTimeRatio=xx -XX:MaxGCPauseMillis=xx application b2. Concurrent collector java -Xconcgc application 2.5 C/C++ Applications on the Solaris OS A multi-threaded C/C++ application that has been optimized for scaling on an SMP system runs with no changes on a CMT-enabled SMP system. Using the latest compilers like Sun Studio with the compiler optimizations mentioned above should further improve the performance. 3 CMT System Efficiency
The efficiency of a CMT-enabled system can be identified by the memory stalls, cache references and misses, and number of instructions processed per cycle. Cache misses, memory stalls, and instructions processed can be obtained using the An SMP system with CMT should be able to increase the number of instructions processed by the number of cores. 3.1 Test Application in Java Programming Language I wrote a small Java application to find the scalability on an UltraSPARC US-IV based system. The system is a Sun Fire E6900 server with multiple domains running the Solaris 10 OS (Build 63). A domain with 4 processors (8 cores) @ 1.2 GHz (16 Gbyte of memory) was used to test the Java application. I wrote three different tests to test the scalability. One of the tests is a version in Java programming language of a working set problem from an earlier paper [24]. The output of the test is units of work done in a known amount of time. A thread checks if time has elapsed, and if not, goes ahead and starts performing the work. To reduce concurrency locking overhead, an atomic boolean test has been used [32]. The work done is stored on the stack (thread local storage could be used here), and this is updated in the instance through an atomic add [34] at the end of the test. This reduces contention to the boolean test, so increasing the number of threads should increase the work done linearly.
Each test was run for 300 seconds, and the number of threads was doubled for each test. The test outputs the results at the end of the test.
The program was run with no options to capture the baseline performance and scalability. Adding GC tuning, MPSS, and multi-threaded
The %java CMT.Cmt 1 0 0 300000
Here is the export LD_PRELOAD=$LD_PRELOAD:mpss.so.1:libumem.so export MPSSHEAP=512k export MPSSSTACK=64k %java -ms2g -mx2g -XX:+UseParallelGC -XX:GCTimeRatio=30 CMT.Cmt 0 8 0 300000 The program takes four arguments where each argument represents a type of work and the number of threads to use. The first argument represents compute work, the second represents working set work, the third represents IO work, and the fourth the time to work. Increasing this number should increase the work done linearly, unless no more compute resources are available or a bottleneck occurs. 3.1.1 Test 1: Computation Work Results Table 1: Test 1:1
Note: Tests were run to get the best numbers. Table 2: Test 1:2
The system scales almost linearly and flattens out at eight threads as the system has only eight cores. Increasing the number of threads beyond eight does not increase the work done. The maximum efficiency achieved was 33% at 8 threads. The efficiency was calculated using the maximum MIPS7 possible. At 100% efficiency, the system with eight cores can execute 38400 MIPS @ 0.25 CPI. The formula used is (processor speed * cores * number of instructions possible per cycle)/1000000. So for a US-IV processor at 1.2 GHz the maximum MIPS would be (1200000000 * 8 * 4)/1000000 = 38400 MIPS. With eight threads the MIPS achieved was 12761 which is 33% of the maximum MIPS. 3.1.2 Test 2: Working Set Results Table 3: Test 2:1
Table 4: Test 2:2
The second test was to show a working set access. The tests were run sequentially to use a warm cache. The scalability seen is due to the caching. This again peaks at 8 threads. The efficiency is poor due to the high cache miss and memory stalls. 3.1.3 Test 3: IO Work Results Table 5: Test 3:1
The IO test was to check the scalability with disk access. The biggest bottleneck turned out to be file open. If this is changed by opening the file once, then disk performance becomes the bottleneck. The domain had only 1 root disk and at 8 threads, iostat writes/sec were up to 400, and the actv length up to 250, with w% and b% at 100%. A faster disk (hardware RAID) should show good scalability. Table 6: Test 3:2
4 Conclusion
CMT offers a very easy way to improve application efficiency without changes to the application code. The throughput improves by the number of cores/logical processors. The power of CMT is in scheduling. The OS needs to be CMT aware, and it should also be able to schedule threads, aware of resource sharing to maximize performance.
Applications using non-locking structures, atomic operations, 5 Acknowledgments
I would like to thank my colleagues for some very good feedback which has provided structure to this paper. Morgan Herrington did a great review and suggested the current structure. His experience on memory management is reflected in the MPO and MPSS sections, including the first touch degradation example. His review was a valuable input that increased the readability of the paper. Teresa Majer provided some very good feedback, including pointing me to CMT material, and feedback on CMP and CMT when I first started the paper. Nils Gura gave great feedback, including suggesting a glossary to manage the new terminology in the paper. Denis Sheahan provided valuable feedback on Niagara based systems. Hugo Rivero provided access to a Sun Fire E6900 server, and his feedback increased the readability of the figures and charts in the paper, including the scalability results. Bruce Chapman and S.R. Venkatramanan offered some excellent review and feedback, including the start of the huge reference section. I would especially like to thank a lot of folks on the Sun internal aliases who again provided valuable feedback. Don Kretsch's suggestion of including single-threaded applications opened up the sections on compiler options and automatic parallelization. Patrick McGehearty and Paul Ryan's suggestions increased the readability of the paper, while Chien-Hua Yen's suggestions increased the accuracy of the sections on the Solaris 10 OS. I would like to thank my ISVs, Sonic Software and Highdeal Transactive, who took part in the OSS/J IP Billing Massive Scalability Study, for also reviewing the paper, especially Simon Waterer who provided some very good suggestions. I also wish to thank the folks on the Sun developer portals who spent time on editing, proofing and reindexing this paper. Finally, I would like to thank everyone else not mentioned in the paper but who provided valuable suggestions that made writing this paper possible. 6 References
7 Footnotes
Appendix A: Glossary
Appendix B: Prototype of Java Technology-Based Application
In this prototype, I have changed the working set and float values to integers to demonstrate better computational scalability. The prototype might show numbers which are much higher than the results in the paper. About the Author
Nagendra Nagarajayya has been working at Sun for the last 11 years. He is a Staff Engineer in Market Development Engineering (MDE), where he works with Independent Software Vendors (ISVs) in the telecommunications industry, on issues related to architecture, performance tuning, sizing and scaling, benchmarking, POCs, porting, and so on. His interests include multi-threading issues, concurrency and parallelism, HA, distributed computing, networking, and performance tuning for Java and Solaris platforms. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||