Contents
This article provides a set of best practices for developers and system administrators who want to achieve maximum application performance on chip multithreading (CMT) architectures such as Sun servers with CoolThreads technology. A brief introduction to throughput computing is presented along with several examples that illustrate what CMT means in the context of the Solaris Operating System. Finally, several best practices for application development and deployment on CMT architectures are presented. For the latest information, see Sun Fire Servers with CoolThreads Technology on sun.com. 1. Introduction to Throughput Computing
Throughput computing is a new approach to system design that delivers higher throughput -- the aggregate amount of work done -- by relying on processors with CMT technology, providing multiple threads of execution on a single chip. Traditional processor designs have focused on increasing the speed of execution of a single instruction stream. However, those designs are limited in that memory speeds have not been increasing at the same rate as processor speeds, so the processor often spends most of its time waiting for memory references. The example below shows how a reduction in processor compute time for a single thread of execution results in small time savings when the workload's time is already dominated by memory latencies, a common case in today's server applications (see Figure 1).
Throughput computing, in contrast, exploits the fact that server workloads typically run multiple jobs at the same time. Given a transistor density, the processor is optimized for parallel computation. In the example below, the processor is enabled to run four execution threads, and can alternate between them at every clock cycle. When one thread is stalled waiting for memory, it is simply skipped. This approach leads to higher processor utilization (the sum of the "C" blocks), even when running at the same clock rate as the previous example. Throughput workloads that are comparatively memory intensive, therefore, see little benefit from clock rate improvements on traditional processor architectures, whereas on threaded CMT architectures (for example, using the UltraSPARC T1 processor), these workloads can see enormous benefit. These workloads spend more of their time waiting for memory requests to be satisfied, and that time can now be used to execute other threads (see Figure 2).
Taking this approach further, a CMT processor can replicate multiple processing units (also known as cores) on the same chip, leading to substantial improvement in processing density. For example, the UltraSPARC T1 processor can have up to 8 cores and 32 hardware threads, all on the same chip. This technology not only delivers better aggregate throughput, but is also more efficient on power consumption. For more details, visit the Throughput Computing section on sun.com. 2. Throughput Computing in the Solaris OS
With the Solaris OS on a CMT architecture (for example, using the UltraSPARC T1 processor), the first thing to notice is that each of the hardware threads is treated as a logical processor. The Solaris OS will schedule LWPs (either processes or software threads) on each of them, and let the chip handle the low-level thread switching in hardware, where it can be done at every clock cycle. Contrast this to the thousands of instructions needed to do a context switch in software, and we can see why this architecture is so efficient in running multiple jobs in parallel.
The Solaris OS will assign each hardware thread its own CPU ID. On an eight-core UltraSPARC T1 processor, the command
In the above output (logical) processors 0 through 3 correspond to the hardware threads on the first core, (logical) processors 4 through 7 correspond to the hardware threads on the second core, and so on. Thus the server looks, for practical purposes, like an SMP on a chip. Similarly, if an application inquires about the number of processors configured (for example, via the
Despite the radical chip design with a highly threaded architecture, CMT did not require fundamental changes to the operating system. The Solaris OS has been optimized over many years to scale to a large number of processors. For example, a Sun Fire E25K server has 144 (single-threaded) cores so scheduling jobs on 32 logical processors was familiar territory. The CMT-specific optimizations required by the multicore and multithreaded nature of the processor have been implemented in the Solaris 10 OS. By making the OS aware of the relationship between logical CPUs and physical resources (cores, caches, and Translation Lookaside Buffers [TLBs]), the scheduler can make educated decisions on how best to use the available resources. For example, here's the
Notice that the eight jobs were evenly distributed, one running on each core, instead of being blindly assigned to any available hardware thread. Another Solaris enhancement is the support of a new machine architecture, sun4v. This is needed to make use of the UltraSPARC T1 processor's new Hypervisor interface, a thin layer of firmware that presents a virtualized machine environment to the operating system:
This new machine architecture is transparent to applications: it only affects the contract between the processor and the Solaris kernel. It does not affect the interfaces between the Solaris OS and user applications. The UltraSPARC T1 is a fully compatible SPARC v9 implementation, and existing SPARC binaries will run unchanged. More details on Hypervisor are available on the OpenSPARC web site. Multithreaded vs. Multi-process
It is a common misconception to think that CMT only works with multithreaded applications. Because the term thread" is used for the hardware, people may assume that the software needs to be threaded as well. This is not the case. Multiple, single-threaded processes can also take good advantage of CMT. The Solaris OS handles processes and threads in a similar fashion: they are both scheduled as LWPs (lightweight processes). As long as there are enough active LWPs to keep the cores busy, applications will reap the performance benefits. Idle Time on CMT
In traditional SMP architectures, when a processor became idle, it entered an idle loop looking for new work to do (and consuming CPU cycles in the process). Clearly, this would be suboptimal in a CMT architecture, where the core's compute cycles are shared among its hardware threads. To improve efficiency, the Solaris platform has been modified to park an idle hardware thread when it is not running any job. The hardware thread then stays out of the way, and is reactivated only when the OS scheduler finds something to run on it.
This means that the concept of idle time, as reported by traditional monitoring tools like
For example, here's the
This shows CPU 1 as being idle, so it could seem that 25 percent of the core's cycles are wasted. In reality, the processor is doing its magic, distributing its compute cycles among hardware threads 0, 2 and 3. For people interested in a low-level analysis of processor utilization, the Solaris OS allows access to the processor counters via the 3. Development Best Practices
The most important consideration for software developers looking at Sun's CMT architecture is that the same practices still apply. Development recommendations are no different from those for traditional SMP architectures. As radical as the processor architecture is, it does not introduce new paradigms in software design. In fact, it has lowered the barriers of entry to highly-threaded servers, so good parallel programming becomes even more relevant than before. As the industry moves to chips with multiple cores and threads, the applications that are designed to scale to large numbers of logical processors will be more competitive. Having said that, here are some specific recommendations that developers should keep in mind:
4. Deployment Best Practices
At a first glance, it may seem reasonable to configure applications according to the number of cores on the system. For example, an UltraSPARC T1 processor with eight cores can be thought of as an eight-way system. However, it is important to remember that the CMT architecture delivers the most benefits when all of its hardware threads are in use as this increases the cores' utilization and, therefore, its overall throughput. These are, after all, "thread-hungry" processors. So, when deploying applications on a CMT processor, it is recommended to configure them to use a large enough number of active LWPs. Most applications provide tunable parameters to set the number of software threads or processes used, for example, to adjust the size of a process pool or the number of worker threads.
The
Notice that only one of the hardware threads on this core is active. This is most likely a suboptimal configuration, and performance could improve with a higher number of LWPs.
To better monitor LWP activity, the
Now we can go to seeing how many LWPs that process has (
This information can help administrators to adjust the number of active LWPs so that it better matches the available hardware threads. Application Consolidation
Because of the increased throughput capabilities and large number of logical processors, CMT architectures are good candidates for workload consolidation -- an example would be combining multiple logical tiers, such as application server and database tiers, on the same physical host. The Solaris OS does a good job of balancing the load over all the cores in a CMT processor, so in most cases no additional tuning is needed. However, there may be situations, especially if the applications have very different runtime characteristics, when some kind of isolation can provide additional benefits, such as better management of the system resources and improved hit rates on the cores' caches.
The Solaris platform offers several ways to segregate applications running on the same host. One is to use processor sets, with the Solaris command
Then, processes or even individual LWPs can be assigned to this processor set:
Another way to segregate multiple applications running on the same host is to use Solaris Containers and the Solaris Resource Manager. Containers, a virtualization tool, allow the creation of multiple private execution environments within a single instance of the Solaris OS. Processor sets can be created in the context of a resource pool, and then associated with a specific container. For detailed instructions, see the How To guide to Solaris Containers. On a CMT processor like the UltraSPARC T1, it is recommended to keep the grouping at the core level. Splitting a core's hardware threads over multiple processor sets or resource manager pools could lead to suboptimal use of shared resources like the Level 1 cache. 5. Application Qualification
In general, software vendors qualify their products to a target Solaris version, and not to specific hardware configurations. For example, they test and support their software on the Solaris 10 OS for SPARC platforms, instead of testing it on every single hardware configuration that runs this operating system. This is possible because Sun maintains the same instruction set and the same operating system across the entire SPARC product line.
Sun's implementation of CMT does not disrupt this model. The UltraSPARC T1 chip is binary compatible with existing SPARC processors, as shown by the
Furthermore, servers with the UltraSPARC T1 processor use the same Solaris 10 OS as other servers based on the SPARC platform. Thus, applications that are qualified to run on the Solaris 10 release on the SPARC architecture are automatically qualified to run on servers using the UltraSPARC T1 processors, such as the Sun servers based on CoolThreads technology. 6. For More Information
7. Glossary
| ||||||||||||||||||||||||||||||||||
|
| ||||||||||||