|
Memory allocation performance in single and multithreaded environments is an important aspect of any application. Some allocators, such as malloc in the Solaris Operating System, work best with single-threaded applications. However, a different approach must be taken when designing an allocator optimized for a multithreaded application.
Using a single-threaded malloc in a multithreaded application can degrade performance. As memory is being allocated concurrently in multiple threads, all the threads must wait in a queue while malloc() handles one request at a time. With a few extra threads, this can slow down performance, causing a problem known as heap contention. In other words, all the threads are competing for access to the same heap.
One indication of heap contention is that the application is making a considerably high number of calls to malloc(). To see these kinds of calls, use an application profiler or the Solaris OS apptrace command.
System library implementers take various approaches to alleviate the bottleneck of a singly threaded malloc(). One solution to this problem is to use multiple heaps. Here the term "heap" refers to blocks of memory that a process can allocate or free at run time. A heap is private to the process it is in. In other words, heap memory is the memory that is allocated or deallocated with calls to malloc() and free().
When there are multiple heaps, the threads do not need to wait for a heap's lock; instead each thread can use the one allocated to the CPU it is running on. Some modern memory allocators take this into consideration, and yield much greater performance with multithreaded code than a traditional allocator. Even better performance can be yielded with these modern allocators by using them on a multiprocessor system.
A faster, multi-heap memory allocator can be implemented while hiding the intricacies of the underlying memory operations with multiple heaps. Whatever the implementation, access to the heap(s) is handled by calls to malloc() and free() (in C), or new and delete (in C++). This makes it easy to switch memory allocators without much change to your program. These faster allocators can then be implemented into shared libraries that can be linked at runtime. The developer need only replace or override the library load path. However, for best performance, developers want to understand the underlying implementation of various allocators so they can choose the best implementation for their applications.
This article summarizes some performance tests that were run on various memory allocators. The allocators tested were Solaris malloc, Solaris mtmalloc, ptmalloc, Hoard, and libumem. malloc is the default memory allocator in the Solaris OS. Similarly, Solaris mtmalloc is a general-purpose allocator that is used with multithreaded applications. ptmalloc is the standard allocator used in the GNU libc library. Hoard is a fast, memory-efficient allocator that utilizes multiple processors. libumem is a port of the kernel memory allocator to the user level.
The Tests
Two test programs were run on these allocators. The first is ThreadTest, which is a program that comes with Hoard. ThreadTest generates one or more threads that allocate and free a large number of chunks of memory, all of the same size. It then repeats this process over a fixed number of iterations. The other is mtmalloctest, written by Greg Nakhimovsky. mtmalloctest allocates and frees blocks of memory, of varying size. It then repeats the process, also over another fixed number of iterations.
It should be noted that these tests do not represent typical allocation patterns. They are worst-case stress tests that rapidly allocate and deallocate large chunks of memory. Therefore, keep in mind that these numbers represent worst-case performance. Performance in real-world applications varies with how those applications allocate memory.
These tests were run on a Sun Fire 4800 machine, with 12 750-MHz UltraSPARC III processors and 24 GB of RAM, running on the Solaris 9 OS. The tests were built using the Sun ONE Studio Compiler Collection. The libraries were linked in various ways, using a combination of static linking and LD_PRELOAD. Several trials were run for each allocator, each for a different number of threads: 1, 2, 4, 6, 8, and 10. The results presented here are the average of all the runs. Two things were measured in these tests: First, the execution time was recorded. In these tests a shorter execution time indicates faster performance. Second, the amount of heap usage was recorded. With the heap tests, a smaller heap usage indicates less wasted memory and thus better efficiency. What follows is the results of some of these tests, and conclusions we can draw from them. See the Analysis section for more explanation of these results.
ThreadTest Average Time
Clearly, the slowest performer here is malloc on the Solaris OS. A smaller execution time means faster performance, and therefore represents a better result. All the other allocators get faster as more threads are added, with Solaris mtmalloc and Hoard winning neck-in-neck.
Figure 1: ThreadTest Performance by Number of Threads
|
mtmalloctest Average Time
Again, malloc on the Solaris OS slows down more and more as the number of threads increases. Just as with the ThreadTest graph, a shorter execution time is better. The other allocators remain near constant, performing very well.
Figure 2: mtmalloctest Performance by Number of Threads
|
ThreadTest Heap Usage
libumem uses the greatest amount of heap memory, as shown in the following figure. With the heap tests, a smaller number is better. The others follow about the same pattern, except for ptmalloc, which has very low numbers. See the Analysis section for more explanation.
Figure 3: ThreadTest Heap Usage by Number of Threads
|
mtmalloctest Heap Usage
Again, in this instance, smaller heap usage is better. In the trials of mtmalloctest, we see greatly increased heap usage by mtmalloc as more memory is allocated.
Once again, heap usage for ptmalloc is very low, barely showing up at all on the graph. See the Analysis section for further explanation on heap usage for ptmalloc.
Figure 4: mtmalloctest Heap Usage by Number of Threads
|
Analysis
Runtime Performance
malloc on the Solaris OS performs well only in single-threaded applications. Its performance is unsuitable for multithreaded applications, because its performance worsens as threads are added. This is due to the fact that malloc maintains a single heap, and thus only one thread can allocate/deallocate memory at any given time. Therefore, with the addition of more threads, we find more threads waiting, and the wait time grows longer, resulting in increasingly slow execution times.
mtmalloc and Hoard are extremely close, both giving very fast multithreaded performance. All of the allocators (except for Solaris malloc) showed performance gains with more threads, but these two had the best performance at all thread levels. This is an improvement over the mtmalloc that was included with the Solaris 8 Operating System. In previous tests, Sun engineers found that mtmalloc in the Solaris 8 OS did speed up as more threads were added, but did not scale as well as Hoard or ptmalloc. mtmalloc in the Solaris 9 OS has closed that gap, performing even slightly faster than Hoard in our ThreadTest test.
Heap Usage
When an application executes, the operating system creates a virtual address space for the new process. This address space contains items such as the stack, the application's code, data, and the heap. The heap is where dynamic memory allocation occurs, as discussed in this document. Various system calls are used to manage a process's heap. The brk() and sbrk() system calls can be used to resize a process's data segment to accommodate requests for more memory allocations. See the manual page for brk() and sbrk() for more details.
malloc on the Solaris OS is slow because only one thread may allocate or deallocate memory at any given time. However, since only one memory operation is going on at a time, the overall heap usage is low. While low heap usage is good, the slow performance of malloc in multithreaded applications still should not justify using it in such applications.
ptmalloc has the lowest reported heap usage. The allocation scheme for ptmalloc is different than for the others we tested, allocating memory in anonymous pages. An anonymous page is a page that is not associated with a particular vnode, or file/directory. The program used in these tests to measure the heap size did not take anonymous pages into account, and therefore ptmalloc's heap usage numbers appear lower than they really are.
Heap usage for Hoard and libumem is in the middle, and does fairly well.
Heap usage for Solaris mtmalloc was much higher than the others, especially in mtmalloctest. This is most likely because mtmalloc is a power of two allocator, that is, it rounds the size of requested blocks up to the next power of two. For example, if a request is made to allocate 17 bytes, the actual size of the allocated block would be rounded up to 32 bytes. As these requests add up, it can lead to internal fragmentation, so that the "extra" memory that is allocated is wasted.
Conclusions
So, which allocator should you use in your application? malloc on the Solaris OS was shown to be a strong choice for single-threaded code. However, for better performance and scalability, you should try to multithread the application. Multithreaded code can improve performance and scalability on a multiprocessor system using a multithreaded memory allocator. For multithreaded code, you should use Solaris mtmalloc or Hoard. As shown in this document, both options yielded close results. If heap usage is a concern, use Hoard, since mtmalloc wastes some space by using fixed power of two allocation sizes. Otherwise, either of these will perform well for your application.
Keep in mind that these conclusions apply to applications running on systems with multiple processors. Uniprocessor systems were not tested in these benchmarks, and may show different results for these allocators.
Resources
About the Authors
Neelakanth Nadgir is a software engineer in Sun's Market Development Engineering organization. He works with tool vendors to develop "best of breed" applications on Sun systems. He also volunteers for the GNU project. In his spare time, he likes to go hiking in Big Basin State Park.
Joe Attardi is an intern at Sun Microsystems in Burlington, Massachusetts. He is currently in his senior year at the University of Massachusetts at Lowell.
|