The organization of your C++ software can have a significant impact on both compile-time and run-time performance. We present a number of techniques for improving performance. Topics include compilation, build environment, include files, inline functions, class design, function design, catching errors, memory allocation, and template design and instantiation. Because inconsiderate pursuit of performance can lead to unmaintainable software, we also discuss the engineering and maintenance implications of the techniques. IntroductionThe goal of any software development effort should be to produce a quality product in a timely manner. The run-time performance of the product contributes to its quality by delivering results faster. The compile-time performance of the product contributes to its timeliness by shortening the edit-compile-debug cycle. However, both run-time performance and compile-time performance are secondary factors in achieving timely quality. Therefore, you should consider run-time and compile-time performance improvements only when justified by improvements in overall product quality and timeliness. This paper presents a number of techniques for improving the performance of C++ programs ["Programming languages -- C++", ISO/IEC 14882:2003] using the Sun C++ compiler. Most of these techniques are not specific to the Sun C++ compiler and may be more generally applied. However, these techniques are not a complete list, so also see the references. The techniques presented are tactical in nature, providing generally local and incremental improvements in performance. You can often gain significantly more performance by improvements in data structures, algorithms, abstractions, goals, and management. Many of the techniques presented in this paper are not always helpful. For instance, some run-time performance improvements will degrade compile time and coding time. Likewise, some compile-time performance improvements will degrade run time. Fortunately, while all of a program contributes to compile time and coding time, usually only a small fraction of the program contributes most of the execution time. You can concentrate run-time performance improvements on the fraction of the program that contributes significantly to execution time, and apply compile-time performance improvements to the rest of the program. The fraction of the program that contributes most of the execution time is usually not obvious. Furthermore, evidence indicates that developers are usually very poor judges of where the time is spent in programs. Therefore, you should apply performance improvements only after the performance is measured, and the hot spots are known, not suspected. There are a number of tools available to measure performance. We recommend the Sun Studio Performance Analyzer [http://docs.sun.com/doc/819-3687], which is part of our product. For more system-level effects, truss and especially dtrace, can also be useful tools. CompilationWe are continually improving our compilers, both in their compile time and the run time of generated programs. Therefore, the most cost-effective technique for improving run-time and compile-time performance is often to simply use the newest compiler. TurnaroundTo achieve the fastest build times, compile without -g and without optimization. OptimizationThe compiler flag -O expands to -xO3, which has good balance between compile time and run time. For faster compilation, use -xO1 or -xO2. For faster execution, there are a number of options, including -xO4, -xO5, -xipo, -xprofile, -xlinkopt, and -xalias_level. Finally, a note of caution about the -fast flag. This flag is carefully interpreted to provide maximum performance on a single machine. It is not good choice for programs that are to run on a wide variety of machines. Build EnvironmentThe build environment can have a significant affect on product build times. Build times can easily be reduced by an order of magnitude by paying careful attention to system-level effects. The following rules are usually helpful, but their significance to overall build times can vary quite a bit. Use file systems that are local to the process. NFS file access is significantly slower than local file access, and the compiler may need to read and write many files. Use make, not scripts fired by make. Taking the time to write explicit dependencies on directory creation, etc, will generally yield faster build times. Use flat make structures and efficient makefiles. The cost of yet another invocation of make can be quite large, particularly when the makefiles use macros that resolve to executing a program. [Miller 1997] Use dmake, which is part of our product. Even on a single-processor machine, dmake can sometimes halve build times by overlapping computation and I/O. Install a local compiler. For single compilations, the cost of loading the compiler from an NFS file system will never be amortized. LibrariesMost C++ libraries are designed for good performance over a wide range of uses. For example, the C++ standard library embodies good algorithms with a significant number of services. However, their general applicability means that such libraries are not optimal for any given application. You can achieve better run-time performance, and often better compile-time performance, with your own custom library. However, designing, documenting, coding, and debugging a custom library is detrimental to timeliness. Make your own custom library only when you need product quality more than you need product timeliness. Include FilesA significant cost in compilation of C++ programs is simply reading what can be a very large number of include files. Program organizations that reduce the number of files included can significantly reduce compile times. AbstractionThe most effective technique for reducing file inclusion is to reduce interface complexity. Reducing interface complexity generally improves the understandability of the product and hence increases quality and timeliness, so good interfaces generally improve both compilation time and debugging time. You should understand abstractions that can help organize your program effectively [Gamma 1995]. However, unbridled abstraction can be detrimental to the product. For example, when a project group must use an external interface, the group often wraps the external interface by an internal interface. This internal interface is often just a simple renaming of the external interface, providing none of the decoupling benefits that abstraction normally provides, but consuming all of the costs in development time, compile time, and run time. So do not create abstractions that do not provide recognizable services. As an example of how abstractions can improve compile times, consider the case of a service provided through an abstract base class and factory, rather than through full classes and the new operator. For example: class service {
public:
virtual answer * query() = 0;
virtual void statement( data * ) = 0;
static service * generate();
protected:
service();
};
...
service * instance = service::generate;
instance->statement( data1 );
answer * result = instance->query();
In this example, the entire structure of the actual service is hidden from the clients of the service, and those clients need not spend the time to compile that structure, or include other structures implied by the internals of the service. Just as importantly, clients need not be recompiled when the internals of the service change, which can dramatically reduce incremental product build times. Multiple ClassesOften several class definitions are needed to provide a service. These classes should all be defined in a single include file, rather than one-class per include file. This technique reduces the number of files opened, which can have a significant effect on compilation time. Incomplete ClassesMany classes only get used indirectly within an include file. For instance, the answer and data classes in the above example are used only through pointers. In these cases, the compiler does not need the full definition of the classes, and incomplete definitions suffice. So, instead of #include "answer.h" use class answer; Guarded IncludesBecause managing the proper inclusion of class definitions can often become difficult, programs tend to rely on idempotent, or self-protecting, include files. For example, #ifndef SERVICE_H
#define SERVICE_H
class service { ... };
#endif
If the file has not been previously read, the class is defined, otherwise, the file is essentially empty. We can avoid the unnecessary read of the file by testing the guard on the outside of the include as well. #ifndef SERVICE_H #include "service.h" #endif This technique is most effective when the compiler is hitting the limits of available main memory and as a result the file cache is ineffective. Many compilers, including Sun Studio 11, will recognize the typical include guards and never open the file a second time. Thus, the additional guard at the include is redundant. So, this technique is only useful with some compilers. Inline FunctionsInline function expansion can significantly increase run time, but the cost is significantly increased compilation time. Do not use inline functions when you anticipate changes to the function definition and recompiling all callers is expensive. In particular, an inline function definition commits you to a much larger ABI. Do not commit yourself without due consideration. Cost/BenefitCalls to small and quick functions can be smaller and quicker when expanded inline than when called normally. Conversely, calls to large or slow functions can be larger and slower when expanded inline than when branched to. Furthermore, all calls to an inline function must be recompiled whenever the function definition changes. Consequently, the decision to use inline functions requires considerable care. The general advice is to use inline functions when the inlined function results in fewer instructions than the call to the function, or when the application performs significantly faster with the function inline. Usually this advice means that inline functions are either trivial, or used extensively in the hot areas of the program. Un-Inlinable FunctionsThe compiler cannot inline all function calls, so making the most effective use of function inlining may require some source changes. Use the +w option to learn when function inlining does not occur. The compiler currently will not inline a function below -xO4 under the following circumstances:
However, our compiler's inlining effectiveness is growing, so what is true this release may not be true next release. Conditional InliningInlining reduces run time, but increases compile time. Unfortunately, the value of run time versus compile time is not fixed throughout program development. During initial development and debugging, compile time is significantly more important than run time. Conversely, during final program build and benchmarking, run time is more important than compile time. You can control the amount of inlining by making inline definitions conditional on a preprocessor variable. The technique has the general structure:
The following example illustrates the above technique, as well as some techniques described earlier. (The code itself is meaningless.)
This technique is particularly valuable during debugging because the compiler does not inline when the -g switch is present. As a consequence, during debugging, inline functions consume object space and compile time without any performance gain. Likewise, the reduced number of file inclusions needed by inlining also improves compile time. However, when inlining is enabled, compilations will generally include more files than they would without the technique. However, because inlining is generally most useful during optimization, and optimization itself is slow, the relative cost in compile time due to extra includes is low. So, for debuggable builds, use -g and do not define INLINING. For fast builds without debugging, do not use -g and do not define INLINING. For run-time optimized builds, use one of the optimization flags and define INLINING. There is one strong note of caution in using this technique. All objects including files testing INLINING must be compiled with the same definition of INLINING. In practice, this restriction means that only application programmers can use the technique. Class DesignThe design of classes and class hierarchies can have a significant impact on run-time performance, through increased operation efficiency, reduced data size, and reduced code size. Bit-FieldsConvert boolean and small integer values into bit-fields, and then place these fields adjacent to each other. This technique can substantially reduce data size, though it generally requires more instructions to implement. The cost of reading bit-fields depends primarily on how much instruction parallelism is available, while the cost of writing bit-fields is usually double the cost of writing a non-bit-field. Field OrderOrder data fields to limit the size of unused padding between fields. In particular, order fields by their types in the order long double, double, long long int, pointers, long int, float, int, short, and char. This transformation may not be feasible or reliable when the program uses typedefs that change with the build environment. Devirtualizing MethodsConvert a virtual method to a non-virtual method when the virtualness of the method is not used. This technique will improve the speed of calls, and enable inlining for further speed improvements. In general, this technique should be applied towards the end of program development, when the abstractions used by the program have become stable and program run time is more important than development flexibility. Converter MethodsThe standard mechanism for dynamic cast is very general, and consequently is more expensive than most specific needs warrant. You can achieve similar functionality by providing dynamic converter methods rather than using dynamic_cast. For example, instead of car* car_ptr = dynamic_cast<car*>( vehicle_ptr ) use car* car_ptr vehicle_ptr->to_car() where class vehicle {
virtual car* to_car() { return (car*)0; }
};
class car : vehicle {
virtual car* to_car() { return this; }
};
Duplicate MethodsRather than duplicate methods in different derived classes, define them in their least common base class. This transformation reduces the working set, which improves run time, and reduces the number of functions, which improves compile time. Default OperatorsUse the default operators. If a class definition does not declare a parameterless constructor, a copy constructor, a copy assignment operator, or a destructor, the compiler will implicitly declare them. These are called default operators. A C-like struct has these default operators. When the compiler builds a default operator, it knows a great deal about the work that needs to be done and can produce very good code. This code is often much faster than user-written code because the compiler can take advantage of assembly-level facilities while the programmer usually cannot. So, when the default operators do what is needed, the program should not declare user-defined versions of these operators. Default operators are inline functions, so do not use default operators when inline functions are inappropriate (see the previous section). Otherwise, default operators are appropriate when:
Some C++ programming texts suggest that class programmers always define all operators so that any reader of the code will know that the class programmer did not forget to consider the semantics of the default operators. Obviously, this advice interferes with the optimization discussed above. The resolution of the conflict is to place a comment in the code stating that the class is using the default operator. Value ClassesC++ classes, including structures and unions, are passed and returned by value. For Plain-Old-Data (POD) classes, the C++ compiler is required to pass the struct as would the C compiler. Objects of these classes are passed directly. For objects of classes with user-defined copy constructors, the compiler is effectively required to construct a copy of the object, pass a pointer to the copy, and destruct the copy after the return. Objects of these classes are passed indirectly. For classes that fall between these two requirements, the compiler can choose. However, this choice affects binary compatibility, so the compiler must choose consistently for every class. For most compilers, passing objects directly can result in faster execution. This execution improvement is particularly noticeable with small value classes, such as complex numbers or probability values. You can sometimes improve program efficiency by designing classes that are more likely to be passed directly than indirectly. In compatibility mode (-compat[=4]), a class is passed indirectly if it has any one of the following:
Otherwise, the class is passed directly. In standard mode (the default mode), a class is passed indirectly if it has any one of the following:
Otherwise, the class is passed directly. To maximize the chance that a class will be passed directly:
Classes (and unions) that are passed directly by the C++ compiler are passed exactly as the C compiler would pass a struct (or union). However, C++ structs and unions are passed differently on different architectures.
Function DesignFunction efficiency is the core of good run-time performance, and C++ functions often have subtle inefficiencies because the language often hides significant work behind very terse syntax. Reference ParametersProgrammers often code functions with reference parameters rather than value parameters because "it is more efficient", rather than because the semantics are appropriate. However, as we saw above, value parameters may be more efficient, and even when they are not directly more efficient, the compiler knows that a value parameter cannot be aliased, and so can better optimize access to the parameter. However, switching between reference parameters and value parameters is not neutral with respect to semantics, so care must be taken that the switch is appropriate. In particular, if the class has virtual functions, passing the class as a value parameter is probably not appropriate. Temporary ObjectsC++ functions often produce implicit temporary objects, each of which must be created and destroyed. For non-trivial classes, the creation and destruction of temporary objects can be expensive in terms of processing time and memory usage. The C++ compiler does eliminate some temporary objects, but it cannot eliminate all of them. Write functions to minimize the number of temporary objects as long as your programs remain comprehensible. Techniques include using explicit variables rather than implicit temporary objects and using reference parameters rather than value parameters. Another technique is to implement and use operations such as += rather than implementing and using only + and =. For example, the first line below introduces a temporary object for the result of a + b, while the second line does not. T x = a + b; T x( a ); x += b; Cache Member VariablesAccessing member variables is a common operation in C++ member functions. The compiler must often load member variables from memory through the this pointer. Because values are being loaded through a pointer, the compiler sometimes cannot determine when a second load must be performed or whether the value loaded before is still valid. In these cases, the compiler must choose the safe, but slow, approach and reload the member variable each time it is accessed. You can avoid unnecessary memory reloads by explicitly caching the values of member variables in local variables, as follows:
This optimization is most productive when the values can reside in registers, as is the case with primitive types. The optimization may also be productive for memory-based values because the reduced aliasing gives the compiler more opportunity to optimize. This optimization may be counter-productive if the member variable is often passed by reference, either explicitly or implicitly. On occasion, the desired semantics of a class requires explicit caching of member variables, for instance when there is a potential alias between the current object and one of the member function's arguments. For example: void complex::operator *= ( const complex& right ) {
real = real * right.real + imag * right.imag;
imag = real * right.imag + imag * right.real;
}
will yield unintended results when called with: x *= x; The cached version does not exhibit the problem. void complex::operator *=(const complex& right){
double this_real = real, this_imag = image;
double that_real = right.real, that_imag = right.image;
real = this_real *that_real + this_imag *that_image;
imag = this_real *that_imag + this_imag *that_real;
}
Common Call EliminationC++ programs usually rely heavily on procedure calls, and a function often calls the same function with the same data more than once. While compilers are generally effective at sharing the results of common subexpressions, they are not effective at sharing the results of common calls. For example, person* grandfather = me->father()->father(); person* grandmother = me->father()->mother(); has a common call, and will generally be more efficient written as person* dad = me->father(); person* grandfather = dad->father(); person* grandmother = dad->mother(); While this technique may seem obvious, and indeed it is, it often requires special attention because the extremely concise nature of C++ calls often provides insufficient reminders to consider efficiency. Const DeclarationsMany programmers use const reference parameters with the intent to inform the compiler that the parameter is read-only, and hence should be better optimized. Unfortunately, this intent is at odds with the C++ language definition. The const keyword says that the storage may not be modified through the given name. What it does not say is that the storage cannot be modified through some other name. With the exception of variables directly declared const, which means you can only initialize them, const is basically ineffective a improving run-time performance. It does, however, catch errors in the programming process. Catching Run-Time ErrorsWhile errors are generally uncommon, their very possibility can have a significant effect on program development time and on program performance. ExitIn general, you should use assert, abort, or exit (or an application-specific variant) when program termination is acceptable. This technique yields the smallest intrusion on program performance. ExceptionsWhen program termination is unacceptable and errors are rare, use exceptions. The C++ exception mechanism is designed for effective handling of rare events. The Sun C++ compiler has an extremely efficient implementation of exceptions when they are not thrown, and it is getting faster with each release. Furthermore, exceptions limit the coding needed to deal with the problem to the point of detection and the point of handling, and avoid handling by intermediates. However, exceptions have a high cost when actually thrown, and are thus not suitable when an error is common. Return CodesFor common errors, the traditional function return codes provide the most efficient run-time solution. Memory AllocationMemory allocation can consume a very large amount of time. Reducing the amount of allocation is usually the most effective technique for improving performance. Deallocating MemoryOnce unnecessary allocations are eliminated, the next most effective technique for improving performance is to deallocate memory when it is no longer needed. This is best accomplished with explicit calls to delete. However, applications may lose control of their memory, and a conservative garbage collector can sometimes be a critical tool. The C++ compiler provides a conservative garbage collector with the option -library=gc. Allocating EfficientlyOnce memory is deallocated, you must turn to improving the efficiency of allocation. The option -library=gc provides a faster malloc and free. If that is not enough, use class-specific new and delete operators. These are commonly called pool allocators. Finally, when processing character strings, consider using realloc when adjusting string size rather than separate new and delete. TemplatesPrograms that use templates heavily can benefit quite a bit from careful template design. Factoring Common CodeThe simple intent of templates is to adapt a general class or function design to a particular type. This is readily accomplished by a simple template design. Unfortunately, these simple designs often result in template instances with large amounts of identical code because the entire class or function is specialized, not just the part that must be specialized. template< typename T >
struct list {
T* chain;
T* next() { return chain; }
void method();
};
You can reduce the amount of code generated, and hence the working set size of the application, by defining a common non-template base class and a derived template class. The base class contains the common code, and the derived class contains the part that must be specialized. struct list_base {
void* chain;
void method();
};
template< typename T >
struct list : list_base {
T* chain;
T* next() { return (T*)chain; }
};
However, there are some costs associated with this technique. First, debugging may be hindered when the base class uses void* rather than a pointer to the actual type. Second, use of a base class will introduce another layer of function calls, which can cost more than the lower working set size justifies. Third, some type-based optimizations may no longer be possible, which is a concern only at very high levels of optimization. DefinitionsYou can organize your template definitions in two ways: with definitions included and with definitions separated. The definitions-included organization places the template definitions within the template header file.
The definitions-separated organization places the template definitions in a separate file, which is read as-needed during template instantiation.
It is important to note that the template definition file is essentially an automatic include file. Do not compile this file. The definitions-included organization requires the compiler to store template definitions, thus increasing compiler data size, even when no templates are instantiated, but it reads the least number of files. The definitions-separated organization avoids saving templates when none are instantiated, but requires searching directories and reading additional files when templates are instantiated. If a particular compilation usually does not instantiate a template, the definitions-separated organization is most efficient. Otherwise, the definitions-included organization is most efficient. InstantiationTemplate instantiation can have a major impact on compile time. The default instantiation mode of the compiler is -instances=global, which provides full support of the language with minimal programming effort, and good compilation time. However, it produces the largest object files. The fastest template instantiation mode is -instances=explicit, but it requires manual instantiation of all templates needed, which can require as substantial amount of programmer effort to identify all the needed instantiations. A slightly easier mode is -instances=semiexplicit, which requires manual instantiation of the directly used instances, but permits the compiler to automatically produce dependent instances. However, this mode restricts the kinds of templates one can use, and produces larger programs than -instances=explicit. See compiler manual for a discussion of the restrictions. In summary, use the default instantiation mode when object size and compilation constraints are not severe. Use -instances=explicit or -instances=semiexplicit when both are important and you are willing to work for the benefits. SummaryFirst and foremost, design your program well. Well-designed programs will often have good compile-time performance. If not, local changes will usually bring good compile-time performance. When the run-time performance is inadequate, explore compilation options and alternate libraries. When the run-time performance is still inadequate, reprogram the small portion of the code that contributes most to run time.Further Reading
| ||||||||||||||||||||||||||
|
| ||||||||||||