|
By Darryl Gove, Updated September 2009
|
|
|
This article suggests how to get the best performance from an
UltraSPARC or x86/EMT64 (x64) processor running on the latest Solaris
platforms by compiling with the best set of compiler options and the
latest compilers. These are suggestions of things you should try, but
before you release the final version of your program, you should
understand exactly what you have asked the compiler to do.
The fundamental questions
There are two questions that you need to ask when compiling your
program:
- What do I know about the platforms that this program will run
on?
- What do I know about the assumptions that are made in the
code?
The answers to these two questions determine what compiler options
you should use.
The target platform
What platforms do you expect your code to run on? The choice of
platform determines:
- 32-bit or 64-bit instruction set
- Instruction set extensions the compiler can use
- Instruction scheduling depending on instruction execution
times
- Cache configuration
The first three are often the ones that will have the greatest
impact on the performance of the application.
32-bit versus 64-bit code
The UltraSPARC and x64 families of processors can run both 32-bit
and 64-bit code. The critical advantage of 64-bit code is that the
application can handle a larger data set than 32-bit code, which has
a size limit of 4GB for the application and data. However, the cost
of this larger address space is a larger memory footprint for the
application; long variable types and pointers increase in size from 4
bytes to 8 bytes. The increase in footprint will cause the 64-bit
application to run more slowly than the 32-bit version.
However, the x86/x64 platform has some architectural advantages
when running 64-bit code compared to running 32-bit code. In
particular, the application can use more registers, and can use a
better calling convention. On an x86 processor, these advantages will
typically enable a 64-bit version of an application to run faster
than a 32-bit version of the same code, unless the memory footprint
of the application has significantly increased.
The UltraSPARC line of processors was architected to enable the
32-bit version of the application to already use the architectural
features of the 64-bit instruction set. So there is no architectural
performance gain going from 32-bit to 64-bit code. Consequently the
UltraSPARC processors will only see the additional cost of the
increase in memory footprint.
Hence best performance is likely to be attained if SPARC binaries
are compiled as 32-bit, and x86 binaries are compiled as 64-bit. The
compiler flags that determine whether a 32-bit or 64-bit binary is
generated are the flags -m32 and -m64.
For additional details about migrating from 32-bit to 64-bit code,
refer to Converting
32-bit Applications Into 64-bit Applications: Things to Consider
and 64-bit
x86 Migration, Debugging, and Tuning, With the Sun Studio 10 Toolset
Specifying an appropriate target processor
The default for the compiler is to produce a 'generic' binary; a
binary that will work well on all platforms. In many situations this
will be the best choice. However, there are some situations where it
is appropriate to select a different target.
- To override a previous target setting. The compiler evaluates
options from left to right, if the flag
-fast has been
specified on the compile line, then it may be appropriate to
override the implicit setting of -xtarget=native with a
different choice.
- To take advantage of features of a particular processor. For
example, newer processors tend to have more features, the compiler
can use these features at the expense of producing a binary that
does not run on the older processors that do not have these
features.
The -xtarget flag actually sets three flags:
- The
-xarch flag which specifies the architecture of
the machine. This is basically the instruction set that the compiler
can use. If the processor that runs the application does not support
the appropriate architecture then the application may not run.
- The
-xchip flag which tells the compiler which
processor to assume is running the code. This tells the compiler
which patterns of instructions to favour when it has a choice
between multiple ways of coding the same operation, it also tells
the compiler the instruction latency to use for scheduling
instructions to minimise stalls.
- The
-xcache flag tells the compiler the cache
hierarchy to assume. This can have a significant impact on floating
point codes where the compiler is able to make a choice about how to
arrange loops so that the data being manipulated fits into the
caches.
The impact of the these three performance settings will depend on
the characteristics of the application. Codes that spend time in
floating point computation tend to be those that show most
sensitivity to the settings used for the target.
Target architectures for SPARC processors
The default -xtarget=generic option should be appropriate
for most situations. The compiler will generate a 32-bit binary that
uses the SPARC V8 instruction set, or a 64-bit binary that uses the
SPARC V9 instruction set. The most common situation where a different
setting might be required would be with a code doing a significant
number of floating point computations. Here, use of the hardware
floating point multiply-accumulate (FMA or FMAC) instructions would
be effective.
The SPARC64 line of processors supports FMA instructions. These
instructions combine a floating point multiply and a floating point
addition (or subtraction) into a single operation. A FMA typically
takes the same number of cycles to complete as either a floating
point addition or a floating point multiplication, so the performance
gain from using these instructions can be significant. However, it is
possible that the results from an application compiled to use FMA
instructions may be different than the same application compiled to
not use these instructions.
An FMAC instruction performs the following operation, called a
“fused multipy-accumulate”:
Result = ROUND( (value1 * value2) + value3)
|
Here ROUND indicates that the value
is rounded to the nearest representable floating point number when it
is stored into the result. This single FMAC instruction replaces the
following two instructions
tmp = ROUND(value1 * value2)
Result = ROUND(tmp + value3)
|
Notice that the two instruction version has two round operations, and
this difference can result in a difference in the least significant
bits of the calculated result.
To generate FMA instructions, the binary needs to be compiled with
two flags: one to specify an architecture that supports the FMA
instructions, and another to tell the compiler that it is acceptable
to use these instructions:
-xarch=sparcfmaf -fma=fused
|
Alternatively the flags -xtarget=sparc64vi -fma=fused will
enable the generation of the FMA instruction and will also tell the
compiler to assume the characteristics of the SPARC64 VI processor
when compiling the code. This will produce optimal code for the
SPARC64 VI platform. Code compiled to contain FMA instructions will
not run on platforms that do not support the instructions.
Specifying the target processor for the
x86/x64 processor family
By default the compiler targets a 32-bit generic x86 based
processor, so the code will run on any x86 processor from a Pentium
Pro up to an AMD Opteron architecture. Whilst this produces code that
can run over the widest range of processors, this does not take
advantage of the extensions offered by the latest processors. Most
currently available x86 processors have the SSE2 instruction set
extensions. To take advantage of these instructions the flag
-xarch=sse2 should be used. However, the compiler may not
recognise all opportunities to use these instructions unless the
vectorisation flag -xvector=simd is also used.
So for x86/x64 processors, compile with at least:
-xarch=sse2 -xvector=simd
|
Summary of target settings for various address
spaces and architectures
The following table summarizes the options to use for various
processors and architectures.
Command Name Translation
The names of the compilers are different, as shown in the table below:
Address Space |
SPARC |
SPARC64 |
x86 |
x64/sse2 |
32-bit |
-xtarget=generic -m32
|
-xtarget=sparc64vi -m32 -fma=fused
|
-xtarget=generic -m32
|
-xtarget=generic -xarch=sse2 -m32
-xvector=simd
|
64-bit |
-xtarget=generic -m64
|
-xtarget=sparc64vi -m64 -fma=fused
|
-xtarget=generic -xarch=sse2 -m64 -xvector=simd
|
Optimization and debug
Compiling with an optimization flag alters three important
characteristics: the runtime of the compiled application, the length
of time that the compilation takes, and the amount of debug that is
possible with the final binary. In general the higher the level of
optimization the faster the application runs (and the longer it takes
to compile), but the less debug information made available; but the
particular impact of optimization levels will vary from application
to application.
The easiest way of thinking about this is to consider three
degrees of optimization, as outlined in the following table.
Purpose |
Flags |
Comments |
Full debug |
[no optimization flags] -g |
The application will have full debug capabilities, but almost no compiler optimizations will be performed, leading to lower performance. |
Optimised |
-g -O
[-g0
for C++]
|
The application will have good debug capabilities, and a reasonable set of optimizations will be performed, typically leading to significantly better performance. |
High optimization |
-g -fast
|
The application will have good debug capabilities, and a full set of compiler optimizations, typically leading to higher performance. |
Note: For C++ at optimisation levels of -O
and lower, the debug flag -g will inhibit some of the
inlining of methods. This can have a significant performance impact
on the binary. The flag -g0 will provide debug information
without inhibiting the inlining of these methods. Consequently it can
be useful to use the flag -g0 with -O
if it is important to have the same level of performance as the
non-debug version. The behaviour of -g
for C++ was changed to this in Sun Studio 12 Update 1; prior releases
of the C++ compiler always disabled front-end inling when the flag -g
was used.
Suggestion: In general an optimization level of at least -O
is suggested, however the two situations where lower levels might be
considered are (1) where more detailed debug information is
required and (2) the semantics of the program require that
variables are treated as volatile,
in which case the optimization level should be lowered to -xO2.
More details on debug information
The compiler will generate information for the debugger if the -g
flag is present. For lower levels of optimization, the -g
flag disables some minor optimizations to make the generated code
easier to debug. At higher levels of optimization, the presence of
the flag does not alter the code generated (or its performance) --
but be aware that at high levels of optimization it is not always
possible for the debugger to relate the disassembled code to the
exact line of source, or to determine the value of local variables
held in registers rather than stored to memory.
As discussed earlier, at low levels of optimisation the C++
compiler will disable some of the inlining performed by the compiler
when the -g compiler flag is used. However, the flag -g0
will tell the compiler to do all the inlining that it would normally
do as well generating the debug information.
A very strong reason for compiling with the -g flag is
that the generated debug information lets the Sun Studio Performance
Analyzer attribute time spent in the code directly to lines of source
code -- making the process of finding performance bottlenecks
considerably easier. Also should the application produce a core file,
the debugger will usually be able to report the line of code which
produced the core file.
Suggestion
Using -fast for performance
The flag -fast is a good starting point when optimizing
code. However, it may not necessarily give you the right set of
optimizations you want for the finished program. The -fast
flag is a macro that enables a full set of optimisations that
will often lead to near optimal performance for many applications.
However, some of these optimisations may not be appropriate for your
particular application.
- On x86 platforms,
-xregs=frameptr allows the
compiler to use the framepointer as an unallocated callee-saves
register, which can result in increased run-time performance. This
option is included in -fast for C. Use of this flag may
mean that some tools are unable to correctly generate callstack
information.
- For the C compiler the
-fast
flag includes -xalias_level=basic,
which declares that the application does not contain pointer
aliasing between different data types. Code not complying to
language standards might not run correctly when compiled with this
flag. We discuss pointer aliasing later in this article.
- The
-fast flag also enables
certain floating point optimisations, which we discuss in the next
section in more detail.
The -fast flag is a good starting
point for getting the best performance out of your application. It is
recommended that the optimisations it enables are reviewed before a
final set of compiler flags are decided for the production build of
your application. The flags -#, -xdryrun, or -V
will cause the compiler to print out the options that-fast
includes, and the list can be used to select the appropriate ones for
your application.
Refer to Comparing
the -fast Option Expansion on x86 Platforms and SPARC Platforms
for the expansion of -fast by Sun Studio 10 C, C++, and
Fortran compilers, cc, CC, and f95,
respectively.
The implications for floating-point
arithmetic when using the -fast option
One issue to be aware of is the inclusion of certain
floating-point arithmetic simplifications implied with -fast.
These are the options -fns and -fsimple=2, which
allow the compiler to do some optimizations that do not comply with
the IEEE-754 floating-point arithmetic standard, and also allow the
compiler to relax language standards regarding floating point
expression reordering.
With -fns, subnormal numbers (that is, very small numbers
that are too small to be represented in normal form) are flushed to
zero. Calculations on subnormal numbers are often done in software,
which is very slow, so codes which have significant numbers of
calculations on subnormal numbers will also run slow. Subnormal
numbers are stored with fewer significant figures of accuracy, so
codes that see large numbers of them will not only run slower, but
may also be performing inaccurate calculations. Hence the presence of
subnormals is not only a performance problem but a cause for further
investigation of the numerics of the application.
With -fsimple=2, the compiler can treat floating-point
arithmetic as you would expect to find in a mathematics textbook. For
example, the order with which additions are performed doesn't matter,
and it is considered safe to replace a divide operation by
multiplication by the reciprocal. These transformations seem
perfectly acceptable when performed on paper, and can give some
performance gains, but they can result in a loss of precision when
algebra becomes real numerical computation with numbers of limited
precision.
Also, -fsimple=2 allows the compiler to make
optimizations that assume that the data used in floating-point
calculations will not be NaNs (Not a Number). Compiling with
-fsimple=2 is not recommended if you expect computation with
NaNs, or if your application is sensitive to the exact order
that floating point computations is performed.
Notes
- The use of the flags
-fns and -fsimple can
result in significant performance gains. However, they may also
result in a loss of precision. Before committing to using them in
production code, it is best to evaluate the performance gain you get
from using the flags, and whether there is any difference in the
results of the application.
- Avoid using
-fsimple=2 with applications that
perform calculations on NaNs, or are known to be sensitive to
the order of floating point computation.
- For more information on floating-point computation, see the
Sun Studio Numerical
Computation Guide.
Crossfile optimization
The -xipo option performs interprocedural optimizations
over the whole program at link time. This means that the object files
are examined again at link time to see if there are any further
optimization opportunities. The most common opportunity is to inline
one code from one file into code from another file. The term inlining
means that the compiler replaces a call to a routine with the actual
code from that routine.
Inlining is good for two reasons, the most obvious being that it
eliminates the overhead of calling another routine. A second, less
obvious reason is that inlining may expose additional optimizations
that can now be performed on the object code. For example, imagine
that a routine calculates the color of a particular point in an image
by taking the x and y position of the point and calculating the
location of the point in the block of memory containing the image
(image_offset = y * row_length + x). By inlining that code in
the routine that works over all the pixels in the image, the compiler
is able generate code to just add one to the current offset to get to
the next point instead of having to do a multiplication and an
addition to calculate each address of each point, resulting in a
performance gain.
The downside of using -xipo is that it can significantly
increase the compile time of the application and may also increase
the size of the executable.
Suggestion:
- Try compiling with
-xipo to see if the increase in
compile time is worth the gain in performance.
Profile feedback
When compiling a program, the compiler takes a best guess at how
the flow of the program might go -- which branches are taken and
which branches are not taken. For floating-point intensive code, this
generally gives good performance. But programs with many branching
operations might not obtain the best performance.
Profile feedback assists the compiler in optimizing your
application by giving it real information about the paths actually
taken by your program. Knowing the critical routes through the code
allows the compiler to make sure these are the optimized ones.
Profile feedback requires that you first compile and execute a
version of your application with -xprofile=collect and then
run the application with representative input data to collect a
runtime performance profile. You then recompile with -xprofile=use
and use the performance profile data collected. The downside of doing
this is that the compile cycle can be significantly longer (you are
doing two compiles and a run of your application), but the compiler
can produce much more optimal execution paths, which means a faster
runtime.
A representative data set should be one that will exercise the
code in ways similar to the actual data that the application will see
in production; the program can be run multiple times with different
workloads to build up the representative data set. Of course if the
representative data manages to exercise the code in ways which are
not representative of the real workloads, then performance may not be
optimal. However, it is often the case that the code is always
executed through similar routes, and so regardless of whether the
data is representative or not, the performance will improve. For more
information on determining whether a workload is representative read
my article Selecting
Representative Training Workloads for Profile Feedback Through
Coverage and Branch Analysis.
Suggestion:
Using large pages for data
If a program manipulates large data sets, it might help improve
performance by using large pages to hold the data.
A page is a region of contiguous physical memory; the
processor works with virtual memory, which allows the processor the
freedom to move the data around in physical memory, or even store it
to and load it from disk. However, working with virtual
memory means the processor has to look up virtual addresses in a table to find the actual physical
location of that data page in real memory. This takes a small amount of time, but if it happens often
the time spent in table lookups can become significant. The default size of these pages is
8KB for SPARC, 4KB for x86. However, the processor can use a range of
page sizes. The advantage of using a large page size is that the
processor will perform fewer lookups, but the disadvantage is
that the processor may not be able to find a sufficiently large chunk
of contiguous memory to allocate the large page on (in which case a
set of smaller size pages will be allocated instead).
The compiler option that controls page size is -xpagesize=size.
The options for the size depend on the platform. On UltraSPARC
processors, allowable sizes are 4K, 8K, 64K, 512K, 2M, 4M, 32M, 256M,
2G, or 16G. For example, changing the page size from 8K (the default)
to 64K will reduce the number of look ups by a factor of 8. On the
x86 platform, the default page size are 4K , and the actual sizes
that are available depend on the processor, often 4K, 2M, 4M, and 1G.
It is possible to detect performance issues from page sizes using
either trapstat, if it is available and if the processor traps into
Solaris to handle Table Lookup Buffer (TLB) misses, or cpustat when the processor provides
hardware performance counters that count TLB miss events.
The command the reports the page sizes available on a particular
system is
If the application incurs significant numbers of TLB miss events
during its run then it is likely that recompiling with a setting for
-xpagesize will improve performance.
Advanced compiler options: C/C++ pointer
aliasing
Two pointers "alias" if they point to the same location
in memory. For the compiler, aliasing means that stores to the memory
addressed by one pointer may change the memory addressed by the other
pointer -- this means that the compiler has to be very careful never
to reorder stores and loads in expressions containing pointers, and
it may also have to reload the values of memory accessed through
pointers after new data is stored into memory.
There are two flags that you can use to make assertions about the
use of pointers in your program. These flags will tell the compiler
something that it can assume about the use of pointers in your
source. It does not check to see if the assertion is ever violated,
so if your code violates the assertion, then your program might not
behave in the way you intended it to. Note that lint can
help you do some validity checking of the code at a particular
-xalias_level. (See Chapter
4, lint Source Code Checker, in Sun Studio 12: C User’s
Guide.)
The two assertions are:
The following table summarizes the options for -xalias_level
for C (cc).
cc -xalias_level=
|
Comment |
any
|
Any pointers can alias (default) |
basic
|
Basic types do not alias each other (for example,
int* and float*) |
weak
|
Structure pointers alias by offset. Structure members of the same type at the same offset (in bytes) from the structure pointer, may alias. |
layout
|
Structure pointers alias by common fields. If the first few fields of two structure pointers have identical types, then they may potentially alias. |
strict
|
Pointers to structures with different variable types in them do not alias |
std
|
Pointers to differently named structures do not alias (so even if all the elements in the structures have the same types, if they have different names, then the structures do not alias). This is the level of aliasing allowed by the language standard. |
strong
|
There are no pointers to the interiors of structures
and char* is considered a basic type (at lower levels
char* is considered as potentially aliasing with any
other pointers) |
The following table summarizes the options for -xalias_level
for C++ (CC).
CC -xalias_level=
|
Comment |
any
|
Any pointers can alias (default) |
simple
|
basic types do not alias (same as basic for C) |
compatible
|
corresponds to layout for C |
Notes
- Specifying
-xrestrict and -xalias_level
correctly can lead to significant performance gains. But if your
code does not conform to the requirements of the flags, then the
results of running the application may be unpredictable.
- For C,
-xalias_level=std means that pointers behave
in the same way as the 1999 ISO C standard suggests. Specified for
standard-conforming codes.
- The flag
-fast for C includes
-xalias_level=basic. If the code
contains aliasing of different basic types, then -fast
needs to be followed by the flag -xalias_level=any
to tell the compiler that any pointers may potentially alias.
A set of flags to try
The final thing to do is to pull all these points together to make
a suggestion for a good set of flags. Remember that this set of flags
may not actually be appropriate for your application, but it is hoped
that they will give you a good starting point. (Use of the flags in
square brackets, [..] depends on special circumstances.)
Flags |
Comment |
-g
|
Generate debugging information (may use -g0
for C++) |
fast
|
Aggressive optimization |
-xtarget=generic [-xtarget=sparc64vi -fma=fused]
[-xarch=sse -xvector=simd]
|
Specify target platform |
-xipo
|
Enable interprocedural optimization |
-xprofile=[collect|use]
|
Compile with profile feedback |
-fsimple=0 -fns=no]
|
No floating-point arithmetic optimizations. Use if IEEE-754 compliance is important |
[-xalias_level=val] |
Set level of pointer aliasing (for C and C++). Use only if you know the option to be safe for your program. |
[-xrestrict] |
Uses restricted pointers (for C). Use only if you know the option to be safe for your program. |
Final remarks
There are many other options that the compilers recognize. The
ones presented here probably give the most noticeable performance
gains for most programs and are relatively easy to use. When
selecting the compiler options for your program:
- It is important to be aware of just what you are telling the
compiler to do. A program may have unpredictable results if it does
not conform to the requirements of the flags.
- Optimization is a tradeoff between increased compile time and
improved runtime performance.
- So, only use those flags that give you both a performance
benefit and make acceptable assertions about the code.
For details on all these options, see the Sun Studio compiler
user guides and man pages.
Further reading
|