|
By Sun Studio Compiler Engineering Staff, revised July 2007
|
|
|
Users wanting the best performance from CPU-intensive codes may wish to explore the use of additional libraries and advanced compiler options that control individual compiler components.
Performance libraries
-
The optimized math library, selected by the switch -xlibmopt in Fortran and C++, or by including -lmopt in C. This library may produce slightly different results, usually differing only in the last bit, in order to achieve higher performance.
-
There are various memory allocation libraries that can be used. A guide to several choices can be found in the NOTES section of the umem_alloc(3MALLOC). In addition, the library libfast.a is available. To use it, add -lfast to your C, Fortran, or C++ compile command. Like libbsdmalloc, libfast keeps free lists of various sizes to provide very fast allocation, but at the expense of additional memory. Also, libfast samples the allocation stream and if consecutive samples are sufficiently smaller than the current size of a free list, a new free list is created. Note that libfast.a is suitable for use only in 32-bit, single-threaded environments. Do not use it if you have set the environment variables PARALLEL or OMP_NUM_THREADS to a value greater than 1, or if your application code calls omp_set_num_threads() to increase the number of threads.
-
The Sun Performance Library, which is a set of optimized, high-speed mathematical subroutines that are used to solve linear algebra and other numerically intensive problems. The library is linked to your application by using the switch -xlic_lib=sunperf .
Note - The optimized math library, sunperf, and several other math libraries are described in the Sun Studio 12: Numerical Computation Guide.
-
An optimized memset/memcpy library, suitable for UltraSPARC-III or later systems. To use it, add the switch -ll2amm to your compile command. You might also need to add -xarch=v8plusb to your command line. Note that compiling with -ll2amm produces a binary that cannot be used on previous generation processors.
-
For Solaris 8 or earlier systems, use the Intimate Shared Memory library to enable use of large pages (for example, 4MB). Using large pages may significantly reduce the number of Translation Lookaside Buffer (TLB) misses, and may therefore improve performance for programs that access large amounts of memory.
To use large pages:
-
Link your program with the libprism32 library. Be sure that your LD_LIBRARY_PATH includes /opt/SUNWspro/prod/lib/v8plusb and then use the switch -lprism32.
-
Set kernel parameters that allow use of intimate shared memory: shmsys:shminfo_shmmin
shmsys:shminfo_shmmax
shmsys:shminfo_shmmni
shmsys:shminfo_shmseg
Documentation on these parameters may be found in the Solaris 8 System Administration Guide, Volume II.
-
Set environment variables that allow use of large pages:
For systems using Solaris 9 or later, the page size setting facility mpss.so.1, the ppgsz command, and the -xpagesize family of compiler flags are the preferred methods for using large pages.
Compiler Component Options
The Sun Studio compilers are divided into several different components. Sometimes, it can
be helpful to performance if switches are sent directly to individual components, including:
-
CC, driver for C++
-
cg, the code generator
-
d, driver for C
-
f90comp, front end for Fortran
-
iropt, the global optimizer
-
ld, the link editor
-
ube, the x86/x64 code generator
-
ube_ipa, the x86/x64 interprocedural optimizer
NOTE: although the use of these options are supported, there are certain
notes that must be understood before using them.
-
Usually, these options are set automatically:
Usually, the compiler itself picks values for these options, based on other options that are selected. Most users will achieve adequate performance without needing to set these options.
-
Subject to change:
These options may change from release to release of the compiler. The spelling, effect, and even presence of these performance options may evolve from time to time; Makefile authors should be prepared to cope with such evolution.
-
Performance testing is required:
Some of these options may help your code run faster; others might actually make it run more slowly. You should not use one of these options unless you believe that you have a good test case to demonstrate its effect. A good test case:
-
represents an important real-life use of the code;
-
can be compiled both with and without the compiler option in question;
-
includes a repeatable workload; and
-
has available a machine environment where changes can be reliably measured.
-
Understand what the driver is doing:
If you choose to experiment with the options documented on this page, you will probably find it useful to examine what the compiler driver is passing to each stage of the compilation, both before and after your changes. You can do this by adding -v to your Fortran or C++ compile, or by adding -# to your C compile. For example, if you wanted to check whether the driver passes -Aheap to iropt, you could find out by typing (in a C shell) something like this: % cc -# -fast -W2,-Aheap tmp.c |& grep bin/iropt | fold -s -20 | grep Aheap
-O5 -Aheap
%
The above command pipes stderr to grep, which looks for the line that invokes iropt. Since that line is very long, we fold it into smaller chunks, and then examine the chunks for one containing "Aheap".
-
Correctness testing is recommended:
The use of performance options may sometimes lead to unexpected results. For example, a program may have a bug which is harmless when compiled without optimization, but which causes incorrect operation when the compiler uses more advanced optimizations. In addition, although these options have been tested by Sun's internal testing, they have had less exposure to customer applications.
Therefore, as in any other performance improvement exercise, it is prudent to include testing for correct output as you tune performance.
Below is a selected list of options that can be passed directly
to Sun Studio compiler components. These are passed using the -W flag (when
using the C compiler) and the -Qoption flag (when using the Fortran or
C++ compilers). The table below shows the relationship between the C and Fortran/C++
compilers and how to invoke the options. The f90comp component is Fortran specific
and not available from the C or C++ compiler. The ld component is
not needed from the C or C++ compiler, just use the -M flag.
|
|
|
|
CC |
|
|
|
cg |
|
|
|
d |
|
|
|
f90comp |
-Qoption f90comp suboption |
|
|
iropt |
|
|
|
ld |
|
|
|
ube |
|
|
|
ube |
|
|
|
ube_ipa |
-Qoption ube_ipa suboption |
-Qoption ube_ipa suboption |
|
|
As an example, the following shows how the compile line would look
to invoke the -Abopt option to the iropt component of the compiler when
compiling a program for C, Fortran and C++. cc -W2,-Abopt c_example.c
f95 -Qoption iropt -Abopt fortran_example.f95
CC -Qoption iropt -Abopt cxx_example.cxx
The "Component" column below is the name of the compiler component as specified
for the Fortran and C++ compilers. The "Option" column is the action requested
of the specific compiler component. The "Description" column describes what the option does.
|
|
|
CC,
d |
|
Use iropt in the profile phase of the compilers (iropt is the
global optimizer). |
cg |
-Qdepgraph-early_cross_call=1 |
Enable cross-call instruction scheduling. This option controls whether the "early" schedulers may
move instructions across a call instruction. The early schedulers are those run before
register allocation. Because of SPARC register windows, this is sometimes useful. |
cg |
|
Allow generation of
speculative loads during Enhanced Pipeline Scheduling (EPS). A speculative load may reduce load
latency, if the speculation is correct; but if the speculation is incorrect (e.g.
the other path is taken, and the load misses in the cache or
the TLB), then the overhead may be hundreds of cycles for the incorrect
speculation. The EPS scheduler needs to have a very good chance of speculating
correctly in order for EPS speculative loads to be an overall win. |
cg |
|
Use
enhanced pipeline scheduling (EPS) and selective scheduling algorithms for instruction scheduling. The EPS
scheduler will cause some applications to improve performance, but some will run more
slowly. |
cg |
-Qeps:rp_filtering_margin=n |
The number of live variables allowed at any given point is n
more than the number of physical registers. Setting n to a significantly large
number (e.g., 100) will disable register pressure heuristics in EPS. |
cg |
|
Set the EPS window
size to n, that is, the number of instructions it will consider across
all paths when trying to find independent instructions to schedule a parallel group.
Larger values may result in better run time, at the cost of increased
compile time. |
cg |
|
Enable the (late) trace scheduler. This is a new
feature of the compiler which is being tuned from release to release. It
may become the default in a future release. |
cg |
|
When performing trace scheduling, set
the aggressiveness of the trace formation to level n, where n is 4,
5, or 6. The higher the value of n, the lower the branch
probability needed to include a basic block in a trace. |
cg |
-Qgsched-trace_spec_load=1 |
When performing trace
scheduling, enable the conversion of loads to non-faulting loads inside the trace. |
cg |
|
Enable optimizations
to reduce branch after branch penalty. On some machines, the instruction fetcher will
operate more effectively if branches are separated from each other; for example, not
having one branch occupy the delay slot of another branch. Adding no-ops into
the code may make the fetcher run more effectively. -Qicache-chbab is not currently
on by default because it may increase code size and therefore make the
icache less effective, and the algorithm for adding the nops has not been
shown to benefit all applications. |
cg |
|
Inline calls to memcpy with n bytes or
fewer being copied. If there are many calls to memcpy with a small
number of bytes, the call overhead may be significant. |
cg |
|
Use profile feedback
data to predict values and attempt to generate faster code along these control
paths, even at the expense of possibly slower code along paths leading to
different values. Correct code is generated for both paths. |
cg |
|
Align function entry points
at n-byte boundaries. Aligning functions may make the instruction fetcher more effective on
some machines. In general, this option causes the binary to be larger, and
it may cause the I-cache to be less well packed. Default settings are
likely to differ from machine to machine |
cg |
|
Peels the most frequent test branches/cases off
a switch until the branch probability reaches less than 1/n. This is effective
only when profile feedback is used |
cg |
-Qlp[=n][-av=n]
[-t=n][-fa=n]
[-fl=n][-ip=n]
[-it=n][-imb=n]
[-prt=n][-prwt=n][-pwt=n]
[-pt=weak][-ol=n] |
Control prefetching for loops with control flow: lp=n Turns
the module on (1) or off (0) (default is on for f95;
off for C/C++) lp in Fortran, equivalent to -Qlp=1 and is used as a
means for setting sub-options listed below. In C/C++, equivalent to -Qlp=0. However, when
used with the options -xprefetch=auto or -xprefetch_level=[2|3], it is equivalent to -Qlp=1, and used
as a means for setting sub-options listed below. -av=n Sets the prefetch look
ahead distance, in bytes. Default is 256. -t=n Sets the number of attempts
at prefetching. If not specified, t=2 if -xprefetch_level=3 has been set; otherwise, defaults
to t=1. -fa=n 1=Force user settings to override internally computed values. -fl=n 1=Force the
optimization to be turned on for all languages -ip=n Turns on (1) prefetching
for one-level indirect memory accesses. -it=n Indicates to the compiler to insert n
extra prefetches for each indirect access in outer loops. -imb=n Indicates to the compiler
(1) to insert indirect prefetches when the indirect access chain spans across basic
blocks. -pt=weak Use weak prefetches in the general loop prefetch. -prt=n 1= Use prefetch
with function code 3 (prefetch for one read) for read-only memory accesses. -prwt=n 3=
Use prefetch with function code 3 (prefetch for one write) for memory accesses
which are read and then written. -pwt=n 1= Use prefetch with function code
3 (prefetch for one write) for accesses that are written only. -ol=n Turns on
(1) prefetching for outer loop. |
cg |
|
Specifies that all loops can be pipelined
without needing to be concerned about loop-carried dependencies. |
cg |
|
Use fp divide for signed integer
division. |
cg |
|
Set prefetch ahead distance assuming that the number of outstanding prefetches are n.
With larger n, the ahead distance gets larger. |
cg |
|
Disable prefetching within
modulo scheduling (used in software pipelining). |
cg |
|
Turn off prefetching in the prolog of
modulo scheduled loops. |
cg |
|
Turn off prefetching for stores in the pipeliner. |
cg |
-Qms_pipe-pref_prefstrong=0 |
Turn off
the use of strong prefetches in modulo scheduled loops. |
cg |
|
Assert (to the pipeliner)
that unsigned int computations will not overflow. |
cg |
|
Disable the max live base
registers algorithm for sethi hoisting. A sethi is a SPARC instruction for forming
large constants, especially address constants. Sethi hoisting uses an algorithm that may increase
register pressure. Usually, this option is likely to help performance. |
f90comp |
|
Set the optimization
level of the f95 front/middle end to the specified optimization level (fortran only). |
f90comp
|
|
Enable padding of f95 arrays by n (fortran only). |
f90comp |
-hoist_expensive,-hoist_trivial |
Enables additional
loop invariant code motion, hoisting operations out of loops. |
iropt |
|
Increase the probability that the
compiler will perform memcpy/memset transformations. |
iropt |
|
Enable aggressive optimizations of all branches, such as
reversing the branch condition. This is only useful when profile feedback is used.
|
iropt |
|
This option turns on analysis of data access patterns for scalars and arrays
regions accessed in each loop. The information is used by various loop transformations
such as loop fusion for determining profitability of those transformations. Unlike regular data
dependence analysis, this analyzes detailed array sections accessed in a loop, so the
analysis can be expensive in terms of compilation time. |
iropt |
|
Ignore parallelization factors in
loop interchange heuristics. |
iropt |
|
Set memory store operation weight for loop interchange to n.
A higher value of n indicates a greater performance cost for stores. This
flag gives more weight to store operations in determining whether some loop transformations such
as loop interchange should be done. |
iropt |
|
Allow the compiler to recognize malloc-like memory
allocation functions. If -xbuiltin is specified, this option is implied. |
iropt |
-Ainline[:cp=n]
[:cs=n][:inc=n]
[:irs=n][:mi]
[:rs=n][:recursion=n] |
cp=n The minimum
call site frequency counter in order to consider a routine for inlining cs=n Set
inline callee size limit to n. The unit roughly corresponds to the number
of instructions. inc=n The inliner is allowed to increase the size of the
program by up to n%. irs=n Allow routines to increase by up to n.
The unit roughly corresponds to the number of instructions. mi Perform maximum inlining (without
considering code size increase). rs=n Inliner only considers routines smaller than n pseudo
instructions as possible inline candidates. recursion=n Allow a recursive call to be inlined up
to n level. |
iropt |
|
More aggressive strength reduction by replicating loops. |
iropt |
|
Increase
the probability that loop induction variables will replaced, so that some extraneous code
can be eliminated from loops. |
iropt |
-Aloop_dist:ignore_parallel |
Ignore parallelization factors in loop distribution heuristics. |
iropt |
|
Reconstruct
array subscripts during memory allocation merging and data layout program transformation. The transformation
uses the same arrays, but modifies the ways the arrays are referenced to
make them more efficient globally. |
iropt |
-Apf:llist=n:noinnerllist |
Do speculative prefetching for link-list data structures: llist=n perform
prefetching n iterations ahead. noinnerllist, do not attempt for innermost loops. |
iropt |
|
Allow prefetching
through up to n levels of indirect memory references. |
iropt |
|
Assumes global
pointers are not aliased (restricted). |
iropt |
|
Convert multiple short memory operations into single long
memory operations. ldld Convert multiple short memory loads into single long load operations.
|
iropt |
|
Perform loop tiling which is enabled by loop skewing. Loop skewing transforms a
non-fully interchangeable loop nest to a fully interchangeable loop nest. The optional bn
sets the tiling block size to n. |
iropt |
|
Increase the probability that small-trip-count
inner loops will be fully unrolled. |
iropt |
|
Enable outer-loop unrolling. |
iropt |
|
Enable optimization of critical
control paths. This is based on profile data to select critical paths and
create super blocks so that more optimizations and better scheduling can be done
on the critical paths, and result in better overall performance. |
iropt |
|
Do not inline
calls when parameters are arrays and actual array dimensions and formal array dimensions
are mismatched |
iropt |
|
Enable inlining of routines with frame size up to n.
|
iropt |
|
Set the maximum code increase due to inlining to n instruction triples per
module. A higher value of n allows more inlining to occur. |
iropt |
|
Set
the maximum code increase due to inlining to n instruction triples per routine.
A higher value of n allows more inlining to occur. |
iropt |
|
Set the maximum
level of recursive inlining to a depth of n. A higher value of
n allows more inlining to occur. |
iropt |
|
Set the maximum size of a
routine body eligible for inlining to n instruction triples. A higher value of
n allows larger routines to be inlined. |
iropt |
|
Disable scalar replacement optimization. Generally, scalar
replacement will reduce memory accesses in a loop, and therefore improve the loop's
performance. But it can also increase register pressure (which can lead to register
spills, that is stores of registers to memory, which is an expensive operation). |
iropt |
|
Do
not perform loop distribution transformations. |
iropt |
|
Enable loop rerolling. |
iropt |
|
Do whole program optimizations. Allows
the compiler to do a better job of inter-procedural analysis. |
iropt |
|
Treat formal pointer
parameters as restricted pointers (not aliased). |
ld |
-M,/usr/lib/ld/map.bssalign |
Instructs linker to use mapfile from
/usr/lib/ld/map.bssalign. This provides an appropriate alignment for large page mapping of the heap,
allowing for more efficient usage of large pages. (Fortran) |
ube |
|
Allow optimizer to
use x87 hardware instructions for sine, cosine, and rsqrt. The precision and rounding
effects are determined by the underlying hardware implementation, rather than by standard IEEE754
semantics (x86). |
ube |
|
Assume (yes, default) that callee-save registers are saved, no
assumes they are not saved (x86). |
ube |
|
Enable the instruction scheduling phase before
global register allocator |
ube_ipa |
|
Enables more aggressive inlining, especially with profile
feedback (x86). |
|
|
|