Sun Java Solaris Communities My SDN Account Join SDN

Article

Advanced Compiler Options for Performance

 
By Sun Studio Compiler Engineering Staff, revised July 2007  
Users wanting the best performance from CPU-intensive codes may wish to explore the use of additional libraries and advanced compiler options that control individual compiler components.

Performance libraries

  1. The optimized math library, selected by the switch -xlibmopt in Fortran and C++, or by including -lmopt in C. This library may produce slightly different results, usually differing only in the last bit, in order to achieve higher performance.

  2. There are various memory allocation libraries that can be used. A guide to several choices can be found in the NOTES section of the umem_alloc(3MALLOC). In addition, the library libfast.a is available. To use it, add -lfast to your C, Fortran, or C++ compile command. Like libbsdmalloc, libfast keeps free lists of various sizes to provide very fast allocation, but at the expense of additional memory. Also, libfast samples the allocation stream and if consecutive samples are sufficiently smaller than the current size of a free list, a new free list is created. Note that libfast.a is suitable for use only in 32-bit, single-threaded environments. Do not use it if you have set the environment variables PARALLEL or OMP_NUM_THREADS to a value greater than 1, or if your application code calls omp_set_num_threads() to increase the number of threads.

  3. The Sun Performance Library, which is a set of optimized, high-speed mathematical subroutines that are used to solve linear algebra and other numerically intensive problems. The library is linked to your application by using the switch -xlic_lib=sunperf .

    Note - The optimized math library, sunperf, and several other math libraries are described in the Sun Studio 12: Numerical Computation Guide.

  4. An optimized memset/memcpy library, suitable for UltraSPARC-III or later systems. To use it, add the switch -ll2amm to your compile command. You might also need to add -xarch=v8plusb to your command line. Note that compiling with -ll2amm produces a binary that cannot be used on previous generation processors.

  5. For Solaris 8 or earlier systems, use the Intimate Shared Memory library to enable use of large pages (for example, 4MB). Using large pages may significantly reduce the number of Translation Lookaside Buffer (TLB) misses, and may therefore improve performance for programs that access large amounts of memory.

    To use large pages:

    • Link your program with the libprism32 library. Be sure that your LD_LIBRARY_PATH includes /opt/SUNWspro/prod/lib/v8plusb and then use the switch -lprism32.

    • Set kernel parameters that allow use of intimate shared memory:

      shmsys:shminfo_shmmin
      shmsys:shminfo_shmmax
      shmsys:shminfo_shmmni
      shmsys:shminfo_shmseg

      Documentation on these parameters may be found in the Solaris 8 System Administration Guide, Volume II.

    • Set environment variables that allow use of large pages:

      • PRISM_HEAP=n: the maximum amount of memory allowed for large pages in the heap.

      • PRISM_MODE=n: what kind of data should be put into which kind of large pages:

        value of n
        action
        1
        attempt to put text in a shared ISM segment
        2
        attempt to put text, data, and current heap in a private ISM segment
        3
        attempt to put text and read-only data in a shared ISM segment, but initialized data, bss and heap in a private ISM segment

    For systems using Solaris 9 or later, the page size setting facility mpss.so.1, the ppgsz command, and the -xpagesize family of compiler flags are the preferred methods for using large pages.

Compiler Component Options

The Sun Studio compilers are divided into several different components. Sometimes, it can be helpful to performance if switches are sent directly to individual components, including:

  • CC, driver for C++

  • cg, the code generator

  • d, driver for C

  • f90comp, front end for Fortran

  • iropt, the global optimizer

  • ld, the link editor

  • ube, the x86/x64 code generator

  • ube_ipa, the x86/x64 interprocedural optimizer

NOTE: although the use of these options are supported, there are certain notes that must be understood before using them.

  • Usually, these options are set automatically:

    Usually, the compiler itself picks values for these options, based on other options that are selected. Most users will achieve adequate performance without needing to set these options.

  • Subject to change:

    These options may change from release to release of the compiler. The spelling, effect, and even presence of these performance options may evolve from time to time; Makefile authors should be prepared to cope with such evolution.

  • Performance testing is required:

    Some of these options may help your code run faster; others might actually make it run more slowly. You should not use one of these options unless you believe that you have a good test case to demonstrate its effect. A good test case:

    • represents an important real-life use of the code;

    • can be compiled both with and without the compiler option in question;

    • includes a repeatable workload; and

    • has available a machine environment where changes can be reliably measured.

  • Understand what the driver is doing:

    If you choose to experiment with the options documented on this page, you will probably find it useful to examine what the compiler driver is passing to each stage of the compilation, both before and after your changes. You can do this by adding -v to your Fortran or C++ compile, or by adding -# to your C compile. For example, if you wanted to check whether the driver passes -Aheap to iropt, you could find out by typing (in a C shell) something like this:

    % cc -# -fast -W2,-Aheap tmp.c |& grep bin/iropt | fold -s -20 | grep Aheap 
    -O5 -Aheap
    %

    The above command pipes stderr to grep, which looks for the line that invokes iropt. Since that line is very long, we fold it into smaller chunks, and then examine the chunks for one containing "Aheap".

  • Correctness testing is recommended:

    The use of performance options may sometimes lead to unexpected results. For example, a program may have a bug which is harmless when compiled without optimization, but which causes incorrect operation when the compiler uses more advanced optimizations. In addition, although these options have been tested by Sun's internal testing, they have had less exposure to customer applications.

    Therefore, as in any other performance improvement exercise, it is prudent to include testing for correct output as you tune performance.

Below is a selected list of options that can be passed directly to Sun Studio compiler components. These are passed using the -W flag (when using the C compiler) and the -Qoption flag (when using the Fortran or C++ compilers). The table below shows the relationship between the C and Fortran/C++ compilers and how to invoke the options. The f90comp component is Fortran specific and not available from the C or C++ compiler. The ld component is not needed from the C or C++ compiler, just use the -M flag.

Component
Fortran
C++
C
CC
-
-Qoption CC suboption
-
cg
-Qoption cg suboption
-Qoption cg suboption
-Wc,suboption
d
-
- 
-Wd,suboption
f90comp
-Qoption f90comp suboption
-
-
iropt
-Qoption iropt suboption
-Qoption iropt suboption
-W2,suboption
ld
-Qoption ld suboption
-
-
ube
-Qoption ube suboption
-Qoption ube suboption
-Wu,suboption
ube
-Qoption ube suboption
-Qoption ube suboption
-Wu,suboption
ube_ipa
-Qoption ube_ipa  suboption
-Qoption ube_ipa suboption
-Wi,suboption

As an example, the following shows how the compile line would look to invoke the -Abopt option to the iropt component of the compiler when compiling a program for C, Fortran and C++.

cc -W2,-Abopt c_example.c
f95 -Qoption iropt -Abopt fortran_example.f95
CC -Qoption iropt -Abopt cxx_example.cxx

The "Component" column below is the name of the compiler component as specified for the Fortran and C++ compilers. The "Option" column is the action requested of the specific compiler component. The "Description" column describes what the option does.

Component
Option
Description
CC, d
-iropt-prof
Use iropt in the profile phase of the compilers (iropt is the global optimizer).
cg
-Qdepgraph-early_cross_call=1
Enable cross-call instruction scheduling.

This option controls whether the "early" schedulers may move instructions across a call instruction. The early schedulers are those run before register allocation. Because of SPARC register windows, this is sometimes useful.

cg
-Qeps:do_spec_load=1
Allow generation of speculative loads during Enhanced Pipeline Scheduling (EPS). A speculative load may reduce load latency, if the speculation is correct; but if the speculation is incorrect (e.g. the other path is taken, and the load misses in the cache or the TLB), then the overhead may be hundreds of cycles for the incorrect speculation. The EPS scheduler needs to have a very good chance of speculating correctly in order for EPS speculative loads to be an overall win.
cg
-Qeps:enabled=1
Use enhanced pipeline scheduling (EPS) and selective scheduling algorithms for instruction scheduling. The EPS scheduler will cause some applications to improve performance, but some will run more slowly.
cg
-Qeps:rp_filtering_margin=n
The number of live variables allowed at any given point is n more than the number of physical registers. Setting n to a significantly large number (e.g., 100) will disable register pressure heuristics in EPS.
cg
-Qeps:ws=n
Set the EPS window size to n, that is, the number of instructions it will consider across all paths when trying to find independent instructions to schedule a parallel group. Larger values may result in better run time, at the cost of increased compile time.
cg
-Qgsched-trace_late=1
Enable the (late) trace scheduler. This is a new feature of the compiler which is being tuned from release to release. It may become the default in a future release.
cg
-Qgsched-Tn
When performing trace scheduling, set the aggressiveness of the trace formation to level n, where n is 4, 5, or 6. The higher the value of n, the lower the branch probability needed to include a basic block in a trace.
cg
-Qgsched-trace_spec_load=1
When performing trace scheduling, enable the conversion of loads to non-faulting loads inside the trace.
cg
-Qicache-chbab=1
Enable optimizations to reduce branch after branch penalty. On some machines, the instruction fetcher will operate more effectively if branches are separated from each other; for example, not having one branch occupy the delay slot of another branch. Adding no-ops into the code may make the fetcher run more effectively. -Qicache-chbab is not currently on by default because it may increase code size and therefore make the icache less effective, and the algorithm for adding the nops has not been shown to benefit all applications.
cg
-Qinline_memcpy=n
Inline calls to memcpy with n bytes or fewer being copied. If there are many calls to memcpy with a small number of bytes, the call overhead may be significant.
cg
-Qipa:valueprediction
Use profile feedback data to predict values and attempt to generate faster code along these control paths, even at the expense of possibly slower code along paths leading to different values. Correct code is generated for both paths.
cg
-Qiselect-funcalign=n 
Align function entry points at n-byte boundaries. Aligning functions may make the instruction fetcher more effective on some machines. In general, this option causes the binary to be larger, and it may cause the I-cache to be less well packed. Default settings are likely to differ from machine to machine
cg
-Qiselect-sw_pf_tbl_th=n 
Peels the most frequent test branches/cases off a switch until the branch probability reaches less than 1/n. This is effective only when profile feedback is used
cg
 -Qlp[=n][-av=n]
[-t=n][-fa=n]
[-fl=n][-ip=n]
[-it=n][-imb=n]
[-prt=n][-prwt=n][-pwt=n]
[-pt=weak][-ol=n] 
Control prefetching for loops with control flow:

lp=n Turns the module on (1) or off (0) (default is on for f95; off for C/C++)

lp in Fortran, equivalent to -Qlp=1 and is used as a means for setting sub-options listed below. In C/C++, equivalent to -Qlp=0. However, when used with the options -xprefetch=auto or -xprefetch_level=[2|3], it is equivalent to -Qlp=1, and used as a means for setting sub-options listed below.

-av=n Sets the prefetch look ahead distance, in bytes. Default is 256.

-t=n Sets the number of attempts at prefetching. If not specified, t=2 if -xprefetch_level=3 has been set; otherwise, defaults to t=1.

-fa=n 1=Force user settings to override internally computed values.

-fl=n 1=Force the optimization to be turned on for all languages

-ip=n Turns on (1) prefetching for one-level indirect memory accesses.

-it=n Indicates to the compiler to insert n extra prefetches for each indirect access in outer loops.

-imb=n Indicates to the compiler (1) to insert indirect prefetches when the indirect access chain spans across basic blocks.

-pt=weak Use weak prefetches in the general loop prefetch.

-prt=n 1= Use prefetch with function code 3 (prefetch for one read) for read-only memory accesses.

-prwt=n 3= Use prefetch with function code 3 (prefetch for one write) for memory accesses which are read and then written.

-pwt=n 1= Use prefetch with function code 3 (prefetch for one write) for accesses that are written only.

-ol=n Turns on (1) prefetching for outer loop.

cg
-Qms_pipe+alldoall 
Specifies that all loops can be pipelined without needing to be concerned about loop-carried dependencies.
cg
-Qms_pipe+intdivusefp
Use fp divide for signed integer division.
cg
-Qms_pipe+prefolim=n
Set prefetch ahead distance assuming that the number of outstanding prefetches are n. With larger n, the ahead distance gets larger.
cg
-Qms_pipe-pref
Disable prefetching within modulo scheduling (used in software pipelining).
cg
-Qms_pipe-pref_prolog
Turn off prefetching in the prolog of modulo scheduled loops.
cg
-Qms_pipe-prefst
Turn off prefetching for stores in the pipeliner.
cg
-Qms_pipe-pref_prefstrong=0 
Turn off the use of strong prefetches in modulo scheduled loops.
cg
-Qms_pipe+unoovf
Assert (to the pipeliner) that unsigned int computations will not overflow.
cg
-Qpeep-Sh0
Disable the max live base registers algorithm for sethi hoisting. A sethi is a SPARC instruction for forming large constants, especially address constants. Sethi hoisting uses an algorithm that may increase register pressure. Usually, this option is likely to help performance.
f90comp
-O[3-5]
Set the optimization level of the f95 front/middle end to the specified optimization level (fortran only).
f90comp
-array_pad_rows,n
Enable padding of f95 arrays by n (fortran only).
f90comp
-hoist_expensive,-hoist_trivial
Enables additional loop invariant code motion, hoisting operations out of loops.
iropt
-Abcopy 
Increase the probability that the compiler will perform memcpy/memset transformations.
iropt
-Abopt
Enable aggressive optimizations of all branches, such as reversing the branch condition. This is only useful when profile feedback is used.
iropt
-Adata_access
This option turns on analysis of data access patterns for scalars and arrays regions accessed in each loop. The information is used by various loop transformations such as loop fusion for determining profitability of those transformations. Unlike regular data dependence analysis, this analyzes detailed array sections accessed in a loop, so the analysis can be expensive in terms of compilation time.
iropt
-Addint:ignore_parallel
Ignore parallelization factors in loop interchange heuristics.
iropt
-Addint:sf=n
Set memory store operation weight for loop interchange to n. A higher value of n indicates a greater performance cost for stores. This flag gives more weight to store operations in determining whether some loop transformations such as loop interchange should be done.
iropt
-Aheap
Allow the compiler to recognize malloc-like memory allocation functions. If -xbuiltin is specified, this option is implied.
iropt
-Ainline[:cp=n]
[:cs=n][:inc=n]
[:irs=n][:mi]
[:rs=n][:recursion=n]
cp=n The minimum call site frequency counter in order to consider a routine for inlining

cs=n Set inline callee size limit to n. The unit roughly corresponds to the number of instructions.

inc=n The inliner is allowed to increase the size of the program by up to n%.

irs=n Allow routines to increase by up to n. The unit roughly corresponds to the number of instructions.

mi Perform maximum inlining (without considering code size increase).

rs=n Inliner only considers routines smaller than n pseudo instructions as possible inline candidates.

recursion=n Allow a recursive call to be inlined up to n level.

iropt
-Aivel:duplicate_loops
More aggressive strength reduction by replicating loops.
iropt
-Aivsub3
Increase the probability that loop induction variables will replaced, so that some extraneous code can be eliminated from loops.
iropt
-Aloop_dist:ignore_parallel
Ignore parallelization factors in loop distribution heuristics.
iropt
-Amemopt:arrayloc 
Reconstruct array subscripts during memory allocation merging and data layout program transformation. The transformation uses the same arrays, but modifies the ways the arrays are referenced to make them more efficient globally.
iropt
-Apf:llist=n:noinnerllist
Do speculative prefetching for link-list data structures:

llist=n perform prefetching n iterations ahead.

noinnerllist, do not attempt for innermost loops.

iropt
-Apf:pdl=n
Allow prefetching through up to n levels of indirect memory references.
iropt
-Arestrict_g
Assumes global pointers are not aliased (restricted).
iropt
-Ashort_ldst[:ldld] 
Convert multiple short memory operations into single long memory operations.

ldld Convert multiple short memory loads into single long load operations.

iropt
-Atile:skewp[:bn] 
Perform loop tiling which is enabled by loop skewing. Loop skewing transforms a non-fully interchangeable loop nest to a fully interchangeable loop nest. The optional bn sets the tiling block size to n.
iropt
-Aujam:inner=g
Increase the probability that small-trip-count inner loops will be fully unrolled.
iropt
-Aunroll 
Enable outer-loop unrolling.
iropt
-crit
Enable optimization of critical control paths. This is based on profile data to select critical paths and create super blocks so that more optimizations and better scheduling can be done on the critical paths, and result in better overall performance.
iropt
-MR
Do not inline calls when parameters are arrays and actual array dimensions and formal array dimensions are mismatched
iropt
-Man
Enable inlining of routines with frame size up to n.
iropt
-Mmn
Set the maximum code increase due to inlining to n instruction triples per module. A higher value of n allows more inlining to occur.
iropt
-Mrn
Set the maximum code increase due to inlining to n instruction triples per routine. A higher value of n allows more inlining to occur.
iropt
-Msn
Set the maximum level of recursive inlining to a depth of n. A higher value of n allows more inlining to occur.
iropt
-Mtn
Set the maximum size of a routine body eligible for inlining to n instruction triples. A higher value of n allows larger routines to be inlined.
iropt
-Rscalarrep
Disable scalar replacement optimization. Generally, scalar replacement will reduce memory accesses in a loop, and therefore improve the loop's performance. But it can also increase register pressure (which can lead to register spills, that is stores of registers to memory, which is an expensive operation).
iropt
-Rloop_dist
Do not perform loop distribution transformations.
iropt
-reroll=1
Enable loop rerolling.
iropt
-whole
Do whole program optimizations. Allows the compiler to do a better job of inter-procedural analysis.
iropt
-xrestrict 
Treat formal pointer parameters as restricted pointers (not aliased).
ld
-M,/usr/lib/ld/map.bssalign
Instructs linker to use mapfile from /usr/lib/ld/map.bssalign. This provides an appropriate alignment for large page mapping of the heap, allowing for more efficient usage of large pages. (Fortran)
ube
-fsimple=3
Allow optimizer to use x87 hardware instructions for sine, cosine, and rsqrt. The precision and rounding effects are determined by the underlying hardware implementation, rather than by standard IEEE754 semantics (x86).
ube
-xcallee[=yes|no]
Assume (yes, default) that callee-save registers are saved, no assumes they are not saved (x86).
ube
-sched_first_pass=1
Enable the instruction scheduling phase before global register allocator
ube_ipa
-inl_alt
Enables more aggressive inlining, especially with profile feedback (x86).