|
By Sheldon Lobo, Compiler Technology And Performance Engineering, Sun Microsystems, November 30, 2005
|
|
|
Improving binary performance is a frequent request from customers.
These requests usually come from end customers of Sun systems or even
performance, benchmarking and production groups of large independent
software vendors (ISVs). The common theme is the non-availability of
the original source code. Without a re-compile, it is usually a hard,
time-consuming and costly endeavor to meaningfully improve binary
performance. Sometimes system tweaks to a non-optimized system will do
the trick, but often a complete system upgrade is necessary.
The Binary Optimizer is a tool that improves binary performance,
without the need for system changes or upgrades. This tool modifies the
binary by updating the binary instructions to generate more optimal
code. Capability exists to instrument the binary for profile
collection. When data from such a profile training run is fed back to
the Binary Optimizer, significant performance improvements may be
achieved. This is especially true for binaries that were not built with
high levels of optimizations, or were built without profile data, or
even built with profile data that is not representative of the end
customers unique workload.
What is the Binary
Optimizer?
The Binary Optimizer is a static SPARC optimizer that
accepts as input a binary and creates an optimized binary as the
output. We define a binary as either an executable or a shared object.
The availability of the original source code is not a pre-requisite for
using this tool. It can optimize binaries irrespective of the source
language used (C, C++ or FORTRAN). It can also optimize mixed source
language binaries. The Binary Optimizer can also be used to instrument
a binary to collect the basic block execution count profile for a given
run. This data can be used to perform profile guided optimizations at
the binary level. If new profile data is available, or the execution
environment of the binary has changed, a previously optimized binary
may be re-optimized with this new data.
A Quick-Start Guide
Without going into details about optimizations, command line options
and debugging here are the essential steps to optimize a binary.
- The binary must be compiled and linked with optimizations (
-O
or -xOn)
and a special
compiler option -xbinopt=prepare.
- The resulting binary should be instrumented for profile
collection using the
-binstrument
flag.
- Run the application with one or more representative
workloads.
- Optimize the binary with this profile data via the
-buse
flag.
Example:
% cc -O -xbinopt=prepare -o myapp *.c % binopt -binstrument myapp % myapp < input_data % binopt -buse myapp
Why Use the Binary
Optimizer?
The global optimizations performed by the Binary Optimizer
usually show
greater performance improvements on large applications. We see the
following potential users of binary optimization technology:
End Users on SPARC
Platforms:
Experienced users of Sun systems (for example, database
administrators) are often looking for ways to improve
binary performance. For such users, ready to go that extra
mile to tune binaries they receive from software vendors, the Binary
Optimizer is an ideal tool. And, it is
available for free as part of the Sun Studio toolkit.
For the software vendor, the necessary step to
follow is:
For the end user, the following steps will optimize the binary for
their
specific workload:
- Instrument the binary, using
binopt.
- Run the instrumented binary on a representative workload.
- Use
binopt
again to optimize the binary
with the collected runtime data.
For example, the end user optimizes app
using binopt:
% binopt -binstrument -bdata=datafile -o app.instr app % app.instr < input_data % binopt -buse -bdata=datafile -o app.opt app
Software Vendors:
The Binary Optimizer performs optimizations that are not
normally
performed by the compiler. Hence by including the Binary Optimizer
in the build process, a better performing production binary may be
obtained.
The steps necessary to create a production binary with the
Binary
Optimizer are:
- Compile the application with the
-xbinopt=prepare
flag.
- Instrument the resulting binary for profile collection
using the
-binstrument flag
of binopt.
- Run the application with one or more representative
workloads.
- Optimize the binary using the collected profile data and
the
binopt
-buse flag.
Example:
% cc -xO4 -xbinopt=prepare -o app *.c % binopt -binstrument -bdata=datafile -o app.instr app % app.instr < input_data % binopt -buse -bdata=datafile -o app.opt app
It is important to note that if you are already using
profile feedback
(-xprofile=collect|use
compiler
flags) to build the application, it may easier to use the -xlinkopt
compiler
flag in the build, rather than using the Binary Optimizer, to obtain
similar optimizations.
Performance With binopt
We see significant performance improvements on large
applications when
the Binary Optimizer is used. This is especially true for applications
that are not built with profile feedback or are built with feedback
that
does not truly represent the end customer's workload. In these
situations,
a 10% or more performance gain is not unheard of.
The user must also be aware that using the Binary Optimizer
causes
an increase in size of the binary. This is due to the fact that
optimized
code is cached in a new segment in the binary. On large applications,
an increase in size of up to 1.8x is seen.
The Binary Optimizer runtime is usually a fraction of the
build time
of the entire binary. For large applications, where the build time is
usually several hours, binopt runtime can be measured in minutes. For
example, building a well known database application from source takes
over 5 hours. Performing binary optimizations on the resultant binary
takes 8 minutes.
Optimization Levels
The -blevel=1
optimization level is the default level of optimization
for binopt(1). At this
level, code ordering and control flow optimizations
are performed. While ordering code, functions may be split to optimize
I-cache performance.
At the -blevel=2
optimization level, data-flow information is constructed
and more aggressive optimizations are performed. These include
inlining, address
simplification and load instruction optimizations. Usually better
performance
is derived from using this higher level of optimization. The tradeoff
is an
increased binopt runtime.
At -blevel=0,
no optimizations are performed.
Profile Instrumentation
Collecting and using a profile of the execution
characteristics of a binary
is crucial to making effective use of the Binary Optimizer.
Instrumenting
a binary and executing a training run to collect the data is relatively
easy
when using this tool. A single command line instruments the binary. The
instrumented
binary may be freely copied to a potentially different run machine
– it
is self contained, and no dependencies need to be maintained. After the
training
run is complete, a file containing the profile data is created.
Accumulation
of profile data from multiple training runs is another useful feature
– the
user just needs to specify the pre-existing data file on the -binstrument
command
line.
When collecting profile for applications which contain one or
more executables
and/or shared objects, all binaries for which optimizations are planned
need
to be instrumented. In the example below, the executable app
has
a dependency on the shared object x.so.
As demonstrated, both binaries need
to be instrumented and optimized separately.
% binopt -binstrument -bdata=app.data -o app.instr app
% binopt -binstrument -bdata=x.so.data -o x.so x.so
% app.instr < input_data
% binopt -buse -bdata=app.data -o app.opt app
% binopt -buse -bdata=x.so.data -o x.so x.so
Debugging
The Binary Optimizer maintains full compatibility with tools
that statically
or dynamically examine a binary (analyzer(1),
dbx(1), pstack(1),
etc.). The
symbol tables are updated to reflect all transformations. Mangled
symbol
names are often assigned (see the example below), which are
automatically
de-mangled when displayed by the Studio tools.
If the prepared binary was built for debugging (with the -g
compiler option),
debugging information is automatically propagated to the binary,
instead of
leaving it in the object file by default. When such a binary is
optimized by
binopt, the available debugging information is updated to reflect the
transformations
performed.
Example
Here is a small example to help understand how the Binary
Optimizer transforms
the binary.
In the code below, there are three functions main(),
add()
and sub(). The
frequently executed parts of the code are denoted by the red
rectangles, while
the less frequently executed code is colored green. The layout of the
optimized
binary is shown on the right hand side. Here are some of the
characteristics
of the new binary:
- The optimized code is placed in a new segment of the binary
(named “Optimized
code” in Figure 1 below.
- Functions may be split while laying out code (function
main()
is split,
the hot fragment which is not the entry point is given the mangled name
_$o1cexhO0.main()).
- The original functions are given new mangled names (
_$r1.main(),
_$b1.add(),
_$b1.sub()).

Figure 1: Typical code layout
from the binary optimizer
Additional Details
-xbinopt=prepare
Considerations:
The -xbinopt=prepare
compiler flag, when used to build a binary,
adds certain information to the binary that allows it to be transformed
by
the Binary Optimizer. This information describes the location of the
executable
code, points out control flow structures like function boundaries and
switch
tables, and provides data flow information about the code. This data is
stored
in a new ELF section named .annotate.
This additional information in the
binary results in a 5% increase in size, on average. There is no
noticeable
build time impact when this flag is used.
In addition, prepared binaries built for debugging (with
the -g compiler
option), have an additional size increase due to the presence of
debugging
information. On average we see a 50% increase in binary file size when
compared
to a debuggable binary built without the -xbinopt=prepare
option.
Profile Instrumentation (-binstrument)
Considerations:
While doing a training run to collect binary profile
information, the user
will notice a slowdown in application performance. This is to be
expected
since there is an overhead associated with recording the execution
count
profile of the executable code. Usually we see a 2.5 to 3x slowdown in
application performance.
There is also an increase in binary file size associated
with adding instrumentation
code. We usually see a 2.5x increase in binary size due to profile
instrumentation.
-bfinal Usage:
As mentioned above, a binary that may be optimized by the
Binary Optimizer
must be prepared using the -xbinopt=prepare
compiler
flag. This results in additional information being placed in an ELF
section
in the binary. When creating a final binary that is to be deployed on
the
run systems, and on which no future optimizations are planned, the -bfinal
option
may be used to strip the -xbinopt=prepare
information
from the resultant binary. This flag may be used to prevent users of
the
binary from making any further modifications to it. For example:
% cc -xO4 -xbinopt=prepare -o app *.c % binopt -binstrument -bdata=datafile -o app.instr app % app.instr < input_data % binopt -buse -bdata=datafile -bfinal -o app.opt app
Handling Modules Not Built With -xbinopt=prepare
If the binary contains a combination of legacy code and
newly created code
(built with -xbinopt=prepare),
the Binary Optimizer may still be gainfully
employed. The Binary Optimizer optimizes only that code that was built
with the -xbinopt=prepare
compiler option, leaving the legacy code untouched.
Conflicts
The Binary Optimizer has some restrictions.
It will not optimize binaries built as follows:
- With the
-xprofile=collect
compiler option.
- With the
-xlinkopt
compiler option.
- With the
-s
compiler option or stripped using the strip(1)
tool.
- Binopt will not optimize that part of the code compiled
with the
-xF compiler
option.
- Binopt will not optimize the template code portion of a
C++ application.
The Binary Optimizer also does not optimize those parts of
the executable
code in a binary that were derived from assembly language files. As
mentioned
earlier, code derived from object files compiled without the -xbinopt=prepare
flag are not
optimized either. On the other hand, the presence of assembly
code or legacy object code in a binary does not prevent binopt(1)
from optimizing
the remainder of the binary.
Future Updates
The Sun Studio 11 release includes the Binary Optimizer, binopt.
Updates in future releases may include new
functionality and more optimizations. Also, several of the restrictions
and conflicts will be addressed. Stay tuned!
Sheldon Lobo is a staff engineer in the SPARC compiler backend
team. He works primarily on developing Sun's object and binary file
optimization and analysis tools.
(Last updated December 1, 2005)
|