|
By Timothy Jacobson, Sun Microsystems, June 2007
|
|
|
For developers who need faster performance out of C, C++, or Fortran programs, Sun Studio compilers provide several efficient methods. Performance tuning has always been a difficult task requiring extensive
knowledge of the machine architecture and instructions. To make this process easier, the Sun Studio C, C++, and Fortran compilers provide easy-to-use performance flags.
By using performance flags, developers can quickly improve execution speed.
However, sometimes compiler flags alone do not result in optimum performance. For this reason,
Sun Studio compilers also allow inline assembly code to be placed in critical areas.
The inline code behaves similarly to a function or subroutine call, which enables cleaner,
more readable code and also enables variables to be directly accessed in the inline assembly code.
This paper provides a demonstration of how to measure the performance of a critical piece of code. An
example using a compiler flag and another example using inline assembly code are provided. The
results are compared to show the benefits and differences of each approach.
Contents
Introduction
For demonstration purposes, this paper uses an academic
program to generate the Mandelbrot set. The example Mandelbrot program is written in C.
Computing all the pixel values of the Mandelbrot set using the Sun Studio compiler is timed.
Then, optimization flags are used and the computations are timed again. Finally, example Sun Studio inline assembly
code is used and the computations are timed again and compared with the previous timings. The examples demonstrate two different
methods for improving performance with the Sun Studio compiler: using flags and using inline assembly code.
Example 1: The Mandelbrot Set Algorithm
The Mandelbrot program calculates unique values for a display that is 1000 pixels by 1000
pixels. Each pixel represents a position in the complex plane of the display. A
value from 0 to 255 is calculated by performing a series of multiplications and additions.
This iteration process is the heart of the Mandelbrot set algorithm, which is shown the Example 1.
Let c = a + bi, a coordinate in the complex plane
z1 = c
z2 = z12 + c
z3 = z22 + c
z4 = z32 + c
z5 = z42 + c
...
zk+1 = zk2 + c
until zk+1 >= 4.0 or k+1 = 255
The color values are placed in a two-dimensional array of integers that is also
1000 by 1000 elements in size. It is the calculation and placement of these values into
the array that is timed so that any latency caused by displaying the pixels can be avoided.
Example 2: Mandelbrot Calculation in C
The Mandelbrot program is written in C (as shown in Example 2), but
similar results can be found using Sun Studio C++ and Fortran compilers.
start = gethrtime();
for(i = 0; i < disp_width; i++)
{
for(j = 0; j < disp_height; j++)
{
x = ((float)i * scale_real) - 2;
y = ((float)j * scale_imag) - 2;
u = 0.0;
v = 0.0;
u2 = 0.0;
v2 = 0.0;
iter = 0;
while ( u2 + v2 < 4.0 && iter < max_iter )
{
v = 2 * v * u + y;
u = u2 - v2 + x;
u2 = u*u;
v2 = v*v;
iter = iter + 1;
}
array[i][j] = iter;
}
}
end = gethrtime();
printf("Time = %lld nsec\n", end - start );
In the Mandelbrot algorithm, the majority of the time is spent in the double-nested
loop and calculating the values of the pixel colors, as shown in Example 2.
Example 3: Compiling and Timing mandelbrot.c
To establish a baseline for timing, the program is compiled using the Sun Studio C compiler with no special
flags or optimizations, as shown in Example 3.
$ cc -xarch=amd64 -o Mandelbrot Mandelbrot.c -lX11
$ Mandelbrot
$ Time = 434277313 nsec
Example 4: Compiling and Timing mandelbrot.c With -fast
One of the easiest ways a developer can get faster performance is to use the -fast flag
in the Sun Studio compilers. The -fast flag is an umbrella flag that invokes a collection of flags
in the correct dependency order to achieve optimization. For further details on -fast,
see the compiler man page (that is, man cc, man CC, or man f90).
Example 4 shows how -fast is used as a compilation option.
$ cc -xarch=amd64 -fast -o Mandelbrot Mandelbrot.c -lX11
$ Mandelbrot
$ Time = 206874465 nsec
Wow, the -fast option has more than cut the time in half. The beautiful
thing about this is that it was so easy to use. However, it is widely believed
that for ultimate performance, a program should be written in assembly code. Writing
assembly code is not trivial for most people. To make it easier, Sun provides a clean,
inline assembly feature called .il, which looks like a function call in the code.
Example 5: Sun Studio Inline Assembly Declaration for C
For the next example, a .il descriptor named mandel_il is used. It is declared
just like any function would be, as shown in Example 5.
The naming convention for inline code is
name_il(arg1, arg2, ...);. Example 5 provides four variable
arguments to the inline code and returns an integer to the main program.
The argument list tells the compiler how to arrange variables in the correct registers.
int mandel_il(float, float, float, int);
Example 6: Mandelbrot Calculation in C With Inline Assembly Code
The inline assembly code replaces the critical code in the while loop and
appears as a function call, as shown in Example 6. This makes the program more readable
because the assembly instructions are hidden in the inline file. With a real function call,
a jump to somewhere else in the code would occur, which causes latency.
With inline code, there is no jump, so the stack pointer can continue without interruption.
By placing only the variables needed in the argument list, the inline code knows which
registers hold those arguments. The arguments are passed to the inline code portion
in the same register order that would be found in a function call. Likewise, the
return value is placed in the register that would normally be used for the return
of a function. This allows inline assembly code to be consistent and reusable, similar to a macro.
scale_real = 4.0 / (float)disp_width;
scale_imag = 4.0 / (float)disp_height;
start = gethrtime();
for(i = 0; i < disp_width; i++)
{
for(j = 0; j < disp_height; j++)
{
x = ((float)i * scale_real) - 2;
y = ((float)j * scale_imag) - 2;
array[i][j] = mandel_il(x, y, 4.0, max_iter);
}
}
The actual inline code is stored in a separate file that has a .il ending.
For this example, a file called Mandelbrot.il is used. When a file.il entry
is included on the compile line, this indicates to the Sun Studio compiler that inline code
is used. The compiler then searches that .il file to find a section beginning with name_il.
Example 7: Typical Inline Assembly Code Template
The key structures of the inline code are shown in Example 7. Each inline portion
begins with the .inline keyword followed by the name used in the C code, a comma,
and finally the value 0. The last line is .end, which is a keyword
indicating the conclusion of the inline assembly code.
.inline name_il, 0
// inline code placed here
.end
For calculation of the Mandelbrot set, the inline code first
needs to read the four input values. These are held in registers.
Scalar floating-point parameters are passed in registers
%xmm0, %xmm1, %xmm2, and so on.
Scalar integer parameters are passed in %rdi,
%rsi, %rdx, %rcx, and so on.
In this example, the AMD64 Application Binary Interface (ABI)
is used to define the registers that are used.
Each architecture has an ABI that defines the register order for passing
parameters. Also, the ABI defines what register is used to pass a
parameter back to the calling routine. This example returns an integer
in the %rax register, according to the AMD64 ABI.
Example 8: Inline Assembly Code for the Iterative Mandelbrot Calculation
Knowing all these facts, the inline code can be written, as shown in Example 8.
.inline mandel_il,0
// x is stored in %xmm0
// y is stored in %xmm1
// 4.0 is stored in %xmm2
// max_int is stored in %rdi
// set registers to zero
xorps %xmm3, %xmm3
xorps %xmm4, %xmm4
xorps %xmm5, %xmm5
xorps %xmm6, %xmm6
xorps %xmm7, %xmm7
xorq %rax, %rax
.loop:
// check to see if u2 - v2 > 4.0
movss %xmm5, %xmm7
addss %xmm6, %xmm7
ucomiss %xmm2, %xmm7
jp .exit
jae .exit
// v = 2 * v * u + y
mulss %xmm3, %xmm4
addss %xmm4, %xmm4
addss %xmm1, %xmm4
// u = u2 - v2 + x
movss %xmm5, %xmm3
subss %xmm6, %xmm3
addss %xmm0, %xmm3
// u2 = u * u
movss %xmm3, %xmm5
mulss %xmm3, %xmm5
// v2 = v * v
movss %xmm4, %xmm6
mulss %xmm4, %xmm6
incl %eax
cmpl %edi, %eax
jl .loop
.exit:
// end of mandel_il
.end
Example 9: Compiling and Timing for mandelbrot.c With Inline Assembly Code
To compile code with .il inline code, just add filename.il
to the compile line and the Sun Studio compiler will search for
the .name_il keyword to start reading the inline code into the
program. Example 9 shows the compile line and timing result from the inline code.
$ cc -xarch=amd64 -o Mandelbrot2 Mandelbrot2.c Mandelbrot2.il -lX11
$ Mandelbrot2
$ Time = 242519854 nsec
Conclusions
The timing for the inline assembly code also shows a significant improvement over the baseline.
However, it does not perform quite as well as the -fast method. There are
several reasons this might be the case. One is that the Mandelbrot set iterates very few times
for many points in the array. The inline
example has to do extra work to move values into registers before and
after. For cases where the inline code does not iterate, this might be an inefficient
use of time. The longer the iteration, the more beneficial inline code becomes.
Another reason is that -fast might be unrolling the loops that traverse the array.
This could be done in the inline code as well; however, it would be much longer
and complicated to write by hand. In the inline code, only one floating-point
number is stored in each register, but these registers can hold up to four
32-bit floating-point numbers. The -fast version might be using the multiple
data features of Streaming SIMD Extensions 2 (SSE2) better than the inline version. Finally, -fast might
be using prefetch commands for the data.
One might think that combining inline code and using the -fast flag
would make even greater improvements. This is not the case for this example.
There are errors in compiling because of multiple copies of labels, which is
due to the fact that -fast unrolls the loop where the inline code lies. Because the
inline code uses a label to jump to the top of the iteration loop (.loop)
and a label to jump to the end (.exit), the labels appear multiple times in the
unrolled loop. The compiler does not change or modify the inline code to
rename these labels. With some tinkering, it would be possible to match
the performance in -fast with inline assembly code. For the developer who has
unlimited time and wants the satisfaction of getting every bit of performance,
this might be the thing to do.
Since most developers want good performance with little time and effort,
this raises an important question. Why bother to write inline code
when the -fast flag performs better? The answer is that -fast
doesn't always perform better than inline code. The -fast flag makes
assumptions that work very well for an example such as this, but there
are cases where -fast does not help much. So for performance tuning,
trying -fast is a great place to start. If -fast shows significant improvement,
then it might not be worth all the effort to write inline assembly code.
Conversely, if -fast does not make significant improvements, exploring inline
assembly code is the better option. Sun Studio compilers provide the flexibility
to do either, which benefits developers.
References
For further information on performance tuning, docs.sun.com has the most current documentation on Sun Studio compilers. In the docs.sun.com Product categories, Sun Studio is under Software -> Application Development -> Development Tools.
|