Contents
AMD64 ABI Summary of Features
The AMD64 ABI provides the following enhancements over the 32-bit x86 ABI:
Where to Find the AMD64 ABI
http://www.x86-64.org/documentation Implications for 64-bit Code on Linux Platforms
Our goal is object interoperability between Linux and Solaris systems for the 64-bit AMD Opteron instruction set over a useful range of programs. We are not yet at our goal, but we are working closely with AMD, Linux, and Solaris developers to produce a common Application Binary Interface (ABI). This document will likely result in changes to Linux, so you may need to upgrade to a newer version of Linux to get full object interoperability. Note, however, that ABI compatibility has limitations when files appear in different places within the file system. Furthermore, the Solaris operating system is POSIX compliant and Linux is not. So, binary compatibility will only be effective if programmers code to the common subset of Linux and Solaris systems. With the caveats given above, 64-bit code compiled in conformance to the AMD64 ABI can be linked together and run on either Linux or Solaris 10 x86 systems. Size and Alignment of C Data Types - Differences
Between 32-bit
(ILP32) and 64-bit
(LP64) x86 and SPARC
How Recompiling Affects Performance
The following items will tend to increase the speed of recompiled code:
The following items will tend to decrease the speed of recompiled code:
Pointers can reduce the speed of a 64-bit program because they are larger. If your application data is mostly pointers to other data, and you spend most of your execution time waiting on main memory, the increased size of pointers decreases the number of pointers that fit in the cache, and will more likely saturate the bandwidth to memory, thus reducing performance. Varargs processing is relatively slow on 64-bit x86 because arguments are really packed into registers and one needs to track a fair amount of information to get the next parameter from the proper place. Normal non-varargs functions should be faster because of this approach, but the varargs functions themselves will be slower. To perform stack walkback, the calling convention needs a lot of information about each function. Much of this information is stored in the executable as auxillary information, separate from the actual code. The result is that object files are much larger, often as much as twice as large as they would be on 32-bit x86. Pulling together all the information necessary to walk back up the stack means that C++ exception processing, Java exception processing, and POSIX thread cancellation may be slower for 64-bit application when compared to a similar 32-bit application. It is generally hard to predict whether a specific application will be faster or slower when recompiled. Your best bet is to measure the performance when compiled with both 32-bit and 64-bit x86 builds, and then choose the best. Consequences of
varargs
Being Passed in RegistersThe AMD64 ABI requires parameters to be passed in specific registers. If you pass a floating-point type to an integer hex printf specifier, it will not work unless it is specifically cast. Example: #define L(d) ((unsigned long long *) &d)[0] Reusing the Frame Pointer Register
The AMD64 ABI permits the compiler to reuse the register that normally contains the frame pointer. The reason is that one extra register can sometimes make a significant difference in the speed of loops. Unfortunately, without the frame pointer available and in a consistent location, debugging and performance analysis tools cannot easily follow the chain of function calls. In particular, when the compiler reuses the frame pointer register, dtrace will not work. Dtrace is a Solaris 10 OS application for whole-system performance analysis. It can help you identify the big problems in system performance. Because this facility is so important, Sun Studio compilers will not reuse the frame pointer by default. For some applications, particularly benchmarks,
the higher-level performance problems
that dtrace will help you find
have already been eliminated.
In these circumstances,
reusing the frame pointer register will provide an extra boost of
speed.
To make this boost more easily available,
we reuse the frame pointer register by compiling with the The AMD64 ABI specifies four memory address space models: kernel, small, medium and large. The larger address space models such as medium and large were not finalized in the ABI during the development of the Sun Studio 10 release, so only kernel and small models are implemented in the compilers, with the small model as the default mode. The medium and large models will appear in future releases. The small model as defined in the AMD64 ABI is basically about 2 gigabytes in address range and provides the fastest data access. Note that this is smaller than the address space for 32-bit mode, which is about 4 gigabytes for absolute addressing. It is possible for some programs to be able to compile and link under 32-bit mode but fail in 64-bit mode, as shown below. Describing the Linker Problem The linker may issue an error message under -xarch=amd64 for large data objects. For example: % cc t.c -xarch=amd64 where t.c may be: #pragma weak buf This is not a compiler error, but rather a misunderstanding of address space under AMD64 as stated in the ABI. Currently with -xarch=amd64, we have -xmodel=[small|kernel]. The compiler may generate 3 different 32-bit relocatable types to handle static memory access:
If the above conditions of R_AMD64_32/32S are not satisfied, the linker will issue the "does not fit" relocation error. This requirement of the linker check is stated in the AMD64 ABI. So in a sense, the address space for static memory objects is only about 2 gigabytes using R_AMD64_32/32S, since only 31 bits are available, smaller than the x86 32-bit mode that can access around 4 gigabytes. In some cases, the compiler can optimize by using R_AMD64_PC32, which is the difference between the memory to be accessed and the current code location. But again this is only 32-bit and might still be insufficient for program with very large data objects. Workaround Solution Meanwhile the user can workaround a "does not fit" relocation error by: 1. Using the -Kpic option. This creates a position independent code. But the compiler will generate 64-bit memory reference by using register indirection via the Global Offset Table with the R_AMD64_GOTPCREL relocatable type. This will work fine as long as the difference between the current code location and the location in the Global Offset Table for the corresponding data object is less than 32 bits. 2. Allocate all static data objects in heap. Then reference the objects via pointer indirection. Note the workaround may have a small performance degradation in memory access due to reference indirection. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||