IntroductionIn general, you should never need to use inline templates, it is normally possible to do all the coding in a high-level language, and the compiler is able to do an excellent job of optimising this. However, in some cases you may either know more about the target hardware, more about the behaviour of the code, or perhaps want to do something that the compiler doesn't readily support. In these rare situations you will find inline templates to be helpful. The following are examples where inline templates are particularly useful:
To use inline templates, a regular function call is placed in the source code, then an inline template is produced with the appropriate name, and at compile time both the source file and the file containing the inline template are compiled together. The compiler will then insert the code from the inline template into the code generated from the source code. The documentation for inlining using .il files can be found under man inline(1). This paper is based on that data. Figure 1 - inline man page
The inline man page is also available in HTML from the documentation index of man pages. Compiling with Inline TemplatesYou compile inline templates by placing them on the same compile line as the file which uses them. The code is inlined by the code-generator stage of compilation. Figure 2 - Compiling with an inline template file
The above example will compile prog.c and inline the code from code.il into the appropriate points. Layout of Code in Inline TemplatesThe inline template file can contain a number of inline templates. Each template starts with a declaration, and ends with an end statement: Figure 3 - Layout of an inline template
The identifier is the name of the template, and the argument_size is the size of the arguments in bytes (this is not required for the latest compiler versions). Multiple templates of the same name can be placed in the file, but the compiler will pick the first one. There is no need for a return instruction since your template will be inlined directly into your code without a call. Note that you must prototype the template in your high-level source code to ensure that the compiler assigns correct types for all the parameters. Figure 4 - Example of a prototype for an inline template
Figure 5 - Example of a template
Figure 4 shows the prototype as it might end up in code.h. Figure 5 shows the inline template code as it might end up in a separate code.il file. Inline templates are always in files with the suffix .il. In the following examples, the prototype has been included in the same box as the inline template code, this is to make the paper more readable - they must go into different files. Guidelines for Coding Inline TemplatesThe inline code can only use integer registers %o0 to %o5 and floating point registers %f0 to %f31 for temporary values, other registers should not be used. These registers are referred to as the 'caller-saved' registers. Calls can be made to other routines from the inline template, but these calls are subject to the same constraint. The compiler will handle most of the SPARC instruction set. If the template contains only instructions which the compiler normally generates, then it will be early inlined (see below), and the code will be scheduled optimally. If the template contains instructions that the compiler understands, but does not typically generate (such as VIS instructions or atomics), then the code may be late inlined, and consequently the code may not be optimally scheduled - resulting in a slight loss of performance. Parameter PassingParameter passing obeys the parameter passing defined in the target architecture - so it is different for 32-bit and 64-bit codes. It is described by the SPARC ABI which can be referenced at http://www.sparc.org/standards.html, SCD 2.3 describes v8 (32-bit code) and SCD 2.4.1 describes v9 (64-bit code). On entering the template, arguments will be passed in %o0-%o5, and will continue on the stack. For 32-bit code, the offset is [%sp+0x5c] and %sp is guaranteed to be 64-byte aligned; for 64-bit code the offset is [%sp+0x8af] (note that %sp+2037 is aligned to a 16-byte boundary). Figure 6 - Example of 32-bit parameter passing using the stack
Example for 64-bit code, note that when a 32-bit int register is passed on the stack, the full 64-bits of the register are saved: Figure 7 - Example of 64-bit parameter passing using the stack
For 32-bit code, floating point values will be passed in the integer registers, for 64-bit code they will be passed in the floating point registers. Figure 8 - Example of 32-bit parameter passing by value
Figure 9 - Example of 64-bit floating point parameter passing
For values passed in memory, single precision floating point values and integers, are guaranteed to be 4-byte aligned. Double precision floating point values will be 8-byte aligned if their offset in the parameters is a multiple of 8-bytes. Integer return values are passed in %o0. Floating point return values are passed in %f0/%f1 (single precision values in %f0, double precision values in the register pair %f0,%f1. For 32-bit code there are two ways of passing the floating point registers, the first way is to pass them by value, and the second is to pass them by reference. Either way, the compiler will do its best to optimise out the load and store instructions, it is often more successful at doing this if the floating point parameters are passed by reference. Example of 32-bit by reference parameter passing: Figure 10 - Example of 32-bit parameter passing by value
Stack SpaceSometimes it is necessary to store variables to the stack in order to load them back later - this is the case for moving between the int and fp registers. The best way of doing this is to use the space which is already set aside for the parameters which are passed into the function. For example in the v8 code shown in Figure 8, the location %sp+0x48 is 8-byte aligned (%sp is 8-byte aligned), and it corresponds to the place where the 2nd and 3rd 4-byte integer parameters would be stored if they were passed on the stack (note that the first parameter would be stored at a non-8-byte boundary). Branches and CallsThere is support for branching and calls available. Every branch or call must be followed by a nop instruction - this is to fill the branch delay slot. It is possible to put instructions in the delay slot of branches - this can be useful if you wish to use the processor support for annulled instructions - but doing so will cause the code to be late-inlined (described below), and may result in sub-optimal performance. Call instructions must have an extra last argument which indicates the number of registers used to pass arguments in the the call parameters. In general you should avoid inlining call instructions. The destinations of branches must be indicated with a number, and the branch instructions should use this number to indicate the appropriate destination together with an f for a forward branch or a b for a backward branch. Example: Figure 11 - Example of using branches in an inline template
Late and Early InliningInlining of templates is done by the code generator part of the compiler, there are two opportunities for inlining, before and after optimisation. If the inline template is 'complex' then it will end up being inlined after optimisation (ie late inlined), this means that the code will more-or-less appear exactly as it appears in the template. If the code is inlined before optimisation (early inlining), then it will be merged with the other code around the call site. Early inlining will lead to better performance. Things that will cause late inlining are:
You will get information in the compiler commentary on inlining when the code is compiled with -g, this information will tell you if a routine is late inlined - if there is no comment, then the routine will have been early inlined. An example of this is attempting to inline the following (incorrect) template: Figure 12 - Incorrect inline template
The template in figure 12 is incorrect because the code uses the frame pointer (%fp) rather than the stack pointer (%sp). The compiler will still inline the code, but because of this error it is unable to early inline the code, and will have to late inline the code. Figure 13 - Compiling with -g to generated debug information
Figure 13 shows the compile line used to generate a 32-bit executable with debug information. Note that the debug information is stored in the .o files by default, so it is necessary to keep these files available. Figure 14 - Using er_src to output compiler commentary
The utility er_src can be used to examine the compiler commentary for a particular file. It takes two parameters, the name of the executable and the name of the function which you wish to examine. In this case the template which cannot be early inlined is sum_val, each time the compiler comes across the %fp register it inserts a debug message, so you can tell that there are six instances of references to %fp in the template. Decoding the Calling ConventionThe calling convention for the architecture can be a bit tricky to master, the easiest way of dealing with this is to write a test function, and see how that gets converted into assembly language. Figure 15 - Examining the 32-bit calling convention
In the example code you can see that the first three fp parameters are passed in %o0-%o5, and that the fourth fp parameter is passed on the stack at locations %sp+92 and %sp+96. Note that this location is 4-byte aligned, so it is not possible to use a single floating point load double instruction to load it. Example for 64-bit code: Figure 16 - Examining the 64-bit calling convention
In the above code you can see that the first action is to load the seventh integer parameter from the stack. Other Examples of TemplatesTemplates are used in libm.il - the inline math library - and in vis.il - the Visual Instruction Set inline library. These two files can be found in /opt/SUNWspro/prod/lib/. They are linked in by the compiler when flags -xlibmil (for the math templates) or -xvis (for the VIS templates) are specified. The include files which prototype the functions in the template libraries are math.h and vis.h. Complete Source Code for 32-Bit ExamplesFigure17 - inline32.il file for 32-bit inline template examples
Figure 18 - driver32.c source file for 32-bit examples
Complete Source Code for 64-Bit ExamplesFigure 19 - inline64.il template file for 64-bit template examples
Figure 20 - driver64.c source file for 64-bit examples
Running ExamplesFigure 21 - Compile and run sequence for the examples
About the AuthorDarryl Gove is a staff engineer in the Compiler Performance Engineering group at Sun Microsystems Inc., analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK. Before joining Sun, Darryl held various software architecture and development roles in the UK. | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||