IntroductionThe VIS instruction set includes a number of instructions that can be used to handle several items of data at the same time. These are called SIMD (Single Instruction Multiple Data) instructions. The VIS instructions work on data held in floating point registers. The floating point registers are 8-bytes in size, and the VIS instructions can operate on them as two 4 byte ints, four 2 byte shorts, or eight 1-byte chars, as shown in the following figure:
The advantage of using VIS instructions is that an operation can be applied to different items of data in parallel; meaning that it takes the same time to compute eight 1 byte results as it does to calculate one 8-byte results. In theory this means that code that uses VIS instructions can be many times faster than code without them. Further information on the VIS instruction set, including manuals and libraries can be found at http://www.sun.com/processors/vis/ . VIS performanceThere can be a performance gain by using VIS instructions. However, determining how much of a performance gain is not straightforward since the following factors come into play:
Compiling with VISUsing VIS requires a target
architecture of at least v8plusa or v9a. This can be achieved by
compiling using the There are two ways to get the compiler to generate VIS instructions:
Because VIS instructions are not directly generated by the compiler, it may happen that the generated code is suboptimal (the VIS instructions will typically be late-inlined as discussed in the article on inline templates). Therefore it is always worth checking the resulting assembly code to see if it looks reasonable, if the performance is not as fast as expected. Example routine coded without VIS instructionsThe example we will use is a simple search-type routine that works on integer (4-byte) data rather than characters. The routine scans an array for a particular value, and then reports the number of characters scanned.
Table 1 - example search routine coded in C without VIS We also define a test harness so that the performance of the existing code can be measured.
Table 2 - Test harness code The timing loop is more complex than might appear necessary. However, the loop has the following characteristics:
There are some weaknesses in the test harness:
Building and running the example codeFirst we'll run the example code.
Table 3 - Performance with no optimisation The code was compiled without optimisation, so performance will be poor. It is interesting to look at the hot loop in this light.
Table 4 - Unoptimised assembly code It is readily apparent that this is
very poor code. The loop index variable (held in
Table 5 - Performance of optimised code So the performance improves by nearly a factor of two. The disassembly for the hot loop looks like the following:
Table 6 - Optimised assembly code In this case the loop has been unrolled twice (there are two iterations performed before the predicted taken branch at the end of the loop), but not pipelined (the two iterations are not interleaved together). The most significant gain is that the index variable is now held in a register and does not end up being stored and reloaded every iteration. In this case there are two loads in the block of code, but the block of code is for two iterations, so the code has the optimal number of loads. However, the loop still does not contain prefetch instructions. Adding prefetch instructions will enable the processor to start fetching data in advance of the data being needed. This will mean that the data will often be ready when the processor needs it, and hence the processor will spend less time waiting for the data to be returned from memory. Manually adding prefetchUsing the header file
Table 7 - Source code with manual prefetch statement Running this code with the manual
prefetch statements gets the following results.
Table 8- Performance of optimised code with prefetch The prefetch statement says to prefetch for 16 ints from the current location within array. This is 16*4 bytes = 64 bytes (each int takes four bytes of memory), or one cacheline ahead. Prefetch can be made more effective by having more time for the prefetch to complete before the load is issued. To check demonstrate this, the offset can be changed from +16 to +64, which means to prefetch for 64*4 bytes ahead, or four cachelines. The following result is obtained.
Table 9 - Performance with increased prefetch ahead distance So from using optimisation and manually inserting prefetch it is possible to get a nearly five times gain in performance for this bit of code. Including VIS instructions in the source codeOne way of using VIS instructions is to include them in the source
code of the application. This requires the use of the The code can be modified to use VIS instructions. Since the
comparison is of four byte integers, two can be loaded and then
compared with the target value at the same time. The macro
Table 10 - Using VIS instructions in C source code The compile line for this code is:
Table 11 - Performance of C code containing VIS instructions The VIS code is faster than the previous integer code. There are two main reasons for this. There is some performance gain from being able to compare two integer values at once, but the instructions to do so are longer latency, so there is not much to be gained from this (of course, if the code was working with eight bytes, or four shorts, then the VIS instructions would lead to greater performance gains). The other contributor to performance is that the floating point load instructions can load data from the on-chip prefetch cache, this reduces the time spent waiting for data from the off-chip caches. It is worth discussing the code, which looks significantly more complex than the previous versions of the same code. The reasons for the complexity are as follows:
Using VIS and inline templatesIn order to obtain the best possible performance from VIS, it is necessary to use inline templates to schedule the code. The following code is basically doing the same algorithm as the C source code, but the code layout is slightly tweaked to improve performance.
Table 12 - Inline template using VIS instructions The VIS code starts by checking and correcting for non-eight byte aligned arrays. It then duplicates the integer value into both halves of the floating point value. The inner loop is then very similar to the VIS example coded in C. The following figure shows compiling and running this code, there is a slight performance gain over the VIS instructions used at source code level. The gain can be attributed to using non-faulting (speculative) load instructions to fetch the next value before the compare of the current value has completed. The non-faulting load instruction is equivalent to a normal load, except in the case where the load would access unmapped memory (for example off the end of an array). In this circumstance a normal load would cause a runtime error because the memory is not mapped. However in the same situation, a non-faulting load would not cause an error and would just return a zero value. The advantage of using this instruction is that the load can now be moved before the test of whether the end of the array has been reached, safe in the knowledge that no fault will occur if the load does happen to pass the end of the array.
Table 13 - Performance of inline template code It is possible to further improve the performance of the hand-coded VIS routine. For example, it would be possible to unroll and pipeline the inner loop such that two or more iterations were computed in parallel. Whilst doing that optimisation would improve performance, it would also make the code less clear, so it was not shwon here. Concluding remarksThis article has demonstrated a number of useful techniques for improving performance. To recap:
Full source codeHere is a link to the full source code Compiling and running the program should produce
results similar to that shown below.
Table 14 - Compile line and timing data for all routines Darryl Gove is a senior staff engineer in Compiler Performance Engineering at Sun Microsystems Inc., analyzing and optimizing the performance of applications on current and future UltraSPARC systems. Darryl has an M.Sc. and Ph.D. in Operational Research from the University of Southampton in the UK. Before joining Sun, Darryl held various software architecture and development roles in the UK. | |||||||||||||||||||||||
|
| ||||||||||||