Traditional parallelization techniques are often limited in their usefulness by inherent serialism and data dependencies. The data flow approach provided by the Forte Application Scalability Tool (FAST) enables you to extract the theoretical maximum parallelism from a problem. This article demonstrates that while traditional parallelization methods on an LU factorization problem produce a performance boost of 2.7x on a 4-CPU system for a problem size of N=1000, scaling is not sustained: using 8 CPUs gives a speedup of only 3.7x. By using FAST data flow parallelism, scalability increased to 3.37x on 4-CPU systems, and 5.64x on 8-CPU systems. With FAST and some hand tuning, the Sun Performance Library team was able to achieve 23.8x with the LINPACK benchmark on a 24-CPU system. IntroductionStarting with an overview of the LU Factorization algorithm, the problems with using traditional methods for parallelizing this algorithm are discussed. The FAST simulator will be introduced which allows you to see these problems in the design stage. Next we will show how FAST overcomes these problems and can be used to generate highly parallel code that delivers more performance. Finally we will show how FAST allows the user to easily modify an algorithm as we present two simple modifications that increase the performance above the initial FAST implementation. This article highlights the advantages provided by FAST and therefore will not cover all the implementation details. The FAST data files and source code will be provided for those interested in exploring the implementation in depth. Algorithm DetailWe will use the LAPACK routine DGETRF as a baseline implementation.
It computes a factorization of the form: The main processing is performed by a loop which performs the three steps described below and shown in Figure 1.
The algorithm moves down and to the right one block and repeats the procedure. Figure 2 shows the regions that are processed in the second iteration. The processing continues down into the lower-right corner.
LU with Parallel KernelsThe majority of the computation in the LU Factorization is performed in the matrix multiply and the triangular solve. Over 90% of the computation from the LAPACK subroutine DGETRF, which performs the LU factorization on a double precision real matrix, can be attributed to the BLAS3 routines DGEMM(matrix multiply) and DTRSM(triangular solve). Both of these can be parallelized by writing simple drivers. Drivers for both these routines are contained in the file PARALLEL_LU1/dgetrf.90. Parallel BLAS is also available as part of Sun Performance Library. These parallel drivers are sufficient for a system with a small number of processors. For a 1000 X 1000 matrix the following percent scalability was observed. Perfect scalability, 100 percent, would indicate that a computation run in parallel on N processors runs N times faster than it does on one processor. Scalability is computed as (T_1 * N) /T_N where T_1 is the time for a one processor run, N is the number of processors, and T_N is the time for the run on N processors.
This decrease in scalability shows that as more processors are added they are used less efficiently. This is predicted by Amdahl's Law which can be stated as Total_time = Parallel_time / NCPUS + Serial_time
where the total time for a computation to complete on a given number of CPUs is a function of the amount of the computation that can be parallelized, Parallel_time, and the amount that must be done serially, Serial_time. As the number of processors increase the serial time comprises more of the total runtime, even if it is much smaller than the parallel time to begin with. This is exactly the case in the LU Factorization. Figure 4 shows the timeline display from the FAST profiler. The timeline shows what task is being performed at each time during a run. Each horizontal bars represents the processes happening in one thread of execution. The horizontal axis represents time. This timeline shows the first second of the run. The color of the bar represents what task is being performed at each point. For two threads, most of the work is spent in computation (represented by blue, green, and yellow segments of the bands). Very little time is spent idly, waiting for work (represented by red).
For the eight processor case, shown in Figure 5, red is a dominate color indicating that a large number of cycles are lost waiting for work. The waits that occur while a single thread performs the factorization now appear to be much larger. The duration of the waits appear larger because parallelization has decreased the duration of the parallel events. Also seven threads are sitting idle instead of one.
The timelines in Figure 4 and Figure 5 visually affirm the truth of Amdahl's Law, small serial regions will kill parallel performance on large number of processors. FAST SimulationsThe FAST Simulator enables the user to experiment with an algorithm to see if a change would benefit the overall run without going through the work of implementing it. It can also be used to simulate performance on large multi-processor systems that may not be available during development. Before the simulator can be used, a FAST project must be developed with the
FAST Editor. The file
Each task that is performed in the algorithm has a node in the Sun ONE Studio explorer tree with an associated set of properties. These properties are used as parameters for simulating a parallel run. This allows a user to see the effect of code modifications in the design stage without going through the effort of implementing them. Figure 7 shows the properties for the task that performs matrix multiply.
The scale property allows the user to multiply the execution time for a task by a factor to increase or decrease its serial run time. The shatter property allows the work being done in one node to be broken into a number of smaller pieces to be performed in parallel. For the LU, shattering the matrix multiply and the triangular solve allows the user to examine what the performance would be if execution was based on the dependency graph in Figure 8.
The simulation execution times produced by FAST for the LU accurately predict the results achieved with parallel BLAS3 drivers as shown in the following graph.
The simulator allows developers to see the results of parallelizing a section of code or speeding it up without actually doing the work. This leads to more efficient development as effort can be concentrated on worthwhile areas. Parallel FAST LUThe FAST project in Additionally a loop was added to do the trailing swaps. These were handled as part of the factorization step in the serial implementation. Adding a second loop causes these swaps to execute after the main loop has completed. Removing the computation from the main loop simplifies the data dependencies for the main computation. To change the step to a parallel step the user must decide how to decompose the work that is being done. The matrix multiply step from the serial project has the three data regions shown in Figure 10 on the left of the arrow.
To decompose this computation so that each column can be computed independently the regions need to be deleted and redrawn as shown to the right of the arrow in Figure 10. Then movements are define for the new small regions to move them across the entire area that needs to be processed. In FAST, movements indicate parallelism. A similar modification is made to the triangular solve step so that each column is computed independently. The data dependencies for the LU can then be expressed with the graph below.
This graph shows that the data dependencies allow for more potential parallelism than used in the traditional approach. The yellow nodes represent the column factorizations that are the only serial task. The data dependencies indicate that this task can run concurrently with other tasks. From the second factorization node in the graph, there is only a data dependency on one of the preceding matrix multiply nodes shown as blue. Only one matrix multiply and one triangular solve, shown as green, are required to be completed before the second factorization node can start executing. The rest of the matrix multiply and triangular solve nodes represent work that can potentially overlap the serial factorization. By taking advantage of this the scalability is improved as shown below.
Using the FAST profiler timeline view, it can be seen that this is exactly the behavior for a run created from this project. The timeline shown in Figure 13 demonstrates FAST parallelization. The previously serial factorizations are running in parallel with matrix multiply and triangular solve computations. This is seen at several places in the display where a vertical line through the display would intersect one band colored yellow and other bands that are blue and green.
Figure 13 shows that though the initial FAST implementation allowed some of
the factorization stage to run in parallel with other computation there is still
a lot of time wasted by threads waiting for work. As the run proceeds, the amount
of parallel work decreases so that there is none left to overlap with the factorizations
and stalls result. If the factorizations could be started sooner so that they
would run while there was more work to overlap with, then the scalability would
increase. FAST allows for a priority to be set for each step that is performed
in an algorithm. The file Using these priorities FAST produced another increase in scalability as shown in Figure 14.
The timeline in Figure 15 shows how the improvement was achieved. This run manages to get more of the serial factorizations done while there is parallel work to overlap with. From the previous version the factorizations (yellow areas) have been executed sooner. In the timeline display they have slid to the left and are closer together.
Critical Path AnalysisIf we could further compress the factorization steps we could increase scalability
more. As we've seen, for each iteration, before the factorization can start, only
one matrix multiply from the previous iteration must complete. The multiply on
this column is the only multiply that lies on the critical path. We can get this
multiply done sooner by parallelizing it further into square blocks so that each
block row in the column is computed independently. The file
The four nodes highlighted in the graph perform the matrix multiply for one column. Each acts on a block of rows within that column. These are super-parallel in the sense that they are more parallel than the other nodes that perform the matrix multiply on columns. Using this approach, a further increase in scalability was achieved as reflected in the graph below (see Figure 17).
As shown in the timeline in Figure 18, the yellow factorization bands have been pushed even further left. This indicates that the factorizations are starting earlier, allowing more of them to run concurrently with parallel work. This results in fewer stalls at the end of the run.
ConclusionUsing FAST the scalability of this algorithm was greatly increased from under 50% to over 70% on 8 processors. FAST allows for parallelization based on data dependencies which offers a way to overcome Amdahl's Law by allowing serial tasks to run concurrently with other tasks. Amdahl's Law limits the number of processors most real algorithms can run on efficiently to a small number. FAST allows users to run algorithms more efficiently on parallel system with minimal coding effort. With additional effort FAST can allow algorithms to be tuned to run efficiently on larger hardware. Figure 19 shows the run-times for a 2000 X 2000 LU on 16 CPUs using algorithms discussed in this paper. FAST allows the user to overcome Amdahl's Law and approach perfect even in scaling algorithms that are not embarrassingly parallel, such as the LU Factorization.
Related Information
About the AuthorsBrad Lewis is a Software Engineer in the Sun ONE Tools division of Sun
Microsystems. He has worked in the Sun Performance Library Group for the last
7 years, optimizing and parallelizing linear algebra routines. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Oracle is reviewing the Sun product roadmap and will provide guidance to customers in accordance with Oracle's standard product communication policies. Any resulting features and timing of release of such features as determined by Oracle's review of roadmaps, are at the sole discretion of Oracle. All product roadmap information, whether communicated by Sun Microsystems or by Oracle, does not represent a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. It is intended for information purposes only, and may not be incorporated into any contract.
|
| ||||||||||||