[IEEE 2013 5th International Conference on Computer Science and Information Technology (CSIT) -...

5
Not computed L U Computed columns 1 to k-1 First column k-th columns A U L (a) Left-looking (b) Right-looking Fig. 1. Left and right looking factorization Hardware Implementation of LU Decomposition Using Dataflow Architecture on FPGA Mahmoud Eljammaly, Yasser Hanafy, Abdelmoniem Wahdan, Amr Bayoumi Arab Academy for Science and Technology and Maritime Transport, Cairo, Egypt. Abstract—Recent FPGA technology advances permitted the hardware implementation of selected software functions to enhance programs performance. Most of the work done was only concerned with integer operations. Little effort addressed floating point operations. In this paper we propose a dataflow implementation of the LU decomposition on FPGA.A modified Kernighan-Lin based task partitioning and assignment algorithm is presented in this paper. The algorithm showed acceptable improvement over existing techniques. KeywordsLU factorization; tagged-token dataflow architecture; FPGA; Kernighan-Lin algorithm; parallel processing. I. INTRODUCTION Usage of matrices is significant in different science and engineering fields. It becomes necessary to find fast and exact solution for matrices of dimensions of thousands or millions of elements in problems such as circuit simulation, load flow calculation of power systems and finite element analysis. In these cases, solving this type of problems using a general purpose processor may face some limitations, such as speed, memory, and accuracy. This leads to consider the use of specific hardware architectures to solve such complex problems. The lower-upper (LU) factorization algorithm [1]is implemented on parallel dataflow architecture capable of performing double-precision floating-point operations. This done by extracting a computation graph from the LU algorithm, partition it and then assign partitions to processing elements (PEs) connected by an interconnection network. Two methods of partitioning are tested: Automatic partitioning using graph partitioning tool and a user-defined partitioning using parallel code. This paper shows that using a user-defined partitioning is better in exploiting the parallelism available in the algorithm than using a partitioning tool. Also a modified Kernighan-Lin algorithm is proposed in this paper as an assignment algorithm for the graph partitions. The Kernighan- Lin algorithm [2] is used as a partitioning algorithm, so to make it work for partition assignment it needs to be modified to take into consideration the effect of the assigned partitions. A comparison between random assignment and the modified Kernighan-Lin assignment of partitions is carried on and it have been proven that using the modified Kernighan-Lin assignment reduce the total traffic on the interconnection network. II. BACKGROUND A. LU Factorization In order to solve the liner system Ax=b, the matrix A needs to be factorized into lower triangular matrix L and upper triangular matrix U. using these two matrices, the unknown vector x can be calculated by using forward reduction equation Ly=b and backward substitution equation Ux=y. The factorization is either a left-looking or a right –looking, see Fig. 1.In the left-locking algorithm, the matrix get factorized column by column from left to right using the previously computed columns on the left. In the other hand, the right-looking algorithm factorizes the matrix from top-left to bottom-right by computing a column of L, a raw of U and updating the submatrix. The LU algorithm used in this paper is a right-locking algorithm. Listing 1 illustrates the LU factorization.LU factorization is intended for dense matrices while other algorithms, like KLU (K: Clark Kent)[3] and SuperLU[4], are intended for sparse matrices. B. Previous Work Different field programmable gate array (FPGA)-based matrix decomposition solutions have been presented. [5] proposed a solution for LU decomposition on heterogeneous system (composed of a host microprocessor and multiple FPGAs) for complex double-precision numbers. A block- partitioned LU factorization algorithm is adopted to factorize dense matrices. Unlike the solution presented in this paper, part 2013 5th International Conference on Computer Science and Information Technology (CSIT) ISBN: 978-1-4673-5825-5 298 978-1-4673-5825-5/13/$31.00 ©2013 IEEE

Transcript of [IEEE 2013 5th International Conference on Computer Science and Information Technology (CSIT) -...

Page 1: [IEEE 2013 5th International Conference on Computer Science and Information Technology (CSIT) - Amman, Jordan (2013.03.27-2013.03.28)] 2013 5th International Conference on Computer

Not

com

pute

d

L

U

Computed columns 1 to k-1

First column k-th columns

A

U

L

(a) Left-looking (b) Right-looking Fig. 1. Left and right looking factorization

Hardware Implementation of LU Decomposition Using Dataflow Architecture on FPGA

Mahmoud Eljammaly, Yasser Hanafy, Abdelmoniem Wahdan, Amr Bayoumi Arab Academy for Science and Technology and Maritime Transport, Cairo, Egypt.

Abstract—Recent FPGA technology advances permitted the hardware implementation of selected software functions to enhance programs performance. Most of the work done was only concerned with integer operations. Little effort addressed floating point operations. In this paper we propose a dataflow implementation of the LU decomposition on FPGA.A modified Kernighan-Lin based task partitioning and assignment algorithm is presented in this paper. The algorithm showed acceptable improvement over existing techniques.

Keywords—LU factorization; tagged-token dataflow architecture; FPGA; Kernighan-Lin algorithm; parallel processing.

I. INTRODUCTION Usage of matrices is significant in different science and

engineering fields. It becomes necessary to find fast and exact solution for matrices of dimensions of thousands or millions of elements in problems such as circuit simulation, load flow calculation of power systems and finite element analysis. In these cases, solving this type of problems using a general purpose processor may face some limitations, such as speed, memory, and accuracy. This leads to consider the use of specific hardware architectures to solve such complex problems.

The lower-upper (LU) factorization algorithm [1]is implemented on parallel dataflow architecture capable of performing double-precision floating-point operations. This done by extracting a computation graph from the LU algorithm, partition it and then assign partitions to processing elements (PEs) connected by an interconnection network. Two methods of partitioning are tested: Automatic partitioning using graph partitioning tool and a user-defined partitioning using parallel code. This paper shows that using a user-defined partitioning is better in exploiting the parallelism available in the algorithm than using a partitioning tool. Also a modified Kernighan-Lin algorithm is proposed in this paper as an assignment algorithm for the graph partitions. The Kernighan-Lin algorithm [2] is used as a partitioning algorithm, so to make it work for partition assignment it needs to be modified to take into consideration the effect of the assigned partitions. A comparison between random assignment and the modified Kernighan-Lin assignment of partitions is carried on and it have been proven that using the modified Kernighan-Lin assignment reduce the total traffic on the interconnection network.

II. BACKGROUND

A. LU Factorization In order to solve the liner system Ax=b, the matrix A needs

to be factorized into lower triangular matrix L and upper triangular matrix U. using these two matrices, the unknown vector x can be calculated by using forward reduction equation Ly=b and backward substitution equation Ux=y.

The factorization is either a left-looking or a right –looking, see Fig. 1.In the left-locking algorithm, the matrix get factorized column by column from left to right using the previously computed columns on the left. In the other hand, the right-looking algorithm factorizes the matrix from top-left to bottom-right by computing a column of L, a raw of U and updating the submatrix. The LU algorithm used in this paper is a right-locking algorithm. Listing 1 illustrates the LU factorization.LU factorization is intended for dense matrices while other algorithms, like KLU (K: Clark Kent)[3] and SuperLU[4], are intended for sparse matrices.

B. Previous Work Different field programmable gate array (FPGA)-based

matrix decomposition solutions have been presented. [5] proposed a solution for LU decomposition on heterogeneous system (composed of a host microprocessor and multiple FPGAs) for complex double-precision numbers. A block-partitioned LU factorization algorithm is adopted to factorize dense matrices. Unlike the solution presented in this paper, part

2013 5th International Conference on Computer Science and Information Technology (CSIT) ISBN: 978-1-4673-5825-5

298 978-1-4673-5825-5/13/$31.00 ©2013 IEEE

Page 2: [IEEE 2013 5th International Conference on Computer Science and Information Technology (CSIT) - Amman, Jordan (2013.03.27-2013.03.28)] 2013 5th International Conference on Computer

1: for k = 1 to n do 2: for i = k + 1 to n do 3: l(i,k) = a(i,k)/a(k,k) 4: u(k,i) = a(k,i) 5: end do 6: for i = k +1 to n do 7: for j = k + 1 to n do 8: a(i,j) = a(i,j)-l(i,k)*u(k,j) 9: end do 10: end do 11: end do

Listing 1: LU factorization

of the factorization process is done on the host microprocessor, in each iteration, and the result is distributed on the computing FPGAs. Also the host microprocessor needs to reconfigure the FPGAs multiple times in each factorization which lead to a huge amount of communication between the host microprocessor and the FPGAs.

In [6], a parallel LU factorization is proposed targeting electric power matrices. The solution is based on Parallel Block-Diagonal-Bordered (BDB) Sparse Linear Solvers. It uses Nios soft IP processor from Altera as processing element integrated with single-precision floating-point unit and the PEs and other peripherals are communicating using multi-mastering Altera Avalon bus. This solution is targeting BDB factorization in single-precision using control flow architecture while the proposed solution in this paper is using double-precision with dataflow architecture.

This paper is strongly based on the work done in [7][8] which is a parallel dataflow architecture implemented as a solution to factor and solve circuit matrices based on the KLU algorithm. The difference in this paper is the implementation of the LU factorization algorithm instead of the KLU, and using a user-defined partitioning technique using a parallel code. Also a modified Kernighan-Lin assignment is used to reduce traffic on the network.

III. PARALLEL LU ON FPGA

A. Parallelism Potential. In each iteration of the LU factorization algorithm two

steps are done: computing a column of L and a row of U and then updating the submatrix using the resulted L column and U row. Listing 1 shows that computing the L column and the U row contain high parallelism potential since its highly independent task. Also updating the submatrix is high parallel task and can be computed using multiple processing elements. These parallelism sources can be exploited by extracting computation graph from the algorithm then partition it and assign each partition to a processing element on the FPGA. Also using a parallel version of the algorithm provides better exploitation of the parallelism available in the algorithm and that’s what this paper proves.

B. Hardware Architecture The proposed system consist of a number of processing

elements (PE) which can perform double-precision floating-

point calculations, connected together by an interconnection network and an on-chip memory to hold the computation graph. The graph is partitioned by the number of the processing elements so each processing element handles a group of the graph nodes.

The architecture used is based on tagged-token dataflow architecture[9]. In this architecture there is no program counter but rather the execution of an instruction is based on the availability of the input data of the instruction. Each node in the computation graph represents an instruction, and a node is ready if it’s all inputs are available. This called the firing rule. This architecture exploits the available parallelism which is hard to exploit using the control flow architecture.

The processing elements communicate with each other using messages routed though packet switched interconnection network. Each processing element is connected to a switch on a mesh network as shown in Fig. 2. The messages sent on multiple packets. Each message contains a floating point number and a destination address. The switches rout the messages using Dimension-Ordered Routing (DOR) algorithm which is the most used routing algorithm in on-chip networks for its simplicity in implementation and also it’s a dead-lock free algorithm [10][11]. In the DOR algorithm, the messages go from source to destination by passing dimension by dimension. The message has to travel over the X-dimension until reaching its X destination and then travel over the Y-dimension to reach its final destination. Switches are built using a combination of split and merge units [12], grouped together to compose bidirectional ports.

A processing element is consists of two main components: floating-point unit (FPU), and dataflow controller, along with buffers for in and out messages. The floating-point unit is capable of performing subtraction, multiplication and division on double-precision numbers as shown in Fig. 3.The dataflow controller is responsible of managing the processing element by fetching messages from the input buffer, managing the FPU, managing read and write from and into the dedicated on-chip graph memories, creating messages and send them to the output buffers.

PE PE

PE

PE

PE

PE PE

PE

PE

S

S

S S

S

S

S

S

S

Fig. 2. Mesh network of processing elements

(S: Switch, PE: Processing Element)

2013 5th International Conference on Computer Science and Information Technology (CSIT) ISBN: 978-1-4673-5825-5

299

Page 3: [IEEE 2013 5th International Conference on Computer Science and Information Technology (CSIT) - Amman, Jordan (2013.03.27-2013.03.28)] 2013 5th International Conference on Computer

1: Bisect S into A and B //A size = n //B size = n 2: do 3: for i=1 to 2n do 4: compute I(i) 5: compute E(i) 6: D(i)= E(i)-I(i) 7: end do 8: for i=1 to n do 9: find unlocked a in A and b in B such that g(a,b)=D(a)+D(b)-2C(a,b) is max 11: lock and swap a and b 12: Update D such that Dnew(x)=Dold(x)+2C(x,a)-2C(x,b),if x in A Dnew(x)=Dold(x)+2C(x,b)-2C(x,a),if x in B 13: end do 14: find k such that g_max=g(1)+...+g(k) is max 15: if g_max>0 then 16: swap a(1)...a(k) with b(1)...b(k) 17: until g_max<=0 18: if n>1then 19: call modified K-L for A and B

Listing 2: Modified Kernighan-Lin Algorithm

MUX Decoder

Add Mul Div

Operand B

Operand A

Select

Data

Control

Fig. 3. FPU block diagram

The computation graph that holds the required operations to factorize and solve a matrix is partitioned and stored in on-chip memory units. A partition is stored in two on-chip memories, one for nodes which contain the instructions, and one for edges which store the addresses where the outputs will be sent to.

The design is also supported by a testing units connected to each PE node memories and also connected to another memories contain the expected results from the PEs. The testing units tell if the results are correct or wrong.

C. Modified Kernighan-Lin To minimize the traffic on the network a modified

Kernighan-Lin algorithm is implemented to assign the graph partitions. The goal is to put the partitions which have high communications among each other near each others. Kernighan-Lin (K-L) algorithm [2] use a heuristic technique to partition a set of graph nodes into two groups while maintaining minimum number of edges/links between the two groups (minimum edge-cuts).

To do that the algorithm first finds the cost reduction of moving each node, which is the number of the external links of the node minus the number of the internal links. Then compute the gain of swapping each node in the first group with each node in the second group. The gain of swapping two nodes equals the summation of the cost reduction of the two nodes minus twice the number of links between the two nodes. The pair with the highest gain gets swapped and fixed so they don’t get into considerations in the upcoming calculations. After that an update for the cost reduction of the unfixed nodes is computed to add the effect of swapping the two nodes. Then again fined the highest gain, swap the pair and fix them. And repeat until all nodes are fixed. Then finally to find the best swaps that provide minimum communications the largest partial sum of the gains is computed and if it’s larger than zero, there exist gain from swapping, swap the pairs or pair.

In the modified K-L algorithm the partitions are considered as the nods and the amount of communication between them considered as the edges/links. The K-L algorithm is repeated recursively, first divide the set of partitions into two groups up and down then divide each group into two groups left and right and so on until each group has only one partition and this way the partitions get assigned. In addition the K-L algorithm doesn’t take into consideration the already assigned partition and its effect on the gain of swapping the nodes (the partition in this case). So when computing the cost reduction of a node; if there is an assigned node in the surrounding neighbors to the group where the node belongs to, deal with it as an internal node, and the assigned node surrounding the other group as an external node. See Listing 2.

D. Graph Partitioning The graph extracted from the LU factorization is

partitioned using two methods: METIS graph partitioning software [13], and user-defined partitioning based on parallel code. METIS provide range of algorithms and schemas for partitioning like multilevel recursive bisection and multilevel k-way partitioning. The multilevel k-way partitioning, which is used to minimize the edge-cuts, is used to partition the graph.

In the user-defined partitioning, the computation of the L column is partitioned on the processing elements, and also the computation of the submatrix update, such that each processing element contain the part of the L column it needs in computing the submatrix part. This way takes into consideration the dependencies between the operations, which in this case the nodes.

IV. HARDWARE SETUP The proposed architecture is implemented and tested on

Xilinx Virtex-6 XC6VLX75T FGPA with interconnection network bit-widths of 16-bit for 4 and 9 processing elements. The network rout messages sent on 6 packet creating message

2013 5th International Conference on Computer Science and Information Technology (CSIT) ISBN: 978-1-4673-5825-5

300

Page 4: [IEEE 2013 5th International Conference on Computer Science and Information Technology (CSIT) - Amman, Jordan (2013.03.27-2013.03.28)] 2013 5th International Conference on Computer

of size 96-bit. The double-precision floating-point units are implemented using Xilinx floating-point cores [14] which available on Xilinx CORE Generator. Xilinx ISE 12.1 is used to implement and test the designs. Computation time for each PE is computed by counting the clocks elapsed while the graph partition is been processed. This done internally on each PE and can be seen in real time using Xilinx ChipScope Pro debugging tool.

V. RESULTS To test and evaluate the design, the computation graph have

been extracted from random generated matrices and partitioned to 4 and 9 partitions. Then the matrices have been tested onto versions of the design, 4 PE network and 9 PE network. And to show the impact of using the modified K-L algorithm a larger matrices have been partitioned to different grid sizes, since 4 (2x2) and 9 (3x3) processing elements are small grids.

Tables I and II shows the number of nodes in each partition and the time spent by each processing element to compute the partition assigned to it for the two methods (the METIS partitioned method and the user-defined parallel code partitioning method), on 4 PEs and 9 PEs respectively. Table III provides a comparison between the random assignment and the modified K-L assignment in term of total traffic on the network (number of hops).

The results in table I and II show that the user-defined partitioning provides a faster solution than the METIS partitioned one. While METIS provide equal distribution of

nodes as shown in table I and II, its processing elements doesn’t take similar time in processing; since the partitioning didn’t take into consideration the dependencies among the nodes. In the other hand, the user defined partitioning provides a better load balancing so the processing elements finish processing in a close time and in total it provides a faster solution, around 1.5X. Also table III shows that the modified K-L assignment reduced the total traffic on the network up to 37% which in return speeds up the system. This is due to taking into consideration the amount of communications between the partitions and assigning them according to it; such that placing the pair of partitions with the most communication beside each other.

VI. CONCLUSION In this paper a dataflow implementation of the LU

factorization is proposed and a modified Kernighan-Lin algorithm is presented for partition assignment. Also two methods of partitioning are tested on the LU factorization algorithm: METIS partitioning and user-defined parallel code partitioning. The results show that using the modified Kernighan-Lin assignment can speed up the system by reducing the traffic on the network up to 37% over the random assignment. And also using a user-defined partitioning based on parallel code algorithm exploits the parallelism potential in a better way and provides a faster solution by 1.5x over automatic partitioning tool like METIS.

TABLE I. PARTITION SIZE AND RUNTIME ON 4PE

Automatically Partitioned by METIS Partitioned by User-Defined Parallel Code

Number of nodes Time (clocks) Number of nodes Time (clocks)

PE1 6404 108,142 8188 123,111 PE2 6597 104,467 8188 122,980 PE3 6685 178,524 6618 122,876 PE4 6550 104,062 3242 123,170

TABLE II. PARTITION SIZE AND RUNTIME ON 9PE

Automatically Partitioned by METIS Partitioned by User-Defined Parallel Code

Number of nodes Time (clocks) Number of nodes Time (clocks)

PE 1 2902 60,552 3068 85,855 PE 2 2895 88,844 3068 85,711 PE 3 2965 88,206 3068 85,244 PE 4 2880 59,933 3068 84,759 PE 5 2896 63,566 3068 84,016 PE 6 2866 68,516 3068 83,127 PE 7 2889 74,827 3068 83,566 PE 8 2996 117,316 3068 85,411 PE 9 2947 97,304 1692 85,930

2013 5th International Conference on Computer Science and Information Technology (CSIT) ISBN: 978-1-4673-5825-5

301

Page 5: [IEEE 2013 5th International Conference on Computer Science and Information Technology (CSIT) - Amman, Jordan (2013.03.27-2013.03.28)] 2013 5th International Conference on Computer

TABLE III. AMOUNT OF TRAFFIC ON THE NETWORK IN RANDOM ASSIGNMENT AND MODIFIED K-L ASSIGNMENT Automatically Partitioned by METIS Partitioned by User-Defined Parallel

Code

Random Modified

K-L Traffic

Reduced Random Modified

K-L Traffic

Reduced

35x35 12602 8689 31% 101725 81647 19% 40x40 16590 12083 27% 136216 110631 18% 45x45 21989 13668 37% 195667 153902 21% 50x50 26993 18252 32% 268374 222031 17%

REFERENCES [1] D.Poole, Linear Algebra: A Modern Introduction, 3rd ed.,

2010. [2] B. W. Kernighan and S. Lin, “An efficient heuristic procedure

for partitioning graphs” Bell Systems Technical Journal 49: 291–307, 1970.

[3] E. Natarajan, “KLU A high performance sparse linear solver for circuit simulation problems,” Master's Thesis, University of Florida Gainesville, 2005.

[4] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu, “A super nodal approach to sparse partial pivoting,” SIAM J. Matrix Analysis and Applications, Volume 20, Number 3, pp. 720-755, 1999.

[5] T. Hauser, A. Dasu, A. Sudarsanam, and S. Young, “Performance of a LU decomposition on a multi-FPGA system compared to a low power commodity microprocessor system,” Scalable Computing: Practice and Experience Volume 8, Number 4, pp. 373–385, 2007.

[6] T. Nechma, M. Zwoli´nski, J. Reeve “Parallel Sparse Matrix Solver for Direct Circuit Simulations on FPGAs,” Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium, May 30-June 2, 2010.

[7] N. Kapre and A. DeHon, “Parallelizing sparse matrix solve for SPICE circuit simulation using FPGAs,” IEEE

International Conference on Field-Programmable Technology (FPT 2009), December 9–11, 2009.

[8] N. Kaper, “SPICE²- a spatial parallel architecture for accelerating the SPICE circuit simulator,” Ph.D. dissertation, California Institute of Technology, 2010.

[9] G. M. Papadopoulos and D. E. Culler, “Monsoon: an explicit token-store architecture,” SIGARCH Comput. Archit. News, vol. 18, no. 3a, pp. 82–91, 1990.

[10] N. E. Jerger and L. Peh, “On-ChipNetworks”, Synthesis Lectures on Computer Architecture, 2009.

[11] J. Duato, S. Yalamanchili, and L. N. Ni, Interconnection networks, Morgan Kaufmann, 2003.

[12] N. Kapre, et al, “Packet switched vs. time multiplexed FPGA overlay networks,” in Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2006, pp. 205–216.

[13] G. Karypis and V. Kumar,“ Multilevel k-way partitioning scheme for irregular graphs,” Journal of Parallel and Distributed Computing, 48(1):96–129, 1998.

[14] Xilinx, “Xilinx Core Generator Floating-Point Operator v5.0”, March 2011.[Online]. Available: http://www.xilinx.com/support/documentation/ip_documentation/floating_point_ds335.pdf [Accessed: 12 January 2013].

2013 5th International Conference on Computer Science and Information Technology (CSIT) ISBN: 978-1-4673-5825-5

302