Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel...

16
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed Digital System Laboratory H H H S S S D D D S S S L L L

Transcript of Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel...

Page 1: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Matrix Multiplication on FPGA

Final presentationOne semester – winter 2014/15

By : Dana Abergel and Alex FonariovSupervisor : Mony Orbach

High Speed Digital System Laboratory

HHHSSS DDDSSSLLL

Page 2: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Motivation and Background

• Matrix multiplication is a complex mathematical operation.

• Naive implementation of the common algorithm may cost a lot of resources and time.

• An efficient matrix multiplication implementation is needed.

Page 3: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Project Goals

• Implementation of an efficient algorithm and Infrastructure• Minimum FPGA resources.• Minimum run time.• Maximum throughput

• Examine the trade off• Working with memory interfaces (DDR)

Page 4: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Development Platform

• VHDL• Simulation – ModelSim/Vivado

• Xillinx – vc709 Evaluation Board• FPGA - Virtex 7• Synthesis - Vivado

Page 5: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Algorithm and Specifications

128 128 128 128 128 128

,

,

,

128

, ,n n,1

B

8

8

23

x x x

i j

i j

i j

i j i jn

A C

A bit

B bit

C bit

C A B

Page 6: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Block Diagram

• Two time domains.• Read FIFO:

write width- 512bitread width- 1024bit

• Write FIFO:write width- 64bitread width- 32bit

FIFO

FIFO

FIFO(Results)

Multiplier Adder[1023:0] [22:0]

[15:0]

[15:0]

100 Mhz200 Mhz

9'b0 [31:23]

Multiplier Adder

[15:0]

[15:0]

[1023:0] [22:0]

9'b0 [31:23]

[31:0]

[31:0]

DDR1

DDR2

FPGA

DDR_Interconnect

Page 7: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

• Consist of 129 multipliers. • Multiplier’s input width is 8 bit.• Multiplier’s output width is 16 bit.• The multipliers are DSP slices.• An additional multiplier for the

valid bit.

Multiplier

[1023:0]

Row_Reg

[1023:0]

[7:0]

[1023:0]

[7:0]

[7:0]

[7:0]

[7:0]

[7:0]

[7:0]

[7:0]

[15:0]

[15:0]

[15:0]

[15:0]

Valid_din [15:0]

9'b0 [7:1]

9'b0 [7:1]

[7:0]

[7:0]

Page 8: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Adder

[15:0][15:0]

[15:0][15:0]

[15:0][15:0]

[15:0][15:0]

[16:0]

[16:0]

[16:0]

[16:0]

[16:0]

[22:0][21:0]

[21:0]

[17:0]

[17:0]

• Consists of 127 adders.• The adder’s width increases by 1

as the data advances through the pipeline.

• Pipeline implementation.

Page 9: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

DDR_Interconnect (IP Integrator)

• Contains the following IP’s:• Axi_data_mover• Axi_interconnect• mig

Page 10: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Memory OrganizationRow1(M1)[1023:0]

Column127(M2)[1023:0]

Column3(M2)[1023:0]

Column5(M2)[1023:0]

Column1(M2)[1023:0]

Row3(M1)[1023:0]

Column127(M2)[1023:0]

Column3(M2)[1023:0]

Column5(M2)[1023:0]

Row5(M1)[1023:0]

DDR1

Column1(M2)[1023:0]

Column3(M2)[1023:0]

Column5(M2)[1023:0]

Column1(M2)[1023:0]

Row2(M1)[1023:0]

Column128(M2)[1023:0]

Column4(M2)[1023:0]

Column6(M2)[1023:0]

Column2(M2)[1023:0]

Row4(M1)[1023:0]

Column128(M2)[1023:0]

Column4(M2)[1023:0]

Column6(M2)[1023:0]

Column2(M2)[1023:0]

DDR2

Elem (1,1)[31:0]

Elem (2,1)31:0]

Elem (128,128)[31:0]

Elem (1,2)[31:0]

Elem (2,2)[31:0]

Elem (1,3)[31:0]

Elem (2,3)[31:0]

DDR2

Page 11: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Flow Chart

Read row(i) (fifo1)Read row(i+1) (fifo2) Read col1 (fifo1) Read col2 (fifo2) Read col3 (fifo1) Read col4 (fifo2)

Read col128 (fifo2)

i+2

Page 12: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Design Verification

• BIST (Built-In Self Test)• Memory tests (C code of the microblaze)

were adjusted to our purposes:• Loading the matrixes to the DDR.• Reading the result matrix from the DDR.

Page 13: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Verification Process

• Three bit-stream files are involved in the process:1. Bist that was modified to write the matrixes to the DDRs.2. Our design, which reads the matrixes, does the arithmetic calculation and writes the

result matrix to the DDR.3. Bist that was modified to read the result matrix and print it on the screen.

Page 14: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Performance

• Total run time: 1.1 sec (220,940,637 clock cycles in 200 Mhz)• Throughput:

• Total FPGA utilization:

128 *128*23 376,8320.0017

220,940,637 220,940,637

Page 15: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Summery and conclusions

• Higher throughput and lower latency as a result of using two DDRs and pipelined design.• The operation involves mathematic operations, which are done in 1.1 seconds.• We considered many ways of verification:

• PCI Express• UART• Microblaze• BIST

• Suggestions – to build an automatic verification environment.• Results –

• o Memory reading, multiplication and summing, and writing back to the memory in a pretty good latency and throughput.

• A low FPGA utilization.

Page 16: Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.

Thank you!