Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems

Architectural Optimization of Decomposition Algorithms

for Wireless Communication Systems

Ali Irturk†, Bridget Benson†, Nikolay Laptev‡, Ryan Kastner†

† Department of Computer Science and EngineeringUniversity of California, San Diego

{airturk, b1benson, kastner}@cs.ucsd.edu

1

‡Department of Computer ScienceUniversity of California, Los Angeles

[email protected]

April 2009

Motivation

Matrix Decompositions are essential computations for wireless communications;

Matrix Decompositions are used for simplifying matrix inversion which are used in • Equalization algorithms to remove the effect of

the channel on the signal,• Minimum mean square error algorithms for pre-

coding in spatial multiplexing,• Detection-estimation algorithms in space-time

coding.

QR,A-1

2

Motivation

3

There are a number of tools that translate Matlab algorithms to a hardware description language;

However, we believe that the majority of these tools take the wrong approach;

We take a more focused approach, specifically developing a tool that is targeting matrix computation algorithms.

Computing Platforms

4

ASICs DSPs FPGAs GPU CELL BE

Exceptional Performance

Long Time to Market Substantial Costs

Ease of Development Fast Time to Market Low Performance

Ease of Development Fast Time to Market ASIC-like Performance

Major Contributions

5

Design of a novel tool, GUSTO, for automatic generation and optimization of application specific matrix computation architectures from a given Matlab algorithm;

Comparison of different matrix decomposition methods in terms of different matrix dimensions, bit widths and parallelism;

Thorough study of area and throughput tradeoffs of matrix decomposition architectures using different parameterizations;

A case study: Implementation of Adaptive Weight Calculation Core using QRD-RLS algorithm.

GUSTO General architecture design Utility and Synthesis Tool for Optimization

GUSTO an easy-to-use tool for more efficient design space exploration and development; automatically generates and optimizes application specific architectures; creates a prototype hardware system in just minutes instead of days or weeks.

GUSTO Bit width

(e.g. 19 bits of precision)

Resource Allocation (e.g. 4 multipliers and 3 adders)

Modes(e.g. Heterogeneous cores connected using

hierarchical datapaths)

Algorithm(e.g. QR decomposition)

HDL files

Error AnalysisNumber of bits used

40342822 40 46 52 58 6416

10-15

100

10-5

10-10

Aver

age

Err

or

6

Outline

Motivation

GUSTO: Design Tool and Methodology

Decomposition Methods

Results• Inflection Point Analysis• Architectural Design Alternatives

Conclusions

7

GUSTO Design Flow

Algorithm AnalysisAlgorithm

Instruction Generation

Resource AllocationType and # of Arithmetic Resources

Design Library

Error Analysis Error Analysis

Architecture GenerationData Representation

Collecting Scheduling Information

Resource Trimming for Hardware Optimization

Area, Latency and Throughput Results

Simulation Results

General Purpose Architecture

Application Specific Architecture

8

GUSTO Design Flow

Algorithm AnalysisAlgorithm

Inst.Cont.

AAAA

MMMM

Mem.Cont.

Processing Element

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Software Defined Radio

Software Defined Radio

GUSTO provides options to divide the given algorithm into smaller processing elements which are small in area and highly optimized for throughput.

?

9

GUSTO Design Flow

Instruction Generation

Resource AllocationType and # of

Arithmetic Resources Design Library + -* /

GUSTO uses instruction scheduling for better resource utilization and provides different scheduling methods.

GUSTO generates resource constrained architectures, i.e. the user chooses the number and type of arithmetic units.

Inst.Cont.

AAAA

MMMM

Mem.Cont.

Processing Element

?

10

GUSTO Design Flow

Error AnalysisError Analysis

GUSTO employs fixed point arithmetic in generated architectures;

GUSTO performs error analysis to find an appropriate fixed point representation which provides results with the accuracy similar to that of a floating point implementation.

GUSTO MATLAB

Error Analysis Metrics:1) Mean Error2) Peak Error

3) Standard Deviation of Error4) Mean Percentage Error

User Defined Input Data

Fixed Point Arithmetic Results(using variable bit width)

Floating Point Arithmetic Results(Single/Double precision)

11

GUSTO Design Flow

Architecture Generation

GUSTO generates a CPU like architecture with• Dynamic Instruction Scheduling;• Dynamic Memory Assignments;• Full Connectivity between functional units.

Instruction Controller

Arithmetic Unit

Memory Controller

Arithmetic Unit

Arithmetic Unit

Arithmetic Unit

Multipliers

Adders

MultipliersMultipliersMultipliersMultipliers

Arithmetic Units

Full Connectivity

Dynamic Instruction Scheduling

Dynamic Memory

Assignments

12

GUSTO Design Flow

Collecting Scheduling Information

Instruction Controller

Arithmetic Unit

Memory Controller

Arithmetic Unit

Arithmetic Unit

Arithmetic Unit

Multipliers

Adders

MultipliersMultipliersMultipliersMultipliers

Arithmetic Units

Full Connectivity

Static Instruction Scheduling

Static Memory Assignments

GUSTO collects scheduling information from instruction and memory controllers.

GUSTO uses this information to eliminate unneeded resources, automatically creating a small, fast statically scheduled architecture.

13

GUSTO Design Flow

Resource Trimming for Hardware Optimization

GUSTO simulates the architecture to define the usage of arithmetic units, multiplexers, register entries and input/output ports and trims away the unused components with their interconnects.

GUSTOs’ optimization provides tremendous silicon savings while ensuring the correctness of solution.

Multiplier

Adder

Memory

Full Connectivity

Multiplier

Adder

Memory

Required Connectivity

14

GUSTOTrimming Feature

A

In_A1 In_A2

Out_mem2

Out_A

Out_mem1

B

In_B1 In_B2

Out_B

mem

In_mem1

A

Out_AOut_BOut_mem

1Out_mem2 Out_

AOut_BOut_mem

1Out_mem2

Out_A

01011010In_A1

In_A2

Out_A Out_B Out_mem

1

Out_mem

2

Simulation runs

15

GUSTOTrimming Feature

A

In_A1 In_A2

Out_mem2

Out_A

Out_mem1

B

In_B1 In_B2

Out_B

mem

In_mem1

B

Out_AOut_BOut_mem

1Out_mem2 Out_

AOut_BOut_mem

1Out_mem2

Out_B

00000000In_B1

In_B2

Out_A Out_B Out_mem

1

Out_mem

2

Simulation runs

16

Outline

Motivation




Conclusions

17

MATRIX DECOMPOSITIONSQR, LU AND CHOLESKY

RQA Given Matrix

Orthogonal Matrix

Upper Triangular Matrix

33

2322

131211

333231

232221

131211

333231

232221

131211

000

RRRRRR

QQQQQQQQQ

AAAAAAAAA

IQQQQ TT TQQ 1

18

ULA Lower Triangular Matrix

33

2322

131211

3231

21

333231

232221

131211

000

101001

UUUUUU

LLL

AAAAAAAAA

Given Matrix Upper Triangular Matrix

TGGA

Unique Lower Triangular Matrix (Cholesky triangle)

Transpose of Lower Triangular Matrix

33

3222

312111

333231

2221

11

333231

232221

131211

0000

00

GGGGGG

GGGGG

G

AAAAAAAAA

Given Matrix

MATRIX INVERSION

IAA 1Given Matrix

Inverse Matrix

Identity Matrix

100010001

333231

232221

131211

333231

232221

131211

xxxxxxxxx

AAAAAAAAA

1311321121111 xAxAxA

0321322121211 xAxAxA

0331323121311 xAxAxA

Full Matrix Inversion is costly!

19

TQRA 11 111 LUA 111 )( GGA T

Outline

Motivation




Conclusions

20

ResultsInflection Point Analysis: Sequential

2×2 3×3 4×4 5×5 6×6 7×7 8×80

1000

2000

3000

4000

5000

6000QR Decomposition (16bit)LU Decomposition (16bit)Cholesky Decomposition(16bit)QR decomposition (32bit)LU Decomposition (32bit)Cholesky Decomposition(32bit)QR decomposition (64bit)LU Decomposition (64bit)Cholesky Decomposition(64bit)

Matrix Size

# of

Clo

ck C

ycle

s (se

quen

tial)

21

ResultsInflection Point Analysis: Parallel

2×2 3×3 4×4 5×5 6×6 7×7 8×80

200

400

600

800

1000

1200

1400 QR Decomposition (16bit)LU Decomposition (16bit)cholesky Decomposition (16bit)QR Decomposition (32bit)LU Decomposition (32bit)Cholesky Decomposition (32bit)QR Decomposition (64bit)LU Decomposition (64bit)Cholesky Decomposition (64bit)

Matrix Size

# of

Clo

ck C

ycle

s (pa

ralle

l)

22

Results Finding the Optimal Hardware : Decomposition Methods

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

# of

Slic

es

General Purpose Architecture

Application Specific Architecture

QR LU Cholesky

Decrease in Area (Percentage)

94%83% 86%

23

0.000.200.400.600.801.001.201.401.601.802.00

Thr

ough

put

Results Finding the Optimal Hardware: Decomposition Methods

General Purpose Architecture (Mode 1)

Application Specific Architecture (Mode 2)

QR LU Cholesky

Increase in Throughput (Percentage)

68%

16%

14%

24

ResultsFinding the Optimal Hardware: Matrix Inversion (using QR)

average of 59% decrease in area 3X increase in throughput

2222 2244 3444 44440

2,000

4,000

6,000

8,000

10,000

12,000

14,000

0

0.05

0.1

0.15

0.2

0.25

0.3Slices (Mode 1)Slices (Mode 2)Throughput (Mode 1)

# of

Slic

es

# of Adder, Subtractor, Multiplier, Divider

Thr

ough

put

25

ResultsArchitectural Design Alternatives

26

ResultsComparison with Previously Published Work: AWC

Edman et al.

Karkooti et al.

Dick et al. GUSTO

Application Matrix Inversion

Matrix Inversion

Beamformer AWC

Method QR QR QR QRMatrix Size 4 × 4 4 × 4 3 × 3 5 × 5 4 × 4Bit width 12 20 18 20Data type fixed floating NR fixed

Device type Virtex 2 Virtex 4 Virtex 4 Virtex 4Slices 4400 9117 3530 2558

DSP48s NR 22 13 12BRAMs NR 9 6 1

Throughput (106×s-1)

0.28 0.12 0.27 0.11 0.13

•F. Edman, V. Öwall, “A Scalable Pipelined Complex Valued Matrix Inversion Architecture”, IEEE International Symposium on Circuits and Systems. (2005).•M. Karkooti, J.R. Cavallaro, C. Dick, “FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm”, Asilomar Conference on Signals, Systems and Computers (2005).•C. Dick, F. Harris, M. Pajic, D. Vuletic, “Real-Time QRD-Based Beamforming on an FPGA Platform,” Asilomar Conference on Signals, Systems and Computers (2006).

27

Adaptive Weight Calculation (AWC) Core

Outline

Motivation




Conclusions

28

GUSTO General architecture design Utility and Synthesis Tool for Optimization

GUSTO is a tool to provide automatic generation and optimization of a variety of application specific processing elements (PEs) with different parameterization options;

Current Projects includes implementation of• Short Preamble Processing unit for OFDM Receiver design.

GUSTO Bit width

(e.g. 19 bits of precision)

Resource Allocation (e.g. 4 multipliers and 3 adders)

Modes(e.g. Heterogeneous cores connected using

hierarchical datapaths)

Algorithm(e.g. QR decomposition)

HDL files

Error Analysis

29

Thank You

30

[email protected]

Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems

Documents

Transcript of Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems