Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems
description
Transcript of Architectural Optimization of Decomposition Algorithms for Wireless Communication Systems
Architectural Optimization of Decomposition Algorithms
for Wireless Communication Systems
Ali Irturk†, Bridget Benson†, Nikolay Laptev‡, Ryan Kastner†
† Department of Computer Science and EngineeringUniversity of California, San Diego
{airturk, b1benson, kastner}@cs.ucsd.edu
1
‡Department of Computer ScienceUniversity of California, Los Angeles
April 2009
Motivation
Matrix Decompositions are essential computations for wireless communications;
Matrix Decompositions are used for simplifying matrix inversion which are used in • Equalization algorithms to remove the effect of
the channel on the signal,• Minimum mean square error algorithms for pre-
coding in spatial multiplexing,• Detection-estimation algorithms in space-time
coding.
QR,A-1
2
Motivation
3
There are a number of tools that translate Matlab algorithms to a hardware description language;
However, we believe that the majority of these tools take the wrong approach;
We take a more focused approach, specifically developing a tool that is targeting matrix computation algorithms.
Computing Platforms
4
ASICs DSPs FPGAs GPU CELL BE
Exceptional Performance
Long Time to Market Substantial Costs
Ease of Development Fast Time to Market Low Performance
Ease of Development Fast Time to Market ASIC-like Performance
Major Contributions
5
Design of a novel tool, GUSTO, for automatic generation and optimization of application specific matrix computation architectures from a given Matlab algorithm;
Comparison of different matrix decomposition methods in terms of different matrix dimensions, bit widths and parallelism;
Thorough study of area and throughput tradeoffs of matrix decomposition architectures using different parameterizations;
A case study: Implementation of Adaptive Weight Calculation Core using QRD-RLS algorithm.
GUSTO General architecture design Utility and Synthesis Tool for Optimization
GUSTO an easy-to-use tool for more efficient design space exploration and development; automatically generates and optimizes application specific architectures; creates a prototype hardware system in just minutes instead of days or weeks.
GUSTO Bit width
(e.g. 19 bits of precision)
Resource Allocation (e.g. 4 multipliers and 3 adders)
Modes(e.g. Heterogeneous cores connected using
hierarchical datapaths)
Algorithm(e.g. QR decomposition)
HDL files
Error AnalysisNumber of bits used
40342822 40 46 52 58 6416
10-15
100
10-5
10-10
Aver
age
Err
or
6
Outline
Motivation
GUSTO: Design Tool and Methodology
Decomposition Methods
Results• Inflection Point Analysis• Architectural Design Alternatives
Conclusions
7
GUSTO Design Flow
Algorithm AnalysisAlgorithm
Instruction Generation
Resource AllocationType and # of Arithmetic Resources
Design Library
Error Analysis Error Analysis
Architecture GenerationData Representation
Collecting Scheduling Information
Resource Trimming for Hardware Optimization
Area, Latency and Throughput Results
Simulation Results
General Purpose Architecture
Application Specific Architecture
8
GUSTO Design Flow
Algorithm AnalysisAlgorithm
Inst.Cont.
AAAA
MMMM
Mem.Cont.
Processing Element
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Software Defined Radio
Software Defined Radio
GUSTO provides options to divide the given algorithm into smaller processing elements which are small in area and highly optimized for throughput.
?
9
GUSTO Design Flow
Instruction Generation
Resource AllocationType and # of
Arithmetic Resources Design Library + -* /
GUSTO uses instruction scheduling for better resource utilization and provides different scheduling methods.
GUSTO generates resource constrained architectures, i.e. the user chooses the number and type of arithmetic units.
Inst.Cont.
AAAA
MMMM
Mem.Cont.
Processing Element
?
10
GUSTO Design Flow
Error AnalysisError Analysis
GUSTO employs fixed point arithmetic in generated architectures;
GUSTO performs error analysis to find an appropriate fixed point representation which provides results with the accuracy similar to that of a floating point implementation.
GUSTO MATLAB
Error Analysis Metrics:1) Mean Error2) Peak Error
3) Standard Deviation of Error4) Mean Percentage Error
User Defined Input Data
Fixed Point Arithmetic Results(using variable bit width)
Floating Point Arithmetic Results(Single/Double precision)
11
GUSTO Design Flow
Architecture Generation
GUSTO generates a CPU like architecture with• Dynamic Instruction Scheduling;• Dynamic Memory Assignments;• Full Connectivity between functional units.
Instruction Controller
Arithmetic Unit
Memory Controller
Arithmetic Unit
Arithmetic Unit
Arithmetic Unit
Multipliers
Adders
MultipliersMultipliersMultipliersMultipliers
Arithmetic Units
Full Connectivity
Dynamic Instruction Scheduling
Dynamic Memory
Assignments
12
GUSTO Design Flow
Collecting Scheduling Information
Instruction Controller
Arithmetic Unit
Memory Controller
Arithmetic Unit
Arithmetic Unit
Arithmetic Unit
Multipliers
Adders
MultipliersMultipliersMultipliersMultipliers
Arithmetic Units
Full Connectivity
Static Instruction Scheduling
Static Memory Assignments
GUSTO collects scheduling information from instruction and memory controllers.
GUSTO uses this information to eliminate unneeded resources, automatically creating a small, fast statically scheduled architecture.
13
GUSTO Design Flow
Resource Trimming for Hardware Optimization
GUSTO simulates the architecture to define the usage of arithmetic units, multiplexers, register entries and input/output ports and trims away the unused components with their interconnects.
GUSTOs’ optimization provides tremendous silicon savings while ensuring the correctness of solution.
Multiplier
Adder
Memory
Full Connectivity
Multiplier
Adder
Memory
Required Connectivity
14
GUSTOTrimming Feature
A
In_A1 In_A2
Out_mem2
Out_A
Out_mem1
B
In_B1 In_B2
Out_B
mem
In_mem1
A
Out_AOut_BOut_mem
1Out_mem2 Out_
AOut_BOut_mem
1Out_mem2
Out_A
01011010In_A1
In_A2
Out_A Out_B Out_mem
1
Out_mem
2
Simulation runs
15
GUSTOTrimming Feature
A
In_A1 In_A2
Out_mem2
Out_A
Out_mem1
B
In_B1 In_B2
Out_B
mem
In_mem1
B
Out_AOut_BOut_mem
1Out_mem2 Out_
AOut_BOut_mem
1Out_mem2
Out_B
00000000In_B1
In_B2
Out_A Out_B Out_mem
1
Out_mem
2
Simulation runs
16
Outline
Motivation
GUSTO: Design Tool and Methodology
Decomposition Methods
Results• Inflection Point Analysis• Architectural Design Alternatives
Conclusions
17
MATRIX DECOMPOSITIONSQR, LU AND CHOLESKY
RQA Given Matrix
Orthogonal Matrix
Upper Triangular Matrix
33
2322
131211
333231
232221
131211
333231
232221
131211
000
RRRRRR
QQQQQQQQQ
AAAAAAAAA
IQQQQ TT TQQ 1
18
ULA Lower Triangular Matrix
33
2322
131211
3231
21
333231
232221
131211
000
101001
UUUUUU
LLL
AAAAAAAAA
Given Matrix Upper Triangular Matrix
TGGA
Unique Lower Triangular Matrix (Cholesky triangle)
Transpose of Lower Triangular Matrix
33
3222
312111
333231
2221
11
333231
232221
131211
0000
00
GGGGGG
GGGGG
G
AAAAAAAAA
Given Matrix
MATRIX INVERSION
IAA 1Given Matrix
Inverse Matrix
Identity Matrix
100010001
333231
232221
131211
333231
232221
131211
xxxxxxxxx
AAAAAAAAA
1311321121111 xAxAxA
0321322121211 xAxAxA
0331323121311 xAxAxA
Full Matrix Inversion is costly!
19
TQRA 11 111 LUA 111 )( GGA T
Outline
Motivation
GUSTO: Design Tool and Methodology
Decomposition Methods
Results• Inflection Point Analysis• Architectural Design Alternatives
Conclusions
20
ResultsInflection Point Analysis: Sequential
2×2 3×3 4×4 5×5 6×6 7×7 8×80
1000
2000
3000
4000
5000
6000QR Decomposition (16bit)LU Decomposition (16bit)Cholesky Decomposition(16bit)QR decomposition (32bit)LU Decomposition (32bit)Cholesky Decomposition(32bit)QR decomposition (64bit)LU Decomposition (64bit)Cholesky Decomposition(64bit)
Matrix Size
# of
Clo
ck C
ycle
s (se
quen
tial)
21
ResultsInflection Point Analysis: Parallel
2×2 3×3 4×4 5×5 6×6 7×7 8×80
200
400
600
800
1000
1200
1400 QR Decomposition (16bit)LU Decomposition (16bit)cholesky Decomposition (16bit)QR Decomposition (32bit)LU Decomposition (32bit)Cholesky Decomposition (32bit)QR Decomposition (64bit)LU Decomposition (64bit)Cholesky Decomposition (64bit)
Matrix Size
# of
Clo
ck C
ycle
s (pa
ralle
l)
22
Results Finding the Optimal Hardware : Decomposition Methods
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
# of
Slic
es
General Purpose Architecture
Application Specific Architecture
QR LU Cholesky
Decrease in Area (Percentage)
94%83% 86%
23
0.000.200.400.600.801.001.201.401.601.802.00
Thr
ough
put
Results Finding the Optimal Hardware: Decomposition Methods
General Purpose Architecture (Mode 1)
Application Specific Architecture (Mode 2)
QR LU Cholesky
Increase in Throughput (Percentage)
68%
16%
14%
24
ResultsFinding the Optimal Hardware: Matrix Inversion (using QR)
average of 59% decrease in area 3X increase in throughput
2222 2244 3444 44440
2,000
4,000
6,000
8,000
10,000
12,000
14,000
0
0.05
0.1
0.15
0.2
0.25
0.3Slices (Mode 1)Slices (Mode 2)Throughput (Mode 1)
# of
Slic
es
# of Adder, Subtractor, Multiplier, Divider
Thr
ough
put
25
ResultsArchitectural Design Alternatives
26
ResultsComparison with Previously Published Work: AWC
Edman et al.
Karkooti et al.
Dick et al. GUSTO
Application Matrix Inversion
Matrix Inversion
Beamformer AWC
Method QR QR QR QRMatrix Size 4 × 4 4 × 4 3 × 3 5 × 5 4 × 4Bit width 12 20 18 20Data type fixed floating NR fixed
Device type Virtex 2 Virtex 4 Virtex 4 Virtex 4Slices 4400 9117 3530 2558
DSP48s NR 22 13 12BRAMs NR 9 6 1
Throughput (106×s-1)
0.28 0.12 0.27 0.11 0.13
•F. Edman, V. Öwall, “A Scalable Pipelined Complex Valued Matrix Inversion Architecture”, IEEE International Symposium on Circuits and Systems. (2005).•M. Karkooti, J.R. Cavallaro, C. Dick, “FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm”, Asilomar Conference on Signals, Systems and Computers (2005).•C. Dick, F. Harris, M. Pajic, D. Vuletic, “Real-Time QRD-Based Beamforming on an FPGA Platform,” Asilomar Conference on Signals, Systems and Computers (2006).
27
Adaptive Weight Calculation (AWC) Core
Outline
Motivation
GUSTO: Design Tool and Methodology
Decomposition Methods
Results• Inflection Point Analysis• Architectural Design Alternatives
Conclusions
28
GUSTO General architecture design Utility and Synthesis Tool for Optimization
GUSTO is a tool to provide automatic generation and optimization of a variety of application specific processing elements (PEs) with different parameterization options;
Current Projects includes implementation of• Short Preamble Processing unit for OFDM Receiver design.
GUSTO Bit width
(e.g. 19 bits of precision)
Resource Allocation (e.g. 4 multipliers and 3 adders)
Modes(e.g. Heterogeneous cores connected using
hierarchical datapaths)
Algorithm(e.g. QR decomposition)
HDL files
Error Analysis
29