DSP Algorithms on FPGA Part II Digital image Processing
description
Transcript of DSP Algorithms on FPGA Part II Digital image Processing
DSP Algorithms on FPGADSP Algorithms on FPGA
Part II Digital image ProcessingPart II Digital image Processing
ContentContent
Overview image processing and Overview image processing and FPGAFPGA
Algorithm to FPGA Mapping FlowAlgorithm to FPGA Mapping Flow Nested Loop Algorithms and MODGNested Loop Algorithms and MODG Example: Motion Estimation Example: Motion Estimation Conclusion and Future TrendsConclusion and Future Trends
Video signal in different Video signal in different formatsformats
PAL 720*576(pixels) 25 (f/s) 10.4 (Mp/s) PAL 720*576(pixels) 25 (f/s) 10.4 (Mp/s) NTSC 720*480 29.97 10.4NTSC 720*480 29.97 10.4 HDTV 1920*1080 30.0 62.2 HDTV 1920*1080 30.0 62.2
Common delivery form:Common delivery form: Analog (cable)Analog (cable) USBUSB FirewireFirewire
Image Processing CharacterImage Processing Character
Need available maximize logic by Need available maximize logic by supporting N-D multiple configurable supporting N-D multiple configurable devicesdevices
For Example :For Example :
Image *Image *
11 22 11
22 44 22
11 22 11
ChallengesChallenges
How toHow to……???……??? Appropriate partitioning of algorithms Appropriate partitioning of algorithms
between hardware and softwarebetween hardware and software Exploiting spatial and temporal parallelismExploiting spatial and temporal parallelism Integration the configurable computer into Integration the configurable computer into
the software frameworkthe software framework Selecting a suitable configuration strategySelecting a suitable configuration strategy
How shall we deal with these challenges?How shall we deal with these challenges?
Why SRAM-Based FPGAs? (Pros)Why SRAM-Based FPGAs? (Pros)
Higher logic/storage capacityHigher logic/storage capacity * * Fast carry chain for adders /subtractorsFast carry chain for adders /subtractors
* Built-in XOR gates/LUT* Built-in XOR gates/LUT * Array of bit-parallel multipliers* Array of bit-parallel multipliers
* * Fast and local storage: array of SRAM Fast and local storage: array of SRAM blocksblocks
* * Interconnect supports: three-state buffers/LUTInterconnect supports: three-state buffers/LUT
Equivalent to fine-grained reconfigurable hardwareEquivalent to fine-grained reconfigurable hardware * Finer-gained pipeling can help preserve the* Finer-gained pipeling can help preserve the performance at low power supply voltage performance at low power supply voltage
More mature CMOS manufacturing technologyMore mature CMOS manufacturing technology
Algorithm to FPGA Mapping FlowAlgorithm to FPGA Mapping Flow
MODGFormulation
Space-TimeMapping
Cost Functionssubject to
Constraints
Intra-PEPipelining
1D ScheduleProc. ArrayMODG
Inter-PEPipelined Array
Fully PipelinedArray
New MappingMatrix T1
NestedDo Loop
AlgorithmCompilation
High-levelSynthesis
HDLSynthesis
PMPRConfig.
Generation
The Matrix Multiplication MODG The Matrix Multiplication MODG c11=0
a11
a21
a31a12
a22
a32a13
a23
a33
b11 b12 b13
b21 b22 b23
b32 b33b31
c33c32c31
c23
c13
c21=0
c31=0
c12=0 c13=0
c31c31
a31
a31
b33
b33
A number of different execution orders can be carried out to achieve the same algorithm.
Nested Do Loop Algorithms and Nested Do Loop Algorithms and Inter-Iteration Dependence GraphInter-Iteration Dependence Graph
Do Do ii=1 to =1 to MMDo Do jj=1 to =1 to NNcc[[i,ji,j]=0;]=0;Do Do kk=1 to =1 to KK
cc[[i,ji,j]= ]= cc[[i,ji,j]+]+aa[[i,ki,k]*]*bb[[k,jk,j];];
EndDo EndDo kkEndDo EndDo jjEndDo EndDo II
Dependence vectorsDependence vectors ddaa = ( = (ii,,jj,,kk))tt = (0,1,0)= (0,1,0)tt ddbb = ( = (ii,,jj,,kk))tt = (1,0,0)= (1,0,0)tt ddcc = ( = (ii,,jj,,kk))tt = (0,0,1)= (0,0,1)tt
Index Space Index Space JJ33 = {( = {(ii,,jj,,kk))tt: 1: 1ii,,jj,,kk 3} 3}((MM==NN==KK=3)=3)
Inter-Iteration Data Inter-Iteration Data Dependence graph (DG)Dependence graph (DG)
c11=0
a11
a21
a31a12
a22
a32a13
a23
a33
b11 b12 b13
b21 b22 b23
b32 b33b31
c33c32c31
c23
c13
+X
b
a
c
ab
c
Systolic Mapping (space-time) of Matrix Systolic Mapping (space-time) of Matrix MultiplicationMultiplication
c11=0
a11
a21
a31a12
a22
a32a13
a23
a33
b11 b12 b13
b21 b22 b23
b32 b33b31
c33c32c31
c23
c13
3-D DG (Dependence Graph)
c11=0
a11
a21
a31a12
a22
a32
a23
a33
b11
b21
b31
c21
c31
c11
a13
c21=0
c31=0
D
D
D
D
D
D
D
D
D
2-D Processor Array
P
s s s
Systolic Systolic Mapping of Mapping of
Matrix Matrix Multiplication, Multiplication,
cont.cont.
a11 a21 a31
a12 a22 a32
a13 a23 a33
C11 C21 C31
C11 C21 C31
C11 C21 C31
b11 b11 b11
b21 b21 b21
b31 b31 b31
C12 C22 C32
C12 C22 C32
C12 C22 C32
b12 b12 b12
b22 b22 b22
b32 b32 b32
a11 a21 a31
a12 a22 a32
a13 a23 a33
C13 C23 C33
C13 C23 C33
C13 C23 C33
b13 b13 b13
b23 b23 b23
b33 b33 b33
a11 a21 a31
a12 a22 a32
a13 a23 a33
0 0 0
c11=0
a11
a21
a31a12
a22
a32
a23
a33
b11
b21
b31
c21
c31
c11
a13
c21=0
c31=0
D
D
D
D
D
D
D
D
D
Why Space-Time Mapping is Why Space-Time Mapping is suitable for FPGAs?suitable for FPGAs?
It can bridge the nested Do loop signal/image It can bridge the nested Do loop signal/image
processing algorithms to the processorprocessing algorithms to the processor arrayarray implementation.implementation.
The space-time array matches the modular and The space-time array matches the modular and regular FPGA structure.regular FPGA structure.
The localized/pipelined interprocessor links can The localized/pipelined interprocessor links can overcome the long programmable interconnect overcome the long programmable interconnect delay.delay.
The size of configuration storage can be significantly The size of configuration storage can be significantly reduced because of the almost identical processing reduced because of the almost identical processing elements and interconnect structure.elements and interconnect structure.
Problems with Existing Design Problems with Existing Design Methodologies/ToolsMethodologies/Tools
The dependence graphs of many other The dependence graphs of many other algorithms are not uniform and must be algorithms are not uniform and must be predetermined by human designers.predetermined by human designers.
Existing methodologiesExisting methodologies cannot handle these complex cannot handle these complex
algorithms use unrealistic cost algorithms use unrealistic cost functions (metrics)functions (metrics)
No built-in features of FPGAs have been No built-in features of FPGAs have been incorporated.incorporated.
Longer interconnect delay in deep Longer interconnect delay in deep submicron CMOS technologysubmicron CMOS technology
Much lower hardware utilization due to Much lower hardware utilization due to programmable interconnect delay in programmable interconnect delay in FPGAsFPGAs
There is another problem--There is another problem--speedspeed
What is Intra-PE What is Intra-PE Pipelining?Pipelining?
PE0 PE1 PE2c c c
a0 b0 a1 b1 a2 b2
c
(a)
(b)
c=c+a0xb0 c=c+a1xb1 c=c+a2xb2
c c +
X
a1 b1
d
c +
X
a2 b2
d
c+
X
a0 b0
d
d=a0 x b0 c=c + d
d=a1 x b1 c=c + d
d=a2 x b2 c=c + d
schedule
CLK
CLK
schedule
•Interconnect delay of FPGAs results in even longer clock period.
•To enhance the overall throughput, Intra-Iteration parallelism must be exploited.
•A simple vector dot product array
•It can be observed that the utilization of each operator is increased.
•Of course, the control mechanism is more complex. Tech done example
Examples of Nested Do Loop Examples of Nested Do Loop AlgorithmsAlgorithms
Motion estimationMotion estimation One of the most time consuming operations (tasks) in One of the most time consuming operations (tasks) in
digital video compressiondigital video compression Stereo matchingStereo matching
used to build disparity map for 3D robot/computer used to build disparity map for 3D robot/computer navigationnavigation
Matrix/Vector MultiplicationMatrix/Vector Multiplication FFT, DCT, 2D/3D graphic etc.FFT, DCT, 2D/3D graphic etc.
2D Linear Transform/Operations2D Linear Transform/Operations 2D FFT, 2D DCT, etc.2D FFT, 2D DCT, etc.
Tennis frame 0Tennis frame 0
previous frame
50 100 150 200 250 300 350
50
100
150
200
Tennis frame 1Tennis frame 1
current frame
50 100 150 200 250 300 350
50
100
150
200
Motion Vectors of 8x8-Pixel Blocks Motion Vectors of 8x8-Pixel Blocks
0 50 100 150 200 250 300 350 400-250
-200
-150
-100
-50
0
50Motion Vector Field of frame 1
Reconstructed Frame 1 from Reconstructed Frame 1 from Frame 0 and Motion VectorsFrame 0 and Motion Vectors
Motion compensated frame
50 100 150 200 250 300 350
50
100
150
200
11 2112 22
21 3122 32
31 4132 42
12 2213 23
22 3223 33
34 4233 43
13 2314 24
23 3324 34
33 4334 44
n=0
n=1
n=2
m=0 m=1 m=2
Illustration of Full Search Block Matching Motion Illustration of Full Search Block Matching Motion Estimation Estimation
(6 level Nested do loop)(6 level Nested do loop)
11 21 31 41 51 61 12 22 32 42 52 62 13 23 33 43 53 63 14 24 34 44 54 64 15 25 35 45 55 65 16 26 36 46 56 66 17 27 37 47 57 67 18 28 38 48 58 68
ji
previous frame, y current frame, x
ij=31
Motion vector=(m,n)
Exp: A Simpler PE Exp: A Simpler PE MicroarchitectureMicroarchitecture
Dmin(l-1,N2-1)
x2(l-1,k) x2(l,k)
MAD(l,N2-1)
Sel2
AND
|x-y|
Sel1
y2(l,k)
Reg
RegAND
Min(Dmin(l-1,N2-1),MAD(l,N2-1))
Dmin(l,N2-1)
Reg
Min
+
Reg
MADMAD((m,nm,n)= )= MADMAD((m,nm,n)+|)+|xx((hNhN++ii,,vNvN++jj)-)-yy((hNhN++ii++mm--pp,,vNvN++jj++nn--pp)|)|
Xilinx Core Generator SystemXilinx Core Generator System Critical path delay = 25 ns. based on Xilinx Virtex dataCritical path delay = 25 ns. based on Xilinx Virtex data 1,500-2,000 equivalent gate count1,500-2,000 equivalent gate count Critical path (blue line) can be shortened further by the Intra-Critical path (blue line) can be shortened further by the Intra-
PE pipeliningPE pipelining
Significance of the ContributionsSignificance of the Contributions The MODG representation for nested Do loop algorithmsThe MODG representation for nested Do loop algorithms
The actual execution is not constrained to The actual execution is not constrained to any any predetermined order.predetermined order.
keeps track of every variable instance so that there is no keeps track of every variable instance so that there is no
redundantredundant memory access to memory access to save I/O, save I/O, bandwidthbandwidth and and power consumptionpower consumption..
can be automated using memory .can be automated using memory .
Without the MODG, Without the MODG, the motion estimation and many other nested DO loop the motion estimation and many other nested DO loop
algorithms can be written in many of different DGs,algorithms can be written in many of different DGs, human must be involved to formulate a DG,human must be involved to formulate a DG, the built-in ROM/RAM of FPGA may not be exploited, andthe built-in ROM/RAM of FPGA may not be exploited, and
Significance of the Contributions, cont.Significance of the Contributions, cont.
Space-Time mapping for the MODG can Space-Time mapping for the MODG can be applied tobe applied to any SRAM-based FPGA Architecture any SRAM-based FPGA Architecture
Constraints and Practical Cost functionsConstraints and Practical Cost functions any coarse-grained architectureany coarse-grained architecture
Intra-PE pipeliningIntra-PE pipelining enhances/preserves the throughput rate at enhances/preserves the throughput rate at
low power mode.low power mode.
ConclusionConclusion Users demand more communication/multimedia processing Users demand more communication/multimedia processing
capabilities on the capabilities on the resource-limited Internetresource-limited Internet appliances. appliances. Reconfigurable SOC is the ultimate solution to design the Reconfigurable SOC is the ultimate solution to design the
challenging low-power/high performance platform.challenging low-power/high performance platform. Its success lies on the embedded high-density FPGA core as a Its success lies on the embedded high-density FPGA core as a
reconfigurable (programmable) accelerating hardware.reconfigurable (programmable) accelerating hardware. As technology (supply voltage) scales down, logic (transistor) As technology (supply voltage) scales down, logic (transistor)
is virtually free while the interconnect becomes the bottleneck is virtually free while the interconnect becomes the bottleneck and power consuming.and power consuming.
Parallel execution of nested Do loop algorithms by an array of Parallel execution of nested Do loop algorithms by an array of localized processing elements at moderate clock frequency is a localized processing elements at moderate clock frequency is a viable solution.viable solution.
It can compromise the three main issues: It can compromise the three main issues: design time, design time, power consumption, and performance.power consumption, and performance.
Future TrendsFuture Trends
Memory (storage) organization should be should be investigated due to investigated due to multiple readsmultiple reads per-clock per-clock cycle in order to sustain such high cycle in order to sustain such high throughput.throughput.
The The control mechanismcontrol mechanism of the of the entire arrayentire array is is one of the aspects that will determine its one of the aspects that will determine its success.success.
A given MODG may need to be partitioned of A given MODG may need to be partitioned of so that the resulting array fits the on-chip so that the resulting array fits the on-chip reconfigurable FPGA core.reconfigurable FPGA core.