Parallel Beam Back Projection: Implementation
description
Transcript of Parallel Beam Back Projection: Implementation
Parallel Beam Back Projection:Implementation
Srdjan Coric
Miriam Leeser
Eric Miller
Outline• Annapolis Wildstar• “Simple Architecture”
– algorithm– datapath– Performance– Results
• Parallelism extraction• “Advanced Architecture 4x”
– datapath– Performance– Results– Implementation issues
• Future directions
Data Flow
Sinogram data address
generation
Sinogram data retrieval
Linearinterpolation
Dataaccumulation
Datawrite
Dataread
Sinogram data prefetch
LUT1
starting position
Interpolation factor errorCorner starting position
Critical error-accumulation path
5161664 22512251222
LUT1
quantization error
LUT2
quantization error
Bit reduction
error
LUT3
quantization error
LUT2:
LUT3:
15
.2 .
15
LUT1: .10 5
1 detector
pixel
detector
pixeldetectors
detector
0
detector
01 cossin
2
1-Nsinycosx :LUT
detector
pixel2 sin :LUT
detector
pixel3 cos :LUT
“Simple Architecture” Datapath
ROUN D
EVENRAM
ODDRAM
EVENRAM
ODDRAM
SW
AP
SU
B MU
LT
AD
D
LOCALRAM
LOCALRAM
MU
X
LUT 1
LUT 2
LUT 3
PROJECTIONCOUN TER
SU
BMU
X
MU
X
AD
D
DE
MU
X
W RITEADDR ESSCOUN TER
MU
XM
UX
AD
D
10
15
16
17
25
25
2525
25
5 4
10
9
9
9
MEZZAN INERAM
9
9
10
13
14
25
9
15
25
++
Performance Results: Software vs. FPGA Hardware
A Software - Floating point - 450 MHz Pentium : ~ 240 s
B Software - Floating point - 1 GHz Dual Pentium : ~ 94 s
C Software - Fixed point - 450 MHz Pentium : ~ 50 s
D Software - Fixed point - 1 GHz Dual Pentium : ~ 28 s
E Hardware - 50 MHz : ~ 5.4 s
0
50
100
150
200
250
A B C D E
Parameters: 1024 projections
1024 samples per projection
512*512 pixels image
9-bit sinogram data
3-bit interpolation factor
Original image Hardware output image
Zoom: ~200%Grayscale range < Pixel value range
(heart features in focus)
Original image Hardware output image
Zoom: ~200%Grayscale range < Pixel value range
(lung features in focus)
Original image - Hardware output image
V1
Imagerows
Projections
Imagecolumns
T~k1V1 T~k1V2 T~k2 V3
k1 <k2, V2 = V3 = V1 /4, T=Execution time
Case 1:No parallelism extracted
Case 2:Pixel level parallelism
extracted
Case 3: Projection level
parallelism extracted
V2
V3
Parallelism Issues
Memory bandwidth requirements at 50 MHz (for data accumulation)
Case 1: 0.4 GB/sCase 2: 1.6 GB/sCase 3: 0.4 GB/s
Memory bandwidth limit
1.2 GB/s
Advanced Architecture - Data Pathprojection parallelism extracted
ROUN D
EVENRAM
ODDRAM
EVENRAM
ODDRAM
SW
AP
SU
B MU
LT
AD
D
LOCALRAM
LOCALRAM
MU
X
LUT 1
LUT 2
LUT 3
PROJECTIONCOUN TER
SU
BMU
X
MU
X
AD
D
DE
MU
X
W RITEADDR ESSCOUN TER
MU
XM
UX
AD
D
10
15
16
17
25
25
25
25
25
5 4
10
9
9
9
MEZZAN INERAM
9
9
10
13
14
25
9
15
25
++
Simple Architecture
SU
B
ROUND
ROUND
ROUND
SU
BS
UB
LUT 3.1
LUT 3.2
LUT 2.1
LUT 2.2
LUT 1.1
LUT 1.2
LUT 4.1
LUT 4.2
EVENWRITE
COUNTERODD
WRITECOUNTER
ODDRAMODD
RAM
EVENRAMEVEN
RAM
ODDRAMODD
RAM
EVENRAMEVEN
RAM
ODDRAMODD
RAM
EVENRAMEVEN
RAM
ODDRAMODD
RAM
EVENRAMEVEN
RAMS
UB
SU
BS
UB
SU
BS
UB M
UL
TM
UL
TM
UL
TM
UL
T
AD
D
PROJECTIONCOUNTER
MU
X
AD
D
LUT 1.3
MU
X
AD
D
LUT 2.3M
UX
AD
D
LUT 3.3
MU
X
AD
D
LUT 4.3
DE
MU
X
ROUND
DE
MU
XD
EM
UX
DE
MU
X
MU
XM
UX
MU
XM
UX
SW
AP
MU
XM
UX
SW
AP
MU
XM
UX
SW
AP
MU
XM
UX
SW
AP
MU
XM
UX
AD
D
AD
DA
DD
AD
D
AD
DA
DD
AD
D LOCALRAM
LOCALRAM
15
16
2525
2525
25
17
5 4
9
9
10
9
9
10
14
1315
16
17
25
25
9
9
8
LEFTMEZZANINE
RAM
RIGHTMEZZANINE
RAM
4 9
4 9
Performance Results: Software vs. FPGA Hardware
A Software - Floating point - 450 MHz Pentium : ~ 240 s
B Software - Floating point - 1 GHz Dual Pentium : ~ 94 s
C Software - Fixed point - 450 MHz Pentium : ~ 50 s
D Software - Fixed point - 1 GHz Dual Pentium : ~ 28 s
E Hardware - 50 MHz : ~ 5.4 s
F Hardware (Advanced Architecture) - 50 MHz : ~ 1.3 s
0
50
100
150
200
250
A B C D E F
Parameters: 1024 projections
1024 samples per projection
512*512 pixels image
9-bit sinogram data
3-bit interpolation factor
prj_num(3)fanout = 1565 !
routing delay = 7.913 ns (~39.99%)
Implementation Issues- fanout -
odd_2_A_4[4]fanout = 144 !
Implementation Issues- fanout -
Memory Bridges Stuff
3 architectures implemented:A “Simple Architecture” = non-parallel (on slide 6)B “Advanced Architecture” = 4-way parallel (slide 12)
C “Bridge Free Advanced Arch” =as B but contains no memory bridges (all design buffers in BlockRAMs) from PCI
bus to memory banks required for Host-Memory communication. Bridges are separate design that is downloaded before (after) design C is
downloaded so that input data can be stored to (output data read from) memories on the WildStar board.
Virtex1000 resource utilization:A 11% logic, 90% BlockRAMs (with bridges)B 39% logic, 100% BlockRAMsC 21% logic, 100% BlockRAMs
Floorplan of the“Bridge Free Advanced Architecture”
(design C on the previous slide)
Future Directions
• Graduate