Parallel Beam Back Projection: Implementation

Parallel Beam Back Projection:Implementation

Srdjan Coric

Miriam Leeser

Eric Miller

Outline• Annapolis Wildstar• “Simple Architecture”

– algorithm– datapath– Performance– Results

• Parallelism extraction• “Advanced Architecture 4x”

– datapath– Performance– Results– Implementation issues

• Future directions

Data Flow

Sinogram data address

generation

Sinogram data retrieval

Linearinterpolation

Dataaccumulation

Datawrite

Dataread

Sinogram data prefetch

LUT1

starting position

Interpolation factor errorCorner starting position

Critical error-accumulation path

5161664 22512251222

LUT1

quantization error

LUT2

quantization error

Bit reduction

error

LUT3

quantization error

LUT2:

LUT3:

15

.2 .

15

LUT1: .10 5

1 detector

pixel

detector

pixeldetectors

detector

0

detector

01 cossin

2

1-Nsinycosx :LUT

detector

pixel2 sin :LUT

detector

pixel3 cos :LUT

“Simple Architecture” Datapath

ROUN D

EVENRAM

ODDRAM

EVENRAM

ODDRAM

SW

AP

SU

B MU

LT

AD

D

LOCALRAM

LOCALRAM

MU

X

LUT 1

LUT 2

LUT 3

PROJECTIONCOUN TER

SU

BMU

X

MU

X

AD

D

DE

MU

X

W RITEADDR ESSCOUN TER

MU

XM

UX

AD

D

10

15

16

17

25

25

2525

25

5 4

10

9

9

9

MEZZAN INERAM

9

9

10

13

14

25

9

15

25

++

Performance Results: Software vs. FPGA Hardware

A Software - Floating point - 450 MHz Pentium : ~ 240 s

B Software - Floating point - 1 GHz Dual Pentium : ~ 94 s

C Software - Fixed point - 450 MHz Pentium : ~ 50 s

D Software - Fixed point - 1 GHz Dual Pentium : ~ 28 s

E Hardware - 50 MHz : ~ 5.4 s

0

50

100

150

200

250

A B C D E

Parameters: 1024 projections

1024 samples per projection

512*512 pixels image

9-bit sinogram data

3-bit interpolation factor

Original image Hardware output image

Zoom: ~200%Grayscale range < Pixel value range

(heart features in focus)

Original image Hardware output image

Zoom: ~200%Grayscale range < Pixel value range

(lung features in focus)

Original image - Hardware output image

V1

Imagerows

Projections

Imagecolumns

T~k1V1 T~k1V2 T~k2 V3

k1 <k2, V2 = V3 = V1 /4, T=Execution time

Case 1:No parallelism extracted

Case 2:Pixel level parallelism

extracted

Case 3: Projection level

parallelism extracted

V2

V3

Parallelism Issues

Memory bandwidth requirements at 50 MHz (for data accumulation)

Case 1: 0.4 GB/sCase 2: 1.6 GB/sCase 3: 0.4 GB/s

Memory bandwidth limit

1.2 GB/s

Advanced Architecture - Data Pathprojection parallelism extracted

ROUN D

EVENRAM

ODDRAM

EVENRAM

ODDRAM

SW

AP

SU

B MU

LT

AD

D

LOCALRAM

LOCALRAM

MU

X

LUT 1

LUT 2

LUT 3

PROJECTIONCOUN TER

SU

BMU

X

MU

X

AD

D

DE

MU

X

W RITEADDR ESSCOUN TER

MU

XM

UX

AD

D

10

15

16

17

25

25

25

25

25

5 4

10

9

9

9

MEZZAN INERAM

9

9

10

13

14

25

9

15

25

++

Simple Architecture

SU

B

ROUND

ROUND

ROUND

SU

BS

UB

LUT 3.1

LUT 3.2

LUT 2.1

LUT 2.2

LUT 1.1

LUT 1.2

LUT 4.1

LUT 4.2

EVENWRITE

COUNTERODD

WRITECOUNTER

ODDRAMODD

RAM

EVENRAMEVEN

RAM

ODDRAMODD

RAM

EVENRAMEVEN

RAM

ODDRAMODD

RAM

EVENRAMEVEN

RAM

ODDRAMODD

RAM

EVENRAMEVEN

RAMS

UB

SU

BS

UB

SU

BS

UB M

UL

TM

UL

TM

UL

TM

UL

T

AD

D

PROJECTIONCOUNTER

MU

X

AD

D

LUT 1.3

MU

X

AD

D

LUT 2.3M

UX

AD

D

LUT 3.3

MU

X

AD

D

LUT 4.3

DE

MU

X

ROUND

DE

MU

XD

EM

UX

DE

MU

X

MU

XM

UX

MU

XM

UX

SW

AP

MU

XM

UX

SW

AP

MU

XM

UX

SW

AP

MU

XM

UX

SW

AP

MU

XM

UX

AD

D

AD

DA

DD

AD

D

AD

DA

DD

AD

D LOCALRAM

LOCALRAM

15

16

2525

2525

25

17

5 4

9

9

10

9

9

10

14

1315

16

17

25

25

9

9

8

LEFTMEZZANINE

RAM

RIGHTMEZZANINE

RAM

4 9

4 9

Performance Results: Software vs. FPGA Hardware

A Software - Floating point - 450 MHz Pentium : ~ 240 s

B Software - Floating point - 1 GHz Dual Pentium : ~ 94 s

C Software - Fixed point - 450 MHz Pentium : ~ 50 s

D Software - Fixed point - 1 GHz Dual Pentium : ~ 28 s

E Hardware - 50 MHz : ~ 5.4 s

F Hardware (Advanced Architecture) - 50 MHz : ~ 1.3 s

0

50

100

150

200

250

A B C D E F

Parameters: 1024 projections

1024 samples per projection

512*512 pixels image

9-bit sinogram data

3-bit interpolation factor

prj_num(3)fanout = 1565 !

routing delay = 7.913 ns (~39.99%)

Implementation Issues- fanout -

odd_2_A_4[4]fanout = 144 !

Implementation Issues- fanout -

Memory Bridges Stuff

3 architectures implemented:A “Simple Architecture” = non-parallel (on slide 6)B “Advanced Architecture” = 4-way parallel (slide 12)

C “Bridge Free Advanced Arch” =as B but contains no memory bridges (all design buffers in BlockRAMs) from PCI

bus to memory banks required for Host-Memory communication. Bridges are separate design that is downloaded before (after) design C is

downloaded so that input data can be stored to (output data read from) memories on the WildStar board.

Virtex1000 resource utilization:A 11% logic, 90% BlockRAMs (with bridges)B 39% logic, 100% BlockRAMsC 21% logic, 100% BlockRAMs

Floorplan of the“Bridge Free Advanced Architecture”

(design C on the previous slide)

Future Directions

• Graduate

Parallel Beam Back Projection: Implementation

Documents

Transcript of Parallel Beam Back Projection: Implementation