Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems

DeVille #229 MAPLD 2005

Design and Tradeoff AnalysisDesign and Tradeoff Analysisof JPEG-2000 onof JPEG-2000 onHardware-Reconfigurable SystemsHardware-Reconfigurable Systems

Ryan DeVille, Vikas Aggarwal,Ian Troxel, and Alan D. George

High-performance Computing and Simulation (HCS) Research LaboratoryDepartment of Electrical and Computer Engineering

University of Florida

#229 MAPLD 2005DeVille 2

Introduction

JPEG-2000 Encoding State-of-the-art low bit-rate compression algorithm Progressive transmission by quality, resolution, component, or spatial

locality Spatially random access to bitstream Region of interest coding

Motivation for porting JPEG-2000 to RC systems High-performance and low-cost solution is attractive for airborne and

satellite imaging systems Speedup readily available with fine-grain and coarse-grain parallelism

opportunities

MulticomponentTransform

Discrete Wavelet Transform

Tier-1 Encoding(compression)Quantization Tier-2 Encoding

(packetization)

EBCOT Algorithm


Related Research EBCOT Encoder designs

Group of Column optimization method Previous RC Designs

Space systems prototype [5] Scalable Entropy Encoder [6] Dual Processing Elements Architecture [7]

2D Discrete Wavelet Transform designs Several mimic early VLSI designs [8, 9] Multiple architecture designs classifications [10]

Direct 1D, transpose, perform another 1D Intrinsically slow

Separate serial and parallel filters or parallel row, parallel column filters Processes along rows and columns Represents significant performance improvement

Symmetrically extended Improves processing efficiency, especially towards center of image


JPEG-2000 Encoder Design & Develop. Software code profiling first used to

determine effort distribution Previous research efforts show that DWT and

Tier1 encoding consume 80-85% of execution time

Current profiling results with Jasper and OpenJPEG show that >90% of execution time spent in DWT and Tier1 Benchmark images selected from Kodak

Lossless True Color Image Suite, JasPer benchmark images, standard image processing images (lena, etc.)

Jasper Execution Time Profile

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

water.p

nm

lena.r

as

baboo

n.ras

kodim

23.ra

s

kodim

22.ra

s

kodim

21.ra

s

kodim

16.ra

s

kodim

11.ra

s

kodim

10.ra

s

kodim

06.ra

s

camera

.ras

peppe

rs.ras

TIER2

TIER1

QUANT

FWT

MCT


Discrete Wavelet Transform (DWT) Features

Second-most computationally intensive block in compression process Transforms each component tile data into coefficients

Reversible transform involves all integer operations Represents high- and low-frequency components of image Amenable to compression – results in better compression ratios

Recursive application yields frequency bands at multiple resolutions Operation

2D transform achieved by successively applying 1D transform in X&Y directions

Each 1D transform consist of Filtering step De-interleave step: reorganizing of data in bands

Available data and functional parallelism can be exploited

a1HL a1HH

a1LHa2HL

a2LH

a2HH

a3HHa3LH

a3HL a3LL


DWT Hardware ArchitectureInputBuffer

DWTColumn

TempBuffer

De-interleaveColumn

TempBuffer

DWTRow

TempBuffer

De-interleave

Row

OutputBuffer

Even

Coeff

Odd

Coeff

Tile Data

Challenges presented by DWT Parallel processing limited by memory bandwidth requirements Some sequential nature in processing involved

Design features Data-level parallelism exploited by operating on multiple “tiles” Function-level parallelism exploited by pipelining different

processing step Data reuse eliminates extra read cycles

Internal architecture Each tile is entirely stored in single Block RAM to

minimize data movement Overlapped processing to further reduce latency


Embedded Block Coding with Optimized Truncation (EBCOT): Tier-1 Features

Specially adapted arithmetic coder Four bit-plane coding primitives Three coding passes for each bit-plane (except the most

significant)

Operation Coding passes: CUP begins at most significant bit plane Iteratively perform coding passes over remaining bit planes Coding-pass-generated context and bit data serially encoded

and compressed by arithmetic encoder Flush and reset arithmetic coder at completion


Tier-1 Encoding Hardware Architecture Challenges presented by Tier-1 encoding:

Serial process – creation of current MQ context data directly depends upon previous pass results

“Bursty” communication – contextual data from a pass short, semi-continuous bursts

Large amounts of data and flags must be stored through multiple iterations of algorithm, requiring high memory bandwidth

Internal architecture (high-level) Retrieve current stripe from memory for processing Data is operated in a pipelined fashion through registers Context and data information sent to queues Serializing agent: arithmetic entropy encoder MQ Input Controller regulates input to arithmetic

entropy encoder, insuring correct operation Data from arithmetic entropy encoder is written to a

separate, final buffer

Cle

anup

Pas

s

Mag

nitu

deR

efer

ence

Pas

s

Sig

nific

ance

Pro

poga

tion

Pas

s

Arithmetic Entropy Encoder

MQ Input controller

Pass Queues

Codeblock

Writ

e bu

ffer

Rea

d bu

ffer

Design decision to use MQ encoder as serializing agent saves area and BlockRAM space without sacrificing too much performance.


Target HPEC Platform

64 b

it/66

Mhz

PC

I Con

nect

or

PCI FPGA(Xilinx

Spartan2)

BenNUEY User FPGA (Xilinx2

6000, -4)

BenBLUE-IIPrimary FPGA(Xilinx Virtex2

6000, -4)

BenBLUE-IISecondary FPGA

(Xilinx Virtex2 6000, -4)

ZBT SSRAM (2 MB)

ZBT SSRAM (2 MB)

ZBT SSRAM (4 MB)

ZBT SSRAM (4 MB)

32 32 64 64

Addr &

Ctrl

Addr &

Ctrl

Addr &

Ctrl

Data

Data

Addr &

Ctrl

Data

Data

PCI COMMS

bus(32-bit data,

40 Mhz)

Local Bus

Inter-FPGA communications bus

(64-bit data, 66 MHz)

(159 IO, user-defined clk)

Low bandwidth to system memory through 64/66 MHz PCI bus connection Large memory storage capability with 12 MB SRAM (166 MHz, ZBT) Advantages/Disadvantages

High configuration time (PCI bus + chained JTAG interface) Large memory storage helps alleviate strain on PCI bus Very good IO interface support with proprietary tools

Three FPGAs (all Xilinx Virtex2 6000, -4)

Single “user” FPGA on BenNUEY PCI board

Dual FPGAs on BenBLUE-II daughter card

High-Perf. Embedded Computing: Nallatech BenNUEY w/ BenBLUE-II

* Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.


DWT Single FPGA ResultsSingle-module designprocessing one tile (μs)

Single-module designprocessing eight tiles (μs)

DMA write time 127 1001

DMA read time 80 573

Computation time (part 1) 52 56


Total time for FPGA solution 307 2034

Time for software solution 130 1043

Results for single DWT module design for BenNUEY board operating at 80 MHz

Processing eight tiles (μs) Processing forty tiles (μs)

DMA write time 758 3750

DMA read time 382 1900



Total time for FPGA solution 1302 6154

Time for software solution 1043 5219

Results for Eight DWT modules design for BenNUEY board operating at 40 MHz

Note: software solution comes from exec. on server with 2.4 GHz Xeon CPU

Performance Comparison

0

500

1000

1500

2000

2500

1 8

Tiles processed

Exec

. Tim

e (u

s)

FPGA Solution (w ithout DMA) FPGA Solution (w ith DMA) Softw are Solution

Performance Comparison

02000400060008000

8 40

Tiles Processed

Exec

. Tim

e (u

s)

FPGA Solution (w ithout DMA) FPGA Solution (w ith DMA) Softw are Solution

Resource Utilization on Virtex2 6000 -4# of Modules Slices BRAMs

Single Module 1157 ( 3%) 6 ( 4%)

Eight Modules 5742 (17%) 48 (33%)


Tier-1 Encoding Current Results

* Results synthesized with Synplify Pro 7.7.1, PAR with Xilinx ISE 6.3

Single-module design processing one codeblock (μs)

Eight-module design processing one

codeblock each (μs)

DMA Write Time 70 218

DMA Read Time 49 388

Computation Time 175 175

Total Time 294 781

Software Time 276 2189

Results for Tier1 module design for BenNUEY board operating at 90 MHz Note: software solution comes from execution on server with 2.4 GHz Xeon Processor

# of modules Slices BlockRAMs

Single 3,527 (10%) 7 (5%)

Eight 25,556 (75%) 56 (38%)

Profiling shows performance projections with DMA transfer times included.

0% 20% 40% 60% 80% 100%

w ater.pnm

lena.ras

baboon.ras

kodim23.ras

kodim22.ras

kodim21.ras

kodim16.ras

kodim11.ras

kodim10.ras

kodim06.ras

camera.ras

peppers.ras

MCT FWT QUANT TIER1 TIER2


Conclusions from HPEC Platform Multi-chip system offers resources for increased

parallelism or a multi-component application Order of magnitude improvement in total

computation time Faster computation times on FPGA

But communication overhead severely hinders performance improvement

Low-bandwidth PCI interconnect not amenable to designs with challenging memory demands


Target HPC Platform High-Performance Computing: SGI

Altix 350 with FPGA Brick Single FPGA: Virtex2 6000 (-6 speed

grade) Approximately 33% of chip used for

SGI’s RASC system layer Two algorithm clock speeds: 200 MHz

and 100 MHz High bandwidth to system memory through

proprietary NUMAlink interconnect (12.8 GB/s) through Scalable System Port (6.4 GB/s)

3 banks of QDR SRAM (6 MB each) with a full bandwidth of 9.6 GB/s (1.6 GB/s for each read and write)

Advantages/Disadvantages Extremely low reconfiguration time High memory bandwidth greatly helps

memory-intensive apps, such as JPEG-2K

AlgorithmFPGA

2 MBQDR SRAM

2 M

BQ

DR

SR

AM

2 MB

QD

R S

RA

M

LoaderFPGA TIO

72 72

NUMAlink connectors

36 36

36

36

36

36

Addr & Ctrl

Addr & Ctrl

Addr & Ctrl

PCI 33Mhz

Select MAPProgramming Interface

* Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.

SGI Altix w/ RASC extension


Performance Projections

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

TIER2TIER1QUANTFWTMCT

NUMAlink interconnect Approximate order-of-magnitude improvement of transfers in similar designs Mitigates communication overhead bottleneck

Profile shows projections for no-latency, infinite-bandwidth interconnect.


Lessons Learned and Conclusions Lessons Learned

HW/SW codesign Shared-memory systems more amenable to closely-coupled

processing associated with communication-sensitive RC applications

PCI boards for servers effective when tasks are offloaded for processing with minimal or masked communication

Memory bandwidth constrains parallelism in DWT design Serializing agent (arithmetic coder) in Tier-1 design is key limit

to performance improvement Conclusions

Identifying and accelerating key components yields better system performance (with a wary eye on Amdahl’s Law)

Performance enhancements achieved mostly through functional parallelism due to sequential processing constraints


Future Work and Acknowledgments Future Work:

Full system implementation on SGI Altix with RASC Region of Interest capability Lossy encoding and rate capability MCT and Tier-2 encoding on FPGA as well Single FPGA JPEG-2000 encoding application

Acknowledgments We wish to thank the following vendors for equipment and/or tools in

support of this research: SGI Nallatech Xilinx Aldec

Special thanks to SGI Digital Media group, SGI RASC engineers for their help and suggestions


References[1] Adams, M.D. and Ward, R.K., “JasPer: a portable flexible open-source software tool kit for image

coding/process”, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), pp. 241-244, May 2004.

[2] OpenJPEG. http://www.opegjpeg.org/[3] Liu, L., Li, D., Li, Z., Wang, Z. and Chen, H., “A VLSI architecture of EBCOT encoder for JPEG2000”, in 5th

International Conference on ASIC, pp. 882-885, Oct. 2003.[4] Chen, K., Lian, C., Chen, H., and L. Chen, “Analysis and architecture design of EBCOT for JPEG-2000,” in

IEEE International Symposium on Circuits and Systems, vol. 2, pp. 765-768, May 2001. [5] Van Buren, D., “A high-rate JPEG2000 compression system for space”, in IEEE Aerospace Conference, March

2005.[6] Aouadi, I., and Hammami, O., “Analysis and hardware design of a scalable dual JPEG-2000 entropy coder”, in

Euromicro Symposium on Digital System Design (DSD 2004), pp. 227-233, Sept. 2004.[7] Gangadhar, M. and Bhatia, D., “FPGA based EBCOT architecture for JPEG 2000”, in IEEE International

Conference on Field-Programmable Technology (FPT’03), pp. 228-233, Dec. 2003[8] Hung, K., Huang Y., Truong, T., Wang, C., “FPGA implementation for 2D discrete wavelet transform”, in

Electronics Letters, pp. 639-640, April 1998.[9] Lakshminarayanan, G. Venkataramani, B. Senthil Kumar, J., Yousuf, A.K. and Sriram, G., “Design and FPGA

implementation of image block encoders with 2D-DWT”, in Conference on Convergent Technologies for Asia-Pacific Region (TENCON 2003), pp. 1015-1019, Oct. 2003.

[10] McCanny, P., Masud, S., and McCanny, J., “Design and implementation of the symmetrically extended 2-D wavelet transform”, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02), vol. 3, pp. 3108-31111, May 2002.

[11] D. Taubman, “High performance scalable image compression with EBCOT,” in IEEE Trans. Image Processing, vol. 9, pp. 1158-1170, July 2000.

[12] I.E.G. Richardson, Video Codec Design: Developing Image and Video Compression Systems. Chichester, West Sussex, New York: John Wiley and Sons, Ltd (UK), 2002.

[13] T. Acharya and P.-S. Tsai, JPEG 2000 Standard for image Compression: Concepts, Algorithms, and VLSI Architectures. Hoboken, New Jersey: John Wiley and Sons, Inc., 2005.

http://www.opegjpeg.org/

Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems

Documents

Transcript of Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems