Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems
description
Transcript of Design and Tradeoff Analysis of JPEG-2000 on Hardware-Reconfigurable Systems
DeVille #229 MAPLD 2005
Design and Tradeoff AnalysisDesign and Tradeoff Analysisof JPEG-2000 onof JPEG-2000 onHardware-Reconfigurable SystemsHardware-Reconfigurable Systems
Ryan DeVille, Vikas Aggarwal,Ian Troxel, and Alan D. George
High-performance Computing and Simulation (HCS) Research LaboratoryDepartment of Electrical and Computer Engineering
University of Florida
#229 MAPLD 2005DeVille 2
Introduction
JPEG-2000 Encoding State-of-the-art low bit-rate compression algorithm Progressive transmission by quality, resolution, component, or spatial
locality Spatially random access to bitstream Region of interest coding
Motivation for porting JPEG-2000 to RC systems High-performance and low-cost solution is attractive for airborne and
satellite imaging systems Speedup readily available with fine-grain and coarse-grain parallelism
opportunities
MulticomponentTransform
Discrete Wavelet Transform
Tier-1 Encoding(compression)Quantization Tier-2 Encoding
(packetization)
EBCOT Algorithm
#229 MAPLD 2005DeVille 3
Related Research EBCOT Encoder designs
Group of Column optimization method Previous RC Designs
Space systems prototype [5] Scalable Entropy Encoder [6] Dual Processing Elements Architecture [7]
2D Discrete Wavelet Transform designs Several mimic early VLSI designs [8, 9] Multiple architecture designs classifications [10]
Direct 1D, transpose, perform another 1D Intrinsically slow
Separate serial and parallel filters or parallel row, parallel column filters Processes along rows and columns Represents significant performance improvement
Symmetrically extended Improves processing efficiency, especially towards center of image
#229 MAPLD 2005DeVille 4
JPEG-2000 Encoder Design & Develop. Software code profiling first used to
determine effort distribution Previous research efforts show that DWT and
Tier1 encoding consume 80-85% of execution time
Current profiling results with Jasper and OpenJPEG show that >90% of execution time spent in DWT and Tier1 Benchmark images selected from Kodak
Lossless True Color Image Suite, JasPer benchmark images, standard image processing images (lena, etc.)
Jasper Execution Time Profile
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
water.p
nm
lena.r
as
baboo
n.ras
kodim
23.ra
s
kodim
22.ra
s
kodim
21.ra
s
kodim
16.ra
s
kodim
11.ra
s
kodim
10.ra
s
kodim
06.ra
s
camera
.ras
peppe
rs.ras
TIER2
TIER1
QUANT
FWT
MCT
#229 MAPLD 2005DeVille 5
Discrete Wavelet Transform (DWT) Features
Second-most computationally intensive block in compression process Transforms each component tile data into coefficients
Reversible transform involves all integer operations Represents high- and low-frequency components of image Amenable to compression – results in better compression ratios
Recursive application yields frequency bands at multiple resolutions Operation
2D transform achieved by successively applying 1D transform in X&Y directions
Each 1D transform consist of Filtering step De-interleave step: reorganizing of data in bands
Available data and functional parallelism can be exploited
a1HL a1HH
a1LHa2HL
a2LH
a2HH
a3HHa3LH
a3HL a3LL
#229 MAPLD 2005DeVille 6
DWT Hardware ArchitectureInputBuffer
DWTColumn
TempBuffer
De-interleaveColumn
TempBuffer
DWTRow
TempBuffer
De-interleave
Row
OutputBuffer
Even
Coeff
Odd
Coeff
Tile Data
Challenges presented by DWT Parallel processing limited by memory bandwidth requirements Some sequential nature in processing involved
Design features Data-level parallelism exploited by operating on multiple “tiles” Function-level parallelism exploited by pipelining different
processing step Data reuse eliminates extra read cycles
Internal architecture Each tile is entirely stored in single Block RAM to
minimize data movement Overlapped processing to further reduce latency
#229 MAPLD 2005DeVille 7
Embedded Block Coding with Optimized Truncation (EBCOT): Tier-1 Features
Specially adapted arithmetic coder Four bit-plane coding primitives Three coding passes for each bit-plane (except the most
significant)
Operation Coding passes: CUP begins at most significant bit plane Iteratively perform coding passes over remaining bit planes Coding-pass-generated context and bit data serially encoded
and compressed by arithmetic encoder Flush and reset arithmetic coder at completion
#229 MAPLD 2005DeVille 8
Tier-1 Encoding Hardware Architecture Challenges presented by Tier-1 encoding:
Serial process – creation of current MQ context data directly depends upon previous pass results
“Bursty” communication – contextual data from a pass short, semi-continuous bursts
Large amounts of data and flags must be stored through multiple iterations of algorithm, requiring high memory bandwidth
Internal architecture (high-level) Retrieve current stripe from memory for processing Data is operated in a pipelined fashion through registers Context and data information sent to queues Serializing agent: arithmetic entropy encoder MQ Input Controller regulates input to arithmetic
entropy encoder, insuring correct operation Data from arithmetic entropy encoder is written to a
separate, final buffer
Cle
anup
Pas
s
Mag
nitu
deR
efer
ence
Pas
s
Sig
nific
ance
Pro
poga
tion
Pas
s
Arithmetic Entropy Encoder
MQ Input controller
Pass Queues
Codeblock
Writ
e bu
ffer
Rea
d bu
ffer
Design decision to use MQ encoder as serializing agent saves area and BlockRAM space without sacrificing too much performance.
#229 MAPLD 2005DeVille 9
Target HPEC Platform
64 b
it/66
Mhz
PC
I Con
nect
or
PCI FPGA(Xilinx
Spartan2)
BenNUEY User FPGA (Xilinx2
6000, -4)
BenBLUE-IIPrimary FPGA(Xilinx Virtex2
6000, -4)
BenBLUE-IISecondary FPGA
(Xilinx Virtex2 6000, -4)
ZBT SSRAM (2 MB)
ZBT SSRAM (2 MB)
ZBT SSRAM (4 MB)
ZBT SSRAM (4 MB)
32 32 64 64
Addr &
Ctrl
Addr &
Ctrl
Addr &
Ctrl
Data
Data
Addr &
Ctrl
Data
Data
PCI COMMS
bus(32-bit data,
40 Mhz)
Local Bus
Inter-FPGA communications bus
(64-bit data, 66 MHz)
(159 IO, user-defined clk)
Low bandwidth to system memory through 64/66 MHz PCI bus connection Large memory storage capability with 12 MB SRAM (166 MHz, ZBT) Advantages/Disadvantages
High configuration time (PCI bus + chained JTAG interface) Large memory storage helps alleviate strain on PCI bus Very good IO interface support with proprietary tools
Three FPGAs (all Xilinx Virtex2 6000, -4)
Single “user” FPGA on BenNUEY PCI board
Dual FPGAs on BenBLUE-II daughter card
High-Perf. Embedded Computing: Nallatech BenNUEY w/ BenBLUE-II
* Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.
#229 MAPLD 2005DeVille 10
DWT Single FPGA ResultsSingle-module designprocessing one tile (μs)
Single-module designprocessing eight tiles (μs)
DMA write time 127 1001
DMA read time 80 573
Computation time (part 1) 52 56
Computation time (part 2) 48 404
Total time for FPGA solution 307 2034
Time for software solution 130 1043
Results for single DWT module design for BenNUEY board operating at 80 MHz
Processing eight tiles (μs) Processing forty tiles (μs)
DMA write time 758 3750
DMA read time 382 1900
Computation time (part 1) 80 80
Computation time (part 2) 82 424
Total time for FPGA solution 1302 6154
Time for software solution 1043 5219
Results for Eight DWT modules design for BenNUEY board operating at 40 MHz
Note: software solution comes from exec. on server with 2.4 GHz Xeon CPU
Performance Comparison
0
500
1000
1500
2000
2500
1 8
Tiles processed
Exec
. Tim
e (u
s)
FPGA Solution (w ithout DMA) FPGA Solution (w ith DMA) Softw are Solution
Performance Comparison
02000400060008000
8 40
Tiles Processed
Exec
. Tim
e (u
s)
FPGA Solution (w ithout DMA) FPGA Solution (w ith DMA) Softw are Solution
Resource Utilization on Virtex2 6000 -4# of Modules Slices BRAMs
Single Module 1157 ( 3%) 6 ( 4%)
Eight Modules 5742 (17%) 48 (33%)
#229 MAPLD 2005DeVille 11
Tier-1 Encoding Current Results
* Results synthesized with Synplify Pro 7.7.1, PAR with Xilinx ISE 6.3
Single-module design processing one codeblock (μs)
Eight-module design processing one
codeblock each (μs)
DMA Write Time 70 218
DMA Read Time 49 388
Computation Time 175 175
Total Time 294 781
Software Time 276 2189
Results for Tier1 module design for BenNUEY board operating at 90 MHz Note: software solution comes from execution on server with 2.4 GHz Xeon Processor
# of modules Slices BlockRAMs
Single 3,527 (10%) 7 (5%)
Eight 25,556 (75%) 56 (38%)
Profiling shows performance projections with DMA transfer times included.
0% 20% 40% 60% 80% 100%
w ater.pnm
lena.ras
baboon.ras
kodim23.ras
kodim22.ras
kodim21.ras
kodim16.ras
kodim11.ras
kodim10.ras
kodim06.ras
camera.ras
peppers.ras
MCT FWT QUANT TIER1 TIER2
#229 MAPLD 2005DeVille 12
Conclusions from HPEC Platform Multi-chip system offers resources for increased
parallelism or a multi-component application Order of magnitude improvement in total
computation time Faster computation times on FPGA
But communication overhead severely hinders performance improvement
Low-bandwidth PCI interconnect not amenable to designs with challenging memory demands
#229 MAPLD 2005DeVille 13
Target HPC Platform High-Performance Computing: SGI
Altix 350 with FPGA Brick Single FPGA: Virtex2 6000 (-6 speed
grade) Approximately 33% of chip used for
SGI’s RASC system layer Two algorithm clock speeds: 200 MHz
and 100 MHz High bandwidth to system memory through
proprietary NUMAlink interconnect (12.8 GB/s) through Scalable System Port (6.4 GB/s)
3 banks of QDR SRAM (6 MB each) with a full bandwidth of 9.6 GB/s (1.6 GB/s for each read and write)
Advantages/Disadvantages Extremely low reconfiguration time High memory bandwidth greatly helps
memory-intensive apps, such as JPEG-2K
AlgorithmFPGA
2 MBQDR SRAM
2 M
BQ
DR
SR
AM
2 MB
QD
R S
RA
M
LoaderFPGA TIO
72 72
NUMAlink connectors
36 36
36
36
36
36
Addr & Ctrl
Addr & Ctrl
Addr & Ctrl
PCI 33Mhz
Select MAPProgramming Interface
* Diagram shown here only reflects those buses actually used in the design; other communication schemes are available.
SGI Altix w/ RASC extension
#229 MAPLD 2005DeVille 14
Performance Projections
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
TIER2TIER1QUANTFWTMCT
NUMAlink interconnect Approximate order-of-magnitude improvement of transfers in similar designs Mitigates communication overhead bottleneck
Profile shows projections for no-latency, infinite-bandwidth interconnect.
#229 MAPLD 2005DeVille 15
Lessons Learned and Conclusions Lessons Learned
HW/SW codesign Shared-memory systems more amenable to closely-coupled
processing associated with communication-sensitive RC applications
PCI boards for servers effective when tasks are offloaded for processing with minimal or masked communication
Memory bandwidth constrains parallelism in DWT design Serializing agent (arithmetic coder) in Tier-1 design is key limit
to performance improvement Conclusions
Identifying and accelerating key components yields better system performance (with a wary eye on Amdahl’s Law)
Performance enhancements achieved mostly through functional parallelism due to sequential processing constraints
#229 MAPLD 2005DeVille 16
Future Work and Acknowledgments Future Work:
Full system implementation on SGI Altix with RASC Region of Interest capability Lossy encoding and rate capability MCT and Tier-2 encoding on FPGA as well Single FPGA JPEG-2000 encoding application
Acknowledgments We wish to thank the following vendors for equipment and/or tools in
support of this research: SGI Nallatech Xilinx Aldec
Special thanks to SGI Digital Media group, SGI RASC engineers for their help and suggestions
#229 MAPLD 2005DeVille 17
References[1] Adams, M.D. and Ward, R.K., “JasPer: a portable flexible open-source software tool kit for image
coding/process”, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’04), pp. 241-244, May 2004.
[2] OpenJPEG. http://www.opegjpeg.org/[3] Liu, L., Li, D., Li, Z., Wang, Z. and Chen, H., “A VLSI architecture of EBCOT encoder for JPEG2000”, in 5th
International Conference on ASIC, pp. 882-885, Oct. 2003.[4] Chen, K., Lian, C., Chen, H., and L. Chen, “Analysis and architecture design of EBCOT for JPEG-2000,” in
IEEE International Symposium on Circuits and Systems, vol. 2, pp. 765-768, May 2001. [5] Van Buren, D., “A high-rate JPEG2000 compression system for space”, in IEEE Aerospace Conference, March
2005.[6] Aouadi, I., and Hammami, O., “Analysis and hardware design of a scalable dual JPEG-2000 entropy coder”, in
Euromicro Symposium on Digital System Design (DSD 2004), pp. 227-233, Sept. 2004.[7] Gangadhar, M. and Bhatia, D., “FPGA based EBCOT architecture for JPEG 2000”, in IEEE International
Conference on Field-Programmable Technology (FPT’03), pp. 228-233, Dec. 2003[8] Hung, K., Huang Y., Truong, T., Wang, C., “FPGA implementation for 2D discrete wavelet transform”, in
Electronics Letters, pp. 639-640, April 1998.[9] Lakshminarayanan, G. Venkataramani, B. Senthil Kumar, J., Yousuf, A.K. and Sriram, G., “Design and FPGA
implementation of image block encoders with 2D-DWT”, in Conference on Convergent Technologies for Asia-Pacific Region (TENCON 2003), pp. 1015-1019, Oct. 2003.
[10] McCanny, P., Masud, S., and McCanny, J., “Design and implementation of the symmetrically extended 2-D wavelet transform”, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02), vol. 3, pp. 3108-31111, May 2002.
[11] D. Taubman, “High performance scalable image compression with EBCOT,” in IEEE Trans. Image Processing, vol. 9, pp. 1158-1170, July 2000.
[12] I.E.G. Richardson, Video Codec Design: Developing Image and Video Compression Systems. Chichester, West Sussex, New York: John Wiley and Sons, Ltd (UK), 2002.
[13] T. Acharya and P.-S. Tsai, JPEG 2000 Standard for image Compression: Concepts, Algorithms, and VLSI Architectures. Hoboken, New Jersey: John Wiley and Sons, Inc., 2005.