2011/03 Jinhong Park The Graduate School Yonsei University Department of Computer Science Design of...

2011/03 Jinhong Park The Graduate School Yonsei University Department of Computer Science Design of Effective Memory Architectures for Mobile 2D/3D Graphics Processor

Agenda Abstract Chapter 1. Introduction 1.1. Motivation 1.2. Overview of Dissertation Chapter 2. Background & Related Works 2.1. OpenVG Pipeline 2.2. Vector Graphics Hardware Issue Chapter 3. Vector Graphics Hardware 3.1. An Effective Rasterization Architecture for Mobile Vector Graphics Processors 3.2. Design of Vector Graphics Paint Processing Hardware 3.3. Implementation of Vector Graphics Processor Chapter 4. Memory Architecture for 3D Graphics Hardware 4.1. The Design of Compressed Memory System for Depth Data in 3D Rendering Processors 4.2. An Effective Depth Data Memory System Using Escape Count Buffer for 3D Rendering Processors Chapter 5. Conclusion Bibliography -2--2-

Introduction Mobile devices require high-quality and high-resolution graphics services 2D vector graphics (VG) Supports high quality graphics services by using the geometrical primitives 3D Graphics Presents real-world images by using 3D objects -3--3-

Background OpenVG Pipeline & H/W Design Issues StageDesign Issues Path generation Convert a curve into multiple line A stroked path into multiple simple path Path transformation Coordinate system transformation: Matrix-vector multiplication Rasterization Traditional scan conversion algorithm Edge table building, Active edge table building, Span generation Allocate/Search/Sort of linked list data structure External Memory Bandwidth Clipping & Masking Out of screen clipping Rectangular clipping: scissoring Alpha masking Paint generation Linear/Radial/Focal radial gradient processing Gradient path transformation Heavy arithmetic calculation for focal-radial gradient (Multiplication, Square, Division, Square root, etc.) Image interpolation Image mapping cache Image path transformation Image sampling(point, bilinear, or tri-linear) Blending & Anti- aliasing Method of calculation a coverage value in rasterization stage Anti-aliasing method Blend rule -4--4-

An Effective Rasterization Architecture for Mobile Vector Graphics Processors Introduction A geometry described with VG is defined by one or more paths. Each path consists of a series of edges. Through the rasterization process of each edge, numerous pixels on the edge, called cells, can be generated. Rasterization requires a large number of memory accesses and comparison calculations for sorting Proposed index board rasterization hardware architecture Uses a Y-index board internal SRAM to store the number of cells generated for each scanline, and an X-index board internal SRAM to store the sorted cells of the scanline size reduces external memory traffic by up to 53.6% compared to traditional algorithms -5--5-

An Effective Rasterization Architecture for Mobile Vector Graphics Processors (cont.) Related Works Active edge based algorithm Kallios active edge tables algorithm After generating the Edge Table, the Active Edge Table for the current scanline is dynamically updated by tracking its active edges Kims algorithm store modified active edge data in AET without ET Cell based scanline processing algorithm AGGs pixel tables algorithm Generates cells regarding all pixels including each edge through the line-drawing algorithm, then sorts in order of the x-axis Comparisons Active edge basedCell based memory traffic> -6--6-

An Effective Rasterization Architecture for Mobile Vector Graphics Processors (cont.) Proposed Rasterization Architecture Flow 1.with a given path, the cells are generated with a series of edges and stored in the corresponding scanline of cell array in the generated order 2.The number of generated cells for each scanline is updated to Y- index board SRAM. 3.After every process of generating cells of the path is completed, the cells in the same scanline are retrieved from the cell array and then active spans are generated by the scanning direction in the scanline. 4.The generated active span is transferred to the span and the remaining processes are executed. Y-index board SRAM - Stores the number of cells generated for each scanline X-index board SRAM - Stores the sorted cells of the scanline size -7--7-

An Effective Rasterization Architecture for Mobile Vector Graphics Processors (cont.) Experimental Result Number of External Memory Traffic (Byte) ~ 53.6% Reduction @QVGA / ~ 57.5% Reduction @ VGA Processing Throughput (Number of External Memory Accesses) ~ 211.8% improvement @QVGA / 219.8% improvement @VGA -8--8-

Design of Vector Graphics Paint Processing Hardware Gradients Paints decide each pixel color of the surface using a location parameter of a pixel Linear gradient : Radial gradient : Dividers, multipliers, and SQRT are required to implement gradient processing with conventional methods. These arithmetic modules are not suitable for the mobile structure because they occupy a large-sized hardware. Image Paints map an image of a bitmap type to inside of the surface Require high memory bandwidth -9--9-

Design of Vector Graphics Paint Processing Hardware (cont.) Proposed Paints Processor Command/cell data read unit Span generation unit Gradient Processing Unit Image Processing Unit Shared Setup Unit Shared texture/gradient LUT cache unit - 10-

Design of Vector Graphics Paint Processing Hardware (cont.) Proposed Gradient Paints Processor Stage 1 : gradient space transformation Stage 4 : color ramp Stage 3 : color ramp LUT index calculation Stage 2 : gradient function - 11-

Design of Vector Graphics Paint Processing Hardware (cont.) Features of Proposed Gradient Paints Stage 1 A corresponding pixel translats into the pre-defined gradient surface using the shared setup units Stage 2 Linear gradients Use the matrix transformation unit used in the existing geometric transformation Radial gradients proposed the floating point SQRT unit using a suitable sized LUT Stage 3 & 4 Reduces the number of high cost arithmetic units and calculations by using a gradient LUT Sharing small-sized internal SRAM for a gradient LUT and a texture cache Proposed SQRT Gradient Color Generation Method - 12-

Design of Vector Graphics Paint Processing Hardware (cont.) Implementation and Result Comparisons for the number of calculators per pixel in gradient function FPGA Implementation Virtex4 lx200 @ 48MHz, a 320x240 LCD, a 512MB DDR, and AHB Bus Results Image : 14.3 fps Linear gradient : 13.3 fps Radial gradient : 11.2 fps - 13- Linear GradientRadial Gradient OpenVGProposedOpenVGProposed division1010 multiplication43108 addition2245 subtraction0030 sqrt0011 Image Linear Gradient Radial Gradient

Implementation of Vector Graphics Processor Vector graphics processor Supporting Feature List OpenVG v1.1 pipeline hardware acceleration support Screen Resolution: max. 1024 x 1024 support Output color format: 24bpp (RGBA)/32bpp(RGB) format SW Path transformation & clipping Image filtering: kernel size (4x4 matrix) HW frame buffer clearing HW gradient/Pattern generation HW masking: 8bit alpha mask HW scissoring: max. 32 scissoring rectangle Image sampling mode: point/bilinear sampling Image file format: 32bpp(RGBA) format support Alpha blending: OpenVG API standard mode support Anti-aliasing: 8bit sub-pixel coverage support - 14-

Implementation of Vector Graphics Processor (cont.) Architecture of the proposed vector graphics processor 4 stages pipeline architecture Cell generation stage Span generation stage Solid/image/gradient processing stage Anti-aliasing stage - 15-

Example of Active Scanline Implementation of Vector Graphics Processor (cont.) Cell generation stage Uses index board rasterization algorithm Span generation stage Active Scan line Eliminates unnecessary cell array accesses Reduces unnecessary X-index board SRAM accesses Solid/image/gradient processing stage Uses Vector Graphics Paint Processing Hardware proposed in Chapter 3.2 Share common setup units Share 4KB SRAM for a gradient LUT and a texture cache Anti-aliasing stage uses an area sampling method - 16- Active Scan Line Architecture

Implementation of Vector Graphics Processor (cont.) VGP commands - 17- Command NameValueSupportOperand Standard OpenVG Command VG_CLOSE_PATH0x00 VGP support no operand VG_MOVE_TO0x02 VGP support X(s15.16), Y(s15.16) VG_LINE_TO0x04 VGP support X(s15.16), Y(s15.16) VG_HLINE_TO0x06API support VG_VLINE_TO0x08API support VG_QUAD_TO0x0AAPI support VG_CUBIC_TO0x0CAPI support VG_SQUAD_TO0x0EAPI support VG_SCUBIC_TO0x10API support VG_SCCWARC_TO0x12API support VG_SCWARC_TO0x14API support VG_LCCWARC_TO0x16API support VG_LCWARC_TO0x18API support BatchMode Extention VGP_PC (Path Color) 0xFC VGP support fillRule(32), color(RGBA32) VGP_FE (Frame End) 0xFE VGP support no operand Register NameAddressWriteRead VGP_R_Reset 0x6FF30000; 1:Start 2:Start(BatchMode) 0:EndOfRendering 1,2:Processing Of Rendering VGP_R_addr_frameBuffer 0x6FF30004; Frame Buffer Address VGP_R_addr_path 0x6FF30008;Path Array Address VGP_R_addr_command 0x6FF3000C; Command Array Address VGP_R_addr_texture 0x6FF30010; Texture Array Address VGP_R_addr_CellArray 0x6FF30014;Cell Array Address VGP_R_ScreenSize 0x6FF30018; ScreenSize Width[31:16] Height[15:0] VGP_R_CMD_value 0x6FF3001C; ClearColor RGBA32 VGP_R_CMD_status 0x6FF30020; 1:StartClear 2:StartFlushing 0:EndOfClearing 1,2:Processing Of Clearing VGP_R_matrix_sx 0x6FF30024;Matrix sx VGP_R_matrix_shy 0x6FF30028;Matrix shy VGP_R_matrix_shx 0x6FF3002C;Matrix shx VGP_R_matrix_sy 0x6FF30030;Matrix sy VGP_R_matrix_tx 0x6FF30034;Matrix tx VGP_R_matrix_ty 0x6FF30038;Matrix ty VGP_R_matrix_w0 0x6FF3003C; Matrix w0 (not used) VGP_R_matrix_w1 0x6FF30040; Matrix w1 (not used) VGP_R_matrix_w2 0x6FF30044; Matrix w2 (not used) VGP_R_paint_color 0x6FF30048; PaintColor RGBA32 VGP_R_paint_mode 0x6FF3004C; FillRule[8] PaintMode[7:0] even_odd, [7:0]painting mode VGP_R_image_size 0x6FF30050; ImageSize Width[31:16] Height[15:0] VGP_R_image_color 0x6FF30054; ImageColor RGBA32 VGP_R_image_addr 0x6FF30058; Image Address(offset) VGP_R_grad_LUT_data 0x6FF3005C;LUT size,[2]type, [1:0]spread VGP_R_grad_d1 0x6FF30060;Gradient D1 S15.16 VGP_R_grad_d2 0x6FF30064;Gradient D2 S15.16 VGP_R_grad_fx 0x6FF30068;Gradient FX S15.16 VGP_R_grad_fy 0x6FF3006C;Gradient FY S15.16 VGP Registers

Software Implementation Proposed API of the VGP - 18- Implementation of Vector Graphics Processor (cont.)

Verification Environment - 19- Implementation of Vector Graphics Processor (cont.) Benchmark images

Results Performance Total logic gate counts 572K of logic gates and SRAM of 11.38KB in a SMIC 0.13 proc - 20- Implementation of Vector Graphics Processor (cont.) Frame/SecVertex/SecPixel/Sec Tiger7.994,610982,792 Picasso4.644,1691,356,660 Lion15.939,2731,268,947 Image14.32,874572,400 Linear Gradient13.3531,021,440 Radial Focal Gradient11.245860,160 Our Implementation On FPGA

An Effective Depth Data Memory System Using Escape Count Buffer for 3D Rendering Processors Memory bandwidth problem is one of the important research issues to improve the performance of graphics processing units A new compressed memory system is proposed to reduce the bandwidth requirement from the external memory by controlling the data transaction size of the compressed block data By using escape count buffer to store the compression levels of the depth data block, the data transaction size for the external memory access could be accurately controlled - 21-

An Effective Depth Data Memory System Using Escape Count Buffer for 3D Rendering Processors (cont.) Proposed Architecture - 22- Processing flow 1.When cache miss occurs during the depth read step, the cache missed depth block is stored in the missed block buffer and an address of the missed depth block is sent to ECB controller. 2.At the same time, the ECB controller reads the escape count of the cache missed depth block from ECB, with which the the size of the compressed data is calculated. 3. With this size information, the transaction control unit issues the memory transaction, so that the compressed data with exact size is retrieved from the external DRAM and then it is sent to the decompression unit. 4.With the compressed data retrieved from the external DRAM and its escape count from the ECB, the decompression unit restores the orignal block data into the depth cache. After the request data processing, the cache missed depth block data in the missed block buffer is compressed. 5.The compressed data for the cache missed depth block is written into the depth buffer through the transaction control unit. At the same time, its escape count is updated into the ECB through the ECB controller.

An Effective Depth Data Memory System Using Escape Count Buffer for 3D Rendering Processors (cont.) Simulation and Results Test benches Bandwidth Requirements (MB/s) - 23- Quake3 UT 2004 Quake3UT2004 VGASVGAVGASVGA No compression137.6283.4305.1418.3 Compression40.076.078.599.1 ATI26.458.372.886.1 Proposed22.448.560.971.6 ACPF Quake3 VGASVGA 32 bit bus64 bit bus32 bit bus64 bit bus No compression 1.761.441.751.44 ATI 1.521.351.481.34 Proposed 1.141.111.171.13 ACPF UT2004 VGASVGA 32 bit bus64 bit bus32 bit bus64 bit bus No compression 1.651.391.641.38 ATI 1.381.291.231.26 Proposed 1.191.151.171.14 Throughput (ACPF)

Conclusions This dissertation focuses on the research and implementation of 2D vector graphics processor and an effective memory architecture for the mobile 2D/3D graphics hardware An effective rasterization architecture for mobile vector graphics processors Design of vector graphics paint processing hardware Implementation of vector graphics processor The design of compressed memory system for depth data in 3D rendering processors An effective depth data memory system using escape count buffer for 3D rendering processors - 24-

2011/03 Jinhong Park The Graduate School Yonsei University Department of Computer Science Design of...

Documents

Transcript of 2011/03 Jinhong Park The Graduate School Yonsei University Department of Computer Science Design of...