A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) •...
Transcript of A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) •...
A Fixed-point Multimedia Co-processor with 50Mvertices/s Programmable SIMD Vertex Shader for Mobile Applications
Ju-Ho Sohn, Jeong-Ho Woo, Ramchan Woo, and Hoi-Jun Yoo
Semiconductor System Lab.Dept. of EECS
Korea Advanced Institute of Science and Technology
Ju-Ho Sohn 2ESSCIRC 2005
Outline
• Introduction• System Architecture• Low Power Features
– Fixed-point processing– Instruction-wise clock gating
• SIMD Datapath Structure– ALU, Multiply, SFU
• Implementation Results• Summary
Ju-Ho Sohn 3ESSCIRC 2005
3G Mobile Terminals
• Fixed-point multimedia co-processor– Low power consumption in ARM-10 architecture– Programmability with SIMD vertex shading
• Requirements– Low power
(< 200mW)– High performance
(>1Mvertices/s)– Variable
Applications(Programmability)
– Low cost (<5$)
3D GraphicsMPEG Video
MP3 Audio Java Game
Wireless Multimedia Center
Ju-Ho Sohn 4ESSCIRC 2005
Programmability in Graphics Hardware
Transform
Lighting
Rasterization
TextureMapping
Clipping
Rasterization
TextureMapping
Clipping
Vertex Shader
VertexProgram
User-definedVertex Processing
TraditionalGraphics Pipeline
ProgrammableGraphics Pipeline
SimpleGraphics
Flexible 3D Graphics& Multimedia
Funtions
FixedFunctions
Ju-Ho Sohn 5ESSCIRC 2005
Co-processor Architecture
• 128-bit 4-way SIMD ARM-10 Co-processor
Co-ProcI/F
Unit&
ControlSIMDALU
opA opB opC
SIMDMultiply
opA opB opC
SFU
op
VGR
VIR
SGR32kBDisplayBuffer
SWZ
StoreReg.
LoadReg.
C_BUS
B_BUS
A_BUS ARM
MUX
MUX
RE
VOR0
VOR1
VOR2
VertexBuffer
WB_BUS
Ju-Ho Sohn 6ESSCIRC 2005
Dual Operations
• Two operating states– Tightly coupled
co-processor (TCC)• Normal co-processor• General SIMD instructions• Handshaking with ARM-10
– Parallel processor (PP)• Independent processor• Graphics instructions• Parallel operations with ARM-10
ARM Inst .1
ARM Inst. 3
SIMD Inst.2
ARM-10 Co-processor
SIMD Inst. 4
ARMProgram 1
VertexProgram
ARMProgram 2
ARM-10 Co-processor
Ju-Ho Sohn 7ESSCIRC 2005
Processor State Control
• Enhanced co-processor I/F for PP state– VPCTRL with 2kB code memory – Share all HW blocks with TCC state– Drive CPBusy for synchronization
2kB CodeMemory
INSTRDEC
&CTRL
Co-ProcI/F
VPCTRL
Fetc
h
ControlReg. (SGR)
VPen
VPen
CPbusy GraphicsINSTR(PP)
SIMDINSTR(TCC)
State
INSTR DatapathSignalsARM-10
Ju-Ho Sohn 8ESSCIRC 2005
Programmable Vertex Shader
• Shader instruction extensions in PP state– Source swizzling and write masks
– More operands
• Parallelism for streaming graphics data
VGR2 = VGR0 + VGR1 VGR2.xyzw = VGR0 + VGR.xyzw(TCC state) (PP state)
VGR2 = VGR0 + VGR1 VOR2.xyzw = VIR0 + MEM[1](TCC state) (PP state)
ARM-10
Co-processor
ExternalMemory
INIT. 1st VertexFetch
1st VertexProcessing
2nd VertexFetch
2nd VertexProcessing
1st Vertex 2nd Vertex
TransferTransfer Hide external memorytransfer cycles
Shading Shading
Ju-Ho Sohn 9ESSCIRC 2005
Fixed-point Processing: Low Power Features (1)
• Fixed-point number representation
• Energy efficiency– 40% less energythan floating-point on avg. in 3D graphics
32 22 12 02 12− 22− 32− 42−7b 6b 5b 4b 3b 2b 1b 0b }1,0{∈ibBit index
Value
sign m-bit integerpart n-bit fraction part
Fraction point
(ex) Q4.4 format
Fixed-pointArithmetic
Unit
PowerConsumption
Execution Time
Floating-pointArithmetic
Unit17 % Less Power
30%Faster
Ju-Ho Sohn 10ESSCIRC 2005
Instruction-wise Clock Gating: Low Power Features (2)
• SIMD RF and datapath are the most power-consuming parts– Activation in instruction-by-instruction basis
ARM-10 Co-ProcI/F
Enable
ClockSource
SIMDReg. Files
Reg.Valid Fixed-point
SIMDDatapathO
PLatchD
EQ
MultimediaCo-processor
High only @CP is called
Clock-off
Prevent signaltransitions
Ju-Ho Sohn 11ESSCIRC 2005
SIMD Datapath : ALU
• Single-cycle integer and fixed-point op.– Arithmetic, logic, shuffle, packing, alignment– SW floating-point emulation (80Mflops peak)
NZCV
Shuffle
Align
Pack
ALU
CLZShifter
aluCode= {alu, N}
shCode= {sh, N}
shAmt
opA
opB
opC
Out
Compare
Negative
ADD or(LSR)
SUB or(LSL)
Yes
No
Ju-Ho Sohn 12ESSCIRC 2005
SIMD Datapath: Multiply
• 4-cycle matrix transformation
Fixed-pointResult
32x16MUL
32x16MUL
SHIFT
CSA
CPA
CSA
CPA SHIFT
Accum.
MUL
Convert 32b Fixed to64b Integer
Convert 64b Integer to 32b Fixed
Bypass of Low 32b Bypass of High 32b
M4M5M6M7
M8M9M10M11
M12M13M14M15
XYZW
M0M1M2M3
M0M1M2M3
Broadcasted Vector Elements
XXXX
MUL X
M4M5M6M7
YYYY
M8M9M10M11
ZZZZ
M12M13M14M15
MAC Y MAC Z MAC W
WWWW
Ju-Ho Sohn 13ESSCIRC 2005
SIMD Datapath: Special Function Unit
• Calculate RCP (1/x) and RSQ (1/sqrt(x))– Fixed-point input Fixed-point output– No additional data type conversion circuits
CountLeading
Zero
SHIFT
Radix-4Integer
DIVSQRT
CarryPropagate
Adder
Fixed-pointInput
Fixed-pointOutput1.0
in Qm.n
For Qm.n Fixed-point Format
Pre-scaleValue
Normalized Input
QuotientRemainder
Scale up fixed-point dividend to 64b spaceOP Pre-scale Value
RCPRSQ
# of Leading Zeron/2 + # of Leading Zero
Ju-Ho Sohn 14ESSCIRC 2005
Integration into a Mobile Graphics Processor
• 0.18μm CMOS– 1P6M
• 200MHz• 10.2 mm2
– Full chip (36 mm2 )• 75.4 mW
– Full chip (155mW)• 3D performance
– 50Mvertiecs/s (Parallel projection)
– 3.6Mvertices/s (Full 3-D geometry)
Ju-Ho Sohn 15ESSCIRC 2005
System Evaluation Board
GraphicsProcessor
LinuxHost
Ju-Ho Sohn 16ESSCIRC 2005
Conclusion
• Low power multimedia co-processor for 3G wireless terminals– Programmable SIMD vertex shader
• ARM-10 co-processor architecture• Flexibility for various multimedia applications
– Low power features• Energy-efficient fixed-point SIMD datapath• Instruction-wise clock gating
– High performance• 50Mvertices/s for parallel projection
• <75.4mW, 10.2 mm2
– Integrated into the mobile graphics processor– Successfully demonstrated on evaluation board