A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) •...

16
A Fixed-point Multimedia Co-processor with 50Mvertices/s Programmable SIMD Vertex Shader for Mobile Applications Ju-Ho Sohn , Jeong-Ho Woo, Ramchan Woo, and Hoi-Jun Yoo Semiconductor System Lab. Dept. of EECS Korea Advanced Institute of Science and Technology

Transcript of A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) •...

Page 1: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

A Fixed-point Multimedia Co-processor with 50Mvertices/s Programmable SIMD Vertex Shader for Mobile Applications

Ju-Ho Sohn, Jeong-Ho Woo, Ramchan Woo, and Hoi-Jun Yoo

Semiconductor System Lab.Dept. of EECS

Korea Advanced Institute of Science and Technology

Page 2: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 2ESSCIRC 2005

Outline

• Introduction• System Architecture• Low Power Features

– Fixed-point processing– Instruction-wise clock gating

• SIMD Datapath Structure– ALU, Multiply, SFU

• Implementation Results• Summary

Page 3: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 3ESSCIRC 2005

3G Mobile Terminals

• Fixed-point multimedia co-processor– Low power consumption in ARM-10 architecture– Programmability with SIMD vertex shading

• Requirements– Low power

(< 200mW)– High performance

(>1Mvertices/s)– Variable

Applications(Programmability)

– Low cost (<5$)

3D GraphicsMPEG Video

MP3 Audio Java Game

Wireless Multimedia Center

Page 4: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 4ESSCIRC 2005

Programmability in Graphics Hardware

Transform

Lighting

Rasterization

TextureMapping

Clipping

Rasterization

TextureMapping

Clipping

Vertex Shader

VertexProgram

User-definedVertex Processing

TraditionalGraphics Pipeline

ProgrammableGraphics Pipeline

SimpleGraphics

Flexible 3D Graphics& Multimedia

Funtions

FixedFunctions

Page 5: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 5ESSCIRC 2005

Co-processor Architecture

• 128-bit 4-way SIMD ARM-10 Co-processor

Co-ProcI/F

Unit&

ControlSIMDALU

opA opB opC

SIMDMultiply

opA opB opC

SFU

op

VGR

VIR

SGR32kBDisplayBuffer

SWZ

StoreReg.

LoadReg.

C_BUS

B_BUS

A_BUS ARM

MUX

MUX

RE

VOR0

VOR1

VOR2

VertexBuffer

WB_BUS

Page 6: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 6ESSCIRC 2005

Dual Operations

• Two operating states– Tightly coupled

co-processor (TCC)• Normal co-processor• General SIMD instructions• Handshaking with ARM-10

– Parallel processor (PP)• Independent processor• Graphics instructions• Parallel operations with ARM-10

ARM Inst .1

ARM Inst. 3

SIMD Inst.2

ARM-10 Co-processor

SIMD Inst. 4

ARMProgram 1

VertexProgram

ARMProgram 2

ARM-10 Co-processor

Page 7: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 7ESSCIRC 2005

Processor State Control

• Enhanced co-processor I/F for PP state– VPCTRL with 2kB code memory – Share all HW blocks with TCC state– Drive CPBusy for synchronization

2kB CodeMemory

INSTRDEC

&CTRL

Co-ProcI/F

VPCTRL

Fetc

h

ControlReg. (SGR)

VPen

VPen

CPbusy GraphicsINSTR(PP)

SIMDINSTR(TCC)

State

INSTR DatapathSignalsARM-10

Page 8: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 8ESSCIRC 2005

Programmable Vertex Shader

• Shader instruction extensions in PP state– Source swizzling and write masks

– More operands

• Parallelism for streaming graphics data

VGR2 = VGR0 + VGR1 VGR2.xyzw = VGR0 + VGR.xyzw(TCC state) (PP state)

VGR2 = VGR0 + VGR1 VOR2.xyzw = VIR0 + MEM[1](TCC state) (PP state)

ARM-10

Co-processor

ExternalMemory

INIT. 1st VertexFetch

1st VertexProcessing

2nd VertexFetch

2nd VertexProcessing

1st Vertex 2nd Vertex

TransferTransfer Hide external memorytransfer cycles

Shading Shading

Page 9: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 9ESSCIRC 2005

Fixed-point Processing: Low Power Features (1)

• Fixed-point number representation

• Energy efficiency– 40% less energythan floating-point on avg. in 3D graphics

32 22 12 02 12− 22− 32− 42−7b 6b 5b 4b 3b 2b 1b 0b }1,0{∈ibBit index

Value

sign m-bit integerpart n-bit fraction part

Fraction point

(ex) Q4.4 format

Fixed-pointArithmetic

Unit

PowerConsumption

Execution Time

Floating-pointArithmetic

Unit17 % Less Power

30%Faster

Page 10: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 10ESSCIRC 2005

Instruction-wise Clock Gating: Low Power Features (2)

• SIMD RF and datapath are the most power-consuming parts– Activation in instruction-by-instruction basis

ARM-10 Co-ProcI/F

Enable

ClockSource

SIMDReg. Files

Reg.Valid Fixed-point

SIMDDatapathO

PLatchD

EQ

MultimediaCo-processor

High only @CP is called

Clock-off

Prevent signaltransitions

Page 11: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 11ESSCIRC 2005

SIMD Datapath : ALU

• Single-cycle integer and fixed-point op.– Arithmetic, logic, shuffle, packing, alignment– SW floating-point emulation (80Mflops peak)

NZCV

Shuffle

Align

Pack

ALU

CLZShifter

aluCode= {alu, N}

shCode= {sh, N}

shAmt

opA

opB

opC

Out

Compare

Negative

ADD or(LSR)

SUB or(LSL)

Yes

No

Page 12: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 12ESSCIRC 2005

SIMD Datapath: Multiply

• 4-cycle matrix transformation

Fixed-pointResult

32x16MUL

32x16MUL

SHIFT

CSA

CPA

CSA

CPA SHIFT

Accum.

MUL

Convert 32b Fixed to64b Integer

Convert 64b Integer to 32b Fixed

Bypass of Low 32b Bypass of High 32b

M4M5M6M7

M8M9M10M11

M12M13M14M15

XYZW

M0M1M2M3

M0M1M2M3

Broadcasted Vector Elements

XXXX

MUL X

M4M5M6M7

YYYY

M8M9M10M11

ZZZZ

M12M13M14M15

MAC Y MAC Z MAC W

WWWW

Page 13: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 13ESSCIRC 2005

SIMD Datapath: Special Function Unit

• Calculate RCP (1/x) and RSQ (1/sqrt(x))– Fixed-point input Fixed-point output– No additional data type conversion circuits

CountLeading

Zero

SHIFT

Radix-4Integer

DIVSQRT

CarryPropagate

Adder

Fixed-pointInput

Fixed-pointOutput1.0

in Qm.n

For Qm.n Fixed-point Format

Pre-scaleValue

Normalized Input

QuotientRemainder

Scale up fixed-point dividend to 64b spaceOP Pre-scale Value

RCPRSQ

# of Leading Zeron/2 + # of Leading Zero

Page 14: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 14ESSCIRC 2005

Integration into a Mobile Graphics Processor

• 0.18μm CMOS– 1P6M

• 200MHz• 10.2 mm2

– Full chip (36 mm2 )• 75.4 mW

– Full chip (155mW)• 3D performance

– 50Mvertiecs/s (Parallel projection)

– 3.6Mvertices/s (Full 3-D geometry)

Page 15: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 15ESSCIRC 2005

System Evaluation Board

GraphicsProcessor

LinuxHost

Page 16: A Fixed-point Multimedia Co-processor with 50Mvertices/s ...€¦ · : Low Power Features (2) • SIMD RF and datapath are the most power-consuming parts – Activation in instruction-by-instruction

Ju-Ho Sohn 16ESSCIRC 2005

Conclusion

• Low power multimedia co-processor for 3G wireless terminals– Programmable SIMD vertex shader

• ARM-10 co-processor architecture• Flexibility for various multimedia applications

– Low power features• Energy-efficient fixed-point SIMD datapath• Instruction-wise clock gating

– High performance• 50Mvertices/s for parallel projection

• <75.4mW, 10.2 mm2

– Integrated into the mobile graphics processor– Successfully demonstrated on evaluation board