A User-Programmable Vertex Engine Erik Lindholm Mark Kilgard Henry Moreton NVIDIA Corporation...

Post on 29-Jan-2016

222 views 0 download

Tags:

Transcript of A User-Programmable Vertex Engine Erik Lindholm Mark Kilgard Henry Moreton NVIDIA Corporation...

A User-Programmable Vertex Engine

A User-Programmable Vertex Engine

Erik LindholmErik Lindholm

Mark KilgardMark Kilgard

Henry MoretonHenry Moreton

NVIDIA CorporationNVIDIA Corporation

Presented by Han-Wei ShenPresented by Han-Wei Shen

Where does the Vertex Engine fit? Where does the Vertex Engine fit?

frame-bufferanti-aliasingframe-bufferanti-aliasing

textureblendingtexture

blending

setuprasterizer

setuprasterizer

Transform & LightingTransform & Lighting

Traditional Graphics Pipeline

frame-bufferanti-aliasingframe-bufferanti-aliasing

textureblendingtexture

blending

setuprasterizer

setuprasterizer

Transform & LightingTransform & Lighting

GeForce 3 Vertex EngineGeForce 3 Vertex Engine

VertexProgramVertex

Program

API SupportAPI Support

• Designed to fit into OpenGL and Designed to fit into OpenGL and D3D API’sD3D API’s

• Program mode vs. Fixed function Program mode vs. Fixed function modemode

• Load and bind programLoad and bind program

• Simple to add to old D3D and Simple to add to old D3D and OpenGL programsOpenGL programs

Programming Model Programming Model

• Enable vertex program Enable vertex program •glEnable(GL_VERTEX_PROGRAM_NV);

• Create vertex program objectCreate vertex program object

• Bind vertex program object Bind vertex program object

• Execute vertex program object Execute vertex program object

Create Vertex Program Create Vertex Program

• Programs (assembly) are defined Programs (assembly) are defined inline as inline as

character strings character strings static const GLubyte vpgm[] = “\!!VP1. 0\ DP4 o[HPOS].x, c[0], v[0]; \ DP4 o[HPOS].y, c[1], v[0]; \ DP4 o[HPOS].z, c[2], v[0]; \ DP4 o[HPOS].w, c[3], v[0]; \ MOV o[COL0],v[3]; \END";

Create Vertex Program (2)Create Vertex Program (2)

• Load and bind vertex programs Load and bind vertex programs similar to texture objects similar to texture objects glLoadProgramNV(GL_VERTEX_PROGRAM_NV, 7,

strelen(programString), programString);

….

glBindProgramNV(GL_VERTEX_PROGRAM_NV, 7);

Invoke Vertex Program Invoke Vertex Program

• The vertex program is initiated The vertex program is initiated when a vertex is given, i.e., whenwhen a vertex is given, i.e., when

glBegin(…)glBegin(…)

glVertex3f(x,y,z)glVertex3f(x,y,z)

… …

glEnd()glEnd()

Let’s look at the sample program

Let’s look at the sample program

static const GLubyte vpgm[] = “\!!VP1. 0\ DP4 o[HPOS].x, c[0], v[0]; \ DP4 o[HPOS].y, c[1], v[0]; \ DP4 o[HPOS].z, c[2], v[0]; \ DP4 o[HPOS].w, c[3], v[0]; \ MOV o[COL0],v[3]; \END";

O[HPOS] = M(c0,c1,c2,c3) * v - HPOS? O[COL0] = v[3] - COL0?

Calculate the clip space point position and Assign the vertex with v[3] as its diffuse color

Vertex Source

Vertex Program

Vertex Output

Program Constants

Temporary Registers

16x4 registers

128 instructions

96x4 registers

12x4 registers

15x4 registers

Programming ModelProgramming Model

V[0] …V[15] c[0]

…c[96]

R0 …R11

O[HPOS]O[COL0]O[COL1]O[FOGP]O[PSIZ]O[TEX0] …O[TEX7]

All quad floats

Input Vertex AttributesInput Vertex Attributes

• V[0] – V[15]V[0] – V[15]

• Aliased (tracked) with conventional per-Aliased (tracked) with conventional per-vertex attributes (Table 3)vertex attributes (Table 3)

• Use glVertexAttribNV() to explicitly assig Use glVertexAttribNV() to explicitly assig values values

• Can also specify a scalar value to the vertex Can also specify a scalar value to the vertex attribute array - glVertexAttributesNV()attribute array - glVertexAttributesNV()

• Can change values inside or outside Can change values inside or outside glBegin()/glEnd() pairglBegin()/glEnd() pair

Program ConstantsProgram Constants

• Can only change values outside glBegin()/glEnd() Can only change values outside glBegin()/glEnd() pair pair

• No automatic aliasing No automatic aliasing

• Can be used to track OpenGl matrices Can be used to track OpenGl matrices (modelview, projection, texture, etc.)(modelview, projection, texture, etc.)

• Example: Example:

glTrackMatrix(GL_VERTEX_PROGRAM_NV, 0, glTrackMatrix(GL_VERTEX_PROGRAM_NV, 0, GL_MODELVIEW_PROJECTION_NV, GL_MODELVIEW_PROJECTION_NV, GL_IDENTIGY_NV)GL_IDENTIGY_NV)

- track 4 contiguous program constants starting - track 4 contiguous program constants starting with c[0]with c[0]

Program Constants (cont’d)

Program Constants (cont’d)

DP4 o[HPOS].x, c[0], v[OPOS]DP4 o[HPOS].x, c[0], v[OPOS]

DP4 o[HPOS].y, c[1], v[OPOS]DP4 o[HPOS].y, c[1], v[OPOS]

DP4 o[HPOS].z, c[2], v[OPOS]DP4 o[HPOS].z, c[2], v[OPOS]

DP4 o[HPOS].w, c[3], v[OPOS]DP4 o[HPOS].w, c[3], v[OPOS]

What does it do? What does it do?

Program Constants (cont’d)

Program Constants (cont’d)

glTrackMatrixNV(GL_VERTEX_PROGRAM_NV, 4, glTrackMatrixNV(GL_VERTEX_PROGRAM_NV, 4, GL_MODEL_VIEW, GL_INVERSE_TRANPOSE_NV)GL_MODEL_VIEW, GL_INVERSE_TRANPOSE_NV)

DP3 R0.x, C[4], V[NRML]DP3 R0.x, C[4], V[NRML]

DP3 R0.y, C[5[, V[NRML]DP3 R0.y, C[5[, V[NRML]

DP3 R0.z, C[6], V[NRML] DP3 R0.z, C[6], V[NRML]

What doe it do? What doe it do?

Hardware Block DiagramHardware Block Diagram

Vertex Attribute Buffer (VAB)

Vector FP Core

Vertex In

Vertex Out

Vertex Attribute Buffer (VAB)

Vertex Attribute Buffer (VAB)

128 ( 32 x 4 )

128

dirty bitsVAB

….0 1 14 15IB

 

0 1 n-2 n-1........IB

0 1 n-2 n-1........OB

SIMDVector Unit

SpecialFunction

Unit

ConstantMemory

InstructionMemory

Registers

writemask

sw/neg

writemask

sw/negsw/neg

HW Block DiagramHW Block Diagram

Data PathData Path

FPU Core

NegateSwizzle

NegateSwizzle

NegateSwizzle

X Y Z WX Y Z W X Y Z W

Write Mask

X Y Z W

Instruction Set: The opsInstruction Set: The ops

• 17 instructions total17 instructions total

• MOV, MUL, ADD, MAD, DSTMOV, MUL, ADD, MAD, DST

• DP3, DP4DP3, DP4

• MIN, MAX, SLT, SGEMIN, MAX, SLT, SGE

• RCP, RSQ, LOG, EXP, LITRCP, RSQ, LOG, EXP, LIT

• ARL ARL

Instruction Set: The Core FeaturesInstruction Set: The Core Features

• Immediate access to sourcesImmediate access to sources

• Swizzle/negate on all sourcesSwizzle/negate on all sources

• Write mask on all destinationsWrite mask on all destinations

• DP3,DP4 most common graphics opsDP3,DP4 most common graphics ops

• Cross product is MUL+MAD with Cross product is MUL+MAD with swizzlingswizzling

• LIT instruction implements LIT instruction implements phongphonglightinglighting

Dot Product Instruction Dot Product Instruction

DP3 R0.x, R1, R2DP3 R0.x, R1, R2

R0.x = R1.x * R2.x + R1.y * R1.y + R0.x = R1.x * R2.x + R1.y * R1.y + R1.z * R2.zR1.z * R2.z

DP4 R0.x, R1, R2DP4 R0.x, R1, R2

4-component dot product 4-component dot product

MUL instruction MUL instruction

MUL R1, R0, R2 MUL R1, R0, R2 (component-wise (component-wise mult.)mult.)

R1.x = R0.x * R2.x R1.x = R0.x * R2.x

R1.y = R0.y * R2.y R1.y = R0.y * R2.y

R1.z = R0.z * R2.z R1.z = R0.z * R2.z

R1.w = R0.w * R2.w R1.w = R0.w * R2.w

MAD instruction MAD instruction

MAD R1, R2, R3, R4MAD R1, R2, R3, R4

R1 = R2 * R3 + R4 R1 = R2 * R3 + R4

*: component wise multiplication*: component wise multiplication

Example: Example:

MAD R1, R0.yzxw, R2.zxyw, -R1MAD R1, R0.yzxw, R2.zxyw, -R1

What does it do? What does it do?

Cross Product Coding ExampleCross Product Coding Example

# Cross product R2 = R0 x R1# Cross product R2 = R0 x R1

MUL R2, R0.zxyw, R1.yzxw;MUL R2, R0.zxyw, R1.yzxw;MAD R2, R0.yzxw, R1.zxyw, -R2;MAD R2, R0.yzxw, R1.zxyw, -R2;

Lighting instructionLighting instruction

LIT R1, R0 LIT R1, R0 (phong light model)(phong light model)Input: R0 = (diffuse, specular, ??, shiness)Input: R0 = (diffuse, specular, ??, shiness)

Output R1 = (1, diffuse, specular^shininess, Output R1 = (1, diffuse, specular^shininess, 1)1)

Usually followed by Usually followed by

DP3DP3 o[COL0], C[21], R1 o[COL0], C[21], R1 (assuming using (assuming using c[21]) c[21])

where C[xx] = (ka, kd, ks, ??) where C[xx] = (ka, kd, ks, ??)

Ready to trace some program? Ready to trace some program?

Previous Work: Geometry EnginePrevious Work: Geometry Engine

• High bandwidth + lots of FlopsHigh bandwidth + lots of Flops

• Low clock rateLow clock rate

• No architectural continuityNo architectural continuity

• VERY hard to programVERY hard to program

• Some high-level language support Some high-level language support (maybe)(maybe)

• A compromise solution (vtx,prim,pix,A compromise solution (vtx,prim,pix,…)…)

Alternative: The CPUAlternative: The CPU

• Low bandwidth + reasonable FlopsLow bandwidth + reasonable Flops

• High clock rateHigh clock rate

• Excellent architectural continuityExcellent architectural continuity

• VERY hard to use efficientlyVERY hard to use efficiently

• Excellent high-level language Excellent high-level language supportsupport

• Flexible, but often too slowFlexible, but often too slow

New Design: The Vertex EngineNew Design: The Vertex Engine

• Simple hardware for a commodity Simple hardware for a commodity GPUGPU

• Allows user to manipulate vertex Allows user to manipulate vertex transformtransform

• Simple to use programming modelSimple to use programming model

• Superset of fixed function modeSuperset of fixed function mode

Why Vertex Processing?Why Vertex Processing?

• Very parallelVery parallel

• Use single vertex programming Use single vertex programming modelmodel

• Hardware can batch or interleaveHardware can batch or interleave

• KISSKISS

Why Not Primitive Processing?Why Not Primitive Processing?

• Face culling and clipping break Face culling and clipping break parallelismparallelism

• Complicates memory accessesComplicates memory accesses

• Inefficient (control takes time)Inefficient (control takes time)

• Let hardware designers optimizeLet hardware designers optimize

Programming Model: Vertex I/OProgramming Model: Vertex I/O

• Streaming vertex architectureStreaming vertex architecture

• Source data converted to floatsSource data converted to floats

• Source data loadedSource data loaded

• Run programRun program

• Destination data drainedDestination data drained

• Destination data re-formatted for Destination data re-formatted for hwhw

Hardware ImplementationHardware Implementation

• Vector SIMD Unit + Special Vector SIMD Unit + Special Function UnitFunction Unit

• Multithreaded and pipelined to hide Multithreaded and pipelined to hide latencylatency

• Any one instruction/cycleAny one instruction/cycle

• All instructions equal latencyAll instructions equal latency

• Free swizzling/negate/write mask Free swizzling/negate/write mask supportsupport

ConclusionConclusion

• Very simple, efficient Very simple, efficient implementationimplementation

• Allows vertex programming Allows vertex programming continuitycontinuity

• Stanford Imagine ArchitectureStanford Imagine Architecture

• A work in progress, lots more to A work in progress, lots more to come…come…

• We welcome your feedbackWe welcome your feedback