SIMD ColumnP - Computer Graphicsgraphics.stanford.edu/~eldridge/academia/ms/eldridge_ms_2up.pdf ·...

SIMD Column�Parallel Polygon Rendering

by

Matthew Willard Eldridge

Submitted to the Department of Electrical Engineering and Computer

Science

in partial ful�llment of the requirements for the degrees of

Bachelor of Science

and

Master of Science

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

May ��

c� Matthew Willard Eldridge� MCMXCV� All rights reserved�

The author hereby grants to MIT permission to reproduce and

distribute publicly paper and electronic copies of this thesis document

in whole or in part� and to grant others the right to do so�

Author � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

Department of Electrical Engineering and Computer Science

May ��

Certi�ed by � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

Seth Teller

Assistant Professor of Electrical Engineering and Computer Science

Thesis Supervisor

Certi�ed by � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

Elizabeth C� Palmer

Director� Employment and Development� David Sarno� Research

Center

Company Supervisor

Accepted by � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

Frederic R� Morgenthaler

Chairman� Departmental Committee on Graduate Students

�

SIMD Column�Parallel Polygon Rendering

by

Matthew Willard Eldridge

Submitted to the Department of Electrical Engineering and Computer Science

on May �� in partial ful�llment of the

requirements for the degrees of

Bachelor of Science

and

Master of Science

Abstract

This thesis describes the design and implementation of a polygonal rendering system

for a large one dimensional single instruction multiple data �SIMD� array of pro

cessors Polygonal rendering enjoys a traditionally high level of available parallelism�

but e�cient realization of this parallelism requires communication On large systems�

such as the �� processor Princeton Engine studied here� communication overhead

limits performance

Special attention is paid to study of the analytical and experimental scalability

of the design alternatives Sort�last communication is determined to be the only

design to o�er linear performance improvements in the number of processors We

demonstrate that sort��rst communication incurs increasingly high redundancy of

calculations with increasing numbers of processors� while sort�middle communication

has performance linear in the number of primitives and independent of the number

of processors Performance is thus be communication�limited on large sort��rst and

sort�middle systems

The system described achieves interactive rendering rates of up to �� frames per

second at NTSC resolution Scalable solutions such as sort�last proved too expensive

to achieve real�time performance Thus� to maintain interactivity a sort�middle

render was implemented A large set of communication optimizations are analyzed

and implemented under the processor�managed communication architecture

Finally a novel combination of sort�middle and sort�last communication is pro

posed and analyzed The algorithm e�ciently combines the performance of sort�

middle architectures for small numbers of processors and polygons with the scala

bility of sort�last architectures for large numbers of processors to create a system

with speedup linear in the number of processors The communication structure is

substantially more e�cient than traditional sort�last architectures both in time and

in memory

Thesis Supervisor� Seth Teller

Title� Assistant Professor of Electrical Engineering and Computer Science

Company Supervisor� Elizabeth C Palmer

Title� Director� Employment and Development� David Sarno� Research Center

Acknowledgments

I would like to thank David Sarno� Research Center and the Sarno� Real Time Cor

poration for their support of this research In particular� Herb Taylor for his expertise

in assembly level programming of the Princeton Engine and many useful discussions�

Jim Fredrickson for his willingness to modify the operating system and programming

environment during development of my thesis� Joe Peters for his patience with my

complaints� and Jim Kaba for his tutoring and frequent insights I would like to

thank Scott Whitman for many helpful discussions about parallel rendering and so

willingly sharing the Princeton Engine with me Thanks to all my friends at David

Sarno� Research Center and Sarno� Real Time Corporation� you made my work and

research very enjoyable

I would like to thank my advisor� Seth Teller� for providing continual motivation to

improve and extend my research Seth has been a never ending source of suggestions

and ideas� and an inspiration

Most of all� a special thank you to my parents for supporting me here during my

undergraduate years and convincing me to pursue graduate studies Without their

love and support I would have never completed this thesis

This material is based upon work supported under a National Science Foundation

Graduate Research Fellowship Any opinions� �ndings� conclusions or recommenda

tions expressed in this publication are those of the author and do not necessarily

re�ect the views of the National Science Foundation

�

Contents

� Introduction ��

�� Background � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Hardware Approaches � � � � � � � � � � � � � � � � � � � � � � ��

�� Software Approaches � � � � � � � � � � � � � � � � � � � � � � � ��

�� Overview � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Princeton Engine Architecture ��

�� Processor Organization � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Interprocessor Communication � � � � � � � � � � � � � � � � � � � � � � �

�� Video I�O � Line�Locked Programming � � � � � � � � � � � � � � � � ��

� Implementation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Graphics Pipeline ��

�� UniProcessor Pipeline � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Geometry � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Rasterization � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Details � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� MultiProcessor Pipeline � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Communication � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Geometry � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Rasterization � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�

� Scalability ��

� Geometry � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Rasterization � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Communication � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Model � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

�� Sort�First � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Sort�Middle � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Sort�Last � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

Analysis � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Sort�First vs Sort�Middle � � � � � � � � � � � � � � � � � � � � �

� Sort�Last� Dense vs Sparse � � � � � � � � � � � � � � � � � � � �

� Sort�Middle vs Dense Sort�Last � � � � � � � � � � � � � � � � ��

� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Communication Optimizations ��

�� Sort�Middle Overview � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Dense Network � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Distribution � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Data Duplication � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Temporal Locality � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

� SortTwice Communication ��

�� Why is Sort�Last so Expensive� � � � � � � � � � � � � � � � � � � � � � ��

�� Leveraging Sort�Middle � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Sort�Last Compositing � � � � � � � � � � � � � � � � � � � � � � � � � ��

� Sort�Twice Cost � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Deferred Shading and Texture Mapping � � � � � � � � � � � � � � � � � ��

�� Performance � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��

�

�Im

plementatio

n��

��Parallel

Pipelin

e��

��

��Line�L

ockedProgram

ming

��

��

��Geom

etry��

��

��Represen

tation��

��

��Tran

sformation

��

��

��Visib

ility

Clip

ping

��

��

��Ligh

ting

��

��

��Coe

cientCalcu

lation��

��

��Com

munication

��

��

��Passin

gaSingle

Poly

gonDescrip

tor��

��

��Passin

gMultip

leDescrip

tors��

��

��Rasterization

��

��

��Norm

alizeLinear

Equation

s��

��

��Rasterize

Poly

gons

��

��

��Rasterization

Optim

izations

��

��

��Summary

��

��

��Control

��

��

��Summary

��

��

�Results

��

��Raw

Perform

ance

��

��

��Geom

etry��

��

��Com

munication

��

��

��Rasterization

��

��

��Aggregate

Perform

ance

��

��

��Accou

ntin

g��

��

��Perform

ance

Com

parison

��

��

�

�FutureDire

ctio

ns

��

��Distrib

ution

Optim

ization��

��

��Sort�T

wice

��

��

��NextGeneration

Hard

ware

��

��

��Conclusio

ns

��

ASim

ulator

��

A��

Model

��

��

A��

InputFiles

��

��

A��

Param

eters��

��

A��

Operation

��

��

A��

Data

Initialization

Geom

etry��

��

A��

Com

munication

��

��

A��

Rasterization

��

��

A��

Correctn

ess��

��

A��

Summary

��

��

�

List

ofFigures

��Prin

cetonEngin

eVideo

I�O��

��

��Prin

cetonEngin

e��

��

��Com

munication

Option

son

thePrin

cetonEngin

e��

��

��Prin

cetonEngin

e��

��

��Synthetic

Eyep

oint

��

��

��Beeth

oven��

��

��Crocod

ile��

��

��Teap

ot��

��

��Texture

Mapped

Crocod

ile��

��

��Sam

pleTexture

Image

��

��

��Uniprocessor

Geom

etryPipelin

e��

��

��D

Persp

ectiveTran

sformation

��

��

��Poly

gonLinear

Equation

s��

��

��Trian

gleDe�nition

byHalf�P

laneIntersection

��

��

��Uniprocessor

Rasterization

��

�

��Sort

Order

��

��

��SIM

DGeom

etryPipelin

e��

��

��SIM

DRasterization

��

��

��Sort�F

irstLoad

Imbalan

ce��

��

��Sort�M

iddle

��

��

��Com

munication

Costs

��

��

��

��Com

munication

Optim

izations��

�

��Sort�M

iddleCom

munication

��

��

��Sim

ulation

ofSparse

vs�

Dense

Netw

orks��

��

��Pass

Cost

��

��

��i�

vs�

dfor

Distrib

ution

Optim

ization��

��

��Poly

gonDuplication

��

��

��Poly

gonsPer

Processor

vs�

m��

��

��Instru

ctionsPer

Poly

gonas

afunction

ofm

andd��

��

��Tem

poral

Locality

��

��

��Multip

leUsers

onaParallel

Poly

gonRenderer

��

��

��Sort�L

astCom

positin

g��

��

��Pixelto

Processor

Map

��

��

��Sort�T

wice

Com

positin

gTim

e��

��

��Pred

ictedSort�M

iddlevs�

Sort�T

wice

Com

munication

Tim

e��

��

��Sim

ulated

Sort�M

iddlevs�

Sort�T

wice

Com

munication

Tim

e��

��

��Sim

ulated

Sort�M

iddlevs�

Sort�T

wice

Execu

tionTim

e��

��

��Beeth

oven��

��

��Crocod

ile��

��

��Texture

Mapped

Crocod

ile��

��

��Teap

ot��

��

��Poly

gonRepresen

tation��

��

��Poly

gonDescrip

tor��

��

��Im

plem

entation

forData

Distrib

ution

��

��

��Linear

Equation

Norm

alization��

��

��Rasterizin

g��

��

��Virtu

alTrack

ball

Interface

��

��

��Execu

tionPro�

le��

��

��

A��

Bounding�b

oxInputFile

��

��

A��

Sam

pleScreen

Map

��

��

A��

Sim

ulator

Model

��

��

A��

Destin

ationFalse

Positiv

e��

��

A��

Sim

ulator

Com

munication

��

��

A��

Destin

ationVector

Align

ment

��

��

A��

Destin

ationVector

forg��

��

��

��

��

List

ofTables

��Com

munication

Modelin

gVariab

les��

��

��Com

munication

sCosts

��

��

��Com

munication

Modelin

gValu

es��

��

��Renderin

gPerform

ance

forTypical

Scen

es��

��

A��

Sim

ulator

Param

eters��

��

��

��

Chapter�

Introductio

n

Thisthesis

explores

theim

plem

entationof

aninteractive

poly

gonal

renderer

onthe

Prin

cetonEngin

e�a��p

rocessorsin

gle�instru

ctionmultip

le�data

�SIM

D�parallel

computer�

Each

processor

isasim

plearith

metic

coreable

tocom

municate

with

its

leftandrigh

tneigh

bors�

Thisisan

importan

tarch

itectureon

which

tostu

dytheim

plem

entation

ofpoly

gon

renderin

g�for

manyreason

s�

�Theinterp

rocessorcom

munication

structu

reissim

pleandinexpensiv

e�making

thecon

struction

oflarger

system

spractical�

with

costlin

earin

thenumber

of

processors�

�Theuse

ofSIM

Dprocessors

allowsvery

carefully

managed

communication

to

beim

plem

ented

�anda�ord

ssign

i�can

topportu

nities

foroptim

izations�

�Curren

tpoly

gonren

derin

galgorith

msandarch

itectures

oftenexhibitpoor

scal�

ability�

andare

design

edwith

largeandexpensiv

ecrossb

arsto

insure

adequate

interp

rocessorbandwidth�A

reduction

intheam

ountof

communication

re�

quired

andthecarefu

lstru

cturin

gof

thecom

munication

algorithm�as

man�

dated

bythemodest

communication

resources

ofthePrin

cetonEngin

e�will

have

largee�ects

onthefeasib

ilityof

future

largermach

ines�

��

Manyparallel

renderers

have

been

implem

ented

�butmost

were

either

execu

ted

onsm

allermach

inesor

onmach

ineswith

ahigh

erdim

ension

communication

netw

ork

than

thePrin

cetonEngin

e�Theuse

ofalarge

mach

inewith

narrow

intercon

nect

such

asthePrin

cetonEngin

emakes

thepoly

gonren

derin

gprob

lemsign

i�can

tlydi�eren

t�

Poly

gonren

derin

gprov

ides

aview

port

into

athree�d

imension

al��D

�space

ona

two�d

imension

al��D

�disp

laydevice�

Thespace

may

beofanysort�

realor

imagin

ed�

There

aretwobroad

approach

estaken

topoly

gonren

derin

g�realism

andinteractiv

ity�

Realism

requires

precise

andgen

erallyexpensiv

ecom

putation

toaccu

ratelymodela

scene�

Interactiv

ityreq

uires

real�timecom

putation

ofascen

e�at

apossib

leloss

of

realism�

Poly

gonren

derin

gmodels

ascen

eas

composed

ofaset

ofpoly

gonsin

�Dbein

g

observ

edbyaview

erfrom

anyarb

itrarylocation

�Geom

etrycom

putation

stran

s�

formthepoly

gonal

scenefrom

a�D

model

toa�D

model

andthen

rasterization

computation

sren

der

the�D

poly

gonsto

thedisp

laydevice�

Both

thegeom

etryand

therasterization

computation

sare

high

lyparallel�

buttheir

parallelization

exposes

a

sortingprob

lem�Poly

gonren

derin

ggen

erallyrelies

onthenotion

ofpoly

gonsbein

g

rendered

inthe�correct�

order�

sothat

poly

gonsnearer

totheobserver�s

view

poin

t

obscu

repoly

gonsfarth

eraw

ay�Ifthepoly

gonren

derin

gprocess

isdistrib

uted

across

processors

thisord

eringcon

straintwill

require

interp

rocessorcom

munication

tore�

solve�

Theord

eringcon

straintis

generally

resolvedin

oneof

three

ways�

referredto

bywhere

they

place

thecom

munication

intheren

derin

gprocess�

Sort��

rstand

sort�middlealgorith

msplace

poly

gonsthat

overlaponce

projected

tothescreen

on

thesam

eprocessor

forrasterization

sothat

theocclu

sioncan

beresolv

edwith

local

data�

Sort�last

rasterizeseach

poly

gonon

theprocessor

that

perform

sthepoly

gon�s

geometry

calculation

s�andthen

uses

pixel

bypixel

inform

ationafter

thefact

to

combineall

oftheprocessor�s

rasterizationresu

lts�Theadvan

tagesanddisad

vantages

ofthese

approach

esare

discu

ssed�both

ingen

eral�andwith

aspeci�

cfocu

son

an

e cien

tim

plem

entation

onthePrin

cetonEngin

e�

��

Amajor

goalofthisthesis

isto

�ndandextract

scalability

fromtheexplored

com�

munication

algorithms�so

that

asthenumber

ofprocessors

increases

theren

derin

g

perform

ance

will

increase

commensurately�

Wedem

onstrate

that

forsom

ecom

muni�

cationapproach

es�most

notab

lysort�m

iddle�

thecom

munication

timeislin

earin

the

numberofpoly

gonsandindepen

den

tofthenumberofprocessors�

With

outsom

ecare

communication

candom

inate

execu

tiontim

eofpoly

gonren

derin

gon

largemach

ines�

Sort�last

algorithmsmake

constan

t�timecom

munication

possib

le�independentof

both

thenumber

ofpoly

gonsandthenumber

ofprocessors�

butunfortu

nately

the

constan

tcost

isvery

large�

Anovel

combination

ofsort�m

iddleandsort�last

communication

isprop

osedand

analy

zedto

address

theprob

lemof

�ndingscalab

lecom

munication

algorithms�

The

algorithm

e�cien

tlycom

bines

theperform

ance

ofsort�m

iddlearch

itectures

forsm

all

numbers

ofprocessors

andpoly

gonswith

thescalab

ilityof

sort�lastarch

itectures

to

createasystem

with

perform

ance

linearin

thenumber

ofprocessors�

Thesystem

describ

edin

thisthesis

achieves

interactive

renderin

grates

ofupto

��

frames

per

second�Tomain

taininteractiv

ityasort�m

iddleren

der

was

implem

ented

�

asscalab

lesolu

tionssuch

assort�last

proved

tooexpensiv

eto

achieve

real�timeper�

formance�

Alarge

setof

optim

izationsare

analy

zedandim

plem

ented

toaddress

the

scalability

prob

lemsof

sort�middlecom

munication

�

This

thesis

signi�can

tlydiers

fromcurren

twork

onparallel

poly

gonren

der�

ing�

Contem

porary

parallel

renderin

gfocu

sesvery

heav

ilyon

multip

le�instru

ction

multip

le�data

MIM

D�arch

itectures

with

large�fast

andexpensive

netw

orks�

We

have

instead

pursu

edaSIM

Darch

itecture

with

arath

ernarrow

netw

ork�

Back

�

groundisprov

ided

forsoftw

areandhard

ware

parallel

renderers�

both

ofwhich

have

signi�can

tlyin�uenced

thiswork

�

�

��

Background

Poly

gonren

derin

gsystem

svary

widely

intheir

capabilities

beyondjustpain

tingpoly

�

gonson

thescreen

�Themost

common

features

foundinmost

ofthesystem

sdescrib

ed

below

inclu

de�

Depth

Bu�erin

gApixel�b

y�pixelcom

putation

ofthedistan

ceofthepoly

gonfrom

theusers

eyeisstored

ina�depth

buer��

Poly

gonare

only

rendered

atthe

pixels

where

they

arecloser

totheobserver

than

theprev

iously

rendered

poly

�

gons�

Smooth

ShadingThecolor

ofthepoly

gonissm

oothly

interp

olatedbetw

eenthe

verticesto

givetheim

pression

ofasm

ooth

lyvary

ingsurface�

rather

than

a

surface

madeupof

�at�sh

aded

plan

arfacets�

Texture

MappingThecolor

ofthepoly

gonisrep

lacedwith

anim

ageto

increase

thedetail

ofthescen

e�For

exam

ple�

anim

agecan

betex

ture

mapped

onto

a

billb

oardpoly

gon�

More

complex

capabilities

arepresen

tin

manyof

thesystem

sand�while

tech�

nically

interestin

g�are

leftunaddressed

asthey

don�tbear

directly

onthepoly

gon

renderin

gsystem

describ

edherein

�

There

have

been

numerou

sparallel

poly

gonren

derin

gsystem

sbuilt�

Iwill

discu

ss

someof

them

here�

andprov

idereferen

cesto

theliteratu

re�It

issign

i�can

tthat

most

ofthework

deals

either

with

largeVLSIsystem

s�or

MIM

Darch

itectures

with

pow

erfulindividual

nodes�

andarich

intercon

nection

netw

ork�Thisthesis

brid

ges

thegap

betw

eenthese

approach

es�mergin

gthevery

high

parallelism

ofhard

ware

approach

eswith

the�exibility

ofsoftw

areapproach

es�

��

HardwareApproaches

High

perform

ance

poly

gonren

derin

ghas

tradition

allybeen

thedom

ainof

special

purpose

hard

ware

solution

s�as

hard

ware

canread

ilyprov

ideboth

theparallelism

��

required

andthecom

munication

bandwidth

necessary

tofully

exploit

theparallelism

inpoly

gonren

derin

g�Turner

Whitted

prov

ides

asurvey

ofthe ��

stateof

theart

inhard

ware

graphics

architectu

resin

��

Iwill

brie�

ydiscu

ssthree

speci�

csystem

shere�

thePixelF

lowarch

itecture

from

theUniversity

ofNorth

Carolin

a�andtwoarch

itectures

fromSilicon

Grap

hics�

Inc��

theIR

IS�D

�GTXandtheReality

Engin

e�Allof

thearch

itectures

prov

ideinteractive

renderin

gwith

variousdegrees

ofrealism

�

PixelFlow

PixelF

low�describ

edin

� ��and� ��

isalarge�scale

parallel

renderin

gsystem

based

onasort�last

compositin

garch

itecture�

Itprov

ides

substan

tialprogram

mability

of

thehard

ware�

andat

aminim

um

perform

sdepth

buerin

g�sm

ooth

shadingand

texture

mappingof

poly

gons�

PixelF

lowiscom

posed

ofapipelin

eof

renderin

gboard

s�Each

renderin

gboard

consists

ofaRISCprocessor

toperform

geometry

calculation

anda ��

� ��

SIM

D

arrayof

pixel

processors

that

rasterizean

entire

poly

gonsim

ultan

eously�

prov

iding

parallelism

intheextrem

e�Thepoly

gonsdatab

aseisdistrib

uted

arbitrarily

among

theren

derin

gboard

sto

achiev

ean

evenload

acrosstheprocessors�

After

renderin

g

allofits

poly

gonseach

renderin

gboard

has

apartial

image

ofthescreen

�correct

only

forthat

subset

ofthepoly

gonsitren

dered

�Thesep

aratepartial

images

ofthescreen

arethen

composited

togen

erateasin

gleoutputim

agevia

a��

gigabit�secon

d

intercon

nection

bus�

PixelF

lowim

plem

entsshadingandtex

turemappingintwospecial

shadingboard

s

attheendofthepipelin

eofren

derin

gboard

s�Each

pixelis

thusshaded

atmost

once�

eliminatin

gneed

lessshadingofnon�visib

lepixels�

andtheam

ountoftex

turemem

ory

required

ofthesystem

islim

itedto

theam

ountthat

will

prov

idesu�cien

tbandwidth

totex

ture

map

fullresolu

tionim

agesat

update

rates�

Thesort�last

architectu

reprov

idesperform

ance

increases

nearly

linear

inthenum�

ber

ofren

derin

gboard

sandread

ilyadmits

loadbalan

cing�

ThePixelF

lowarch

itec�

�

ture�

when

complete�

isexpected

toscale

toperform

ance

ofat

least�million

poly

�

gonsper

second�Bycom

parison

thecurren

tSilicon

Grap

hics

Reality

Engin

eren

ders

slightly

more

than

million

poly

gonsper

second�

IRIS

�D�GTX

TheSilicon

Grap

hics

IRIS

�D�G

TX�describ

edin

��was

�rst

soldin

��andisthe

Prin

cetonEngin

e�scon

temporary�

TheGTX

perform

ssm

oothshadinganddepth�

buerin

gof

litpoly

gonsandcan

render

over ��

triangles

asecon

d�TheGTX

prov

ides

notex

ture

mappingsupport�

Geom

etrycom

putation

sare

perform

edbyapipelin

eof�high

perform

ance

proces�

sorswhich

eachperform

adistin

ctphase

ofthegeom

etrycom

putation

s�Theoutput

ofthegeom

etrypipelin

eisfed

into

apoly

gonprocessor

which

dices

poly

gonsinto

trapezoid

swith

parallel

verticaledges�

These

trapezoid

sare

consumed

byan

edge

processor

that

generates

aset

ofvertical

spansto

interp

olateto

render

eachpoly

gon�

Aset

of�span

processors

perform

interp

olationof

thevertical

spans�

Each

span

processor

isfurth

ersupported

by�im

ageengin

esthat

perform

thecom

position

of

thepixels

with

prev

iously

rendered

poly

gonsandmanage

thefully

distrib

uted

frame�

buer�

Thespan

processors

areinterleaved

columnbycolu

mnof

thescreen

andthe

image

engin

esfor

eachspan

processor

areinterleaved

pixelbypixel�

The�negrain

interleaved

interleav

ingof

thespan

processors

andim

ageengin

esinsure

that

most

prim

itiveswill

berasterized

inpart

byall

oftheim

ageengin

es�

Com

munication

isim

plem

ented

bythesort�m

iddle

screen�sp

acedecom

position

perform

edbetw

eentheedge

processor

andthespan

processors�

Thischoice�

while

limitin

gthescalab

ilityof

thesystem

�avoid

stheextrem

elyhigh

constan

tcost

of

sort�lastapproach

essuch

asPixelF

low�

Reality

Engine

TheReality

Engin

e�describ

edin

� ��rep

resentsthecurren

thigh

�endpoly

gonren

der�

ingsystem

fromSilicon

Grap

hics�

Itsupports

fulltex

ture

mapping�shading�

lightin

g

��

anddepth

buerin

gof

prim

itivesat

ratesexceed

ingonemillion

poly

gonsasecon

d�

Logically

theReality

Engin

eisvery

similar

tothearch

itecture

oftheGTX�The

��processor

geometry

pipelin

ehas

been

replaced

by �

independentRISCprocessors

eachim

plem

entingacom

plete

version

ofthepipelin

e�to

which

poly

gonsare

assigned

inarou

nd�rob

infash

ion�A

triangle

busprov

ides

thesort�m

iddleintercon

nect

be�

tween

thegeom

etryengin

esandaset

ofupto

��interleaved

fragmentgen

erators�

Each

fragmentgen

eratoroutputs

shaded

andtex

tured

fragments

toaset

of �

im�

ageengin

eswhich

perform

multisam

pleantialiasin

gof

theim

age�andprov

ideafully

distrib

uted

framebuer�

Themost

signi�can

tchange

fromtheGTX

arethefrag�

mentgen

erators�which

arethecom

bination

ofof

thespan

processors

andtheedge

processors�

Thetex

ture

mem

oryisfully

duplicated

ateach

ofthe��

fragmentgen

eratorsto

prov

idethehigh

parallel

bandwidth

required

toaccess

textures

atfullthrottle

byall

fragmentgen

erators�andas

such

thesystem

operates

with

littleperform

ance

penalty

fortex

ture

mapping�

Afully

con�gured

Reality

Engin

econ

sistsof

over��

processors

andnearly

half

agigab

yte

ofmem

ory�andprov

ides

theclosest

analog

tothePrin

cetonEngin

ein

rawcom

putation

alpow

er�Substan

tialdieren

cesin

communication

bandwidth

and

thespeed

oftheprocessors

makethecom

parison

farfrom

perfect

how

ever�andthe

Prin

cetonEngin

e�sperform

ance

ismore

comparab

leto

itscon

temporary�

theGTX�

��

Softw

are

Approaches

There

have

been

anumber

ofsoftw

areparallel

renderers

implem

ented

�both

inthe

interests

ofrealism

and�or

interactiv

ity�Themajority

ofthese

renderers

have

either

been

implem

ented

onsm

allmach

ines

tensof

processors�

oron

largemach

inewith

substan

tialintercon

nection

�In

eithercase

theseverity

ofthecom

munication

prob

lem

isminim

ized�

��

Connectio

nMachine

Fran

kCrow

etal�

implem

ented

apoly

gonal

renderin

gsystem

onaConnection

Ma�

chine�C

M��

atWhitn

ey�D

emos

Prod

uction

s�as

describ

edin

��TheCM��

� ��

isamassiv

elyparallel

SIM

Dcom

puter

composed

ofupto

��bit�serial

processors

and��

�oatin

g�poin

tcop

rocessors�Thenodes

arecon

nected

ina�D

mesh

and

alsoviaa ��d

imension

alhypercu

bewith

�processors

ateach

vertex�Each

nodein

theCM��

has

�KBof

mem

ory�for

atotal

main

mem

orysize

of� �M

B�

Crow

distrib

utes

thefram

ebuerover

themach

inewith

afew

pixels

per

processor�

Renderin

gisperform

edwith

apoly

gonassign

edto

eachprocessor�

Allp

rocessorsren

�

dersim

ultan

eously

andas

pixels

aregen

eratedthey

aretran

smitted

viathehypercu

be

intercon

nect

totheprocessor

with

theapprop

riatepiece

ofthefram

ebuer

fordepth

buerin

g�Thisistherefore

alarge�scale

implem

entation

ofsort�last

communication

�

Theissu

eof

thebandwidth

thisreq

uires

fromthenetw

orkisleft

untou

ched�It

is

reasonableto

assumethat

itisnot

inexpensiv

eon

such

alarge

mach

ine�

Crow

alludes

totheprob

lemofcollision

sbetw

eenpixels

destin

edfor

thesam

eprocessor�

butleaves

itunresolved

�

Theuse

ofaSIM

Darch

itecture

forcesall

processors

toalw

aysperform

thesam

e

operation

son

poten

tiallydieren

tdata��

Thusifpoly

gonsvary

insize

allprocessors

will

have

towait

fortheprocessor

with

thebiggest

poly

gonto

�nish

rasterizingbefore

they

proceed

tothenextpoly

gon�Thisprob

lemisexacerb

atedbythe��D

natu

reof

thepoly

gons�which

generally

break

stherasterization

into

aloop

overthehorizon

tal

�spans�

ofthepoly

gons�

with

aninner

loopover

thepixels

ineach

span�With

out

carethis

canresu

ltin

allprocessors

execu

tingin

timeprop

ortional

totheworst

casepoly

gonwidth�Crow

addresses

thisprob

lemin

part

byallow

ingprocessors

to

independently

advan

ceto

thenextspan

oftheir

curren

tpoly

gonevery

fewpixels�

butcon

strainsall

processors

tosynchron

izeat

poly

gonboundaries�

Ourexperien

ce

suggests

that

asubstan

tialam

ountof

timeiswasted

bythisapproach

�

Crow

prov

idesagood

overview

ofthedi�

culties

ofadaptin

gthepoly

gonren

derin

g

algorithmto

avery

largeSIM

Dmach

ine�b

utleaves

somecriticalload

balan

cingissu

es

��

unresolved

�andunfortu

nately

prov

ides

noperform

ance

numbers�

Prin

cetonEngine

Contem

poran

eouswith

thisthesis�

Scott

Whitm

anim

plem

ented

amultip

le�user

poly

�

gonren

derer

onthePrin

cetonEngin

e�Thesystem

describ

edin

��divides

the

Prin

cetonEngin

eam

ongatotal

of�sim

ultan

eoususers�

assigningagrou

pof

��

processors

toeach

user�

Sort�m

iddlecom

munication

isused

�with

somediscu

ssionmadeofthecost

ofthis

approach

�Asin

Crow

��Whitm

anem

ploy

saperiod

ictest

durin

gtherasterization

ofeach

scan�lin

eto

allowprocessors

toindependently

advan

ceto

thenextscan

�line�

Healso

synchron

izesat

thepoly

gonlevel�

constrain

ingthealgorith

mto

perform

in

timeprop

ortional

tothelargest

poly

gonon

anyprocessor�

Whitm

ansuggests

that

hisalgorith

mwould

attainperform

ance

ofapprox

imately

��trian

glesper

secondper

��processors

onthenextgen

erationhard

ware

we

discu

ssinx��

Thisassu

mes

that

thesystem

remain

sdevoted

tomultip

leusers�

soa

single

user

will

not

attainan

addition

al��

triangles

per

secondfor

everygrou

p

of ��

processors

assigned

tohim

�

Whitm

analso

prov

ides

agood

general

discu

ssionandback

groundfor

parallel

computer

renderin

gmeth

odsin

hisbook

� ��

ParallelRenderM

an

Mich

aelCox

addresses

issues

of�ndingscalab

leparallel

algorithmsandarch

itectures

forren

derin

g�andspeci�

callyexam

ines

aparallel

implem

entationof

RenderM

an��

Cox

implem

entsboth

asort��

rstren

derer�as

ane�

cientdecom

position

oftheren

�

derin

gprob

lemin

image

space�

andasort�last

renderer�

asan

e�cien

tdecom

position

inob

jectspace�

tostu

dythedieren

cesin

e�cien

cybetw

eenthese

twoapproach

es�

Both

implem

entation

swere

onaCM��

fromThinkingMach

ines�

a� �

processor

MIM

Darch

itecture

with

apow

erfulcom

munication

netw

ork�

Theprob

lemsof

eachapproach

arediscu

ssedin

detail�

Sort��

rstan

dsort�

��

middle�

cansuer

fromload

imbalan

cesas

thedistrib

ution

ofpoly

gonsvaries

in

image

space�

andsort�last

suers

fromextrem

ecom

munication

requirem

ents�

Cox

addresses

these

issues

with

ahybrid

algorithm

which

uses

sort�lastcom

munication

asaback

endto

loadbalan

ceto

sort��rst

communication

�

Distrib

utedMemory

MIM

DRenderin

g

Thom

asCrocket

andTobias

Orlo

prov

ideadiscu

ssionof

parallel

renderin

gon

a

distrib

uted

mem

oryMIM

Darch

itecturein

��with

sort�middlecom

munication

�Par�

ticular

attention

ispaid

toload

�imbalan

cesdueto

irregular

conten

tdistrib

ution

over

thescen

e�

They

use

arow

�parallel

decom

position

ofthescreen

�assign

ingeach

processor

agrou

pof

contigu

ousrow

sfor

which

itperform

srasterization

�Thegeom

etryand

rasterizationcalcu

lationsare

perform

edin

aloop

��rst

perform

ingsom

egeom

etry

calculation

sandtran

smittin

gtheresu

ltsto

theoth

erprocessors�

then

rasterizing

somepoly

gonsreceiv

edfrom

other

processors�

then

perform

ingmore

geometry�

etc�

This

distrib

utes

thecom

munication

overtheentire

timeof

thealgorith

mandcan

resultin

much

better

utilization

ofthecom

munication

netw

orkbynot

constrain

ing

allof

thecom

munication

tohappen

inaburst

betw

eengeom

etryandrasterization

�

Interactiv

eParallelGraphics

David

Ellsw

orthdescrib

esan

algorithm

forinteractive

graphics

onparallel

computers

in��

Ellsw

orth�s

algorithm

divides

thescreen

into

anumber

ofregion

sgreater

than

thenumber

ofprocessors�

anddynam

icallycom

putes

amappingof

regionsto

processors

toach

ievegood

loadbalan

cing�

Thealgorith

mwas

implem

ented

onthe

Touchston

eDelta

prototy

peparallel

computer

with

� �Intel

i��com

pute

nodes

intercon

nected

ina�D

mesh

�

Poly

gonsare

communicated

insort�m

iddlefash

ionwith

thecaveat

that

processors

areplaced

into

groups�

Aprocessor

communicates

poly

gonsafter

geometry

calcula�

tionsbybundlin

gall

thepoly

gonsto

berasterized

byaparticu

largrou

pinto

alarge

��

message

andsen

dingthem

toa�forw

arding�

processor

atthat

groupthat

then

redis�

tributes

them

locally�Thisapproach

amortizes

message

overhead

bycreatin

glarger

message

sizes�

Thisam

ortizationof

thecom

munication

netw

orkoverh

eadisapow

erfulidea�

and

isthecen

tralthem

eof

several

optim

izationsexplored

inthisthesis�

��

Overview

Thisthesis

deals

with

anumber

ofissu

esin

parallel

renderin

g�Agreat

deal

ofatten

�

tionispaid

toe�

cientcom

munication

strategieson

SIM

Drin

gs�andtheir

speci�

c

implem

entation

onthePrin

cetonEngin

e�

Chapter

�prov

ides

anoverv

iewof

thePrin

cetonEngin

earch

itecture

which

will

driv

eouranaly

sisof

sortingmeth

ods�andourim

plem

entation

�

Chapter

�discu

ssestheuniprocessor

poly

gonren

derin

gpipelin

eandthen

exten

ds

itto

themultip

rocessorpipelin

e�Approach

esto

thesortin

gprob

lemare

addressed

�

andissu

esin

extractin

ge�

ciency

fromaparallel

SIM

Dmach

ineare

raised�

Chapter

�exam

ines

thescalab

ilityof

poly

gonren

derin

galgorith

ms�bylook

ing

atthescalab

ilityof

thethree

prim

arycom

ponents

ofren

derin

g�geom

etry�com

�

munication

�andrasterization

�A

general

model

fortheperform

ance

ofsort��

rst�

sort�middleandsort�last

communication

ona D

ringisprop

osedandused

tomake

broad

pred

ictionsaboutthescalab

ilityof

thesortin

galgorith

msin

both

thenumber

ofpoly

gonsandthenumber

ofprocessors�

Atten

tionisalso

paid

totheload

bal�

ancin

gprob

lemsthat

thechoice

ofsortin

galgorith

mcan

introd

uce

andtheir

eects

ontheoverall

scalability

oftheren

derer�

Chapter

�con

cludes

with

justi�

cationfor

thechoice

ofsort�m

iddlecom

munication

forthePrin

cetonEngin

epoly

gonren

derin

g

implem

entation

�desp

iteits

lackof

scalability�

Chapter

�prop

osesandanaly

zesoptim

izationsfor

sort�middle

communication

�

both

with

ananaly

ticalmodel�

andwith

thesupport

ofinstru

ction�accu

ratesim

�

ulation

�Thefeasib

ilityof

verytigh

tuser

control

overthecom

munication

netw

ork

��

isexploited

toprov

ideaset

ofprecisely

managed

andhigh

lye�

cientoptim

izations

which

more

than

tripletheoverall

perform

ance

fromthemost

naiv

ecom

munication

implem

entation

�

Chapter

�prop

osesandanaly

zesanovel

communication

strategythat

marries

the

e�cien

cyofsort�m

iddlecom

munication

forsm

allnumbers

ofpoly

gonsandprocessors

with

thecon

stanttim

eperform

ance

ofsort�last

communication

toobtain

ahigh

ly

scalable

algorithm�Cycle

bycycle

control

ofthecom

munication

channel

isused

to

createacom

munication

structu

rethat

while

requirin

gthesam

eam

ountof

band�

width

assort�last

communication

achieves

nearly

afactor

of�speed

upover

atypical

implem

entation

byam

ortizingthecom

munication

overhead

�TheVLIW

codingof

thePrin

cetonEngin

eallow

safulldeferred

lightin

g�shadingandtex

ture

mapping

implem

entation

which

increases

rasterizere�

ciency

byafactor

ofmore

than

two�

Chapter

�details

ourim

plem

entation

ofpoly

gonren

derin

gon

thePrin

cetonEn�

gine�

Acom

plete

discu

ssionof

thee�

cienthandlin

gof

thecom

munication

channel

ismade�speci�

callydiscu

ssingthequeuein

gof

poly

gonsfor

communication

tomain

�

tainsatu

rationofthechannel�

andtheuseofcarefu

llypipelin

edpoin

terindirection

to

achieve

general

all�to�allcom

munication

with

neigh

bor�to�n

eighbor

communication

with

outanyaddition

alcop

yoperation

s�Speci�

cmech

anism

sandoptim

izationfor

theim

plem

entation

offast

ande�

cientparallel

geometry

andrasterization

arealso

discu

ssed�

Chapter

�discu

ssestheperform

ance

ofoursystem

�andprov

ides

anaccou

ntof

where

alltheinstru

ctionsgo

inasin

glesecon

dof

execu

tion�Statistics

fromexecu

�

tionsof

thesystem

areprov

ided

forthree

samplescen

esof

varyingcom

plex

ity�alon

g

with

benchmark

�gures

oftheraw

throu

ghputof

thesystem

forindividual

geome�

try�com

munication

�andrasterization

operation

s�Theresu

ltsare

compared

with

the

GTX

fromSilicon

Grap

hics�

acon

temporary

ofthePrin

cetonEngin

e�andwith

the

Magic� �

ournextgen

erationhard

ware�

Chapter

�discu

ssesournextgen

erationhard

ware�

theMagic� �

curren

tlyunder

testanddevelop

ment�TheMagic

prov

idessign

i�can

tincreases

incom

munication

and

��

computation

support�

inaddition

toagreatly

increased

clockrate�

andispred

icted

toprov

idean

immensely

pow

erfulpoly

gonren

derin

gsystem

�Ourestim

ationsfor

an

implem

entation

ofasort�tw

iceren

derin

gsystem

onthisarch

itecture

areprov

ided�

andindicate

asystem

with

perform

ance

inthemillion

sof

poly

gons�with

�exibility

farbeyon

dthat

ofcurren

thigh

perform

ance

hard

ware

system

s�

Chapter

��summarizes

theresu

ltsof

thisthesis�

andcon

cludes

with

adiscu

ssion

ofthecon

tribution

sof

thiswork

�

AppendixAdiscu

ssestheinstru

ction�accu

ratesim

ulator

develop

edto

explore

the

e�cien

cyof

di�eren

tcom

munication

algorithmsandtheload

�balan

cingissu

esraised

onahigh

lyparallel

mach

ine�

��

Chapter�

Prin

cetonEngineArchite

cture

ThePrin

cetonEngin

e��

asuper�com

puter

classparallel

mach

ine�was

design

edand

built

attheDavid

Sarn

o�Research

Center

in��

Itcon

sistsof

betw

een��

and

��SIM

Dprocessors

connected

inaneigh

bor�to�n

eighbor

ring�

TheEngin

ewas

originally

design

edpurely

asavideo

processor�

andits

only

trueinput�ou

tputpath

s

arevideo�

Itaccep

tstwoNTSC

composite

video

inputs

andprod

uces

upto

four

interlaced

orprogressiv

escan

componentvideo

outputs�

asshow

nin

Figu

re��

ThePrin

cetonEngin

eisaSIM

Darch

itecture�

soall

processors

execu

tethesam

e

instru

ctionstream

�Condition

alexecu

tionof

theinstru

ctionstream

isach

ievedwith

a�sleep

�state

that

processors

may

condition

allyenter

andrem

ainin

until

awaken

ed�

Cam

eraP

rinceton Engine

VC

R

Figu

re��

Prin

cetonEngineVideoI�O�

thePrin

cetonEngin

esupports

si�

multan

eousvideo

inputs

and�sim

ultan

eousoutputs�

��

Theinstru

ctionstream

implicitly

execu

tes

ifwakeupthen

asleep�false

executeinstruction

if�asleepthen

commitresult

forevery

instru

ction�More

complex

if��then��else��execu

tioncan

beach

ievedby�rst

puttin

gsom

esubset

oftheprocessors

tosleep

�execu

tingthethen

body�togglin

gtheasleep�ag�

andexecu

tingtheelsebody�

Thedetails

oftheprocessors�

thecom

munication

intercon

nect

andthevideo

I�O

capabilities

aredescrib

edbelow

�Thevideo

I�Ostru

cture

isparticu

larlycritical

asit

determ

ineswhat

algorithmscan

bepractically

implem

ented

onthePrin

cetonEngin

e�

��

Processo

rOrganiza

tion

ThePrin

cetonEngin

e�show

nsch

ematically

inFigu

re��

isaSIM

Dlin

eararray

of��

processors

connected

inarin

g�Each

processor

isa��b

itload

�storecore�

with

aregister

�le�

ALU�a��

��

multip

lierand��K

Bof

mem

ory�

The��

multip

liercom

putes

a�

bitresu

lt�andaprod

uct

picker

allowsthese�

lectionofwhich

��bits

oftheoutputare

used

�prov

idingtheoperation

c��a�b��

n�

This

allowse�

cientcod

ingof

�xed�poin

toperation

s�ThePrin

cetonEngin

ewas

design

edto

process

video

inreal�tim

e�andas

such

exten

ded

precision

integer

arith�

metic

�beyon

d��

bits�

and�oatin

g�poin

tarith

metic

delegated

tosoftw

aresolu

tions

andare

relativelyexpensive

operation

s�A��b

itsign

edinteger

multip

lyreq

uires

��

instru

ctions�anda��b

itsign

edinteger

division

requires

over��

instru

ctions�

Themem

oryiscom

posed

oftypes�

��KBof

fastmem

oryand��K

Bof

slow

mem

ory�Fast

mem

oryisaccessib

lein

asin

gleclock

cycle�

Slow

mem

oryisdivided

into

�banksof

��KB

andis

accessible

incycles�

Thecom

piler

places

alluser

�

IPC

Register

File

28MB

/s

ALU

P

Product Picker

144bit

Instruction Bus

Synchronous

640KB

SR

AM 16bit

Multiplier

PP

P

IPC

Register

File

28MB

/s

PP

PP

PP

PP

16bit

Multiplier

Product Picker

ALU

SR

AM

640KB

Figu

re��

Prin

cetonEngine�theprocessors

areorgan

izedin

a�D

ring�

The

instru

ctionstream

isdistrib

uted

synchron

ously

fromtheseq

uencer�

data

andvariab

lesinto

fastmem

ory�Anyuse

oftheslow

mem

oryis

byexplicit

program

mer

access�

Theinstru

ctionstream

supplied

bytheseq

uencer

is��

bits

wide�

TheVLIW

�Very

LongInstru

ctionWord

�cod

edinstru

ctionstream

prov

ides

parallel

control

over

allof

theintern

aldata

path

sof

theprocessors�

Theaverage

amountof

instru

ction

levelparallelism

obtain

edwith

intheprocessors

isbetw

eenand��

Unlike

typical

uniprocessors�

thePrin

cetonEngin

eprocessors

arenot

pipelin

ed�as

theinstru

ction

decod

ehas

alreadyhappened

attheseq

uencer�

Theparallelism

obtain

edbyVLIW

on

thePrin

cetonEngin

eiscom

parab

lyto

theparallelism

obtain

edbypipelin

edexecu

tion

inaRISCprocessor�

andthe��M

Hzexecu

tionof

theprocessin

gelem

entisrou

ghly

equivalen

tto

��RISCMIPS�

Theprocessors

arecon

trolledbyacen

tralseq

uencer

board

which

manages

the

program

counter

andcon

dition

alexecu

tion�Each

ofthe��

processors

operates

at��M

Hzfor

anaggregate

perform

ance

ofapprox

imately

��RISCMIPSor

��

billion

instru

ctionsper

second�

��

��

Interprocesso

rCommunica

tion

Theprocessors

areintercon

nected

via

aneigh

bor�to�n

eighbor

�interp

rocessorcom

�

munication

��IP

C�bus�

Each

processor

may

selectively

transm

itor

receivedata

onanygiven

instru

ction�allow

ingmultip

lecom

munication

sto

occursim

ultan

eously�

There

isnoshared

mem

oryon

thePrin

cetonEngin

e�all

communication

betw

een

processors

mustvia

theIPC�

Interp

rocessorcom

munication

�IPC�occu

rsvia

a��b

itdata

registerusab

leon

everycycle�

forabisection

�neigh

bor

toneigh

bor�

bandwidth

of�M

B�s�

Therin

g

canbecon

�gured

asashift

registerwith

��entries�

Each

processor

has

control

over

oneentry

intheshift

register�After

ashift

operation

eachprocessor�s

IPC

register

hold

sthevalu

eof

itsneigh

bor�s

�leftor

right�

IPC

registerprev

iousto

theshift

operation

�Durin

ganygiven

instru

ctiontheIPCisshifted

left�righ

tor

unchanged

�

TheIPCregister

may

beread

non�destru

ctively�so

once

data

isplaced

intheIPC

registersit

may

beshifted

andread

arbitrarily

manytim

es�to

distrib

ute

aset

of

values

tomultip

leprocessors�

There

areseveral

modes

ofcom

munication

supported

onthePrin

cetonEngin

e�

They

arerep

resented

schem

aticallyin

Figu

re��

Neighbor�to�NeighborThemost

natu

raloption

�Each

processor

passes

a��b

it

word

toeith

erits

leftor

rightneigh

bor�

Allprocessors

pass

simultan

eously

either

totheleft

orto

therigh

t�

Broadcast

Oneprocessor

isthebroad

caster�andall

other

processors

arereceivers�

Neighbor�to�Neighbor�N

Each

processor

passes

asin

gledatu

mto

aprocessor

Nprocessors

away�

Allprocessors

dothisin

parallel�

sothisisas

e�cien

tas

neigh

bor�to�n

eighbor

communication

interm

sof

inform

ationdensity

inthe

ring�

Multi�

Broadcast

Som

esubset

oftheprocessors

aredesign

atedbroad

casters�Each

broad

castertran

smits

values

toall

processors�

inclu

dingitself�

betw

eenitself

��

13

31

43 3

21

Multi-B

roadcast

Neighbor-to-N

eighbor-N

Broadcast

Neighbor-to-N

eighbor

41

34

2

32

1 12

41

43

221

14

31

11

Figu

re��

Communicatio

nOptio

nsonthePrin

cetonEngine�therigh

thand

sidedepicts

thecon

tents

ofthecom

munication

registeron

eachprocessor

aftera

single

stepof

thecom

munication

meth

od�

andthenextbroad

caster�

Thediscu

ssionof

communication

approach

eswill

ignore

thepossib

ilityof

usin

g

broad

cast�in

either

mode�

becau

seitreq

uires

more

instru

ctionsthan

neigh

bor�to�

neigh

bor

totran

sferthesam

eam

ountof

data�

Consid

erthetran

smission

ofpdata

aroundamach

ineof

sizen�where

eachdatu

misthewidth

ofthecom

munication

channel�

Atthestart

ofthecom

munication

eachprocessor

hold

sonedatu

m�andat

theendof

thecom

munication

eachprocessor

has

acop

yof

every

datu

m�Wecan

countthecycles

required

foreach

approach

�

approach

parallelism

inst�d

atum

totalinst

neigh

bor�to�n

eighbor

n��n

p��

��n��n

��p

broad

cast�

��p

Theneigh

bor�to�n

eighbor

approach

isthree

tofou

rtim

esfaster

than

broad

casting

becau

seitnever

has

topay

theexpense

ofwaitin

gfor

data

toprop

agateto

allof

the

processors

�theprop

agationdelay

betw

eenneigh

bors

issm

all��Theinstru

ctionsper

datu

mis��n

instead

of��n

becau

seeach

processor

loadstheir

datu

min

parallel

��

��instru

ction�then

perform

srep

eatedshifts

andnon�destru

ctiveread

sof

thecom

�

munication

register�Sim

ilarresu

ltsapplyto

neigh

bor�to�n

eighbor�N

communication

versusmulti�b

roadcast

communication

�

��

VideoI�O

�Line�LockedProgramming

Thevideo

I�Ostru

cture

oftheEngin

eassign

saprocessor

toeach

columnof

the

outputvideo�

Each

processor

isresp

onsib

lefor

supplyingall

oftheoutputpixels

forthecolu

mnof

thedisp

layitisassign

edto�

Theaddition

ofan

�OutputTim

ing

Sequence�

�OTS�allow

sprocessors

tobeassign

edregion

ssom

ewhat

more

arbitrary

than

acolu

mnof

theoutputvideo�

When

thePrin

cetonEngin

ewas

originally

design

editwas

conceiv

edof

only

asa

video

supercom

puter�

Itwas

expected

that

theprocessin

gwould

bevirtu

allyidentical

forevery

pixel�

andthusprogram

sshouldn�toperate

onfram

esof

video�

they

should

operate

onpixels�

andevery

new

pixelwould

constitu

teanew

execu

tionof

thepro�

gram�Theparad

igmisenforced

byresettin

gtheprogram

counter

atthebegin

ning

ofeach

scanlin

e�

Before

theendof

thescan

lineall

oftheprocessors

must

have

processed

their

pixel

andplaced

theresu

ltin

their

video

outputregister�

Becau

setheprogram

counter

isreset

atthebegin

ningof

eachscan

lineall

program

smustinsure

that

their

longest

execu

tionpath

iswith

inthescan

line�in

struction

budget�

of��

instru

ctions

at��M

Hz�

This

necessarily

leadsto

someobscu

recod

ing�

ascom

plex

operation

s

requirin

gmore

than

��instru

ctionsmustbedecom

posed

into

aseq

uence

ofsm

aller

stepsthat

canindividually

occurwith

intheinstru

ctionbudget�

��

Implementatio

n

Theprocessors

arefab

ricatedin

a��

micron

CMOSprocess

which

allowstwo

processors

tobeplaced

onthesam

edie�

Thephysical

con�guration

ofthemach

ineis

��

and video

2 processors w/

640KB

mem

ory

piggy-backed

on each.

I/O chips for

ring comm

unication

512 Processor C

abinet

2 Processor C

abinets + S

equencer

1024 Processor S

ystem

Figu

re��

Prin

cetonEngine�each

ofthecab

inets

show

nisapprox

imately

feet

wide��feet

deep

and�feet

high

�Theprocessors

arecooled

byaforced

airsystem

�

show

ninFigu

re��

Themach

inecon

sistsof�to

�processor

cabinets

andaseq

uencer

cabinet�

Theprocessor

cabinets

hold

��processors

each�alon

gwith

their

associated

mem

ory�Theseq

uencer

cabinet

contain

stheseq

uencer

which

prov

ides

instru

ctions

totheprocessors�

thevideo

inputandoutputhard

ware�

aHiPPiinterface�

anda

dedicated

controller

cardwith

aneth

ernet

interface�

Theseq

uencer

prov

ides

theinstru

ctionstream

totheprocessors�

Serial

control

�ow

�proced

ure

calls�ishandled

bytheseq

uencer�

Condition

alcon

trol�ow

isper�

formed

byevalu

atingthecon

dition

alexpression

onall

theprocessors

inparallel�

Processor

�then

transm

itsits

resultto

theseq

uencer

via

theinterp

rocessorcom

mu�

nication

bus�

Theseq

uencer

bran

ches

condition

allybased

onthevalu

eitsees�

High

�level

program

control

isperform

edvia

aneth

ernet

connection

andaded�

icatedcon

troller�Thecon

trollerallow

sthestartin

gandstop

pingof

theseq

uencer�

andtheload

ingofprogram

sinto

sequencer

mem

oryanddata

into

processor

mem

ory�

Facilities

fornon�video

inputandoutputare

not

supported

inthevideo

processin

g

modeof

operation

�so

they

arenot

discu

ssedhere�

Sim

ilarly�theHiPPiinput�ou

tput

channel

isnot

compatib

lewith

therealtim

evideo

processin

gmodeof

operation

�A

piece

oftheHiPPiim

plem

entation�astatu

sregister

used

for�ow

control�

canbe

��

used

asacru

dedata�p

assingmeth

odbetw

eenthecon

trollerandtheprocessors�

Data

inputthrou

ghthis

channel

islim

itedto

relatively

lowrates

�hundred

sof

values

a

second�as

there

isnohandshakingprov

ided�

Thevideo

inputandoutputchannelare

distrib

uted

throu

ghamech

anism

similar

totheinterp

rocessorcom

munication

register�Over

thecou

rseof

avideo

scanlin

ethe

I�Ochipssit

intheback

groundandshift

incom

posite

video

inputs

andshift

out

�com

ponentvideo

outputs�

Atthestart

ofeach

scanlin

etheuser

program

canread

thevideo

inputregisters�

andbefore

theendof

eachlin

etim

etheuser

program

must

write

thevideo

outputregisters�

��

Chapter�

GraphicsPipelin

e

Poly

gonalren

derin

guses

asynthetic

scenecom

posed

ofpoly

gonsto

visu

alizeareal

sceneor

theresu

ltsof

computation

oranynumber

ofoth

erenviron

ments�

Objects

in

thevirtu

alspace

arede�ned

assets

of�D

poin

tsandpoly

gonsde�ned

with

vertices

atthese

poin

ts�In

themost

straightforw

ardapplication

ofpoly

gonren

derin

g�these

verticesare

projected

tothedisp

layplan

e�th

emonitor�

viaatran

sformation

de�ned

byasynthetic

eyepoin

tand�eld

ofview

�as

show

nin

Figu

re��

Subseq

uently

the

poly

gonsde�ned

bytheset

ofvertices

with

intheclip

volu

meare

rendered

�En�

hancem

ents

tothisbasic

algorithm

inclu

deocclu

sion�ligh

ting�

shading�

andtex

ture

mapping�

Perp

ixelocclu

sioninsures

that

thepoly

gonsclosest

totheobserver

obscu

repoly

�

display plane

frustum

eyepoint

Figu

re��

Synthetic

Eyepoint�

asynthetic

eyepoin

tandfru

stum

de�nes

both

the

setof

visib

leob

jectsandadisp

layplan

e��

gonsfurth

eraw

ay�Occlu

sionhas

been

implem

ented

anumber

ofways�butismost

commonlyim

plem

entedwith

adepthbu�er�

Bycom

putin

gthedepth

ofeach

vertex

ofapoly

gonwecan

interp

olatethese

depthsacross

theinterior

pixels

ofthepoly

gon

tocom

pute

thedepth

ofanypixel�

When

wewrite

apixel

tothedisp

laydevice

we

actually

perform

acon

dition

alwrite�

only

writin

gthepixelifthesurface

fragmentit

represen

tsis closer�

totheeyethan

anysurface

alreadyren

dered

atthat

pixel�

Ligh

tingandshadinggreatly

increase

therealism

ofascen

ebyprov

idingdepth

cues

via

changes

ininten

sityacross

surfaces�

Ligh

tingassu

mes

foranygiv

enpoly

gon

that

itistheonly

poly

gonin

thescen

e�so

there

arenoshadow

scast

bytheobstru

c�

tionof

lightsou

rcesbyoth

erscen

eelem

entsor

re�ection

ofligh

to�

ofoth

ersurfaces�

These

e�ects

arecap

tured

inoth

erapproach

esto

renderin

gsuch

asray

tracing�

and

arenot

generally

implem

ented

inpoly

gonalren

derers�

Shadingisperform

edbyapply�

ingaligh

tingmodel�

such

asthePhongmodel��

toanorm

alwhich

represen

tsthe

surface

norm

alof

theob

jectbein

gapprox

imated

atthat

poin

t�Theligh

tingmodel

may

either

beapplied

atevery

pixelthepoly

goncovers�

byinterp

olatingthevertex

norm

alsto

determ

ineinterior

norm

als�or

itmay

beapplied

justto

thevertex

norm

als

ofthepoly

gon�andtheresu

ltingcolors

interp

olatedacross

thepoly

gon�Theform

er

approach

�Phongshading�

will

oftenyield

more

accurate

results

andisfree

ofsom

e

artifactspresen

tin

thelatter

approach

�Gourau

dinterp

olation�butitmore

compu�

tationally

expensive�

Figu

res��

��and��

areall

exam

ples

ofdepthbu�ered

�

Phonglit�

Gourau

din

terpolated

poly

gonal

scenes�

Texture

mappingisafurth

erenhancem

entto

poly

gonal

renderin

g�Instead

ofjust

associatingacolor

with

eachvertex

ofapoly

gon�weassociate

a�u�v�coord

inate

in

anim

agewith

eachvertex

�Bycarefu

lassign

mentof

these

coordinates�

wecan

create

amappingthat

places

anim

ageonto

aset

ofpoly

gons�for

exam

ple

apictu

reonto

abillb

oard�carp

etonto

a�oor�

even

aface

onto

ahead

�Texture

mapping�

ina

fashion

analogou

sto

shadinganddepthbu�erin

g�lin

earlyinterp

olatesthetex

ture

coordinates

ateach

vertexacross

thepoly

gon�Thecolor

ofaparticu

larpixelwith

in

thepoly

gonisdeterm

ined

bylook

ingupthepixelin

thetex

ture

map

corresponding

��

Figu

re��

Beethoven�

a��

poly

gonmodelwith

��non�cu

lledpoly

gons�

tothecurren

tvalu

esof

uandv�Figu

re��

show

sasam

ple

texture

mapped

scene

alongwith

thecorresp

ondingtex

ture

map

inFigu

re��

��

UniP

rocesso

rPipelin

e

Allof

theabove

iscom

bined

into

anim

plem

entation

ofapoly

gonal

renderer

ina

graphics

pipelin

e��Theseq

uence

ofoperation

sis

apipelin

ebecau

sethey

areap�

plied

sequentially�

Thedepthbu�er

prov

ides

order

invarian

ce�so

theresu

ltsof

one

computation

�poly

gon�have

noe�ect

onoth

ercom

putation

s�

Theuniprocessor

pipelin

econ

sistsof

twostages

geometry

andrasterization

�The

geometry

stagetran

sformsapoly

gonfrom

its�D

represen

tationto

screencoord

i�

nates�

testsvisib

ility�perform

sclip

pingandligh

tingat

thevertices�

andcalcu

lates

interp

olationcoe�

cients

fortheparam

etersof

thepoly

gon�Therasterization

stage

takesthecom

puted

results

ofgeom

etryfor

apoly

gonandren

ders

thepixels

onthe

screen�

Amore

complete

descrip

tionof

theuniprocessor

pipelin

eisnow

presen

ted�

��

Figu

re��

Crocodile�

a��

poly

gonmodelwith

��noncu

lledpoly

gons�

Figu

re��

Teapot�

a��

poly

gonmodelwith

��noncu

lledpoly

gons�

��

Figu

re��

Texture

MappedCrocodile�

a��

poly

gonmodelwith

��

visib

lepoly

gons�

Figu

re��

SampleTexture

Image�

theim

ageof

synthetic

crocodile

skin

applied

tocroco

dile�

��

next polygon

light

clip

YN

visible?

transform

calculate coefficients

Figu

re��

Uniprocesso

rGeometry

Pipelin

e

��

Geometry

Thegeom

etrystage

canbeform

edinto

apipelin

eof

operation

sto

beperform

ed�as

show

nin

Figu

re��

andexplain

edbelow

�

Transfo

rmThevertices

areprojected

totheuser�s

poin

tofview

�This

involves

translation

�so

the�D

originof

thepoin

tsbecom

estheuser�s

location�and

rotation�so

that

thelin

eofsigh

toftheobserver

correspondsto

lookingdow

nthe

zaxis�bycon

vention

�in

�D�Thistran

sformation

isperform

edwith

amatrix

multip

licationof

theform

��

x�

y�

z�

�

��

��

ab

ctx

de

fty

gh

itz

��

��

��

xyz�

��

Theuse

ofahom

ogeneou

scoord

inate

system

allowsthetran

slationto

bein�

cluded

inthesam

ematrix

with

therotation

�

��

Z

Y�z�y�

Z�d

�z��y

�

disp

layplan

e

d�y�z

�

d�y�z

Figu

re��

�DPersp

ectiv

eTransfo

rmatio

n�the�D

caseisanalogou

sto

the�D

case�Here

theaxes

have

been

labeled

tocorresp

ondto

thestan

dard

�Dcoord

inate

system

�Thevertex

eyecoordinates

arenow

projected

tothescreen

�Thisintrod

uces

the

familiar

persp

ectivein

computer

�andnatu

ral�im

ages�where

distan

tob

jects

appear

smaller

than

closerob

jects�Thescreen

coordinates

ofavertex

��s

x �sy �

arerelated

tothetran

sformed

�Dcoord

inates

�x��y

��z��by

sx�dx

�

z�

��

sy�

dy

�

z�

��

The�D

to�D

projection

isshow

nin

Figu

re��

The�D

to�D

projection

is

identical�

just

repeated

foreach

coordinate�

Clip

ping�

Rejectio

nOnce

thepoly

gonsare

transform

edto

eyecoordinates

they

mustbeclip

ped

against

theview

ablearea�

Theray

scast

fromtheeyep

ointof

theobserver

throu

ghthefou

rcorn

ersof

theview

port

de�neaview

ingfru

stum�

Allpoly

gonsoutsid

eof

thisfru

stum

arenot

visib

leandneed

not

beren

dered

�

Poly

gonsthat

intersect

thefru

stum

arepartially

visib

le�andare

intersected

with

thevisib

leportion

ofthedisp

layplan

e�or

clipped��

Clip

pingtak

esa

poly

gonandcon

vertsitinto

anew

poly

gonthat

isidentical

with

intheview

ing

��

frustu

m�andisbounded

bythefru

stum�

Back

facecullin

gculls

poly

gonsbased

ontheface

ofthepoly

gonorien

tedtow

ards

theobserver�

Each

poly

gonhas

twofaces�

andof

course

youcan

only

seeoneof

thetwofaces

atanytim

e�For

exam

ple�if

wecon

structed

abox

outof

poly

gons

wecou

lddividethepoly

gonfaces

into

twosets�

aset

offaces

poten

tiallyvisib

le

foranyeyep

ointinsid

ethebox�andaset

offaces

poten

tialvisib

lefor

any

eyepoin

toutsid

eof

thebox�andtheunion

ofthese

setsisall

thefaces

ofall

of

thepoly

gons�

Ifweknow

apriori

that

theview

poin

twill

never

beinterior

to

thebox

weneverhave

toren

der

anyof

thefaces

intheinterior

set�Poly

gons

that

areculled

becau

setheir

interior�

faceistow

ardstheview

erare

saidto

be

back

facing�

andare

back

faceculled

�

Lightin

gLigh

tingis

now

applied

tothevertex

norm

als�A

typical

lightin

gmodel

inclu

des

contrib

ution

sfrom

di�use

�back

groundor

ambien

t�illu

mination

�dis�

tantillu

mination

�likethesun�andlocal

poin

tillu

mination

�likealigh

tbulb��

Poly

gonsare

assigned

prop

ertiesinclu

dingcolor

�RGB��shininess

andspecu

�

laritywhich

determ

inehow

theincid

entligh

tis

re�ected

�See

Foley

andvan

Dam

��pg�

��for

adescrip

tionof

lightin

gmodels�

Thenorm

alassociated

with

eachvertex

may

alsobetran

sformed�depending

ontheview

ingmodel�

Ifthemodelof

interaction

assumes

theview

erismoving

aroundin

a�xed

space�

then

theligh

tingshould

remain

constan

t�as

should

thevertex

norm

alsused

tocom

pute

lightin

g�How

ever�ifthemodel

assumes

that

theob

jectisbein

gmanipulated

byan

observer

whoissittin

gstill

then

the

norm

alsbutalso

betran

sformed�Anorm

alisrep

resented

byavector

��

nx

ny

nz

�

��

��

andistran

sformed

bytheinverse

transpose

ofthetran

sformation

matrix

�

LinearEquatio

nSetup

Rasterization

isperform

edbyiteratin

ganumber

oflin

ear

equation

softheform

Ax�By�C�Thecoe�

cientsofthese

equation

srep

resent

theslop

esof

thevalu

esbein

giterated

alongthexandyaxes

inscreen

space�

Atypical

system

interp

olatesmanyvalu

es�inclu

dingthecolor

ofthepixel�

the

depth

ofthepixel�

andthetex

ture

map

coordinates�

Thecom

putation

ofthese

linear

equation

coe�cien

ts�while

straightforw

ard�is

timecon

suming�

Itreq

uires

both

accuracy

anddynam

icran

geto

encap

sulate

thefullran

geofvalu

esofinterest�

Figu

re��

show

stheform

ofthecom

putation

forthedepth�Thesam

eequation

sare

evaluated

foreach

componentto

be

interp

olatedacross

thepoly

gon�

Dueto

thesim

ilarityof

thework

perform

ed�lots

ofarith

metic�

andin

anticip

a�

tionof

theparallel

discu

ssion�Iinclu

delin

earequation

setupwith

thegeom

etry

pipelin

e�butitcou

ldalso

becon

sidered

thestart

oftherasterization

stage�

Therasterization

stagedescrib

ednextren

ders

theset

ofvisib

lepoly

gonsbyiter�

atingthelin

earequation

scalcu

lateddurin

gthegeom

etrystage�

��

Raste

rizatio

n

Rasterization

must

perform

twotask

s�It

must

determ

inetheset

ofpixels

that

a

poly

gonoverlap

s�is

poten

tiallyvisib

leat�

andfor

eachofthose

pixels

itmustevalu

ate

thevariou

sinterp

olants

associatedwith

thepoly

gon�

Thetypical

uniprocessor

software

implem

entationof

poly

gonrasterization

ana�

lyzes

thepoly

gonto

beren

dered

anddeterm

ines

foreach

horizon

talscan

linethe

leftmost

andrigh

tmost

edge

ofthepoly

gon�Thealgorith

mcan

then

evaluate

the

intersection

ofthepoly

gonedges

with

anygiven

scanlin

eandrasterize

thepixels

with

in�Thisalgorith

mise�

cient�as

itexam

ines

only

thepixels

with

inthepoly

gon

anditiseasy

todeterm

inetheleft

andrigh

tedges�

��

X

Y

Z

1

2

0

�z�

�x�

�y�

�z�

�y�

�x�

Linear

equation

foran

edge

fromvertex

�to

vertex�

E�x�y�

�Ax�By�C

��

A�

��y

��

B�

�x�

��

C�

x��y��y��x

��

��

Linear

equation

fordepth

Z�x�y�

�Ax�By�C

��

��

�x��y

��y

��x

��

A�

��z��y

��y

��z

� ��

B�

��x��z

��z

��x

� ��

C�

z��A�x��B�y�

��

��

Figu

re��

PolygonLinearEquatio

ns�

Theedge

equation

sare

unnorm

alizedas

only

thesign

ofsubseq

uentevalu

ationsisim

portan

t�Accu

ratemagn

itudeisreq

uired

fortheoth

erequation

s�resu

ltingin

more

complex

expression

sfor

thecoe�

cients�

��

po

sitive

neg

ative

half-p

lane

half-p

lane

Figu

re��

Tria

ngleDe�nitio

nbyHalf�PlaneInterse

ctio

n�

apositive

and

negative

half�p

laneare

de�ned

byalin

ein

screenspace�

Theinterior

ofapoly

gonis

allpoin

tsin

thepositiv

ehalf�p

laneof

alltheedges�

Diagram

isafter

��

Typical

hard

ware

implem

entation

s�for

exam

ple

��tak

eadi eren

tapproach

�

Each

poly

goncan

bede�ned

astheintersection

ofthensem

i��nite

plain

sboundby

thenedges

ofthepoly

gon�as

explain

edbyPinedain

��Wecan

de�netheequation

ofalin

ein

�DasAx�By�C��andthusahalf�p

laneisgiven

byAx�By�C�

��

With

theapprop

riatechoice

ofcoe�

cients

wecan

insure

that

theequation

sof

the

nsem

i��nite

plain

swill

allbepositiv

ely�or

negatively

valued�with

inthepoly

gon�

Figu

re��

depicts

thisapproach

�

Thisprov

ides

uswith

avery

cheap

test�th

eevalu

ationof

onelin

earequation

per

vertex�to

determ

ineifanypixelisinsid

eof

apoly

gon�Furth

ermore�

wecan

cheap

ly

boundtheset

ofpixels

that

arepoten

tiallyinterior

tothepoly

gonas

theset

ofpixels

interior

totheboundingbox

ofthepoly

gon�

This

approach

results

inexam

ining

more

pixels

then

will

beren

dered

�as

theboundingbox

isgen

erallyacon

servative

estimate

oftheset

ofpixels

inapoly

gon�For

triangles

atleast

half

ofthepixels

exam

ined

will

beexterior

tothepoly

gon�Experim

entally

thishas

been

determ

ined

tobeagood

tradeo

against

thehigh

ercom

putation

alcom

plex

ityofspan

algorithms

becau

seSIM

Dim

plem

entationsbene�tfrom

asim

ple

design

with

fewexcep

tional

cases�

Therasterization

process

isactu

allypoorly

describ

edas

apipelin

e�becau

seitcon

�

sistsof

aseries

ofearly

�aborts�

atleast

inits

serialform

�Figu

re��

show

satypical

view

oftherasterization

process�

Thenext

pixel

operation

takesstep

horizon

tally

until

itreach

estherigh

tsid

eof

theboundingbox�andthen

takesavertical

stepand

��

interior pixel?

pixel visible?

shade pixel

write R

GB

Z

N

compute z

textured polygon?

next polygon

NN

NN

more pixels?

next pixel

NY

N

compute z

interior pixel?

pixel visible?

write R

GB

Z

get texture RG

B

shade pixel

next pixel

more pixels?

Figu

re��

Uniprocesso

rRaste

rizatio

n

��

returnsto

takinghorizon

talstep

s�Theinterior

pixel

testisthesim

plelin

earequation

testshow

nin

Figu

re��

Compute

Zevalu

atesthelin

earequation

fordepth

and

thepixel

visibletest

determ

ines

ifthispixeliscloser

totheobserv

erthan

thecurren

t

pixelin

thefram

ebu er�

Ifso

itisthen

shaded

andwritten

tothefram

ebu er�

The

proced

ure

fortex

ture

mapped

poly

gonsisidentical�

excep

tthepixelthat

getsshaded

isretriev

edfrom

atex

ture

map�rath

erthan

just

bein

gthecolor

ofthepoly

gon�

Consid

erable

complex

ityhas

been

sweptunder

therugin

thisview

ofrasteriza�

tion�Most

signi�can

tisthetex

ture

mappingoperation

�which

isrelativ

elycom

plex

inpractice�

Thesim

plest

way

toperform

texture

mappingis�poin

tsam

plin

g��Poin

t

samplin

gtakes

thecolor

ofatex

ture

mapped

pixel

asthecolor

oftheclosest

texel

�texture

pixel�

tothecurren

tinterp

olatedtex

ture

coordinates�

Amore

pleasin

g

meth

oduses

aweigh

tedcom

bination

ofthe�closest

texels

fromthetex

ture

map�A

yetmore

complex

andpleasin

gapproach

isach

ievedwith

MIP�M

apping�

describ

ed

byLance

William

sin

��Allof

these

meth

odsattem

ptto

correcttheinaccu

racyof

poin

tsam

plin

gatran

sformed

image�

Amath

ematically

correctresam

plin

gof

thein�

putim

agewould

involve

complex

�lterin

ganddi�

cultcom

putation

�Approx

imation

meth

odsattem

ptto

correctthesam

plin

gerrors

atlow

ercost�

usually

with

acceptab

le

visu

alresu

lts�

��

Details

There

areafew

auxiliary

details

which

accompanytheren

derin

gprocess

which

have

been

ignored

sofar�

Most

importan

tly�there

isgen

erallyaway

fortheuser

tointeract

with

thesystem

�A

similarly

importan

tsubject

isdouble�b

u erin

g�

Theparticu

larinputmeth

odprov

ided

bythesystem

isnot

importan

t�just

some

meth

odmust

exist�

These

meth

odsvary

greatly�from

bein

gable

tomodify

the

poly

gondatab

aseandview

matrix

contin

ually

andinteractiv

ely�to

simply

execu

ting

ascrip

tof

pred

e�ned

view

poin

ts�

Double�b

u erin

gisused

tohidethemach

inery

ofpoly

gonren

derin

g�Users

are

�usually

�only

interested

inthe�nal

rendered

image�

not

allthework

that

goesinto

��

it�Tothisend�twofram

ebu ers

areused

�Onecom

pleted

framebu er

isdisp

layed�

while

thesecon

dfram

ebu er

isren

dered

�When

therasterization

ofall

poly

gonsis

complete�th

efram

ebu ers

areexchanged

�usually

durin

gtheverticalretrace

interval

ofthedisp

lay�so

theim

agesdon�t�ick

erdistractin

gly�

��

MultiP

rocesso

rPipelin

e

Theparallel

versionofthegrap

hics

pipelin

edi ers

signi�can

tlyfrom

theuniprocessor

pipelin

e�There

arenow

nprocessors

work

ingon

theren

derin

gprob

leminstead

of

asin

gleprocessor�

andoptim

allywewould

likeafactor

ofnspeed

up�How

dowe

dividethework

amongtheprocessors�

Given

that

wehave

divided

thework

�how

doweexecu

teitin

parallel�

While

these

question

sappear

super�

ciallyobviou

s�they

arenot�

Byparallelizin

gthealgorith

mwehave

exposed

that

itisactu

allyasortin

g

prob

lem�resolved

bythedepth�bu er

intheuniprocessor

system

�butpresu

mably

resolvedin

parallel

inamultip

rocessorim

plem

entation�

Itis

relativelyobviou

show

topartition

thegeom

etrywork

�Instead

ofgiv

ing

ppoly

gonsto

�processor�

givep�n

poly

gonsto

eachprocessor�

andallow

them

to

proceed

inparallel�

Thisshould

yield

afactor

ofnspeed

up�which

isthemost

wecan

hopefor�

Therasterization

stageisless

obviou

s�Dowelet

eachprocessor

rasterize

��nofthepoly

gons�

Dowelet

eachprocessor

rasterize��n

ofthetotal

pixels�

When

apoly

gonhas

toberasterized

doall

processors

work

onitat

thesam

etim

e�

Wewould

alsolike

todividethework

aseven

lyas

possib

leam

ongtheprocessors�

Ifweputmore

work

onanyoneprocessor

wehave

comprom

isedtheperform

ance

of

oursystem

�How

ever�itisnot

alwaysthecase

that

thesam

enumber

ofpoly

gonson

eachprocessor

correspondsto

thesam

eam

ountof

work

�if�

forexam

ple�th

epoly

gons

varyin

size�

Rasterization

isactu

allyasortin

galgorith

mbased

onscreen

locationanddepth�

Ifwedividethework

acrossmultip

leprocessors

wemust

stillperform

this

sorting

operation

�Theaddition

ofcom

munication

totheren

derin

gpipelin

eenables

usto

��

resolvethesortin

gprob

lemandren

der

correctim

ages�that

is�thesam

eim

agesa

uniprocessor

renderer

would

generate�

Becau

setherasterization

sortingissu

ehas

forcedusto

introd

uce

communication

�

our�rst

departu

refrom

theuniprocessor

pipelin

e�itisnatu

ralto

startbydiscu

ssing

itsparallelization

�

��

Communicatio

n

Given

that

wehave

toperform

communication

toprod

uce

the�nal

image�

wemust

choose

how

weparallelize

therasterization

�This

choice

bears

heav

ilyon

how

the

image

isgen

eratedandwill

determ

inewhat

communication

option

sare

available

to

us�

There

aretwowayswecou

lddividetherasterization


��give

eachprocessor

aset

ofpixels

torasterize�

or

��give

eachprocessor

aset

ofpoly

gonsto

rasterize�

Ifwechoose

the�rst

option

then

thesortin

gprob

lemis

solved

bymakingthe

pixelsets

disjoin

t�so

asin

gleprocessor

has

alltheinform

ationnecessary

tocorrectly

generate

itssubset

ofthepixels�

Thisreq

uires

communication

before

rasterization�

sothat

eachprocessor

may

betold

aboutall

thepoly

gonsthat

intersects

itspixel

set�These

approach

esare

referredto

as�sort��

rst�and�sort�m

iddle�

becau

sethe

communication

occurseith

er�rst�

before

geometry

andrasterization

�or

inthemiddle�

betw

eengeom


�

Ifwechoose

thesecon

doption

then

thepoly

gonsrasterized

onanygiven

processor

will

overlapsom

esubset

ofthedisp

laypixels�

andthere

may

bepoly

gonson

some

other

processor

orprocessors

that

overlapsom

eof

thesam

edisp

laypixels�

Soifwe

parallelize

therasterization

bypoly

gonsrath

erthan

pixels

then

thesortin

gprob

lem

mustbesolv

edafter

rasterization�Thisapproach

isreferred

toas

�sort�last�becau

se

thesortin

goccu

rsafter

both

rasterizationandgeom

etry�

Figu

re��

show

sthese

communication

option

ssch

ematically�

Inall

casesthe

communication

isshow

nas

anarb

itraryintercon

nection

netw

ork�Ideally

wewould

��

G

SO

RT

-FIR

ST

display

polygon database

R

R GR

Distribute P

ixels

SO

RT

-LA

ST

display

polygon database

G

R G

RR G

G

Distribute P

olygons

R

G

RR G

display

G

Distribute P

olygons

SO

RT

-MID

DL

E

polygon database

Figu

re��

Sort

Order�

somepossib

lewaysto

arrange

thepoly

gonren

derin

gprocess�

Figu

reisafter

��

likethisnetw

orkto

beacrossb

arwith

in�nite

bandwidth�so

that

thecom

munication

stagereq

uires

notim

eto

execu

te�Ofcou

rsein

practice

wewill

never

obtain

such

a

communication

netw

ork�

Sort�

First

Sort��

rstperform

scom

munication

before

geometry

andrasterization

�Thescreen

is

diced

upinto

aset

ofregion

sandeach

processor

isassign

edaregion

orregion

s�and

rasterizesall

ofthepoly

gonsthat

intersect

itsregion

s�Allof

theregion

sare

disjoin

t�

sothe�nal

image

issim

ply

theunion

ofall

theregion

sfrom

alltheprocessors�

andnocom

munication

innecessary

tocom

binetheresu

ltsof

rasterization�

The

communication

stagemustgive

eachprocessor

acop

yof

allthepoly

gonsthat

overlap

itsscreen

area�This

inform

ationcan

only

bedeterm

ined

aftertheworld

�space

to

screen�sp

acetran

sformation

ismade�so

there

isgen

erallysom

enon�triv

ialam

ount

ofcom

putation

donebefore

thecom

munication

isperform

ed�

��

Sort�

Middle

Sort�m

iddle

alsoassu

mes

that

eachprocessor

israsterizin

gauniquearea

ofthe

screen�How

ever�thecom

munication

isnow

doneafter

thegeom

etrystage�

sorath

er

than

communicatin

gthepoly

gonsthem

selves�

wecan

directly

distrib

ute

thelin

ear

equation

coe�cien

tsthat

describ

etheedges

ofthepoly

gonon

thescreen

alongwith

thedepth�shading�

etc�

Sort�

Last

Sort�last

assignseach

processor

asubset

ofthepoly

gonsto

rasterize�rath

erthan

a

subset

ofthepixels�

Asthepoly

gonsrasterized

ondi eren

tprocessors

may

overlap

arbitrarily

inthe�nal

image�

thecom

position

isperform

edafter

rasterization�Each

processor

canperform

thegeom


stagewith

nocom

munication

whatsoever�

only

communicatin

gonce

they

have

generated

allof

their

pixels

forthe

outputim

age�

Compariso

n

These

three

sortingchoices

allincurdi eren

ttrad

eo s�andtheparticu

larchoice

of

algorithm

will

dependon

both

details

ofthemach

ine�sp

eed�number

ofprocessors�

etc��andthesize

ofthedata

setto

beren

dered

�

Sort��

rstcom

munication

canexploit

temporal

locality�Generally

aview

erwill

navigate

ascen

ein

asm

oothandcon

tinuousfash

ion�so

theview

poin

tchanges

slowly�

andthelocation

ofeach

poly

gonon

thescreen

changes

slowly�

Ifthelocation

ofpoly

�

gonsischangin

gslow

lythen

thechange

inwhich

poly

gonseach

processor

willrasterize

will

besm

all�andsort��

rstwill

require

littlecom

munication

�Thisismost

obviou

s

iftheview

erisstan

dingstill�

inwhich

casethepoly

gonsare

alreadycorrectly

sorted

fromren

derin

gtheprev

iousfram

eto

render

thecurren

tfram

e�How

ever�sort��

rst

incurs

theoverh

eadof

duplicated

e ort

inthegeom

etrystage�

Every

poly

gonhas

its

geometry

computation

perform

edbyallp

rocessorswhose

rasterizationregion

sitover�

laps�

For

largeparallel

system

s�which

will

have

correspondingly

small

rasterization

��

regions�thiscost

could

becom

erelativ

elyhigh

�

Sort�m

iddle

communication

calculates

thegeom

etryonly

once

per

poly

gonand

thuswill

not

have

theduplication

ofe ort

incurred

bysort��

rst�How

ever�sort�

middlemustcom

municate

theentire

poly

gondatab

asefor

eachren

dered

frame�w

hich

ispresu

mably

substan

tiallymore

expensive

than

ameth

odwhich

exploits

temporal

locality�

Sort�last

communication

tra�cs

inpixels

instead

ofpoly

gons�

andhas

thein�

terestingprop

ertythat

theam

ountof

communication

iscon

stant�

Each

processor

will

have

tocom

municate

nomore

than

ascreen

�sworth

ofpixels

tocom

posite

the

�nal

image�

soas

thepoly

gondatab

asegrow

sarb

itrarilylarge

sort�lastwill

o er

the

cheap

estcom

munication

�Alth

ough

thiscon

stantboundon

communication

isusefu

l�

itisshow

nlater

tobeavery

largeam

ountof

communication

compared

tosort��

rst

andsort�m

iddlefor

ourscen

esof

interest�

Wehave

laidafram

ework

forthediscu

ssionto

follow�and�while

there

areinterac�

tions�theactu

algeom


stagesare

tosom

eexten

tindependentof

thecom

munication

topology�

Thecom

munication

topology

will

determ

ineprecisely

what

poly

gonsare

computed

onwhat

processor

foreach

stage�butwill

have

little

e ect

onthenatu

reof

thecom

putation

itself�Wewill

defer

theactu

alchoice

of�and

justi�

cationfor�

acom

munication

topology

until

Chapter

��

��

Geometry

Thegeom

etrycalcu

lationscon

sistsof

�vedistin

ctstep

s�

�tran

sform

�clip

�reject

�project

toscreen

coordinates

�ligh

t

�lin

earequation

setup

�

clip/reject

transform

next polygon

add to visible list

Nproject to screen

next visible polygon


light

Y

Figu

re��

SIM

DGeometry

Pipelin

e�thepipelin

eissplit

betw

eenclip

pingand

lightin

g�

Each

ofthese

stageshas

anupper�b

oundon

theam

ountof

work

perform

edper

poly

gon�

Thecalcu

lationsto

beperform

edfor

eachpoly

gon�while

execu

tingon

di eren

tdata

andyield

ingdi eren

tresu

lts�are

identical

algebraically�

Theexecu

tion

ofeach

processor

over

itssubset

ofthepoly

gons�

wheth

erobtain

edfrom

theinput

poly

gonal

datab

asedi eren

tly�or

fromtheresu

ltsof

sort��rst

communication

�will

beidentical�

andtheexecu

tiontim

ewill

bedeterm

ined

bytheprocessor

with

most

poly

gonsto

process�

Often

aftertheclip

�rejectstage

avery

largenumber

ofprocessors

will

have

dis�

carded

their

poly

gonandhave

noth

ingto

dowhile

theoth

erprocessors

compute�

Thiswill

achieve

poor

utilization

oftheprocessors

andaless

than

optim

alspeed

up

��

forthegeom

etrystage�

Arev

isedgeom

etrypipelin

eapprop

riatefor

parallel

execu

�

tionisshow

nin

Figu

re��

Bybreak

ingthegeom

etrypipelin

einto

twosep

arate

pieces

andqueuein

gthework

betw

eenthem

wewill

�rst

perform

work

prop

ortional

tothemaxim

um

number

ofpoly

gonson

anyprocessor�

then

work

prop

ortional

to

themaxim

um

number

ofvisible

poly

gonson

anyprocessor�

Thecost

ofthesecon

d

stageof

thepipelin

eper

poly

gonwill

prove

tobesubstan

tiallyhigh

erthan

the�rst

stage�so

ifthenumber

ofvisib

lepoly

gonsismuch

smaller

than

thetotal

number

of

poly

gonsthesav

ings

could

besubstan

tial�Section

��prov

ides

adiscu

ssionof

this

implem

entation

issue�

��

Raste

rizatio

n

Thesim

plerasterization

algorithm

suggested

inFigu

re��

has

thesam

eim

plem

en�

tationprob

lemsas

thegeom

etrystage�

Ifthenumber

ofpoly

gonsvaries

signi�can

tly

acrossprocessors

theam

ountof

work

tobedoneon

eachprocessor

will

alsovary

signi�can

tly�This

prob

lemiscom

pounded

bythevary

ingsizes

ofpoly

gons�

Ifwe

assumeouralgorith

mcan

rasterizeapixelin

somecon

stantam

ountoftim

e�wewould

likeourtim

eto

rasterizeapoly

gonto

beprop

ortional

tothearea

ofthepoly

gon�

How

ever�theuniprocessor

implem

entationhas

aforced

serializationof

thepoly

gons�

which

introd

uces

anexpensiv

esynchron

izationpoin

tin

thealgorith

m�Thissynchro�

nization

forcesall

processors

torasterize

their

poly

gonin

theworstcase

time�

To

achieve

ane�

cientexecu

tionitwill

benecessary

todecou

pletherasterization

process

fromthespeci�

cpoly

gonbein

gprocessed

�

Figu

re��

show

sasuggested

implem

entation

foramore

e�cien

tparallel

ren�

derer�

Thedouble

loopin

theuniprocessor

implem

entation�show

nin

Figu

re��

has

been

replaced

with

asin

gleloop

�andthetest

casehas

been

mademore

com�

plex

�Thisdecou

ples

thesize

di eren

cesin

poly

gonsbetw

eenprocessors

atthecost

ofmakingeach

iterationof

theloop

more

expensiv

e�

Di eren

cesin

thesize

ofpoly

gonsasid

e�there

isreason

tobeliev

ethere

will

be

substan

tialdi eren

cesin

thenumbers

ofpoly

gonson

eachprocessor�

Ifweuse

a

��

next polygonnext pixel

more pixels?

NY

get texture RG

B

textured polygon?

Y

NY interior pixel?

Npixel visible?

compute z

N

shade pixel

write R

GB

Z

Figu

re��

SIM

DRaste

rizatio

n�therasterization

process

onaSIM

Dmulti�

processor�

Thedouble

loopisrep

lacedwith

asin

gleloop

todecou

pletheprocessor

execu

tiontim

efrom

poly

gonsize

di�eren

ces�

sort��rst

orsort�m

iddle

communication

structu

rethen

anyim

balan

cesin

thedis�

tribution

ofpoly

gonsacross

thescreen

�which

there

oftenare

will

bere�

ectedin

imbalan

cesin

thenumber

ofpoly

gonsoverlap

pingeach

processor�s

rasterizationre�

gion�lead

ingto

furth

erload

imbalan

ces�Thissuggests

that

analgorith

mbased

on

sort�lastcom

munication

�which

canassign

poly

gonsto

processors

arbitrarily�

could

perform

signi�can

tlybetter

than

either

sort��rst

orsort�m

iddlecom

munication

�at

leastdurin

gtherasterization

stage�

��

Summary

Themajor

function

alityof

apoly

gonren

derin

gpipelin

ehas

been

describ

ed�Particu

�

laratten

tionhas

been

paid

tothereq

uirem

entsfor

ane�

cientSIM

Dim

plem

entation

�

Ingen

eralan

attemptismadeto

queueupasizab

leportion

ofwork

foreach

stage

ofcom

putation

onall

processors

tominim

izethenumber

ofidle

processors

atany

poin

tin

time�

Ine�ect

ane�

cientSIM

Dgrap

hics

pipelin

ewill

move

allpoly

gons

simultan

eously

throu

ghthepipelin

e�so

that

statisticallyeach

processor

should

be

keptbusy

desp

itepoly

gonbypoly

gonvariation

sin

theam

ountof

work

tobedone�

Unfortu

nately

lackof

uniform

ityin

thepoly

gondatab

ase�particu

larlywith

re�

gardsto

asymmetric

distrib

ution

ofpoly

gonsover

thescreen

�can

resultin

di�

cult

loadbalan

cingprob

lems�Thenextchapter

will

discu

ssthescalab

ilityof

thedi�eren

t

communication

schem

eswith

aneyeboth

totheir

inheren

te�

ciency

andtheir

e�ects

onthee�

ciency

ofthegeom


stages�

��

Chapter�

Scalability

While

weare

interested

speci�

callyin

afast

poly

gonren

derin

galgorith

mfor

the

Prin

cetonEngin

e�weare

more

generally

interested

inafast

algorithm

forarb

itrary

parallel

computers�

Inparticu

larwewould

likeasca

lable

algorithm

sowecan

readily

constru

ctbigger

andprop

ortionally

fastersystem

s�Optim

allywewould

likealin

ear

speed

upin

thenumber

ofprocessors�

soouralgorith

mmust

beO�p�n

�Twoforces

will

preven

tusfrom

actually

achiev

ingthisperform

ance�

Communicatio

nCom

munication

isnecessary

inaparallel

renderer

tocom

binethe

results

oftheparallel

execu

tionsof

thealgorith

m�Com

munication

not

only

represen

tsacost

non�ex

istentin

theuniprocessor

case�it

will

alsoprove

to

bean

O�p

operation

forourim

plem

entation

�so

communication

will

cometo

dom

inate

perform

ance

foralarge

enough

system

�O��

solution

sexist�

butare

extrem

elyexpensive�

requirin

gtim

eprop

ortional

tothenumber

ofpixels

inthe

disp

lay�

Imperfe

ctloadbalancingToobtain

perfect

linear

speed

upwemustbeableto

per�

fectlydividethework


Ingen

eralthere

areunavoid

able

di�eren

cesin

computation

alload

betw

eenprocessors�

Manyren

derin

gsystem

s

attackthis

prob

lembysubdividingtheprob

lemto

�ner

levels

ofparallelism

which

aremore

easilybalan

ced�butfor

largesystem

sthese

regionsmay

be�

��

comearb

itrarilysm

all�andthesubdivision

itselfwill

startto

incursubstan

tial

overhead

�

While

wedesire

linear

perform

ance

improvem

entsin

thenumberofprocessors�

we

alsodesire

thefastest

possib

lesolu

tionfor

ourparticu

lararch

itecture�

thePrin

ceton

Engin

e�In

particu

lar�thegoal

ofinteractiv

efram

erates

will

preclu

detheuse

of

existin

glin

earlyscalab

lesolu

tions�

For

exam

ple�

sort�lastcom

munication

requires

constan

ttim

e�independentof

thenumber

ofprocessors�

Unfortu

nately

itwill

prove

proh

ibitively

expensive

forourscen

esof

interest�

Inthischapter

wewill

discu

sstheexpected

scalability

ofthegeom

etry�com

muni�

cationandrasterization

stages�In

addition

wewill

prov

ideafram

ework

forpick

ing

themost

e�cien

tcom

munication

strategyin

general�

andspeci�

callyapply

itto

the

Prin

cetonEngin

e�Thechoice

ofsort

order

will

play

anim

portan

trole�

not

only

bydeterm

iningthetotal

amountof

timespentperform

ingcom

munication

andthe

scalability

ofthealgorith

m�butalso

bya�ectin

gtheload

balan

ceandduplication

of

e�ort

inthegeom


stages�

��

Geometry

Thegeom

etrycalcu

lationsare

iden

ticalfor

everypoly

gon�Som

epoly

gonswill

be

culled

becau

sethey

areoutof

theview

ingfru

stum�or

rejectedbecau

sethey

are

back

facing�

butas

discu

ssedin

x��thiscan

easilybecou

ntered

bybreak

ingthe

geometry

pipelin

einto

twopieces

andqueuein

gthevisib

lepoly

gonsbetw

eenthese

stages�Thegeom

etrycalcu

lationsreq

uired

forapoly

goniscon

stant�so

weshould

be

able

todistrib

ute

poly

gonsover

processors

uniform

ly�andobtain

perform

ance

that

isO�p�n

for

thegeom

etrystage�

��

��

Raste

rizatio

n

Thedriv

ingfactor

ofouranaly

sisof

parallelism

has

been

thenumber

ofprocessors

involved

�particu

larlylarge

numbers

ofprocessors�

Asthenumber

ofprocessors

in�

volvedbecom

eslarger

thedivision

ofthework

�assumingacon

stantam

ountof

work

tobedone

requires

a�ner

and�ner

granularity

oftheprob

lem�Even

forvery

large

numbers

ofprocessors

�thousan

ds there

isgen

erallyenough

parallelism

availabledur�

ingthegeom

etrystage

tokeep

them

alloccu

pied

�Thisisnot

asread

ilytru

efor

the

rasterizationstage�

Ifweassu

meapoly

gonparallel

implem

entation

ofrasterization

�wegive

each

processor

p�nof

thepoly

gonsto

rasterize�an

dhave

thusassu

med

asort�last

archi�

tecture

andwith

afew

weak

assumption

saboutthedistrib

ution

ofpoly

gonsand

their

sizes�weobtain

asystem

that

exhibits

excellen

tparallelism

�Unfortu

nately

we

will

soonsee

that

sort�lastalgorith

msare

proh

ibitiv

elyexpensiv

eon

thePrin

ceton

Engin

e�Thisleaves

sort��rst

andsort�m

iddlealgorith

msto

becon

sidered

�Both

of

these

assignaregion

ofthescreen

toeach

processor

forrasterization

�rath

erthan

somenumber

ofprim

itives�

andthusreq

uire

image

parallel

rasterization�

Sort��

rstandsort�m

iddledividethescreen

into

multip

leregion

sandassign

them

totheprocessors

insom

efash

ion�Asdiscu

ssedin

thefollow

ingsection

s�sort��

rst

andsort�m

iddleare

most

e�cien

tfor

small

numbers

ofprocessors

andsm

allnumbers

ofpoly

gons�

Inparticu

larthecom

munication

timefor

sort�middleproves

tobelin

ear

inthenumber

ofpoly

gonsandindependentof

thenumber

ofprocessors�

Sort��

rst

introd

uces

redundantcalcu

lationwhich

increases

with

thenumber

ofregion

sthe

screenistiled

into�

andthuswith

thenumber

ofprocessors�

Soboth

sort��rst

and

sort�middlelack

scalability

foroursystem

sof

interest�

Sort��

rstandsort�m

iddlealso

a�ect

thee�

ciency

oftherasterization

stage�Ifwe

staticallydividethescreen

intonregion

sthese

regionswill

becom

equite

small

�tens

ofpixels

forlarge

n�Unfortu

nately

thepoly

goncom

plex

ityof

ascen

eisusually

not

distrib

uted

uniform

lyover

thedisp

lay�so

someregion

swill

have

veryfew

poly

gons�

��

while

other

regionswill

have

many�

Thiscan

leadto

substan

tialrasterization

load

imbalan

cesbetw

eenprocessors�

How

ever�ifweim

plem

entsort�last

communication

thisprob

lemisavoid

ed�as

wewillassign

poly

gonsrath

erthan

pixels

toeach

processor�

Chapter

�discu

ssesahybrid

algorithmentitled

�sort�twice�

that

combinesthelow

costof

sort�middlefor

small

numbers

ofpoly

gonsandprocessors

with

theexcellen

t

scalability

ofsort�last

toobtain

asystem

more

e�cien

tthan

either

approach

for

scenes

often

sof

thousan

dsof

poly

gons�

��

Communicatio

n

Thecom

munication

stageacts

asacrossb

ar�com

municatin

gtheresu

ltsof

onephase

ofcom

putation

toanoth

er�andmay

beplaced

inthree

distin

ctplaces

intheren

derin

g

algorithm�as

show

nsch

ematically

inFigu

re��

Sort��

rstandsort�m

iddle

both

communicate

poly

gons�

orsom

epartially

computed

represen

tationof

them

�while

sort�lasttra�

csin

pixels�

Neith

erthescalab

ilitynor

thee�

ciency

ofanyof

thesort

option

sappear

in�

tuitively

obviou

s�Asevidenced

bytherasterization

discu

ssiontheord

eringof

the

communication

will

have

e�ects

onboth

therasterization

andgeom

etrystages�

A

modelisintrod

uced

toquantify

anumber

ofasp

ectsof

these

system

sandprov

idea

basis

fortheir

comparison

�

��

Model

Themodelabstracts

quite

faraw

ayfrom

theactu

alren

derin

gprocess�

Anattem

pt

has

been

madeto

capture

allof

the�rst

order

e�ects

incom

munication

andas

manyof

thesecon

dord

ere�ects

asdeem

edpractical�

Wewill

make

anumber

of

simplify

ingassu

mption

s�andnote

when

weare

bein

goverly

optim

isticor

pessim

istic

inourassu

mption

s�Wehave

foundthat

evenwith

crudeassu

mption

sthemodelcan

prov

ideaclearly

preferab

lecom

munication

strategy�Becau

sethecom

munication

structu

rehas

asign

i�can

tim

pact

onthegeom


stagesthey

are

��

variable

de�nition

pnumber

ofpoly

gons

nnumber

ofprocessors

Gtim

eto

perform

geometry

per

poly

gonR

timeto

rasterizeapixel

fasy

mmetry

factor�ratio

ofmaxim

um

number

ofpoly

gonsin

aregion

ofthescreen

totheaverage

number

ofpoly

gonsper

region�

onumber

ofregion

sthat

apoly

gonoverlap

sQ

areaof

screenin

pixels

Aaverage

areaof

apoly

gonin

pixels

Ctim

eto

communicate

adatu

mfrom

neigh

bor�to�n

eighbor�

Di�eren

tfor

eachcom

munication

structu

re�

Table��

Modelin

gVaria

bles

inclu

ded

inthemodel�

Table��

presen

tsourvariab

lesfor

modelin

gthecom

munication

prob

lem�Wewill

modeltheren

derin

gprocess

asoccu

rringon

amach

inewith

nprocessors�

renderin

g

appoly

gonscen

e�

Wemodelthegeom

etrystage

asreq

uirin

gtim

eGto

execu

teper

poly

gon�andthe

totaltim

espentwill

beG

timesthemaxim

umnumber

ofpoly

gonson

anyprocessor�

Sim

ilarly�therasterization

stageismodeled

asreq

uirin

gtim

eR

torasterize

apixel�

andwill

require

atotal

timeequal

tothemaxim

um

number

ofpixels

rasterizedby

anyprocessor�

timesR�

For

eachcom

munication

schem

ewewill

determ

inetheam

ountof

data

qthat

it

communicates

andthedistan

cedthat

eachdatu

mmusttravel�

Thecom

munication

topology

isa��D

ringso

fore�

ciency

reasonswewill

consid

erneigh

bor�to�n

eighbor

communication

�Passin

gadatu

mfrom

processor

nato

processor

nbwill

therefore

require

jna�nb jpasses�

sotim

eisprop

ortional

todistan

ce�

Allprocessors

may

pass

adatu

mto

their

neigh

bor

simultan

eously�

soat

anypoin

t

intim

endata

canbein

communication

�Sothetotal

amountof

timereq

uired

to

communicate

thedata

betw

eenprocessors

isO�q

�d�n �each

datu

mtakes

time

prop

ortionaldto

communicate�

wecan

communicate

nof

them

simultan

eously�

and

�

there

areatotal

ofqdata

todistrib

ute�

Ingen

eralthedistrib

ution

ofpoly

gonsover

thescreen

isnot

uniform

�Thede�

viation

fromuniform

ity�expressed

asthemaxim

um

number

ofpoly

gonsin

ascreen

regiondivided

bytheaverage�

isf�In

addition

poly

gonscan

overlapmore

than

a

single

region�Theaverage

number

ofregion

sapoly

gonoverlap

siso�

Note

that

the

sizeof

aregion

isdecreasin

gin

n�becau

sethescreen

must

besubdivided

furth

er

andfurth

erto

share

thework

amongall

processors

asnincreases�

Asthesize

ofthe

regionsdecreases

both

fandowill

increase�

fwill

increase

becau

sethesam

plin

gof

smaller

andsm

allerregion

sof

thescreen

will

expose

largervariation

sin

thepoly

gon

distrib

ution

�Likew

ise�as

thesize

oftheregion

sdecrease

eachpoly

gonwill

overlap

more

regions�soowill

increase�

Thedependence

offandoon

nwill

have

importan

t

e�ects

onthee�

ciency

ofthedi�eren

tcom

munication

approach

es�

Wewill

now

analy

zethecom

munication

option

susin

gthisfram

ework

toquantify

their

perform

ance�

Note

that

optim

allywewould

likean

algorithm

which

runsin

time

toptim

al�

pn�G

�AR

��

sothat

asweincrease

thenumberofprocessors

weobtain

aperfect

linear

speed

up�

Thisassu

mes

that

weperform

nocom

munication

�that

wearen

�tpenalized

bythe

overlapfactor�

andthere

isnoasy

mmetry

inthedistrib

ution

ofload

acrossthepro�

cessors�Thefollow

inganaly

siswill

show

how

eachof

thecom

munication

algorithms

violates

these

assumption

sin

someway�

��

Sort�First

Sort��

rstplaces

thecom

munication

stageas

earlyas

possib

lein

thealgorith

m�ac�

tually

preced

ingthegeom

etrystage�

Com

munication

will

place

eachpoly

gonon

the

processor�s

responsib

lefor

rasterizingit�

Iftheset

ofprocessors

responsib

lefor

renderin

gapoly

gonwillch

ange

slowly�

sort�

��

�rst

canleverage

thistem

poral

localityto

perform

lesscom

munication

�Modelin

gthe

exploitation

oflocality

isdi�

cultwith

ourmodel�

andwewill

make

theoptim

isticas�

sumption

that

sort��rst

algorithmsoperate

with

nocom

munication

�Thisassu

mption

proves

tobeusefu

l�as

thesort��

rstapproach

stillincurssubstan

tialoverh

eadover

a

uniprocessor

pipelin

ewhich

alonecan

beused

tocom

pare

itto

other

approach

es�In

particu

lar�byplacin

gcom

munication

before

geometry

eachprocessor

that

apoly

gon

overlapswillh

aveto

perform

thegeom

etrycom

putation

sfor

thepoly

gon�Theoverlap

factoroexpresses

thisduplication

�Furth

ermore

anyasy

mmetries

inthedistrib

ution

ofpoly

gonsover

thescreen

regions�f�will

beexposed

tothegeom

etrystage�

sowe

will

spendtim

eGfopnperform

inggeom

etrycom

putation

s�

Therasterization

stagewill

alsobepenalized

bytheasy

mmetry

factorf�andthe

worst

caseprocessor

will

have

torasterize

fpnpoly

gons�

How

ever�theprocessor

will

only

have

torasterize

A�o

pixels

ofeach

poly

gon�for

atotal

rasterizationtim

eof

AR

fo

pn�

Thetotal

timefor

sort��rst

communication

isthen

tsf�

pnf�oG

�RAo

��

which

isnear

optim

alfor

fandoclose

to��

Weexpect

both

fandoto

be

increasin

ginnhow

ever�makingthescalab

ilityof

sort��rst

communication

question

�

able�

Ouranaly

sisin

addition

has

completely

ignored

thecom

munication

overhead

�

which

ispresu

mably

non�zero�

��

Sort�Middle

Sort�M

iddle

perform

sthegeom

etrycalcu

lationsandthen

distrib

utes

computed

re�

sults

totheprocessors

responsib

lefor

rasterization�Nolocality

isexploited

�so

the

processor

which

perform

sthegeom

etrycom

putation

sisuncorrelated

with

theras�

terizationprocessor

andtheexpected

distan

ceadatu

mwill

travelisn��

Inreality

manypoly

gonswill

berasterized

bymore

than

oneprocessor�

Theoverlap

factoro

��

will

determ

inehow

manyprocessors

eachpoly

gonwill

berasterized

by�andthushave

tobecom

municated

to�In

general

there

isnoreason

tobeliev

ethat

theprocessors

will

belocal

toeach

other�

sotheactu

aldistan

cemay

approach

nas

oincreases�

We

will

pessim

isticallyassu

med�nhere�

Sort�m

iddleonly

computes

thegeom

etryonce

foreach

poly

gon�so

itavoid

sthe

penalty

ofduplicated

work

incurred

insort��

rst�Likesort��

rsthow

ever�itpaysthe

asymmetry

penalty

inscreen

poly

gondistrib

ution

durin

grasterization

�

Ifwecall

thetim

eto

pass

thecom

puted

results

ofapoly

gonfrom

oneprocessor

toits

neigh

bor

Csmthen

communication

will

require

time

pn�n�C

smRasterization

timeisidentical

tothesort��

rstcase�

sowehave

atotal

timeof

tsm�

pn��fRo �pC

sm

��

Note

that

thetotal

communication

timefor

sort�middle

ispC

smwhich

isinde�

pen

den

tof

thenumber

ofprocessors�

soas

ngets

largesort�m

iddlealgorith

mswill

spendall

their

timecom

municatin

g�

��

Sort�Last

Sort�L

astisabitof

awild

card�Itdoesn

�tcom

municate

poly

gondata�

butinstead

communicates

pixels�

therasterization

results�

Each

processor

renders

somearb

itrary

subset

ofthedatab

ase�andsortin

gisperform

edafter

rasterization�as

these

subsets

will

overlapin

somearb

itraryway

inthe�nal

image�

There

aretwowaysto

thinkaboutcom

positin

gthisinform

ation�Therasterization

results

could

either

becon

sidered

asafram

ebu�er

oneach

processor�

oras

aset

of

poly

gonfragm

entson

eachprocessor�

Theform

erview

suggests

that

wecom

binethe

framebu�ers

fromallp

rocessorsto

obtain

a�nalfram

ebu�er�

Thelatter

view

suggests

wecom

binetheren

dered

pixels

fromeach

processor

togen

erateafram

ebu�er�

These

twopossib

ilitiesare

referredto

respectiv

elyas

�dense��

becau

seitcom

municates

an

entire

framebu�er

fromeach

processor�

and�sp

arse��becau

seitcom

municates

only

��

thepixels

rasterizedon

eachprocessor�

Inboth

casesthetotal

geometry

andrasterization

work

isp�G

�AR��n

�The

asymmetry

factorfisnot

exposed

becau

sewecan

evenly

dividethepoly

gonsover

processors�

andtheoverlap

factoroisnot

exposed

becau

seweallow

processors

to

rasterizeoverlap

pingregion

s�Ign

oringcom

munication

overhead

�sort�last

o�ers

the

linear

scalability

wehave

been

seeking�

andin

factprov

ides

theoptim

alparallel

speed

up�

Dense

sort�lastcom

municates

ascreen

sworth

ofpixels

fromeach

processor�

How

�

ever�each

pixel

only

has

tobesen

tto

theprocessors

neigh

bor�

Upon

receiptof

a

pixeltheprocessor

either

passes

thepixelon

tothenextprocessor

ifitocclu

des

the

processors

pixel�

orinstead

forward

sits

ownpixel�

Section

��prov

ides

anexam

ple

ofthiscom

munication

structu

re�andamore

carefulexplan

ationof

whythedistan

ce

eachpixelmusttravel

is��

Thescreen

isatotal

ofQ

pixels�

sodense

sort�lastmust

communicate

atotal

ofnQ

totalpixels�

npixels

arecom

municated

inparallel�

andeach

pixel

travelsa

distan

ceof

��so

thetotal

communication

timeisQCsld

ense �

Aco

nsta

ntam

ountof

timeisspentin

dense

sort�lastcom

munication

�independentof

both

thenumber

of

poly

gonsandthenumber

ofprocessors�

Sparse

sort�lastlack

sacom

plete

framebu�er

oneach

processor�

Ifweassu

me

thefram

ebu�er

forthe nal

complete

image

isdistrib

uted

overall

theprocessors

then

eachprocessor

has

tocom

municate

eachof

itsrasterized

pixels

tosom

eoth

er

arbitrary

processor�

Thustheexpected

distan

ceapixelwill

travelispessim

istically

n�Each

poly

gonhas

anaverage

areaA�andthere

areatotal

ofppoly

gons�so

there

areAppixels

tocom

municate�

Wecan

communicate

npixels

inparallel�

eachover

a

distan

cen�for

atotal

communication

timeofApC

slsparse �

Dense

sort�lastreq

uires

atotal

amountof

time

tsld

ense�

pn�G

�AR��QCsldense

��

toren

der

ascen

e�while

sparse

sort�lastreq

uires

time

�

tsls

parse�

pn�G

�AR��ApC

slsparse

��

toren

der

ascen

e�Dense

sort�lasto�ers

communication

that

iscon

stantover�

head

�while

sparse

sort�lastclosely

resembles

sort�middle�

butlack

stheasy

mmetry

penalties

in�icted

onrasterization

insort�m

iddlestru

ctures�

��

Analysis

Thusfar

ourmodelhas

only

assumed

that

ourexecu

tiontim

esare

bounded

bythe

sum

oftheworst

casetim

esof

eachstage�

Thisistru

efor

anySIM

Darch

itecture�

Wewill

now

quantify

eachof

thevariab

lesin

themodelfor

thePrin

cetonEngin

e�In

somecases

wewill

use

actual

gures

obtain

edbyexam

iningan

implem

entation

�and

inoth

ercases

represen

tativenumbers

will

beused

�

Determ

ination

ofthecost

ofcom

munication

ofcou

rsereq

uires

amodel

ofthe

actualcom

munication

mech

anism

�OnthePrin

cetonEngin

eeach

processor

canpass

asin

gle��b

itdatu

mto

itsneigh

bor

in�operation

s�write

thecom

munication

register�perform

sthepass�

andread

thecom

munication

register�Passin

gadatu

m

ofsize

awill

require

i��

�a�b

��

where

bisan

overhead

factorrelatin

gto

setupandtest

ofthereceiv

eddatu

m�The

factorof

�re�

ectsthat

passin

gasin

gleinteger

throu

ghthecom

munication

register

requires

writin

gtheregister�

perform

ingthepass�

andread

ingtheregister�

Dueto

resource

constrain

tsthese

operation

scan

tbepipelin

edover

eachoth

er�

There

are�distin

ctcom

munication

approach

esto

analy

ze�Sort�

rstandsort�

middle

both

deal

inpoly

gons�or

computed

results

thereof�

andsort�last

deals

in

pixels�

Agood

estimate

ofthesize

ofthedatu

mfor

eachisnecessary

todeterm

ine

realisticexecu

tiontim

es�Apoly

gonisdescrib

edinourim

plem

entation

by��

integers�

��

approach

datu

msize

overhead

C

sort� rst

��

��sort�m

iddle

��

��sort�last

full

��

��sparse

��

��

Table

��Communicatio

nCosts�

comparison

ofdatu

msizes

fordi�eren

tcom

�munication

approach

es�

while

apoly

gondescrip

tor�th

esort�m

iddle

datu

m�is��

integers�

Dense

sort�last

requires

thecom

munication

ofthree

colorcom

ponents

of��b

itseach

anda��b

itz

value�

Asparse

sort�lastapproach

alsoreq

uires

eachpixelto

betagged

with

screen

coordinates�

foran

addition

altwointegers�

Theoverh

ead gures

anddatu

msize

arebased

onan

implem

entation

forsort�

middle�

Thesort�

rstdatu

msize

isthesize

ofapoly

gonin

ourdatab

ase�andthe

overhead

gure

istak

enas

equal

tothesort�m

iddlecase

becau

sethey

arecom

para�

ble

operation

s�Sort�last

has

been

approx

imated

andisbased

onexperien

ce�The

overhead

gures

re�ect

that

durin

gthedense

approach

thecoord

inates

ofthenext

pixelreceived

areknow

napriori�

soless

computation

need

sto

beperform

edon

each

receivedpixels�

Sparse

sort�laston

theoth

erhandreq

uires

afram

ebu�er

address

cal�

culation

foreach

received

pixel�

andtheunpred

ictability

ofthepixels

received

makes

overlappingoperation

sdi�

cult�

Ourmodelin

gmadeanumber

ofassu

mption

s�Thesim

ulator

was

instru

mented

tomeasu

reanumberoftherelevan

tquantities�

andveri

edourassu

mption

s�In

par�

ticular�

Figu

re��

show

sthedistrib

ution

ofpoly

gonsover

processors

forasort�

rst

algorithm�andin

particu

lardem

onstrates

thesev

ereload

balan

ceprob

lemsresu

lting

fromtheob

jectdetail

bein

gcen

teredin

thescreen

�Figu

re��

veri esourassu

mption

that

thenumber

ofcom

munication

operation

sfor

sort�middlewould

dependlin

early

onthenumber

ofpoly

gons�andindependentof

thenumber

ofprocessors�

Given

thevalu

esof

theparam

eterscom

mon

toall

thealgorith

ms�

speci

edin

��

variable

value

pleft

unspeci

edfor

amore

general

analy

sisn

��thenumber

ofprocessors

inourPrin

cetonEngin

eG

��obtain

edfrom

ourim

plem

entationA

��thecom

mon

benchmark

is��p

ixelpoly

gons

R��

obtain

edfrom

ourim

plem

entationQ

��thescreen

is��x

��pixels

f��

taken

fromthesam

pleteap

otscen

eo

��tak

enfrom

thesam

pleteap

otscen

e

Table��

Modelin

gValues

0 50

100

150

200

250

300

350

400

0256

512768

1024

polygons

processor

Figu

re��

Sort�

First

LoadIm

balance�thenumber

ofpoly

gonsper

processor

variesdram

atically�Theworst

caseprocessor

has

��poly

gonsintersectin

gits

region�

while

theaverage

caseis��

poly

gons��

��

0

0.2

0.4

0.6

0.8 1

1.2

0512

10241536

2048

shifts/polygon

# processors

beethovencrocodile

teapot

Figu

re��

Sort�

Middle�thenumber

ofshifts

per

poly

gonisthetotal

number

ofshifts

required

todistrib

ute

thepoly

gons�divided

bythetotal

number

ofpoly

gons�

Themargin

alcost

ofcom

municatin

gapoly

gonisapprox

imately

constan

tat

�shift

per

poly

gon�

��

Table

��andthesize

ofthedatu

meach

communication

structu

reuses�

givenin

Table��

wenow

perform

comparison

sbetw

eenthese

communication

approach

esto

determ

inewhich

ismost

practical

onthePrin

cetonEngin

e�

��

Sort�

First

vs�

Sort�

Middle

Ifwecom

pare

sort� rst

andsort�m

iddlewesee

that

sort�middleisfaster

than

sort�

rst

when

tsm

�tsf

��

pn�G

�fRAo��pC

sm�

pnf�oG

�RAo�

��

G�nCsm

�foG

��

fo��nCsm

G��

For

ournumbers

sort�middle

issuperior

iffo�

��This

analy

sissuggests

that

sort� rst

will

beaboutafactor

�faster

than

sort�middle�

asfo�

��for

oursam

ple

scene�

Noneth

elesssort�m

iddle

was

implem

ented

preferen

tiallybecau

se

itisbelieved

that

inreality

sort� rst

will

besign

i can

tlyslow

erthan

assumed

here�

fortworeason

s�First�

theanaly

sisabove

assumed

that

sort� rst

would

require

no

communication

�which

isclearly

false�Secon

d�itcom

pared

aworst�case

behavior

for

sort�middle�

Inreality

throu

ghoptim

izations�w

hich

donot

apply

tothesort�

rst

case�wecan

implem

entsort�m

iddleabouttwice

asfast

asassu

med

here�

Overall

we

expect

that

sort�middleisat

leastas

fastas

sort� rst

onthisarch

itecture�

andlikely

tobesign

i can

tlyfaster�

��

Sort�

Last�

Dense

vs�

Sparse

Ignorin

gcom

mon

factors�th

erasterization

andgeom

etrytim

e�wesee

that

thetim

e

required

forasparse

sort�lastalgorith

misprop

ortionalto

p�while

thedense

sort�last

��

timeiscon

stant�Thuswecan

calculate

thenumberofpoly

gonswhere

dense

sort�last

communication

ispreferab

leto

sparse

tobe

p�

Q�Csld

ense

A�C

slsparse

��

orp�

��A

��poly

gonscen

eisextrem

elysim

ple�

At��fp

samere

��

poly

gonsper

secondwould

beren

dered

�Sofor

allbutthemost

trivial

scenes

dense

sort�lastproves

superior

tosparse

sort�last�

��

Sort�

Middle

vs�

Dense

Sort�

Last

Ourtworem

ainingalgorith

msare

sort�middleandsort�last�

Sort�m

iddle

has

per�

formance

linear

inthenumber

ofpoly

gons�anddense

sort�lasthas

constan

tperfor�

mance�

soonce

againwecan

ndthetrad

eo�poin

twhere

dense

sort�lastwill

bethe

algorithm

ofchoice�

p�

Q�Csld

ense

Csm

��f�o��AR

n

��

Ifp�

��then

sort�lastwill

bepreferab

le�Thetim

ereq

uired

toperform

thesort�last

composite

will

beQ

�Csld

ense�

��Minstru

ctions�which

isnearly

an

entire

secondon

thePrin

cetonEngin

e�Sosort�last�

while

o�erin

gcon

stanttim

e

perform

ance�

andthusscalab

ility�proves

proh

ibitively

expensive�

Figu

re��

compares

theperform

ance

ofthese

algorithms�Aswould

beexpected

�

sort� rst

requires

thefew

estinstru

ctions�

followed

bysort�m

iddle

and nally

the

sort�lastsparse

implem

entation

�

Sort�m

iddle

isade nite

win

onthis

architectu

re�A

number

ofoptim

izations

arediscu

ssedlater

inthepaper

which

will

apply

tomost

ofthese

communication

topologies�

how

everord

ersofmagn

itudesep

aratetheperform

ance

ofthese

algorithms�

andthere

arenoord

ersofmagn

itudeinperform

ance

tobefou

ndinanyoptim

izations�

sosort�m

iddlerem

ainstheoptim

alchoice

ofcom

munication

structu

re�

��

1 10

100

1000100010000

100000

10^6 instructions

polygons

sort-firstsort-m

iddledense sort-lastsparse sort-last

optimal

Figu

re��

Communicatio

nCosts�

comparison

ofinstru

ctionsreq

uired

toexecu

tevariou

scom

munication

structu

resversu

snumber

ofpoly

gons�

�

��

Summary

Afram

ework

has

been

prov

ided

toexam

inevariou

scom

munication

alternativ

esfor

poly

gonal

renderin

g�Toprov

ideastraigh

tforward

andanaly

ticalsolu

tionanumber

ofassu

mption

shave

been

made�

Assu

mption

shave

been

both

optim

istic�sort�

rst

requires

nocom

munication

�andpessim

istic�sort�m

iddlereq

uires

all�to�allcom

mu�

nication

�andhave

yield

edinsigh

tfulresu

lts�

Theapplication

ofthisanaly

sisto

thePrin

cetonEngin

erev

ealsthat

algorithms�

such

assort�last�

which

execu

tein

constan

ttim

e�may

prove

tobeproh

ibitiv

ely

expensive�

while

communication

that

execu

tesin

linear

time�such

assort�

rst�may

induce

somuch

overhead

that

they

prove

impractical�

Sort�m

iddle�

which

runsin

linear

timein

thenumber

ofpoly

gonsand

indep

enden

tof

thenumber

ofprocessors

would

seemat

rst

glance

tobean

infeasib

lesolu

tion�butactu

allyproves

tobethe

most

e�cien

tof

thealgorith

msfor

ourscen

esof

interest�

Theobservation

that

sort�lastyield

scon

stantperform

ance

andsort�m

iddleyield

s

perform

ance

independentof

the�clu

mping�

ofprim

itivesthat

sort� rst

issubject

to

issign

i can

t�Wewill

return

tothese

twoalgorith

msin

Chapter

to

develop

ahybrid

algorithm

which

extracts

thecon

stanttim

ebene ts

ofsort�last

andthee�

ciency

of

sort�middlefor

small

numbers

ofprocessors�

��

��

Chapter�

Communicatio

nOptim

izatio

ns

Chapter

�motivated

theim

plem

entationof

asort�m

iddlecom

munication

schem

eto

maxim

izeperform

ance�

How

ever�ifwelook

attheratio

ofthetim

espentcom

muni�

catingto

thetotal

execu

tiontim

e

n�Csm

G�n�Csm�fARo

��

andevalu

atefor

thePrin

cetonEngin

ewesee

that

communication

will

occupy

approx

imately

��of

ourtotal

execu

tiontim

e�Even

allowingfor

ourpessim

istic

assumption

s�weare

alreadyclearly

facingtheissu

eof

scalability

forsort�m

iddle

communication

�Thesin

glelargest

obstacle

totheim

plem

entation

ofahigh

�speed

poly

gonal

renderer

onaSIM

Drin

gisthecom

munication

bottlen

eck�

Wehave

alreadydeterm

ined

that

sort�middlecom

munication

will

execu

teinO�p

time�so

wewill

focuscon

siderab

lee ort

onminim

izingthecon

stantfactors

inthe

perform

ance�

Anumber

ofpossib

ilitiescou

ldbeexploited

inthedesign

ofamore

e�cien

tsort�

middlecom

munication

algorithm�They

arelargely

orthogon

alto

eachoth

er�so

when

combined

they

yield

atree

ofim

plem

entation

s�show

inFigu

re��

Fortu

nately

some

ofthebran

ches

prove

ine ective�

andthey

canbeabandoned

earlyin

theanaly

sis�

Theoption

sto

beexplored

inthedesign

ofthecom

munication

algorithm

have

��

dense

m>

1

m=

1

netw

ork

locality

du

plicatio

nd

atad

istribu

tion

sparsestriped

unstripedyes

no

Figu

re��

Communicatio

nOptim

izatio

ns�

theset

ofpossib

lecom

munication

optim

izationscreate

atree

ofpossib

ilitiesto

explore�

��

been

ordered

byboth

easeofim

plem

entation

�andexpected

importan

ce�Theoption

s

are�

Network

Thecom

munication

netw

orkhas

a�nite

capacity

forinform

ation�Itmay

bedesirab

leto

attemptto

saturate

thischannel�

placin

gas

much

inform

ation

into

itas

possib

le�There

isacost

associatedwith

saturatin

gthechannel

be�

cause

thesetu

pandtests

toach

ievesatu

rationreq

uire

time�durin

gwhich

no

communication

isach

ieved�Theoptim

izationto

keepthenetw

orkas

fullas

possib

leis

referredto

asa�dense�

netw

ork�not

tobecon

fused

with

dense

sort�lastcom

positin

g�

Distrib

utio

nThecost

ofpassin

gadatu

mbetw

eenarb

itraryprocessors

isdecreasin

g

inthesize

ofthepasses

used

�Usin

glarge

passes

�neigh

bor�to�n

eighbor�N

amortizes

thecost

ofwritin

gandread

ingthecom

munication

register�alon

g

with

theoverh

eadof

thepass

operation

�A

two�p

hase

sort�middle

algorithm

which

uses

largeinitial

passes

toget

data

closeto

itsdestin

ation�follow

edby

neigh

bor�to�n

eighbor

passes

toprecisely

deliv

erthedata

couldbemore

e�cien

t

than

anaive

sort�middlealgorith

m�

Data

Duplicatio

nThepoly

gondatab

asecou

ldbeduplicated

ontheprocessors�

so

that

instead

ofthere

bein

gacop

yof

apoly

gonon

only

processor�

there

is

acop

yon

mprocessors�

Byduplicatin

gthepoly

gonsperiod

icallyacross

the

processors

wehave

reduced

themaxim

umdistan

ceanypoly

gonmusttravel

be�

tween

geometry

computation

sandrasterization

�thusred

ucin

gcom

munication

time�at

thepossib

leexpense

ofgeom

etrytim

e�

TemporalLocality

Interactiv

epoly

gonren

dered

scenes�

such

asanim

ations�build

�

ingwalk

throu

ghs�andmolecu

larmodelin

g�gen

erallyhave

high

frame�to�fram

e

coheren

ce�Apoly

gonren

dered

inonefram

ewill

beren

dered

invery

nearly

the

sameposition

inthenextfram

e�Com

munication

could

accountfor

thisand

allowpoly

gonsto

�migrate�

aroundtherin

g�stay

ingreason

ably

closeto

the

processors

that

have

torasterize

them

�Astim

eprogresses

thepoly

gonswill

�

move

fromprocessor

toprocessor

totrack

thechangin

gview

frustu

m�butthese

moves

will

besm

all�thusdecreasin

gtheam

ountof

timespentin

communica�

tion�

Allof

these

option

s�while

appealin

g�are

di�

cult

toanaly

zewith

outactu

ally

implem

entin

gthealgorith

ms�

Ingen

eralthey

deal

with

thesubtle

poin

tsof

the

communication

�andwhile

they

may

prove

tobepred

ictable�

agen

eralmodel

is

di�

cultto

�nd�

Before

discu

ssingtheoptim

izationin

detail

adescrip

tionof

theim

plem

entation

ofsort�m

iddlecom

munication

isgiven

toclarify

theissu

esaddressed

�

��

Sort�MiddleOverview

Themost

straightforw

ardim

plem

entation

ofsort�m

iddlecom

munication

isshow

nin

Figu

re��

After

thegeom

etrystage

eachprocessor

has

aqueueofcom

puted

poly

gon

�descrip

tors��show

nsch

ematically

astrian

glesin

the�gure�

Com

munication

begin

s

with

eachprocessor

sendingthe�rst

poly

gonin

itsqueueto

itsrigh

tneigh

bor�

and

simultan

eously

receivingits

leftneigh

bor�s

poly

gon�Theprocessors

then

perform

n��

more

passes�

eachtim

epassin

gthedescrip

torthey

received

fromtheir

leftneigh

bor

totheir

rightneigh

bor�

Acop

yof

eachreceived

poly

gonis

keptifitoverlap

sthe

rasterizationregion

fortheprocessor�

Once

the�rst

batch

ofpoly

gonshas

traveled

alltheway

aroundtherin

gtheprocessors

then

alldequeuethenextpoly

gonfrom

their

queuesandrep

eattheprocess

untilallp

olygon

shave

been

communicated

�When

complete

eachprocessor

has

seeneach

ofthe�rst

poly

gonsfrom

alltheprocessors

queues�

��

DenseNetwork

Thecom

munication

ofapoly

gonhas

been

assumed

toreq

uire

npasses�

where

nis

number

ofprocessors�

Thisassu

mes

that

eachprocessor

isinterested

ineach

poly

�

��

1

processor3

operation

2 0

2

1

2

2

01

Figu

re��

Sort�

MiddleCommunicatio

n�thework

queues�

with

asin

glepoly

gonin

each�are

show

nat

thetop

�The�rst

operation

�passes

the�rst

poly

gonfrom

eachprocessor

toits

neigh

bor�

Thesubseq

uentoperation

s��

pass

these

poly

gons

aroundtherin

guntil

everyonehas

seenthem

�

gon�In

realityonly

somesm

allsubset

oftheprocessors

will

rasterizeanyparticu

lar

poly

gon�andonce

allprocessors

inthisset

have

seenthepoly

gonitneed

n�tbecom

�

municated

anyfurth

er�Onanygiven

pass

operation

upto

npoly

gonswill

bein

communication

�Byceasin

gthecom

munication

ofpoly

gonsthat

have

been

fully

dis�

tributed

wecan

more

fully

utilize

thecom

munication

slotsavailab

leon

eachiteration

ofthealgorith

m�

These

option

scan

bereferred

toas

sparse

anddense

communication

�not

tobe

confused

with

thesort�last

approach

es�Thesparse

approach

naively

communicates

every

poly

gonto

every

processor�

andthedense

approach

stopscom

municatin

ga

poly

gonas

soonas

allof

itsrasterization

processors

have

seenit�

Figu

re��

compares

theperform

ance

ofsparse

anddense

communication

under

simulation

�Thedata

setsshow

nall

have

anaverage

distan

cebetw

eenthegeom

etry

processor

andtherasterization

processor

foreach

poly

gonof

approx

imately

��pro�

cessors�half

thetotal

number

ofprocessors�

Ithas

been

assumed

that

thecost

of

thesparse

anddense

distrib

ution

schem

esper

pass

operation

areidentical�

which

is

��

0

5000

10000

15000

20000

25000

30000

35000

beethovencrocodile

teapot

shifts

densesparse

Figu

re��

Simulatio

nofSparse

vs�

Dense

Networks�

thesparse

netw

orkcan

require

upto

twice

aslon

gto

distrib

ute

agiv

endata

set�

��

not

strictlytru

e�How

ever�unless

thetests

necessary

toobtain

dense

communication

were

veryexpensive

itisclearly

advan

tageousto

use

dense

communication

�There�

main

der

oftheoptim

izationanaly

sesare

perform

edwith

theassu

mption

that

dense

communication

isused

�

��

Distribution

Thenumber

ofinstru

ctionsreq

uired

byaneigh

bor�to�n

eighbor

pass

isi�

�a�b

where

aisthesize

ofthedatu

mandbistheoverh

eadof

theoperation

�Thefactor

of

�re�

ectsthedistin

ctoperation

son

thePrin

cetonEngin

eof

writin

gthecom

munica�

tionregister�

perform

ingthepass

operation

�andread

ingthecom

munication

register�

Arch

itecture

constrain

tsproh

ibitpipelin

ingthese

operation

sover

eachoth

er�

Ifweperform

apass

ofsom

earb

itrarydistan

ced�neigh

bor�to�n

eighbor�N

itwill

then

require

i��da

�binstru

ctions�Adatu

mcou

ldbecom

municated

adistan

ce

dbyperform

ingasin

glepass

oflen

gthd�

ordneigh

bor�to�n

eighbor

passes�

Clearly

asin

glelon

gpass

issubstan

tiallycheap

er�as

you

only

pay

thesetu

pandoverh

ead

costsonce�

rather

than

dtim

es�Ellsw

orthdiscu

ssesan

analogou

sstrategy

in��

which

places

processors

into

groupsandbundles

together

poly

gonmessages

across

groupboundaries

into

asin

glemessage

toam

ortizethecost

ofmessage

overhead

�

Sim

ilaram

ortizationisn

�tpossib

lehere�

asouroverh

eadis

per

poly

gon�not

per

message�

Thecheap

estway

toget

apoly

gonfrom

oneprocessor

toanoth

eristo

perform

asin

gleshift

ofthesize

necessary�

rather

than

anumber

ofsm

allshifts�

How

ever�

perform

inganumber

ofspecial

shifts

ofvary

ingsizes

will

leadto

poor

utilization

ofthecom

munication

netw

ork�Instead

wecon

sider

atwo�stage

algorithm

inwhich

poly

gons�rst

takestrid

esof

sizedarou

ndtherin

gto

getclose

totheir

destin

ations

cheap

ly�andthen

arepassed

neigh

bor�to�n

eighbor

tobeprecisely

aligned�

The

instru

ctioncost

savingin

this

approach

could

prove

tobesubstan

tial�Figu

re��

show

sthetrad

eo betw

eencost

anddistan

ce�Note

that

wecan

obtain

perform

ance

��

40 60 80

100

120

140

160

05

1015

2025

3035

inst/d

d

Figu

re��

Pass

Cost�

thecost

ofcom

municatin

gadatu

mper

unitdistan

ceis

inversely

prop

ortional

tothedistan

ceof

thepass�

nobetter

than

thecase

ofasin

glepass

which

places

thepoly

gonprecisely

onits

destin

ation�so

ourperform

ance

improvem

entislim

itedto

theratio

ofthetim

efor

a

single

pass

oflen

gthdto

thetim

efor

dneigh

bor�to�n

eighbor

passes�

limd��

a��

�d

�b

d��a�b

��ba

��

For

oursystem

a��

andb�

��so

wecan

expect

best�case

speed

upof

��in

thetim

ereq

uired

tocom

municate

asin

gledatu

m�

Wecan

solvefor

theoptim

alstrid

esize

dunder

someassu

mption

saboutthe

distrib

ution

ofdata�

Uniform

SourceDistrib

utio

nThepoly

gonsto

berasterized

areuniform

lydis�

tributed

acrosstheprocessors

afterthegeom

etrystage�

This

isagood

as�

sumption

�as

thepoly

gondatab

aseisdistrib

uted

uniform

lyover

theprocessors

initially�

��

Uniform

Destin

atio

nDistrib

utio

nThenumber

ofpoly

gonsto

berasterized

by

eachprocessor

isthesam

e�Thistherefore

assumes

auniform

poly

gondistrib

u�

tionover

thescreen

�Itislikely

someprocessors

�those

rasterizingthemiddleof

thescen

ewill

receivemanymore

poly

gonsthan

other

processors�

Thisintro�

duces

asymmetries

inthecom

munication

pattern

andwill

beasou

rceof

error

betw

eenouranaly

sisandsim

ulation

results�

SingleDestin

atio

nEach

poly

gonisrasterized

byasin

gleprocessor�

sotheover�

lapfactor

is�

This

isastron

gassu

mption

�Aswehave

alreadyseen

in

Chapter

�theaverage

poly

gonis

rasterizedon

about�di eren

tprocessors�

How

ever�colu

mn�parallel

decom

position

sof

thescreen

�in

which

eachproces�

sorisassign

edacolu

mnofthescreen

torasterize�

closelymatch

thisassu

mption

becau

setheprocessors

that

rasterizeapoly

gonwill

beadjacen

t�

FullRingAtanystep

inthedistrib

ution

allpoin

tsof

therin

gare

carryingdata�

Thisisnot

strictlytru

e�becau

seas

thecom

munication

reachescom

pletion

there

isnoabrupttran

sitionfrom

therin

gbein

gfullof

inform

ationto

bein

gem

pty�

For

largedata

sets�p�

nandthedense

distrib

ution

schem

ethisshould

bea

reasonableapprox

imation

�

Given

these

assumption

s�andtheir

accompanyingcaveats�

wecan

calculate

an

optim

aldistrib

ution

pattern

�andits

perform

ance

relativeto

asim

pler

neigh

bor�to�

neigh

bor

algorithm�

Wehave

assumed

that

thepoly

gonsou

rcesanddistrib

ution

sare

uniform

lydis�

tributed

�so

theexpected

distan

ceapoly

gonwill

travelisn��

Atwo�stage

distrib

u�

tionsch

eme�were

we�rst

perform

passes

ofsom

estrid

edandthen

passes

ofstrid

e

will

require

atotal

ofi�instru

ctions�com

posed

�rst

ofinstru

ctionsthat

perform

passes

ofsize

d�andthen

instru

ctionsthat

perform

theneigh

bor�to�n

eighbor

passes�

Thesize

dpasses

will

require

idinstru

ctions�

id�

pn

n�

df��

�da

�bg

��

��

Each

pass

requires

��d�a�binstru

ctions�each

poly

gontakes

anaverage

of

n��d

passes

tocom

municate

inthe�rst

stage�andthere

areatotal

ofppoly

gons

bein

gcom

municated

�nat

atim

e�

After

this

�rst

pass

ofcom

munication

eachpoly

gonwill

bean

averageof

d��

processors

fromwhere

itneed

sto

be�

so�sim

ilarto

equation

��thesecon

dpass

of

distrib

ution

will

require

i�instru

ctions�

i��

pn

d�f��

�a

�bg

��

andi��id�i� �

i�

�pn �

n�

d��

�da

�b��

d��

�a

�b� �

��

Solv

ingto

�ndtheminim

um

number

ofinstru

ctionsyield

s

d�

s��a

�bn

�a�b

��

i�

�p �� s

��a�b��a

�b

n�a� ��

��

Theoptim

alstrid

eisd��

forn��

a��

andb��

Figu

re��

compares

thepred

ictedandsim

ulated

perform

ance

ofthealgorith

m�

Thecurve

isvery

�at

attheminim

um�w

hich

isreason

ableto

expect�

since

thecost

per

unitdistan

cetraveled

approach

esasy

mptotically

with

increasin

gd�

Asdincreases�

data

isdistrib

uted

lessclosely

toits

optim

alposition

inthe�rst

stageandthecost

ofthesecon

dstage

grows�

Thesim

ulation

show

saclear

minim

um

number

ofinstru

ctionsper

poly

gonfor

d��or

d��

Thisisalittle

more

than

half

oftheanaly

ticallydeterm

ined

optim

al

d�Thediscrep

ancy

arisesbecau

sethe�D

ense

Ring�

assumption

startsto

break

dow

n

earlierfor

largevalu

esofd�

soouranaly

sisunderestim

atedi� �

Thesim

ulation

sshow

speed

upsof��

��and��

while

ourpred

ictedspeed

upis��

compared

tothe

��

0 20 40 60 80

100

120

140

160

180

200

020

4060

80100

120

inst/p

d

beethovencrocodile

teapotpredicted

Figu

re��

i�

vs�

dforDistrib

utio

nOptim

izatio

n�data

based

onsim

ulation

ofa��p

rocessormach

ine

discon

strained

tobeafactor

ofnbythesim

ulator

�

maxim

um

speed

upof

�

��

Data

Duplic

atio

n

Data

duplication

isan

intrigu

ingpossib

ilityfor

reducin

gthetotal

communication

re�

quirem

ents

Byplacin

gmore

than

onecop

yofeach

poly

gonin

theprocessor

arraywe

reduce

themaxim

umpossib

ledistan

ceanypoly

gonmusttravel

tobecom

municated

toanygiven

processor

Ifweduplicate

thepoly

gondatab

asem

times

acrossthearray

then

apoly

gon

which

lieson

processor

k�

n�m

alsolies

onprocessor

k�n�m

�processor

k�

�n�m

�throu

ghprocessor

k��m

��n

�mThemaxim

um

distan

ceapoly

gonwill

have

totravel

fromasou

rceto

destin

ationisnow

n�m

instead

ofnFurth

ermore�

thedistrib

ution

ofthecop

iesacross

thearray

isdeterm

inistic�

soanyprocessor

can

determ

inewith

outcommunica

tionwhich

ofm

processors

should

source

eachpoly

gon

forwhich

ithas

acop

yThuswehave

reduced

theam

ountof

communication

wehave

toperform

attheexpense

ofdoin

gextra

work

inthegeom

etrystage

Each

processor

will

now

have

toexam

inepm

�npoly

gons�andcom

pute

thegeom

etryfor

thesubset

ofthem

that

itdeterm

ines

itisthebest

source

for

Toanaly

zethepoten

tialperform

ance

improvem

entsof

data

duplication

wewill

use

thesam

eassu

mption

sas

used

durin

gthedistrib

ution

analy

sis�nam

ely�uniform

source

anddestin

ationdistrib

ution

�sin

gledestin

ationanddense

ring

Tomake

amean

ingfu

lanaly

siswehave

tokeep

aneye

onthecost

ofthegeom

etry

operation

sAsalread

ydiscu

ssed�thegeom

etrystage

may

bebroken

into

twopieces�

�rst

exam

iningall

ofthepoly

gonson

aprocessor

anddeterm

iningwhich

ofthem

are

visib

le�andthen

perform

ingthecom

plete

geometry

calculation

sonly

forthevisib

le

poly

gons

Inthecase

ofdata

duplication

thevisib

ilitytest

will

inclu

dedecid

ing

ifthis

processor

cancom

municate

thepoly

gonmost

cheap

lyof

them

processors

with

acop

yof

itLet

Gibethenumber

ofinstru

ctionsreq

uired

foraprocessor

to

transform

apoly

gonto

thescreen

anddeterm

ineifitshould

source

itLet

Gfbethe

��

remain

inginstru

ctionsto

complete

thegeom

etrystage

per

poly

gonThenumber

of

instru

ctionsreq

uired

toperform

thegeom

etrycalcu

lationsanddistrib

ute

theresu

lts

foraduplication

factorm

isim�

im�

pmnGi�

p�nGf�

p�n

n�m��a

�b�

��

Note

that

theduplication

costsafactor

mon

thenumber

ofpoly

gonswehave

to

exam

ine�G

i ��butcosts

nothingfor

theGfcalcu

lations�as

these

areonly

perform

ed

byoneprocessor

foreach

poly

gonThe�nal

termin

thesum

isthecom

munication

time�which

now

takes

distan

cen�mfor

eachpoly

gon�thanksto

theduplication

We

assumethat

half

ofall

poly

gonsare

visib

le

Implem

entation

reveals

that

Gi�

��andGf�

��instru

ctions

Solv

ingfor

theoptim

alchoice

ofm

yield

s

m�

sn��a

�b�

�Gi

��

im

�p �� s

Gi ��a

�b�

n�G

f

�n ��

andm

��

forn��

a��

andb��

Itisinterestin

gto

note

that

theoptim

um

constrain

tdoes

notdependon

thecost

ofG

f �as

wehave

assumed

that

thevisib

lepoly

gonsare

uniform

lydistrib

uted

across

theprocessors

fortheGfstage

ofcom

putation

Ifthere

areanynon�uniform

ities

inthedestin

ationdistrib

ution

these

will

bere�

ectedback

inthechoice

ofwhich

processors

source

thepoly

gonsfor

them

�creatin

ganon�uniform

ityin

thegeom

etry

stage

Ouranaly

sisthusfar

has

been

largelyad

hoc

dueto

thedi�

culty

ofmodelin

g

asymmetries

inthesou

rcedistrib

ution

anddestin

ationdistrib

ution

Thesim

ulator

canmodelthese

e�ects

precisely�

andyield

sinterestin

gresu

ltsFigu

re��

compares

theanaly

ticalexpression

fortheoptim

alchoice

ofm

tothesim

ulator

results

��

0 20 40 60 80

100

120

140

160

08

1624

32

inst/poly

m

simulated

predicted

Figu

re��

PolygonDuplic

atio

n�instru

ctionsexecu

tedperpoly

gonon

asim

ulated

��processor

mach

inevsm

��

Alth

ough

thecurves

have

thesam

egen

eralshape�

there

isalarge

discrep

ancy

betw

eenthetworesu

ltsPart

ofthediscrep

ancy

isdueto

themistaken

assumption

that

��of

thepoly

gonswill

bevisib

le�while

theactu

alnumber

iscloser

to� �

in

thisscen

eThesim

ulator

alsoexpresses

results

ininstru

ctionsper

visiblepoly

gon�

which

isin

somewaysmore

usefu

lthan

instru

ctionsper

totalpoly

gon�becau

seit

re�ects

thee�ort

that

goesinto

theusefu

l�visib

le�poly

gons�rath

erthan

allof

the

poly

gons

How

ever�these

twofactors

alonefar

fromaccou

ntfrom

thedi�eren

cein

theresu

lts

Figu

re��

revealshow

ouranaly

siscou

ldhave

ledusso

farastray

Theclu

mping

oftheob

jectin

themiddle

ofthescen

ecreates

aclu

mpingof

thegeom

etrywork

onapprox

imately

onefou

rthof

theprocessors�

sothegeom

etrystage

was

�tim

es

more

expensive

then

weexpected

Furth

ermore

thecom

munication

ismore

expensiv

e

becau

seithas

allbecom

erelativ

elylocal�

andlarge

parts

oftherin

ggo

unutilized

How

ever�thesim

ulation

stillreveals

aperform

ance

improvem

enton

theord

erof

��

forthebest

caseim

provem

entat

m��

Itappears

that

wehave

foundanoth

ermeth

odfor

reducin

gthetotal

amountof

communication

�at

theexpense

ofincreased

geometry

computation

Attheoptim

al

poin

tm

��

wespendan

averageof

��instru

ctionsper

poly

gonversu

s��

forthe

noduplication

case�asav

ings

of��

Thelogical

exten

sionto

tothissim

ulation

isto

incorp

oratethedistrib

ution

optim

izationalread

ydiscu

ssedfor

afurth

erperform

ance

improvem

ent

Figu

re��

show

stheinstru

ctionsper

visib

lepoly

gonas

afunction

ofd�

thedata

distrib

ution

stepsize�

andm�theduplication

factorTheminim

umpoin

tisfou

ndat

d��

andm

��

which

indicates

that

theduplication

optim

ization�while

e�ectiv

e

inandof

itself�does

not

combinewell

with

other

optim

izations

Intuitiv

elythe

duplication

optim

izationisminim

izingthedistan

ceanypoly

gonhas

tobetraveled

�

thusmakingthedistrib

ution

optim

izationless

e�ectiv

e

��

0 5 10 15 20 25 30 35 40 45 50

0256

512768

1024

polygons

processor

m =

1m

= 8

Figu

re��

PolygonsPerProcesso

rvs�

m�theasy

mmetric

distrib

ution

ofscen

econ

tentisre�

ectedin

thedistrib

ution

ofpoly

gonsdurin

ggeom

etryprocessin

gData

isdisp

layedas

themaxim

um

ofadjacen

tprocessors

forread

ability

andistak

enfrom

asim

ulation

oftheteap

otdata

set��

12

48

161

24

8

1650 60 70 80 90

100110120130

d

m

inst/poly

Figu

re��

Instru

ctio

nsPerPolygonasafunctio

nofm

andd�theoptim

alchoice

ofd��

andtheoptim

alm

��donot

together

yield

anoptim

alsolu

tion

Sim

ulation

isbased

onthecrocod

iledata

set

��

��

TemporalLocality

Tem

poral

localityattem

ptsto

exploit

localityinasim

ilarfash

ionto

data

duplication

Thepoly

gonsare

allowed

tomigrate

aroundtherin

gfor

thegeom

etrycom

putation

s�

rather

than

alwaysbein

gcom

puted

onthesam

esou

rceprocessor

Apoly

gonrem

ains

asclose

aspossib

leto

theprocessors

that

will

berasterizin

git�

thusminim

izing

communication

time

Unlikesort��

rstonly

asin

glecop

yof

thepoly

goniscom

puted

durin

gthegeom

etrystage�

thusrem

ovingtheduplicated

computation

penalty�

but

stillincurrin

ganyload

imbalan

cesdueto

non�uniform

poly

gondistrib

ution

overthe

screenAsinthedata

duplication

case�thisismost

mean

ingfu

lformapsofprocessors

topixels

that

keepsadjacen

tscreen

areason

adjacen

tprocessors�

such

asacolu

mn�

parallel

decom

position

Tem

poral

localitycan

bethoughtof

asdata

duplication

where

m�n�excep

twe

don�thave

toactu

allyduplicate

theentire

datab

aseon

eachprocessor

Thecom

munication

timeis

greatlyred

uced

becau

seeach

poly

gonis

rasterized

onprocessors

immediately

adjacen

tto

theprocessor

which

perform

edits

geometry

Theaverage

distan

ceapoly

gonmusttravel

becom

estheaverage

width

ofapoly

gon

Inthis

caseanoth

ercom

munication

costgets

added

into

theoverh

ead�wemust

now

communicate

thepoly

gon�not

thecom

puted

represen

tationwewill

rasterize�

betw

eenprocessors

fromfram

eto

frame

These

moves

will

besm

allin

general�

as

weare

exploitin

glocality

How

ever�they

arecom

plex

For

instan

ce�apoly

goncou

ld

move

slightly

totheleft

ofwhere

itison

thescreen

�or

slightly

totherigh

t�so

we

must

now

dopasses

ineach

direction

ifwedon�twantto

have

topass

poly

gonsall

theway

totherigh

tto

move

them

slightly

totheleft

Furth

ermore

weneed

todo

someth

ingwith

theculled

poly

gons

Atsom

epoin

tthey

could

bevisib

leagain

and

weneed

tokeep

them

aroundon

someprocessor

Ignorin

gthese

issues

wecan

�rst

just

exam

inethekindof

loadim

balan

ceswe

arefacin

gin

thissystem

In

general

they

will

bemuch

high

erthan

those

ofdata

duplication

becau

seweare

gettingthepoly

gonsin

anoptim

alposition

which

will

��

expose

allof

theasy

mmetries

intheload

distrib

ution

Figu

re��

show

sthepoly

gon

distrib

ution

durin

gthegeom

etrystage

fortheteap

otdataset

Here

theworst

case

processor

has

� �visib

lepoly

gonsto

perform

thegeom

etryfor�

atacost

ofGi �

Gf

instru

ctionsper

poly

gon�ign

oringtheculled

poly

gonswhich

mustalso

behandled

Theoverh

eadofthisapproach

overabalan

cedgeom

etrystage

istheexcess

number

ofpoly

gonson

thisprocessor

Theteap

otdata

sethas

��poly

gons�or

��poly

gons

per

processor�

sotem

poral

localityim

poses

a��

poly

gonpenalty

onthee�ective

poly

gonload

per

processor

Wecan

compute

thenumber

ofextra

instru

ctionsthis

requires

per

visib

lepoly

gonas

��G

i �Gf ��p

vis��

which

exceed

seven

anaive

implem

entation

ofsort�m

iddle

communication

which

incurs

amargin

alcost

of��

instru

ctionsper

poly

gonOnce

weinclu

dethenecessity

toperform

communication

both

topass

apoly

gonto

itsneigh

bors

forrasterization

andto

main

taintem

poral

localityweare

likelyto

befar

inexcess

ofthis�gure

Tem

poral

localityrem

ainsan

intrigu

ingoptim

izationdueto

itssim

ilaritywith

sort��rst

communication

structu

resHow

ever�its

vulnerab

ilityto

asymmetries

in

thepoly

gondistrib

ution

overthescreen

make

itunsuitab

lefor

use

here

��

Summary

Fourcom

munication

optim

izationswere

exam

ined�netw

ork�distrib

ution

�duplica�

tion�andlocality

Speci�

catten

tionispaid

totheapplication

oftheoptim

izations

tothePrin

cetonEngin

e�andtheinteraction

oftheoptim

izations

Maxim

izingthedensity

ofinform

ationin

thecom

munication

channel�

themost

obviou

soptim

ization�prod

uces

startlingly

positiv

eresu

ltsDistrib

ution

contin

ues

thistren

dando�ers

speed

upsof

greaterthan

��byam

ortizingthecost

ofload

ing

andstorin

gthecom

munication

registerover

passes

largerthan

neigh

bor�to�n

eighbor

Data

duplication

strives

toim

prove

theperform

ance

ofthecom

munication

time

bydecreasin

gthemaxim

um�an

dthemean

�distan

cebetw

eenasou

rceprocessor

and

thedestin

ationprocessors

Non�uniform

scenecon

tentdistrib

ution

becom

esre�

ected

�

0 20 40 60 80

100

120

140

160

180

0256

512768

1024

polygons

processor

Figu

re��

TemporalLocality

�thenumberofpoly

gonsper

processor

forgeom

etrycom

putation

sdirectly

re�ects

theunderly

ingscen

ecom

plex

ityfor

eachcolu

mnThe

worst

caseprocessor

has

� �poly

gonsto

perform

geometry

calculation

sfor�

almost

��tim

estheaverage

numberofpoly

gonsto

compute

Data

isdisp

layedas

themaxim

um

ofadjacen

tprocessors

forread

ability

�

inthedistrib

ution

ofwork

durin

ggeom

etrycom

putation

s�makingthisapproach

more

similar

tosort��

rst�While

o�erin

gasign

i�can

tspeed

up�ito�ers

asm

allerperfor�

mance

improvem

entthan

thedistrib

ution

meth

od�an

ddoes

not

o�eranyperform

ance

improvem

entswhen

combined

with

distrib

ution

�

Tem

poral

localityiscon

ceptually

thelogical

extrem

eof

data

duplication

�where

theduplication

isach

ieved

bysort��

rststy

lecom

munication

�Itproves

tobevulner�

able

tothesam

eload

imbalan

cesthat

sort��rst

communication

fallsvictim

to�The

extrem

ely�nesubdivision

ofthescreen

coupled

with

ahigh

lyvariab

ledistrib

ution

of

scenecon

tentcreates

afew

extrem

elyheav

ilyload

edprocessors

which

impose

ahuge

perform

ance

penalty

onthegeom

etrycalcu

lations�

��

Chapter�

Sort�TwiceCommunicatio

n

Aspart

ofthisthesis

anovel

new

approach

tothesortin

gprob

lemwas

develop

ed�en�

titledsort�tw

ice��Sort�tw

iceleverages

thee�

ciency

ofsort�m

iddlecom

munication

forlow

poly

goncou

ntscen

esto

createan

implem

entationof

sort�lastcom

positin

g

that

ispractical

with

modest

bandwidth

processor

intercon

nect�

Theim

plem

entation

ofatwo�stage

communication

algorithm

allowsanoveloverlap

pingof

therasteriza�

tionpipelin

ewith

thesecon

dcom

munication

stageto

createahigh

lye�

cientdeferred

shadingandtex

turin

gsystem

�

��

WhyisSort�

Last

soExpensiv

e�

Sort�last

has

prim

arilybeen

thedom

ainof

hard

ware

solution

sdueto

thevery

high

bandwidth

required

�For

exam

ple�

PixelF

low ��

uses

adedicated

��bit

busat

��MHz�for

abisection

bandwidth

of��G

B�s�

toprov

idethenecessary

compositin

g

intercon

nect�

Bycom

parison

�thePrin

cetonEngin

e�which

employ

sthesam

esty

leof

linear

intercon

nect�

has

ameager

��MB�s

ofintercon

nection

bandwidth�more

than

twoord

ersof

magn

itudeless

than

PixelF

low�

Whyisso

much

bandwidth

required

�Thebandwidth

forsort�last

ispurely

driv

en

bythenumber

ofsam

ples

tocom

posite

andtheupdate

rateof

thedisp

lay�Amodest

single�sam

plesystem

ofof

��pixels

by��

pixels

at�

frames

per

secondreq

uires

��

astaggerin

g��M

B�s

ofcom

munication

�which

�alth

ough

attainable

onoursecon

d

generation

mach

ine�would

leavelittle

timefor

anyoth

ercom

putation

�

��

LeveragingSort�Middle

Sort�tw

icecom

munication

leverages

thelow

expense

ofsort�m

iddlecom

munication

over

short

dista

nces

tomakesort�last

compositin

gpractical

onagen

eralpurpose

mach

inesuch

asthePrin

cetonEngin

e�Theuse

ofsort�m

iddlecom

munication

ina

system

supportin

gmultip

leusers

prov

ides

thefou

ndation

forthisnew

approach

�

Figu

re��

show

sa��

processor

mach

inesupportin

g�sim

ultan

eoususers�

Each

user

isassign

edto

agrou

pof

�processors

forgeom

etrycalcu

lations�

andan

adja�

centgrou

pof

�processors

forrasterization

�Theskew

betw

eenthegeom

etryand

rasterizationprocessors

allowsall

communication

shifts

tobeperform

edin

thesam

e

direction

�aperform

ance

boon

onaSIM

Dmach

ine�

Thecom

munication

stageonly

has

topass

apoly

gonacross

amaxim

um

of�processors�

rather

than

��processors

ifthemach

inewere

dedicated

toasin

gleuser�

Thissystem

deliv

ersperform

ance

to

eachuser

indistin

guish

able

fromamach

inethesam

esize

asthegrou

pof

processors

assigned

totheuser�

Ifweincrease

thetotaln

umberofprocessors

while

leavingthenumberofprocessors

assigned

toeach

user

unchanged

theaggregate

poly

gonren

derin

gperform

ance

will

increase

linearly�

Whitm

andiscu

ssestheim

plem

entationof

aparallel

renderer

onthe

Prin

cetonEngin

especi�

callyto

support

multip

leusers

in ��

Note

that

thenumber

ofprocessors

assigned

toauser

has

noe�ect

onthenumber

ofpoly

gonseach

processor

cantran

sformper

second�or

thenumber

ofpixels

itcan

rasterize�Ifweassign

asin

gleprocessor

toeach

user

wedon�tneed

toperform

any

communication

andwecan

maxim

izetheaggregate

renderin

gperform

ance

ofthe

system

�of

course

attheexpense

oftheper

user

poly

gonsdelivered

eachsecon

d�

Ifwewish

tofocu

sall

ofourcom

putin

gresou

rceson

asin

gleuser

wecan

stilluse

themultip

leuser

decom

position

�excep

teach

groupof

processors

while

concep

tually

��

78

1011

136

5

User 4

User 2

User 3

User 1

User 0

414

912

01

23

12

Com

munication

Rasterization

Geom

etry9

63

27

54

08

114

1311

10

Figu

re��

Multip

leUsers

onaParallelPolygonRenderer�

eachuser

isassign

edto

asm

allfraction

oftheprocessors�

work

ingfor

di�eren

tusers

will

inreality

bework

ingfor

thesam

euser�

Wewill

dividethemach

ineinto

groupsof

gprocessors�

where

gisthegrou

psize

�g�

�in

Figu

re��

Sim

ilarly�wedividethescreen

into

gnon�overlap

pingregion

s�numbered

throu

ghg��

Apoly

gonthat

overlapsaregion

r��

r�

gmay

berasterized

byanyprocessor

m�m

�r�k�g�

Ifwedistrib

ute

thepoly

gondatab

aseover

allprocessors�

nomatter

what

processor

perform

sthegeom

etrycom

putation

sfor

a

poly

gonthepoly

gondescrip

torwill

have

totravel

amaxim

um

ofg��processors

toarrive

ataprocessor

that

canrasterize

it�as

opposed

toamaxim

um

ofn�

�

processors

inthetypical

sort�middleim

plem

entation

�After

alltheprocessor

groups

complete

rasterization�wecan

composite

theim

agesgen

eratedbythedistin

ctsets

of

processors

inafash

ionanalogou

sto

dense

sort�lastcom

positin

gto

generate

the�nal

outputim

age�

Equation

��giv

esthetim

efor

sort�middleren

derin

g�Ifwechange

thisequation

tore�

ectthenew

maxim

um

distan

ceapoly

gonmust

travelwesee

t�sm�

pn�G

�gCsm�fRo�

��

andournew

implem

entation

requires

timeprop

ortionalto

thenumberofpoly

gons

��

andinverse

inthenumber

ofprocessors�

foralin

earlyscalab

lesystem

�How

ever�

we

now

must

pay

thecost

ofperform

ingsort�last

composition

tocom

binetheresu

ltsof

alltheindividual

renderin

ggrou

ps�

��

Sort�Last

Compositin

g

Sort�last

compositin

gdeserves

furth

erexplan

ationfor

thisalgorith

m�as

itdeparts

signi�can

tlyfrom

theclassical

dense

sort�lastapproach

�Processors

may

nolon

ger

rasterizearb

itrarypixels

fromtheentire

screen�Instead

eachprocessor

may

render

only

pixels

inits

regionof

thescreen

�Then�g

processors

that

have

rendered

pixels

inthesam

eregion

ofthedisp

laymust

then

composite

their

results

togen

eratethe

�nal

outputim

age�

AsFigu

re��

show

s�theregion

ofthescreen

that

aprocessor

isresp

onsib

lefor

re�

peats

everygprocessors�

Thusthecom

munication

forthesort�last

compositin

gstage

ofsort�tw

iceiscom

pletely

determ

inistic�

Most

importan

tly�thiscom

positin

gcan

be

perform

edin

constan

ttim

e�in

afash

ionanalogou

sto

norm

alsort�last

compositin

g�

The�nal

result

ofthecom

position

operation

sis

eachpixel

isknow

nat

some

determ

inistic

processor

m�Thecorrect

pixelisnot

know

nat

alltheprocessors

that

could

have

rendered

thepixel�

asthisreq

uires

all�to�allcom

munication

andwould

require

timeprop

ortionalto

n�Ifweonly

have

thecorrect

answ

er�for

anyparticu

lar

pixelon

oneprocessor

wecan

achieve

thecom

position

incon

stanttim

e�

Thecom

position

ofasin

glepixelis

perform

edas

follows�

processor

m�gtak

esits

copyof

thepixelandpasses

itto

processor

m��g�

which

composites

thepixelwith

itsow

nren

dered

pixel�

andpasses

theresu

ltto

processor

m��g�

Thisprocessor

is

repeated

until

�nally

thepixelispassed

toprocessor

m�which

composites

thepixel

into

itsfram

ebu�er�

Thusn�g�

�shifts

arereq

uired

tocom

posite

asin

glepixel�

Follow

ingthepath

ofpixel

�in

Figu

re��

itstarts

onprocessor

which

simply

passes

itscop

yof

thepixelto

processor

��Processor

�then

composites

thereceiv

ed

pixelwith

itspixel�andpasses

theresu

ltto

processor

��Thiscon

tinues�

until

�nally

��

7

12

8

9

45 214

0

11

63

148

6

10

31

13

0

11

9

12

4

13

7

12

5

10

40

1

1 23

3

12

2

3

1 23

4

4

0

0

0

4

Figu

re��

Sort�

Last

Compositin

g�asam

plemach

inewith

��processors

anda

groupsize

of��

Theassign

mentof

processors

toregion

sisshow

non

thedisp

layat

left�

processor

��receives

thepixelandcom

posites

itinto

itsfram

ebu�er�

AsFigu

re��

show

s�all

n�g

processors

responsib

lefor

thesam

eregion

cansi�

multan

eously

composite

n�g

pixels

inn�g�

�shifts�

There

aregsuch

groupsof

processors

compositin

g�so

npixels

arecom

posited

inn�g��operation

s�Theresu

lt�

ingcom

posited

image

will

bedistrib

uted


Theprocessor

that

will

hold

the�nal

composited

version

ofeach

pixelisshow

nin

Figu

re��

There

areP

pixels

intheoutputfram

e�so

thefram

ewill

require

Pn� ng��

�P� �g�

�n�

��

shifts

tocom

posite�

Thenumber

ofshifts

isbounded

byP�g�

acon

stanttim

e

solution

forsom

echoice

ofg�independentofthenumberofprocessors

andthenumber

ofpoly

gons�

Note

that

ifg�

�then

wehave

thedense

sort�lastcom

position

�as

discu

ssedprev

iously�

While

thiscom

position

requires

communication

operation

sinversely

prop

ortional

tothegrou

psize�

eachcom

munication

operation

isacross

gprocessors�

with

cost

prop

ortional

tothegrou

psize�

foroverall

constan

tperform

ance�

��

4

85

107

11

9

9

14

14

7

0

012

4

2

2

1

110 13

69

96

612

0

0

0

0312

12

3

3

93

3

3

3

6

Figu

re��

Pixelto

Processo

rMap�the�nalmappingoffully

composited

pixels

toprocessors

results

fromtheinterleav

ingof

thecom

positin

goperation

s�Theprocessor

that

isresp

onsib

lefor

eachpixellab

elsthepixel�

��

Sort�TwiceCost

Com

positin

gthe�nal

image

will

require

P��g�

�n�com

munication

operation

s�Each

operation

will

beapass

ofapixelacross

gprocessors�

Thecost

ofcom

municatin

gapixel

isof

thesam

eform

asgiv

enfor

poly

gons�

i�

��d�a

�b�

Inthis

cased�

g�each

pass

isacross

anentire

group��a�

�

��bits

ofcolor

and��

bits

ofz�

andb�

��as

taken

fromtab

le��

Sowhile

thenumber

ofshifts

isdecreasin

glin

earlyin

thegrou

psize�

thecost

oftheshifts

is

increasin

glin

earlyin

thegrou

psize�

andthecom

position

will

require

constan

ttim

e�

Exam

iningthetotal

number

ofinstru

ctionsreq

uired

toperform

thecom

positin

g

wesee

i�

P� �g�

�n� ��

�g�a

�b�

��

iP�

a�

�a�b

gif�n�

g��

Equation

��isenligh

tening�

While

thenumber

ofinstru

ctionsexecu

tedper

sam�

��

ple

isindependentof

nandp�

itisinversely

prop

ortional

tog�

sowewould

liketo

makeglarge

tominim

izethecost

ofeach

sample

wecom

posite�

For

ourmodel

we

have

a��and�a

�b�

��Asgincreases

wepay

lessandless

per

sample�

asymp�

toticallyapproach

ingamere

�instru

ctionsper

pixel�

forafactor

of�speed

upover

theclassical

sort�lastapproach

�g�

��Aswith

thedistrib

ution

�optim

ization�

weare

amortizin

gthecost

ofthetests

andwrite�read

ofthecom

munication

register

overalarger

pass

�caused

byalarger

g��It

isinterestin

gthat

theclassical

dense

sort�lastcom

positin

gcase

issim

ultan

eously

themost

expensive

tocom

posite�

requir�

ing��

instru

ctionsper

pixel�

andthemost

expensiv

ein

mem

ory�req

uirin

gan

entire

framebu�er

oneach

processor�

��

Deferre

dShadingandTexture

Mapping

Asshow

nin

Figu

re��

thetim

ereq

uired

toperform

sort�lastcom

position

isde�

creasingin

thegrou

psize�

sologically

wewould

liketo

maxim

izethegrou

psize

�with

aneye

onthecost

ofsort�m

iddle

communication

�to

maxim

izeouroverall

perfor�

mance�

Durin

gthepure

communication

ofthecurren

tpixel

data

�not

loadingor

storingregisters

orperform

ingtests

onthedata�

buttheactu

altim

ethedata

isin

transit��

theprocessors

areessen

tiallyidle�

Itisknow

napriori

that

theprocessors

nextoperation

will

beto

composite

thepixel

itreceiv

eswith

itsow

npixel�

While

waitin

gto

receivethispixel�

shadingandtex

ture

mappingfor

thisprocessor�s

pixel

may

beperform

ed�Byoverlap

pingshadingandtex

ture

mappingoperation

suntil

the

sort�lastcom

munication

stagewesave

timeande�ort

intherasterization

stage�and

leaveourcom

munication

timeuna�ected

�

Ourexpression

forthecost

ofapass

is��d�a

�binstru

ctions�

�ainstru

ctionsare

spentwritin

gandread

ingthecom

munication

register�andbinstru

ctionsare

spent

testingthereceived

datu

mandprep

aringto

transm

itthenext�

dainstru

ctionsare

spentwith

data

simply

intran

sit�which

mean

sourVLIW

processors

areexecu

ting

essentially

just

passRightfor

eachinstru

ction�andnoth

ingmore�

For

apixelsize

of

��

0 5 10 15 20 25

016

3248

6480

96112

128

inst/pixel

group size

Figu

re��

Sort�

TwiceCompositin

gTim

e�

thepayo�

forincreasin

ggdim

in�

ishes

rapidly�

Note

that

simultan

eouswith

increasin

gganddecreasin

gthecom

posi�

tiontim

e�thesort�m

iddletim

eisincreasin

g�

��

a��andamodest

groupsize

ofg��

thisam

ounts

to��

instru

ctionsper

pixel

which

goalm

ostcom

pletely

unused

bytheprocessors

Ifwelook

ahead

totheresu

ltsin

Chapter

�ourcurren

tim

plem

entationreq

uires

��instru

ctionsperbilin

earlyinterp

olatedtex

turemapped

pixelandonly�

instru

c�

tionspershaded

pixel�so

texturemappingisa��

instru

ctionprem

iumClearly

with

��instru

ctionsavailab

lewecou

lddefer

both

texture

mappingandshading

Bydeferrin

gthese

operation

suntil

afterrasterization

�therasterization

stageis

mademuch

simpler

Inparticu

lar�con

sider

thecase

oftex

ture

mapping

Prev

iously

ifanypoly

gonwas

texture

mapped�all

processors

had

tooperate

intex

ture

mapped

modeandtake

theperform

ance

penalty

Now

theperform

ance

penalty

excep

tthe

extra

param

etersto

interp

olate�iscom

pletely

deferred

until

sort�lastcom

position

�so

thetex

ture

mappingoverh

eadisentirely

hidden

Arap

idprototy

peof

thissystem

show

sthat

inunoptim

ized com

piler

generated

�

codeitwill

rasterizeasin

glepixelwith

deferred

texture

mappingandshadingin

�

instru

ctions�afactor

of�im

provem

entover

ourcurren

tim

plem

entation

��

Performance

Thetim

eto

composite

afram

eiscom

pletely

determ

inistic�an

dwehave

had

goodluck

pred

ictingsort�m

iddlecom

munication

�sowecan

con�dently

pred

icttheperform

ance

ofsort�tw

icecom

munication

Figu

re��

show

stheinstru

ctionsreq

uired

per

pixelfor

variouschoices

ofg

Ifwecon

sider

thelim

itingcase

n�

g�

��wemust

have

atleast

�instru

ctionsper

pixelto

composite�

andwecan

seethat

thiswill

require

��instru

ctionsfor

a��

by��

pixeldisp

layThePrin

cetonEngin

eoperates

at��M

Hzso

thetotal

execu

tiontim

eto

composite

anoutputim

ageis��

seconds

Ifwedonoth

ingbutcom

posite

frames

wecan

operate

atamaximumof

��fram

es

persecon

dClearly

thissolu

tionisinfeasib

lefor

aninteractiv

eren

derin

genviron

ment

onthePrin

cetonEngin

e

Itisinterestin

gto

consid

ertheattain

able

perform

ance

ofthissystem

inandof

��

itselfhow

everPart

oftheperform

ance

isstill

tiedupin

thesort�m

iddle

commu�

nication

Asdiscu

ssedearlier�

areason

able

model

forsort�m

iddle

communication

assumesthat

eachpoly

gonwillhave

tobepassed

overhalf

theprocessors

Sort�tw

ice

reduces

thenumber

ofe�ective

processors

tothegrou

psize

Thesort�m

iddlestage

will

execu

te

ism�

�pn

g� �

�d�a

�b��

pn�g

��

instru

ctions�

where

pis

thenumber

ofvisib

lepoly

gons

Thesort�last

stage�

assumingn�g�

��will

execu

te

isl�

�P ��g�

��

instru

ctions

Solv

ingfor

theoptim

algrou

psize

asdeterm

ined

bytheminimum

number

ofinstru

ctions�giv

es

g�

s��P

n

�p ��

is�

�ism��

isl��P

� s��P

p

n ��

sothegrou

psize

increases

with

number

ofprocessors�

amortizin

gthe�xed

costs

ofapass

operation

�anddecreases

inthenumber

ofpoly

gons�red

ucin

gthecost

of

thesort�m

iddleoperation

s

Bycom

parison

�pure

sort�middlecom

munication

with

thedistrib

ution

anddense

netw

orkoptim

izationsreq

uires

��

0

0.5 1

1.5 2

2.5 3

3.5 4

2000040000

6000080000

100000120000

10^6 instructions

polygons

sort-middle

sort-twice

Figu

re��

PredictedSort�Middle

vs�

Sort�

TwiceCommunicatio

nTim

e�

atabout��

poly

gonsper

frametheperform

ance

ofsort�tw

icewill

exceed

the

perform

ance

ofsort�m

iddle

ism

�p �� s

�a�

b� �a�

b�

n�

a� �A�

p ��pn

��

��

instru

ctions

Thenumber

ofinstru

ctionsreq

uired

bysort�m

iddleandsort�tw

ice

intersect

atp�

��This

isareason

ably

high

number

ofpoly

gonsper

frame

Assu

mingweare

renderin

gat

��fram

esper

secondthisim

plies

apoly

gonrate

of

over��M

rendered

poly

gonsper

second

Figu

re��

show

sthepred

ictedperform

ance

ofthesort�tw

icecom

munication

al�

gorithm

versu

stheperform

ance

ofsort�m

iddle

Inparticu

lar�note

theextrem

ely

gentle

increase

incom

munication

timeas

afunction

ofthenumber

ofpoly

gonsfor

sort�twice

Figu

re��

compares

thesim

ulated

perform

ance

ofsort�m

iddle

andsort�tw

ice

��

0.1 1 10

100

10100

1000

10^6 instructions

10^3 polygons

sort-twice

sort-middle

Figu

re��

Sim

ulatedSort�

Middle

vs�

Sort�

TwiceCommunicatio

nTim

e�

theraw

communication

times

ofthealgorith

msclosely

match

thepred

ictedresu

ltsNote

that

thesim

ulator

restrictsthechoice

ofgrou

psize

gto

beafactor

ofthenumber

ofprocessors

Ateach

data

poin

tgwas

chosen

asthelargest

factorof

nsm

allerthan

thepred

ictedoptim

alchoice

ofg

communication

algorithmsfor

scenes

ofupto

��Mpoly

gons

Thescen

eswere

generated

asran

dom

poly

gonsuniform

lydistrib

uted

overthescreen

with

��ofthe

poly

gonsvisib

le

Figu

re��

inclu

des

thecosts

ofgeom


For

scenes

beyon

d

��poly

gonssort�tw

icecon

sistently

outperform

ssort�m

iddle

��

Summary

Sort�tw

iceo�ers

anextrem

elye�

cientmeth

odto

obtain

linearly

scalable

perfor�

mance

inthenumber

ofprocessors�

with

outpayingthefullcost

ofdense

sort�last

compositin

g

Thedisp

layisdivided

into

gregion

s�andevery

processor

r�

kgisallow

edto

rasterizeanypoly

gonthat

intersects

regionr

Byallow

inganumber

ofdi�eren

t

��

0.1 1 10

100

10100

1000

10^6 instructions

10^3 polygons

sort-twice

sort-middle

Figu

re��

Sim

ulatedSort�Middle

vs�

Sort�

TwiceExecutio

nTim

e�

in�

struction

counts

inclu

degeom


stagesfor

both

approach

esAll

poly

gonsare

textured

�andthesort�tw

iceim

plem

entation

uses

deferred

shadingand

texturin

gdurin

gthesort�last

operation

processors

torasterize

thesam

eregion

theam

ountof

sort�middle

communication

required

isminim

izedThen�g

copies

ofeach

regionofthescreen

arethen

composited

afterrasterization

insort�last

fashion

Thenumber

ofpixels

tocom

posite

fromeach

processor

is��g

ofthetotal

disp

laypixels

butthey

mustbepassed

adistan

cegdurin

g

composition

Theincreased

distan

cefor

eachcom

munication

operation

amortizes

the

overhead

operation

sassociated

with

communication

�andsubstan

tiallyred

uces

the

totaltim

espentin

sort�lastcom

munication

Theintelligen

tuseofthecom

munication

instru

ctionsdurin

gsort�last

composition

allowstheim

plem

entationof

deferred

shadingandtex

ture

mappingat

nocost

Ras�

terizationcan

bemademore

than

twice

asfast

asthenon�deferred

implem

entation

prov

ided

with

sort�middlesch

emes

Thesort�last

composition

stillreq

uires

operation

sprop

ortional

tothenumber

of

pixels�

andin

thelim

itcan

execu

tein

nofew

erinstru

ctionsthan

thetotal

number

ofpixels

inthedisp

laytim

esthesize

ofeach

pixel

Inthelim

itingcase

afactor

��

ofspeed

upisobtain

able

onthePrin

cetonEngin

eover

anorm

alsort�last

imple�

mentation

Sort�tw

icecom

munication

createsan

e�cien

tsort�last

implem

entation

�

not

only

intim

eandin

mem

ory�butalso

byadmittin

gamore

e�cien

trasterization

schem

ewith

fully

deferred

shadingandtex

ture

mapping

��

Chapter�

Implementatio

n

Theprev

iouschapters�

while

prov

idingtheback

groundandanaly

sisnecessary

to

specify

thesystem

�have

only

hinted

attheactu

alunderly

ingim

plem

entation

This

chapter

will

discu

sstheissu

esin

theactu

alim

plem

entation

ofan

interactive

poly

gon

renderin

gsystem

onthePrin

cetonEngin

eIwill

prim

arilydiscu

sse�

ciency

issues

andoptim

izations

Theactu

allin

e�by�lin

edetails

ofthecod

eare

straightforw

ard�

andrep

resentative

ofthecalcu

lationsperform

edin

anynumberofpoly

gonren

derers

Thepoly

gonren

derer

supports

depth

bu�erin

g�ligh

ting�

shadingandtex

ture

mappingof

triangular

prim

itivesTherasterization

stagehas

been

implem

ented

with

acolu

mn�parallel

decom

position

ofthescreen

�which

readily

admits

thefuture

inclu

sionof

variousoptim

izationsof

thecom

munication

structu

re�andin

addition

simpli�

estherasterization

process

foreach

poly

gonFigu

res��

��

and��

are

represen

tativeoutputof

thesystem

�cap

tured

throu

ghadigital

framestore

attached

tothePrin

cetonEngin

eThehistogram

acrossthebottom

ofthe�gures

islin

early

prop

ortional

tothenumber

ofpoly

gonsthat

intersect

eachcolu

mnof

thedisp

lay

There

areanumber

ofcaveats

toim

plem

entin

gpoly

gonren

derin

gon

thePrin

ce�

tonEngin

eFirst

isthegen

eralissu

eof

managin

gtheim

plem

entation

pipelin

eon

aSIM

Dmach

ine�discu

ssedinx��

andsecon

disthedi�

culty

ofdealin

gwith

the

line�locked

program

mingparad

igm�discu

ssedinx��

These

constrain

tsprov

idesub�

stantial

motivation

forourim

plem

entation�so

wewill

discu

ssthem

�rst�

andthen

��

Figu

re��

Beethoven��

��visib

lepoly

gonsina��

poly

gonmodelren

dered

at��

frames

per

secondThehistogram

ingreen

represen

tsthenumber

ofpoly

gonslices

rasterizedbyeach

processor

Thepeak

number

ofslices

atapprox

imately

the

center

ofthe�gure�

is��

Theshort

redbars

note

which

processors

had

atim

econ

sumingrasterization

load

Figu

re��

Crocodile��

��visib

lepoly

gonsin

a��

poly

gonmodelren

�dered

at�fram

esper

secondThepeak

rasterizationload

is��

poly

gonslices

��

Figu

re��

TextureMappedCrocodile��

��visib

lepoly

gonsin

a��

poly

gonmodelren

dered

at�fram

esper

secondThepeak

rasterizationload

is��

poly

gonslices

Figu

re��

Teapot��

��visib

lepoly

gonsin

a��

poly

gonmodelren

dered

at��

frames

per

secondThepeak

rasterizationload

is��

poly

gonslices

Note

the

high

lightsfrom

twoligh

tsou

rces

��

theactu

alim

plem

entation

ofthe�pipelin

estages�

geometry�

communication

and

rasterization

��

Paralle

lPipelin

e

Thepipelin

emodel

ofexecu

tionprov

ides

agreat

deal

ofparallelism

�butlen

dsit�

selfto

adirect

implem

entation

only

where

atru

epipelin

eexists

inhard

ware

The

Prin

cetonEngin

e�while

high

lyparallel�

isnot

apipelin

eof

processors�

itisavector

ofprocessors

Toextract

maxim

ume�

ciency

fromthesystem

thepipelin

ealgorith

m

mustbecarefu

llyim

plem

ented

Astraigh

tforward

implem

entation

would

assigneach

ofnprocessors

apoly

gon�

perform

allthegeom

etry�com

munication

andrasterization

forthose

npoly

gons�

andthen

process

thenextnpoly

gons

Whyis

this

not

agood

approach

�The

graphics

pipelin

erejects

inform

ationas

itproceed

sThe�rst

stepin

thegeom

etry

stagetran

sformsandculls

poly

gons

Ingen

eral�som

elarge

fractionof

thepoly

gons

westart

with

will

almost

immediately

bethrow

naw

ay�leav

inglarge

amounts

of

resources

unused

Furth

ermore�

di�eren

tpoly

gonswill

require

di�erin

gam

ounts

of

timeto

rasterize�as

agood

implem

entation

will

makerasterization

proceed

intim

e

prop

ortional

tothearea

oftheprim

itiveIfall

processors

rasterizeat

most

asin

gle

poly

gonandthen

return

togeom

etrycom

putation

sthealgorith

mwillexecu

teintim

e

prop

ortional

tothebiggest

poly

gon

Thismotivates

arearran

gementofthepipelin

einto

aset

ofseq

uentially

execu

ted

stagesAne�

cientim

plem

entation

will

�rst

transform

andtest

visib

ilityfor

all

poly

gons�then

perform

lightin

g�then

clipping�etc

Thisserialization

ofthealgorith

m

imposes

noloss

inperform

ance�as

clearlyall

ofthesam

eoperation

smustbeexecu

ted

regardless

oford

erTheonly

costassociated

with

thisserialization

isthestorage

necessary

tobu�er

results

betw

eenstages

Byprocessin

gtheentire

poly

gondatab

ase

ateach

stagebefore

proceed

ingvariation

sin

loadbetw

eenprocessors

exposed

ona

poly

gonbypoly

gonbasis

should

beminim

ized

��

��

Line�LockedProgramming

Apoly

gonren

derin

gim

plem

entation

onthePrin

cetonEngin

efaces

anumberofobsta�

clesbeyon

dextractin

gparallelism

�Most

challen

gingisthelin

e�lockedprogram

ming

parad

igmenforced

bythehard

ware�

Durin

geach

horizon

talscan

linetheprocessors

execu

tean

instru

ctionbudget

of

��instru

ctions�

attheendof

which

theprogram

counter

isreset

tothestart

of

theprogram

inprep

arationof

thenextlin

eof

video�

After

theoverh

eadof

clocking

outvideo

andoth

erbook

keepingtask

stheavailab

leprogram

instru

ctionbudget

is

approx

imately

��instru

ctionsper

scanlin

e�

Theforced

segmentation

oftheprogram

into

block

sof

code��

instru

ctionslon

g

greatlyincreases

thedi

culty

ofextractin

ge

ciency

fromtheim

plem

entation

�Code

mustnot

only

bee

cientin

itsutilization

ofprocessors�

itmustbee

cientin

itsuse

ofcod

eblock

s�Anycod

eblock

lessthan

��instru

ctionsin

length

wastes

execu

tion

timeneed

lessly�

Aclever

approach

istaken

tohandlin

gthiscon

straint�

Theprogram

isbrok

en

into

aset

ofproced

ures�

eachof

which

execu

tesin

lessthan

��instru

ctions�

Atthe

begin

ningof

eachscan

linetheproced

ure

addressed

byaglob

alfunction

poin

teris

called�Each

proced

ure

isthen

responsib

lefor

settingthispoin

terto

theaddress

of

thenextproced

ure

toexecu

te�In

largeareas

ofstraigh

t�linecod

e�such

asdurin

g

thegeom

etrystage�

eachproced

ure

just

setsthepoin

terto

theproced

ure

which

contin

ues

itsexecu

tion�while

insection

sof

codewith

asin

gletigh

tlycod

edloop

�

such

asrasterization

�theproced

ure

may

leavethepoin

terunchanged

until

itdetects

completion

byallp

rocessors�thusexecu

tingthesam

efunction

formultip

lescan

�lines�

Thismeth

od�while

somew

hat

awkward

�mapsrelatively

easilyto

poly

gonren

der�

ing�

For

exam

ple�all

ofthestate

isalread

ymain

tained

inglob

alvariab

lesto

minim

ize

theoverh

eadof

passin

gvariab

les�so

thefunction

disp

atchdoes

not

need

tocon

cern

itselfwith

param

eters�andcan

beavery

lowoverh

eadoperation

�

Ofcou

rse�justperform

ingthepack

ingofinstru

ctionsinto

instru

ctionbudget

sized

��

blocksisdi

cult�

asevery

timetheprogram

isrecom

piled

carefulatten

tionmustbe

paid

that

theinstru

ctionbudget

has

not

been

exceed

edbyanyof

theproced

ures�

In

somecases

complete

utilization

oftheinstru

ctionbudget

isim

possib

le�For

exam

ple�

longdivision

requires

slightly

more

than

��instru

ctionsto

execu

teon

thePrin

ceton

Engin

e�makingitnearly

impossib

leto

execu

temore

than

asin

glelon

gdividewith

in

theinstru

ctionbudget�

sosection

sof

thegeom

etrycod

ewhich

execu

temultip

lelon

g

divides

areoften

ine

cientbecau

sethey

don�thave

enough

extra

work

topack

into

theleft�over

instru

ctions�

Unfortu

nately

thecom

piler

prov

ides

nodirect

support

for

these

codepack

ingoperation

s�andtheprocess

isoften

anard

uoustrial

anderror

e ort�

TheMagic��

ournextgen

erationhard

ware

describ

edin

x��elim

inates

the

line�locked

program

mingcon

straint�

��

Geometry

Thegeom

etrystage

prov

ides

themost

obviou

sparallelism

�as

allvisib

le�after

rejec�

tionandclip

ping�

poly

gonsreq

uire

exactly

thesam

eam

ountof

work

tocom

pute�

How

ever�to

extract

ecien

cywestill

have

tobreak

thepipelin

einto

twopieces�

First

wetran

sformall

thepoly

gonsandtest

forvisib

ility�gen

eratingalist

ofvisib

le

poly

gonson

eachprocessor�

andthen

wecom

plete

thegeom

etryprocessin

gon

just

thevisib

leset

ofpoly

gons�

Thistwostage

process

allowsusto

ecien

tlyhandle

a

largeset

ofpoly

gons�hopefu

llymanyof

which

arenot

visib

le�Thuswecan

perform

acheap

transform

andtest

operation

onall

poly

gons�anddefer

theexpensiv

ework

tilllater�

when

wewill

have

toperform

iton

fewer

poly

gons�

��

Representatio

n

Thepoly

gonrep

resentation

isshow

nin

Figu

re��

Allpoly

gonsare

triangles

for

simplicity�

Anall

integer

represen

tationhas

been

chosen

forits

greatere

ciency

than

�oatin

gpoin

t�

Each

vertexhas

associatedwith

ita�D

location��x�y

�z��anorm

al��n

x �ny �n

z �

��

VE

RT

EX

:(14 bytes)

mem

ber

Vertex 2

Vertex 0

Vertex 1

1 22222

bytes

Green

1

bytes

3 141414

(48 bytes)

1

Red

Blue

texture reference

1

PO

LY

GO

N:

1

(x,y,z)

(u,v)(x,y,z)

(x,y,z)

(u,v)00

(Nx,N

y,Nz)(N

x,Ny,N

z)2

r,g,b

(u,v)1

Nx

Ny

Nzuv

mem

ber

2

Z

1

1

22

Y X(N

x,Ny,N

z)0

Figu

re��

PolygonRepresentatio

n

andatex

ture

coordinate�

�u�v��

Thevertex

coordinates

are��b

itintegers�

The

vertexnorm

alsare

��bit

signed

�xed�poin

twith

��fraction

albits�

Thetex

ture

map

coordinates

are��b

itunsign

edintegers�

with

�fraction

albits�

forthefullran

ge

of��

intex

ture

coordinates�

Thedecision

was

madenot

tosupport

per

vertex

colorto

savespace

andsim

plify

theligh

tingmodel�

Thepoly

gon�in

addition

to�

vertices�inclu

des

acolor�

�r�g�b��

andareferen

ceto

thetex

ture

touse�

ifapplicab

le�

��

Transfo

rmatio

n

Thepoly

gondata

setismodeled

asasin

glerigid

object�

soasin

gletran

sformation

matrix

may

beused

forall

poly

gons�

Tran

sformation

matrices

aretran

smitted

tothePrin

cetonEngin

ebyacon

trol

application

runningon

arem

otework

station�There

isnodirect

meth

odto

inputdata

inreal

timeto

thePrin

cetonEngin

e�with

theobviou

sexcep

tionofvideo

inform

ation�

Aclever

reuse

ofaHiPPiregister

allowstran

sformation

matrices

tobesen

tto

the

engin

eat

framerates�

Thetran

sformation

matrix

isinputas

��integers�

specify

ingthe�elem

ents

of

��

therotation

matrix

with

��fraction

albits

andthetran

sformation

entries

assign

ed

integers�

Theuse

of��b

itintegers

forboth

thevertex

coordinates

andthetran

sformation

matrix

prov

ides

asign

i�can

tadvan

tage�Theworld

toeyesp

acetran

sformation

may

now

beperform

edwith

��bit

arithmetic

which

allowsthefast

�single

clockcycle�

single�p

recisiondata

path

sto

beused

throu

ghout�

Thetran

sformation

toscreen

�space

requires

�divides�

sx�

x�z

andsy�

y�z�As

alreadydiscu

ssed�division

isparticu

larlyexpensiv

eon

thismach

ine�so

wemake

the

optim

izationof

perform

ingdivision

bytab

lelook

up�Division

ofak�bitnumber

bya

k�bitnumber

may

alwaysbeperform

edbylook

upin

a�kentry

k�bittab

lewith

no

lossofprecision

�A��entry

lookuptab

leof��z

isprov

ided

oneach

processor�

Afull

��entries

arenot

necessary

asdivision

isonly

donebypositiv

ez�

While

expensiv

e

inmem

ory�thistab

lewill

prove

tobeinvalu

ablelater

forpersp

ectivecorrect

texture

mapping�

��

Visib

ility�Clipping

Visib

ilityisperform

edas

aset

ofscreen

�space

tests�Tobevisib

leapoly

gonmust

liein

frontof

thedisp

layplan

eandtheboundingbox

ofthepoly

gonmust

intersect

thevisib

learea

ofthedisp

lay�plan

e�Alloth

erpoly

gonsare

rejectedas

they

mustbe

non�visib

le�

Allpoly

gonsare

de�ned

with

vertices

incou

nterclock

wise

order

when

facingthe

view

er�admittin

gatriv

ialback

facingtest�

Theback

facingtest

isperform

edby

evaluatin

gonevertex

ofthetrian

glein

thelin

earequation

de�ned

bytheoth

ertwo

vertices�Ifthevertex

liesin

thenegativ

ehalf�p

lanethepoly

gonisback

�facingand

isculled

�

Clip

pingis

not

explicitly

perform

edin

this

implem

entation�as

ourparticu

lar

meth

odfor

rasterizationmakes

itunnecessary�

Therasterizer

never

exam

ines

o �

screenpixels�

sothedetails

ofthepoly

gonintersection

with

screenedges

areunim

�

portan

t�

��

Theoptim

izationdiscu

ssedearlier

ofgen

eratingavisib

lelist

ofpoly

gons�rst

andthen

completin

gthegeom

etryprocessin

gonly

forthevisib

lepoly

gonsis

not

implem

ented

�For

scenes

ofinterest

thegeom

etrystage

typically

consumes

lessthan

��of

thetotal

instru

ctionswith

outthisoptim

ization�

��

Lightin

g

Theim

plem

ented

lightin

gmodel

supports

three

lightsou

rces�an

ambien

t�adirec�

tional�

andapoin

tligh

tsou

rce�Allthree

lightsou

rcesare

constrain

edto

bewhite�

andthepoly

goniscon

strained

tobeasin

glecolor�

sothat

wecan

interp

olateasin

gle

quantity�

thewhite

lightinten

sityre�

ectedfrom

aface

totheobserver�

rather

than

separate

interp

olationsof

thered

�green

andblueligh

tre�

ectedto

theobserver�

Theam

bien

tligh

tcon

trolstheback

groundlevelof

illumination

inthescen

e�Con�

trolisprov

ided

over

theinten

sityof

theligh

t�

Thedirection

alligh

tacts

asaligh

tsou

rcein�nitely

faraw

ayfrom

thescen

e�so

allligh

tincid

entupon

thescen

ehas

thesam

edirection

vector�

Controls

areprov

ided

tomodify

thedirection

andinten

sityof

theligh

t�

Thepoin

tligh

tsou

rcemodels

aphysicalligh

tplaced

somew

here

inclose

prox

imity

tothescen

e�so

thedirection

ofincid

entray

sto

avertex

dependson

theposition

of

thevertex

relativeto

theligh

tsou

rce�Controls

areprov

ided

overtheposition

and

inten

sityof

theligh

t�

Separate

controls

areprov

ided

overthedi use

componentof

re�ection

andthe

specu

larcom

ponentof

re�ection

fortheob

ject�

Theuse

ofaligh

tingmodelwith

direction

alandpoin

tligh

tsou

rcesalso

requires

theability

tocom

pute

unitvectors

fromarb

itraryvectors�

Alook

uptab

leisused

to

perform

thereq

uired

inverse

square

root��

��

CoecientCalculatio

n

Calcu

lationofcoe

cientsisthemost

timecon

sumingfunction

ofthegeom

etrystage�

Theworld

toeyetran

sformation

�cullin

g�clip

pingandligh

tingare

allperform

edin

lessthan

��instru

ctions�

Thecalcu

lationof

thecoe

cientsof

thelin

earequation

s

toiterate

require

anoth

er��

instru

ctionsto

execu

te�

Figu

re��

show

stheform

oftheequation

sto

compute�

Theexpenseofthese

oper�

ationsresu

ltsfrom

thenecessity

ofperform

ingtwo��b

itdivides

foreach

param

eter

tointerp

olate�Note

that

thedivisor

�is

common

toallthecalcu

lations�depth�

inten

sity�tex

ture

coordinates��

asitonly

dependson

vertexscreen

coordinates�

and

not

theparam

eterbein

ginterp

olated�

Oneoptim

izationthat

would

prov

idealarge

perform

ance

improvem

entwould

be

carefulcalcu

lationof

theinverse

ofthecom

mon

denom

inator

�once

into

a��b

it

�xed

poin

tnumber�

Subseq

uentdivision

scou

ldbeperform

edas

amultip

lyanda

shift�

Unfortu

nately

thisreq

uires

a��b

itmultip

lywith

a��b

itresu

lt�an

option

not

supported

bythecom

piler�

anddeem

edto

troublesom

eto

implem

entin

lightof

therelativ

elysm

all�less

than

��fraction

oftim

espentexecu

tingthegeom

etry

code�Thepoly

gondescrip

torgen

eratedas

the�nalresu

ltofthegeom

etrycom

putation

s�

show

nin

Figu

re��

completely

describ

esthepoly

gonfor

rasterization�Com

munica�

tion�thenextstage

ofoperation

�will

transm

itthese

descrip

torsbetw

eenprocessors

forrasterization

�

��

Communicatio

n

Theim

plem

entation

uses

sort�middlecom

munication

�Allof

theoptim

izationanal�

ysis

forsort�m

iddlecom

munication

was

based

ontheassu

mption

oflocality

ofdes�

tination

foreach

poly

gonto

becom

municated

�This

directed

ourchoice

ofscreen

decom

position

forrasterization

�as

alreadysuggested

�to

becolu

mn�parallel�

Byas�

signingeach

processor

asin

glecolu

mnof

pixels

onthedisp

layto

rasterizeweleave

��

C

224

B

(84 bytes)

A

coefficientcoefficient

bytes

(8 bytes)

-or-

(12 bytes)L

INE

AR

EQ

UA

TIO

N:

bytes

ABC4 4 4

PO

LY

GO

N D

ES

CR

IPT

OR

:

bytes

81281212 8

texture v

linear equation

edge0

edge1

edge2

zdepth

intensity

texture u

8322222

bytes

3texture reference

r,g,b

left

right

support information

# passes

bottom

top

Figu

re��

PolygonDescrip

tor�

resultof

allof

thegeom

etrycom

putation

sfor

asin

glepoly

gon�

aquarter

oftheprocessors

idledurin

grasterization

�thedisp

layisonly

��colu

mns

wide�

butwegain

agreat

deal

oflocality

inpoly

gondestin

ation�Allprocessors

ras�

terizingapoly

gonare

guaran

teedto

becon

tiguous�so

ourprev

iousanaly

seswhich

reliedon

theassu

mption

ofasin

gledestin

ationprocessor

forcom

munication

are

approx

imately

correct�andouroptim

izationswill

prove

e ective�

Theuse

ofthe�distrib

ution

�sch

emefor

decreasin

gcom

munication

timewas

un�

fortunately

not

implem

ented

�Thecon

sideration

andanaly

sisof

thispossib

ilitydid

not

ariseuntil

latein

theim

plem

entation

�preven

tingits

inclu

sion�How

ever�analy

sis

show

sthat

ithappensto

map

tothelin

e�lockedprogram

mingparad

igmvery

e�

ciently�

Passin

gasin

glepoly

gonfrom

neigh

bor

toneigh

bor

wouldreq

uire

��d�

�a�b

operation

s�where

a��

�a��

integer

poly

gondescrip

tor��b��

�thetest

overhead

�

andd�

��optim

ally�for

atotal

of��

instru

ctions�

��instru

ctions�tcom

fort�

ably

with

inthebudget

of��

instru

ctions�andallow

stim

efor

aglob

al�ifto

detect

completion

ofthe�rst

pass

ofthedistrib

ution

algorithm�

Adescrip

tionof

themech

anics

forperform

ingneigh

bor�to�n

eighbor

communica�

tionof

thepoly

gondescrip

torsisnow

presen

ted�Neigh

bor�to�n

eighbor�N

forthe

distrib

ution

optim

izationisan

obviou

sexten

sion�sim

ply

usin

glarge

pass

sizes�

��

��

Passin

gaSinglePolygonDescrip

tor

Themost

essential

communication

operation

istheneigh

bor�to�n

eighbor

pass�

Ar�

bitrary

communication

canthen

beperform

edwith

repeated

neigh

bor�to�n

eighbor

passes�

Passin

gapoly

gondescrip

tor�rep

resented

as��

��bit

integers�

fromone

processor

toits

neigh

bor

isaloop

overthedescrip

tor�passin

ga��b

itpiece

ofitat

atim

e�

voidpass�int�sourceDescriptor�int�destDescriptor�

�

intcounter�

for�counter��counter��counter��

�

communicationRegister�sourceDatum counter��

passRight��orpassLeft��

destDatum counter��communicationRegister�

�

�

ThepassRight��or

passLeft��operation

isacom

piler

prim

itivethat

shifts

the

datu

min

eachprocessor�s

communication

registeroneprocessor

totherigh

t�or

left��

Each

processor

isperform

ingthesam

eaction

�so

eachprocessor

issim

ultan

eously

sourcin

gandreceiv

ingapoly

gondescrip

tor�Theloop

iscom

pletely

unrolled

and

coded

inassem

bly

language

fore

ciency�

��

Passin

gMultip

leDescrip

tors

Ingen

eraleach

processor

will

have

more

than

asin

gledescrip

torto

distrib

ute�

and

more

than

asin

gledescrip

torwillb

edestin

edfor

thisprocessor�

Itistheresp

onsib

ility

ofeach

processor

tostore

thisdata

asitarriv

es�

Each

processor

manages

asourcearray

ofdata

andareceivearray

ofdata�

The

sourcearray

contain

sall

ofthedescrip

torsthat

thisprocessor

will

communicate

to

theoth

erprocessors�

Thereceivearray

contain

s�at

theendof

thecom

munication

�

allof

thedescrip

torsthat

theoth

erprocessors

communicated

tothisprocessor�

��

voiddistribute�intn�DESCRIPTORsource ��DESCRIPTORreceive ��

�

DESCRIPTOR�s��r�

s�

source�

r�

receive�

while�globalAnd�n��

pass�s�r��

s�r�

if�

GOOD�r��r��

if��INTERESTING�s��n��

s��source�

�

�

Figu

re��

Implementatio

nforData

Distrib

utio

n

Thevariab

lespoin

tsto

thenextdescrip

torto

pass�

while

rpoin

tsto

thelocation

tostore

thenextdescrip

torreceiv

ed�sping�p

ongs

back

andforth

betw

eenthesou

rce

arrayandreceive

array�Initially

eachprocessors

sources

adatu

mfrom

itssource

array�andreceiv

eadatu

minto

itsreceivearray�

When

aprocessor

receivesa

descrip

toritim

mediately

poin

tssat

it�theassu

mption

bein

gthat

itwill

prob

ably

have

topass

thisdescrip

torfurth

eralon

g�as

itisunlikely

thisprocessor

isthelast

processor

that

need

sto

seethispoly

gon�Thisstore

andforw

ardoperation

insures

that

eachpoly

gondescrip

tortravels

aroundtherin

gfar

enough

tobeseen

byall

processors

rasterizingit�

Note

that

whatever

descrip

torswas

poin

tingat

itwas

just

passed

tothis

processors

neigh

bor�

which

isnow

responsib

lefor

it�Thusthe

autom

atics�roperation

forgetswhich

descrip

torthisprocessor

just

sourced

�butit

simultan

eously

becom

esanoth

erprocessors

responsib

ility�

Ifthedescrip

torreceiv

edisGOOD�a

descrip

torfor

apoly

gonthat

thisprocessor

will

bein

part

responsib

lefor

rasterizing�

theprocessor

increm

entsthereceive

poin

ter

rso

that

itdoesn

�toverw

riteitwith

thenextpoly

gonreceiv

ed�Otherw

iserisleft

alone�andthenextdescrip

torreceiv

edwilloverw

riteit�

Thiscan

cause�an

dgen

erally

��

does�

asitu

ationwhere

s��r�andtheprocessor

justcon

tinually

forward

spoly

gons

foroth

erprocessors�

Ifthedescrip

torreceiv

edbyaprocessor

onanygiven

cycle

isnot

INTERESTING

then

ithas

alreadybeen

communicated

toall

processors

that

need

tosee

it�The

processor

will

then

source

oneof

itsrem

ainingsourcepoly

gonsifithas

anyleft�

Iftheprocessor

has

nomore

descrip

torsto

contrib

ute

itwill

simply

pass

o this

�uninterestin

g�poly

gonto

thenextprocessor�

which

may

ormay

not

choose

to

replace

itwith

apoly

gonof

itsow

n�etc�

Thewhileloop

terminates

when

noprocessor

isstill

communicatin

gdata

ofin�

terest�Ifnooneiscom

municatin

gdata

ofinterest

then

thecom

munication

mustbe

complete�an

dthefunction

terminates�

Thishas

tobedoneviaaglob

altest

asitisn

�t

know

napriori

how

longitwill

taketo

communicate

agiv

ennumber

ofpoly

gons��

Thisim

plem

entationavoid

sas

manycop

yoperation

sas

possib

le�as

acop

ywill

require

aread

sandawrites

�aisthedescrip

torsize��

which

isalm

ostas

expensive

asthecom

munication

operation

itself�Thisisparticu

larlyim

portan

tfor

communi�

cation�as

itisgen

erallythecase

that

onagiv

encycle

only

somesm

allpercen

tageof

theprocessors

will

behandlin

gdata

they

need

acop

yof�

andtherest

will

bejustfor�

ward

ingdata

inten

ded

foroth

ers�Anycon

dition

aloperation

s�su

chas

acon

dition

al

copy�would

beparticu

larlyexpensive�

asonly

asm

allfraction

ofthemach

inewould

beutilized

�

When

communication

has

completed

eachprocessor

has

acop

yof

allthedescrip

�

torsfor

thepoly

gonsitwill

beresp

onsib

lefor

rasterizing�andthealgorith

mproceed

s

directly

torasterization

�

�Theactu

alim

plem

entation

optim

izesto

perform

this

testperio

dically

bymak

ingan

optim

isticgu

essof

how

manyiteration

sof

theloop

therem

ainingdescrip

torswill

taketo

communicate�

and

afterthatman

yitera

tionsanew

estimate

isform

ed�etc�

Theestim

ateswill

decrease

mon

otonically

�beca

use

they

are

based

ontherem

ainingnumber

ofpoly

gonseach

processor

has

tocom

municate�

which

isdecrea

singifprog

ressisbein

gmad

e�an

dthefreq

uency

ofcom

pleten

esscheck

swillth

ereforeincrea

seuntil

actualcom

pletion

�W

ithalittle

carethenumber

oftests

canbeminim

izedwith

out

perfo

rmingmore

communication

thannecessary��

F(x,y) =

Ax +

By +

CF

’(x,y) = B

y + C

’; C’ =

Asx

Figu

re��

LinearEquatio

nNorm

alizatio

n�lin

earequation

sare

norm

alizedto

thehorizon

talposition

oftheprocessor

inscreen

space�

��

Raste

rizatio

n

After

communication

eachprocessor

has

acop

yin

thereceivearray

ofall

poly

gon

descrip

torsthat

intersect

itsrasterization

region�A

column�parallel

decom

position

ofthescreen

has

been

implem

ented

�so

eachprocessor

isresp

onsib

lefor

rasterizinga

verticalslice

ofeach

ofthese

poly

gonsinto

itssin

glecolu

mnof

thefram

ebu�er�

Rasterization

isbroken

into

a�step

process

intheim

plem

entation

�

�Norm

alizelin

earequation

sto

processor

columnposition

�Rasterize

poly

gons

��

Norm

alizeLinearEquatio

ns

Theequation

sthat

resideon

every

processor

aspart

ofthepoly

gondescrip

torsare

oftheform

Fx�y��Ax By C�Wehave

implem

ented

acolu

mn�parallel

decom

�

position

ofthescreen

�so

eachprocessor

isresp

onsib

lefor

rasterizingasin

glecolu

mn

ofpixels�

thuseach

processor

only

need

sto

iteratethelin

earequation

svertically�

Thenorm

alizationC

��C Asxismadefor

eachequation

�where

sxisthehorizon

�

talposition

ofthisprocessor�s

columnon

thedisp

lay�These

norm

alizationsreq

uire

approx

imately

��instru

ctionsper

poly

gonandare

applied

tothereceivepoly

gon

arrayon

eachprocessor

immediately

afterthecom

pletion

ofthecom

munication

stage�

��

Now

simpler

equation

sof

theform

F�x

�y��By C

�can

beevalu

atedon

each

processor�

asshow

nin

Figu

re��

Ofcou

rsetheequation

sare

evaluated

iteratively�

sotheactu

alevalu

ationprocess

isFx�y��Fx�y

��

B

foreach

verticalstep

�

Rasterization

begin

sonce

alltheprocessors

have

norm

alizedtheir

linear

equation

s

forall

receivedpoly

gons�

��

Raste

rizePolygons

Poly

gonrasterization

�discu

ssedin

section��

isim

plem

entedbyiteratin

gaset

of

linear

equation

sover

theboundingbox

ofapoly

gon�In

thiscase

weonly

iteratethe

equation

sover

avertical

span

ofthepoly

gondueto

thecolu

mn�parallel

decom

posi�

tionof

thescreen

�Torasterize

asin

glepixelwecom

pare

thecurren

tlyinterp

olated

depth

forthepoly

gonto

thecorresp

ondingpixelin

thedepth�bu�er�

Ifthecurren

t

pixelocclu

des

thepixelin

thedepth�bu�er

wethen

shadeandtex

ture

map

thepixel

andplace

itin

thefram

ebu�er

andits

depth

valuein

thedepth�bu�er�

Thuswehave

anocclu

sionmodelthat

isaccu

rateon

apixel�b

y�pixelbasis�

There

areanumber

ofissu

esandoptim

izationswhich

were

tackled

intheim

�

plem

entation

�Thedi�

cult

implem

entation

issues

will

bediscu

ssed�rst�

andthe

optim

izationswill

bediscu

ssedat

theendof

thesection

�

Decouplin

gRaste

rizatio

nandPolygonSize

Themost

importan

tissu

ealth

ough

arguably

anoptim

ization�in

therasterizer

is

allowingprocessors

toin

dep

enden

tlyadvan

ceto

their

nextpoly

gonafter

completin

g

rasterizationof

their

curren

tpoly

gon�Thisdecou

ples

therasterization

process

from

sizedi�eren

cesin

theparticu

larpoly

gonsbein

grasterized

oneach

processor

atany

poin

tintim

e�With

outthisdecou

plin

geach

processor

wouldrasterize

its�rst

poly

gon�

wait

untilalloth

erprocessors

were

done�andthen

allprocessors

wouldsim

ultan

eously

proceed

totheir

secondpoly

gon�etc�

forcingall

processors

torasterize

their

poly

gon

intim

eprop

ortional

tothesize

oftheworst

casepoly

gon�

Thedecou

plin

gisach

ievedbyperform

ingaperiod

ictest

durin

gpixelrasterization

��

toallow

processors

tomove

onto

their

nextpoly

gon�Every

k�pixels

allprocessors

testifthey

have

�nish

edrasterizin

gtheir

curren

tpoly

gon�andthose

that

have

update

theequation

sthey

areiteratin

gwith

thecoe�

cientsof

their

nextpoly

gonto

raster�

ize�Thistest

andcon

dition

aladvan

cementisexpensiv

ebecau

seitreq

uires

copying

indirectly

addressed

data

into

registers�Furth

ermore�

thelin

e�lockedprogram

ming

parad

igmmakes

somechoices

ofhow

oftento

perform

thetest

thevalu

eofk�more

e�cien

tthan

other�

interm

sof

totalinstru

ctionsexecu

tedevery

durin

geach

scan�

line�

Theshaded

poly

gonrasterizer

perform

sonetest

andeigh

tpixelrasterization

s

k�

��per

scanlin

e�andthefulltex

ture

mappingrasterizer

perform

sonetest

and

fourpixelrasterization

sper

scanlin

e�

Thisoptim

izationissim

ilarto

those

implem

ented

byCrow

��andWhitm

an��

Crow

andWhitm

anboth

use

asim

ilartest

todecou

plethescan

lineof

thepoly

gon

eachprocessor

iswork

ingon�butstill

forcesynchron

izationat

thepoly

gonlevel�

Thissuggests

they

only

gaine�

ciency

where

poly

gonare

similar

inarea�

buthave

varyingasp

ectratios�

Ourapproach

isinsen

sitiveto

poly

gonasp

ectratios

becau

se

eachprocessor

only

rasterizesasin

glecolu

mnof

anypoly

gon�andto

poly

gonarea�

which

would

seemto

o�er

amore

substan

tialadvan

tage�

Persp

ectiv

e�Corre

ctTexture

Mapping

Ourmodelof

textured

poly

gonsassociates

atex

turecoord

inate

u�v�with

eachofthe

verticesof

thepoly

gon�andthetex

ture

coordinate

ofanypoin

twith

inthepoly

gon

may

befou

ndas

thelin

earcom

bination

ofthese

vertexvalu

es�How

ever�this

is

linear

interp

olationin

object

space�

butnot

inscreen

space�

Thegen

eralform

ofthe

interp

olationshould

be

u�

Ax By C

��

v�

Dx Ey F

��

��

inob

jectspace�

butweinterp

olateall

ourparam

etersin

screenspace�

Naive

screenspace

interp

olationof

quantities

nonlin

earin

screenspace

leadsto

distractin

gartifacts�

perh

apsmost

noticeab

lein

texture

mapping�

Theerrors

made

destroy

theforesh

orteninge�ect

associatedwith

objects

twisted

away

fromtheob�

server�

Ifwecarefu

llyperform

only

interp

olationsof

quantities

that

arelin

earin

screenspace

wecan

createapersp

ectiveco

rrectren

derer�

Exam

iningourpersp

ectivetran

sformation

�sx�x�z

andsy�y�z�

wesee

wecan

express

these

equation

sas

u�

zAsx Bsy C�

��

v�

zDsx Esy F�

��

solin

earinterp

olationofu�z

andv�z

canbecorrectly

perform

edin

screenspace�

andthevalu

eofuandvrecovered

ateach

pixelbymultip

licationwith

z�How

ever�

multip

licationbyzrein

troduces

theprob

lemof

screenspace

linearity�

Ifweexam

inetheequation

ofaplan

ein

�D�andthustheequation

ofdepth�

wesee

that

liketex

ture

coordinates�

depth

isnot

linear

inscreen

space�

thein

verse

depth

islin

earin

screenspace�

How

ever�wecan

easilyinterp

olateinverse

zinstead

ofz�

andstill

use

depth�bu�erin

g�Ourcom

positin

gtest

for��z

values

issim

plythe

complem

entof

thetest

forz�

Now

wecan

trivially

interp

olateu�z�v�z

and��z

foreach

poly

gon�andtex

ture

mappingwill

require

thecalcu

lation

u�

u�z

��z��

v�

v�z

��z��

ateach

pixel�

Wereu

setheinverse

ztab

leused

toperform

thepersp

ectivedivision

�

��

avoidingthevery

high

costofperform

ingtwodividesper

iterationoftherasterization

loop�Heck

bert

andMoreton

prov

ideacom

preh

ensive

treatmentof

theissu

eof

object�

linear

vs�

screen�lin

earquantities

in��

Bilin

earInterpolatio

n

Texturemappingismuddied

furth

erbeyon

dpersp

ectivecorrection

bythenecessity

of

perform

ingsam

plin

g�Asim

pletex

ture

mappingalgorith

mperform

spoin

t�samplin

g

ofthetex

ture

image�

takingthecolor

ofthepoly

gonpixelas

thetex

elclosest

tothe

curren

tinterp

olatedu�v�coord

inate�

Unfortu

nately

small

motion

sof

objects

can

induce

jitteringin

thesam

plin

gof

thetex

elsandcreate

unpleasan

tartifacts

under

poin

tsam

plin

g�Bilin

earinterp

olationhelp

sto

reduce

these

e�ects

bycom

putin

ga

pixelcolor

astheweigh

tedaverage

ofthe�tex

elsnearest

tothetex

ture

coordinate�

Onmanyarch

itectures

thisprov

ides

asubstan

tialperform

ance

penalty�

asitre�

quires

perform

ing�tex

elfetch

esfor

everytex

turemapped

pixel�

How

everthePrin

ce�

tonEngin

ehas

enorm

ousaggregate

mem

orybandwidth�andin

factthere

isacop

y

ofall

ofthetex

tures

onevery

processor�

sotheaddition

altex

elfetch

esare

atriv

ial

expense�

��

Raste

rizatio

nOptim

izatio

ns

Distin

ctTexture�MappedandNon�Texture�MappedAlgorith

ms

Theinclu

sionof

texture

mappingaddssign

i�can

tlyto

thecost

ofrasterization

�A

texture

mapped

pixel

requires

approx

imately

twice

aslon

gto

rasterizeas

apixel

with

outtex

ture

mapping�

Manyscen

esof

interest

have

notex

ture

mappingat

all�so

acon

trolisprov

ided

intheinterface

todisab

letex

ture

mappingwhich

results

inthe

use

ofan

optim

izedshading�on

lyrasterizer�

Early

Abort

��

(a)(b)

erroraliasing

for this slice

rasterizationcom

plete

Figu

re��

Raste

rizing�trian

glesoccu

pyless

than

half

ofthepixel

coveredby

their

boundingboxes�

Inboth

casessign

i�can

tadvan

tagecan

bemadeof

notin

gwhen

rasterizationexits

thepoly

gon�Case

b�addsthecom

plex

itythat

acareless

rasterizerwill

stepover

thepoly

gondueto

integer

pixel

coordinates

andrasterize

forever�

waitin

gto

enter

thepoly

gon�

Observe

that

when

rasterizingpoly

gons�

show

nin

Figu

re��

that

atlea

stone

half

ofthearea

intheboundingbox

ofatrian

gleis

notwith

inthetrian

gle�It

would

beadvan

tageousto

avoidexam

iningall

ofthese

pixels

need

lessly�Weinclu

de

asim

pletest

which

onaverage

avoidsrasterizin

g��

ofthese

�exterior�

pixels�

As

rasterizationproceed

s�a�ag

isset

when

thepoly

gonisentered

�Becau

setrian

gles

arenecessarily

convex

�as

soonas

oneof

theedge

equation

stest

negative

andthe�ag

isset�

therasterizer

know

sthat

ithas

completely

rendered

itspiece

ofthispoly

gon

andrasterization

ofthenextpoly

goncan

begin

�

��

Summary

This

sectionhas

describ

edthemajor

implem

entation

issues

andoptim

izationsin

therasterizer�

Particu

laratten

tionhas

been

paid

tothedecou

plin

gof

thepoly

gon

bein

gprocessed

fromrasterization

�In

addition

persp

ective�correct

texture

mapping�

requirin

gan

expensiv

etwodivides

per

pixel�

has

been

e�cien

tlyim

plem

ented

with

a

tablelook

up�

Theuse

ofsep

araterasterization

algorithmswith

andwith

outtex

ture

mapping

support

prov

ides

asubstan

tialperform

ance

gainfor

scenes

with

outtex

turin

g�An

��

earlyabort

mech

anism

allowsrasterization

todetect

completion

ofapoly

gonbefore

theentire

vertical

exten

tof

theboundingbox

has

been

exam

ined�andprov

ides

a

signi�can

tred

uction

inthetotal

pixels

exam

ined

torasterize

agiv

enpoly

gon�

Thusfar

thegeom

etry�com

munication

andrasterization

stageshave

been

de�

scribed�Thenextsection

will

discu

ssthe�nal

issue�

theactu

alcon

trolof

theren

�

derer�

��

Control

Avirtu

altrack

ball

isused

tointeract

with

renderin

gsoftw

are�Awork

stationisused

todisp

laya�con

trolcube��

show

nin

Figu

re��

which

represen

tstheorien

tation

andtran

slationof

theob

jectren

dered

bythePrin

cetonEngin

e�Thecubemay

be

grabbed

with

themouse

androtated

andtran

slatedarb

itrarily�Theuser

may

also

setthecubespinningaboutan

arbitrary

axis

ofrotation

�Quatern

ions�

describ

ed

in��

areused

tointerp

olatebetw

eensuccessive

position

sof

therotatin

gcon

trol

cube�

TheGLmodelv

iewmatrix

��corresp

ondingto

theorien

tationof

thecubeis

transm

ittedto

thePrin

cetonEngin

econ

tinually

via

anauxiliary

data

channel�

The

renderer

may

beoperatin

gat

lessthan

thematrix

update

rate�in

which

casesom

eof

theupdates

areign

ored�so

while

theupdate

rateoftheengin

edisp

laymay

not

match

thevirtu

altrack

ball�

therate

ofrotation

andtran

slationof

theren

dered

object

does�

Clever

reuseofacom

munication

registerused

tosupport

aHiPPi�interface

allows

asin

gleinteger

tobeinputto

allprocessors

perscan

line�

Wecan

input��

��

integers

asecon

dto

theengin

ethrou

ghthischannel�

Alow

data

rate�butmore

than

adequate

forsim

pleupdates�

such

asthetran

sformation

matrix

�Thebandwidth

into

theengin

ethrou

ghthis�au

xiliary

�channelisobviou

slysubstan

tiallylarger

than

that

required

forsim

plematrix

updates�

andallow

sthefuture

possib

ilityof

sendingextra

inform

ationthrou

ghthischannel�

such

astheposition

sandprop

ertiesofligh

tsou

rces

�Unfortu

nately

use

oftheactu

alHiPPichannel

requires

reprogrammingtheIO

circuitry

ofthe

engineandcan�tbeused

concurren

tlywith

video

output�

��

Figu

re��

Virtu

alTrackballInterfa

ce

with

intheworld

�or

simulation

data�

Afullhandshakingprotocol

isused

betw

eenthework

stationandtheengin

e�

which

insures

perfect

synchron

ization�Thechannel

isalso

robust

andread

ilyhan�

dles

drop

ped�scram

bled

anddelayed

data�

sothat

theoccasion

alerror

isgracefu

lly

handled

�A

serialinterface

betw

eenthework

stationrunningthevirtu

altrack

ball

software

andthePrin

cetonEngin

econ

trollerisused

toavoid

control

di�

culties

due

tovariation

sin

ethern

ettra�

c�

��

Summary

Thischapter

has

discu

ssedtheim

plem

entation

ofacolu

mn�parallelp

olygon

renderer

onthePrin

cetonEngin

e�Thepoly

gonren

derer

supports

depth

bu�erin

g�ligh

ting�

shadingandtex

ture

mappingof

triangular

prim

itives�

Anexplan

ationof

thegeom

etrystage

has

been

prov

ided�inclu

dingthedetails

of

therep

resentation

ofthepoly

gondatab

aseandtheform

ofthecalcu

lationsperform

ed�

Theinclu

sionofaperprocessor

inverse

ztab

lesupports

e�cien

tpersp

ectivedivision

s�

andisreu

sedlater

tosupport

e�cien

tpersp

ective�correcttex

ture

mapping�

Thecom

munication

stagehas

been

carefully

describ

ed�as

itdom

inates

theexe�

cution

timeof

thealgorith

m�andits

e�cien

cybears

heav

ilyon

theoverall

e�cien

cy

��

oftheprogram

�Thedetails

ofqueuein

gpoly

gondescrip

torsfor

transm

issionand

receiptisexplain

ed�andacarefu

lexplan

ationof

theactu

alprogram

med

managed

communication

algorithm

isgiven

�

Sign

i�can

tissu

esandoptim

izationof

therasterization

stage�secon

donly

tocom

�

munication

intotal

execu

tiontim

e�have

been

exam

ined�Decou

plin

gof

thepoly

gon

sizedependence

inrasterization

incon

junction

with

anearly

abort

mech

anism

a�ord

s

ane�

cientim

plem

entation

�Theuse

ofan

inverse

zlook

uptab

leprov

ides

visu

ally

pleasin

gpersp

ectivecorrect

texture

mappingcheap

ly�

Anintuitive

virtu

altrack

ball

interface

prov

ides

real�timeuser

interaction

and

control

oftheren

dered

scene�

Chapter

�will

prov

ideperform

ance

results

ofthisim

plem

entationandacom

par�

isonwith

theperform

ance

ofaPrin

cetonEngin

econ

temporary�

theSilicon

Grap

hics

GTX�

��

��

Chapter�

Results

Thepoly

gonren

derer

has

achiev

edpeak

perform

ance

ofover

��visib

le�shaded

andbilin

earlyinterp

olatedtex

turemapped

poly

gonsper

second�There

areanumber

ofperpoly

gonandperpixelp

erformance

�gures

that

combineto

yield

the�nalaggre�

gateperform

ance

ofthesystem

�Thischapter

will

�rst

prov

idetheraw

perform

ance

ofeach

operation

�andthen

anaccou

ntin

gof

how

these

instru

ctionsare

spentper

visib

lepoly

gonren

dered

�Finally

acom

parison

will

bemadebetw

eenthePrin

ceton

Engin

eperform

ance

andthat

ofits

contem

porary�

theSilicon

Grap

hics

GTX�

��

Raw

Perfo

rmance

Theraw

perform

ance

ofeach

oftheindividualstages

oftheim

plem

entationisread

ily

quanti�

ed�Unfortu

nately

thelin

e�lockedprogram

mingparad

igmofthethePrin

ceton

Engin

eoften

obscu

resthetru

eperform

ance

ofthealgorith

m�Where

feasible�gures

areprov

ided

both

fortheactu

alim

plem

entationandfor

ahypoth

eticalPrin

ceton

Engin

elack

ingthelin

e�lockedcon

straint�

Thelin

e�lockedprogram

mingcon

straintmakes

itmost

natu

ralto

express

oper�

ationsas

thenumber

ofoperation

sach

ieved

per

linetim

e�where

alin

etim

eis��

instru

ctions�andinclu

des

thefram

ebuer

managem

entcod

eandanyunused

cycles�

��

NTSCvideo

operates

at��

frames

per

second�and� �

lines

per

frame�for

atotal

of

��lin

esper

second�

��

Geometry

Tran

sformingto

screencoord

inates�

lightin

g�andcom

putin

glin

earequation

coe��

cientsisvery

fast�

� �procs�

�

polygon

proc�lines��

lines

second��shadedpolygons

second

��

with

texture

mapping�w

hich

requires

thecalcu

lationof

linear

equation

sfor

the

addition

alparam

etersuandv��

� �procs�

polygon

proc�lines��

lines

second��texturedpolygons

second

��

��

Communicatio

n

Characterizin

gthecom

munication

perform

ance

isdi�

cult�as

thedistrib

ution

ofpoly

�

gonsandaverage

distan

cethey

have

totravel

willa

ectthee�

ciency

oftheshift

ring�

Ausefu

lbenchmark

isthenumberofpoly

gonpasses

�proc

n�

proc

n��

persecon

d�

� �procs��

passes

proc�line��

lines

second��polygonpasses

second

��

Ifeach

poly

gonreq

uires

approx

imately

� passes

�travelshalfw

ayarou

ndthe

ringto

reachits

destin

ation�then

thepoly

gonscom

municated

per

secondis

�NTSC

video�asde�ned

bytheSMPTEdraftsta

ndard

actu

ally

opera

tesat��eld

s

per

second�orapproxim

ately

��fra

mes

per

second

��

��

��passes

second�

�

polygon

passes

��polygons

second

��

��

Raste

rizatio

n

Perform

ance

remain

slargely

communication

limited

�which

isindependentofpoly

gon

size��Thismakestheusual��p

ixelpoly

gon�aless

usefu

lbenchmark

�asitstresses

therasterizin

gperform

ance

excessiv

elyover

thecom

munication

perform

ance�

More

relevantispixels

per

second�as

thisyield

sinsigh

tto

theaverage

depth

complex

ity

that

canbehandled

�Anaggregate

rateof

almost

��million

bilin

earlyinterp

olated�

Gourau

dshaded

texturemapped

pixels

isattain

ed�With

outtex

turemapping�alm

ost

��million

pixels

canberen

dered

per

second�A

secondof

NTSC

video

is��

�

� ��

�

��

pixels�

orless

than

oneeigh

thof

thepeak

pixel�llrate

ofthis

implem

entation

�

Ifwecon

sider

texture�m

apped

�bilin

earlyinterp

olated��Gourau

d�sh

aded

pixels

per

second�wecan

characterize

theresu

ltingperform

ance

as��

��procs��

pixels

line�proc��

lines

second��pixels

second

��

With

outthebilin

earinterp

olationnecessary

fortex

ture

mapping�

thepixel

�ll

ratedoubles�

��procs��

pixels

line�proc��

lines

second��pixels

second

��

With

outthelin

e�lockedprogram

mingparad

igm�andign

oringtheoverh

eadto

�Thisistru

eto

�rst

order

Ofcourse

larger

polygonswill

haveto

stayin

therin

glonger

befo

re

every

onehasseen

them

�butgenera

llythesize

ofapolygonis

smallcompared

tothedista

nce

it

must

travel

��processo

rsare

speci�

ed�asonly

theren

derin

gofon�screen

processo

rsis

used

inthis

algorith

m

�

change

activepoly

gons�

therasterizer

requires

� instru

ctionsper

shaded

texture

mapped

pixel�

foran

aggregate�llrate

with

afully

distrib

uted

framebuer

�all� �

processors�

of�

� �procs��

�inst

sec�proc�

�

pixel

inst

��pixels

second

��

Shaded

pixels

only

require

��instru

ctionsper

pixel�

foran

aggregate�llrate

of�

� �procs��

�inst

sec�proc�

��

pixel

inst

��pixels

second

��

Thefullpixelren

derin

grate

isneverrealized

becau

seweperform

aboundingbox

scanofthetrian

gular

prim

itives�which

guaran

teeswewill

test�an

dnot

render�

some

non�prim

itivepixels�

Wealso

perform

aperiod

ictest

�not

afterevery

pixel�

tocon

�

dition

allyadvan

ceto

thenextpoly

gon�which

while

improv

ingoverall

perform

ance�

alsoincreases

thenumber

ofpixels

wewill

exam

inefor

eachprim

itivenecessarily�

If

weapprox

imate

theloss

ine�

ciency

dueto

thepixels

exam

ined

outsid

eof

thepoly

�

gonandthegran

ularity

intherasterization

algorithm�therasterizer

achieves

��

parallel

e�cien

cy�Assu

minguniform

loadbalan

cing�

theren

derer

canrasterize�

��

��pixels

second��

�

��

polygon

pixels

��pixelpolygons

second

��

with

outtex

ture

mapping�

��

��pixels

second��

�

��

polygon

pixels

��pixelpolygons

second

��

��

Aggregate

Perfo

rmance�

percen

tof

timein

eachphase

object

frames�s��

poly

s�s��

geometry

communicate

rasterizebeeth

oven��

��

��

��

��

��

� ��

��

crocodile

��

��

��

��

��

��

��

�teap

ot��

��

� ��

��

��

��

��

��

Table��

Renderin

gPerfo

rmanceforTypicalScenes�

Rendered

scenes

always

countan

integral

number

offram

etim

esto

compute�

asfram

ebuers

canonly

be

swapped

atfram

eboundaries�

Any�ex

cess�tim

ein

thefram

espentidlin

giscou

nted

asren

der

time�cau

singan

unfairly

high

estimate

ofren

der

timeversu

scom

putation

anddistrib

ution

�

Theactu

alperform

ance

oftheren

derer

ismeasu

redbyren

derin

gthescen

esshow

n

in�gures

��

and

��Typical

results

aregiven

inTable��

Multip

leentries

forthesam

eob

jectare

atdieren

torien

tations�

ascom

munication

andren

derin

g

costwill

dependon

orientation

�Figu

re��

show

sthebreak

dow

nof

renderin

gtim

e

betw

eengeom

etry�com

munication

andrasterization

�

��

Accountin

g

ThePrin

cetonEngin

eprov

ides

anaggregate

�BIPsof

computin

gperform

ance�

At

��poly

gonsasecon

dthisim

plies

approx

imately

��instru

ctionsperren

dered

poly

gon�Thiscost

seemsexcessiv

e�butwecan

accountfor

where

alltheinstru

ctions

have

gone�

Exam

iningthescen

ewith

ourpeak

perform

ance�

thetex

ture

mapped

crocodile�

wetally

theinstru

ctionusage�

Ourscen

ehas

��poly

gons��

ofwhich

arevisib

leat

thegiv

enorien

ta�

tion�Thescen

eisren

dered

at�fp

s�for

anaggregate

perform

ance

of� ��

poly

gons

per

second�Wewill

express

times

foroperation

interm

sof

scanlin

es�becau

seall

function

alityisim

plem

ented

inscan

linesized

proced

ure

block

s�

��

0 10 20 30 40 50 60 70

beethovencrocodile

teapot

percent

geometry

comm

unicationrasterization

Figu

re��

Executio

nPro�le�thecom

munication

timefor

most

scenes

isthe

dom

inantperform

ance

factor�Asthenumber

ofpoly

gonsincrease

�andtheir

sizedecreases�

communication

timebecom

eseven

more

dom

inant�

Thegeom

etrystage

isnot

optim

izedto

perform

atwo�stage

pipelin

e�so

allpoly

�

gonshave

tobefully

processed

�Atex

ture�m

apped

poly

gonreq

uires

scan

lines

to

compute�

foratotal

of��inst

line�

lines

polygon��

��inst

polygon

��

there

areatotal

of��

poly

gons�for

��

polygons

frame

��

inst

polygon��frames

second� ��

��

inst

second

��

We�ll

countthecom

munication

stageabit

dieren

tly�A

single

pass

consumes

MIPsacross

allof

theprocessors

andwecan

perform

�passes

inascan

line�for

��inst

line�proc�

�

lines

pass��

procs��

��inst

pass

��

��

Each

poly

gonhas

amargin

alpass

cost�total

number

ofpasses

divided

bythe

number

ofpoly

gons�

of��

foratotal

of

��

inst

pass��

��pass

polygon�

�� inst

polygon

��

Thecom

munication

stagethusreq

uires

��

polygons

frame

��

inst

polygon��frames

second��

��

inst

second

��

Bythetim

ewereach

therasterization

stagewehave

hitalarge

loadim

balan

ce�

dueto

theasy

mmetric

distrib

ution

ofscen

econ

tent�

Ifweexam

ineFigu

re��

and

note

thehistogram

wesee

that

thepeak

processor

has

��poly

gonsintersectin

g

itscolu

mn�Theaverage

poly

gonheigh

tis��

pixels

�obtain

edfrom

thesim

ulator�

andourtex

ture

mappingcod

eforces

poly

gonsto

rasterizein

�pixelchunks�so

each

poly

gonwillh

avean

eectiv

eaverage

heigh

tof�pixels�

poly

gonscan

benorm

alized

inascan

line�and�pixels

canberasterized

inascan

line�so

theinstru

ctionsreq

uired

toprocess

asin

glepoly

gonare

��inst

line��

normalize

polygon

�

line

normalize��

pixels

polygon�

�

line

pixel��

inst

polygon

��

andourworst�case

processor

has

��poly

gonslices

torasterize�

foratotal

of

��polygons

proc

��inst

polygon��

procs��frames

second��

��

inst

second

��

Ourtotal

instru

ctionsfor

renderin

gthisscen

eare

thus

��

�instru

ctionsper

second�which

veryclosely

match

esthe�B

IPcap

ability

ofthePrin

cetonEngin

e�

Theextra

BIPs�a

non�triv

ialnumber

ofinstru

ctions�isaccou

nted

forin

anumber

ofplaces�

There

areanumberofproced

ures

that

arecalled

asin

gletim

e�forexam

ple�

��

toplace

thehistogram

ontheim

age�or

toinitialize

thedatab

asefor

renderin

g�In

addition

anycycles

betw

eentheendrasterization

andthestart

ofthenextfram

e

�when

thefram

ebuers

canbeswapped�are

missin

gfrom

thisanaly

sis�They

have

been

counted

bytheren

derer

asrasterization

instru

ctionsin

Figu

re��

��

Perfo

rmanceCompariso

n

Theperform

ance

ofPrin

cetonEngin

epoly

gonren

derer

iscom

pared

tothat

ofthat

ofits

contem

porary�

theSilicon

Grap

hics

GTX�describ

edin

� ��

Thecom

parison

isabitstilted

ofcou

rse�TheGTX

represen

tsthestate

ofthe

artin

special

purpose

poly

gonren

derin

ghard

ware

for��

Likew

ise�thePrin

ceton

Engin

e�alth

ough

never

soldin

volu

me�

isamillion

dollar

mach

ine�

Noneth

eless

they

arecom

parab

leon

anumber

oflev

els�Both

mach

ines

exhibit

ahigh

level

ofparallelism

�andare

implem

entedin

thetech

nology

era�so

clockspeed

smay

be

expected

tobesim

ilar�etc�

Thegeom

etrystage

oftheGTX

iscom

posed

of�high

perform

ance

�geometry

engin

es�con

nected

inapipelin

e�with

eachprocessor

handlin

gaspeci�

casp

ectof

thegeom

etrycalcu

lations�

Togeth

erthey

prov

idean

aggregateperform

ance

of��

million

�oatin

gpoin

toperation

sper

second�M

�ops��

Bycom

parison

thePrin

ceton

Engin

ecan

perform

a�oatin

gpoin

tmultip

lyanddividein

asin

glelin

etim

eon

each

processor�

foran

aggregateperform

ance

ofapprox

imately

� M�ops�

ThePrin

ceton

Engin

ewas

not

design

edto

perform

�oatin

gpoin

toperation

s�andcon

sequently

pays

aprem

ium

fortheir

use�

Bydirect

comparison

both

ofthese

system

sare

implem

entin

gageom

etrypipelin

e�

ThePrin

cetonEngin

eim

plem

entation

sacri�ces

dynam

icran

geanduses

anall

in�

tegerrep

resentation

forits

prim

itives

andcan

conseq

uently

compute

approx

imately

million

triangular

prim

itives

per

second�TheGTX

perform

s�w

ithsubstan

tially

fewer

processors

ofcou

rse��

triangular

prim

itives

per

second�

Itishard

toquantify

theanalog

ofthePrin

cetonEngin

ecom

munication

stage

��

ontheGTX�TheGTX

has

adedicated

high

speed

buswhich

transports

prim

itives

betw

eenthegeom

etrypipelin

eandtheir

�nal

destin

ationat

theim

ageengin

es�The

buswas

speci�

callydesign

edto

acceptprim

itivesat

thefullrate

thegeom

etrystage

generates

them

�andwill

thusnever

beaperform

ance

limitin

gfactor�

Therasterization

stagereq

uires

someguessin

gto

determ

inetheperform

ance

of

theindividual

image

engin

esof

theGTX�They

arecited

ashavingan

aggregate

perform

ance

of��

million

depth�buered

pixels

per

second�which

isless

than

half

of

ourpixel�

llrate�Assu

mingeach

image

engin

emustperform

ontheord

erof �

integer

operation

sper

depth�buered

pixeltheaggregate

perform

ance

oftheim

ageengin

es

is��M

IPs�or

��BIPs�sligh

tlyless

than

��of

thePrin

cetonEngin

eperform

ance�

Ofcou

rsethePrin

cetonEngin

espendsapprox

imately

�instru

ctionsper

pixel�

so

they

aren�tdirectly

comparab

le�

Curiou

sly�theGTX

has

realtimevideo

inputandoutputcap

ability�

makingita

particu

larlyaptcom

parison

tothePrin

cetonEngin

e�Alth

ough

not

part

ofthepoly

�

gonren

derin

gprob

lem�theperform

ance

ofthePrin

cetonEngin

edoubtless

exceed

s

theGTXfor

video

relatedoperation

�

Som

efeatu

reshave

nocom

parison

�Most

notab

ly�thePrin

cetonEngin

eren

derer

perform

sfullb

ilinear�in

terpolated

texturemapping�which

theGTXoers

nosupport

for�In

addition

thePrin

cetonEngin

eisasoftw

aresolu

tion�andwill

admitmany

exten

sionsthat

aresim

ply

impossib

lein

ahard

ware

solution

�

The�nalandmost

tellingcom

parison

isofcou

rsetheaggregate

perform

ance�

The

GTX

canren

der

��con

nected

�shared

vertex

�trian

glesper

second�com

pared

tothePrin

cetonEngin

eim

plem

entation

�which

achieves

apeak

of� ��

texture

mapped

triangles

persecon

d�Given

theluxury

ofcostless

communication

�asprov

ided

intheGTX�thePrin

cetonEngin

ewould

immediately

double

itsperform

ance

and

becom

edirectly

comparab

leto

theGTX�

In��

thetim

eofthePrin

cetonEngin

e�s�rst

operation

�itwould

have

prov

ided

poly

gonren

derin

gdirectly

comparab

leto

high

endgrap

hics

work

stationsof

theera�

More

signi�can

tlyit

oers

anunparalleled

�exibility

ofoperation

throu

ghan

all

��

software

implem

entation

�

��

Chapter�

FutureDirectio

ns

Dueto

timeandresou

rcecon

straints

acou

ple

ofideas

inthisthesis

have

remain

ed

unim

plem

ented

�Future

work

will

address

these

issues

when

possib

le�

First

andforem

ostistheim

plem

entation

ofthe�distrib

ution

�optim

ization�ex�

pected

todouble

thesort�m

iddle

communication

optim

izationof

thecurren

tim

�

plem

entation

�Theoth

erissu

eto

beaddressed

isan

implem

entationof

sort�twice

communication

�Both

ofthese

issues

must

alsobeexam

ined

inligh

tof

thenext

generation

Prin

cetonEngin

e�just

becom

ingoperation

alat

thetim

eof

thiswritin

g�

which

will

o�er

greatlyim

prov

edcap

abilities�

Theim

plem

entation

ofthe�distrib

ution

�algorith

mis

expected

todouble

the

perform

ance

ofthealgorith

m�obtain

ingat

least��

poly

gonsper

secondon

the

Prin

cetonEngin

e�Its

implem

entation

should

consist

chie

yoftheaddition

ofanoth

er

proced

ureto

theren

derer

andsom

ecarefu

lmanagem

entofthecom

munication

bu�ers�

��

Distr

ibutio

nOptim

izatio

n

Thedistrib

ution

optim

izationfor

sort�middlecom

munication

prom

isesafactor

of�

improvem

entin

communication

time�

Thedistrib

ution

optim

izationperform

scom

�

munication

intwostages�

�rst

sendingpoly

gonsclose

totheir

�nal

destin

ationwith

anumber

oflarge

steps�andthen

placin

gpoly

gonson

their

�nal

destin

ationproces�

� �

sorswith

neigh

bor�to�n

eighbor

passes�

Distrib

ution

gainsperform

ance

overaregu

lar

sort�middleim

plem

entationbecau

setheoverh

eadfor

communicatin

gapoly

gonisper

pass�

soalarge

pass

iscom

parativ

elycheap

erthan

aneigh

bor�to�n

eighbor

pass�

Thedistrib

ution

optim

izationyield

safactor

of�speed

upin

communication

in

simulation

�Exam

iningourpeak

perform

ance

scene�thecrocod

ileat

�� poly

gons

asecon

d�wesee

that

��of

thetim

eisspentin

communication

�Ifthat

timecou

ld

behalved

anim

mediate

��im

provem

entin

perform

ance

would

beseen

�brin

ging

theperform

ance

ofthesystem

to��

poly

gonsper

second�

Asdiscu

ssedin

x�� thedistrib

ution

optim

izationmapsvery

neatly

tothelin

e�

lockedprogram

mingparad

igmof

thePrin

cetonEngin

eandim

plem

entation

should

bereason

ably

straightforw

ard�

��

Sort�Twice

Sort�tw

iceo�ers

themost

prom

isingpoten

tialperform

ance

improvem

entseen

sofar�

Ourinitial

analy

sissuggests

that

sort�twice

implem

ented

onoursecon

dgen

eration

hard

ware�

discu

ssedin

x��would

operate

more

than

tim

esfaster

than

sort�twice

onthePrin

cetonEngin

e�thanksto

atrip

lingof

theclock

rate�andawider

commu�

nication

bu�er�

Anim

plem

entationof

sort�twice

renderin

gwould

prov

ideafurth

erexploration

ofthis

design

space

andinterestin

gfeed

back

onthespeci�

ccosts

ofthesort�last

compositin

gstep

�Hopefu

llyitcou

ldinuence

thedesign

offuturemach

inesto

inclu

de

lowcost

support

hard

ware

tofurth

erstream

lineits

operation

�

��

NextGeneratio

nHardware

ThePrin

cetonEngin

eis

aneigh

tyear

oldarch

itecture

atthis

poin

t�Curren

tlya

spin�ou

tcom

panyfrom

theDavid

Sarn

o�Research

Center�

theSarn

o�Real

Tim

e

Corp

oration�is

develop

ingacom

mercial

version

oftheengin

ecalled

the�M

agic�

��

��TheMagic�

although

based

ontheorigin

alPrin

cetonEngin

e�has

signi�can

t

enhancem

ents�

andrep

resents

aninterm

ediate

stepbetw

eenthePrin

cetonEngin

e

andthe�Sarn

o�Engin

e��as

describ

edin

��

Theim

provem

entsinclu

deatrip

ledclock

rate�more

mem

ory�disk

arraycap

abil�

ity��b

itinterp

rocessorcom

munication

�elim

ination

of�lin

e�locked�program

ming

model�

hard

ware

support

forbilin

earandtrilin

earinterp

olation�support

forhigh

bandwidth

video

andnon

video

data

input�andacom

puted

outputtim

ingseq

uence�

Thefaster

clockrate

will

directly

yield

a��

perform

ance

improvem

entper

processor�

Theincreased

mem

oryprov

ides

thepossib

ilityof

verylarge

poly

gonal

data

sets�and

largetex

ture

maps�

Acom

puted

�outputtim

ingseq

uence�

enables

eachprocessor

to

independently

determ

inewhen

itspixelisoutput�thiscom

bined

with

auniquefeed

�

back

capability

that

feedstheoutputof

theengin

einto

theinputallow

sexploration

ofe�

cientsort�last

compositin

galgorith

ms�

Beyon

dthebasic

clockrate

improvem

ents�thedoublin

gof

thewidth

ofthecom

�

munication

registerwill

prov

idesubstan

tialperform

ance

gains�as

pass

operation

swill

pass

twice

asmuch

data�

Theoverh

eadwill

bethesam

eto

write

andread

theregister

�stillonly

a��b

itcore�

andthetest

operation

swill

presu

mably

bethesam

e�but

distan

tpasses

will

require

fewer

cycles

totran

smitthesam

eam

ountof

data�

This

will

have

asubstan

tialim

pact

ontheperform

ance

ofasort�m

iddleim

plem

entation

with

the�distrib

ution

�optim

ization�

More

signi�can

tly�it

will

greatlycheap

enthecost

ofasort�tw

iceapproach

�If

wetake

thewrite�read

andtest

overhead

asnegligib

le�dueto

thesize

ofthepasses

employed

�then

compositin

ghas

becom

etw

iceas

cheap

�Coupled

with

theclock

�

rateim

provem

ents�

composition

will

only

require

onesix

thof

thetim

eas

thecurren

t

architectu

re�asizab

leim

prov

ement�w

hich

brin

gstheattain

ableperform

ance

into

the

�Hzupdate

rateregim

e�

Thenew

support

fornon

video

inputdata

durin

goperation

will

prov

ide�M

B�s

ofinputbandwidth

which

may

bedirected

toanyarb

itrarysubset

oftheprocessors

simultan

eously�

Thiswill

allowtheren

der

tooperate

inamore

conven

ientim

mediate

��

modesty

leof

renderin

gandto

support

datab

asesof

poly

gonsandtex

tures

toolarge

to�tin

main

mem

ory�

Magic

isfully

operation

alat

thispoin

t�andportin

gthepoly

gonren

derin

gto

it

will

beamatter

ofobtain

ingadequate

timeon

themach

inegiven

other

conictin

g

dem

ands�

��

Chapter��

Conclusio

ns

Thisthesis

has

addressed

twoparts

ofthebroad

topicof

parallel

poly

gonren

derin

g�

Acarefu

lanaly

sisofthesort�m

iddlecom

munication

prob

lemhas

dem

onstrated

its

lackof

scalability�

How

ever�particu

laratten

tionto

optim

izationshas

dem

onstrated

thee�ectiven

essof

acarefu

llytuned

andcon

trolledcom

munication

algorithm

forob�

tainingsubstan

tialperform

ance

improvem

entsover

typicalcom

munication

structu

res�

Theuse

ofatwo�p

asscom

munication

algorithm

reduces

thehigh

costof

perform

ing

manyneigh

bor�to�n

eighbor

communication

operation

sandach

ievesnetw

orkutiliza�

tionmuch

closerto

maxim

um

throu

ghput�

Thesearch

forscalab

leandcost�e�

ectcom

munication

strategyled

tothedevel�

opmentof

anovel

new

communication

strategyentitled

sort�twice

communication

�

Sort�tw

icecom

munication

marries

thee�

ciency

ofsort�m

iddle

forsm

allnumbers

ofboth

poly

gonsandprocessors

with

thecon

stanttim

eperform

ance

toobtain

an

algorithm

with

thescalab

ilityof

dense

sort�lastcom

munication

while

amortizin

gthe

high

costof

neigh

bor�to�n

eighbor

communication

overgrou

psof

processors�

Cycle

bycycle

control

ofthecom

munication

channel

createsavery

e�cien

tpipelin

eof

communication

operation

s�TheVLIW

architectu

reof

thePrin

cetonEngin

eislever�

agedto

support

afully

deferred

shadingandtex

ture

mappingmodel�

with

substan

tial

increases

intheperform

ance

oftherasterizer�

Anim

plem

entation

ofa�D

renderin

gengin

eon

adistrib

uted

framebu�er

SIM

D

��

architectu

rethat

achieves

�llrates

of�

million

pixels

asecon

dandover

��

texture�m

apped�lit

andshaded

poly

gonsasecon

dwas

presen

ted�

Theren

derer

implem

entation

addressed

several

issues�

�Com

munication

ofpoly

gondescrip

torsin

arin

gtop

ology

�Rasterization

inaprocessor

per

columnorgan

ization

�Flex

ibleinteraction

with

areal�tim

eren

derin

genviron

menton

aslave

mach

ine

Truereal�tim

eren

derin

gperform

ance

has

been

dem

onstrated

onaSIM

Dproces�

sorarray�

with

results

that

scalewell

with

increasin

gnumbers

ofprocessors�

Poly

gons

areof

relatively

arbitrary

size�andbenchmarked

astheclassic

��pixel

poly

gon��

Thealgorith

msare

readily

modi�ed

andadapted

toexperim

entation

�

ThePrin

cetonEngin

eprov

ides

auniqueexible

architectu

refor

real�timein�

teractiveren

derin

g�Thenextgen

erationof

thePrin

cetonEngin

e�theMagic��

is

expected

toprov

ideren

derin

gperform

ance

ofover

�million

poly

gonsper

secondin

asort�tw

iceim

plem

entation

�

��

Appendix

A

Sim

ulator

Adetailed

andaccu

ratesim

ulator

was

develop

edto

experim

entwith

algorithmchanges

andoptim

izationswith

outhavingto

rewrite

theim

plem

entation

�Thesim

ulator

pro�

vides

theexibility

andease

need

edto

rapidly

trymanyoption

sandcom

bination

s

offeatu

reswith

littlee�ort�

Thesim

ulator

alsoprov

ides

theability

tosim

ulate

ma�

chines

unavailab

lefor

actual

use�

such

asmach

ines

with

more

processors�

orournext

generation

hard

ware�

Thisthesis

has

presen

tedalarge

numberofcom

munication

algorithmsandoption

s

tobeanaly

zed�

Itis

impractical

torew

ritetheim

plem

entation

toanaly

zeeach

optim

ization�for

anumber

ofreason

s�

�Writin

gcod

efor

thePrin

cetonEngin

eisintricate

andtim

econ

suming�

�Thelin

e�lockedprogram

mingcon

straintoften

obscu

resthetru

eperform

ance

of

analgorith

mbymakingitdi�

cultto

use

allavailab

lecycles

forcom

putation

�

�Itisdi�

cultto

establish

correctness

oftheim

plem

entation�Thelack

ofdebug�

gingtools

makedata

distrib

ution

errorsdi�

cultto

detect

andanaly

ze�

Fortu

nately

thePrin

cetonEngin

elen

dsitself

very

well

tosim

ulation

�ThePrin

ce�

tonEngin

eisaSIM

Darch

itecture�

sobyobserv

ingtheworst

caseexecu

tionpath

ofa

program

weobserve

thetotal

execu

tiontim

ereq

uired

�Interp

rocessorcom

munication

��

iscom

pletely

controlled

bytheuser

program

�allow

inganycom

munication

algorithm

tobestu

died

atthecycle

levelwith

complete

accuracy�

With

inthesim

ulator

attention

ispaid

almost

exclu

sively

tothedetails

ofthe

communication

stage�butonly

becau

seithas

proven

themost

di�

cultstage

toana�

lytically

model�

particu

larlywhen

theinteraction

sof

more

than

asin

gleoptim

ization

mustbecon

sidered

�Thegeom


stages�while

requirin

gsign

i�can

t

amounts

ofcom

putation

intheim

plem

entation

�are

trivially

simulated

byobserv

ing

that

theworst

caseprocessor

�most

poly

gonsto

compute�

most

pixels

toren

der�

will

determ

inetheperform

ance

ofeach

stage�

A��

Model

Thesim

ulator

modelis

signi�can

tlyabstracted

fromthePrin

cetonEngin

e�Itassu

mes

somearb

itrarynumber

ofprocessors

connected

inaneigh

bor�to�n

eighbor

ring�

The

structu

reof

thesim

ulated

algorithm

isbased

largelyon

implem

entationexperien

ce�

andisorgan

izedto

support

themaxim

umam

ountof

exibility

inthecom

munication

stage�

A��

InputFile

s

Theinputto

thesim

ulator

consists

oftwopieces�

a�bounding�b

ox��lethat

describ

es

thepoly

gondatab

aseas

aset

ofpoly

gonboundingboxesinscreen

space�

anda�screen

map�that

speci�

esthedecom

position

ofthescreen

forrasterization

�

Thebounding�b

ox�le

isgen

eratedbyprep

rocessingapoly

gondatab

aseas

ob�

served

fromaspeci�

cview

poin

t�Every

poly

gonin

theorigin

aldatab

aseisspeci�

ed

inthebounding�b

ox�le�

inclu

dingpoly

gonsthat

areculled

asoutof

theview

ing

frustu

mor

back

facing�

Each

bounding�b

oxistagged

with

a�or

a�indicatin

gvis�

ible�cu

lled�Theinclu

sionof

culled

poly

gonsin

thebounding�b

ox�le

help

smodel

thegeom

etrystage�

durin

gwhich

somefraction

ofthepoly

gonscom

puted

will

be

back

facingand�or

outof

theview

ingfru

stum�It

isim

portan

tto

modelthese

culled

��

NumPolygons��

��

��

��

��

��

��

��

Figu

reA��

Bounding�boxFile�Each

poly

gonisspeci�

edas

avisib

le�nonvisib

letag

andarectan

gular

bounding�b

ox�

poly

gonsbecau

se�

�Culled

poly

gonshave

acom

putation

alcost

associatedwith

rejectingthem

�

�It

may

becheap

erto

compute

andreject

apoly

gonthan

avisib

lepoly

gon�

which

will

a�ect

thetotal

instru

ctionsexecu

tedin

thegeom

etrystage�

�Poly

goncullin

gyield

sasy

mmetries

inthedistrib

ution

ofpoly

gonsacross

pro�

cessors�which

may

a�ect

theexecu

tiontim

eof

both

thegeom

etryandcom

mu�

nication

stages�

Theuse

ofabounding�b

ox�le

allowsthedetails

oftheactu

algeom

etrycalcu

�

lationsto

berem

ovedfrom

consid

eration�insurin

gthat

wemain

tainastab

leinput

data

setfor

di�eren

tsim

ulation

s�Italso

removes

theburden

ofinsurin

gthat

thege�

ometry

pipelin

eiscorrectly

implem

ented

inthesim

ulator

inaddition

totheren

derer�

Figu

reA��

show

sasam

plebounding�b

oxinput�le�

Thescreen

map

allowscom

pletely

arbitrary

mapsof

pixels

toprocessors

forras�

terizationto

bespeci�

ed�A

defau

ltcolu

mn�parallel

decom

position

ofthescreen

is

assumed�butgen

eralrectan

gular

regionsmay

bespeci�

ed�inclu

dingnon�con

tiguous

regions�allow

ingtheanaly

sisof

staticload

balan

cingtech

niques

such

asthose

sug�

gestedin

��Theonly

constrain

tson

thescreen

map

isthat

every

pixelisrasterized

byprecisely

oneprocessor

��Asam

plescreen

map

isshow

inFigu

reA��

�Thesort�

twice

case�

forwhich

multip

leprocesso

rsrasterize

thesamereg

ion

ofthescreen

�is

��

��

��

��

��

��

��

��

��

Figu

reA��

Sample

ScreenMap�a processor

system

�with

eachprocessor

as�sign

edaquadran

tof

a��

by��

pixel

screen�Theprocessor

isgiven

inthe�rst

column�follow

edbytherectan

gular

regionit

rasterizes�Processors

may

belisted

multip

letim

esin

the�lefor

non�con

tiguousrasterization

regions�

Given

thepoly

gonboundingboxes

inthebounding�b

ox�le

andthepixel

to

processor

map

inthescreen

map

itcan

bedeterm

ined

precisely

what

processors

will

rasterizeeach

poly

gon�Anygiven

poly

gonisrasterized

bytheunion

oftheprocessors

whose

pixels

itoverlap

s�

A��

Parameters

Thesim

ulator

isfully

param

eterized�Allof

theparam

etersof

interest

may

bespeci�

�ed

viacom

mandlin

eoption

s�andmore

drastic

changes

�such

asin

thewidth

ofthe

interp

rocessorcom

munication

bus�

etc��may

betriv

iallymadethrou

ghrecom

pila�

tion�Theparam

etersof

interest

areshow

nin

TableA��

Most

oftheparam

etersare

fairlyobviou

s�Theparticu

larparam

etersof

interest

forspecify

ingthecom

munication

structu

reare

m�t�

d��andg�

which

enable

the

simulation

oftheduplication

�dense�

distrib

ution

andsort�tw

iceoptim

izations�

Data

Duplicatio

nThedata

duplication

pattern

isspeci�

edbym�which

isthe

number

oftim

esthepoly

gondatab

aseisduplicated


Itis

speci�

edas

ageom

etryparam

eterbecau

seitisfully

resolved

bythegeom

etry

stageof

thesim

ulator�

andnot

exposed

tothecom

munication

stage�

Dense

Communicatio

nTheperiod

betw

eentests�

t�speci�

eshow

oftenthesim

u�

latedprocessors

will

testthepoly

gonthey

justreceiv

edto

seeifitiscom

pletely

discu

ssedlater�

��

param

eterdefau

ltmean

ing

n��

number

ofprocessors

polyfile

�none�

bounding�b

ox�le

mapf

ilecolu

mn

pixelto

processor

map

parallel

geometry

Gi

��instru

ctionsto

transform

andkeep

�rejectpoly

gonG

f��

instru

ctionsto

lightandcom

pute

linear

equation

sfor

apoly

gonm

�duplication

factorcom

municationt

�period

betw

eentests

d�

arrayspecify

ingseq

uence

ofpass

sizesto

beused

gn

number

ofprocessors

per

group

rasterizationR

��instru

ctionsto

rasterizeapixel

Table

A��

SimulatorParameters

communicated

andcan

berep

lacedwith

their

nextpoly

gonto

communicate�

Dense

communication

t�

� has

been

determ

ined

toalw

aysbeausefu

lopti�

mization

andis

used

bydefau

lt�For

simulation

ofthesparse

communication

case discu

ssedin

x�� t�

n�

Distrib

utio

nThedistrib

ution

pattern

isspeci�

edin

das

aseq

uence

ofpass

sizes

tobeused

durin

gcom

munication

�Theoptim

aldistrib

ution

strategy as

deter�

mined

experim

entally

would

berep

resented

asd�

��

Sort�

TwiceThegrou

psize

gis

used

forsort�tw

icesim

ulation

sandspeci�

esthe

number

ofprocessors

responsib

lefor

rasterizationof

onecom

plete

copyof

the

screen�

Allof

these

param

etersinteract

aswould

beexpected

�Thusaduplication

fac�

torm

��andadistrib

ution

pattern

canbecom

bined

inthesam

esim

ulation

for

exam

ple�

Notab

lyabsen

tissupport

fortem

poral

localityoptim

izations�

Tem

poral

locality

��

requires

asubstan

tialexten

sionof

thesim

ulator�

Acom

mon

benchmark

fortem

poral

locality used

in��

forexam

ple

isto

render

ascen

e sligh

tlychange

theview

poin

t

andren

der

thescen

eagain

�Thesm

allchange

intheview

poin

tforces

somecom

mu�

nication

tobeperform

edso

that

thee�

ciency

oftem

poral

localityin

communication

canbemod

eled�How

ever theuse

ofabounding�b

ox�le

tocollap

sethepoly

gon

inform

ationdestroy

sthevery

inform

ationnecessary

tospecify

arbitrary

view

poin

ts�

Sim

ulation

oftem

poral

localityload

imbalan

cesin

x��exposed

severeenough

inef�

�cien

ciesthat

theapproach

was

abandoned

atthat

poin

t elim

inatin

gtheneed

for

simulator

support�

Inaddition

tothese

param

eters there

areanumber

ofinstru

mentation

knobsthat

may

beturned

onfor

anygiven

simulation

tocollect

extra

data�

Ofparticu

laruse

istheoption

togen

erateahistogram

ofthenumber

ofpoly

gonson

eachprocessor

before

andafter

eachphase

ofcom

munication

�This

exposes

both

geometry

load

imbalan

ces as

may

becreated

bym

��

andrasterization

loadim

balan

cesas

may

becreated

byanon�uniform

distrib

ution

ofpoly

gonsover

thedisp

lay�

A��

Operatio

n

Thehigh

�leveldiagram

inFigu

reA��

schem

aticallyrep

resents

theexecu

tionof

the

simulator�

Alm

ostall

oftheop

erationis

wrap

ped

upin

thedetails

oftheas

yet

unspeci�

edcom

munication

simulation

�In

addition

there

issom

eprep

rocessingto

loadthepoly

gondatab

aseinto

theprocessors

andsom

epost

processin

gto

countthe

expected

instru

ctionsfor

rasterization�Thefollow

ingsection

sdescrib

etheop

eration

ofthestages

ofthesim

ulator

indetail�

A��

Data

Initia

lizatio

n�

Geometry

Data

initialization

determ

ines

themappingof

poly

gonsto

processors

forgeom

etry

computation

s andthemappingof

pixels

toprocessors

forrasterization

�Theform

er

isspeci�

edbysim

ply

placin

gpoly

gonson

processors

while

thelatter

isdeterm

ined

��

clip/reject

geometry

transform

rasterizationproc 3

proc 2proc 1

proc 0

algorithms &

optimizations

comm

unication

polygon database

proc 2

screen map

proc 0


light

proc 1proc 2

proc 3geom

etry

23

proc 0proc 1

proc 3

10

Figu

reA��

SimulatorModel�

a�processor

poly

gonren

derer

with

a�poly

gondatab

ase�

��

bycalcu

latinga�destin

ationvector�

foreach

poly

gonwhich

tagseach

processor

responsib

lefor

rasterizingapart

ofthepoly

gon�

Thebounding�b

ox�leonly

de�nes

thetotal

setof

poly

gonsin

thesim

ulation

and

leavesunspeci�

edthemappingbetw

eenpoly

gonsandprocessors

forthegeom

etry

calculation

s�Thesim

ulator

allowstw

odistrib

ution

pattern

s�

Norm

alThepoly

gonsare

assigned

totheprocessors

inrou

ndrob

infash

ion�Num�

berin

gthepoly

gonsseq

uentially

asthey

areread

fromthebounding�b

ox�le

poly

gonbwill

lieon

processor

i�

mod�b�n

�on

annprocessor

system

�

Duplicated

Aduplication

factorm

��speci�

esthat

eachpoly

gonshould

bedu�

plicated

mtim

esacross

theprocessor

array�A

poly

gonbisinstan

tiatedon

all

processors

i�

mod�b�n

�m��km�

Thenorm

aldistrib

ution

isjust

adegen

eratecase

�m�

��of

theduplicated

distrib

ution

pattern

�

Thepossib

ilityofaran

dom

assignmentofpoly

gonsto

processors

was

not

inclu

ded

asthiscan

besim

ulated

byshu�ingthebounding�b

oxinput�lebefore

itisread

by

thesim

ulator

andexperim

ental

results

show

that

anycorrelation

inthedata

setis

alreadybrok

enupbytherou

ndrob

inassign

mentof

poly

gonsto

theprocessors�

Thecom

bination

ofthepoly

gonboundingboxes

fromthebounding�b

ox�leand

thepixel

toprocessor

map

fromthescreen

map

prov

ides

allof

thenecessary

in�

formation

forperform

ingcom

munication

�Each

poly

gonis

describ

edin

part

byan

n�bitdestin

ationvector

where

abitiisset

ifandonly

ifthepoly

gonboundingbox

intersects

theset

ofpixels

rasterizedbyprocessor

i�Figu

reA��

show

sthedestin

a�

tionvectors

associatedwith

eachpoly

gonas

ashort

bitvector

nextto

thepoly

gon�

Note

that

multip

lebits

areset

ifmore

than

asin

gleprocessor

will

beresp

onsib

lefor

rasterizationof

thepoly

gon�

Theuse

ofabounding�b

ox�lewhich

speci�

estherectan

gular

exten

tof

thepoly

�

gonsguaran

teesthat

wewill

never

incorrectly

assumeapoly

gondoesn

�toverlap

a

��

proc 2proc 3

proc 0proc 1

Figu

reA��

Destin

atio

nFalse

Positiv

e�thepoly

gonis

incorrectly

assumed

tointersect

therasterization

regionof

processor

��

processor�s

pixels

how

ever in

somecases

wemay

incorrectly

assumecoverage�

Con�

sider

thepoly

gonandscreen

map

show

nin

Figu

reA��

processor

�will

attemptto

rasterizeapiece

ofthispoly

gon even

though

itdoesn

�tactu

allyoverlap

processor

��s

region�It

isassu

med

that

these

occasional

falsepositiv

esincuralow

ercost

than

a

precise

test�Given

thecost

ofsort�m

iddlecom

munication

itmay

make

sense

toper�

formatest

toelim

inate

falsepositiv

es how

ever itisunnecessary

foroursim

ulation

s

ofinterest�

For

both

ourim

plem

entation

andmost

ofoursim

ulation

s�barrin

gsort�

twice

approach

es� acolu

mn�parallel

decom

position

ofthescreen

isused

in

which

casetheboundingbox

ofthepoly

gonwill

only

overlapaprocessor�s

columnof

the

screenifthere

will

bepixels

torasterize

sothere

isnoexcess

communication

done

dueto

falsepositives�

The�rst

stageof

geometry

computation

will

process

aworst

caseof

dp�m�n

e

poly

gonson

aprocessor

andreq

uires

Giinstru

ctionsper

poly

gonto

perform

�In

general

eachprocessor

will

beresp

onsib

lefor

communicatin

g andthusperform

ing

thesecon

dstage

ofgeom

etrycom

putation

all

visib

lepoly

gonsassign

edto

it�The

data

duplication

optim

ization d

iscussed

inx��

introd

uces

asligh

ttw

ist as

thesam

e

poly

gonwill

exist

onm

processors�

how

ever only

oneprocessor

should

communicate

thepoly

gon in

particu

lartheprocessor

which

canperform

thecom

munication

most

cheap

ly�Each

processor

will

source

apoly

gononly

ifitistherigh

tmost

processor

left

��

ofthe�rst

processor

rasterizingthispoly

gon��

After

thepoly

gondatab

aseis

loaded

onto

theprocessors

theligh

tingmod

elis

applied

tothevisib

lepoly

gonson

eachprocessor

andlin

earequation

coe�cien

tsare

calculated

�Thisisa�pseu

do�step

�in

thesim

ulator

which

only

deals

with

poly

gon

boundingboxes

andisinclu

ded

tomod

eltheactu

alalgorith

m�Itreq

uires

max

i �piv ��

Gfinstru

ctionsto

execu

te where

pivisthenumber

ofvisib

lepoly

gonson

processor

i�

Note

that

atthispoin

tweonly

have

asin

glecop

yof

eachpoly

gonread

yto

distrib

ute

regardless

ofthechoice

ofm�Theduplication

factorm

ishidden

fromtherem

ainder

ofthesim

ulator

ine�ect

��

A��

Communicatio

n

Com

munication

isthemost

challen

gingpart

ofthepoly

gonren

derin

galgorith

ms

tosim

ulate�

Alloth

erasp

ectsof

thesim

ulation

may

beperform

edat

agross

level

bysim

ply

countin

gpoly

gons�pixels�

andmultip

lyingbythenumber

ofinstru

ctions

required

per

poly

gon�pixel��

Com

munication

how

everreq

uires

adetailed

understan

d�

ingof

what

happensin

theinterp

rocessorcom

munication

channel

onapass

bypass

level�Com

munication

ismod

eledas

occurrin

gon

a�D

ringof

processors�

Each

proces�

sorisassign

eda�slot�

intherin

gwhich

itcan

place

apoly

gondescrip

torinto

and

readapoly

gondescrip

torfrom

�Therin

gis

represen

tedbyan

arrayof

poin

tersto

poly

gondescrip

tors�Each

poly

gondescrip

torrep

resentsaminim

alsetof

inform

ation

areferen

cenumber

forthepoly

gon�for

laterverify

ingthecorrectn

essof

thecom

mu�

nication

�andadestin

ationvector

which

speci�

estheprocessors

that

must

receive

acop

yof

thispoly

gonfor

thisstage

ofcom

munication

�Figu

reA��

show

stherin

g�

�Thisiswhatwemeanby�clo

sest��theprocesso

rthatwill

haveto

makethefew

estpasses

befo

re

thepolygonhasatlea

ststa

rtedto

arriv

eatits

destin

atio

n�This

proves

tobeausefu

lcost

metric

forcolumn�parallel

deco

mpositio

nsofthescreen

�where

alldestin

atio

nprocesso

rsare

contig

uous�

Amore

sophistica

tedcommunica

tioncost

measure

must

beused

fordisjo

intdeco

mpositio

nsofthe

screen�

�Ofcourse

duplica

tionmayintro

duce

largeasymmetries

inthevalues

ofpiv �

butthatwill

a�ect

only

thesim

ulated

execu

tioncost�

��

proc 3proc 2

proc 1proc 0

proc 3proc 0

proc 1proc 2

Figu

reA��

SimulatorCommunicatio

n�theinterp

rocessorcom

munication

isab�

stractedas

passin

gan

entire

poly

gondescrip

torneigh

bor�to�n

eighbor

atonce�

Processor

�poin

tsto

a�null�

which

indicates

that

there

isnopoly

gondescrip

tor

curren

tlyassociated

with

that

locationin

therin

g�

Each

processor

has

anarray

ofpoly

gondescrip

torsto

source

andan

arrayof

poly

gonsreceived

fromcom

munication

analogou

sto

thesourceandreceivearray

s

intheim

plem

entation

�Initially

therin

gis

empty

andtheprocessors

allhave

their

source

arraysfullandtheir

receivearray

sem

pty

exactly

liketheactu

alim

plem

en�

tation�Sim

ilarly at

theendof

communication

eachprocessor

will

have

anem

pty

source

arrayandareceive

arraywith

allpoly

gonsitwill

have

torasterize�

Com

munication

operation

sare

perform

edby�rotatin

g�therin

g�Allof

thepoin

t�

ersare

shifted

oneplace

totherigh

tin

thearray

with

thepoin

terthat

fallso�

the

endshifted

back

tothemiddle�

Larger

shifts

�forthedistrib

ution

optim

ization�are

implem

ented

byperform

ingalarger

rotationof

therin

g�Thecom

munication

isn�t

actually

broken

dow

nto

the�nest

granularity

ofsim

ulatin

geach

individual

commu�

nication

operation

necessary

topass

apoly

gondescrip

torfrom

processor

toprocessor�

Thecost

mod

elfor

apass

isprecisely

thei�

��d�a

�bmod

elused

intheanaly

tical

analy

ses with

a��

andb�

��

After

eachpass

operation

eachprocessor

teststhedestin

ationvector

ofthepoly

gon

descrip

torin

itsslot

oftherin

g andifthepoly

gonisdestin

edfor

itnotes

thereferen

ce

number

inthereceiv

earray

andclears

itsbit

inthevector�

Apoly

gonhas

been

communication

toall

interested

processors

when

nobitin

destin

ationvector

isset�

Thedense

distrib

ution

andsort�tw

icecom

munication

optim

izationsare

allex�

��

posed

with

inthecom

munication

simulation

andare

discu

ssedin

turn

below

�

Dense

Communicatio

n

Thesparse

vs�

dense

communication

optim

izationsare

implem

entedviathetvariab

le

which

speci�

eshow

oftentheprocessors

testthepoly

gondescrip

torin

their

�slot�

tosee

ifeveryon

ehas

seenit�

For

simplicity�s

sakeeach

processor

will

autom

atically

replace

anullin

itsslot

oftherin

gwith

oneof

thepoly

gonsfrom

itssou

rcearray�

The

period

icitytof

thetest

determ

ineshow

oftenthedestin

ationvector

ofeach

descrip

tor

intherin

gistested

andturned

into

anullifall

zero�

Distrib

utio

n

Thedistrib

ution

optim

izationis

relatively

trickyto

implem

ent�It

isperform

edby

execu

tingthecom

munication

algorithm

anumber

oftim

es andbetw

eeneach

execu

�

tionthereceiv

earray

foreach

processor

isused

toreb

uild

thesou

rcearray�

Durin

g

allstages

ofcom

munication

excep

tthelast

theshifts

that

areperform

edare

ofsize

greaterthan

� so

thenotion

ofthedestin

ationvector

becom

esabitmuddied

�Poly

�

gonscan

nolon

gerbearb

itrarilyrou

ted butcan

only

besen

tto

processors

i�

s�kd

where

sistheprocessor

thispoly

gonissou

rcedfrom

anddisthesize

ofthepass�

To

achieve

thisdistrib

ution

pattern

thedestin

ationvectors

arebuilt

fortheactu

alpoly

�

gondestin

ationsandthen

allof

thedestin

ationsare

adjusted

toalign

with

the�rst

validdestin

ationbefore

thedesired

destin

ationprocessor

asshow

nin

Figu

reA��a�

Inoursim

ple

exam

ple

d�

��andprocessor

�issou

rcingapoly

gondestin

ed

forprocessors

� �and��

The�rst

stageof

distrib

ution

results

inboth

processor

�andprocessor

�havingacop

yof

thepoly

gon�Thesecon

dpass

ofcom

munication

will

beof

size�andwill

route

thepoly

gonto

its�nal

destin

ations

processors

� �

and��

Ifwearen

�tcarefu

lboth

processor

�and�will

tryto

route

thepoly

gonto

all

ofthese

processors

which

intheworst

caseresu

ltsin

these

processors

rasterizingthe

poly

gontw

iceandalso

wastes

communication

bandwidth�Theobservation

ismade

that

afterperform

ingacom

munication

stagewith

passes

ofsize

deach

processor

can

��

128

40

(b)

(a)

processor

smooshed/m

asked destinationsvalid destinations

source processor

Figu

reA��

Destin

atio

nVectorAlignment�

a��

processor

system

usin

gadistri�

bution

pattern

of��Thedestin

ationvectors

arefor

apoly

gonthat

isdistrib

uted

fromprocessor

�to

�nal

destin

ationsof

processors

� �and��

�a�isthedestin

ationvector

used

forthe�rst

pass

ofcom

munication

and�b�are

thedestin

ationvectors

forprocessors

�and�to

perform

thesecon

dand�nal

pass

ofcom

munication

�

onlypass

anypoly

gonsithas

tosou

rceover

d��processors

toavoid

duplication

�Thus

processor

�will

only

pass

poly

gonsas

faras

processor

� etc�

Infact

wecan

always

thinkof

communication

asstartin

gwith

apass

ofsize

n thenumber

ofprocessors

afterwhich

eachpoly

gonmay

only

bepassed

overat

most

n��processors

before

it

will

beseen

twice

bysom

eprocessor�

Figu

reA��b

show

sthedestin

ationvectors

used

byprocessors

�and�for

thesecon

dstage

ofcom

munication

�

Sort�

Twice

The�nal

optim

izationto

deal

with

issort�tw

icecom

munication

�Theplacem

entof

processors

into

groupsof

sizegintrod

uces

asubtle

complex

ityinto

thecom

munication

structu

rebymakingprocessors

equivalen

tat

somelev

el�Ifapoly

gonoverlap

sregion

ritcan

berasterized

byanyprocessor

r�

kg

sothecom

munication

pattern

isno

longer

determ

inistic�

Inreality

ofcou

rseit

isdeterm

inistic

becau

sewewanteach

poly

gonto

traveltheminim

um

amountof

distan

cepossib

le which

mean

sits

only

destin

ationis

the�rst

processor

r�

kgafter

theprocessor

itstarts

on�How

ever

��

04

expand

mask

processor

valid destinationssource processor

128

destinations

Figu

reA��

Destin

atio

nVectorforg��

��thedestin

ationprocessor

setisinferred

fromthe�rst

gprocessors

andthen

mask

edso

that

thepoly

gonisreceived

atmost

oneof

theprocessors

which

canrasterize

theregion

�s�itoverlap

s�

thepixelto

processor

map

will

only

map

pixels

tothe�rst

gprocessors

becau

sethe

screenis

divided

into

gregion

s andonly

asin

gleprocessor

may

beresp

onsib

lefor

anypixel�

Thuswhen

thedestin

ationvector

isbuilt

itonly

speci�

esdestin

ationsin

the�rst

gprocessors�

Figu

reA��

show

sasam

ple

destin

ationvector

forsort�tw

ice

communication

�Thedestin

ationvector

isexpanded

toset

orclear

allof

thebits

based

onthe�rst

gprocessor

bits

tore�

ectall

thevalid

rasterizationprocessors�

Thevector

isthen

mask

edso

that

thepoly

gonisshifted

amaxim

um

ofg��tim

es

toavoid

unnecessary

communication

�After

this

more

complex

destin

ationvector

constru

ction thecom

munication

may

proceed

asit

norm

allywould alth

ough

of

course

theextra

instru

ctionsreq

uired

toperform

thesort�last

compositin

gmust

beaccou

nted

forlater�

Thesort�last

composition

isadata

independentop

eration

andthenumber

ofinstru

ctionsreq

uired

isjust

calculated

based

onthegrou

psize

andthenumber

ofprocessors�

Thecon

tribution

is�instru

ctionsfor

g�

n�van

illa

sort�middlecom

munication

�as

would

beexpected

�

Thissection

has

describ

edall

ofthemajor

features

ofthecom

munication

simula�

tionandsom

eof

thecare

that

isnecessary

intheir

implem

entation

�Itisworth

notin

g

that

complex

ityin

thesim

ulation

will

likely

alsobere�

ectedin

anyim

plem

entation

andanyaddition

alinsigh

tthesim

ulator

prov

ides

inim

plem

entation

canbequite

��

valuable�

A��

Rasterization

Therasterization

stageissim

ulated

inafash

ionsim

ilarto

thegeom

etrystage�

Itis

possib

lewith

reasonably

high

accuracy

toguess

how

longitwill

taketo

rasterizea

setof

poly

gons�

For

eachprocessor

i�riiscom

puted

�which

isthesum

ofthenumber

ofpixels

�rasterizedbythisprocessor�

that

allpoly

gonboundingboxes

overlap�The

totalrasterization

timeisthen

just

max

i �ri �

�R�where

Risthetim

eto

rasterizea

single

pixel�

Thisign

oresdetails

oftheim

plem

entation

such

asthecost

ofadvan

cing

tothenextpoly

gon�andthee�ects

ofperform

ingthetest

foradvan

cementto

the

nextpoly

gonperiod

icallyrath

erthan

afterevery

single

pixel�

These

e�ects

have

been

inclu

ded

inan

adhoc

fashion

byincreasin

gthecost

ofrasterizin

gasin

glepixel�

While

not

strictlycorrect�

thisapprox

imation

yield

sgood

results�

Thegoal

ofthesim

ulator

has

been

tomeasu

rethee�ectiven

essof

variouscom

munication

optim

izations�the

e�ects

ofwhich

areentirely

wrap

ped

upin

thegeom

etryandcom

munication

stages�

makingtheaccu

racyof

therasterization

stagesim

ulation

lessim

portan

t�

A��

Correctness

Theresu

ltsof

thesim

ulator

areto

alarge

exten

tveri�

able�

Asim

ple�an

dtherefore

prob

ably

correct�check

isdoneat

theendof

thecom

munication

stagethat

check

s

that��

Each

poly

gonwas

received

onall

processors

whose

pixels

itoverlap

ped�

�Noprocessor

receivedapoly

gonitdidn�tneed

�

�Noprocessor

receivedmore

than

onecop

yof

apoly

gon�

The�rst

check

isof

course

most

critical�Ifanyprocessor

doesn

�treceiv

eapoly

gonthat

itneed

sto

rasterizeitcou

ldresu

ltin

ahole

intheren

dered

scene�

The

��

secondcheck

insures

wearen

�tintrod

ucin

gextra

work

bygiv

ingprocessors

poly

gons

torasterize

that

don�toverlap

their

screenareas�

Thenecessity

ofthethird

check

isnot

immediately

obviou

s�Thisdetects

subtle

errorswhich

canbeintrod

uced

with

thedistrib

ution

optim

ization�In

thiscase

abug

inthealgorith

mcou

ld�an

ddid�accid

entally

distrib

ute

poly

gonsin

duplicate

�and

eventrip

licate�in

someboundary

cases�

Ofcou

rse�there

aresom

eerrors

that

thesim

ulator

can�tdetect�

andthey

will

necessary

allrelate

toperform

ance�

rather

than

thecorrectn

essof

theresu

lt�Such

errorswould

inclu

dedistrib

utin

gapoly

gonfurth

erthan

necessary

�past

thelast

processor

interested

init��

ormiscou

ntin

gtheinstru

ctioncycles

forsom

eoperation

�

These

areasof

thecod

ehave

allbeen

carefully

checked

�andtheresu

ltsgiven

bythe

simulator

re ect

ourexpectation

s�so

itisbeliev

edthey

arecorrect�

A��

Summary

Theanaly

sisof

variousoptim

izationshas

dem

onstrated

that

while

analy

ticalmod

elsmay

accurately

model

theform

ofthecorrect

answer�

simulation

may

yield

a

substan

tiallydi�eren

t�andpresu

mablymore

accurate�

solution

�Largely

these

di�er

ences

inaccu

racyare

caused

bytheassu

mption

sused

tomake

theanaly

ticalprob

lem

tractable�

Assu

mption

ssuch

asa�dense

ring�

aremadeto

compensate

forthelack

ofanygood

modelfor

theactu

aldistrib

ution

ofinform

ationin

therin

g�Variation

s

inthedistrib

ution

ofsou

rcesanddestin

ationsfor

poly

gonsfrom

uniform

ityexacer

bates

things

furth

er�makingsim

ulation

invalu

ableto

obtain

accurate

answers

forreal

poly

gondatab

ases�

Thevery

factorsthat

make

thecom

munication

sohard

toanaly

ticallymodelm

ake

itequally

hard

tosim

plify

thesim

ulation

�Fortu

nately

forourdata

setsit

isnot

proh

ibitively

expensive

toperform

afullsim

ulation

ofthecom

munication

structu

re�

��

Biblio

graphy

��Akeley�K�

Reality

Engin

eGrap

hics�

InProceed

ings

Siggra

ph��

�August

��pp��

��Akeley�K��andJermoluk�T�High

�Perform

ance

Poly

gonRenderin

g�In

Proceed

ings

Siggra

ph��A

ugust

��pp��

��Chin�D��Passe�J��Bernard�F��Taylor�H��andKnight�S�

The

Prin

cetonEngin

e�A

Real�T

ImeVideo

System

Sim

ulator�

IEEE

Transactio

ns

onConsumer

Electro

nics

��M

ay��

��Cox�M�B�Algo

rithmsforParallel

Renderin

g�PhD

dissertation

�Prin

ceton

University�

Departm

entof

Com

puter

Scien

ce�May

��

��Crockett�T�W��andOrloff�T�

AParallel

Renderin

gAlgorith

mfor

MIM

DArch

itectures�

InProceed

ings

Parallel

Renderin

gSymposiu

m�N

ewYork

�

��ACM

Press�

pp��

��Crow�F�C��Demos�G��Hardy�J��McLaughlin�J��andSims�K��D

Image

Synthesis

ontheConnection

Mach

ine�In

Parallel

Processin

gforComputer

Visio

nandDisp

lay�Addison

Wesley�

��

��Ellsworth�D�ANew

Algorith

mfor

Interactiv

eGrap

hics

onMulticom

puters�

IEEEComputer

Graphics

andApplica

tions��

��Ju

ly��

��

��Foley�J�D��vanDam�A��Feiner�S�K��Hughes�J�F��andPhillips�

R�L�Intro

ductio

nto

Computer

Graphics�

Addison

Wesley�

��

�

��Heckbert�P��andMoreton�H�Interp

olationfor

Poly

gonTextureMapping

andShading�

InState

oftheArt

inComputer

Graphics�

Visu

aliza

tionand

Modelin

g�Sprin

ger�Verlag�

��

��Knight�S��Chin�D��Taylor�H��andPeters�J�TheSarn

o�Engin

e�A

Massively

Parallel

Com

puter

forHigh

De�nition

System

Sim

ulation

�Journalof

VLSISign

alProcessin

g��

��

��Molnar�S��Cox�M��Ellsworth�D��andFuchs�H�ASortin

gClassi�

cationof

Parallel

Renderin

g�IEEEComputer

Graphics

andApplica

tions�Ju

ly

��

��Molnar�S��Eyles�J��andPoulton�J�PixelF

low�High

�Speed

Renderin

g

Usin

gIm

ageCom

position

�In

Proceed

ings

Siggra

ph��Ju

ly��

pp��

��Molnar�S�E�Im

age�C

ompositio

nArch

itectures

forReal�T

imeIm

age

Gen�

eratio

n�

PhD

dissertation

�TheUniversity

ofNorth

Carolin

aat

Chapel

Hill�

Departm

entof

Com

puter

Scien

ce�Octob

er��

��Neider�J��Davis�T��andWoo�M�OpenGLProgra

mmingGuide�Addison

Wesley�

��

��Phong�B�T�Illu

mination

forCom

puter

Generated

Pictu

res�Communica

tion

oftheACM

��

��

��

��Pineda�J�A

Parallel

Algorith

mfor

Poly

gonRasterization

�In

Proceed

ings

of

SIG

GRAPH

��

pp��

��Shoemake�K�Anim

atingRotation

with

Quatern

ionCurves�

InProceed

ings

Siggra

ph��Ju

ly��

pp��

��Connection

Mach

ineModel

CM�

Tech

nical

Summary�

Tech

�Rep�HA��

ThinkingMach

ines�

April

��

��

��Whitman�S�Multip

rocesso

rMeth

odsforComputer

Graphics

Renderin

g�AK

Peters�

Wellesley�

MA��

��Whitman�S�

ALoad

Balan

cedSIM

DPoly

gonRender�

unpublish

edarticle

�August

��

��Whitted�T�A

Taxonom

yof

Image

Com

position

Arch

itectures�

InIntern

a�

tionalSummer

Institu

teonState

oftheStart

inComputer

Graphics

��

��Williams�L�Pyram

idal

Param

etrics�Computer

Graphics

��Ju

ly��

��

��

SIMD ColumnP - Computer Graphicsgraphics.stanford.edu/~eldridge/academia/ms/eldridge_ms_2up.pdf ·...

Documents

Transcript of SIMD ColumnP - Computer Graphicsgraphics.stanford.edu/~eldridge/academia/ms/eldridge_ms_2up.pdf ·...