SIMD ColumnP - Computer Graphicsgraphics.stanford.edu/~eldridge/academia/ms/eldridge_ms_2up.pdf ·...
Transcript of SIMD ColumnP - Computer Graphicsgraphics.stanford.edu/~eldridge/academia/ms/eldridge_ms_2up.pdf ·...
SIMD Column�Parallel Polygon Rendering
by
Matthew Willard Eldridge
Submitted to the Department of Electrical Engineering and Computer
Science
in partial ful�llment of the requirements for the degrees of
Bachelor of Science
and
Master of Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May ����
c� Matthew Willard Eldridge� MCMXCV� All rights reserved�
The author hereby grants to MIT permission to reproduce and
distribute publicly paper and electronic copies of this thesis document
in whole or in part� and to grant others the right to do so�
Author � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
Department of Electrical Engineering and Computer Science
May ��� ����
Certi�ed by � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
Seth Teller
Assistant Professor of Electrical Engineering and Computer Science
Thesis Supervisor
Certi�ed by � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
Elizabeth C� Palmer
Director� Employment and Development� David Sarno� Research
Center
Company Supervisor
Accepted by � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
Frederic R� Morgenthaler
Chairman� Departmental Committee on Graduate Students
�
SIMD Column�Parallel Polygon Rendering
by
Matthew Willard Eldridge
Submitted to the Department of Electrical Engineering and Computer Science
on May ��� ����� in partial ful�llment of the
requirements for the degrees of
Bachelor of Science
and
Master of Science
Abstract
This thesis describes the design and implementation of a polygonal rendering system
for a large one dimensional single instruction multiple data �SIMD� array of pro
cessors Polygonal rendering enjoys a traditionally high level of available parallelism�
but e�cient realization of this parallelism requires communication On large systems�
such as the ��� �processor Princeton Engine studied here� communication overhead
limits performance
Special attention is paid to study of the analytical and experimental scalability
of the design alternatives Sort�last communication is determined to be the only
design to o�er linear performance improvements in the number of processors We
demonstrate that sort��rst communication incurs increasingly high redundancy of
calculations with increasing numbers of processors� while sort�middle communication
has performance linear in the number of primitives and independent of the number
of processors Performance is thus be communication�limited on large sort��rst and
sort�middle systems
The system described achieves interactive rendering rates of up to �� frames per
second at NTSC resolution Scalable solutions such as sort�last proved too expensive
to achieve real�time performance Thus� to maintain interactivity a sort�middle
render was implemented A large set of communication optimizations are analyzed
and implemented under the processor�managed communication architecture
Finally a novel combination of sort�middle and sort�last communication is pro
posed and analyzed The algorithm e�ciently combines the performance of sort�
middle architectures for small numbers of processors and polygons with the scala
bility of sort�last architectures for large numbers of processors to create a system
with speedup linear in the number of processors The communication structure is
substantially more e�cient than traditional sort�last architectures both in time and
in memory
Thesis Supervisor� Seth Teller
Title� Assistant Professor of Electrical Engineering and Computer Science
Company Supervisor� Elizabeth C Palmer
Title� Director� Employment and Development� David Sarno� Research Center
SIMD Column�Parallel Polygon Rendering
by
Matthew Willard Eldridge
Submitted to the Department of Electrical Engineering and Computer Science
on May ��� ����� in partial ful�llment of the
requirements for the degrees of
Bachelor of Science
and
Master of Science
Abstract
This thesis describes the design and implementation of a polygonal rendering system
for a large one dimensional single instruction multiple data �SIMD� array of pro
cessors Polygonal rendering enjoys a traditionally high level of available parallelism�
but e�cient realization of this parallelism requires communication On large systems�
such as the ��� �processor Princeton Engine studied here� communication overhead
limits performance
Special attention is paid to study of the analytical and experimental scalability
of the design alternatives Sort�last communication is determined to be the only
design to o�er linear performance improvements in the number of processors We
demonstrate that sort��rst communication incurs increasingly high redundancy of
calculations with increasing numbers of processors� while sort�middle communication
has performance linear in the number of primitives and independent of the number
of processors Performance is thus be communication�limited on large sort��rst and
sort�middle systems
The system described achieves interactive rendering rates of up to �� frames per
second at NTSC resolution Scalable solutions such as sort�last proved too expensive
to achieve real�time performance Thus� to maintain interactivity a sort�middle
render was implemented A large set of communication optimizations are analyzed
and implemented under the processor�managed communication architecture
Finally a novel combination of sort�middle and sort�last communication is pro
posed and analyzed The algorithm e�ciently combines the performance of sort�
middle architectures for small numbers of processors and polygons with the scala
bility of sort�last architectures for large numbers of processors to create a system
with speedup linear in the number of processors The communication structure is
substantially more e�cient than traditional sort�last architectures both in time and
in memory
Thesis Supervisor� Seth Teller
Title� Assistant Professor of Electrical Engineering and Computer Science
Company Supervisor� Elizabeth C Palmer
Title� Director� Employment and Development� David Sarno� Research Center
Acknowledgments
I would like to thank David Sarno� Research Center and the Sarno� Real Time Cor
poration for their support of this research In particular� Herb Taylor for his expertise
in assembly level programming of the Princeton Engine and many useful discussions�
Jim Fredrickson for his willingness to modify the operating system and programming
environment during development of my thesis� Joe Peters for his patience with my
complaints� and Jim Kaba for his tutoring and frequent insights I would like to
thank Scott Whitman for many helpful discussions about parallel rendering and so
willingly sharing the Princeton Engine with me Thanks to all my friends at David
Sarno� Research Center and Sarno� Real Time Corporation� you made my work and
research very enjoyable
I would like to thank my advisor� Seth Teller� for providing continual motivation to
improve and extend my research Seth has been a never ending source of suggestions
and ideas� and an inspiration
Most of all� a special thank you to my parents for supporting me here during my
undergraduate years and convincing me to pursue graduate studies Without their
love and support I would have never completed this thesis
This material is based upon work supported under a National Science Foundation
Graduate Research Fellowship Any opinions� �ndings� conclusions or recommenda
tions expressed in this publication are those of the author and do not necessarily
re�ect the views of the National Science Foundation
�
Contents
� Introduction ��
�� Background � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Hardware Approaches � � � � � � � � � � � � � � � � � � � � � � ��
��� Software Approaches � � � � � � � � � � � � � � � � � � � � � � � ��
�� Overview � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Princeton Engine Architecture ��
�� Processor Organization � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Interprocessor Communication � � � � � � � � � � � � � � � � � � � � � � �
�� Video I�O � Line�Locked Programming � � � � � � � � � � � � � � � � ��
� Implementation � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Graphics Pipeline ��
�� UniProcessor Pipeline � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� Geometry � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� Rasterization � � � � � � � � � � � � � � � � � � � � � � � � � � � �
��� Details � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� MultiProcessor Pipeline � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Communication � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Geometry � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
��� Rasterization � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�
� Scalability ��
� Geometry � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Rasterization � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Communication � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Model � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
�� Sort�First � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Sort�Middle � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Sort�Last � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
Analysis � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Sort�First vs Sort�Middle � � � � � � � � � � � � � � � � � � � � �
� Sort�Last� Dense vs Sparse � � � � � � � � � � � � � � � � � � � �
� Sort�Middle vs Dense Sort�Last � � � � � � � � � � � � � � � � ��
� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Communication Optimizations ��
�� Sort�Middle Overview � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Dense Network � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Distribution � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Data Duplication � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Temporal Locality � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
�� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ��
� SortTwice Communication ��
�� Why is Sort�Last so Expensive� � � � � � � � � � � � � � � � � � � � � � ���
�� Leveraging Sort�Middle � � � � � � � � � � � � � � � � � � � � � � � � � ���
�� Sort�Last Compositing � � � � � � � � � � � � � � � � � � � � � � � � � ��
� Sort�Twice Cost � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���
�� Deferred Shading and Texture Mapping � � � � � � � � � � � � � � � � � ���
�� Performance � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���
�� Summary � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � ���
�
�Im
plementatio
n���
���Parallel
Pipelin
e�����������������������������
���
���Line�L
ockedProgram
ming
�����������������������
���
���Geom
etry���������������������������������
��
�����Represen
tation��������������������������
��
�����Tran
sformation
��������������������������
���
�����Visib
ility
Clip
ping
�����������������������
���
�����Ligh
ting
������������������������������
���
�����Coe
cientCalcu
lation����������������������
���
���Com
munication
�����������������������������
���
�����Passin
gaSingle
Poly
gonDescrip
tor���������������
���
�����Passin
gMultip
leDescrip
tors�������������������
���
���Rasterization
�������������������������������
���
�����Norm
alizeLinear
Equation
s�������������������
���
�����Rasterize
Poly
gons
������������������������
��
�����Rasterization
Optim
izations
�������������������
���
�����Summary
�����������������������������
���
���Control
����������������������������������
���
���Summary
���������������������������������
���
�Results
���
���Raw
Perform
ance
�����������������������������
���
�����Geom
etry�����������������������������
��
�����Com
munication
��������������������������
��
�����Rasterization
���������������������������
���
�����Aggregate
Perform
ance
����������������������
���
���Accou
ntin
g��������������������������������
���
���Perform
ance
Com
parison
�������������������������
���
�
�FutureDire
ctio
ns
���
���Distrib
ution
Optim
ization������������������������
���
���Sort�T
wice
��������������������������������
��
���NextGeneration
Hard
ware
�����������������������
��
��Conclusio
ns
���
ASim
ulator
���
A��
Model
�����������������������������������
���
A����
InputFiles
�����������������������������
���
A����
Param
eters����������������������������
���
A��
Operation
���������������������������������
��
A����
Data
Initialization
Geom
etry�����������������
��
A����
Com
munication
��������������������������
���
A����
Rasterization
���������������������������
���
A��
Correctn
ess��������������������������������
���
A��
Summary
���������������������������������
��
�
List
ofFigures
���Prin
cetonEngin
eVideo
I�O�����������������������
��
���Prin
cetonEngin
e�����������������������������
��
���Com
munication
Option
son
thePrin
cetonEngin
e�����������
��
���Prin
cetonEngin
e�����������������������������
��
���Synthetic
Eyep
oint
����������������������������
��
���Beeth
oven���������������������������������
��
���Crocod
ile���������������������������������
��
���Teap
ot�����������������������������������
��
���Texture
Mapped
Crocod
ile������������������������
��
���Sam
pleTexture
Image
��������������������������
��
���Uniprocessor
Geom
etryPipelin
e���������������������
��
����D
Persp
ectiveTran
sformation
���������������������
��
���Poly
gonLinear
Equation
s������������������������
��
���Trian
gleDe�nition
byHalf�P
laneIntersection
�������������
��
����Uniprocessor
Rasterization
������������������������
�
����Sort
Order
���������������������������������
��
����SIM
DGeom
etryPipelin
e�������������������������
��
����SIM
DRasterization
����������������������������
��
���Sort�F
irstLoad
Imbalan
ce������������������������
��
���Sort�M
iddle
��������������������������������
��
���Com
munication
Costs
��������������������������
��
��
���Com
munication
Optim
izations����������������������
�
���Sort�M
iddleCom
munication
�����������������������
��
���Sim
ulation
ofSparse
vs�
Dense
Netw
orks����������������
��
���Pass
Cost
���������������������������������
��
���i�
vs�
dfor
Distrib
ution
Optim
ization������������������
��
���Poly
gonDuplication
���������������������������
��
���Poly
gonsPer
Processor
vs�
m����������������������
��
���Instru
ctionsPer
Poly
gonas
afunction
ofm
andd�����������
��
���Tem
poral
Locality
����������������������������
��
���Multip
leUsers
onaParallel
Poly
gonRenderer
�������������
��
���Sort�L
astCom
positin
g��������������������������
��
���Pixelto
Processor
Map
��������������������������
��
���Sort�T
wice
Com
positin
gTim
e����������������������
��
���Pred
ictedSort�M
iddlevs�
Sort�T
wice
Com
munication
Tim
e�����
���
���Sim
ulated
Sort�M
iddlevs�
Sort�T
wice
Com
munication
Tim
e�����
���
���Sim
ulated
Sort�M
iddlevs�
Sort�T
wice
Execu
tionTim
e��������
���
���Beeth
oven���������������������������������
���
���Crocod
ile���������������������������������
���
���Texture
Mapped
Crocod
ile������������������������
���
���Teap
ot�����������������������������������
���
���Poly
gonRepresen
tation�������������������������
���
���Poly
gonDescrip
tor����������������������������
���
���Im
plem
entation
forData
Distrib
ution
������������������
���
���Linear
Equation
Norm
alization���������������������
���
���Rasterizin
g��������������������������������
���
���Virtu
alTrack
ball
Interface
������������������������
���
���Execu
tionPro�
le�����������������������������
���
��
A��
Bounding�b
oxInputFile
�������������������������
���
A��
Sam
pleScreen
Map
����������������������������
���
A��
Sim
ulator
Model
�����������������������������
���
A��
Destin
ationFalse
Positiv
e������������������������
���
A��
Sim
ulator
Com
munication
������������������������
���
A��
Destin
ationVector
Align
ment
����������������������
���
A��
Destin
ationVector
forg���
�����������������������
���
��
��
List
ofTables
���Com
munication
Modelin
gVariab
les�������������������
��
���Com
munication
sCosts
��������������������������
��
���Com
munication
Modelin
gValu
es��������������������
��
���Renderin
gPerform
ance
forTypical
Scen
es���������������
���
A��
Sim
ulator
Param
eters���������������������������
���
��
��
Chapter�
Introductio
n
Thisthesis
explores
theim
plem
entationof
aninteractive
poly
gonal
renderer
onthe
Prin
cetonEngin
e�a����p
rocessorsin
gle�instru
ctionmultip
le�data
�SIM
D�parallel
computer�
Each
processor
isasim
plearith
metic
coreable
tocom
municate
with
its
leftandrigh
tneigh
bors�
Thisisan
importan
tarch
itectureon
which
tostu
dytheim
plem
entation
ofpoly
gon
renderin
g�for
manyreason
s�
�Theinterp
rocessorcom
munication
structu
reissim
pleandinexpensiv
e�making
thecon
struction
oflarger
system
spractical�
with
costlin
earin
thenumber
of
processors�
�Theuse
ofSIM
Dprocessors
allowsvery
carefully
managed
communication
to
beim
plem
ented
�anda�ord
ssign
i�can
topportu
nities
foroptim
izations�
�Curren
tpoly
gonren
derin
galgorith
msandarch
itectures
oftenexhibitpoor
scal�
ability�
andare
design
edwith
largeandexpensiv
ecrossb
arsto
insure
adequate
interp
rocessorbandwidth�A
reduction
intheam
ountof
communication
re�
quired
andthecarefu
lstru
cturin
gof
thecom
munication
algorithm�as
man�
dated
bythemodest
communication
resources
ofthePrin
cetonEngin
e�will
have
largee�ects
onthefeasib
ilityof
future
largermach
ines�
��
Manyparallel
renderers
have
been
implem
ented
�butmost
were
either
execu
ted
onsm
allermach
inesor
onmach
ineswith
ahigh
erdim
ension
communication
netw
ork
than
thePrin
cetonEngin
e�Theuse
ofalarge
mach
inewith
narrow
intercon
nect
such
asthePrin
cetonEngin
emakes
thepoly
gonren
derin
gprob
lemsign
i�can
tlydi�eren
t�
Poly
gonren
derin
gprov
ides
aview
port
into
athree�d
imension
al��D
�space
ona
two�d
imension
al��D
�disp
laydevice�
Thespace
may
beofanysort�
realor
imagin
ed�
There
aretwobroad
approach
estaken
topoly
gonren
derin
g�realism
andinteractiv
ity�
Realism
requires
precise
andgen
erallyexpensiv
ecom
putation
toaccu
ratelymodela
scene�
Interactiv
ityreq
uires
real�timecom
putation
ofascen
e�at
apossib
leloss
of
realism�
Poly
gonren
derin
gmodels
ascen
eas
composed
ofaset
ofpoly
gonsin
�Dbein
g
observ
edbyaview
erfrom
anyarb
itrarylocation
�Geom
etrycom
putation
stran
s�
formthepoly
gonal
scenefrom
a�D
model
toa�D
model
andthen
rasterization
computation
sren
der
the�D
poly
gonsto
thedisp
laydevice�
Both
thegeom
etryand
therasterization
computation
sare
high
lyparallel�
buttheir
parallelization
exposes
a
sortingprob
lem�Poly
gonren
derin
ggen
erallyrelies
onthenotion
ofpoly
gonsbein
g
rendered
inthe�correct�
order�
sothat
poly
gonsnearer
totheobserver�s
view
poin
t
obscu
repoly
gonsfarth
eraw
ay�Ifthepoly
gonren
derin
gprocess
isdistrib
uted
across
processors
thisord
eringcon
straintwill
require
interp
rocessorcom
munication
tore�
solve�
Theord
eringcon
straintis
generally
resolvedin
oneof
three
ways�
referredto
bywhere
they
place
thecom
munication
intheren
derin
gprocess�
Sort��
rstand
sort�middlealgorith
msplace
poly
gonsthat
overlaponce
projected
tothescreen
on
thesam
eprocessor
forrasterization
sothat
theocclu
sioncan
beresolv
edwith
local
data�
Sort�last
rasterizeseach
poly
gonon
theprocessor
that
perform
sthepoly
gon�s
geometry
calculation
s�andthen
uses
pixel
bypixel
inform
ationafter
thefact
to
combineall
oftheprocessor�s
rasterizationresu
lts�Theadvan
tagesanddisad
vantages
ofthese
approach
esare
discu
ssed�both
ingen
eral�andwith
aspeci�
cfocu
son
an
e cien
tim
plem
entation
onthePrin
cetonEngin
e�
��
Amajor
goalofthisthesis
isto
�ndandextract
scalability
fromtheexplored
com�
munication
algorithms�so
that
asthenumber
ofprocessors
increases
theren
derin
g
perform
ance
will
increase
commensurately�
Wedem
onstrate
that
forsom
ecom
muni�
cationapproach
es�most
notab
lysort�m
iddle�
thecom
munication
timeislin
earin
the
numberofpoly
gonsandindepen
den
tofthenumberofprocessors�
With
outsom
ecare
communication
candom
inate
execu
tiontim
eofpoly
gonren
derin
gon
largemach
ines�
Sort�last
algorithmsmake
constan
t�timecom
munication
possib
le�independentof
both
thenumber
ofpoly
gonsandthenumber
ofprocessors�
butunfortu
nately
the
constan
tcost
isvery
large�
Anovel
combination
ofsort�m
iddleandsort�last
communication
isprop
osedand
analy
zedto
address
theprob
lemof
�ndingscalab
lecom
munication
algorithms�
The
algorithm
e�cien
tlycom
bines
theperform
ance
ofsort�m
iddlearch
itectures
forsm
all
numbers
ofprocessors
andpoly
gonswith
thescalab
ilityof
sort�lastarch
itectures
to
createasystem
with
perform
ance
linearin
thenumber
ofprocessors�
Thesystem
describ
edin
thisthesis
achieves
interactive
renderin
grates
ofupto
��
frames
per
second�Tomain
taininteractiv
ityasort�m
iddleren
der
was
implem
ented
�
asscalab
lesolu
tionssuch
assort�last
proved
tooexpensiv
eto
achieve
real�timeper�
formance�
Alarge
setof
optim
izationsare
analy
zedandim
plem
ented
toaddress
the
scalability
prob
lemsof
sort�middlecom
munication
�
This
thesis
signi�can
tlydiers
fromcurren
twork
onparallel
poly
gonren
der�
ing�
Contem
porary
parallel
renderin
gfocu
sesvery
heav
ilyon
multip
le�instru
ction
multip
le�data
MIM
D�arch
itectures
with
large�fast
andexpensive
netw
orks�
We
have
instead
pursu
edaSIM
Darch
itecture
with
arath
ernarrow
netw
ork�
Back
�
groundisprov
ided
forsoftw
areandhard
ware
parallel
renderers�
both
ofwhich
have
signi�can
tlyin�uenced
thiswork
�
�
���
Background
Poly
gonren
derin
gsystem
svary
widely
intheir
capabilities
beyondjustpain
tingpoly
�
gonson
thescreen
�Themost
common
features
foundinmost
ofthesystem
sdescrib
ed
below
inclu
de�
Depth
Bu�erin
gApixel�b
y�pixelcom
putation
ofthedistan
ceofthepoly
gonfrom
theusers
eyeisstored
ina�depth
buer��
Poly
gonare
only
rendered
atthe
pixels
where
they
arecloser
totheobserver
than
theprev
iously
rendered
poly
�
gons�
Smooth
ShadingThecolor
ofthepoly
gonissm
oothly
interp
olatedbetw
eenthe
verticesto
givetheim
pression
ofasm
ooth
lyvary
ingsurface�
rather
than
a
surface
madeupof
�at�sh
aded
plan
arfacets�
Texture
MappingThecolor
ofthepoly
gonisrep
lacedwith
anim
ageto
increase
thedetail
ofthescen
e�For
exam
ple�
anim
agecan
betex
ture
mapped
onto
a
billb
oardpoly
gon�
More
complex
capabilities
arepresen
tin
manyof
thesystem
sand�while
tech�
nically
interestin
g�are
leftunaddressed
asthey
don�tbear
directly
onthepoly
gon
renderin
gsystem
describ
edherein
�
There
have
been
numerou
sparallel
poly
gonren
derin
gsystem
sbuilt�
Iwill
discu
ss
someof
them
here�
andprov
idereferen
cesto
theliteratu
re�It
issign
i�can
tthat
most
ofthework
deals
either
with
largeVLSIsystem
s�or
MIM
Darch
itectures
with
pow
erfulindividual
nodes�
andarich
intercon
nection
netw
ork�Thisthesis
brid
ges
thegap
betw
eenthese
approach
es�mergin
gthevery
high
parallelism
ofhard
ware
approach
eswith
the�exibility
ofsoftw
areapproach
es�
�����
HardwareApproaches
High
perform
ance
poly
gonren
derin
ghas
tradition
allybeen
thedom
ainof
special
purpose
hard
ware
solution
s�as
hard
ware
canread
ilyprov
ideboth
theparallelism
��
required
andthecom
munication
bandwidth
necessary
tofully
exploit
theparallelism
inpoly
gonren
derin
g�Turner
Whitted
prov
ides
asurvey
ofthe ���
stateof
theart
inhard
ware
graphics
architectu
resin
�� ��
Iwill
brie�
ydiscu
ssthree
speci�
csystem
shere�
thePixelF
lowarch
itecture
from
theUniversity
ofNorth
Carolin
a�andtwoarch
itectures
fromSilicon
Grap
hics�
Inc��
theIR
IS�D
�GTXandtheReality
Engin
e�Allof
thearch
itectures
prov
ideinteractive
renderin
gwith
variousdegrees
ofrealism
�
PixelFlow
PixelF
low�describ
edin
� ��and� ���
isalarge�scale
parallel
renderin
gsystem
based
onasort�last
compositin
garch
itecture�
Itprov
ides
substan
tialprogram
mability
of
thehard
ware�
andat
aminim
um
perform
sdepth
buerin
g�sm
ooth
shadingand
texture
mappingof
poly
gons�
PixelF
lowiscom
posed
ofapipelin
eof
renderin
gboard
s�Each
renderin
gboard
consists
ofaRISCprocessor
toperform
geometry
calculation
anda ��
� ��
SIM
D
arrayof
pixel
processors
that
rasterizean
entire
poly
gonsim
ultan
eously�
prov
iding
parallelism
intheextrem
e�Thepoly
gonsdatab
aseisdistrib
uted
arbitrarily
among
theren
derin
gboard
sto
achiev
ean
evenload
acrosstheprocessors�
After
renderin
g
allofits
poly
gonseach
renderin
gboard
has
apartial
image
ofthescreen
�correct
only
forthat
subset
ofthepoly
gonsitren
dered
�Thesep
aratepartial
images
ofthescreen
arethen
composited
togen
erateasin
gleoutputim
agevia
a����
gigabit�secon
d
intercon
nection
bus�
PixelF
lowim
plem
entsshadingandtex
turemappingintwospecial
shadingboard
s
attheendofthepipelin
eofren
derin
gboard
s�Each
pixelis
thusshaded
atmost
once�
eliminatin
gneed
lessshadingofnon�visib
lepixels�
andtheam
ountoftex
turemem
ory
required
ofthesystem
islim
itedto
theam
ountthat
will
prov
idesu�cien
tbandwidth
totex
ture
map
fullresolu
tionim
agesat
update
rates�
Thesort�last
architectu
reprov
idesperform
ance
increases
nearly
linear
inthenum�
ber
ofren
derin
gboard
sandread
ilyadmits
loadbalan
cing�
ThePixelF
lowarch
itec�
�
ture�
when
complete�
isexpected
toscale
toperform
ance
ofat
least�million
poly
�
gonsper
second�Bycom
parison
thecurren
tSilicon
Grap
hics
Reality
Engin
eren
ders
slightly
more
than
million
poly
gonsper
second�
IRIS
�D�GTX
TheSilicon
Grap
hics
IRIS
�D�G
TX�describ
edin
����was
�rst
soldin
���andisthe
Prin
cetonEngin
e�scon
temporary�
TheGTX
perform
ssm
oothshadinganddepth�
buerin
gof
litpoly
gonsandcan
render
over ������
triangles
asecon
d�TheGTX
prov
ides
notex
ture
mappingsupport�
Geom
etrycom
putation
sare
perform
edbyapipelin
eof�high
perform
ance
proces�
sorswhich
eachperform
adistin
ctphase
ofthegeom
etrycom
putation
s�Theoutput
ofthegeom
etrypipelin
eisfed
into
apoly
gonprocessor
which
dices
poly
gonsinto
trapezoid
swith
parallel
verticaledges�
These
trapezoid
sare
consumed
byan
edge
processor
that
generates
aset
ofvertical
spansto
interp
olateto
render
eachpoly
gon�
Aset
of�span
processors
perform
interp
olationof
thevertical
spans�
Each
span
processor
isfurth
ersupported
by�im
ageengin
esthat
perform
thecom
position
of
thepixels
with
prev
iously
rendered
poly
gonsandmanage
thefully
distrib
uted
frame�
buer�
Thespan
processors
areinterleaved
columnbycolu
mnof
thescreen
andthe
image
engin
esfor
eachspan
processor
areinterleaved
pixelbypixel�
The�negrain
interleaved
interleav
ingof
thespan
processors
andim
ageengin
esinsure
that
most
prim
itiveswill
berasterized
inpart
byall
oftheim
ageengin
es�
Com
munication
isim
plem
ented
bythesort�m
iddle
screen�sp
acedecom
position
perform
edbetw
eentheedge
processor
andthespan
processors�
Thischoice�
while
limitin
gthescalab
ilityof
thesystem
�avoid
stheextrem
elyhigh
constan
tcost
of
sort�lastapproach
essuch
asPixelF
low�
Reality
Engine
TheReality
Engin
e�describ
edin
� ��rep
resentsthecurren
thigh
�endpoly
gonren
der�
ingsystem
fromSilicon
Grap
hics�
Itsupports
fulltex
ture
mapping�shading�
lightin
g
��
anddepth
buerin
gof
prim
itivesat
ratesexceed
ingonemillion
poly
gonsasecon
d�
Logically
theReality
Engin
eisvery
similar
tothearch
itecture
oftheGTX�The
��processor
geometry
pipelin
ehas
been
replaced
by �
independentRISCprocessors
eachim
plem
entingacom
plete
version
ofthepipelin
e�to
which
poly
gonsare
assigned
inarou
nd�rob
infash
ion�A
triangle
busprov
ides
thesort�m
iddleintercon
nect
be�
tween
thegeom
etryengin
esandaset
ofupto
��interleaved
fragmentgen
erators�
Each
fragmentgen
eratoroutputs
shaded
andtex
tured
fragments
toaset
of �
im�
ageengin
eswhich
perform
multisam
pleantialiasin
gof
theim
age�andprov
ideafully
distrib
uted
framebuer�
Themost
signi�can
tchange
fromtheGTX
arethefrag�
mentgen
erators�which
arethecom
bination
ofof
thespan
processors
andtheedge
processors�
Thetex
ture
mem
oryisfully
duplicated
ateach
ofthe��
fragmentgen
eratorsto
prov
idethehigh
parallel
bandwidth
required
toaccess
textures
atfullthrottle
byall
fragmentgen
erators�andas
such
thesystem
operates
with
littleperform
ance
penalty
fortex
ture
mapping�
Afully
con�gured
Reality
Engin
econ
sistsof
over���
processors
andnearly
half
agigab
yte
ofmem
ory�andprov
ides
theclosest
analog
tothePrin
cetonEngin
ein
rawcom
putation
alpow
er�Substan
tialdieren
cesin
communication
bandwidth
and
thespeed
oftheprocessors
makethecom
parison
farfrom
perfect
how
ever�andthe
Prin
cetonEngin
e�sperform
ance
ismore
comparab
leto
itscon
temporary�
theGTX�
�����
Softw
are
Approaches
There
have
been
anumber
ofsoftw
areparallel
renderers
implem
ented
�both
inthe
interests
ofrealism
and�or
interactiv
ity�Themajority
ofthese
renderers
have
either
been
implem
ented
onsm
allmach
ines
tensof
processors�
oron
largemach
inewith
substan
tialintercon
nection
�In
eithercase
theseverity
ofthecom
munication
prob
lem
isminim
ized�
��
Connectio
nMachine
Fran
kCrow
etal�
implem
ented
apoly
gonal
renderin
gsystem
onaConnection
Ma�
chine�C
M���
atWhitn
ey�D
emos
Prod
uction
s�as
describ
edin
����TheCM��
� ��
isamassiv
elyparallel
SIM
Dcom
puter
composed
ofupto
������bit�serial
processors
and�����
�oatin
g�poin
tcop
rocessors�Thenodes
arecon
nected
ina�D
mesh
and
alsoviaa ��d
imension
alhypercu
bewith
�processors
ateach
vertex�Each
nodein
theCM��
has
�KBof
mem
ory�for
atotal
main
mem
orysize
of� �M
B�
Crow
distrib
utes
thefram
ebuerover
themach
inewith
afew
pixels
per
processor�
Renderin
gisperform
edwith
apoly
gonassign
edto
eachprocessor�
Allp
rocessorsren
�
dersim
ultan
eously
andas
pixels
aregen
eratedthey
aretran
smitted
viathehypercu
be
intercon
nect
totheprocessor
with
theapprop
riatepiece
ofthefram
ebuer
fordepth
buerin
g�Thisistherefore
alarge�scale
implem
entation
ofsort�last
communication
�
Theissu
eof
thebandwidth
thisreq
uires
fromthenetw
orkisleft
untou
ched�It
is
reasonableto
assumethat
itisnot
inexpensiv
eon
such
alarge
mach
ine�
Crow
alludes
totheprob
lemofcollision
sbetw
eenpixels
destin
edfor
thesam
eprocessor�
butleaves
itunresolved
�
Theuse
ofaSIM
Darch
itecture
forcesall
processors
toalw
aysperform
thesam
e
operation
son
poten
tiallydieren
tdata��
Thusifpoly
gonsvary
insize
allprocessors
will
have
towait
fortheprocessor
with
thebiggest
poly
gonto
�nish
rasterizingbefore
they
proceed
tothenextpoly
gon�Thisprob
lemisexacerb
atedbythe��D
natu
reof
thepoly
gons�which
generally
break
stherasterization
into
aloop
overthehorizon
tal
�spans�
ofthepoly
gons�
with
aninner
loopover
thepixels
ineach
span�With
out
carethis
canresu
ltin
allprocessors
execu
tingin
timeprop
ortional
totheworst
casepoly
gonwidth�Crow
addresses
thisprob
lemin
part
byallow
ingprocessors
to
independently
advan
ceto
thenextspan
oftheir
curren
tpoly
gonevery
fewpixels�
butcon
strainsall
processors
tosynchron
izeat
poly
gonboundaries�
Ourexperien
ce
suggests
that
asubstan
tialam
ountof
timeiswasted
bythisapproach
�
Crow
prov
idesagood
overview
ofthedi�
culties
ofadaptin
gthepoly
gonren
derin
g
algorithmto
avery
largeSIM
Dmach
ine�b
utleaves
somecriticalload
balan
cingissu
es
��
unresolved
�andunfortu
nately
prov
ides
noperform
ance
numbers�
Prin
cetonEngine
Contem
poran
eouswith
thisthesis�
Scott
Whitm
anim
plem
ented
amultip
le�user
poly
�
gonren
derer
onthePrin
cetonEngin
e�Thesystem
describ
edin
����divides
the
Prin
cetonEngin
eam
ongatotal
of�sim
ultan
eoususers�
assigningagrou
pof
��
processors
toeach
user�
Sort�m
iddlecom
munication
isused
�with
somediscu
ssionmadeofthecost
ofthis
approach
�Asin
Crow
����Whitm
anem
ploy
saperiod
ictest
durin
gtherasterization
ofeach
scan�lin
eto
allowprocessors
toindependently
advan
ceto
thenextscan
�line�
Healso
synchron
izesat
thepoly
gonlevel�
constrain
ingthealgorith
mto
perform
in
timeprop
ortional
tothelargest
poly
gonon
anyprocessor�
Whitm
ansuggests
that
hisalgorith
mwould
attainperform
ance
ofapprox
imately
�������trian
glesper
secondper
��processors
onthenextgen
erationhard
ware
we
discu
ssinx����
Thisassu
mes
that
thesystem
remain
sdevoted
tomultip
leusers�
soa
single
user
will
not
attainan
addition
al�������
triangles
per
secondfor
everygrou
p
of ��
processors
assigned
tohim
�
Whitm
analso
prov
ides
agood
general
discu
ssionandback
groundfor
parallel
computer
renderin
gmeth
odsin
hisbook
� ���
ParallelRenderM
an
Mich
aelCox
addresses
issues
of�ndingscalab
leparallel
algorithmsandarch
itectures
forren
derin
g�andspeci�
callyexam
ines
aparallel
implem
entationof
RenderM
an����
Cox
implem
entsboth
asort��
rstren
derer�as
ane�
cientdecom
position
oftheren
�
derin
gprob
lemin
image
space�
andasort�last
renderer�
asan
e�cien
tdecom
position
inob
jectspace�
tostu
dythedieren
cesin
e�cien
cybetw
eenthese
twoapproach
es�
Both
implem
entation
swere
onaCM��
fromThinkingMach
ines�
a� �
processor
MIM
Darch
itecture
with
apow
erfulcom
munication
netw
ork�
Theprob
lemsof
eachapproach
arediscu
ssedin
detail�
Sort��
rstan
dsort�
��
middle�
cansuer
fromload
imbalan
cesas
thedistrib
ution
ofpoly
gonsvaries
in
image
space�
andsort�last
suers
fromextrem
ecom
munication
requirem
ents�
Cox
addresses
these
issues
with
ahybrid
algorithm
which
uses
sort�lastcom
munication
asaback
endto
loadbalan
ceto
sort��rst
communication
�
Distrib
utedMemory
MIM
DRenderin
g
Thom
asCrocket
andTobias
Orlo
prov
ideadiscu
ssionof
parallel
renderin
gon
a
distrib
uted
mem
oryMIM
Darch
itecturein
���with
sort�middlecom
munication
�Par�
ticular
attention
ispaid
toload
�imbalan
cesdueto
irregular
conten
tdistrib
ution
over
thescen
e�
They
use
arow
�parallel
decom
position
ofthescreen
�assign
ingeach
processor
agrou
pof
contigu
ousrow
sfor
which
itperform
srasterization
�Thegeom
etryand
rasterizationcalcu
lationsare
perform
edin
aloop
��rst
perform
ingsom
egeom
etry
calculation
sandtran
smittin
gtheresu
ltsto
theoth
erprocessors�
then
rasterizing
somepoly
gonsreceiv
edfrom
other
processors�
then
perform
ingmore
geometry�
etc�
This
distrib
utes
thecom
munication
overtheentire
timeof
thealgorith
mandcan
resultin
much
better
utilization
ofthecom
munication
netw
orkbynot
constrain
ing
allof
thecom
munication
tohappen
inaburst
betw
eengeom
etryandrasterization
�
Interactiv
eParallelGraphics
David
Ellsw
orthdescrib
esan
algorithm
forinteractive
graphics
onparallel
computers
in����
Ellsw
orth�s
algorithm
divides
thescreen
into
anumber
ofregion
sgreater
than
thenumber
ofprocessors�
anddynam
icallycom
putes
amappingof
regionsto
processors
toach
ievegood
loadbalan
cing�
Thealgorith
mwas
implem
ented
onthe
Touchston
eDelta
prototy
peparallel
computer
with
� �Intel
i���com
pute
nodes
intercon
nected
ina�D
mesh
�
Poly
gonsare
communicated
insort�m
iddlefash
ionwith
thecaveat
that
processors
areplaced
into
groups�
Aprocessor
communicates
poly
gonsafter
geometry
calcula�
tionsbybundlin
gall
thepoly
gonsto
berasterized
byaparticu
largrou
pinto
alarge
��
message
andsen
dingthem
toa�forw
arding�
processor
atthat
groupthat
then
redis�
tributes
them
locally�Thisapproach
amortizes
message
overhead
bycreatin
glarger
message
sizes�
Thisam
ortizationof
thecom
munication
netw
orkoverh
eadisapow
erfulidea�
and
isthecen
tralthem
eof
several
optim
izationsexplored
inthisthesis�
���
Overview
Thisthesis
deals
with
anumber
ofissu
esin
parallel
renderin
g�Agreat
deal
ofatten
�
tionispaid
toe�
cientcom
munication
strategieson
SIM
Drin
gs�andtheir
speci�
c
implem
entation
onthePrin
cetonEngin
e�
Chapter
�prov
ides
anoverv
iewof
thePrin
cetonEngin
earch
itecture
which
will
driv
eouranaly
sisof
sortingmeth
ods�andourim
plem
entation
�
Chapter
�discu
ssestheuniprocessor
poly
gonren
derin
gpipelin
eandthen
exten
ds
itto
themultip
rocessorpipelin
e�Approach
esto
thesortin
gprob
lemare
addressed
�
andissu
esin
extractin
ge�
ciency
fromaparallel
SIM
Dmach
ineare
raised�
Chapter
�exam
ines
thescalab
ilityof
poly
gonren
derin
galgorith
ms�bylook
ing
atthescalab
ilityof
thethree
prim
arycom
ponents
ofren
derin
g�geom
etry�com
�
munication
�andrasterization
�A
general
model
fortheperform
ance
ofsort��
rst�
sort�middleandsort�last
communication
ona D
ringisprop
osedandused
tomake
broad
pred
ictionsaboutthescalab
ilityof
thesortin
galgorith
msin
both
thenumber
ofpoly
gonsandthenumber
ofprocessors�
Atten
tionisalso
paid
totheload
bal�
ancin
gprob
lemsthat
thechoice
ofsortin
galgorith
mcan
introd
uce
andtheir
eects
ontheoverall
scalability
oftheren
derer�
Chapter
�con
cludes
with
justi�
cationfor
thechoice
ofsort�m
iddlecom
munication
forthePrin
cetonEngin
epoly
gonren
derin
g
implem
entation
�desp
iteits
lackof
scalability�
Chapter
�prop
osesandanaly
zesoptim
izationsfor
sort�middle
communication
�
both
with
ananaly
ticalmodel�
andwith
thesupport
ofinstru
ction�accu
ratesim
�
ulation
�Thefeasib
ilityof
verytigh
tuser
control
overthecom
munication
netw
ork
��
isexploited
toprov
ideaset
ofprecisely
managed
andhigh
lye�
cientoptim
izations
which
more
than
tripletheoverall
perform
ance
fromthemost
naiv
ecom
munication
implem
entation
�
Chapter
�prop
osesandanaly
zesanovel
communication
strategythat
marries
the
e�cien
cyofsort�m
iddlecom
munication
forsm
allnumbers
ofpoly
gonsandprocessors
with
thecon
stanttim
eperform
ance
ofsort�last
communication
toobtain
ahigh
ly
scalable
algorithm�Cycle
bycycle
control
ofthecom
munication
channel
isused
to
createacom
munication
structu
rethat
while
requirin
gthesam
eam
ountof
band�
width
assort�last
communication
achieves
nearly
afactor
of�speed
upover
atypical
implem
entation
byam
ortizingthecom
munication
overhead
�TheVLIW
codingof
thePrin
cetonEngin
eallow
safulldeferred
lightin
g�shadingandtex
ture
mapping
implem
entation
which
increases
rasterizere�
ciency
byafactor
ofmore
than
two�
Chapter
�details
ourim
plem
entation
ofpoly
gonren
derin
gon
thePrin
cetonEn�
gine�
Acom
plete
discu
ssionof
thee�
cienthandlin
gof
thecom
munication
channel
ismade�speci�
callydiscu
ssingthequeuein
gof
poly
gonsfor
communication
tomain
�
tainsatu
rationofthechannel�
andtheuseofcarefu
llypipelin
edpoin
terindirection
to
achieve
general
all�to�allcom
munication
with
neigh
bor�to�n
eighbor
communication
with
outanyaddition
alcop
yoperation
s�Speci�
cmech
anism
sandoptim
izationfor
theim
plem
entation
offast
ande�
cientparallel
geometry
andrasterization
arealso
discu
ssed�
Chapter
�discu
ssestheperform
ance
ofoursystem
�andprov
ides
anaccou
ntof
where
alltheinstru
ctionsgo
inasin
glesecon
dof
execu
tion�Statistics
fromexecu
�
tionsof
thesystem
areprov
ided
forthree
samplescen
esof
varyingcom
plex
ity�alon
g
with
benchmark
�gures
oftheraw
throu
ghputof
thesystem
forindividual
geome�
try�com
munication
�andrasterization
operation
s�Theresu
ltsare
compared
with
the
GTX
fromSilicon
Grap
hics�
acon
temporary
ofthePrin
cetonEngin
e�andwith
the
Magic� �
ournextgen
erationhard
ware�
Chapter
�discu
ssesournextgen
erationhard
ware�
theMagic� �
curren
tlyunder
testanddevelop
ment�TheMagic
prov
idessign
i�can
tincreases
incom
munication
and
��
computation
support�
inaddition
toagreatly
increased
clockrate�
andispred
icted
toprov
idean
immensely
pow
erfulpoly
gonren
derin
gsystem
�Ourestim
ationsfor
an
implem
entation
ofasort�tw
iceren
derin
gsystem
onthisarch
itecture
areprov
ided�
andindicate
asystem
with
perform
ance
inthemillion
sof
poly
gons�with
�exibility
farbeyon
dthat
ofcurren
thigh
perform
ance
hard
ware
system
s�
Chapter
��summarizes
theresu
ltsof
thisthesis�
andcon
cludes
with
adiscu
ssion
ofthecon
tribution
sof
thiswork
�
AppendixAdiscu
ssestheinstru
ction�accu
ratesim
ulator
develop
edto
explore
the
e�cien
cyof
di�eren
tcom
munication
algorithmsandtheload
�balan
cingissu
esraised
onahigh
lyparallel
mach
ine�
��
Chapter�
Prin
cetonEngineArchite
cture
ThePrin
cetonEngin
e�� �
asuper�com
puter
classparallel
mach
ine�was
design
edand
built
attheDavid
Sarn
o�Research
Center
in����
Itcon
sistsof
betw
een��
and
���SIM
Dprocessors
connected
inaneigh
bor�to�n
eighbor
ring�
TheEngin
ewas
originally
design
edpurely
asavideo
processor�
andits
only
trueinput�ou
tputpath
s
arevideo�
Itaccep
tstwoNTSC
composite
video
inputs
andprod
uces
upto
four
interlaced
orprogressiv
escan
componentvideo
outputs�
asshow
nin
Figu
re���
ThePrin
cetonEngin
eisaSIM
Darch
itecture�
soall
processors
execu
tethesam
e
instru
ctionstream
�Condition
alexecu
tionof
theinstru
ctionstream
isach
ievedwith
a�sleep
�state
that
processors
may
condition
allyenter
andrem
ainin
until
awaken
ed�
Cam
eraP
rinceton Engine
VC
R
Figu
re���
Prin
cetonEngineVideoI�O�
thePrin
cetonEngin
esupports
si�
multan
eousvideo
inputs
and�sim
ultan
eousoutputs�
��
Theinstru
ctionstream
implicitly
execu
tes
ifwakeupthen
asleep�false
executeinstruction
if�asleepthen
commitresult
forevery
instru
ction�More
complex
if���then���else���execu
tioncan
beach
ievedby�rst
puttin
gsom
esubset
oftheprocessors
tosleep
�execu
tingthethen
body�togglin
gtheasleep�ag�
andexecu
tingtheelsebody�
Thedetails
oftheprocessors�
thecom
munication
intercon
nect
andthevideo
I�O
capabilities
aredescrib
edbelow
�Thevideo
I�Ostru
cture
isparticu
larlycritical
asit
determ
ineswhat
algorithmscan
bepractically
implem
ented
onthePrin
cetonEngin
e�
���
Processo
rOrganiza
tion
ThePrin
cetonEngin
e�show
nsch
ematically
inFigu
re��
isaSIM
Dlin
eararray
of���
processors
connected
inarin
g�Each
processor
isa���b
itload
�storecore�
with
aregister
�le�
ALU�a��
���
multip
lierand���K
Bof
mem
ory�
The�����
multip
liercom
putes
a�
bitresu
lt�andaprod
uct
picker
allowsthese�
lectionofwhich
��bits
oftheoutputare
used
�prov
idingtheoperation
c��a�b��
n�
This
allowse�
cientcod
ingof
�xed�poin
toperation
s�ThePrin
cetonEngin
ewas
design
edto
process
video
inreal�tim
e�andas
such
exten
ded
precision
integer
arith�
metic
�beyon
d��
bits�
and�oatin
g�poin
tarith
metic
delegated
tosoftw
aresolu
tions
andare
relativelyexpensive
operation
s�A��b
itsign
edinteger
multip
lyreq
uires
���
instru
ctions�anda��b
itsign
edinteger
division
requires
over���
instru
ctions�
Themem
oryiscom
posed
oftypes�
��KBof
fastmem
oryand��K
Bof
slow
mem
ory�Fast
mem
oryisaccessib
lein
asin
gleclock
cycle�
Slow
mem
oryisdivided
into
�banksof
��KB
andis
accessible
incycles�
Thecom
piler
places
alluser
�
IPC
Register
File
28MB
/s
ALU
P
Product Picker
144bit
Instruction Bus
Synchronous
640KB
SR
AM 16bit
Multiplier
PP
P
IPC
Register
File
28MB
/s
PP
PP
PP
PP
16bit
Multiplier
Product Picker
ALU
SR
AM
640KB
Figu
re��
Prin
cetonEngine�theprocessors
areorgan
izedin
a�D
ring�
The
instru
ctionstream
isdistrib
uted
synchron
ously
fromtheseq
uencer�
data
andvariab
lesinto
fastmem
ory�Anyuse
oftheslow
mem
oryis
byexplicit
program
mer
access�
Theinstru
ctionstream
supplied
bytheseq
uencer
is���
bits
wide�
TheVLIW
�Very
LongInstru
ctionWord
�cod
edinstru
ctionstream
prov
ides
parallel
control
over
allof
theintern
aldata
path
sof
theprocessors�
Theaverage
amountof
instru
ction
levelparallelism
obtain
edwith
intheprocessors
isbetw
eenand��
Unlike
typical
uniprocessors�
thePrin
cetonEngin
eprocessors
arenot
pipelin
ed�as
theinstru
ction
decod
ehas
alreadyhappened
attheseq
uencer�
Theparallelism
obtain
edbyVLIW
on
thePrin
cetonEngin
eiscom
parab
lyto
theparallelism
obtain
edbypipelin
edexecu
tion
inaRISCprocessor�
andthe��M
Hzexecu
tionof
theprocessin
gelem
entisrou
ghly
equivalen
tto
��RISCMIPS�
Theprocessors
arecon
trolledbyacen
tralseq
uencer
board
which
manages
the
program
counter
andcon
dition
alexecu
tion�Each
ofthe���
processors
operates
at��M
Hzfor
anaggregate
perform
ance
ofapprox
imately
������RISCMIPSor
��
billion
instru
ctionsper
second�
��
���
Interprocesso
rCommunica
tion
Theprocessors
areintercon
nected
via
aneigh
bor�to�n
eighbor
�interp
rocessorcom
�
munication
��IP
C�bus�
Each
processor
may
selectively
transm
itor
receivedata
onanygiven
instru
ction�allow
ingmultip
lecom
munication
sto
occursim
ultan
eously�
There
isnoshared
mem
oryon
thePrin
cetonEngin
e�all
communication
betw
een
processors
mustvia
theIPC�
Interp
rocessorcom
munication
�IPC�occu
rsvia
a���b
itdata
registerusab
leon
everycycle�
forabisection
�neigh
bor
toneigh
bor�
bandwidth
of�M
B�s�
Therin
g
canbecon
�gured
asashift
registerwith
���entries�
Each
processor
has
control
over
oneentry
intheshift
register�After
ashift
operation
eachprocessor�s
IPC
register
hold
sthevalu
eof
itsneigh
bor�s
�leftor
right�
IPC
registerprev
iousto
theshift
operation
�Durin
ganygiven
instru
ctiontheIPCisshifted
left�righ
tor
unchanged
�
TheIPCregister
may
beread
non�destru
ctively�so
once
data
isplaced
intheIPC
registersit
may
beshifted
andread
arbitrarily
manytim
es�to
distrib
ute
aset
of
values
tomultip
leprocessors�
There
areseveral
modes
ofcom
munication
supported
onthePrin
cetonEngin
e�
They
arerep
resented
schem
aticallyin
Figu
re���
Neighbor�to�NeighborThemost
natu
raloption
�Each
processor
passes
a���b
it
word
toeith
erits
leftor
rightneigh
bor�
Allprocessors
pass
simultan
eously
either
totheleft
orto
therigh
t�
Broadcast
Oneprocessor
isthebroad
caster�andall
other
processors
arereceivers�
Neighbor�to�Neighbor�N
Each
processor
passes
asin
gledatu
mto
aprocessor
Nprocessors
away�
Allprocessors
dothisin
parallel�
sothisisas
e�cien
tas
neigh
bor�to�n
eighbor
communication
interm
sof
inform
ationdensity
inthe
ring�
Multi�
Broadcast
Som
esubset
oftheprocessors
aredesign
atedbroad
casters�Each
broad
castertran
smits
values
toall
processors�
inclu
dingitself�
betw
eenitself
��
13
31
43 3
21
Multi-B
roadcast
Neighbor-to-N
eighbor-N
Broadcast
Neighbor-to-N
eighbor
41
34
2
32
1 12
41
43
221
14
31
11
Figu
re���
Communicatio
nOptio
nsonthePrin
cetonEngine�therigh
thand
sidedepicts
thecon
tents
ofthecom
munication
registeron
eachprocessor
aftera
single
stepof
thecom
munication
meth
od�
andthenextbroad
caster�
Thediscu
ssionof
communication
approach
eswill
ignore
thepossib
ilityof
usin
g
broad
cast�in
either
mode�
becau
seitreq
uires
more
instru
ctionsthan
neigh
bor�to�
neigh
bor
totran
sferthesam
eam
ountof
data�
Consid
erthetran
smission
ofpdata
aroundamach
ineof
sizen�where
eachdatu
misthewidth
ofthecom
munication
channel�
Atthestart
ofthecom
munication
eachprocessor
hold
sonedatu
m�andat
theendof
thecom
munication
eachprocessor
has
acop
yof
every
datu
m�Wecan
countthecycles
required
foreach
approach
�
approach
parallelism
inst�d
atum
totalinst
neigh
bor�to�n
eighbor
n���n
p���
��n��n
��p
broad
cast�
���p
Theneigh
bor�to�n
eighbor
approach
isthree
tofou
rtim
esfaster
than
broad
casting
becau
seitnever
has
topay
theexpense
ofwaitin
gfor
data
toprop
agateto
allof
the
processors
�theprop
agationdelay
betw
eenneigh
bors
issm
all��Theinstru
ctionsper
datu
mis���n
instead
of��n
becau
seeach
processor
loadstheir
datu
min
parallel
��
��instru
ction�then
perform
srep
eatedshifts
andnon�destru
ctiveread
sof
thecom
�
munication
register�Sim
ilarresu
ltsapplyto
neigh
bor�to�n
eighbor�N
communication
versusmulti�b
roadcast
communication
�
���
VideoI�O
�Line�LockedProgramming
Thevideo
I�Ostru
cture
oftheEngin
eassign
saprocessor
toeach
columnof
the
outputvideo�
Each
processor
isresp
onsib
lefor
supplyingall
oftheoutputpixels
forthecolu
mnof
thedisp
layitisassign
edto�
Theaddition
ofan
�OutputTim
ing
Sequence�
�OTS�allow
sprocessors
tobeassign
edregion
ssom
ewhat
more
arbitrary
than
acolu
mnof
theoutputvideo�
When
thePrin
cetonEngin
ewas
originally
design
editwas
conceiv
edof
only
asa
video
supercom
puter�
Itwas
expected
that
theprocessin
gwould
bevirtu
allyidentical
forevery
pixel�
andthusprogram
sshouldn�toperate
onfram
esof
video�
they
should
operate
onpixels�
andevery
new
pixelwould
constitu
teanew
execu
tionof
thepro�
gram�Theparad
igmisenforced
byresettin
gtheprogram
counter
atthebegin
ning
ofeach
scanlin
e�
Before
theendof
thescan
lineall
oftheprocessors
must
have
processed
their
pixel
andplaced
theresu
ltin
their
video
outputregister�
Becau
setheprogram
counter
isreset
atthebegin
ningof
eachscan
lineall
program
smustinsure
that
their
longest
execu
tionpath
iswith
inthescan
line�in
struction
budget�
of��
instru
ctions
at��M
Hz�
This
necessarily
leadsto
someobscu
recod
ing�
ascom
plex
operation
s
requirin
gmore
than
��instru
ctionsmustbedecom
posed
into
aseq
uence
ofsm
aller
stepsthat
canindividually
occurwith
intheinstru
ctionbudget�
���
Implementatio
n
Theprocessors
arefab
ricatedin
a���
micron
CMOSprocess
which
allowstwo
processors
tobeplaced
onthesam
edie�
Thephysical
con�guration
ofthemach
ineis
��
and video
2 processors w/
640KB
mem
ory
piggy-backed
on each.
I/O chips for
ring comm
unication
512 Processor C
abinet
2 Processor C
abinets + S
equencer
1024 Processor S
ystem
Figu
re���
Prin
cetonEngine�each
ofthecab
inets
show
nisapprox
imately
feet
wide��feet
deep
and�feet
high
�Theprocessors
arecooled
byaforced
airsystem
�
show
ninFigu
re���
Themach
inecon
sistsof�to
�processor
cabinets
andaseq
uencer
cabinet�
Theprocessor
cabinets
hold
��processors
each�alon
gwith
their
associated
mem
ory�Theseq
uencer
cabinet
contain
stheseq
uencer
which
prov
ides
instru
ctions
totheprocessors�
thevideo
inputandoutputhard
ware�
aHiPPiinterface�
anda
dedicated
controller
cardwith
aneth
ernet
interface�
Theseq
uencer
prov
ides
theinstru
ctionstream
totheprocessors�
Serial
control
�ow
�proced
ure
calls�ishandled
bytheseq
uencer�
Condition
alcon
trol�ow
isper�
formed
byevalu
atingthecon
dition
alexpression
onall
theprocessors
inparallel�
Processor
�then
transm
itsits
resultto
theseq
uencer
via
theinterp
rocessorcom
mu�
nication
bus�
Theseq
uencer
bran
ches
condition
allybased
onthevalu
eitsees�
High
�level
program
control
isperform
edvia
aneth
ernet
connection
andaded�
icatedcon
troller�Thecon
trollerallow
sthestartin
gandstop
pingof
theseq
uencer�
andtheload
ingofprogram
sinto
sequencer
mem
oryanddata
into
processor
mem
ory�
Facilities
fornon�video
inputandoutputare
not
supported
inthevideo
processin
g
modeof
operation
�so
they
arenot
discu
ssedhere�
Sim
ilarly�theHiPPiinput�ou
tput
channel
isnot
compatib
lewith
therealtim
evideo
processin
gmodeof
operation
�A
piece
oftheHiPPiim
plem
entation�astatu
sregister
used
for�ow
control�
canbe
��
used
asacru
dedata�p
assingmeth
odbetw
eenthecon
trollerandtheprocessors�
Data
inputthrou
ghthis
channel
islim
itedto
relatively
lowrates
�hundred
sof
values
a
second�as
there
isnohandshakingprov
ided�
Thevideo
inputandoutputchannelare
distrib
uted
throu
ghamech
anism
similar
totheinterp
rocessorcom
munication
register�Over
thecou
rseof
avideo
scanlin
ethe
I�Ochipssit
intheback
groundandshift
incom
posite
video
inputs
andshift
out
�com
ponentvideo
outputs�
Atthestart
ofeach
scanlin
etheuser
program
canread
thevideo
inputregisters�
andbefore
theendof
eachlin
etim
etheuser
program
must
write
thevideo
outputregisters�
��
Chapter�
GraphicsPipelin
e
Poly
gonalren
derin
guses
asynthetic
scenecom
posed
ofpoly
gonsto
visu
alizeareal
sceneor
theresu
ltsof
computation
oranynumber
ofoth
erenviron
ments�
Objects
in
thevirtu
alspace
arede�ned
assets
of�D
poin
tsandpoly
gonsde�ned
with
vertices
atthese
poin
ts�In
themost
straightforw
ardapplication
ofpoly
gonren
derin
g�these
verticesare
projected
tothedisp
layplan
e�th
emonitor�
viaatran
sformation
de�ned
byasynthetic
eyepoin
tand�eld
ofview
�as
show
nin
Figu
re����
Subseq
uently
the
poly
gonsde�ned
bytheset
ofvertices
with
intheclip
volu
meare
rendered
�En�
hancem
ents
tothisbasic
algorithm
inclu
deocclu
sion�ligh
ting�
shading�
andtex
ture
mapping�
Perp
ixelocclu
sioninsures
that
thepoly
gonsclosest
totheobserver
obscu
repoly
�
display plane
frustum
eyepoint
Figu
re���
Synthetic
Eyepoint�
asynthetic
eyepoin
tandfru
stum
de�nes
both
the
setof
visib
leob
jectsandadisp
layplan
e���
gonsfurth
eraw
ay�Occlu
sionhas
been
implem
ented
anumber
ofways�butismost
commonlyim
plem
entedwith
adepthbu�er�
Bycom
putin
gthedepth
ofeach
vertex
ofapoly
gonwecan
interp
olatethese
depthsacross
theinterior
pixels
ofthepoly
gon
tocom
pute
thedepth
ofanypixel�
When
wewrite
apixel
tothedisp
laydevice
we
actually
perform
acon
dition
alwrite�
only
writin
gthepixelifthesurface
fragmentit
represen
tsis closer�
totheeyethan
anysurface
alreadyren
dered
atthat
pixel�
Ligh
tingandshadinggreatly
increase
therealism
ofascen
ebyprov
idingdepth
cues
via
changes
ininten
sityacross
surfaces�
Ligh
tingassu
mes
foranygiv
enpoly
gon
that
itistheonly
poly
gonin
thescen
e�so
there
arenoshadow
scast
bytheobstru
c�
tionof
lightsou
rcesbyoth
erscen
eelem
entsor
re�ection
ofligh
to�
ofoth
ersurfaces�
These
e�ects
arecap
tured
inoth
erapproach
esto
renderin
gsuch
asray
tracing�
and
arenot
generally
implem
ented
inpoly
gonalren
derers�
Shadingisperform
edbyapply�
ingaligh
tingmodel�
such
asthePhongmodel�����
toanorm
alwhich
represen
tsthe
surface
norm
alof
theob
jectbein
gapprox
imated
atthat
poin
t�Theligh
tingmodel
may
either
beapplied
atevery
pixelthepoly
goncovers�
byinterp
olatingthevertex
norm
alsto
determ
ineinterior
norm
als�or
itmay
beapplied
justto
thevertex
norm
als
ofthepoly
gon�andtheresu
ltingcolors
interp
olatedacross
thepoly
gon�Theform
er
approach
�Phongshading�
will
oftenyield
more
accurate
results
andisfree
ofsom
e
artifactspresen
tin
thelatter
approach
�Gourau
dinterp
olation�butitmore
compu�
tationally
expensive�
Figu
res����
���and���
areall
exam
ples
ofdepthbu�ered
�
Phonglit�
Gourau
din
terpolated
poly
gonal
scenes�
Texture
mappingisafurth
erenhancem
entto
poly
gonal
renderin
g�Instead
ofjust
associatingacolor
with
eachvertex
ofapoly
gon�weassociate
a�u�v�coord
inate
in
anim
agewith
eachvertex
�Bycarefu
lassign
mentof
these
coordinates�
wecan
create
amappingthat
places
anim
ageonto
aset
ofpoly
gons�for
exam
ple
apictu
reonto
abillb
oard�carp
etonto
a�oor�
even
aface
onto
ahead
�Texture
mapping�
ina
fashion
analogou
sto
shadinganddepthbu�erin
g�lin
earlyinterp
olatesthetex
ture
coordinates
ateach
vertexacross
thepoly
gon�Thecolor
ofaparticu
larpixelwith
in
thepoly
gonisdeterm
ined
bylook
ingupthepixelin
thetex
ture
map
corresponding
��
Figu
re���
Beethoven�
a�����
poly
gonmodelwith
�����non�cu
lledpoly
gons�
tothecurren
tvalu
esof
uandv�Figu
re���
show
sasam
ple
texture
mapped
scene
alongwith
thecorresp
ondingtex
ture
map
inFigu
re����
���
UniP
rocesso
rPipelin
e
Allof
theabove
iscom
bined
into
anim
plem
entation
ofapoly
gonal
renderer
ina
graphics
pipelin
e��Theseq
uence
ofoperation
sis
apipelin
ebecau
sethey
areap�
plied
sequentially�
Thedepthbu�er
prov
ides
order
invarian
ce�so
theresu
ltsof
one
computation
�poly
gon�have
noe�ect
onoth
ercom
putation
s�
Theuniprocessor
pipelin
econ
sistsof
twostages
geometry
andrasterization
�The
geometry
stagetran
sformsapoly
gonfrom
its�D
represen
tationto
screencoord
i�
nates�
testsvisib
ility�perform
sclip
pingandligh
tingat
thevertices�
andcalcu
lates
interp
olationcoe�
cients
fortheparam
etersof
thepoly
gon�Therasterization
stage
takesthecom
puted
results
ofgeom
etryfor
apoly
gonandren
ders
thepixels
onthe
screen�
Amore
complete
descrip
tionof
theuniprocessor
pipelin
eisnow
presen
ted�
��
Figu
re���
Crocodile�
a������
poly
gonmodelwith
������noncu
lledpoly
gons�
Figu
re���
Teapot�
a�����
poly
gonmodelwith
�����noncu
lledpoly
gons�
��
Figu
re���
Texture
MappedCrocodile�
a������
poly
gonmodelwith
�������
visib
lepoly
gons�
Figu
re���
SampleTexture
Image�
theim
ageof
synthetic
crocodile
skin
applied
tocroco
dile�
��
next polygon
light
clip
YN
visible?
transform
calculate coefficients
Figu
re���
Uniprocesso
rGeometry
Pipelin
e
�����
Geometry
Thegeom
etrystage
canbeform
edinto
apipelin
eof
operation
sto
beperform
ed�as
show
nin
Figu
re���
andexplain
edbelow
�
Transfo
rmThevertices
areprojected
totheuser�s
poin
tofview
�This
involves
translation
�so
the�D
originof
thepoin
tsbecom
estheuser�s
location�and
rotation�so
that
thelin
eofsigh
toftheobserver
correspondsto
lookingdow
nthe
zaxis�bycon
vention
�in
�D�Thistran
sformation
isperform
edwith
amatrix
multip
licationof
theform
����������
x�
y�
z�
�
�����������
����������
ab
ctx
de
fty
gh
itz
��
��
����������� ����������
xyz�
���������������
Theuse
ofahom
ogeneou
scoord
inate
system
allowsthetran
slationto
bein�
cluded
inthesam
ematrix
with
therotation
�
��
Z
Y�z�y�
Z�d
�z��y
�
disp
layplan
e
d�y�z
�
d�y�z
Figu
re���
�DPersp
ectiv
eTransfo
rmatio
n�the�D
caseisanalogou
sto
the�D
case�Here
theaxes
have
been
labeled
tocorresp
ondto
thestan
dard
�Dcoord
inate
system
�Thevertex
eyecoordinates
arenow
projected
tothescreen
�Thisintrod
uces
the
familiar
persp
ectivein
computer
�andnatu
ral�im
ages�where
distan
tob
jects
appear
smaller
than
closerob
jects�Thescreen
coordinates
ofavertex
��s
x �sy �
arerelated
tothetran
sformed
�Dcoord
inates
�x��y
��z��by
sx�dx
�
z�
�����
sy�
dy
�
z�
�����
The�D
to�D
projection
isshow
nin
Figu
re����
The�D
to�D
projection
is
identical�
just
repeated
foreach
coordinate�
Clip
ping�
Rejectio
nOnce
thepoly
gonsare
transform
edto
eyecoordinates
they
mustbeclip
ped
against
theview
ablearea�
Theray
scast
fromtheeyep
ointof
theobserver
throu
ghthefou
rcorn
ersof
theview
port
de�neaview
ingfru
stum�
Allpoly
gonsoutsid
eof
thisfru
stum
arenot
visib
leandneed
not
beren
dered
�
Poly
gonsthat
intersect
thefru
stum
arepartially
visib
le�andare
intersected
with
thevisib
leportion
ofthedisp
layplan
e�or
clipped��
Clip
pingtak
esa
poly
gonandcon
vertsitinto
anew
poly
gonthat
isidentical
with
intheview
ing
��
frustu
m�andisbounded
bythefru
stum�
Back
facecullin
gculls
poly
gonsbased
ontheface
ofthepoly
gonorien
tedtow
ards
theobserver�
Each
poly
gonhas
twofaces�
andof
course
youcan
only
seeoneof
thetwofaces
atanytim
e�For
exam
ple�if
wecon
structed
abox
outof
poly
gons
wecou
lddividethepoly
gonfaces
into
twosets�
aset
offaces
poten
tiallyvisib
le
foranyeyep
ointinsid
ethebox�andaset
offaces
poten
tialvisib
lefor
any
eyepoin
toutsid
eof
thebox�andtheunion
ofthese
setsisall
thefaces
ofall
of
thepoly
gons�
Ifweknow
apriori
that
theview
poin
twill
never
beinterior
to
thebox
weneverhave
toren
der
anyof
thefaces
intheinterior
set�Poly
gons
that
areculled
becau
setheir
interior�
faceistow
ardstheview
erare
saidto
be
back
facing�
andare
back
faceculled
�
Lightin
gLigh
tingis
now
applied
tothevertex
norm
als�A
typical
lightin
gmodel
inclu
des
contrib
ution
sfrom
di�use
�back
groundor
ambien
t�illu
mination
�dis�
tantillu
mination
�likethesun�andlocal
poin
tillu
mination
�likealigh
tbulb��
Poly
gonsare
assigned
prop
ertiesinclu
dingcolor
�RGB��shininess
andspecu
�
laritywhich
determ
inehow
theincid
entligh
tis
re�ected
�See
Foley
andvan
Dam
���pg�
���for
adescrip
tionof
lightin
gmodels�
Thenorm
alassociated
with
eachvertex
may
alsobetran
sformed�depending
ontheview
ingmodel�
Ifthemodelof
interaction
assumes
theview
erismoving
aroundin
a�xed
space�
then
theligh
tingshould
remain
constan
t�as
should
thevertex
norm
alsused
tocom
pute
lightin
g�How
ever�ifthemodel
assumes
that
theob
jectisbein
gmanipulated
byan
observer
whoissittin
gstill
then
the
norm
alsbutalso
betran
sformed�Anorm
alisrep
resented
byavector
����������
nx
ny
nz
�
���������������
��
andistran
sformed
bytheinverse
transpose
ofthetran
sformation
matrix
�
LinearEquatio
nSetup
Rasterization
isperform
edbyiteratin
ganumber
oflin
ear
equation
softheform
Ax�By�C�Thecoe�
cientsofthese
equation
srep
resent
theslop
esof
thevalu
esbein
giterated
alongthexandyaxes
inscreen
space�
Atypical
system
interp
olatesmanyvalu
es�inclu
dingthecolor
ofthepixel�
the
depth
ofthepixel�
andthetex
ture
map
coordinates�
Thecom
putation
ofthese
linear
equation
coe�cien
ts�while
straightforw
ard�is
timecon
suming�
Itreq
uires
both
accuracy
anddynam
icran
geto
encap
sulate
thefullran
geofvalu
esofinterest�
Figu
re���
show
stheform
ofthecom
putation
forthedepth�Thesam
eequation
sare
evaluated
foreach
componentto
be
interp
olatedacross
thepoly
gon�
Dueto
thesim
ilarityof
thework
perform
ed�lots
ofarith
metic�
andin
anticip
a�
tionof
theparallel
discu
ssion�Iinclu
delin
earequation
setupwith
thegeom
etry
pipelin
e�butitcou
ldalso
becon
sidered
thestart
oftherasterization
stage�
Therasterization
stagedescrib
ednextren
ders
theset
ofvisib
lepoly
gonsbyiter�
atingthelin
earequation
scalcu
lateddurin
gthegeom
etrystage�
�����
Raste
rizatio
n
Rasterization
must
perform
twotask
s�It
must
determ
inetheset
ofpixels
that
a
poly
gonoverlap
s�is
poten
tiallyvisib
leat�
andfor
eachofthose
pixels
itmustevalu
ate
thevariou
sinterp
olants
associatedwith
thepoly
gon�
Thetypical
uniprocessor
software
implem
entationof
poly
gonrasterization
ana�
lyzes
thepoly
gonto
beren
dered
anddeterm
ines
foreach
horizon
talscan
linethe
leftmost
andrigh
tmost
edge
ofthepoly
gon�Thealgorith
mcan
then
evaluate
the
intersection
ofthepoly
gonedges
with
anygiven
scanlin
eandrasterize
thepixels
with
in�Thisalgorith
mise�
cient�as
itexam
ines
only
thepixels
with
inthepoly
gon
anditiseasy
todeterm
inetheleft
andrigh
tedges�
��
X
Y
Z
1
2
0
�z�
�x�
�y�
�z�
�y�
�x�
Linear
equation
foran
edge
fromvertex
�to
vertex�
E�x�y�
�Ax�By�C
�����
A�
��y
������
B�
�x�
�����
C�
x��y��y��x
������
�����
Linear
equation
fordepth
Z�x�y�
�Ax�By�C
������
��
�x���y
���y
���x
�������
A�
��z���y
���y
���z
� ���������
B�
��x���z
���z
���x
� ���������
C�
z��A�x��B�y�
������
������
Figu
re���
PolygonLinearEquatio
ns�
Theedge
equation
sare
unnorm
alizedas
only
thesign
ofsubseq
uentevalu
ationsisim
portan
t�Accu
ratemagn
itudeisreq
uired
fortheoth
erequation
s�resu
ltingin
more
complex
expression
sfor
thecoe�
cients�
��
po
sitive
neg
ative
half-p
lane
half-p
lane
Figu
re�����
Tria
ngleDe�nitio
nbyHalf�PlaneInterse
ctio
n�
apositive
and
negative
half�p
laneare
de�ned
byalin
ein
screenspace�
Theinterior
ofapoly
gonis
allpoin
tsin
thepositiv
ehalf�p
laneof
alltheedges�
Diagram
isafter
���
Typical
hard
ware
implem
entation
s�for
exam
ple
����tak
eadi eren
tapproach
�
Each
poly
goncan
bede�ned
astheintersection
ofthensem
i��nite
plain
sboundby
thenedges
ofthepoly
gon�as
explain
edbyPinedain
���Wecan
de�netheequation
ofalin
ein
�DasAx�By�C��andthusahalf�p
laneisgiven
byAx�By�C�
��
With
theapprop
riatechoice
ofcoe�
cients
wecan
insure
that
theequation
sof
the
nsem
i��nite
plain
swill
allbepositiv
ely�or
negatively
valued�with
inthepoly
gon�
Figu
re����
depicts
thisapproach
�
Thisprov
ides
uswith
avery
cheap
test�th
eevalu
ationof
onelin
earequation
per
vertex�to
determ
ineifanypixelisinsid
eof
apoly
gon�Furth
ermore�
wecan
cheap
ly
boundtheset
ofpixels
that
arepoten
tiallyinterior
tothepoly
gonas
theset
ofpixels
interior
totheboundingbox
ofthepoly
gon�
This
approach
results
inexam
ining
more
pixels
then
will
beren
dered
�as
theboundingbox
isgen
erallyacon
servative
estimate
oftheset
ofpixels
inapoly
gon�For
triangles
atleast
half
ofthepixels
exam
ined
will
beexterior
tothepoly
gon�Experim
entally
thishas
been
determ
ined
tobeagood
tradeo
against
thehigh
ercom
putation
alcom
plex
ityofspan
algorithms
becau
seSIM
Dim
plem
entationsbene�tfrom
asim
ple
design
with
fewexcep
tional
cases�
Therasterization
process
isactu
allypoorly
describ
edas
apipelin
e�becau
seitcon
�
sistsof
aseries
ofearly
�aborts�
atleast
inits
serialform
�Figu
re����
show
satypical
view
oftherasterization
process�
Thenext
pixel
operation
takesstep
horizon
tally
until
itreach
estherigh
tsid
eof
theboundingbox�andthen
takesavertical
stepand
��
interior pixel?
pixel visible?
shade pixel
write R
GB
Z
N
compute z
textured polygon?
next polygon
NN
NN
more pixels?
next pixel
NY
N
compute z
interior pixel?
pixel visible?
write R
GB
Z
get texture RG
B
shade pixel
next pixel
more pixels?
Figu
re�����
Uniprocesso
rRaste
rizatio
n
��
returnsto
takinghorizon
talstep
s�Theinterior
pixel
testisthesim
plelin
earequation
testshow
nin
Figu
re�����
Compute
Zevalu
atesthelin
earequation
fordepth
and
thepixel
visibletest
determ
ines
ifthispixeliscloser
totheobserv
erthan
thecurren
t
pixelin
thefram
ebu er�
Ifso
itisthen
shaded
andwritten
tothefram
ebu er�
The
proced
ure
fortex
ture
mapped
poly
gonsisidentical�
excep
tthepixelthat
getsshaded
isretriev
edfrom
atex
ture
map�rath
erthan
just
bein
gthecolor
ofthepoly
gon�
Consid
erable
complex
ityhas
been
sweptunder
therugin
thisview
ofrasteriza�
tion�Most
signi�can
tisthetex
ture
mappingoperation
�which
isrelativ
elycom
plex
inpractice�
Thesim
plest
way
toperform
texture
mappingis�poin
tsam
plin
g��Poin
t
samplin
gtakes
thecolor
ofatex
ture
mapped
pixel
asthecolor
oftheclosest
texel
�texture
pixel�
tothecurren
tinterp
olatedtex
ture
coordinates�
Amore
pleasin
g
meth
oduses
aweigh
tedcom
bination
ofthe�closest
texels
fromthetex
ture
map�A
yetmore
complex
andpleasin
gapproach
isach
ievedwith
MIP�M
apping�
describ
ed
byLance
William
sin
����Allof
these
meth
odsattem
ptto
correcttheinaccu
racyof
poin
tsam
plin
gatran
sformed
image�
Amath
ematically
correctresam
plin
gof
thein�
putim
agewould
involve
complex
�lterin
ganddi�
cultcom
putation
�Approx
imation
meth
odsattem
ptto
correctthesam
plin
gerrors
atlow
ercost�
usually
with
acceptab
le
visu
alresu
lts�
�����
Details
There
areafew
auxiliary
details
which
accompanytheren
derin
gprocess
which
have
been
ignored
sofar�
Most
importan
tly�there
isgen
erallyaway
fortheuser
tointeract
with
thesystem
�A
similarly
importan
tsubject
isdouble�b
u erin
g�
Theparticu
larinputmeth
odprov
ided
bythesystem
isnot
importan
t�just
some
meth
odmust
exist�
These
meth
odsvary
greatly�from
bein
gable
tomodify
the
poly
gondatab
aseandview
matrix
contin
ually
andinteractiv
ely�to
simply
execu
ting
ascrip
tof
pred
e�ned
view
poin
ts�
Double�b
u erin
gisused
tohidethemach
inery
ofpoly
gonren
derin
g�Users
are
�usually
�only
interested
inthe�nal
rendered
image�
not
allthework
that
goesinto
��
it�Tothisend�twofram
ebu ers
areused
�Onecom
pleted
framebu er
isdisp
layed�
while
thesecon
dfram
ebu er
isren
dered
�When
therasterization
ofall
poly
gonsis
complete�th
efram
ebu ers
areexchanged
�usually
durin
gtheverticalretrace
interval
ofthedisp
lay�so
theim
agesdon�t�ick
erdistractin
gly�
���
MultiP
rocesso
rPipelin
e
Theparallel
versionofthegrap
hics
pipelin
edi ers
signi�can
tlyfrom
theuniprocessor
pipelin
e�There
arenow
nprocessors
work
ingon
theren
derin
gprob
leminstead
of
asin
gleprocessor�
andoptim
allywewould
likeafactor
ofnspeed
up�How
dowe
dividethework
amongtheprocessors�
Given
that
wehave
divided
thework
�how
doweexecu
teitin
parallel�
While
these
question
sappear
super�
ciallyobviou
s�they
arenot�
Byparallelizin
gthealgorith
mwehave
exposed
that
itisactu
allyasortin
g
prob
lem�resolved
bythedepth�bu er
intheuniprocessor
system
�butpresu
mably
resolvedin
parallel
inamultip
rocessorim
plem
entation�
Itis
relativelyobviou
show
topartition
thegeom
etrywork
�Instead
ofgiv
ing
ppoly
gonsto
�processor�
givep�n
poly
gonsto
eachprocessor�
andallow
them
to
proceed
inparallel�
Thisshould
yield
afactor
ofnspeed
up�which
isthemost
wecan
hopefor�
Therasterization
stageisless
obviou
s�Dowelet
eachprocessor
rasterize
��nofthepoly
gons�
Dowelet
eachprocessor
rasterize��n
ofthetotal
pixels�
When
apoly
gonhas
toberasterized
doall
processors
work
onitat
thesam
etim
e�
Wewould
alsolike
todividethework
aseven
lyas
possib
leam
ongtheprocessors�
Ifweputmore
work
onanyoneprocessor
wehave
comprom
isedtheperform
ance
of
oursystem
�How
ever�itisnot
alwaysthecase
that
thesam
enumber
ofpoly
gonson
eachprocessor
correspondsto
thesam
eam
ountof
work
�if�
forexam
ple�th
epoly
gons
varyin
size�
Rasterization
isactu
allyasortin
galgorith
mbased
onscreen
locationanddepth�
Ifwedividethework
acrossmultip
leprocessors
wemust
stillperform
this
sorting
operation
�Theaddition
ofcom
munication
totheren
derin
gpipelin
eenables
usto
��
resolvethesortin
gprob
lemandren
der
correctim
ages�that
is�thesam
eim
agesa
uniprocessor
renderer
would
generate�
Becau
setherasterization
sortingissu
ehas
forcedusto
introd
uce
communication
�
our�rst
departu
refrom
theuniprocessor
pipelin
e�itisnatu
ralto
startbydiscu
ssing
itsparallelization
�
�����
Communicatio
n
Given
that
wehave
toperform
communication
toprod
uce
the�nal
image�
wemust
choose
how
weparallelize
therasterization
�This
choice
bears
heav
ilyon
how
the
image
isgen
eratedandwill
determ
inewhat
communication
option
sare
available
to
us�
There
aretwowayswecou
lddividetherasterization
amongtheprocessors�
��give
eachprocessor
aset
ofpixels
torasterize�
or
��give
eachprocessor
aset
ofpoly
gonsto
rasterize�
Ifwechoose
the�rst
option
then
thesortin
gprob
lemis
solved
bymakingthe
pixelsets
disjoin
t�so
asin
gleprocessor
has
alltheinform
ationnecessary
tocorrectly
generate
itssubset
ofthepixels�
Thisreq
uires
communication
before
rasterization�
sothat
eachprocessor
may
betold
aboutall
thepoly
gonsthat
intersects
itspixel
set�These
approach
esare
referredto
as�sort��
rst�and�sort�m
iddle�
becau
sethe
communication
occurseith
er�rst�
before
geometry
andrasterization
�or
inthemiddle�
betw
eengeom
etryandrasterization
�
Ifwechoose
thesecon
doption
then
thepoly
gonsrasterized
onanygiven
processor
will
overlapsom
esubset
ofthedisp
laypixels�
andthere
may
bepoly
gonson
some
other
processor
orprocessors
that
overlapsom
eof
thesam
edisp
laypixels�
Soifwe
parallelize
therasterization
bypoly
gonsrath
erthan
pixels
then
thesortin
gprob
lem
mustbesolv
edafter
rasterization�Thisapproach
isreferred
toas
�sort�last�becau
se
thesortin
goccu
rsafter
both
rasterizationandgeom
etry�
Figu
re����
show
sthese
communication
option
ssch
ematically�
Inall
casesthe
communication
isshow
nas
anarb
itraryintercon
nection
netw
ork�Ideally
wewould
��
G
SO
RT
-FIR
ST
display
polygon database
R
R GR
Distribute P
ixels
SO
RT
-LA
ST
display
polygon database
G
R G
RR G
G
Distribute P
olygons
R
G
RR G
display
G
Distribute P
olygons
SO
RT
-MID
DL
E
polygon database
Figu
re�����
Sort
Order�
somepossib
lewaysto
arrange
thepoly
gonren
derin
gprocess�
Figu
reisafter
����
likethisnetw
orkto
beacrossb
arwith
in�nite
bandwidth�so
that
thecom
munication
stagereq
uires
notim
eto
execu
te�Ofcou
rsein
practice
wewill
never
obtain
such
a
communication
netw
ork�
Sort�
First
Sort��
rstperform
scom
munication
before
geometry
andrasterization
�Thescreen
is
diced
upinto
aset
ofregion
sandeach
processor
isassign
edaregion
orregion
s�and
rasterizesall
ofthepoly
gonsthat
intersect
itsregion
s�Allof
theregion
sare
disjoin
t�
sothe�nal
image
issim
ply
theunion
ofall
theregion
sfrom
alltheprocessors�
andnocom
munication
innecessary
tocom
binetheresu
ltsof
rasterization�
The
communication
stagemustgive
eachprocessor
acop
yof
allthepoly
gonsthat
overlap
itsscreen
area�This
inform
ationcan
only
bedeterm
ined
aftertheworld
�space
to
screen�sp
acetran
sformation
ismade�so
there
isgen
erallysom
enon�triv
ialam
ount
ofcom
putation
donebefore
thecom
munication
isperform
ed�
��
Sort�
Middle
Sort�m
iddle
alsoassu
mes
that
eachprocessor
israsterizin
gauniquearea
ofthe
screen�How
ever�thecom
munication
isnow
doneafter
thegeom
etrystage�
sorath
er
than
communicatin
gthepoly
gonsthem
selves�
wecan
directly
distrib
ute
thelin
ear
equation
coe�cien
tsthat
describ
etheedges
ofthepoly
gonon
thescreen
alongwith
thedepth�shading�
etc�
Sort�
Last
Sort�last
assignseach
processor
asubset
ofthepoly
gonsto
rasterize�rath
erthan
a
subset
ofthepixels�
Asthepoly
gonsrasterized
ondi eren
tprocessors
may
overlap
arbitrarily
inthe�nal
image�
thecom
position
isperform
edafter
rasterization�Each
processor
canperform
thegeom
etryandrasterization
stagewith
nocom
munication
whatsoever�
only
communicatin
gonce
they
have
generated
allof
their
pixels
forthe
outputim
age�
Compariso
n
These
three
sortingchoices
allincurdi eren
ttrad
eo s�andtheparticu
larchoice
of
algorithm
will
dependon
both
details
ofthemach
ine�sp
eed�number
ofprocessors�
etc��andthesize
ofthedata
setto
beren
dered
�
Sort��
rstcom
munication
canexploit
temporal
locality�Generally
aview
erwill
navigate
ascen
ein
asm
oothandcon
tinuousfash
ion�so
theview
poin
tchanges
slowly�
andthelocation
ofeach
poly
gonon
thescreen
changes
slowly�
Ifthelocation
ofpoly
�
gonsischangin
gslow
lythen
thechange
inwhich
poly
gonseach
processor
willrasterize
will
besm
all�andsort��
rstwill
require
littlecom
munication
�Thisismost
obviou
s
iftheview
erisstan
dingstill�
inwhich
casethepoly
gonsare
alreadycorrectly
sorted
fromren
derin
gtheprev
iousfram
eto
render
thecurren
tfram
e�How
ever�sort��
rst
incurs
theoverh
eadof
duplicated
e ort
inthegeom
etrystage�
Every
poly
gonhas
its
geometry
computation
perform
edbyallp
rocessorswhose
rasterizationregion
sitover�
laps�
For
largeparallel
system
s�which
will
have
correspondingly
small
rasterization
��
regions�thiscost
could
becom
erelativ
elyhigh
�
Sort�m
iddle
communication
calculates
thegeom
etryonly
once
per
poly
gonand
thuswill
not
have
theduplication
ofe ort
incurred
bysort��
rst�How
ever�sort�
middlemustcom
municate
theentire
poly
gondatab
asefor
eachren
dered
frame�w
hich
ispresu
mably
substan
tiallymore
expensive
than
ameth
odwhich
exploits
temporal
locality�
Sort�last
communication
tra�cs
inpixels
instead
ofpoly
gons�
andhas
thein�
terestingprop
ertythat
theam
ountof
communication
iscon
stant�
Each
processor
will
have
tocom
municate
nomore
than
ascreen
�sworth
ofpixels
tocom
posite
the
�nal
image�
soas
thepoly
gondatab
asegrow
sarb
itrarilylarge
sort�lastwill
o er
the
cheap
estcom
munication
�Alth
ough
thiscon
stantboundon
communication
isusefu
l�
itisshow
nlater
tobeavery
largeam
ountof
communication
compared
tosort��
rst
andsort�m
iddlefor
ourscen
esof
interest�
Wehave
laidafram
ework
forthediscu
ssionto
follow�and�while
there
areinterac�
tions�theactu
algeom
etryandrasterization
stagesare
tosom
eexten
tindependentof
thecom
munication
topology�
Thecom
munication
topology
will
determ
ineprecisely
what
poly
gonsare
computed
onwhat
processor
foreach
stage�butwill
have
little
e ect
onthenatu
reof
thecom
putation
itself�Wewill
defer
theactu
alchoice
of�and
justi�
cationfor�
acom
munication
topology
until
Chapter
��
�����
Geometry
Thegeom
etrycalcu
lationscon
sistsof
�vedistin
ctstep
s�
�tran
sform
�clip
�reject
�project
toscreen
coordinates
�ligh
t
�lin
earequation
setup
�
clip/reject
transform
next polygon
add to visible list
Nproject to screen
next visible polygon
calculate coefficients
light
Y
Figu
re�����
SIM
DGeometry
Pipelin
e�thepipelin
eissplit
betw
eenclip
pingand
lightin
g�
Each
ofthese
stageshas
anupper�b
oundon
theam
ountof
work
perform
edper
poly
gon�
Thecalcu
lationsto
beperform
edfor
eachpoly
gon�while
execu
tingon
di eren
tdata
andyield
ingdi eren
tresu
lts�are
identical
algebraically�
Theexecu
tion
ofeach
processor
over
itssubset
ofthepoly
gons�
wheth
erobtain
edfrom
theinput
poly
gonal
datab
asedi eren
tly�or
fromtheresu
ltsof
sort��rst
communication
�will
beidentical�
andtheexecu
tiontim
ewill
bedeterm
ined
bytheprocessor
with
most
poly
gonsto
process�
Often
aftertheclip
�rejectstage
avery
largenumber
ofprocessors
will
have
dis�
carded
their
poly
gonandhave
noth
ingto
dowhile
theoth
erprocessors
compute�
Thiswill
achieve
poor
utilization
oftheprocessors
andaless
than
optim
alspeed
up
��
forthegeom
etrystage�
Arev
isedgeom
etrypipelin
eapprop
riatefor
parallel
execu
�
tionisshow
nin
Figu
re�����
Bybreak
ingthegeom
etrypipelin
einto
twosep
arate
pieces
andqueuein
gthework
betw
eenthem
wewill
�rst
perform
work
prop
ortional
tothemaxim
um
number
ofpoly
gonson
anyprocessor�
then
work
prop
ortional
to
themaxim
um
number
ofvisible
poly
gonson
anyprocessor�
Thecost
ofthesecon
d
stageof
thepipelin
eper
poly
gonwill
prove
tobesubstan
tiallyhigh
erthan
the�rst
stage�so
ifthenumber
ofvisib
lepoly
gonsismuch
smaller
than
thetotal
number
of
poly
gonsthesav
ings
could
besubstan
tial�Section
���prov
ides
adiscu
ssionof
this
implem
entation
issue�
�����
Raste
rizatio
n
Thesim
plerasterization
algorithm
suggested
inFigu
re����
has
thesam
eim
plem
en�
tationprob
lemsas
thegeom
etrystage�
Ifthenumber
ofpoly
gonsvaries
signi�can
tly
acrossprocessors
theam
ountof
work
tobedoneon
eachprocessor
will
alsovary
signi�can
tly�This
prob
lemiscom
pounded
bythevary
ingsizes
ofpoly
gons�
Ifwe
assumeouralgorith
mcan
rasterizeapixelin
somecon
stantam
ountoftim
e�wewould
likeourtim
eto
rasterizeapoly
gonto
beprop
ortional
tothearea
ofthepoly
gon�
How
ever�theuniprocessor
implem
entationhas
aforced
serializationof
thepoly
gons�
which
introd
uces
anexpensiv
esynchron
izationpoin
tin
thealgorith
m�Thissynchro�
nization
forcesall
processors
torasterize
their
poly
gonin
theworstcase
time�
To
achieve
ane�
cientexecu
tionitwill
benecessary
todecou
pletherasterization
process
fromthespeci�
cpoly
gonbein
gprocessed
�
Figu
re����
show
sasuggested
implem
entation
foramore
e�cien
tparallel
ren�
derer�
Thedouble
loopin
theuniprocessor
implem
entation�show
nin
Figu
re�����
has
been
replaced
with
asin
gleloop
�andthetest
casehas
been
mademore
com�
plex
�Thisdecou
ples
thesize
di eren
cesin
poly
gonsbetw
eenprocessors
atthecost
ofmakingeach
iterationof
theloop
more
expensiv
e�
Di eren
cesin
thesize
ofpoly
gonsasid
e�there
isreason
tobeliev
ethere
will
be
substan
tialdi eren
cesin
thenumbers
ofpoly
gonson
eachprocessor�
Ifweuse
a
��
next polygonnext pixel
more pixels?
NY
get texture RG
B
textured polygon?
Y
NY interior pixel?
Npixel visible?
compute z
N
shade pixel
write R
GB
Z
Figu
re�����
SIM
DRaste
rizatio
n�therasterization
process
onaSIM
Dmulti�
processor�
Thedouble
loopisrep
lacedwith
asin
gleloop
todecou
pletheprocessor
execu
tiontim
efrom
poly
gonsize
di�eren
ces�
sort��rst
orsort�m
iddle
communication
structu
rethen
anyim
balan
cesin
thedis�
tribution
ofpoly
gonsacross
thescreen
�which
there
oftenare
will
bere�
ectedin
imbalan
cesin
thenumber
ofpoly
gonsoverlap
pingeach
processor�s
rasterizationre�
gion�lead
ingto
furth
erload
imbalan
ces�Thissuggests
that
analgorith
mbased
on
sort�lastcom
munication
�which
canassign
poly
gonsto
processors
arbitrarily�
could
perform
signi�can
tlybetter
than
either
sort��rst
orsort�m
iddlecom
munication
�at
leastdurin
gtherasterization
stage�
���
Summary
Themajor
function
alityof
apoly
gonren
derin
gpipelin
ehas
been
describ
ed�Particu
�
laratten
tionhas
been
paid
tothereq
uirem
entsfor
ane�
cientSIM
Dim
plem
entation
�
Ingen
eralan
attemptismadeto
queueupasizab
leportion
ofwork
foreach
stage
ofcom
putation
onall
processors
tominim
izethenumber
ofidle
processors
atany
poin
tin
time�
Ine�ect
ane�
cientSIM
Dgrap
hics
pipelin
ewill
move
allpoly
gons
simultan
eously
throu
ghthepipelin
e�so
that
statisticallyeach
processor
should
be
keptbusy
desp
itepoly
gonbypoly
gonvariation
sin
theam
ountof
work
tobedone�
Unfortu
nately
lackof
uniform
ityin
thepoly
gondatab
ase�particu
larlywith
re�
gardsto
asymmetric
distrib
ution
ofpoly
gonsover
thescreen
�can
resultin
di�
cult
loadbalan
cingprob
lems�Thenextchapter
will
discu
ssthescalab
ilityof
thedi�eren
t
communication
schem
eswith
aneyeboth
totheir
inheren
te�
ciency
andtheir
e�ects
onthee�
ciency
ofthegeom
etryandrasterization
stages�
��
Chapter�
Scalability
While
weare
interested
speci�
callyin
afast
poly
gonren
derin
galgorith
mfor
the
Prin
cetonEngin
e�weare
more
generally
interested
inafast
algorithm
forarb
itrary
parallel
computers�
Inparticu
larwewould
likeasca
lable
algorithm
sowecan
readily
constru
ctbigger
andprop
ortionally
fastersystem
s�Optim
allywewould
likealin
ear
speed
upin
thenumber
ofprocessors�
soouralgorith
mmust
beO�p�n
�Twoforces
will
preven
tusfrom
actually
achiev
ingthisperform
ance�
Communicatio
nCom
munication
isnecessary
inaparallel
renderer
tocom
binethe
results
oftheparallel
execu
tionsof
thealgorith
m�Com
munication
not
only
represen
tsacost
non�ex
istentin
theuniprocessor
case�it
will
alsoprove
to
bean
O�p
operation
forourim
plem
entation
�so
communication
will
cometo
dom
inate
perform
ance
foralarge
enough
system
�O��
solution
sexist�
butare
extrem
elyexpensive�
requirin
gtim
eprop
ortional
tothenumber
ofpixels
inthe
disp
lay�
Imperfe
ctloadbalancingToobtain
perfect
linear
speed
upwemustbeableto
per�
fectlydividethework
amongtheprocessors�
Ingen
eralthere
areunavoid
able
di�eren
cesin
computation
alload
betw
eenprocessors�
Manyren
derin
gsystem
s
attackthis
prob
lembysubdividingtheprob
lemto
�ner
levels
ofparallelism
which
aremore
easilybalan
ced�butfor
largesystem
sthese
regionsmay
be�
��
comearb
itrarilysm
all�andthesubdivision
itselfwill
startto
incursubstan
tial
overhead
�
While
wedesire
linear
perform
ance
improvem
entsin
thenumberofprocessors�
we
alsodesire
thefastest
possib
lesolu
tionfor
ourparticu
lararch
itecture�
thePrin
ceton
Engin
e�In
particu
lar�thegoal
ofinteractiv
efram
erates
will
preclu
detheuse
of
existin
glin
earlyscalab
lesolu
tions�
For
exam
ple�
sort�lastcom
munication
requires
constan
ttim
e�independentof
thenumber
ofprocessors�
Unfortu
nately
itwill
prove
proh
ibitively
expensive
forourscen
esof
interest�
Inthischapter
wewill
discu
sstheexpected
scalability
ofthegeom
etry�com
muni�
cationandrasterization
stages�In
addition
wewill
prov
ideafram
ework
forpick
ing
themost
e�cien
tcom
munication
strategyin
general�
andspeci�
callyapply
itto
the
Prin
cetonEngin
e�Thechoice
ofsort
order
will
play
anim
portan
trole�
not
only
bydeterm
iningthetotal
amountof
timespentperform
ingcom
munication
andthe
scalability
ofthealgorith
m�butalso
bya�ectin
gtheload
balan
ceandduplication
of
e�ort
inthegeom
etryandrasterization
stages�
���
Geometry
Thegeom
etrycalcu
lationsare
iden
ticalfor
everypoly
gon�Som
epoly
gonswill
be
culled
becau
sethey
areoutof
theview
ingfru
stum�or
rejectedbecau
sethey
are
back
facing�
butas
discu
ssedin
x�����thiscan
easilybecou
ntered
bybreak
ingthe
geometry
pipelin
einto
twopieces
andqueuein
gthevisib
lepoly
gonsbetw
eenthese
stages�Thegeom
etrycalcu
lationsreq
uired
forapoly
goniscon
stant�so
weshould
be
able
todistrib
ute
poly
gonsover
processors
uniform
ly�andobtain
perform
ance
that
isO�p�n
for
thegeom
etrystage�
��
���
Raste
rizatio
n
Thedriv
ingfactor
ofouranaly
sisof
parallelism
has
been
thenumber
ofprocessors
involved
�particu
larlylarge
numbers
ofprocessors�
Asthenumber
ofprocessors
in�
volvedbecom
eslarger
thedivision
ofthework
�assumingacon
stantam
ountof
work
tobedone
requires
a�ner
and�ner
granularity
oftheprob
lem�Even
forvery
large
numbers
ofprocessors
�thousan
ds there
isgen
erallyenough
parallelism
availabledur�
ingthegeom
etrystage
tokeep
them
alloccu
pied
�Thisisnot
asread
ilytru
efor
the
rasterizationstage�
Ifweassu
meapoly
gonparallel
implem
entation
ofrasterization
�wegive
each
processor
p�nof
thepoly
gonsto
rasterize�an
dhave
thusassu
med
asort�last
archi�
tecture
andwith
afew
weak
assumption
saboutthedistrib
ution
ofpoly
gonsand
their
sizes�weobtain
asystem
that
exhibits
excellen
tparallelism
�Unfortu
nately
we
will
soonsee
that
sort�lastalgorith
msare
proh
ibitiv
elyexpensiv
eon
thePrin
ceton
Engin
e�Thisleaves
sort��rst
andsort�m
iddlealgorith
msto
becon
sidered
�Both
of
these
assignaregion
ofthescreen
toeach
processor
forrasterization
�rath
erthan
somenumber
ofprim
itives�
andthusreq
uire
image
parallel
rasterization�
Sort��
rstandsort�m
iddledividethescreen
into
multip
leregion
sandassign
them
totheprocessors
insom
efash
ion�Asdiscu
ssedin
thefollow
ingsection
s�sort��
rst
andsort�m
iddleare
most
e�cien
tfor
small
numbers
ofprocessors
andsm
allnumbers
ofpoly
gons�
Inparticu
larthecom
munication
timefor
sort�middleproves
tobelin
ear
inthenumber
ofpoly
gonsandindependentof
thenumber
ofprocessors�
Sort��
rst
introd
uces
redundantcalcu
lationwhich
increases
with
thenumber
ofregion
sthe
screenistiled
into�
andthuswith
thenumber
ofprocessors�
Soboth
sort��rst
and
sort�middlelack
scalability
foroursystem
sof
interest�
Sort��
rstandsort�m
iddlealso
a�ect
thee�
ciency
oftherasterization
stage�Ifwe
staticallydividethescreen
intonregion
sthese
regionswill
becom
equite
small
�tens
ofpixels
forlarge
n�Unfortu
nately
thepoly
goncom
plex
ityof
ascen
eisusually
not
distrib
uted
uniform
lyover
thedisp
lay�so
someregion
swill
have
veryfew
poly
gons�
��
while
other
regionswill
have
many�
Thiscan
leadto
substan
tialrasterization
load
imbalan
cesbetw
eenprocessors�
How
ever�ifweim
plem
entsort�last
communication
thisprob
lemisavoid
ed�as
wewillassign
poly
gonsrath
erthan
pixels
toeach
processor�
Chapter
�discu
ssesahybrid
algorithmentitled
�sort�twice�
that
combinesthelow
costof
sort�middlefor
small
numbers
ofpoly
gonsandprocessors
with
theexcellen
t
scalability
ofsort�last
toobtain
asystem
more
e�cien
tthan
either
approach
for
scenes
often
sof
thousan
dsof
poly
gons�
���
Communicatio
n
Thecom
munication
stageacts
asacrossb
ar�com
municatin
gtheresu
ltsof
onephase
ofcom
putation
toanoth
er�andmay
beplaced
inthree
distin
ctplaces
intheren
derin
g
algorithm�as
show
nsch
ematically
inFigu
re�����
Sort��
rstandsort�m
iddle
both
communicate
poly
gons�
orsom
epartially
computed
represen
tationof
them
�while
sort�lasttra�
csin
pixels�
Neith
erthescalab
ilitynor
thee�
ciency
ofanyof
thesort
option
sappear
in�
tuitively
obviou
s�Asevidenced
bytherasterization
discu
ssiontheord
eringof
the
communication
will
have
e�ects
onboth
therasterization
andgeom
etrystages�
A
modelisintrod
uced
toquantify
anumber
ofasp
ectsof
these
system
sandprov
idea
basis
fortheir
comparison
�
�����
Model
Themodelabstracts
quite
faraw
ayfrom
theactu
alren
derin
gprocess�
Anattem
pt
has
been
madeto
capture
allof
the�rst
order
e�ects
incom
munication
andas
manyof
thesecon
dord
ere�ects
asdeem
edpractical�
Wewill
make
anumber
of
simplify
ingassu
mption
s�andnote
when
weare
bein
goverly
optim
isticor
pessim
istic
inourassu
mption
s�Wehave
foundthat
evenwith
crudeassu
mption
sthemodelcan
prov
ideaclearly
preferab
lecom
munication
strategy�Becau
sethecom
munication
structu
rehas
asign
i�can
tim
pact
onthegeom
etryandrasterization
stagesthey
are
��
variable
de�nition
pnumber
ofpoly
gons
nnumber
ofprocessors
Gtim
eto
perform
geometry
per
poly
gonR
timeto
rasterizeapixel
fasy
mmetry
factor�ratio
ofmaxim
um
number
ofpoly
gonsin
aregion
ofthescreen
totheaverage
number
ofpoly
gonsper
region�
onumber
ofregion
sthat
apoly
gonoverlap
sQ
areaof
screenin
pixels
Aaverage
areaof
apoly
gonin
pixels
Ctim
eto
communicate
adatu
mfrom
neigh
bor�to�n
eighbor�
Di�eren
tfor
eachcom
munication
structu
re�
Table����
Modelin
gVaria
bles
inclu
ded
inthemodel�
Table���
presen
tsourvariab
lesfor
modelin
gthecom
munication
prob
lem�Wewill
modeltheren
derin
gprocess
asoccu
rringon
amach
inewith
nprocessors�
renderin
g
appoly
gonscen
e�
Wemodelthegeom
etrystage
asreq
uirin
gtim
eGto
execu
teper
poly
gon�andthe
totaltim
espentwill
beG
timesthemaxim
umnumber
ofpoly
gonson
anyprocessor�
Sim
ilarly�therasterization
stageismodeled
asreq
uirin
gtim
eR
torasterize
apixel�
andwill
require
atotal
timeequal
tothemaxim
um
number
ofpixels
rasterizedby
anyprocessor�
timesR�
For
eachcom
munication
schem
ewewill
determ
inetheam
ountof
data
qthat
it
communicates
andthedistan
cedthat
eachdatu
mmusttravel�
Thecom
munication
topology
isa��D
ringso
fore�
ciency
reasonswewill
consid
erneigh
bor�to�n
eighbor
communication
�Passin
gadatu
mfrom
processor
nato
processor
nbwill
therefore
require
jna�nb jpasses�
sotim
eisprop
ortional
todistan
ce�
Allprocessors
may
pass
adatu
mto
their
neigh
bor
simultan
eously�
soat
anypoin
t
intim
endata
canbein
communication
�Sothetotal
amountof
timereq
uired
to
communicate
thedata
betw
eenprocessors
isO�q
�d�n �each
datu
mtakes
time
prop
ortionaldto
communicate�
wecan
communicate
nof
them
simultan
eously�
and
�
there
areatotal
ofqdata
todistrib
ute�
Ingen
eralthedistrib
ution
ofpoly
gonsover
thescreen
isnot
uniform
�Thede�
viation
fromuniform
ity�expressed
asthemaxim
um
number
ofpoly
gonsin
ascreen
regiondivided
bytheaverage�
isf�In
addition
poly
gonscan
overlapmore
than
a
single
region�Theaverage
number
ofregion
sapoly
gonoverlap
siso�
Note
that
the
sizeof
aregion
isdecreasin
gin
n�becau
sethescreen
must
besubdivided
furth
er
andfurth
erto
share
thework
amongall
processors
asnincreases�
Asthesize
ofthe
regionsdecreases
both
fandowill
increase�
fwill
increase
becau
sethesam
plin
gof
smaller
andsm
allerregion
sof
thescreen
will
expose
largervariation
sin
thepoly
gon
distrib
ution
�Likew
ise�as
thesize
oftheregion
sdecrease
eachpoly
gonwill
overlap
more
regions�soowill
increase�
Thedependence
offandoon
nwill
have
importan
t
e�ects
onthee�
ciency
ofthedi�eren
tcom
munication
approach
es�
Wewill
now
analy
zethecom
munication
option
susin
gthisfram
ework
toquantify
their
perform
ance�
Note
that
optim
allywewould
likean
algorithm
which
runsin
time
toptim
al�
pn�G
�AR
����
sothat
asweincrease
thenumberofprocessors
weobtain
aperfect
linear
speed
up�
Thisassu
mes
that
weperform
nocom
munication
�that
wearen
�tpenalized
bythe
overlapfactor�
andthere
isnoasy
mmetry
inthedistrib
ution
ofload
acrossthepro�
cessors�Thefollow
inganaly
siswill
show
how
eachof
thecom
munication
algorithms
violates
these
assumption
sin
someway�
�����
Sort�First
Sort��
rstplaces
thecom
munication
stageas
earlyas
possib
lein
thealgorith
m�ac�
tually
preced
ingthegeom
etrystage�
Com
munication
will
place
eachpoly
gonon
the
processor�s
responsib
lefor
rasterizingit�
Iftheset
ofprocessors
responsib
lefor
renderin
gapoly
gonwillch
ange
slowly�
sort�
��
�rst
canleverage
thistem
poral
localityto
perform
lesscom
munication
�Modelin
gthe
exploitation
oflocality
isdi�
cultwith
ourmodel�
andwewill
make
theoptim
isticas�
sumption
that
sort��rst
algorithmsoperate
with
nocom
munication
�Thisassu
mption
proves
tobeusefu
l�as
thesort��
rstapproach
stillincurssubstan
tialoverh
eadover
a
uniprocessor
pipelin
ewhich
alonecan
beused
tocom
pare
itto
other
approach
es�In
particu
lar�byplacin
gcom
munication
before
geometry
eachprocessor
that
apoly
gon
overlapswillh
aveto
perform
thegeom
etrycom
putation
sfor
thepoly
gon�Theoverlap
factoroexpresses
thisduplication
�Furth
ermore
anyasy
mmetries
inthedistrib
ution
ofpoly
gonsover
thescreen
regions�f�will
beexposed
tothegeom
etrystage�
sowe
will
spendtim
eGfopnperform
inggeom
etrycom
putation
s�
Therasterization
stagewill
alsobepenalized
bytheasy
mmetry
factorf�andthe
worst
caseprocessor
will
have
torasterize
fpnpoly
gons�
How
ever�theprocessor
will
only
have
torasterize
A�o
pixels
ofeach
poly
gon�for
atotal
rasterizationtim
eof
AR
fo
pn�
Thetotal
timefor
sort��rst
communication
isthen
tsf�
pnf�oG
�RAo
����
which
isnear
optim
alfor
fandoclose
to��
Weexpect
both
fandoto
be
increasin
ginnhow
ever�makingthescalab
ilityof
sort��rst
communication
question
�
able�
Ouranaly
sisin
addition
has
completely
ignored
thecom
munication
overhead
�
which
ispresu
mably
non�zero�
�����
Sort�Middle
Sort�M
iddle
perform
sthegeom
etrycalcu
lationsandthen
distrib
utes
computed
re�
sults
totheprocessors
responsib
lefor
rasterization�Nolocality
isexploited
�so
the
processor
which
perform
sthegeom
etrycom
putation
sisuncorrelated
with
theras�
terizationprocessor
andtheexpected
distan
ceadatu
mwill
travelisn���
Inreality
manypoly
gonswill
berasterized
bymore
than
oneprocessor�
Theoverlap
factoro
��
will
determ
inehow
manyprocessors
eachpoly
gonwill
berasterized
by�andthushave
tobecom
municated
to�In
general
there
isnoreason
tobeliev
ethat
theprocessors
will
belocal
toeach
other�
sotheactu
aldistan
cemay
approach
nas
oincreases�
We
will
pessim
isticallyassu
med�nhere�
Sort�m
iddleonly
computes
thegeom
etryonce
foreach
poly
gon�so
itavoid
sthe
penalty
ofduplicated
work
incurred
insort��
rst�Likesort��
rsthow
ever�itpaysthe
asymmetry
penalty
inscreen
poly
gondistrib
ution
durin
grasterization
�
Ifwecall
thetim
eto
pass
thecom
puted
results
ofapoly
gonfrom
oneprocessor
toits
neigh
bor
Csmthen
communication
will
require
time
pn�n�C
smRasterization
timeisidentical
tothesort��
rstcase�
sowehave
atotal
timeof
tsm�
pn��fRo �pC
sm
����
Note
that
thetotal
communication
timefor
sort�middle
ispC
smwhich
isinde�
pen
den
tof
thenumber
ofprocessors�
soas
ngets
largesort�m
iddlealgorith
mswill
spendall
their
timecom
municatin
g�
�����
Sort�Last
Sort�L
astisabitof
awild
card�Itdoesn
�tcom
municate
poly
gondata�
butinstead
communicates
pixels�
therasterization
results�
Each
processor
renders
somearb
itrary
subset
ofthedatab
ase�andsortin
gisperform
edafter
rasterization�as
these
subsets
will
overlapin
somearb
itraryway
inthe�nal
image�
There
aretwowaysto
thinkaboutcom
positin
gthisinform
ation�Therasterization
results
could
either
becon
sidered
asafram
ebu�er
oneach
processor�
oras
aset
of
poly
gonfragm
entson
eachprocessor�
Theform
erview
suggests
that
wecom
binethe
framebu�ers
fromallp
rocessorsto
obtain
a�nalfram
ebu�er�
Thelatter
view
suggests
wecom
binetheren
dered
pixels
fromeach
processor
togen
erateafram
ebu�er�
These
twopossib
ilitiesare
referredto
respectiv
elyas
�dense��
becau
seitcom
municates
an
entire
framebu�er
fromeach
processor�
and�sp
arse��becau
seitcom
municates
only
��
thepixels
rasterizedon
eachprocessor�
Inboth
casesthetotal
geometry
andrasterization
work
isp�G
�AR��n
�The
asymmetry
factorfisnot
exposed
becau
sewecan
evenly
dividethepoly
gonsover
processors�
andtheoverlap
factoroisnot
exposed
becau
seweallow
processors
to
rasterizeoverlap
pingregion
s�Ign
oringcom
munication
overhead
�sort�last
o�ers
the
linear
scalability
wehave
been
seeking�
andin
factprov
ides
theoptim
alparallel
speed
up�
Dense
sort�lastcom
municates
ascreen
sworth
ofpixels
fromeach
processor�
How
�
ever�each
pixel
only
has
tobesen
tto
theprocessors
neigh
bor�
Upon
receiptof
a
pixeltheprocessor
either
passes
thepixelon
tothenextprocessor
ifitocclu
des
the
processors
pixel�
orinstead
forward
sits
ownpixel�
Section
��prov
ides
anexam
ple
ofthiscom
munication
structu
re�andamore
carefulexplan
ationof
whythedistan
ce
eachpixelmusttravel
is��
Thescreen
isatotal
ofQ
pixels�
sodense
sort�lastmust
communicate
atotal
ofnQ
totalpixels�
npixels
arecom
municated
inparallel�
andeach
pixel
travelsa
distan
ceof
��so
thetotal
communication
timeisQCsld
ense �
Aco
nsta
ntam
ountof
timeisspentin
dense
sort�lastcom
munication
�independentof
both
thenumber
of
poly
gonsandthenumber
ofprocessors�
Sparse
sort�lastlack
sacom
plete
framebu�er
oneach
processor�
Ifweassu
me
thefram
ebu�er
forthe nal
complete
image
isdistrib
uted
overall
theprocessors
then
eachprocessor
has
tocom
municate
eachof
itsrasterized
pixels
tosom
eoth
er
arbitrary
processor�
Thustheexpected
distan
ceapixelwill
travelispessim
istically
n�Each
poly
gonhas
anaverage
areaA�andthere
areatotal
ofppoly
gons�so
there
areAppixels
tocom
municate�
Wecan
communicate
npixels
inparallel�
eachover
a
distan
cen�for
atotal
communication
timeofApC
slsparse �
Dense
sort�lastreq
uires
atotal
amountof
time
tsld
ense�
pn�G
�AR��QCsldense
�����
toren
der
ascen
e�while
sparse
sort�lastreq
uires
time
�
tsls
parse�
pn�G
�AR��ApC
slsparse
�����
toren
der
ascen
e�Dense
sort�lasto�ers
communication
that
iscon
stantover�
head
�while
sparse
sort�lastclosely
resembles
sort�middle�
butlack
stheasy
mmetry
penalties
in�icted
onrasterization
insort�m
iddlestru
ctures�
���
Analysis
Thusfar
ourmodelhas
only
assumed
that
ourexecu
tiontim
esare
bounded
bythe
sum
oftheworst
casetim
esof
eachstage�
Thisistru
efor
anySIM
Darch
itecture�
Wewill
now
quantify
eachof
thevariab
lesin
themodelfor
thePrin
cetonEngin
e�In
somecases
wewill
use
actual
gures
obtain
edbyexam
iningan
implem
entation
�and
inoth
ercases
represen
tativenumbers
will
beused
�
Determ
ination
ofthecost
ofcom
munication
ofcou
rsereq
uires
amodel
ofthe
actualcom
munication
mech
anism
�OnthePrin
cetonEngin
eeach
processor
canpass
asin
gle��b
itdatu
mto
itsneigh
bor
in�operation
s�write
thecom
munication
register�perform
sthepass�
andread
thecom
munication
register�Passin
gadatu
m
ofsize
awill
require
i��
�a�b
����
where
bisan
overhead
factorrelatin
gto
setupandtest
ofthereceiv
eddatu
m�The
factorof
�re�
ectsthat
passin
gasin
gleinteger
throu
ghthecom
munication
register
requires
writin
gtheregister�
perform
ingthepass�
andread
ingtheregister�
Dueto
resource
constrain
tsthese
operation
scan
tbepipelin
edover
eachoth
er�
There
are�distin
ctcom
munication
approach
esto
analy
ze�Sort�
rstandsort�
middle
both
deal
inpoly
gons�or
computed
results
thereof�
andsort�last
deals
in
pixels�
Agood
estimate
ofthesize
ofthedatu
mfor
eachisnecessary
todeterm
ine
realisticexecu
tiontim
es�Apoly
gonisdescrib
edinourim
plem
entation
by��
integers�
��
approach
datu
msize
overhead
C
sort� rst
����
���sort�m
iddle
����
��sort�last
full
���
��sparse
���
��
Table
����Communicatio
nCosts�
comparison
ofdatu
msizes
fordi�eren
tcom
�munication
approach
es�
while
apoly
gondescrip
tor�th
esort�m
iddle
datu
m�is��
integers�
Dense
sort�last
requires
thecom
munication
ofthree
colorcom
ponents
of��b
itseach
anda���b
itz
value�
Asparse
sort�lastapproach
alsoreq
uires
eachpixelto
betagged
with
screen
coordinates�
foran
addition
altwointegers�
Theoverh
ead gures
anddatu
msize
arebased
onan
implem
entation
forsort�
middle�
Thesort�
rstdatu
msize
isthesize
ofapoly
gonin
ourdatab
ase�andthe
overhead
gure
istak
enas
equal
tothesort�m
iddlecase
becau
sethey
arecom
para�
ble
operation
s�Sort�last
has
been
approx
imated
andisbased
onexperien
ce�The
overhead
gures
re�ect
that
durin
gthedense
approach
thecoord
inates
ofthenext
pixelreceived
areknow
napriori�
soless
computation
need
sto
beperform
edon
each
receivedpixels�
Sparse
sort�laston
theoth
erhandreq
uires
afram
ebu�er
address
cal�
culation
foreach
received
pixel�
andtheunpred
ictability
ofthepixels
received
makes
overlappingoperation
sdi�
cult�
Ourmodelin
gmadeanumber
ofassu
mption
s�Thesim
ulator
was
instru
mented
tomeasu
reanumberoftherelevan
tquantities�
andveri
edourassu
mption
s�In
par�
ticular�
Figu
re���
show
sthedistrib
ution
ofpoly
gonsover
processors
forasort�
rst
algorithm�andin
particu
lardem
onstrates
thesev
ereload
balan
ceprob
lemsresu
lting
fromtheob
jectdetail
bein
gcen
teredin
thescreen
�Figu
re���
veri esourassu
mption
that
thenumber
ofcom
munication
operation
sfor
sort�middlewould
dependlin
early
onthenumber
ofpoly
gons�andindependentof
thenumber
ofprocessors�
Given
thevalu
esof
theparam
eterscom
mon
toall
thealgorith
ms�
speci
edin
��
variable
value
pleft
unspeci
edfor
amore
general
analy
sisn
����thenumber
ofprocessors
inourPrin
cetonEngin
eG
����obtain
edfrom
ourim
plem
entationA
���thecom
mon
benchmark
is����p
ixelpoly
gons
R���
obtain
edfrom
ourim
plem
entationQ
�����thescreen
is��x
���pixels
f���
taken
fromthesam
pleteap
otscen
eo
���tak
enfrom
thesam
pleteap
otscen
e
Table����
Modelin
gValues
0 50
100
150
200
250
300
350
400
0256
512768
1024
polygons
processor
Figu
re����
Sort�
First
LoadIm
balance�thenumber
ofpoly
gonsper
processor
variesdram
atically�Theworst
caseprocessor
has
���poly
gonsintersectin
gits
region�
while
theaverage
caseis����
poly
gons��
��
0
0.2
0.4
0.6
0.8 1
1.2
0512
10241536
2048
shifts/polygon
# processors
beethovencrocodile
teapot
Figu
re����
Sort�
Middle�thenumber
ofshifts
per
poly
gonisthetotal
number
ofshifts
required
todistrib
ute
thepoly
gons�divided
bythetotal
number
ofpoly
gons�
Themargin
alcost
ofcom
municatin
gapoly
gonisapprox
imately
constan
tat
�shift
per
poly
gon�
��
Table
����andthesize
ofthedatu
meach
communication
structu
reuses�
givenin
Table����
wenow
perform
comparison
sbetw
eenthese
communication
approach
esto
determ
inewhich
ismost
practical
onthePrin
cetonEngin
e�
�����
Sort�
First
vs�
Sort�
Middle
Ifwecom
pare
sort� rst
andsort�m
iddlewesee
that
sort�middleisfaster
than
sort�
rst
when
tsm
�tsf
�����
pn�G
�fRAo��pC
sm�
pnf�oG
�RAo�
�����
G�nCsm
�foG
�����
fo���nCsm
G������
For
ournumbers
sort�middle
issuperior
iffo�
�����This
analy
sissuggests
that
sort� rst
will
beaboutafactor
�faster
than
sort�middle�
asfo�
���for
oursam
ple
scene�
Noneth
elesssort�m
iddle
was
implem
ented
preferen
tiallybecau
se
itisbelieved
that
inreality
sort� rst
will
besign
i can
tlyslow
erthan
assumed
here�
fortworeason
s�First�
theanaly
sisabove
assumed
that
sort� rst
would
require
no
communication
�which
isclearly
false�Secon
d�itcom
pared
aworst�case
behavior
for
sort�middle�
Inreality
throu
ghoptim
izations�w
hich
donot
apply
tothesort�
rst
case�wecan
implem
entsort�m
iddleabouttwice
asfast
asassu
med
here�
Overall
we
expect
that
sort�middleisat
leastas
fastas
sort� rst
onthisarch
itecture�
andlikely
tobesign
i can
tlyfaster�
�����
Sort�
Last�
Dense
vs�
Sparse
Ignorin
gcom
mon
factors�th
erasterization
andgeom
etrytim
e�wesee
that
thetim
e
required
forasparse
sort�lastalgorith
misprop
ortionalto
p�while
thedense
sort�last
��
timeiscon
stant�Thuswecan
calculate
thenumberofpoly
gonswhere
dense
sort�last
communication
ispreferab
leto
sparse
tobe
p�
Q�Csld
ense
A�C
slsparse
������
orp�
�����A
����poly
gonscen
eisextrem
elysim
ple�
At��fp
samere
�����
poly
gonsper
secondwould
beren
dered
�Sofor
allbutthemost
trivial
scenes
dense
sort�lastproves
superior
tosparse
sort�last�
�����
Sort�
Middle
vs�
Dense
Sort�
Last
Ourtworem
ainingalgorith
msare
sort�middleandsort�last�
Sort�m
iddle
has
per�
formance
linear
inthenumber
ofpoly
gons�anddense
sort�lasthas
constan
tperfor�
mance�
soonce
againwecan
ndthetrad
eo�poin
twhere
dense
sort�lastwill
bethe
algorithm
ofchoice�
p�
Q�Csld
ense
Csm
��f�o���AR
n
������
Ifp�
�����then
sort�lastwill
bepreferab
le�Thetim
ereq
uired
toperform
thesort�last
composite
will
beQ
�Csld
ense�
����Minstru
ctions�which
isnearly
an
entire
secondon
thePrin
cetonEngin
e�Sosort�last�
while
o�erin
gcon
stanttim
e
perform
ance�
andthusscalab
ility�proves
proh
ibitively
expensive�
Figu
re���
compares
theperform
ance
ofthese
algorithms�Aswould
beexpected
�
sort� rst
requires
thefew
estinstru
ctions�
followed
bysort�m
iddle
and nally
the
sort�lastsparse
implem
entation
�
Sort�m
iddle
isade nite
win
onthis
architectu
re�A
number
ofoptim
izations
arediscu
ssedlater
inthepaper
which
will
apply
tomost
ofthese
communication
topologies�
how
everord
ersofmagn
itudesep
aratetheperform
ance
ofthese
algorithms�
andthere
arenoord
ersofmagn
itudeinperform
ance
tobefou
ndinanyoptim
izations�
sosort�m
iddlerem
ainstheoptim
alchoice
ofcom
munication
structu
re�
��
1 10
100
1000100010000
100000
10^6 instructions
polygons
sort-firstsort-m
iddledense sort-lastsparse sort-last
optimal
Figu
re����
Communicatio
nCosts�
comparison
ofinstru
ctionsreq
uired
toexecu
tevariou
scom
munication
structu
resversu
snumber
ofpoly
gons�
�
���
Summary
Afram
ework
has
been
prov
ided
toexam
inevariou
scom
munication
alternativ
esfor
poly
gonal
renderin
g�Toprov
ideastraigh
tforward
andanaly
ticalsolu
tionanumber
ofassu
mption
shave
been
made�
Assu
mption
shave
been
both
optim
istic�sort�
rst
requires
nocom
munication
�andpessim
istic�sort�m
iddlereq
uires
all�to�allcom
mu�
nication
�andhave
yield
edinsigh
tfulresu
lts�
Theapplication
ofthisanaly
sisto
thePrin
cetonEngin
erev
ealsthat
algorithms�
such
assort�last�
which
execu
tein
constan
ttim
e�may
prove
tobeproh
ibitiv
ely
expensive�
while
communication
that
execu
tesin
linear
time�such
assort�
rst�may
induce
somuch
overhead
that
they
prove
impractical�
Sort�m
iddle�
which
runsin
linear
timein
thenumber
ofpoly
gonsand
indep
enden
tof
thenumber
ofprocessors
would
seemat
rst
glance
tobean
infeasib
lesolu
tion�butactu
allyproves
tobethe
most
e�cien
tof
thealgorith
msfor
ourscen
esof
interest�
Theobservation
that
sort�lastyield
scon
stantperform
ance
andsort�m
iddleyield
s
perform
ance
independentof
the�clu
mping�
ofprim
itivesthat
sort� rst
issubject
to
issign
i can
t�Wewill
return
tothese
twoalgorith
msin
Chapter
to
develop
ahybrid
algorithm
which
extracts
thecon
stanttim
ebene ts
ofsort�last
andthee�
ciency
of
sort�middlefor
small
numbers
ofprocessors�
��
��
Chapter�
Communicatio
nOptim
izatio
ns
Chapter
�motivated
theim
plem
entationof
asort�m
iddlecom
munication
schem
eto
maxim
izeperform
ance�
How
ever�ifwelook
attheratio
ofthetim
espentcom
muni�
catingto
thetotal
execu
tiontim
e
n�Csm
G�n�Csm�fARo
���
andevalu
atefor
thePrin
cetonEngin
ewesee
that
communication
will
occupy
approx
imately
���of
ourtotal
execu
tiontim
e�Even
allowingfor
ourpessim
istic
assumption
s�weare
alreadyclearly
facingtheissu
eof
scalability
forsort�m
iddle
communication
�Thesin
glelargest
obstacle
totheim
plem
entation
ofahigh
�speed
poly
gonal
renderer
onaSIM
Drin
gisthecom
munication
bottlen
eck�
Wehave
alreadydeterm
ined
that
sort�middlecom
munication
will
execu
teinO�p
time�so
wewill
focuscon
siderab
lee ort
onminim
izingthecon
stantfactors
inthe
perform
ance�
Anumber
ofpossib
ilitiescou
ldbeexploited
inthedesign
ofamore
e�cien
tsort�
middlecom
munication
algorithm�They
arelargely
orthogon
alto
eachoth
er�so
when
combined
they
yield
atree
ofim
plem
entation
s�show
inFigu
re���
Fortu
nately
some
ofthebran
ches
prove
ine ective�
andthey
canbeabandoned
earlyin
theanaly
sis�
Theoption
sto
beexplored
inthedesign
ofthecom
munication
algorithm
have
��
dense
m>
1
m=
1
netw
ork
locality
du
plicatio
nd
atad
istribu
tion
sparsestriped
unstripedyes
no
Figu
re���
Communicatio
nOptim
izatio
ns�
theset
ofpossib
lecom
munication
optim
izationscreate
atree
ofpossib
ilitiesto
explore�
��
been
ordered
byboth
easeofim
plem
entation
�andexpected
importan
ce�Theoption
s
are�
Network
Thecom
munication
netw
orkhas
a�nite
capacity
forinform
ation�Itmay
bedesirab
leto
attemptto
saturate
thischannel�
placin
gas
much
inform
ation
into
itas
possib
le�There
isacost
associatedwith
saturatin
gthechannel
be�
cause
thesetu
pandtests
toach
ievesatu
rationreq
uire
time�durin
gwhich
no
communication
isach
ieved�Theoptim
izationto
keepthenetw
orkas
fullas
possib
leis
referredto
asa�dense�
netw
ork�not
tobecon
fused
with
dense
sort�lastcom
positin
g�
Distrib
utio
nThecost
ofpassin
gadatu
mbetw
eenarb
itraryprocessors
isdecreasin
g
inthesize
ofthepasses
used
�Usin
glarge
passes
�neigh
bor�to�n
eighbor�N
amortizes
thecost
ofwritin
gandread
ingthecom
munication
register�alon
g
with
theoverh
eadof
thepass
operation
�A
two�p
hase
sort�middle
algorithm
which
uses
largeinitial
passes
toget
data
closeto
itsdestin
ation�follow
edby
neigh
bor�to�n
eighbor
passes
toprecisely
deliv
erthedata
couldbemore
e�cien
t
than
anaive
sort�middlealgorith
m�
Data
Duplicatio
nThepoly
gondatab
asecou
ldbeduplicated
ontheprocessors�
so
that
instead
ofthere
bein
gacop
yof
apoly
gonon
only
processor�
there
is
acop
yon
mprocessors�
Byduplicatin
gthepoly
gonsperiod
icallyacross
the
processors
wehave
reduced
themaxim
umdistan
ceanypoly
gonmusttravel
be�
tween
geometry
computation
sandrasterization
�thusred
ucin
gcom
munication
time�at
thepossib
leexpense
ofgeom
etrytim
e�
TemporalLocality
Interactiv
epoly
gonren
dered
scenes�
such
asanim
ations�build
�
ingwalk
throu
ghs�andmolecu
larmodelin
g�gen
erallyhave
high
frame�to�fram
e
coheren
ce�Apoly
gonren
dered
inonefram
ewill
beren
dered
invery
nearly
the
sameposition
inthenextfram
e�Com
munication
could
accountfor
thisand
allowpoly
gonsto
�migrate�
aroundtherin
g�stay
ingreason
ably
closeto
the
processors
that
have
torasterize
them
�Astim
eprogresses
thepoly
gonswill
�
move
fromprocessor
toprocessor
totrack
thechangin
gview
frustu
m�butthese
moves
will
besm
all�thusdecreasin
gtheam
ountof
timespentin
communica�
tion�
Allof
these
option
s�while
appealin
g�are
di�
cult
toanaly
zewith
outactu
ally
implem
entin
gthealgorith
ms�
Ingen
eralthey
deal
with
thesubtle
poin
tsof
the
communication
�andwhile
they
may
prove
tobepred
ictable�
agen
eralmodel
is
di�
cultto
�nd�
Before
discu
ssingtheoptim
izationin
detail
adescrip
tionof
theim
plem
entation
ofsort�m
iddlecom
munication
isgiven
toclarify
theissu
esaddressed
�
���
Sort�MiddleOverview
Themost
straightforw
ardim
plem
entation
ofsort�m
iddlecom
munication
isshow
nin
Figu
re����
After
thegeom
etrystage
eachprocessor
has
aqueueofcom
puted
poly
gon
�descrip
tors��show
nsch
ematically
astrian
glesin
the�gure�
Com
munication
begin
s
with
eachprocessor
sendingthe�rst
poly
gonin
itsqueueto
itsrigh
tneigh
bor�
and
simultan
eously
receivingits
leftneigh
bor�s
poly
gon�Theprocessors
then
perform
n��
more
passes�
eachtim
epassin
gthedescrip
torthey
received
fromtheir
leftneigh
bor
totheir
rightneigh
bor�
Acop
yof
eachreceived
poly
gonis
keptifitoverlap
sthe
rasterizationregion
fortheprocessor�
Once
the�rst
batch
ofpoly
gonshas
traveled
alltheway
aroundtherin
gtheprocessors
then
alldequeuethenextpoly
gonfrom
their
queuesandrep
eattheprocess
untilallp
olygon
shave
been
communicated
�When
complete
eachprocessor
has
seeneach
ofthe�rst
poly
gonsfrom
alltheprocessors
queues�
���
DenseNetwork
Thecom
munication
ofapoly
gonhas
been
assumed
toreq
uire
npasses�
where
nis
number
ofprocessors�
Thisassu
mes
that
eachprocessor
isinterested
ineach
poly
�
��
1
processor3
operation
2 0
2
1
2
2
01
Figu
re����
Sort�
MiddleCommunicatio
n�thework
queues�
with
asin
glepoly
gonin
each�are
show
nat
thetop
�The�rst
operation
�passes
the�rst
poly
gonfrom
eachprocessor
toits
neigh
bor�
Thesubseq
uentoperation
s��
pass
these
poly
gons
aroundtherin
guntil
everyonehas
seenthem
�
gon�In
realityonly
somesm
allsubset
oftheprocessors
will
rasterizeanyparticu
lar
poly
gon�andonce
allprocessors
inthisset
have
seenthepoly
gonitneed
n�tbecom
�
municated
anyfurth
er�Onanygiven
pass
operation
upto
npoly
gonswill
bein
communication
�Byceasin
gthecom
munication
ofpoly
gonsthat
have
been
fully
dis�
tributed
wecan
more
fully
utilize
thecom
munication
slotsavailab
leon
eachiteration
ofthealgorith
m�
These
option
scan
bereferred
toas
sparse
anddense
communication
�not
tobe
confused
with
thesort�last
approach
es�Thesparse
approach
naively
communicates
every
poly
gonto
every
processor�
andthedense
approach
stopscom
municatin
ga
poly
gonas
soonas
allof
itsrasterization
processors
have
seenit�
Figu
re���
compares
theperform
ance
ofsparse
anddense
communication
under
simulation
�Thedata
setsshow
nall
have
anaverage
distan
cebetw
eenthegeom
etry
processor
andtherasterization
processor
foreach
poly
gonof
approx
imately
��pro�
cessors�half
thetotal
number
ofprocessors�
Ithas
been
assumed
that
thecost
of
thesparse
anddense
distrib
ution
schem
esper
pass
operation
areidentical�
which
is
��
0
5000
10000
15000
20000
25000
30000
35000
beethovencrocodile
teapot
shifts
densesparse
Figu
re����
Simulatio
nofSparse
vs�
Dense
Networks�
thesparse
netw
orkcan
require
upto
twice
aslon
gto
distrib
ute
agiv
endata
set�
��
not
strictlytru
e�How
ever�unless
thetests
necessary
toobtain
dense
communication
were
veryexpensive
itisclearly
advan
tageousto
use
dense
communication
�There�
main
der
oftheoptim
izationanaly
sesare
perform
edwith
theassu
mption
that
dense
communication
isused
�
���
Distribution
Thenumber
ofinstru
ctionsreq
uired
byaneigh
bor�to�n
eighbor
pass
isi�
�a�b
where
aisthesize
ofthedatu
mandbistheoverh
eadof
theoperation
�Thefactor
of
�re�
ectsthedistin
ctoperation
son
thePrin
cetonEngin
eof
writin
gthecom
munica�
tionregister�
perform
ingthepass
operation
�andread
ingthecom
munication
register�
Arch
itecture
constrain
tsproh
ibitpipelin
ingthese
operation
sover
eachoth
er�
Ifweperform
apass
ofsom
earb
itrarydistan
ced�neigh
bor�to�n
eighbor�N
itwill
then
require
i����da
�binstru
ctions�Adatu
mcou
ldbecom
municated
adistan
ce
dbyperform
ingasin
glepass
oflen
gthd�
ordneigh
bor�to�n
eighbor
passes�
Clearly
asin
glelon
gpass
issubstan
tiallycheap
er�as
you
only
pay
thesetu
pandoverh
ead
costsonce�
rather
than
dtim
es�Ellsw
orthdiscu
ssesan
analogou
sstrategy
in���
which
places
processors
into
groupsandbundles
together
poly
gonmessages
across
groupboundaries
into
asin
glemessage
toam
ortizethecost
ofmessage
overhead
�
Sim
ilaram
ortizationisn
�tpossib
lehere�
asouroverh
eadis
per
poly
gon�not
per
message�
Thecheap
estway
toget
apoly
gonfrom
oneprocessor
toanoth
eristo
perform
asin
gleshift
ofthesize
necessary�
rather
than
anumber
ofsm
allshifts�
How
ever�
perform
inganumber
ofspecial
shifts
ofvary
ingsizes
will
leadto
poor
utilization
ofthecom
munication
netw
ork�Instead
wecon
sider
atwo�stage
algorithm
inwhich
poly
gons�rst
takestrid
esof
sizedarou
ndtherin
gto
getclose
totheir
destin
ations
cheap
ly�andthen
arepassed
neigh
bor�to�n
eighbor
tobeprecisely
aligned�
The
instru
ctioncost
savingin
this
approach
could
prove
tobesubstan
tial�Figu
re���
show
sthetrad
eo betw
eencost
anddistan
ce�Note
that
wecan
obtain
perform
ance
��
40 60 80
100
120
140
160
05
1015
2025
3035
inst/d
d
Figu
re����
Pass
Cost�
thecost
ofcom
municatin
gadatu
mper
unitdistan
ceis
inversely
prop
ortional
tothedistan
ceof
thepass�
nobetter
than
thecase
ofasin
glepass
which
places
thepoly
gonprecisely
onits
destin
ation�so
ourperform
ance
improvem
entislim
itedto
theratio
ofthetim
efor
a
single
pass
oflen
gthdto
thetim
efor
dneigh
bor�to�n
eighbor
passes�
limd��
a��
�d
�b
d��a�b
���ba
����
For
oursystem
a���
andb�
���so
wecan
expect
best�case
speed
upof
���in
thetim
ereq
uired
tocom
municate
asin
gledatu
m�
Wecan
solvefor
theoptim
alstrid
esize
dunder
someassu
mption
saboutthe
distrib
ution
ofdata�
Uniform
SourceDistrib
utio
nThepoly
gonsto
berasterized
areuniform
lydis�
tributed
acrosstheprocessors
afterthegeom
etrystage�
This
isagood
as�
sumption
�as
thepoly
gondatab
aseisdistrib
uted
uniform
lyover
theprocessors
initially�
��
Uniform
Destin
atio
nDistrib
utio
nThenumber
ofpoly
gonsto
berasterized
by
eachprocessor
isthesam
e�Thistherefore
assumes
auniform
poly
gondistrib
u�
tionover
thescreen
�Itislikely
someprocessors
�those
rasterizingthemiddleof
thescen
ewill
receivemanymore
poly
gonsthan
other
processors�
Thisintro�
duces
asymmetries
inthecom
munication
pattern
andwill
beasou
rceof
error
betw
eenouranaly
sisandsim
ulation
results�
SingleDestin
atio
nEach
poly
gonisrasterized
byasin
gleprocessor�
sotheover�
lapfactor
is�
This
isastron
gassu
mption
�Aswehave
alreadyseen
in
Chapter
�theaverage
poly
gonis
rasterizedon
about�di eren
tprocessors�
How
ever�colu
mn�parallel
decom
position
sof
thescreen
�in
which
eachproces�
sorisassign
edacolu
mnofthescreen
torasterize�
closelymatch
thisassu
mption
becau
setheprocessors
that
rasterizeapoly
gonwill
beadjacen
t�
FullRingAtanystep
inthedistrib
ution
allpoin
tsof
therin
gare
carryingdata�
Thisisnot
strictlytru
e�becau
seas
thecom
munication
reachescom
pletion
there
isnoabrupttran
sitionfrom
therin
gbein
gfullof
inform
ationto
bein
gem
pty�
For
largedata
sets�p�
nandthedense
distrib
ution
schem
ethisshould
bea
reasonableapprox
imation
�
Given
these
assumption
s�andtheir
accompanyingcaveats�
wecan
calculate
an
optim
aldistrib
ution
pattern
�andits
perform
ance
relativeto
asim
pler
neigh
bor�to�
neigh
bor
algorithm�
Wehave
assumed
that
thepoly
gonsou
rcesanddistrib
ution
sare
uniform
lydis�
tributed
�so
theexpected
distan
ceapoly
gonwill
travelisn���
Atwo�stage
distrib
u�
tionsch
eme�were
we�rst
perform
passes
ofsom
estrid
edandthen
passes
ofstrid
e
will
require
atotal
ofi�instru
ctions�com
posed
�rst
ofinstru
ctionsthat
perform
passes
ofsize
d�andthen
instru
ctionsthat
perform
theneigh
bor�to�n
eighbor
passes�
Thesize
dpasses
will
require
idinstru
ctions�
id�
pn
n�
df��
�da
�bg
����
��
Each
pass
requires
���d�a�binstru
ctions�each
poly
gontakes
anaverage
of
n����d
passes
tocom
municate
inthe�rst
stage�andthere
areatotal
ofppoly
gons
bein
gcom
municated
�nat
atim
e�
After
this
�rst
pass
ofcom
munication
eachpoly
gonwill
bean
averageof
d��
processors
fromwhere
itneed
sto
be�
so�sim
ilarto
equation
����thesecon
dpass
of
distrib
ution
will
require
i�instru
ctions�
i��
pn
d�f��
�a
�bg
����
andi��id�i� �
i�
�pn �
n�
d���
�da
�b��
d����
�a
�b� �
����
Solv
ingto
�ndtheminim
um
number
ofinstru
ctionsyield
s
d�
s��a
�bn
�a�b
����
i�
�p �� s
��a�b��a
�b
n�a� ��
����
Theoptim
alstrid
eisd������
forn�����
a���
andb����
Figu
re���
compares
thepred
ictedandsim
ulated
perform
ance
ofthealgorith
m�
Thecurve
isvery
�at
attheminim
um�w
hich
isreason
ableto
expect�
since
thecost
per
unitdistan
cetraveled
approach
esasy
mptotically
with
increasin
gd�
Asdincreases�
data
isdistrib
uted
lessclosely
toits
optim
alposition
inthe�rst
stageandthecost
ofthesecon
dstage
grows�
Thesim
ulation
show
saclear
minim
um
number
ofinstru
ctionsper
poly
gonfor
d��or
d���
Thisisalittle
more
than
half
oftheanaly
ticallydeterm
ined
optim
al
d�Thediscrep
ancy
arisesbecau
sethe�D
ense
Ring�
assumption
startsto
break
dow
n
earlierfor
largevalu
esofd�
soouranaly
sisunderestim
atedi� �
Thesim
ulation
sshow
speed
upsof����
����and����
while
ourpred
ictedspeed
upis����
compared
tothe
��
0 20 40 60 80
100
120
140
160
180
200
020
4060
80100
120
inst/p
d
beethovencrocodile
teapotpredicted
Figu
re����
i�
vs�
dforDistrib
utio
nOptim
izatio
n�data
based
onsim
ulation
ofa�����p
rocessormach
ine
discon
strained
tobeafactor
ofnbythesim
ulator
�
maxim
um
speed
upof
�
���
Data
Duplic
atio
n
Data
duplication
isan
intrigu
ingpossib
ilityfor
reducin
gthetotal
communication
re�
quirem
ents
Byplacin
gmore
than
onecop
yofeach
poly
gonin
theprocessor
arraywe
reduce
themaxim
umpossib
ledistan
ceanypoly
gonmusttravel
tobecom
municated
toanygiven
processor
Ifweduplicate
thepoly
gondatab
asem
times
acrossthearray
then
apoly
gon
which
lieson
processor
k�
n�m
alsolies
onprocessor
k�n�m
�processor
k�
�n�m
�throu
ghprocessor
k��m
���n
�mThemaxim
um
distan
ceapoly
gonwill
have
totravel
fromasou
rceto
destin
ationisnow
n�m
instead
ofnFurth
ermore�
thedistrib
ution
ofthecop
iesacross
thearray
isdeterm
inistic�
soanyprocessor
can
determ
inewith
outcommunica
tionwhich
ofm
processors
should
source
eachpoly
gon
forwhich
ithas
acop
yThuswehave
reduced
theam
ountof
communication
wehave
toperform
attheexpense
ofdoin
gextra
work
inthegeom
etrystage
Each
processor
will
now
have
toexam
inepm
�npoly
gons�andcom
pute
thegeom
etryfor
thesubset
ofthem
that
itdeterm
ines
itisthebest
source
for
Toanaly
zethepoten
tialperform
ance
improvem
entsof
data
duplication
wewill
use
thesam
eassu
mption
sas
used
durin
gthedistrib
ution
analy
sis�nam
ely�uniform
source
anddestin
ationdistrib
ution
�sin
gledestin
ationanddense
ring
Tomake
amean
ingfu
lanaly
siswehave
tokeep
aneye
onthecost
ofthegeom
etry
operation
sAsalread
ydiscu
ssed�thegeom
etrystage
may
bebroken
into
twopieces�
�rst
exam
iningall
ofthepoly
gonson
aprocessor
anddeterm
iningwhich
ofthem
are
visib
le�andthen
perform
ingthecom
plete
geometry
calculation
sonly
forthevisib
le
poly
gons
Inthecase
ofdata
duplication
thevisib
ilitytest
will
inclu
dedecid
ing
ifthis
processor
cancom
municate
thepoly
gonmost
cheap
lyof
them
processors
with
acop
yof
itLet
Gibethenumber
ofinstru
ctionsreq
uired
foraprocessor
to
transform
apoly
gonto
thescreen
anddeterm
ineifitshould
source
itLet
Gfbethe
��
remain
inginstru
ctionsto
complete
thegeom
etrystage
per
poly
gonThenumber
of
instru
ctionsreq
uired
toperform
thegeom
etrycalcu
lationsanddistrib
ute
theresu
lts
foraduplication
factorm
isim�
im�
pmnGi�
p�nGf�
p�n
n�m��a
�b�
���
Note
that
theduplication
costsafactor
mon
thenumber
ofpoly
gonswehave
to
exam
ine�G
i ��butcosts
nothingfor
theGfcalcu
lations�as
these
areonly
perform
ed
byoneprocessor
foreach
poly
gonThe�nal
termin
thesum
isthecom
munication
time�which
now
takes
distan
cen�mfor
eachpoly
gon�thanksto
theduplication
We
assumethat
half
ofall
poly
gonsare
visib
le
Implem
entation
reveals
that
Gi�
����andGf�
����instru
ctions
Solv
ingfor
theoptim
alchoice
ofm
yield
s
m�
sn��a
�b�
�Gi
����
im
�p �� s
Gi ��a
�b�
n�G
f
�n �������
andm
�����
forn������
a���
andb���
Itisinterestin
gto
note
that
theoptim
um
constrain
tdoes
notdependon
thecost
ofG
f �as
wehave
assumed
that
thevisib
lepoly
gonsare
uniform
lydistrib
uted
across
theprocessors
fortheGfstage
ofcom
putation
Ifthere
areanynon�uniform
ities
inthedestin
ationdistrib
ution
these
will
bere�
ectedback
inthechoice
ofwhich
processors
source
thepoly
gonsfor
them
�creatin
ganon�uniform
ityin
thegeom
etry
stage
Ouranaly
sisthusfar
has
been
largelyad
hoc
dueto
thedi�
culty
ofmodelin
g
asymmetries
inthesou
rcedistrib
ution
anddestin
ationdistrib
ution
Thesim
ulator
canmodelthese
e�ects
precisely�
andyield
sinterestin
gresu
ltsFigu
re���
compares
theanaly
ticalexpression
fortheoptim
alchoice
ofm
tothesim
ulator
results
��
0 20 40 60 80
100
120
140
160
08
1624
32
inst/poly
m
simulated
predicted
Figu
re����
PolygonDuplic
atio
n�instru
ctionsexecu
tedperpoly
gonon
asim
ulated
����processor
mach
inevsm
��
Alth
ough
thecurves
have
thesam
egen
eralshape�
there
isalarge
discrep
ancy
betw
eenthetworesu
ltsPart
ofthediscrep
ancy
isdueto
themistaken
assumption
that
���of
thepoly
gonswill
bevisib
le�while
theactu
alnumber
iscloser
to� �
in
thisscen
eThesim
ulator
alsoexpresses
results
ininstru
ctionsper
visiblepoly
gon�
which
isin
somewaysmore
usefu
lthan
instru
ctionsper
totalpoly
gon�becau
seit
re�ects
thee�ort
that
goesinto
theusefu
l�visib
le�poly
gons�rath
erthan
allof
the
poly
gons
How
ever�these
twofactors
alonefar
fromaccou
ntfrom
thedi�eren
cein
theresu
lts
Figu
re��
revealshow
ouranaly
siscou
ldhave
ledusso
farastray
Theclu
mping
oftheob
jectin
themiddle
ofthescen
ecreates
aclu
mpingof
thegeom
etrywork
onapprox
imately
onefou
rthof
theprocessors�
sothegeom
etrystage
was
�tim
es
more
expensive
then
weexpected
Furth
ermore
thecom
munication
ismore
expensiv
e
becau
seithas
allbecom
erelativ
elylocal�
andlarge
parts
oftherin
ggo
unutilized
How
ever�thesim
ulation
stillreveals
aperform
ance
improvem
enton
theord
erof
���
forthebest
caseim
provem
entat
m���
Itappears
that
wehave
foundanoth
ermeth
odfor
reducin
gthetotal
amountof
communication
�at
theexpense
ofincreased
geometry
computation
Attheoptim
al
poin
tm
���
wespendan
averageof
���instru
ctionsper
poly
gonversu
s���
forthe
noduplication
case�asav
ings
of���
Thelogical
exten
sionto
tothissim
ulation
isto
incorp
oratethedistrib
ution
optim
izationalread
ydiscu
ssedfor
afurth
erperform
ance
improvem
ent
Figu
re��
show
stheinstru
ctionsper
visib
lepoly
gonas
afunction
ofd�
thedata
distrib
ution
stepsize�
andm�theduplication
factorTheminim
umpoin
tisfou
ndat
d���
andm
���
which
indicates
that
theduplication
optim
ization�while
e�ectiv
e
inandof
itself�does
not
combinewell
with
other
optim
izations
Intuitiv
elythe
duplication
optim
izationisminim
izingthedistan
ceanypoly
gonhas
tobetraveled
�
thusmakingthedistrib
ution
optim
izationless
e�ectiv
e
��
0 5 10 15 20 25 30 35 40 45 50
0256
512768
1024
polygons
processor
m =
1m
= 8
Figu
re�� �
PolygonsPerProcesso
rvs�
m�theasy
mmetric
distrib
ution
ofscen
econ
tentisre�
ectedin
thedistrib
ution
ofpoly
gonsdurin
ggeom
etryprocessin
gData
isdisp
layedas
themaxim
um
ofadjacen
tprocessors
forread
ability
andistak
enfrom
asim
ulation
oftheteap
otdata
set��
12
48
161
24
8
1650 60 70 80 90
100110120130
d
m
inst/poly
Figu
re���
Instru
ctio
nsPerPolygonasafunctio
nofm
andd�theoptim
alchoice
ofd����
andtheoptim
alm
��donot
together
yield
anoptim
alsolu
tion
Sim
ulation
isbased
onthecrocod
iledata
set
��
���
TemporalLocality
Tem
poral
localityattem
ptsto
exploit
localityinasim
ilarfash
ionto
data
duplication
Thepoly
gonsare
allowed
tomigrate
aroundtherin
gfor
thegeom
etrycom
putation
s�
rather
than
alwaysbein
gcom
puted
onthesam
esou
rceprocessor
Apoly
gonrem
ains
asclose
aspossib
leto
theprocessors
that
will
berasterizin
git�
thusminim
izing
communication
time
Unlikesort��
rstonly
asin
glecop
yof
thepoly
goniscom
puted
durin
gthegeom
etrystage�
thusrem
ovingtheduplicated
computation
penalty�
but
stillincurrin
ganyload
imbalan
cesdueto
non�uniform
poly
gondistrib
ution
overthe
screenAsinthedata
duplication
case�thisismost
mean
ingfu
lformapsofprocessors
topixels
that
keepsadjacen
tscreen
areason
adjacen
tprocessors�
such
asacolu
mn�
parallel
decom
position
Tem
poral
localitycan
bethoughtof
asdata
duplication
where
m�n�excep
twe
don�thave
toactu
allyduplicate
theentire
datab
aseon
eachprocessor
Thecom
munication
timeis
greatlyred
uced
becau
seeach
poly
gonis
rasterized
onprocessors
immediately
adjacen
tto
theprocessor
which
perform
edits
geometry
Theaverage
distan
ceapoly
gonmusttravel
becom
estheaverage
width
ofapoly
gon
Inthis
caseanoth
ercom
munication
costgets
added
into
theoverh
ead�wemust
now
communicate
thepoly
gon�not
thecom
puted
represen
tationwewill
rasterize�
betw
eenprocessors
fromfram
eto
frame
These
moves
will
besm
allin
general�
as
weare
exploitin
glocality
How
ever�they
arecom
plex
For
instan
ce�apoly
goncou
ld
move
slightly
totheleft
ofwhere
itison
thescreen
�or
slightly
totherigh
t�so
we
must
now
dopasses
ineach
direction
ifwedon�twantto
have
topass
poly
gonsall
theway
totherigh
tto
move
them
slightly
totheleft
Furth
ermore
weneed
todo
someth
ingwith
theculled
poly
gons
Atsom
epoin
tthey
could
bevisib
leagain
and
weneed
tokeep
them
aroundon
someprocessor
Ignorin
gthese
issues
wecan
�rst
just
exam
inethekindof
loadim
balan
ceswe
arefacin
gin
thissystem
In
general
they
will
bemuch
high
erthan
those
ofdata
duplication
becau
seweare
gettingthepoly
gonsin
anoptim
alposition
which
will
��
expose
allof
theasy
mmetries
intheload
distrib
ution
Figu
re���
show
sthepoly
gon
distrib
ution
durin
gthegeom
etrystage
fortheteap
otdataset
Here
theworst
case
processor
has
� �visib
lepoly
gonsto
perform
thegeom
etryfor�
atacost
ofGi �
Gf
instru
ctionsper
poly
gon�ign
oringtheculled
poly
gonswhich
mustalso
behandled
Theoverh
eadofthisapproach
overabalan
cedgeom
etrystage
istheexcess
number
ofpoly
gonson
thisprocessor
Theteap
otdata
sethas
����poly
gons�or
��poly
gons
per
processor�
sotem
poral
localityim
poses
a���
poly
gonpenalty
onthee�ective
poly
gonload
per
processor
Wecan
compute
thenumber
ofextra
instru
ctionsthis
requires
per
visib
lepoly
gonas
�����G
i �Gf ��p
vis��� �
which
exceed
seven
anaive
implem
entation
ofsort�m
iddle
communication
which
incurs
amargin
alcost
of���
instru
ctionsper
poly
gonOnce
weinclu
dethenecessity
toperform
communication
both
topass
apoly
gonto
itsneigh
bors
forrasterization
andto
main
taintem
poral
localityweare
likelyto
befar
inexcess
ofthis�gure
Tem
poral
localityrem
ainsan
intrigu
ingoptim
izationdueto
itssim
ilaritywith
sort��rst
communication
structu
resHow
ever�its
vulnerab
ilityto
asymmetries
in
thepoly
gondistrib
ution
overthescreen
make
itunsuitab
lefor
use
here
���
Summary
Fourcom
munication
optim
izationswere
exam
ined�netw
ork�distrib
ution
�duplica�
tion�andlocality
Speci�
catten
tionispaid
totheapplication
oftheoptim
izations
tothePrin
cetonEngin
e�andtheinteraction
oftheoptim
izations
Maxim
izingthedensity
ofinform
ationin
thecom
munication
channel�
themost
obviou
soptim
ization�prod
uces
startlingly
positiv
eresu
ltsDistrib
ution
contin
ues
thistren
dando�ers
speed
upsof
greaterthan
����byam
ortizingthecost
ofload
ing
andstorin
gthecom
munication
registerover
passes
largerthan
neigh
bor�to�n
eighbor
Data
duplication
strives
toim
prove
theperform
ance
ofthecom
munication
time
bydecreasin
gthemaxim
um�an
dthemean
�distan
cebetw
eenasou
rceprocessor
and
thedestin
ationprocessors
Non�uniform
scenecon
tentdistrib
ution
becom
esre�
ected
�
0 20 40 60 80
100
120
140
160
180
0256
512768
1024
polygons
processor
Figu
re����
TemporalLocality
�thenumberofpoly
gonsper
processor
forgeom
etrycom
putation
sdirectly
re�ects
theunderly
ingscen
ecom
plex
ityfor
eachcolu
mnThe
worst
caseprocessor
has
� �poly
gonsto
perform
geometry
calculation
sfor�
almost
��tim
estheaverage
numberofpoly
gonsto
compute
Data
isdisp
layedas
themaxim
um
ofadjacen
tprocessors
forread
ability
�
inthedistrib
ution
ofwork
durin
ggeom
etrycom
putation
s�makingthisapproach
more
similar
tosort��
rst�While
o�erin
gasign
i�can
tspeed
up�ito�ers
asm
allerperfor�
mance
improvem
entthan
thedistrib
ution
meth
od�an
ddoes
not
o�eranyperform
ance
improvem
entswhen
combined
with
distrib
ution
�
Tem
poral
localityiscon
ceptually
thelogical
extrem
eof
data
duplication
�where
theduplication
isach
ieved
bysort��
rststy
lecom
munication
�Itproves
tobevulner�
able
tothesam
eload
imbalan
cesthat
sort��rst
communication
fallsvictim
to�The
extrem
ely�nesubdivision
ofthescreen
coupled
with
ahigh
lyvariab
ledistrib
ution
of
scenecon
tentcreates
afew
extrem
elyheav
ilyload
edprocessors
which
impose
ahuge
perform
ance
penalty
onthegeom
etrycalcu
lations�
��
�
Chapter�
Sort�TwiceCommunicatio
n
Aspart
ofthisthesis
anovel
new
approach
tothesortin
gprob
lemwas
develop
ed�en�
titledsort�tw
ice��Sort�tw
iceleverages
thee�
ciency
ofsort�m
iddlecom
munication
forlow
poly
goncou
ntscen
esto
createan
implem
entationof
sort�lastcom
positin
g
that
ispractical
with
modest
bandwidth
processor
intercon
nect�
Theim
plem
entation
ofatwo�stage
communication
algorithm
allowsanoveloverlap
pingof
therasteriza�
tionpipelin
ewith
thesecon
dcom
munication
stageto
createahigh
lye�
cientdeferred
shadingandtex
turin
gsystem
�
���
WhyisSort�
Last
soExpensiv
e�
Sort�last
has
prim
arilybeen
thedom
ainof
hard
ware
solution
sdueto
thevery
high
bandwidth
required
�For
exam
ple�
PixelF
low ���
uses
adedicated
����bit
busat
���MHz�for
abisection
bandwidth
of���G
B�s�
toprov
idethenecessary
compositin
g
intercon
nect�
Bycom
parison
�thePrin
cetonEngin
e�which
employ
sthesam
esty
leof
linear
intercon
nect�
has
ameager
��MB�s
ofintercon
nection
bandwidth�more
than
twoord
ersof
magn
itudeless
than
PixelF
low�
Whyisso
much
bandwidth
required
�Thebandwidth
forsort�last
ispurely
driv
en
bythenumber
ofsam
ples
tocom
posite
andtheupdate
rateof
thedisp
lay�Amodest
single�sam
plesystem
ofof
���pixels
by���
pixels
at�
frames
per
secondreq
uires
��
astaggerin
g��M
B�s
ofcom
munication
�which
�alth
ough
attainable
onoursecon
d
generation
mach
ine�would
leavelittle
timefor
anyoth
ercom
putation
�
���
LeveragingSort�Middle
Sort�tw
icecom
munication
leverages
thelow
expense
ofsort�m
iddlecom
munication
over
short
dista
nces
tomakesort�last
compositin
gpractical
onagen
eralpurpose
mach
inesuch
asthePrin
cetonEngin
e�Theuse
ofsort�m
iddlecom
munication
ina
system
supportin
gmultip
leusers
prov
ides
thefou
ndation
forthisnew
approach
�
Figu
re���
show
sa��
processor
mach
inesupportin
g�sim
ultan
eoususers�
Each
user
isassign
edto
agrou
pof
�processors
forgeom
etrycalcu
lations�
andan
adja�
centgrou
pof
�processors
forrasterization
�Theskew
betw
eenthegeom
etryand
rasterizationprocessors
allowsall
communication
shifts
tobeperform
edin
thesam
e
direction
�aperform
ance
boon
onaSIM
Dmach
ine�
Thecom
munication
stageonly
has
topass
apoly
gonacross
amaxim
um
of�processors�
rather
than
��processors
ifthemach
inewere
dedicated
toasin
gleuser�
Thissystem
deliv
ersperform
ance
to
eachuser
indistin
guish
able
fromamach
inethesam
esize
asthegrou
pof
processors
assigned
totheuser�
Ifweincrease
thetotaln
umberofprocessors
while
leavingthenumberofprocessors
assigned
toeach
user
unchanged
theaggregate
poly
gonren
derin
gperform
ance
will
increase
linearly�
Whitm
andiscu
ssestheim
plem
entationof
aparallel
renderer
onthe
Prin
cetonEngin
especi�
callyto
support
multip
leusers
in ���
Note
that
thenumber
ofprocessors
assigned
toauser
has
noe�ect
onthenumber
ofpoly
gonseach
processor
cantran
sformper
second�or
thenumber
ofpixels
itcan
rasterize�Ifweassign
asin
gleprocessor
toeach
user
wedon�tneed
toperform
any
communication
andwecan
maxim
izetheaggregate
renderin
gperform
ance
ofthe
system
�of
course
attheexpense
oftheper
user
poly
gonsdelivered
eachsecon
d�
Ifwewish
tofocu
sall
ofourcom
putin
gresou
rceson
asin
gleuser
wecan
stilluse
themultip
leuser
decom
position
�excep
teach
groupof
processors
while
concep
tually
��
78
1011
136
5
User 4
User 2
User 3
User 1
User 0
414
912
01
23
12
Com
munication
Rasterization
Geom
etry9
63
27
54
08
114
1311
10
Figu
re����
Multip
leUsers
onaParallelPolygonRenderer�
eachuser
isassign
edto
asm
allfraction
oftheprocessors�
work
ingfor
di�eren
tusers
will
inreality
bework
ingfor
thesam
euser�
Wewill
dividethemach
ineinto
groupsof
gprocessors�
where
gisthegrou
psize
�g�
�in
Figu
re�����
Sim
ilarly�wedividethescreen
into
gnon�overlap
pingregion
s�numbered
throu
ghg���
Apoly
gonthat
overlapsaregion
r��
r�
gmay
berasterized
byanyprocessor
m�m
�r�k�g�
Ifwedistrib
ute
thepoly
gondatab
aseover
allprocessors�
nomatter
what
processor
perform
sthegeom
etrycom
putation
sfor
a
poly
gonthepoly
gondescrip
torwill
have
totravel
amaxim
um
ofg��processors
toarrive
ataprocessor
that
canrasterize
it�as
opposed
toamaxim
um
ofn�
�
processors
inthetypical
sort�middleim
plem
entation
�After
alltheprocessor
groups
complete
rasterization�wecan
composite
theim
agesgen
eratedbythedistin
ctsets
of
processors
inafash
ionanalogou
sto
dense
sort�lastcom
positin
gto
generate
the�nal
outputim
age�
Equation
���giv
esthetim
efor
sort�middleren
derin
g�Ifwechange
thisequation
tore�
ectthenew
maxim
um
distan
ceapoly
gonmust
travelwesee
t�sm�
pn�G
�gCsm�fRo�
�����
andournew
implem
entation
requires
timeprop
ortionalto
thenumberofpoly
gons
��
andinverse
inthenumber
ofprocessors�
foralin
earlyscalab
lesystem
�How
ever�
we
now
must
pay
thecost
ofperform
ingsort�last
composition
tocom
binetheresu
ltsof
alltheindividual
renderin
ggrou
ps�
���
Sort�Last
Compositin
g
Sort�last
compositin
gdeserves
furth
erexplan
ationfor
thisalgorith
m�as
itdeparts
signi�can
tlyfrom
theclassical
dense
sort�lastapproach
�Processors
may
nolon
ger
rasterizearb
itrarypixels
fromtheentire
screen�Instead
eachprocessor
may
render
only
pixels
inits
regionof
thescreen
�Then�g
processors
that
have
rendered
pixels
inthesam
eregion
ofthedisp
laymust
then
composite
their
results
togen
eratethe
�nal
outputim
age�
AsFigu
re���
show
s�theregion
ofthescreen
that
aprocessor
isresp
onsib
lefor
re�
peats
everygprocessors�
Thusthecom
munication
forthesort�last
compositin
gstage
ofsort�tw
iceiscom
pletely
determ
inistic�
Most
importan
tly�thiscom
positin
gcan
be
perform
edin
constan
ttim
e�in
afash
ionanalogou
sto
norm
alsort�last
compositin
g�
The�nal
result
ofthecom
position
operation
sis
eachpixel
isknow
nat
some
determ
inistic
processor
m�Thecorrect
pixelisnot
know
nat
alltheprocessors
that
could
have
rendered
thepixel�
asthisreq
uires
all�to�allcom
munication
andwould
require
timeprop
ortionalto
n�Ifweonly
have
thecorrect
answ
er�for
anyparticu
lar
pixelon
oneprocessor
wecan
achieve
thecom
position
incon
stanttim
e�
Thecom
position
ofasin
glepixelis
perform
edas
follows�
processor
m�gtak
esits
copyof
thepixelandpasses
itto
processor
m��g�
which
composites
thepixelwith
itsow
nren
dered
pixel�
andpasses
theresu
ltto
processor
m��g�
Thisprocessor
is
repeated
until
�nally
thepixelispassed
toprocessor
m�which
composites
thepixel
into
itsfram
ebu�er�
Thusn�g�
�shifts
arereq
uired
tocom
posite
asin
glepixel�
Follow
ingthepath
ofpixel
�in
Figu
re����
itstarts
onprocessor
which
simply
passes
itscop
yof
thepixelto
processor
��Processor
�then
composites
thereceiv
ed
pixelwith
itspixel�andpasses
theresu
ltto
processor
��Thiscon
tinues�
until
�nally
��
7
12
8
9
45 214
0
11
63
148
6
10
31
13
0
11
9
12
4
13
7
12
5
10
40
1
1 23
3
12
2
3
1 23
4
4
0
0
0
4
Figu
re����
Sort�
Last
Compositin
g�asam
plemach
inewith
��processors
anda
groupsize
of��
Theassign
mentof
processors
toregion
sisshow
non
thedisp
layat
left�
processor
��receives
thepixelandcom
posites
itinto
itsfram
ebu�er�
AsFigu
re���
show
s�all
n�g
processors
responsib
lefor
thesam
eregion
cansi�
multan
eously
composite
n�g
pixels
inn�g�
�shifts�
There
aregsuch
groupsof
processors
compositin
g�so
npixels
arecom
posited
inn�g��operation
s�Theresu
lt�
ingcom
posited
image
will
bedistrib
uted
acrosstheprocessors�
Theprocessor
that
will
hold
the�nal
composited
version
ofeach
pixelisshow
nin
Figu
re����
There
areP
pixels
intheoutputfram
e�so
thefram
ewill
require
Pn� ng���
�P� �g�
�n�
�����
shifts
tocom
posite�
Thenumber
ofshifts
isbounded
byP�g�
acon
stanttim
e
solution
forsom
echoice
ofg�independentofthenumberofprocessors
andthenumber
ofpoly
gons�
Note
that
ifg�
�then
wehave
thedense
sort�lastcom
position
�as
discu
ssedprev
iously�
While
thiscom
position
requires
communication
operation
sinversely
prop
ortional
tothegrou
psize�
eachcom
munication
operation
isacross
gprocessors�
with
cost
prop
ortional
tothegrou
psize�
foroverall
constan
tperform
ance�
��
4
85
107
11
9
9
14
14
7
0
012
4
2
2
1
110 13
69
96
612
0
0
0
0312
12
3
3
93
3
3
3
6
Figu
re����
Pixelto
Processo
rMap�the�nalmappingoffully
composited
pixels
toprocessors
results
fromtheinterleav
ingof
thecom
positin
goperation
s�Theprocessor
that
isresp
onsib
lefor
eachpixellab
elsthepixel�
���
Sort�TwiceCost
Com
positin
gthe�nal
image
will
require
P��g�
�n�com
munication
operation
s�Each
operation
will
beapass
ofapixelacross
gprocessors�
Thecost
ofcom
municatin
gapixel
isof
thesam
eform
asgiv
enfor
poly
gons�
i�
���d�a
�b�
Inthis
cased�
g�each
pass
isacross
anentire
group��a�
�
���bits
ofcolor
and��
bits
ofz�
andb�
���as
taken
fromtab
le����
Sowhile
thenumber
ofshifts
isdecreasin
glin
earlyin
thegrou
psize�
thecost
oftheshifts
is
increasin
glin
earlyin
thegrou
psize�
andthecom
position
will
require
constan
ttim
e�
Exam
iningthetotal
number
ofinstru
ctionsreq
uired
toperform
thecom
positin
g
wesee
i�
P� �g�
�n� ��
�g�a
�b�
�����
iP�
a�
�a�b
gif�n�
g������
Equation
���isenligh
tening�
While
thenumber
ofinstru
ctionsexecu
tedper
sam�
��
ple
isindependentof
nandp�
itisinversely
prop
ortional
tog�
sowewould
liketo
makeglarge
tominim
izethecost
ofeach
sample
wecom
posite�
For
ourmodel
we
have
a��and�a
�b�
���Asgincreases
wepay
lessandless
per
sample�
asymp�
toticallyapproach
ingamere
�instru
ctionsper
pixel�
forafactor
of�speed
upover
theclassical
sort�lastapproach
�g�
���Aswith
thedistrib
ution
�optim
ization�
weare
amortizin
gthecost
ofthetests
andwrite�read
ofthecom
munication
register
overalarger
pass
�caused
byalarger
g��It
isinterestin
gthat
theclassical
dense
sort�lastcom
positin
gcase
issim
ultan
eously
themost
expensive
tocom
posite�
requir�
ing��
instru
ctionsper
pixel�
andthemost
expensiv
ein
mem
ory�req
uirin
gan
entire
framebu�er
oneach
processor�
��
Deferre
dShadingandTexture
Mapping
Asshow
nin
Figu
re����
thetim
ereq
uired
toperform
sort�lastcom
position
isde�
creasingin
thegrou
psize�
sologically
wewould
liketo
maxim
izethegrou
psize
�with
aneye
onthecost
ofsort�m
iddle
communication
�to
maxim
izeouroverall
perfor�
mance�
Durin
gthepure
communication
ofthecurren
tpixel
data
�not
loadingor
storingregisters
orperform
ingtests
onthedata�
buttheactu
altim
ethedata
isin
transit��
theprocessors
areessen
tiallyidle�
Itisknow
napriori
that
theprocessors
nextoperation
will
beto
composite
thepixel
itreceiv
eswith
itsow
npixel�
While
waitin
gto
receivethispixel�
shadingandtex
ture
mappingfor
thisprocessor�s
pixel
may
beperform
ed�Byoverlap
pingshadingandtex
ture
mappingoperation
suntil
the
sort�lastcom
munication
stagewesave
timeande�ort
intherasterization
stage�and
leaveourcom
munication
timeuna�ected
�
Ourexpression
forthecost
ofapass
is���d�a
�binstru
ctions�
�ainstru
ctionsare
spentwritin
gandread
ingthecom
munication
register�andbinstru
ctionsare
spent
testingthereceived
datu
mandprep
aringto
transm
itthenext�
dainstru
ctionsare
spentwith
data
simply
intran
sit�which
mean
sourVLIW
processors
areexecu
ting
essentially
just
passRightfor
eachinstru
ction�andnoth
ingmore�
For
apixelsize
of
��
0 5 10 15 20 25
016
3248
6480
96112
128
inst/pixel
group size
Figu
re����
Sort�
TwiceCompositin
gTim
e�
thepayo�
forincreasin
ggdim
in�
ishes
rapidly�
Note
that
simultan
eouswith
increasin
gganddecreasin
gthecom
posi�
tiontim
e�thesort�m
iddletim
eisincreasin
g�
��
a��andamodest
groupsize
ofg����
thisam
ounts
to���
instru
ctionsper
pixel
which
goalm
ostcom
pletely
unused
bytheprocessors
Ifwelook
ahead
totheresu
ltsin
Chapter
�ourcurren
tim
plem
entationreq
uires
���instru
ctionsperbilin
earlyinterp
olatedtex
turemapped
pixelandonly�
instru
c�
tionspershaded
pixel�so
texturemappingisa���
instru
ctionprem
iumClearly
with
���instru
ctionsavailab
lewecou
lddefer
both
texture
mappingandshading
Bydeferrin
gthese
operation
suntil
afterrasterization
�therasterization
stageis
mademuch
simpler
Inparticu
lar�con
sider
thecase
oftex
ture
mapping
Prev
iously
ifanypoly
gonwas
texture
mapped�all
processors
had
tooperate
intex
ture
mapped
modeandtake
theperform
ance
penalty
Now
theperform
ance
penalty
excep
tthe
extra
param
etersto
interp
olate�iscom
pletely
deferred
until
sort�lastcom
position
�so
thetex
ture
mappingoverh
eadisentirely
hidden
Arap
idprototy
peof
thissystem
show
sthat
inunoptim
ized com
piler
generated
�
codeitwill
rasterizeasin
glepixelwith
deferred
texture
mappingandshadingin
�
instru
ctions�afactor
of�im
provem
entover
ourcurren
tim
plem
entation
���
Performance
Thetim
eto
composite
afram
eiscom
pletely
determ
inistic�an
dwehave
had
goodluck
pred
ictingsort�m
iddlecom
munication
�sowecan
con�dently
pred
icttheperform
ance
ofsort�tw
icecom
munication
Figu
re���
show
stheinstru
ctionsreq
uired
per
pixelfor
variouschoices
ofg
Ifwecon
sider
thelim
itingcase
n�
g�
��wemust
have
atleast
�instru
ctionsper
pixelto
composite�
andwecan
seethat
thiswill
require
��������instru
ctionsfor
a��
by���
pixeldisp
layThePrin
cetonEngin
eoperates
at��M
Hzso
thetotal
execu
tiontim
eto
composite
anoutputim
ageis����
seconds
Ifwedonoth
ingbutcom
posite
frames
wecan
operate
atamaximumof
����fram
es
persecon
dClearly
thissolu
tionisinfeasib
lefor
aninteractiv
eren
derin
genviron
ment
onthePrin
cetonEngin
e
Itisinterestin
gto
consid
ertheattain
able
perform
ance
ofthissystem
inandof
���
itselfhow
everPart
oftheperform
ance
isstill
tiedupin
thesort�m
iddle
commu�
nication
Asdiscu
ssedearlier�
areason
able
model
forsort�m
iddle
communication
assumesthat
eachpoly
gonwillhave
tobepassed
overhalf
theprocessors
Sort�tw
ice
reduces
thenumber
ofe�ective
processors
tothegrou
psize
Thesort�m
iddlestage
will
execu
te
ism�
�pn
g� �
�d�a
�b��
pn�g
���
instru
ctions�
where
pis
thenumber
ofvisib
lepoly
gons
Thesort�last
stage�
assumingn�g�
��will
execu
te
isl�
�P ����g�
���
instru
ctions
Solv
ingfor
theoptim
algrou
psize
asdeterm
ined
bytheminimum
number
ofinstru
ctions�giv
es
g�
s��P
n
�p ���
is�
�ism��
isl���P
� s���P
p
n ��
sothegrou
psize
increases
with
number
ofprocessors�
amortizin
gthe�xed
costs
ofapass
operation
�anddecreases
inthenumber
ofpoly
gons�red
ucin
gthecost
of
thesort�m
iddleoperation
s
Bycom
parison
�pure
sort�middlecom
munication
with
thedistrib
ution
anddense
netw
orkoptim
izationsreq
uires
���
0
0.5 1
1.5 2
2.5 3
3.5 4
2000040000
6000080000
100000120000
10^6 instructions
polygons
sort-middle
sort-twice
Figu
re����
PredictedSort�Middle
vs�
Sort�
TwiceCommunicatio
nTim
e�
atabout������
poly
gonsper
frametheperform
ance
ofsort�tw
icewill
exceed
the
perform
ance
ofsort�m
iddle
ism
�p �� s
�a�
b� �a�
b�
n�
a� �A�
p ������pn
��� �
���
instru
ctions
Thenumber
ofinstru
ctionsreq
uired
bysort�m
iddleandsort�tw
ice
intersect
atp�
�����This
isareason
ably
high
number
ofpoly
gonsper
frame
Assu
mingweare
renderin
gat
��fram
esper
secondthisim
plies
apoly
gonrate
of
over��M
rendered
poly
gonsper
second
Figu
re���
show
sthepred
ictedperform
ance
ofthesort�tw
icecom
munication
al�
gorithm
versu
stheperform
ance
ofsort�m
iddle
Inparticu
lar�note
theextrem
ely
gentle
increase
incom
munication
timeas
afunction
ofthenumber
ofpoly
gonsfor
sort�twice
Figu
re���
compares
thesim
ulated
perform
ance
ofsort�m
iddle
andsort�tw
ice
���
0.1 1 10
100
10100
1000
10^6 instructions
10^3 polygons
sort-twice
sort-middle
Figu
re����
Sim
ulatedSort�
Middle
vs�
Sort�
TwiceCommunicatio
nTim
e�
theraw
communication
times
ofthealgorith
msclosely
match
thepred
ictedresu
ltsNote
that
thesim
ulator
restrictsthechoice
ofgrou
psize
gto
beafactor
ofthenumber
ofprocessors
Ateach
data
poin
tgwas
chosen
asthelargest
factorof
nsm
allerthan
thepred
ictedoptim
alchoice
ofg
communication
algorithmsfor
scenes
ofupto
��Mpoly
gons
Thescen
eswere
generated
asran
dom
poly
gonsuniform
lydistrib
uted
overthescreen
with
���ofthe
poly
gonsvisib
le
Figu
re���
inclu
des
thecosts
ofgeom
etryandrasterization
For
scenes
beyon
d
������poly
gonssort�tw
icecon
sistently
outperform
ssort�m
iddle
���
Summary
Sort�tw
iceo�ers
anextrem
elye�
cientmeth
odto
obtain
linearly
scalable
perfor�
mance
inthenumber
ofprocessors�
with
outpayingthefullcost
ofdense
sort�last
compositin
g
Thedisp
layisdivided
into
gregion
s�andevery
processor
r�
kgisallow
edto
rasterizeanypoly
gonthat
intersects
regionr
Byallow
inganumber
ofdi�eren
t
���
0.1 1 10
100
10100
1000
10^6 instructions
10^3 polygons
sort-twice
sort-middle
Figu
re����
Sim
ulatedSort�Middle
vs�
Sort�
TwiceExecutio
nTim
e�
in�
struction
counts
inclu
degeom
etryandrasterization
stagesfor
both
approach
esAll
poly
gonsare
textured
�andthesort�tw
iceim
plem
entation
uses
deferred
shadingand
texturin
gdurin
gthesort�last
operation
processors
torasterize
thesam
eregion
theam
ountof
sort�middle
communication
required
isminim
izedThen�g
copies
ofeach
regionofthescreen
arethen
composited
afterrasterization
insort�last
fashion
Thenumber
ofpixels
tocom
posite
fromeach
processor
is��g
ofthetotal
disp
laypixels
butthey
mustbepassed
adistan
cegdurin
g
composition
Theincreased
distan
cefor
eachcom
munication
operation
amortizes
the
overhead
operation
sassociated
with
communication
�andsubstan
tiallyred
uces
the
totaltim
espentin
sort�lastcom
munication
Theintelligen
tuseofthecom
munication
instru
ctionsdurin
gsort�last
composition
allowstheim
plem
entationof
deferred
shadingandtex
ture
mappingat
nocost
Ras�
terizationcan
bemademore
than
twice
asfast
asthenon�deferred
implem
entation
prov
ided
with
sort�middlesch
emes
Thesort�last
composition
stillreq
uires
operation
sprop
ortional
tothenumber
of
pixels�
andin
thelim
itcan
execu
tein
nofew
erinstru
ctionsthan
thetotal
number
ofpixels
inthedisp
laytim
esthesize
ofeach
pixel
Inthelim
itingcase
afactor
���
ofspeed
upisobtain
able
onthePrin
cetonEngin
eover
anorm
alsort�last
imple�
mentation
Sort�tw
icecom
munication
createsan
e�cien
tsort�last
implem
entation
�
not
only
intim
eandin
mem
ory�butalso
byadmittin
gamore
e�cien
trasterization
schem
ewith
fully
deferred
shadingandtex
ture
mapping
���
Chapter�
Implementatio
n
Theprev
iouschapters�
while
prov
idingtheback
groundandanaly
sisnecessary
to
specify
thesystem
�have
only
hinted
attheactu
alunderly
ingim
plem
entation
This
chapter
will
discu
sstheissu
esin
theactu
alim
plem
entation
ofan
interactive
poly
gon
renderin
gsystem
onthePrin
cetonEngin
eIwill
prim
arilydiscu
sse�
ciency
issues
andoptim
izations
Theactu
allin
e�by�lin
edetails
ofthecod
eare
straightforw
ard�
andrep
resentative
ofthecalcu
lationsperform
edin
anynumberofpoly
gonren
derers
Thepoly
gonren
derer
supports
depth
bu�erin
g�ligh
ting�
shadingandtex
ture
mappingof
triangular
prim
itivesTherasterization
stagehas
been
implem
ented
with
acolu
mn�parallel
decom
position
ofthescreen
�which
readily
admits
thefuture
inclu
sionof
variousoptim
izationsof
thecom
munication
structu
re�andin
addition
simpli�
estherasterization
process
foreach
poly
gonFigu
res����
��������
and���
are
represen
tativeoutputof
thesystem
�cap
tured
throu
ghadigital
framestore
attached
tothePrin
cetonEngin
eThehistogram
acrossthebottom
ofthe�gures
islin
early
prop
ortional
tothenumber
ofpoly
gonsthat
intersect
eachcolu
mnof
thedisp
lay
There
areanumber
ofcaveats
toim
plem
entin
gpoly
gonren
derin
gon
thePrin
ce�
tonEngin
eFirst
isthegen
eralissu
eof
managin
gtheim
plem
entation
pipelin
eon
aSIM
Dmach
ine�discu
ssedinx���
andsecon
disthedi�
culty
ofdealin
gwith
the
line�locked
program
mingparad
igm�discu
ssedinx��
These
constrain
tsprov
idesub�
stantial
motivation
forourim
plem
entation�so
wewill
discu
ssthem
�rst�
andthen
���
Figu
re����
Beethoven��
�����visib
lepoly
gonsina�����
poly
gonmodelren
dered
at��
frames
per
secondThehistogram
ingreen
represen
tsthenumber
ofpoly
gonslices
rasterizedbyeach
processor
Thepeak
number
ofslices
atapprox
imately
the
center
ofthe�gure�
is���
Theshort
redbars
note
which
processors
had
atim
econ
sumingrasterization
load
Figu
re����
Crocodile��
������visib
lepoly
gonsin
a������
poly
gonmodelren
�dered
at�fram
esper
secondThepeak
rasterizationload
is���
poly
gonslices
���
Figu
re����
TextureMappedCrocodile��
������visib
lepoly
gonsin
a������
poly
gonmodelren
dered
at�fram
esper
secondThepeak
rasterizationload
is���
poly
gonslices
Figu
re����
Teapot��
�����visib
lepoly
gonsin
a����
poly
gonmodelren
dered
at��
frames
per
secondThepeak
rasterizationload
is���
poly
gonslices
Note
the
high
lightsfrom
twoligh
tsou
rces
���
theactu
alim
plem
entation
ofthe�pipelin
estages�
geometry�
communication
and
rasterization
���
Paralle
lPipelin
e
Thepipelin
emodel
ofexecu
tionprov
ides
agreat
deal
ofparallelism
�butlen
dsit�
selfto
adirect
implem
entation
only
where
atru
epipelin
eexists
inhard
ware
The
Prin
cetonEngin
e�while
high
lyparallel�
isnot
apipelin
eof
processors�
itisavector
ofprocessors
Toextract
maxim
ume�
ciency
fromthesystem
thepipelin
ealgorith
m
mustbecarefu
llyim
plem
ented
Astraigh
tforward
implem
entation
would
assigneach
ofnprocessors
apoly
gon�
perform
allthegeom
etry�com
munication
andrasterization
forthose
npoly
gons�
andthen
process
thenextnpoly
gons
Whyis
this
not
agood
approach
�The
graphics
pipelin
erejects
inform
ationas
itproceed
sThe�rst
stepin
thegeom
etry
stagetran
sformsandculls
poly
gons
Ingen
eral�som
elarge
fractionof
thepoly
gons
westart
with
will
almost
immediately
bethrow
naw
ay�leav
inglarge
amounts
of
resources
unused
Furth
ermore�
di�eren
tpoly
gonswill
require
di�erin
gam
ounts
of
timeto
rasterize�as
agood
implem
entation
will
makerasterization
proceed
intim
e
prop
ortional
tothearea
oftheprim
itiveIfall
processors
rasterizeat
most
asin
gle
poly
gonandthen
return
togeom
etrycom
putation
sthealgorith
mwillexecu
teintim
e
prop
ortional
tothebiggest
poly
gon
Thismotivates
arearran
gementofthepipelin
einto
aset
ofseq
uentially
execu
ted
stagesAne�
cientim
plem
entation
will
�rst
transform
andtest
visib
ilityfor
all
poly
gons�then
perform
lightin
g�then
clipping�etc
Thisserialization
ofthealgorith
m
imposes
noloss
inperform
ance�as
clearlyall
ofthesam
eoperation
smustbeexecu
ted
regardless
oford
erTheonly
costassociated
with
thisserialization
isthestorage
necessary
tobu�er
results
betw
eenstages
Byprocessin
gtheentire
poly
gondatab
ase
ateach
stagebefore
proceed
ingvariation
sin
loadbetw
eenprocessors
exposed
ona
poly
gonbypoly
gonbasis
should
beminim
ized
��
���
Line�LockedProgramming
Apoly
gonren
derin
gim
plem
entation
onthePrin
cetonEngin
efaces
anumberofobsta�
clesbeyon
dextractin
gparallelism
�Most
challen
gingisthelin
e�lockedprogram
ming
parad
igmenforced
bythehard
ware�
Durin
geach
horizon
talscan
linetheprocessors
execu
tean
instru
ctionbudget
of
���instru
ctions�
attheendof
which
theprogram
counter
isreset
tothestart
of
theprogram
inprep
arationof
thenextlin
eof
video�
After
theoverh
eadof
clocking
outvideo
andoth
erbook
keepingtask
stheavailab
leprogram
instru
ctionbudget
is
approx
imately
��instru
ctionsper
scanlin
e�
Theforced
segmentation
oftheprogram
into
block
sof
code��
instru
ctionslon
g
greatlyincreases
thedi
culty
ofextractin
ge
ciency
fromtheim
plem
entation
�Code
mustnot
only
bee
cientin
itsutilization
ofprocessors�
itmustbee
cientin
itsuse
ofcod
eblock
s�Anycod
eblock
lessthan
��instru
ctionsin
length
wastes
execu
tion
timeneed
lessly�
Aclever
approach
istaken
tohandlin
gthiscon
straint�
Theprogram
isbrok
en
into
aset
ofproced
ures�
eachof
which
execu
tesin
lessthan
��instru
ctions�
Atthe
begin
ningof
eachscan
linetheproced
ure
addressed
byaglob
alfunction
poin
teris
called�Each
proced
ure
isthen
responsib
lefor
settingthispoin
terto
theaddress
of
thenextproced
ure
toexecu
te�In
largeareas
ofstraigh
t�linecod
e�such
asdurin
g
thegeom
etrystage�
eachproced
ure
just
setsthepoin
terto
theproced
ure
which
contin
ues
itsexecu
tion�while
insection
sof
codewith
asin
gletigh
tlycod
edloop
�
such
asrasterization
�theproced
ure
may
leavethepoin
terunchanged
until
itdetects
completion
byallp
rocessors�thusexecu
tingthesam
efunction
formultip
lescan
�lines�
Thismeth
od�while
somew
hat
awkward
�mapsrelatively
easilyto
poly
gonren
der�
ing�
For
exam
ple�all
ofthestate
isalread
ymain
tained
inglob
alvariab
lesto
minim
ize
theoverh
eadof
passin
gvariab
les�so
thefunction
disp
atchdoes
not
need
tocon
cern
itselfwith
param
eters�andcan
beavery
lowoverh
eadoperation
�
Ofcou
rse�justperform
ingthepack
ingofinstru
ctionsinto
instru
ctionbudget
sized
���
blocksisdi
cult�
asevery
timetheprogram
isrecom
piled
carefulatten
tionmustbe
paid
that
theinstru
ctionbudget
has
not
been
exceed
edbyanyof
theproced
ures�
In
somecases
complete
utilization
oftheinstru
ctionbudget
isim
possib
le�For
exam
ple�
longdivision
requires
slightly
more
than
���instru
ctionsto
execu
teon
thePrin
ceton
Engin
e�makingitnearly
impossib
leto
execu
temore
than
asin
glelon
gdividewith
in
theinstru
ctionbudget�
sosection
sof
thegeom
etrycod
ewhich
execu
temultip
lelon
g
divides
areoften
ine
cientbecau
sethey
don�thave
enough
extra
work
topack
into
theleft�over
instru
ctions�
Unfortu
nately
thecom
piler
prov
ides
nodirect
support
for
these
codepack
ingoperation
s�andtheprocess
isoften
anard
uoustrial
anderror
e ort�
TheMagic���
ournextgen
erationhard
ware
describ
edin
x����elim
inates
the
line�locked
program
mingcon
straint�
���
Geometry
Thegeom
etrystage
prov
ides
themost
obviou
sparallelism
�as
allvisib
le�after
rejec�
tionandclip
ping�
poly
gonsreq
uire
exactly
thesam
eam
ountof
work
tocom
pute�
How
ever�to
extract
ecien
cywestill
have
tobreak
thepipelin
einto
twopieces�
First
wetran
sformall
thepoly
gonsandtest
forvisib
ility�gen
eratingalist
ofvisib
le
poly
gonson
eachprocessor�
andthen
wecom
plete
thegeom
etryprocessin
gon
just
thevisib
leset
ofpoly
gons�
Thistwostage
process
allowsusto
ecien
tlyhandle
a
largeset
ofpoly
gons�hopefu
llymanyof
which
arenot
visib
le�Thuswecan
perform
acheap
transform
andtest
operation
onall
poly
gons�anddefer
theexpensiv
ework
tilllater�
when
wewill
have
toperform
iton
fewer
poly
gons�
�����
Representatio
n
Thepoly
gonrep
resentation
isshow
nin
Figu
re���
Allpoly
gonsare
triangles
for
simplicity�
Anall
integer
represen
tationhas
been
chosen
forits
greatere
ciency
than
�oatin
gpoin
t�
Each
vertexhas
associatedwith
ita�D
location��x�y
�z��anorm
al��n
x �ny �n
z �
���
VE
RT
EX
:(14 bytes)
mem
ber
Vertex 2
Vertex 0
Vertex 1
1 22222
bytes
Green
1
bytes
3 141414
(48 bytes)
1
Red
Blue
texture reference
1
PO
LY
GO
N:
1
(x,y,z)
(u,v)(x,y,z)
(x,y,z)
(u,v)00
(Nx,N
y,Nz)(N
x,Ny,N
z)2
r,g,b
(u,v)1
Nx
Ny
Nzuv
mem
ber
2
Z
1
1
22
Y X(N
x,Ny,N
z)0
Figu
re���
PolygonRepresentatio
n
andatex
ture
coordinate�
�u�v��
Thevertex
coordinates
are���b
itintegers�
The
vertexnorm
alsare
���bit
signed
�xed�poin
twith
��fraction
albits�
Thetex
ture
map
coordinates
are��b
itunsign
edintegers�
with
�fraction
albits�
forthefullran
ge
of�����
intex
ture
coordinates�
Thedecision
was
madenot
tosupport
per
vertex
colorto
savespace
andsim
plify
theligh
tingmodel�
Thepoly
gon�in
addition
to�
vertices�inclu
des
acolor�
�r�g�b��
andareferen
ceto
thetex
ture
touse�
ifapplicab
le�
�����
Transfo
rmatio
n
Thepoly
gondata
setismodeled
asasin
glerigid
object�
soasin
gletran
sformation
matrix
may
beused
forall
poly
gons�
Tran
sformation
matrices
aretran
smitted
tothePrin
cetonEngin
ebyacon
trol
application
runningon
arem
otework
station�There
isnodirect
meth
odto
inputdata
inreal
timeto
thePrin
cetonEngin
e�with
theobviou
sexcep
tionofvideo
inform
ation�
Aclever
reuse
ofaHiPPiregister
allowstran
sformation
matrices
tobesen
tto
the
engin
eat
framerates�
Thetran
sformation
matrix
isinputas
��integers�
specify
ingthe�elem
ents
of
���
therotation
matrix
with
��fraction
albits
andthetran
sformation
entries
assign
ed
integers�
Theuse
of���b
itintegers
forboth
thevertex
coordinates
andthetran
sformation
matrix
prov
ides
asign
i�can
tadvan
tage�Theworld
toeyesp
acetran
sformation
may
now
beperform
edwith
���bit
arithmetic
which
allowsthefast
�single
clockcycle�
single�p
recisiondata
path
sto
beused
throu
ghout�
Thetran
sformation
toscreen
�space
requires
�divides�
sx�
x�z
andsy�
y�z�As
alreadydiscu
ssed�division
isparticu
larlyexpensiv
eon
thismach
ine�so
wemake
the
optim
izationof
perform
ingdivision
bytab
lelook
up�Division
ofak�bitnumber
bya
k�bitnumber
may
alwaysbeperform
edbylook
upin
a�kentry
k�bittab
lewith
no
lossofprecision
�A���entry
lookuptab
leof��z
isprov
ided
oneach
processor�
Afull
���entries
arenot
necessary
asdivision
isonly
donebypositiv
ez�
While
expensiv
e
inmem
ory�thistab
lewill
prove
tobeinvalu
ablelater
forpersp
ectivecorrect
texture
mapping�
�����
Visib
ility�Clipping
Visib
ilityisperform
edas
aset
ofscreen
�space
tests�Tobevisib
leapoly
gonmust
liein
frontof
thedisp
layplan
eandtheboundingbox
ofthepoly
gonmust
intersect
thevisib
learea
ofthedisp
lay�plan
e�Alloth
erpoly
gonsare
rejectedas
they
mustbe
non�visib
le�
Allpoly
gonsare
de�ned
with
vertices
incou
nterclock
wise
order
when
facingthe
view
er�admittin
gatriv
ialback
facingtest�
Theback
facingtest
isperform
edby
evaluatin
gonevertex
ofthetrian
glein
thelin
earequation
de�ned
bytheoth
ertwo
vertices�Ifthevertex
liesin
thenegativ
ehalf�p
lanethepoly
gonisback
�facingand
isculled
�
Clip
pingis
not
explicitly
perform
edin
this
implem
entation�as
ourparticu
lar
meth
odfor
rasterizationmakes
itunnecessary�
Therasterizer
never
exam
ines
o �
screenpixels�
sothedetails
ofthepoly
gonintersection
with
screenedges
areunim
�
portan
t�
���
Theoptim
izationdiscu
ssedearlier
ofgen
eratingavisib
lelist
ofpoly
gons�rst
andthen
completin
gthegeom
etryprocessin
gonly
forthevisib
lepoly
gonsis
not
implem
ented
�For
scenes
ofinterest
thegeom
etrystage
typically
consumes
lessthan
��of
thetotal
instru
ctionswith
outthisoptim
ization�
�����
Lightin
g
Theim
plem
ented
lightin
gmodel
supports
three
lightsou
rces�an
ambien
t�adirec�
tional�
andapoin
tligh
tsou
rce�Allthree
lightsou
rcesare
constrain
edto
bewhite�
andthepoly
goniscon
strained
tobeasin
glecolor�
sothat
wecan
interp
olateasin
gle
quantity�
thewhite
lightinten
sityre�
ectedfrom
aface
totheobserver�
rather
than
separate
interp
olationsof
thered
�green
andblueligh
tre�
ectedto
theobserver�
Theam
bien
tligh
tcon
trolstheback
groundlevelof
illumination
inthescen
e�Con�
trolisprov
ided
over
theinten
sityof
theligh
t�
Thedirection
alligh
tacts
asaligh
tsou
rcein�nitely
faraw
ayfrom
thescen
e�so
allligh
tincid
entupon
thescen
ehas
thesam
edirection
vector�
Controls
areprov
ided
tomodify
thedirection
andinten
sityof
theligh
t�
Thepoin
tligh
tsou
rcemodels
aphysicalligh
tplaced
somew
here
inclose
prox
imity
tothescen
e�so
thedirection
ofincid
entray
sto
avertex
dependson
theposition
of
thevertex
relativeto
theligh
tsou
rce�Controls
areprov
ided
overtheposition
and
inten
sityof
theligh
t�
Separate
controls
areprov
ided
overthedi use
componentof
re�ection
andthe
specu
larcom
ponentof
re�ection
fortheob
ject�
Theuse
ofaligh
tingmodelwith
direction
alandpoin
tligh
tsou
rcesalso
requires
theability
tocom
pute
unitvectors
fromarb
itraryvectors�
Alook
uptab
leisused
to
perform
thereq
uired
inverse
square
root����
�����
CoecientCalculatio
n
Calcu
lationofcoe
cientsisthemost
timecon
sumingfunction
ofthegeom
etrystage�
Theworld
toeyetran
sformation
�cullin
g�clip
pingandligh
tingare
allperform
edin
lessthan
���instru
ctions�
Thecalcu
lationof
thecoe
cientsof
thelin
earequation
s
toiterate
require
anoth
er���
instru
ctionsto
execu
te�
Figu
re���
show
stheform
oftheequation
sto
compute�
Theexpenseofthese
oper�
ationsresu
ltsfrom
thenecessity
ofperform
ingtwo���b
itdivides
foreach
param
eter
tointerp
olate�Note
that
thedivisor
�is
common
toallthecalcu
lations�depth�
inten
sity�tex
ture
coordinates��
asitonly
dependson
vertexscreen
coordinates�
and
not
theparam
eterbein
ginterp
olated�
Oneoptim
izationthat
would
prov
idealarge
perform
ance
improvem
entwould
be
carefulcalcu
lationof
theinverse
ofthecom
mon
denom
inator
�once
into
a���b
it
�xed
poin
tnumber�
Subseq
uentdivision
scou
ldbeperform
edas
amultip
lyanda
shift�
Unfortu
nately
thisreq
uires
a���b
itmultip
lywith
a���b
itresu
lt�an
option
not
supported
bythecom
piler�
anddeem
edto
troublesom
eto
implem
entin
lightof
therelativ
elysm
all�less
than
���fraction
oftim
espentexecu
tingthegeom
etry
code�Thepoly
gondescrip
torgen
eratedas
the�nalresu
ltofthegeom
etrycom
putation
s�
show
nin
Figu
re����
completely
describ
esthepoly
gonfor
rasterization�Com
munica�
tion�thenextstage
ofoperation
�will
transm
itthese
descrip
torsbetw
eenprocessors
forrasterization
�
���
Communicatio
n
Theim
plem
entation
uses
sort�middlecom
munication
�Allof
theoptim
izationanal�
ysis
forsort�m
iddlecom
munication
was
based
ontheassu
mption
oflocality
ofdes�
tination
foreach
poly
gonto
becom
municated
�This
directed
ourchoice
ofscreen
decom
position
forrasterization
�as
alreadysuggested
�to
becolu
mn�parallel�
Byas�
signingeach
processor
asin
glecolu
mnof
pixels
onthedisp
layto
rasterizeweleave
���
C
224
B
(84 bytes)
A
coefficientcoefficient
bytes
(8 bytes)
-or-
(12 bytes)L
INE
AR
EQ
UA
TIO
N:
bytes
ABC4 4 4
PO
LY
GO
N D
ES
CR
IPT
OR
:
bytes
81281212 8
texture v
linear equation
edge0
edge1
edge2
zdepth
intensity
texture u
8322222
bytes
3texture reference
r,g,b
left
right
support information
# passes
bottom
top
Figu
re����
PolygonDescrip
tor�
resultof
allof
thegeom
etrycom
putation
sfor
asin
glepoly
gon�
aquarter
oftheprocessors
idledurin
grasterization
�thedisp
layisonly
���colu
mns
wide�
butwegain
agreat
deal
oflocality
inpoly
gondestin
ation�Allprocessors
ras�
terizingapoly
gonare
guaran
teedto
becon
tiguous�so
ourprev
iousanaly
seswhich
reliedon
theassu
mption
ofasin
gledestin
ationprocessor
forcom
munication
are
approx
imately
correct�andouroptim
izationswill
prove
e ective�
Theuse
ofthe�distrib
ution
�sch
emefor
decreasin
gcom
munication
timewas
un�
fortunately
not
implem
ented
�Thecon
sideration
andanaly
sisof
thispossib
ilitydid
not
ariseuntil
latein
theim
plem
entation
�preven
tingits
inclu
sion�How
ever�analy
sis
show
sthat
ithappensto
map
tothelin
e�lockedprogram
mingparad
igmvery
e�
ciently�
Passin
gasin
glepoly
gonfrom
neigh
bor
toneigh
bor
wouldreq
uire
���d�
�a�b
operation
s�where
a���
�a��
integer
poly
gondescrip
tor��b���
�thetest
overhead
�
andd�
��optim
ally�for
atotal
of���
instru
ctions�
���instru
ctions�tcom
fort�
ably
with
inthebudget
of��
instru
ctions�andallow
stim
efor
aglob
al�ifto
detect
completion
ofthe�rst
pass
ofthedistrib
ution
algorithm�
Adescrip
tionof
themech
anics
forperform
ingneigh
bor�to�n
eighbor
communica�
tionof
thepoly
gondescrip
torsisnow
presen
ted�Neigh
bor�to�n
eighbor�N
forthe
distrib
ution
optim
izationisan
obviou
sexten
sion�sim
ply
usin
glarge
pass
sizes�
��
�����
Passin
gaSinglePolygonDescrip
tor
Themost
essential
communication
operation
istheneigh
bor�to�n
eighbor
pass�
Ar�
bitrary
communication
canthen
beperform
edwith
repeated
neigh
bor�to�n
eighbor
passes�
Passin
gapoly
gondescrip
tor�rep
resented
as��
���bit
integers�
fromone
processor
toits
neigh
bor
isaloop
overthedescrip
tor�passin
ga���b
itpiece
ofitat
atim
e�
voidpass�int�sourceDescriptor�int�destDescriptor�
�
intcounter�
for�counter���counter��counter���
�
communicationRegister�sourceDatum counter��
passRight�����orpassLeft�����
destDatum counter��communicationRegister�
�
�
ThepassRight���or
passLeft���operation
isacom
piler
prim
itivethat
shifts
the
datu
min
eachprocessor�s
communication
registeroneprocessor
totherigh
t�or
left��
Each
processor
isperform
ingthesam
eaction
�so
eachprocessor
issim
ultan
eously
sourcin
gandreceiv
ingapoly
gondescrip
tor�Theloop
iscom
pletely
unrolled
and
coded
inassem
bly
language
fore
ciency�
�����
Passin
gMultip
leDescrip
tors
Ingen
eraleach
processor
will
have
more
than
asin
gledescrip
torto
distrib
ute�
and
more
than
asin
gledescrip
torwillb
edestin
edfor
thisprocessor�
Itistheresp
onsib
ility
ofeach
processor
tostore
thisdata
asitarriv
es�
Each
processor
manages
asourcearray
ofdata
andareceivearray
ofdata�
The
sourcearray
contain
sall
ofthedescrip
torsthat
thisprocessor
will
communicate
to
theoth
erprocessors�
Thereceivearray
contain
s�at
theendof
thecom
munication
�
allof
thedescrip
torsthat
theoth
erprocessors
communicated
tothisprocessor�
���
voiddistribute�intn�DESCRIPTORsource ��DESCRIPTORreceive ��
�
DESCRIPTOR�s��r�
s�
source�
r�
receive�
while�globalAnd�n�����
pass�s�r��
s�r�
if�
GOOD�r��r���
if��INTERESTING�s����n������
s���source�
�
�
Figu
re����
Implementatio
nforData
Distrib
utio
n
Thevariab
lespoin
tsto
thenextdescrip
torto
pass�
while
rpoin
tsto
thelocation
tostore
thenextdescrip
torreceiv
ed�sping�p
ongs
back
andforth
betw
eenthesou
rce
arrayandreceive
array�Initially
eachprocessors
sources
adatu
mfrom
itssource
array�andreceiv
eadatu
minto
itsreceivearray�
When
aprocessor
receivesa
descrip
toritim
mediately
poin
tssat
it�theassu
mption
bein
gthat
itwill
prob
ably
have
topass
thisdescrip
torfurth
eralon
g�as
itisunlikely
thisprocessor
isthelast
processor
that
need
sto
seethispoly
gon�Thisstore
andforw
ardoperation
insures
that
eachpoly
gondescrip
tortravels
aroundtherin
gfar
enough
tobeseen
byall
processors
rasterizingit�
Note
that
whatever
descrip
torswas
poin
tingat
itwas
just
passed
tothis
processors
neigh
bor�
which
isnow
responsib
lefor
it�Thusthe
autom
atics�roperation
forgetswhich
descrip
torthisprocessor
just
sourced
�butit
simultan
eously
becom
esanoth
erprocessors
responsib
ility�
Ifthedescrip
torreceiv
edisGOOD�a
descrip
torfor
apoly
gonthat
thisprocessor
will
bein
part
responsib
lefor
rasterizing�
theprocessor
increm
entsthereceive
poin
ter
rso
that
itdoesn
�toverw
riteitwith
thenextpoly
gonreceiv
ed�Otherw
iserisleft
alone�andthenextdescrip
torreceiv
edwilloverw
riteit�
Thiscan
cause�an
dgen
erally
���
does�
asitu
ationwhere
s��r�andtheprocessor
justcon
tinually
forward
spoly
gons
foroth
erprocessors�
Ifthedescrip
torreceiv
edbyaprocessor
onanygiven
cycle
isnot
INTERESTING
then
ithas
alreadybeen
communicated
toall
processors
that
need
tosee
it�The
processor
will
then
source
oneof
itsrem
ainingsourcepoly
gonsifithas
anyleft�
Iftheprocessor
has
nomore
descrip
torsto
contrib
ute
itwill
simply
pass
o this
�uninterestin
g�poly
gonto
thenextprocessor�
which
may
ormay
not
choose
to
replace
itwith
apoly
gonof
itsow
n�etc�
Thewhileloop
terminates
when
noprocessor
isstill
communicatin
gdata
ofin�
terest�Ifnooneiscom
municatin
gdata
ofinterest
then
thecom
munication
mustbe
complete�an
dthefunction
terminates�
Thishas
tobedoneviaaglob
altest
asitisn
�t
know
napriori
how
longitwill
taketo
communicate
agiv
ennumber
ofpoly
gons��
Thisim
plem
entationavoid
sas
manycop
yoperation
sas
possib
le�as
acop
ywill
require
aread
sandawrites
�aisthedescrip
torsize��
which
isalm
ostas
expensive
asthecom
munication
operation
itself�Thisisparticu
larlyim
portan
tfor
communi�
cation�as
itisgen
erallythecase
that
onagiv
encycle
only
somesm
allpercen
tageof
theprocessors
will
behandlin
gdata
they
need
acop
yof�
andtherest
will
bejustfor�
ward
ingdata
inten
ded
foroth
ers�Anycon
dition
aloperation
s�su
chas
acon
dition
al
copy�would
beparticu
larlyexpensive�
asonly
asm
allfraction
ofthemach
inewould
beutilized
�
When
communication
has
completed
eachprocessor
has
acop
yof
allthedescrip
�
torsfor
thepoly
gonsitwill
beresp
onsib
lefor
rasterizing�andthealgorith
mproceed
s
directly
torasterization
�
�Theactu
alim
plem
entation
optim
izesto
perform
this
testperio
dically
bymak
ingan
optim
isticgu
essof
how
manyiteration
sof
theloop
therem
ainingdescrip
torswill
taketo
communicate�
and
afterthatman
yitera
tionsanew
estimate
isform
ed�etc�
Theestim
ateswill
decrease
mon
otonically
�beca
use
they
are
based
ontherem
ainingnumber
ofpoly
gonseach
processor
has
tocom
municate�
which
isdecrea
singifprog
ressisbein
gmad
e�an
dthefreq
uency
ofcom
pleten
esscheck
swillth
ereforeincrea
seuntil
actualcom
pletion
�W
ithalittle
carethenumber
oftests
canbeminim
izedwith
out
perfo
rmingmore
communication
thannecessary����
F(x,y) =
Ax +
By +
CF
’(x,y) = B
y + C
’; C’ =
Asx
Figu
re����
LinearEquatio
nNorm
alizatio
n�lin
earequation
sare
norm
alizedto
thehorizon
talposition
oftheprocessor
inscreen
space�
���
Raste
rizatio
n
After
communication
eachprocessor
has
acop
yin
thereceivearray
ofall
poly
gon
descrip
torsthat
intersect
itsrasterization
region�A
column�parallel
decom
position
ofthescreen
has
been
implem
ented
�so
eachprocessor
isresp
onsib
lefor
rasterizinga
verticalslice
ofeach
ofthese
poly
gonsinto
itssin
glecolu
mnof
thefram
ebu�er�
Rasterization
isbroken
into
a�step
process
intheim
plem
entation
�
�Norm
alizelin
earequation
sto
processor
columnposition
�Rasterize
poly
gons
�����
Norm
alizeLinearEquatio
ns
Theequation
sthat
resideon
every
processor
aspart
ofthepoly
gondescrip
torsare
oftheform
Fx�y��Ax By C�Wehave
implem
ented
acolu
mn�parallel
decom
�
position
ofthescreen
�so
eachprocessor
isresp
onsib
lefor
rasterizingasin
glecolu
mn
ofpixels�
thuseach
processor
only
need
sto
iteratethelin
earequation
svertically�
Thenorm
alizationC
��C Asxismadefor
eachequation
�where
sxisthehorizon
�
talposition
ofthisprocessor�s
columnon
thedisp
lay�These
norm
alizationsreq
uire
approx
imately
���instru
ctionsper
poly
gonandare
applied
tothereceivepoly
gon
arrayon
eachprocessor
immediately
afterthecom
pletion
ofthecom
munication
stage�
��
Now
simpler
equation
sof
theform
F�x
�y��By C
�can
beevalu
atedon
each
processor�
asshow
nin
Figu
re����
Ofcou
rsetheequation
sare
evaluated
iteratively�
sotheactu
alevalu
ationprocess
isFx�y��Fx�y
���
B
foreach
verticalstep
�
Rasterization
begin
sonce
alltheprocessors
have
norm
alizedtheir
linear
equation
s
forall
receivedpoly
gons�
�����
Raste
rizePolygons
Poly
gonrasterization
�discu
ssedin
section�����
isim
plem
entedbyiteratin
gaset
of
linear
equation
sover
theboundingbox
ofapoly
gon�In
thiscase
weonly
iteratethe
equation
sover
avertical
span
ofthepoly
gondueto
thecolu
mn�parallel
decom
posi�
tionof
thescreen
�Torasterize
asin
glepixelwecom
pare
thecurren
tlyinterp
olated
depth
forthepoly
gonto
thecorresp
ondingpixelin
thedepth�bu�er�
Ifthecurren
t
pixelocclu
des
thepixelin
thedepth�bu�er
wethen
shadeandtex
ture
map
thepixel
andplace
itin
thefram
ebu�er
andits
depth
valuein
thedepth�bu�er�
Thuswehave
anocclu
sionmodelthat
isaccu
rateon
apixel�b
y�pixelbasis�
There
areanumber
ofissu
esandoptim
izationswhich
were
tackled
intheim
�
plem
entation
�Thedi�
cult
implem
entation
issues
will
bediscu
ssed�rst�
andthe
optim
izationswill
bediscu
ssedat
theendof
thesection
�
Decouplin
gRaste
rizatio
nandPolygonSize
Themost
importan
tissu
ealth
ough
arguably
anoptim
ization�in
therasterizer
is
allowingprocessors
toin
dep
enden
tlyadvan
ceto
their
nextpoly
gonafter
completin
g
rasterizationof
their
curren
tpoly
gon�Thisdecou
ples
therasterization
process
from
sizedi�eren
cesin
theparticu
larpoly
gonsbein
grasterized
oneach
processor
atany
poin
tintim
e�With
outthisdecou
plin
geach
processor
wouldrasterize
its�rst
poly
gon�
wait
untilalloth
erprocessors
were
done�andthen
allprocessors
wouldsim
ultan
eously
proceed
totheir
secondpoly
gon�etc�
forcingall
processors
torasterize
their
poly
gon
intim
eprop
ortional
tothesize
oftheworst
casepoly
gon�
Thedecou
plin
gisach
ievedbyperform
ingaperiod
ictest
durin
gpixelrasterization
���
toallow
processors
tomove
onto
their
nextpoly
gon�Every
k�pixels
allprocessors
testifthey
have
�nish
edrasterizin
gtheir
curren
tpoly
gon�andthose
that
have
update
theequation
sthey
areiteratin
gwith
thecoe�
cientsof
their
nextpoly
gonto
raster�
ize�Thistest
andcon
dition
aladvan
cementisexpensiv
ebecau
seitreq
uires
copying
indirectly
addressed
data
into
registers�Furth
ermore�
thelin
e�lockedprogram
ming
parad
igmmakes
somechoices
ofhow
oftento
perform
thetest
thevalu
eofk�more
e�cien
tthan
other�
interm
sof
totalinstru
ctionsexecu
tedevery
durin
geach
scan�
line�
Theshaded
poly
gonrasterizer
perform
sonetest
andeigh
tpixelrasterization
s
k�
��per
scanlin
e�andthefulltex
ture
mappingrasterizer
perform
sonetest
and
fourpixelrasterization
sper
scanlin
e�
Thisoptim
izationissim
ilarto
those
implem
ented
byCrow
���andWhitm
an����
Crow
andWhitm
anboth
use
asim
ilartest
todecou
plethescan
lineof
thepoly
gon
eachprocessor
iswork
ingon�butstill
forcesynchron
izationat
thepoly
gonlevel�
Thissuggests
they
only
gaine�
ciency
where
poly
gonare
similar
inarea�
buthave
varyingasp
ectratios�
Ourapproach
isinsen
sitiveto
poly
gonasp
ectratios
becau
se
eachprocessor
only
rasterizesasin
glecolu
mnof
anypoly
gon�andto
poly
gonarea�
which
would
seemto
o�er
amore
substan
tialadvan
tage�
Persp
ectiv
e�Corre
ctTexture
Mapping
Ourmodelof
textured
poly
gonsassociates
atex
turecoord
inate
u�v�with
eachofthe
verticesof
thepoly
gon�andthetex
ture
coordinate
ofanypoin
twith
inthepoly
gon
may
befou
ndas
thelin
earcom
bination
ofthese
vertexvalu
es�How
ever�this
is
linear
interp
olationin
object
space�
butnot
inscreen
space�
Thegen
eralform
ofthe
interp
olationshould
be
u�
Ax By C
����
v�
Dx Ey F
���
���
inob
jectspace�
butweinterp
olateall
ourparam
etersin
screenspace�
Naive
screenspace
interp
olationof
quantities
nonlin
earin
screenspace
leadsto
distractin
gartifacts�
perh
apsmost
noticeab
lein
texture
mapping�
Theerrors
made
destroy
theforesh
orteninge�ect
associatedwith
objects
twisted
away
fromtheob�
server�
Ifwecarefu
llyperform
only
interp
olationsof
quantities
that
arelin
earin
screenspace
wecan
createapersp
ectiveco
rrectren
derer�
Exam
iningourpersp
ectivetran
sformation
�sx�x�z
andsy�y�z�
wesee
wecan
express
these
equation
sas
u�
zAsx Bsy C�
����
v�
zDsx Esy F�
����
solin
earinterp
olationofu�z
andv�z
canbecorrectly
perform
edin
screenspace�
andthevalu
eofuandvrecovered
ateach
pixelbymultip
licationwith
z�How
ever�
multip
licationbyzrein
troduces
theprob
lemof
screenspace
linearity�
Ifweexam
inetheequation
ofaplan
ein
�D�andthustheequation
ofdepth�
wesee
that
liketex
ture
coordinates�
depth
isnot
linear
inscreen
space�
thein
verse
depth
islin
earin
screenspace�
How
ever�wecan
easilyinterp
olateinverse
zinstead
ofz�
andstill
use
depth�bu�erin
g�Ourcom
positin
gtest
for��z
values
issim
plythe
complem
entof
thetest
forz�
Now
wecan
trivially
interp
olateu�z�v�z
and��z
foreach
poly
gon�andtex
ture
mappingwill
require
thecalcu
lation
u�
u�z
��z����
v�
v�z
��z����
ateach
pixel�
Wereu
setheinverse
ztab
leused
toperform
thepersp
ectivedivision
�
��
avoidingthevery
high
costofperform
ingtwodividesper
iterationoftherasterization
loop�Heck
bert
andMoreton
prov
ideacom
preh
ensive
treatmentof
theissu
eof
object�
linear
vs�
screen�lin
earquantities
in����
Bilin
earInterpolatio
n
Texturemappingismuddied
furth
erbeyon
dpersp
ectivecorrection
bythenecessity
of
perform
ingsam
plin
g�Asim
pletex
ture
mappingalgorith
mperform
spoin
t�samplin
g
ofthetex
ture
image�
takingthecolor
ofthepoly
gonpixelas
thetex
elclosest
tothe
curren
tinterp
olatedu�v�coord
inate�
Unfortu
nately
small
motion
sof
objects
can
induce
jitteringin
thesam
plin
gof
thetex
elsandcreate
unpleasan
tartifacts
under
poin
tsam
plin
g�Bilin
earinterp
olationhelp
sto
reduce
these
e�ects
bycom
putin
ga
pixelcolor
astheweigh
tedaverage
ofthe�tex
elsnearest
tothetex
ture
coordinate�
Onmanyarch
itectures
thisprov
ides
asubstan
tialperform
ance
penalty�
asitre�
quires
perform
ing�tex
elfetch
esfor
everytex
turemapped
pixel�
How
everthePrin
ce�
tonEngin
ehas
enorm
ousaggregate
mem
orybandwidth�andin
factthere
isacop
y
ofall
ofthetex
tures
onevery
processor�
sotheaddition
altex
elfetch
esare
atriv
ial
expense�
�����
Raste
rizatio
nOptim
izatio
ns
Distin
ctTexture�MappedandNon�Texture�MappedAlgorith
ms
Theinclu
sionof
texture
mappingaddssign
i�can
tlyto
thecost
ofrasterization
�A
texture
mapped
pixel
requires
approx
imately
twice
aslon
gto
rasterizeas
apixel
with
outtex
ture
mapping�
Manyscen
esof
interest
have
notex
ture
mappingat
all�so
acon
trolisprov
ided
intheinterface
todisab
letex
ture
mappingwhich
results
inthe
use
ofan
optim
izedshading�on
lyrasterizer�
Early
Abort
���
(a)(b)
erroraliasing
for this slice
rasterizationcom
plete
Figu
re����
Raste
rizing�trian
glesoccu
pyless
than
half
ofthepixel
coveredby
their
boundingboxes�
Inboth
casessign
i�can
tadvan
tagecan
bemadeof
notin
gwhen
rasterizationexits
thepoly
gon�Case
b�addsthecom
plex
itythat
acareless
rasterizerwill
stepover
thepoly
gondueto
integer
pixel
coordinates
andrasterize
forever�
waitin
gto
enter
thepoly
gon�
Observe
that
when
rasterizingpoly
gons�
show
nin
Figu
re����
that
atlea
stone
half
ofthearea
intheboundingbox
ofatrian
gleis
notwith
inthetrian
gle�It
would
beadvan
tageousto
avoidexam
iningall
ofthese
pixels
need
lessly�Weinclu
de
asim
pletest
which
onaverage
avoidsrasterizin
g���
ofthese
�exterior�
pixels�
As
rasterizationproceed
s�a�ag
isset
when
thepoly
gonisentered
�Becau
setrian
gles
arenecessarily
convex
�as
soonas
oneof
theedge
equation
stest
negative
andthe�ag
isset�
therasterizer
know
sthat
ithas
completely
rendered
itspiece
ofthispoly
gon
andrasterization
ofthenextpoly
goncan
begin
�
�����
Summary
This
sectionhas
describ
edthemajor
implem
entation
issues
andoptim
izationsin
therasterizer�
Particu
laratten
tionhas
been
paid
tothedecou
plin
gof
thepoly
gon
bein
gprocessed
fromrasterization
�In
addition
persp
ective�correct
texture
mapping�
requirin
gan
expensiv
etwodivides
per
pixel�
has
been
e�cien
tlyim
plem
ented
with
a
tablelook
up�
Theuse
ofsep
araterasterization
algorithmswith
andwith
outtex
ture
mapping
support
prov
ides
asubstan
tialperform
ance
gainfor
scenes
with
outtex
turin
g�An
���
earlyabort
mech
anism
allowsrasterization
todetect
completion
ofapoly
gonbefore
theentire
vertical
exten
tof
theboundingbox
has
been
exam
ined�andprov
ides
a
signi�can
tred
uction
inthetotal
pixels
exam
ined
torasterize
agiv
enpoly
gon�
Thusfar
thegeom
etry�com
munication
andrasterization
stageshave
been
de�
scribed�Thenextsection
will
discu
ssthe�nal
issue�
theactu
alcon
trolof
theren
�
derer�
���
Control
Avirtu
altrack
ball
isused
tointeract
with
renderin
gsoftw
are�Awork
stationisused
todisp
laya�con
trolcube��
show
nin
Figu
re�����
which
represen
tstheorien
tation
andtran
slationof
theob
jectren
dered
bythePrin
cetonEngin
e�Thecubemay
be
grabbed
with
themouse
androtated
andtran
slatedarb
itrarily�Theuser
may
also
setthecubespinningaboutan
arbitrary
axis
ofrotation
�Quatern
ions�
describ
ed
in�����
areused
tointerp
olatebetw
eensuccessive
position
sof
therotatin
gcon
trol
cube�
TheGLmodelv
iewmatrix
����corresp
ondingto
theorien
tationof
thecubeis
transm
ittedto
thePrin
cetonEngin
econ
tinually
via
anauxiliary
data
channel�
The
renderer
may
beoperatin
gat
lessthan
thematrix
update
rate�in
which
casesom
eof
theupdates
areign
ored�so
while
theupdate
rateoftheengin
edisp
laymay
not
match
thevirtu
altrack
ball�
therate
ofrotation
andtran
slationof
theren
dered
object
does�
Clever
reuseofacom
munication
registerused
tosupport
aHiPPi�interface
allows
asin
gleinteger
tobeinputto
allprocessors
perscan
line�
Wecan
input��
���������
integers
asecon
dto
theengin
ethrou
ghthischannel�
Alow
data
rate�butmore
than
adequate
forsim
pleupdates�
such
asthetran
sformation
matrix
�Thebandwidth
into
theengin
ethrou
ghthis�au
xiliary
�channelisobviou
slysubstan
tiallylarger
than
that
required
forsim
plematrix
updates�
andallow
sthefuture
possib
ilityof
sendingextra
inform
ationthrou
ghthischannel�
such
astheposition
sandprop
ertiesofligh
tsou
rces
�Unfortu
nately
use
oftheactu
alHiPPichannel
requires
reprogrammingtheIO
circuitry
ofthe
engineandcan�tbeused
concurren
tlywith
video
output�
���
Figu
re�����
Virtu
alTrackballInterfa
ce
with
intheworld
�or
simulation
data�
Afullhandshakingprotocol
isused
betw
eenthework
stationandtheengin
e�
which
insures
perfect
synchron
ization�Thechannel
isalso
robust
andread
ilyhan�
dles
drop
ped�scram
bled
anddelayed
data�
sothat
theoccasion
alerror
isgracefu
lly
handled
�A
serialinterface
betw
eenthework
stationrunningthevirtu
altrack
ball
software
andthePrin
cetonEngin
econ
trollerisused
toavoid
control
di�
culties
due
tovariation
sin
ethern
ettra�
c�
���
Summary
Thischapter
has
discu
ssedtheim
plem
entation
ofacolu
mn�parallelp
olygon
renderer
onthePrin
cetonEngin
e�Thepoly
gonren
derer
supports
depth
bu�erin
g�ligh
ting�
shadingandtex
ture
mappingof
triangular
prim
itives�
Anexplan
ationof
thegeom
etrystage
has
been
prov
ided�inclu
dingthedetails
of
therep
resentation
ofthepoly
gondatab
aseandtheform
ofthecalcu
lationsperform
ed�
Theinclu
sionofaperprocessor
inverse
ztab
lesupports
e�cien
tpersp
ectivedivision
s�
andisreu
sedlater
tosupport
e�cien
tpersp
ective�correcttex
ture
mapping�
Thecom
munication
stagehas
been
carefully
describ
ed�as
itdom
inates
theexe�
cution
timeof
thealgorith
m�andits
e�cien
cybears
heav
ilyon
theoverall
e�cien
cy
���
oftheprogram
�Thedetails
ofqueuein
gpoly
gondescrip
torsfor
transm
issionand
receiptisexplain
ed�andacarefu
lexplan
ationof
theactu
alprogram
med
managed
communication
algorithm
isgiven
�
Sign
i�can
tissu
esandoptim
izationof
therasterization
stage�secon
donly
tocom
�
munication
intotal
execu
tiontim
e�have
been
exam
ined�Decou
plin
gof
thepoly
gon
sizedependence
inrasterization
incon
junction
with
anearly
abort
mech
anism
a�ord
s
ane�
cientim
plem
entation
�Theuse
ofan
inverse
zlook
uptab
leprov
ides
visu
ally
pleasin
gpersp
ectivecorrect
texture
mappingcheap
ly�
Anintuitive
virtu
altrack
ball
interface
prov
ides
real�timeuser
interaction
and
control
oftheren
dered
scene�
Chapter
�will
prov
ideperform
ance
results
ofthisim
plem
entationandacom
par�
isonwith
theperform
ance
ofaPrin
cetonEngin
econ
temporary�
theSilicon
Grap
hics
GTX�
���
���
Chapter�
Results
Thepoly
gonren
derer
has
achiev
edpeak
perform
ance
ofover
������visib
le�shaded
andbilin
earlyinterp
olatedtex
turemapped
poly
gonsper
second�There
areanumber
ofperpoly
gonandperpixelp
erformance
�gures
that
combineto
yield
the�nalaggre�
gateperform
ance
ofthesystem
�Thischapter
will
�rst
prov
idetheraw
perform
ance
ofeach
operation
�andthen
anaccou
ntin
gof
how
these
instru
ctionsare
spentper
visib
lepoly
gonren
dered
�Finally
acom
parison
will
bemadebetw
eenthePrin
ceton
Engin
eperform
ance
andthat
ofits
contem
porary�
theSilicon
Grap
hics
GTX�
���
Raw
Perfo
rmance
Theraw
perform
ance
ofeach
oftheindividualstages
oftheim
plem
entationisread
ily
quanti�
ed�Unfortu
nately
thelin
e�lockedprogram
mingparad
igmofthethePrin
ceton
Engin
eoften
obscu
resthetru
eperform
ance
ofthealgorith
m�Where
feasible�gures
areprov
ided
both
fortheactu
alim
plem
entationandfor
ahypoth
eticalPrin
ceton
Engin
elack
ingthelin
e�lockedcon
straint�
Thelin
e�lockedprogram
mingcon
straintmakes
itmost
natu
ralto
express
oper�
ationsas
thenumber
ofoperation
sach
ieved
per
linetim
e�where
alin
etim
eis��
instru
ctions�andinclu
des
thefram
ebuer
managem
entcod
eandanyunused
cycles�
��
NTSCvideo
operates
at��
frames
per
second�and� �
lines
per
frame�for
atotal
of
����lin
esper
second�
�����
Geometry
Tran
sformingto
screencoord
inates�
lightin
g�andcom
putin
glin
earequation
coe��
cientsisvery
fast�
� �procs�
�
polygon
proc�lines�����
lines
second����������shadedpolygons
second
����
with
texture
mapping�w
hich
requires
thecalcu
lationof
linear
equation
sfor
the
addition
alparam
etersuandv��
� �procs�
polygon
proc�lines�����
lines
second����������texturedpolygons
second
��� �
�����
Communicatio
n
Characterizin
gthecom
munication
perform
ance
isdi�
cult�as
thedistrib
ution
ofpoly
�
gonsandaverage
distan
cethey
have
totravel
willa
ectthee�
ciency
oftheshift
ring�
Ausefu
lbenchmark
isthenumberofpoly
gonpasses
�proc
n�
proc
n��
persecon
d�
� �procs��
passes
proc�line�����
lines
second�����������polygonpasses
second
�����
Ifeach
poly
gonreq
uires
approx
imately
� passes
�travelshalfw
ayarou
ndthe
ringto
reachits
destin
ation�then
thepoly
gonscom
municated
per
secondis
�NTSC
video�asde�ned
bytheSMPTEdraftsta
ndard
actu
ally
opera
tesat������������eld
s
per
second�orapproxim
ately
�����fra
mes
per
second
��
������
����passes
second�
�
polygon
passes
�������polygons
second
�����
�����
Raste
rizatio
n
Perform
ance
remain
slargely
communication
limited
�which
isindependentofpoly
gon
size��Thismakestheusual����p
ixelpoly
gon�aless
usefu
lbenchmark
�asitstresses
therasterizin
gperform
ance
excessiv
elyover
thecom
munication
perform
ance�
More
relevantispixels
per
second�as
thisyield
sinsigh
tto
theaverage
depth
complex
ity
that
canbehandled
�Anaggregate
rateof
almost
��million
bilin
earlyinterp
olated�
Gourau
dshaded
texturemapped
pixels
isattain
ed�With
outtex
turemapping�alm
ost
��million
pixels
canberen
dered
per
second�A
secondof
NTSC
video
is���
�
� ����
�
��������
pixels�
orless
than
oneeigh
thof
thepeak
pixel�llrate
ofthis
implem
entation
�
Ifwecon
sider
texture�m
apped
�bilin
earlyinterp
olated��Gourau
d�sh
aded
pixels
per
second�wecan
characterize
theresu
ltingperform
ance
as��
���procs��
pixels
line�proc�����
lines
second�����������pixels
second
�����
With
outthebilin
earinterp
olationnecessary
fortex
ture
mapping�
thepixel
�ll
ratedoubles�
���procs��
pixels
line�proc�����
lines
second���������pixels
second
�����
With
outthelin
e�lockedprogram
mingparad
igm�andign
oringtheoverh
eadto
�Thisistru
eto
�rst
order
Ofcourse
larger
polygonswill
haveto
stayin
therin
glonger
befo
re
every
onehasseen
them
�butgenera
llythesize
ofapolygonis
smallcompared
tothedista
nce
it
must
travel
���processo
rsare
speci�
ed�asonly
theren
derin
gofon�screen
processo
rsis
used
inthis
algorith
m
�
change
activepoly
gons�
therasterizer
requires
� instru
ctionsper
shaded
texture
mapped
pixel�
foran
aggregate�llrate
with
afully
distrib
uted
framebuer
�all� �
processors�
of�
� �procs����
�inst
sec�proc�
�
pixel
inst
������pixels
second
�����
Shaded
pixels
only
require
��instru
ctionsper
pixel�
foran
aggregate�llrate
of�
� �procs����
�inst
sec�proc�
��
pixel
inst
����������pixels
second
�����
Thefullpixelren
derin
grate
isneverrealized
becau
seweperform
aboundingbox
scanofthetrian
gular
prim
itives�which
guaran
teeswewill
test�an
dnot
render�
some
non�prim
itivepixels�
Wealso
perform
aperiod
ictest
�not
afterevery
pixel�
tocon
�
dition
allyadvan
ceto
thenextpoly
gon�which
while
improv
ingoverall
perform
ance�
alsoincreases
thenumber
ofpixels
wewill
exam
inefor
eachprim
itivenecessarily�
If
weapprox
imate
theloss
ine�
ciency
dueto
thepixels
exam
ined
outsid
eof
thepoly
�
gonandthegran
ularity
intherasterization
algorithm�therasterizer
achieves
���
parallel
e�cien
cy�Assu
minguniform
loadbalan
cing�
theren
derer
canrasterize�
������
����pixels
second�����
�
��
polygon
pixels
�����������pixelpolygons
second
�����
with
outtex
ture
mapping�
������
����pixels
second�����
�
��
polygon
pixels
������������pixelpolygons
second
�����
�����
Aggregate
Perfo
rmance�
percen
tof
timein
eachphase
object
frames�s��
poly
s�s��
geometry
communicate
rasterizebeeth
oven���
�������
����
���� ����
��
�����
� �����
����
crocodile
���������
��
����
�������
����
���� ����
���
�teap
ot���
������
� ��
���������
���
�����
�������
����
Table���
Renderin
gPerfo
rmanceforTypicalScenes�
Rendered
scenes
always
countan
integral
number
offram
etim
esto
compute�
asfram
ebuers
canonly
be
swapped
atfram
eboundaries�
Any�ex
cess�tim
ein
thefram
espentidlin
giscou
nted
asren
der
time�cau
singan
unfairly
high
estimate
ofren
der
timeversu
scom
putation
anddistrib
ution
�
Theactu
alperform
ance
oftheren
derer
ismeasu
redbyren
derin
gthescen
esshow
n
in�gures
�����
and
����Typical
results
aregiven
inTable���
Multip
leentries
forthesam
eob
jectare
atdieren
torien
tations�
ascom
munication
andren
derin
g
costwill
dependon
orientation
�Figu
re��
show
sthebreak
dow
nof
renderin
gtim
e
betw
eengeom
etry�com
munication
andrasterization
�
���
Accountin
g
ThePrin
cetonEngin
eprov
ides
anaggregate
�BIPsof
computin
gperform
ance�
At
������poly
gonsasecon
dthisim
plies
approx
imately
������instru
ctionsperren
dered
poly
gon�Thiscost
seemsexcessiv
e�butwecan
accountfor
where
alltheinstru
ctions
have
gone�
Exam
iningthescen
ewith
ourpeak
perform
ance�
thetex
ture
mapped
crocodile�
wetally
theinstru
ctionusage�
Ourscen
ehas
������poly
gons������
ofwhich
arevisib
leat
thegiv
enorien
ta�
tion�Thescen
eisren
dered
at�fp
s�for
anaggregate
perform
ance
of� ����
poly
gons
per
second�Wewill
express
times
foroperation
interm
sof
scanlin
es�becau
seall
function
alityisim
plem
ented
inscan
linesized
proced
ure
block
s�
��
0 10 20 30 40 50 60 70
beethovencrocodile
teapot
percent
geometry
comm
unicationrasterization
Figu
re���
Executio
nPro�le�thecom
munication
timefor
most
scenes
isthe
dom
inantperform
ance
factor�Asthenumber
ofpoly
gonsincrease
�andtheir
sizedecreases�
communication
timebecom
eseven
more
dom
inant�
Thegeom
etrystage
isnot
optim
izedto
perform
atwo�stage
pipelin
e�so
allpoly
�
gonshave
tobefully
processed
�Atex
ture�m
apped
poly
gonreq
uires
scan
lines
to
compute�
foratotal
of��inst
line�
lines
polygon��
���inst
polygon
����
there
areatotal
of������
poly
gons�for
������
polygons
frame
�����
inst
polygon��frames
second� ���
���
inst
second
��� �
We�ll
countthecom
munication
stageabit
dieren
tly�A
single
pass
consumes
MIPsacross
allof
theprocessors
andwecan
perform
�passes
inascan
line�for
��inst
line�proc�
�
lines
pass�� �
procs���
����inst
pass
�����
��
Each
poly
gonhas
amargin
alpass
cost�total
number
ofpasses
divided
bythe
number
ofpoly
gons�
of����
foratotal
of
������
inst
pass��
��pass
polygon�
�� �inst
polygon
�����
Thecom
munication
stagethusreq
uires
�����
polygons
frame
��� �
inst
polygon��frames
second�����
���
inst
second
�����
Bythetim
ewereach
therasterization
stagewehave
hitalarge
loadim
balan
ce�
dueto
theasy
mmetric
distrib
ution
ofscen
econ
tent�
Ifweexam
ineFigu
re��
and
note
thehistogram
wesee
that
thepeak
processor
has
���poly
gonsintersectin
g
itscolu
mn�Theaverage
poly
gonheigh
tis���
pixels
�obtain
edfrom
thesim
ulator�
andourtex
ture
mappingcod
eforces
poly
gonsto
rasterizein
�pixelchunks�so
each
poly
gonwillh
avean
eectiv
eaverage
heigh
tof�pixels�
poly
gonscan
benorm
alized
inascan
line�and�pixels
canberasterized
inascan
line�so
theinstru
ctionsreq
uired
toprocess
asin
glepoly
gonare
��inst
line��
normalize
polygon
�
line
normalize��
pixels
polygon�
�
line
pixel������
inst
polygon
�����
andourworst�case
processor
has
���poly
gonslices
torasterize�
foratotal
of
���polygons
proc
����inst
polygon�� �
procs��frames
second����
���
inst
second
�����
Ourtotal
instru
ctionsfor
renderin
gthisscen
eare
thus
�����
�instru
ctionsper
second�which
veryclosely
match
esthe�B
IPcap
ability
ofthePrin
cetonEngin
e�
Theextra
BIPs�a
non�triv
ialnumber
ofinstru
ctions�isaccou
nted
forin
anumber
ofplaces�
There
areanumberofproced
ures
that
arecalled
asin
gletim
e�forexam
ple�
��
toplace
thehistogram
ontheim
age�or
toinitialize
thedatab
asefor
renderin
g�In
addition
anycycles
betw
eentheendrasterization
andthestart
ofthenextfram
e
�when
thefram
ebuers
canbeswapped�are
missin
gfrom
thisanaly
sis�They
have
been
counted
bytheren
derer
asrasterization
instru
ctionsin
Figu
re���
���
Perfo
rmanceCompariso
n
Theperform
ance
ofPrin
cetonEngin
epoly
gonren
derer
iscom
pared
tothat
ofthat
ofits
contem
porary�
theSilicon
Grap
hics
GTX�describ
edin
� ��
Thecom
parison
isabitstilted
ofcou
rse�TheGTX
represen
tsthestate
ofthe
artin
special
purpose
poly
gonren
derin
ghard
ware
for����
Likew
ise�thePrin
ceton
Engin
e�alth
ough
never
soldin
volu
me�
isamillion
dollar
mach
ine�
Noneth
eless
they
arecom
parab
leon
anumber
oflev
els�Both
mach
ines
exhibit
ahigh
level
ofparallelism
�andare
implem
entedin
thetech
nology
era�so
clockspeed
smay
be
expected
tobesim
ilar�etc�
Thegeom
etrystage
oftheGTX
iscom
posed
of�high
perform
ance
�geometry
engin
es�con
nected
inapipelin
e�with
eachprocessor
handlin
gaspeci�
casp
ectof
thegeom
etrycalcu
lations�
Togeth
erthey
prov
idean
aggregateperform
ance
of��
million
�oatin
gpoin
toperation
sper
second�M
�ops��
Bycom
parison
thePrin
ceton
Engin
ecan
perform
a�oatin
gpoin
tmultip
lyanddividein
asin
glelin
etim
eon
each
processor�
foran
aggregateperform
ance
ofapprox
imately
� M�ops�
ThePrin
ceton
Engin
ewas
not
design
edto
perform
�oatin
gpoin
toperation
s�andcon
sequently
pays
aprem
ium
fortheir
use�
Bydirect
comparison
both
ofthese
system
sare
implem
entin
gageom
etrypipelin
e�
ThePrin
cetonEngin
eim
plem
entation
sacri�ces
dynam
icran
geanduses
anall
in�
tegerrep
resentation
forits
prim
itives
andcan
conseq
uently
compute
approx
imately
million
triangular
prim
itives
per
second�TheGTX
perform
s�w
ithsubstan
tially
fewer
processors
ofcou
rse�������
triangular
prim
itives
per
second�
Itishard
toquantify
theanalog
ofthePrin
cetonEngin
ecom
munication
stage
��
ontheGTX�TheGTX
has
adedicated
high
speed
buswhich
transports
prim
itives
betw
eenthegeom
etrypipelin
eandtheir
�nal
destin
ationat
theim
ageengin
es�The
buswas
speci�
callydesign
edto
acceptprim
itivesat
thefullrate
thegeom
etrystage
generates
them
�andwill
thusnever
beaperform
ance
limitin
gfactor�
Therasterization
stagereq
uires
someguessin
gto
determ
inetheperform
ance
of
theindividual
image
engin
esof
theGTX�They
arecited
ashavingan
aggregate
perform
ance
of��
million
depth�buered
pixels
per
second�which
isless
than
half
of
ourpixel�
llrate�Assu
mingeach
image
engin
emustperform
ontheord
erof �
integer
operation
sper
depth�buered
pixeltheaggregate
perform
ance
oftheim
ageengin
es
is���M
IPs�or
���BIPs�sligh
tlyless
than
��of
thePrin
cetonEngin
eperform
ance�
Ofcou
rsethePrin
cetonEngin
espendsapprox
imately
�instru
ctionsper
pixel�
so
they
aren�tdirectly
comparab
le�
Curiou
sly�theGTX
has
realtimevideo
inputandoutputcap
ability�
makingita
particu
larlyaptcom
parison
tothePrin
cetonEngin
e�Alth
ough
not
part
ofthepoly
�
gonren
derin
gprob
lem�theperform
ance
ofthePrin
cetonEngin
edoubtless
exceed
s
theGTXfor
video
relatedoperation
�
Som
efeatu
reshave
nocom
parison
�Most
notab
ly�thePrin
cetonEngin
eren
derer
perform
sfullb
ilinear�in
terpolated
texturemapping�which
theGTXoers
nosupport
for�In
addition
thePrin
cetonEngin
eisasoftw
aresolu
tion�andwill
admitmany
exten
sionsthat
aresim
ply
impossib
lein
ahard
ware
solution
�
The�nalandmost
tellingcom
parison
isofcou
rsetheaggregate
perform
ance�
The
GTX
canren
der
������con
nected
�shared
vertex
�trian
glesper
second�com
pared
tothePrin
cetonEngin
eim
plem
entation
�which
achieves
apeak
of� ����
texture
mapped
triangles
persecon
d�Given
theluxury
ofcostless
communication
�asprov
ided
intheGTX�thePrin
cetonEngin
ewould
immediately
double
itsperform
ance
and
becom
edirectly
comparab
leto
theGTX�
In����
thetim
eofthePrin
cetonEngin
e�s�rst
operation
�itwould
have
prov
ided
poly
gonren
derin
gdirectly
comparab
leto
high
endgrap
hics
work
stationsof
theera�
More
signi�can
tlyit
oers
anunparalleled
�exibility
ofoperation
throu
ghan
all
��
software
implem
entation
�
��
Chapter�
FutureDirectio
ns
Dueto
timeandresou
rcecon
straints
acou
ple
ofideas
inthisthesis
have
remain
ed
unim
plem
ented
�Future
work
will
address
these
issues
when
possib
le�
First
andforem
ostistheim
plem
entation
ofthe�distrib
ution
�optim
ization�ex�
pected
todouble
thesort�m
iddle
communication
optim
izationof
thecurren
tim
�
plem
entation
�Theoth
erissu
eto
beaddressed
isan
implem
entationof
sort�twice
communication
�Both
ofthese
issues
must
alsobeexam
ined
inligh
tof
thenext
generation
Prin
cetonEngin
e�just
becom
ingoperation
alat
thetim
eof
thiswritin
g�
which
will
o�er
greatlyim
prov
edcap
abilities�
Theim
plem
entation
ofthe�distrib
ution
�algorith
mis
expected
todouble
the
perform
ance
ofthealgorith
m�obtain
ingat
least��
poly
gonsper
secondon
the
Prin
cetonEngin
e�Its
implem
entation
should
consist
chie
yoftheaddition
ofanoth
er
proced
ureto
theren
derer
andsom
ecarefu
lmanagem
entofthecom
munication
bu�ers�
���
Distr
ibutio
nOptim
izatio
n
Thedistrib
ution
optim
izationfor
sort�middlecom
munication
prom
isesafactor
of�
improvem
entin
communication
time�
Thedistrib
ution
optim
izationperform
scom
�
munication
intwostages�
�rst
sendingpoly
gonsclose
totheir
�nal
destin
ationwith
anumber
oflarge
steps�andthen
placin
gpoly
gonson
their
�nal
destin
ationproces�
� �
sorswith
neigh
bor�to�n
eighbor
passes�
Distrib
ution
gainsperform
ance
overaregu
lar
sort�middleim
plem
entationbecau
setheoverh
eadfor
communicatin
gapoly
gonisper
pass�
soalarge
pass
iscom
parativ
elycheap
erthan
aneigh
bor�to�n
eighbor
pass�
Thedistrib
ution
optim
izationyield
safactor
of�speed
upin
communication
in
simulation
�Exam
iningourpeak
perform
ance
scene�thecrocod
ileat
��� poly
gons
asecon
d�wesee
that
���of
thetim
eisspentin
communication
�Ifthat
timecou
ld
behalved
anim
mediate
��im
provem
entin
perform
ance
would
beseen
�brin
ging
theperform
ance
ofthesystem
to���
poly
gonsper
second�
Asdiscu
ssedin
x�� �thedistrib
ution
optim
izationmapsvery
neatly
tothelin
e�
lockedprogram
mingparad
igmof
thePrin
cetonEngin
eandim
plem
entation
should
bereason
ably
straightforw
ard�
���
Sort�Twice
Sort�tw
iceo�ers
themost
prom
isingpoten
tialperform
ance
improvem
entseen
sofar�
Ourinitial
analy
sissuggests
that
sort�twice
implem
ented
onoursecon
dgen
eration
hard
ware�
discu
ssedin
x���would
operate
more
than
tim
esfaster
than
sort�twice
onthePrin
cetonEngin
e�thanksto
atrip
lingof
theclock
rate�andawider
commu�
nication
bu�er�
Anim
plem
entationof
sort�twice
renderin
gwould
prov
ideafurth
erexploration
ofthis
design
space
andinterestin
gfeed
back
onthespeci�
ccosts
ofthesort�last
compositin
gstep
�Hopefu
llyitcou
ldinuence
thedesign
offuturemach
inesto
inclu
de
lowcost
support
hard
ware
tofurth
erstream
lineits
operation
�
���
NextGeneratio
nHardware
ThePrin
cetonEngin
eis
aneigh
tyear
oldarch
itecture
atthis
poin
t�Curren
tlya
spin�ou
tcom
panyfrom
theDavid
Sarn
o�Research
Center�
theSarn
o�Real
Tim
e
Corp
oration�is
develop
ingacom
mercial
version
oftheengin
ecalled
the�M
agic�
��
���TheMagic�
although
based
ontheorigin
alPrin
cetonEngin
e�has
signi�can
t
enhancem
ents�
andrep
resents
aninterm
ediate
stepbetw
eenthePrin
cetonEngin
e
andthe�Sarn
o�Engin
e��as
describ
edin
����
Theim
provem
entsinclu
deatrip
ledclock
rate�more
mem
ory�disk
arraycap
abil�
ity����b
itinterp
rocessorcom
munication
�elim
ination
of�lin
e�locked�program
ming
model�
hard
ware
support
forbilin
earandtrilin
earinterp
olation�support
forhigh
bandwidth
video
andnon
video
data
input�andacom
puted
outputtim
ingseq
uence�
Thefaster
clockrate
will
directly
yield
a��
perform
ance
improvem
entper
processor�
Theincreased
mem
oryprov
ides
thepossib
ilityof
verylarge
poly
gonal
data
sets�and
largetex
ture
maps�
Acom
puted
�outputtim
ingseq
uence�
enables
eachprocessor
to
independently
determ
inewhen
itspixelisoutput�thiscom
bined
with
auniquefeed
�
back
capability
that
feedstheoutputof
theengin
einto
theinputallow
sexploration
ofe�
cientsort�last
compositin
galgorith
ms�
Beyon
dthebasic
clockrate
improvem
ents�thedoublin
gof
thewidth
ofthecom
�
munication
registerwill
prov
idesubstan
tialperform
ance
gains�as
pass
operation
swill
pass
twice
asmuch
data�
Theoverh
eadwill
bethesam
eto
write
andread
theregister
�stillonly
a���b
itcore�
andthetest
operation
swill
presu
mably
bethesam
e�but
distan
tpasses
will
require
fewer
cycles
totran
smitthesam
eam
ountof
data�
This
will
have
asubstan
tialim
pact
ontheperform
ance
ofasort�m
iddleim
plem
entation
with
the�distrib
ution
�optim
ization�
More
signi�can
tly�it
will
greatlycheap
enthecost
ofasort�tw
iceapproach
�If
wetake
thewrite�read
andtest
overhead
asnegligib
le�dueto
thesize
ofthepasses
employed
�then
compositin
ghas
becom
etw
iceas
cheap
�Coupled
with
theclock
�
rateim
provem
ents�
composition
will
only
require
onesix
thof
thetim
eas
thecurren
t
architectu
re�asizab
leim
prov
ement�w
hich
brin
gstheattain
ableperform
ance
into
the
�Hzupdate
rateregim
e�
Thenew
support
fornon
video
inputdata
durin
goperation
will
prov
ide�M
B�s
ofinputbandwidth
which
may
bedirected
toanyarb
itrarysubset
oftheprocessors
simultan
eously�
Thiswill
allowtheren
der
tooperate
inamore
conven
ientim
mediate
���
modesty
leof
renderin
gandto
support
datab
asesof
poly
gonsandtex
tures
toolarge
to�tin
main
mem
ory�
Magic
isfully
operation
alat
thispoin
t�andportin
gthepoly
gonren
derin
gto
it
will
beamatter
ofobtain
ingadequate
timeon
themach
inegiven
other
conictin
g
dem
ands�
���
Chapter��
Conclusio
ns
Thisthesis
has
addressed
twoparts
ofthebroad
topicof
parallel
poly
gonren
derin
g�
Acarefu
lanaly
sisofthesort�m
iddlecom
munication
prob
lemhas
dem
onstrated
its
lackof
scalability�
How
ever�particu
laratten
tionto
optim
izationshas
dem
onstrated
thee�ectiven
essof
acarefu
llytuned
andcon
trolledcom
munication
algorithm
forob�
tainingsubstan
tialperform
ance
improvem
entsover
typicalcom
munication
structu
res�
Theuse
ofatwo�p
asscom
munication
algorithm
reduces
thehigh
costof
perform
ing
manyneigh
bor�to�n
eighbor
communication
operation
sandach
ievesnetw
orkutiliza�
tionmuch
closerto
maxim
um
throu
ghput�
Thesearch
forscalab
leandcost�e�
ectcom
munication
strategyled
tothedevel�
opmentof
anovel
new
communication
strategyentitled
sort�twice
communication
�
Sort�tw
icecom
munication
marries
thee�
ciency
ofsort�m
iddle
forsm
allnumbers
ofboth
poly
gonsandprocessors
with
thecon
stanttim
eperform
ance
toobtain
an
algorithm
with
thescalab
ilityof
dense
sort�lastcom
munication
while
amortizin
gthe
high
costof
neigh
bor�to�n
eighbor
communication
overgrou
psof
processors�
Cycle
bycycle
control
ofthecom
munication
channel
createsavery
e�cien
tpipelin
eof
communication
operation
s�TheVLIW
architectu
reof
thePrin
cetonEngin
eislever�
agedto
support
afully
deferred
shadingandtex
ture
mappingmodel�
with
substan
tial
increases
intheperform
ance
oftherasterizer�
Anim
plem
entation
ofa�D
renderin
gengin
eon
adistrib
uted
framebu�er
SIM
D
���
architectu
rethat
achieves
�llrates
of�
million
pixels
asecon
dandover
��
texture�m
apped�lit
andshaded
poly
gonsasecon
dwas
presen
ted�
Theren
derer
implem
entation
addressed
several
issues�
�Com
munication
ofpoly
gondescrip
torsin
arin
gtop
ology
�Rasterization
inaprocessor
per
columnorgan
ization
�Flex
ibleinteraction
with
areal�tim
eren
derin
genviron
menton
aslave
mach
ine
Truereal�tim
eren
derin
gperform
ance
has
been
dem
onstrated
onaSIM
Dproces�
sorarray�
with
results
that
scalewell
with
increasin
gnumbers
ofprocessors�
Poly
gons
areof
relatively
arbitrary
size�andbenchmarked
astheclassic
���pixel
poly
gon��
Thealgorith
msare
readily
modi�ed
andadapted
toexperim
entation
�
ThePrin
cetonEngin
eprov
ides
auniqueexible
architectu
refor
real�timein�
teractiveren
derin
g�Thenextgen
erationof
thePrin
cetonEngin
e�theMagic���
is
expected
toprov
ideren
derin
gperform
ance
ofover
�million
poly
gonsper
secondin
asort�tw
iceim
plem
entation
�
��
Appendix
A
Sim
ulator
Adetailed
andaccu
ratesim
ulator
was
develop
edto
experim
entwith
algorithmchanges
andoptim
izationswith
outhavingto
rewrite
theim
plem
entation
�Thesim
ulator
pro�
vides
theexibility
andease
need
edto
rapidly
trymanyoption
sandcom
bination
s
offeatu
reswith
littlee�ort�
Thesim
ulator
alsoprov
ides
theability
tosim
ulate
ma�
chines
unavailab
lefor
actual
use�
such
asmach
ines
with
more
processors�
orournext
generation
hard
ware�
Thisthesis
has
presen
tedalarge
numberofcom
munication
algorithmsandoption
s
tobeanaly
zed�
Itis
impractical
torew
ritetheim
plem
entation
toanaly
zeeach
optim
ization�for
anumber
ofreason
s�
�Writin
gcod
efor
thePrin
cetonEngin
eisintricate
andtim
econ
suming�
�Thelin
e�lockedprogram
mingcon
straintoften
obscu
resthetru
eperform
ance
of
analgorith
mbymakingitdi�
cultto
use
allavailab
lecycles
forcom
putation
�
�Itisdi�
cultto
establish
correctness
oftheim
plem
entation�Thelack
ofdebug�
gingtools
makedata
distrib
ution
errorsdi�
cultto
detect
andanaly
ze�
Fortu
nately
thePrin
cetonEngin
elen
dsitself
very
well
tosim
ulation
�ThePrin
ce�
tonEngin
eisaSIM
Darch
itecture�
sobyobserv
ingtheworst
caseexecu
tionpath
ofa
program
weobserve
thetotal
execu
tiontim
ereq
uired
�Interp
rocessorcom
munication
���
iscom
pletely
controlled
bytheuser
program
�allow
inganycom
munication
algorithm
tobestu
died
atthecycle
levelwith
complete
accuracy�
With
inthesim
ulator
attention
ispaid
almost
exclu
sively
tothedetails
ofthe
communication
stage�butonly
becau
seithas
proven
themost
di�
cultstage
toana�
lytically
model�
particu
larlywhen
theinteraction
sof
more
than
asin
gleoptim
ization
mustbecon
sidered
�Thegeom
etryandrasterization
stages�while
requirin
gsign
i�can
t
amounts
ofcom
putation
intheim
plem
entation
�are
trivially
simulated
byobserv
ing
that
theworst
caseprocessor
�most
poly
gonsto
compute�
most
pixels
toren
der�
will
determ
inetheperform
ance
ofeach
stage�
A��
Model
Thesim
ulator
modelis
signi�can
tlyabstracted
fromthePrin
cetonEngin
e�Itassu
mes
somearb
itrarynumber
ofprocessors
connected
inaneigh
bor�to�n
eighbor
ring�
The
structu
reof
thesim
ulated
algorithm
isbased
largelyon
implem
entationexperien
ce�
andisorgan
izedto
support
themaxim
umam
ountof
exibility
inthecom
munication
stage�
A����
InputFile
s
Theinputto
thesim
ulator
consists
oftwopieces�
a�bounding�b
ox��lethat
describ
es
thepoly
gondatab
aseas
aset
ofpoly
gonboundingboxesinscreen
space�
anda�screen
map�that
speci�
esthedecom
position
ofthescreen
forrasterization
�
Thebounding�b
ox�le
isgen
eratedbyprep
rocessingapoly
gondatab
aseas
ob�
served
fromaspeci�
cview
poin
t�Every
poly
gonin
theorigin
aldatab
aseisspeci�
ed
inthebounding�b
ox�le�
inclu
dingpoly
gonsthat
areculled
asoutof
theview
ing
frustu
mor
back
facing�
Each
bounding�b
oxistagged
with
a�or
a�indicatin
gvis�
ible�cu
lled�Theinclu
sionof
culled
poly
gonsin
thebounding�b
ox�le
help
smodel
thegeom
etrystage�
durin
gwhich
somefraction
ofthepoly
gonscom
puted
will
be
back
facingand�or
outof
theview
ingfru
stum�It
isim
portan
tto
modelthese
culled
���
NumPolygons��
����������������
�����������������
������������ ���
����������������
���������������
��������������
���������������
Figu
reA���
Bounding�boxFile�Each
poly
gonisspeci�
edas
avisib
le�nonvisib
letag
andarectan
gular
bounding�b
ox�
poly
gonsbecau
se�
�Culled
poly
gonshave
acom
putation
alcost
associatedwith
rejectingthem
�
�It
may
becheap
erto
compute
andreject
apoly
gonthan
avisib
lepoly
gon�
which
will
a�ect
thetotal
instru
ctionsexecu
tedin
thegeom
etrystage�
�Poly
goncullin
gyield
sasy
mmetries
inthedistrib
ution
ofpoly
gonsacross
pro�
cessors�which
may
a�ect
theexecu
tiontim
eof
both
thegeom
etryandcom
mu�
nication
stages�
Theuse
ofabounding�b
ox�le
allowsthedetails
oftheactu
algeom
etrycalcu
�
lationsto
berem
ovedfrom
consid
eration�insurin
gthat
wemain
tainastab
leinput
data
setfor
di�eren
tsim
ulation
s�Italso
removes
theburden
ofinsurin
gthat
thege�
ometry
pipelin
eiscorrectly
implem
ented
inthesim
ulator
inaddition
totheren
derer�
Figu
reA��
show
sasam
plebounding�b
oxinput�le�
Thescreen
map
allowscom
pletely
arbitrary
mapsof
pixels
toprocessors
forras�
terizationto
bespeci�
ed�A
defau
ltcolu
mn�parallel
decom
position
ofthescreen
is
assumed�butgen
eralrectan
gular
regionsmay
bespeci�
ed�inclu
dingnon�con
tiguous
regions�allow
ingtheanaly
sisof
staticload
balan
cingtech
niques
such
asthose
sug�
gestedin
����Theonly
constrain
tson
thescreen
map
isthat
every
pixelisrasterized
byprecisely
oneprocessor
��Asam
plescreen
map
isshow
inFigu
reA���
�Thesort�
twice
case�
forwhich
multip
leprocesso
rsrasterize
thesamereg
ion
ofthescreen
�is
���
��
��
�����
��
��� ��� ��
�����
�������
����� ������ ��
Figu
reA���
Sample
ScreenMap�a processor
system
�with
eachprocessor
as�sign
edaquadran
tof
a���
by���
pixel
screen�Theprocessor
isgiven
inthe�rst
column�follow
edbytherectan
gular
regionit
rasterizes�Processors
may
belisted
multip
letim
esin
the�lefor
non�con
tiguousrasterization
regions�
Given
thepoly
gonboundingboxes
inthebounding�b
ox�le
andthepixel
to
processor
map
inthescreen
map
itcan
bedeterm
ined
precisely
what
processors
will
rasterizeeach
poly
gon�Anygiven
poly
gonisrasterized
bytheunion
oftheprocessors
whose
pixels
itoverlap
s�
A����
Parameters
Thesim
ulator
isfully
param
eterized�Allof
theparam
etersof
interest
may
bespeci�
�ed
viacom
mandlin
eoption
s�andmore
drastic
changes
�such
asin
thewidth
ofthe
interp
rocessorcom
munication
bus�
etc��may
betriv
iallymadethrou
ghrecom
pila�
tion�Theparam
etersof
interest
areshow
nin
TableA���
Most
oftheparam
etersare
fairlyobviou
s�Theparticu
larparam
etersof
interest
forspecify
ingthecom
munication
structu
reare
m�t�
d��andg�
which
enable
the
simulation
oftheduplication
�dense�
distrib
ution
andsort�tw
iceoptim
izations�
Data
Duplicatio
nThedata
duplication
pattern
isspeci�
edbym�which
isthe
number
oftim
esthepoly
gondatab
aseisduplicated
acrosstheprocessors�
Itis
speci�
edas
ageom
etryparam
eterbecau
seitisfully
resolved
bythegeom
etry
stageof
thesim
ulator�
andnot
exposed
tothecom
munication
stage�
Dense
Communicatio
nTheperiod
betw
eentests�
t�speci�
eshow
oftenthesim
u�
latedprocessors
will
testthepoly
gonthey
justreceiv
edto
seeifitiscom
pletely
discu
ssedlater�
���
param
eterdefau
ltmean
ing
n����
number
ofprocessors
polyfile
�none�
bounding�b
ox�le
mapf
ilecolu
mn
pixelto
processor
map
parallel
geometry
Gi
����instru
ctionsto
transform
andkeep
�rejectpoly
gonG
f����
instru
ctionsto
lightandcom
pute
linear
equation
sfor
apoly
gonm
�duplication
factorcom
municationt
�period
betw
eentests
d�
arrayspecify
ingseq
uence
ofpass
sizesto
beused
gn
number
ofprocessors
per
group
rasterizationR
���instru
ctionsto
rasterizeapixel
Table
A���
SimulatorParameters
communicated
andcan
berep
lacedwith
their
nextpoly
gonto
communicate�
Dense
communication
t�
� has
been
determ
ined
toalw
aysbeausefu
lopti�
mization
andis
used
bydefau
lt�For
simulation
ofthesparse
communication
case discu
ssedin
x��� t�
n�
Distrib
utio
nThedistrib
ution
pattern
isspeci�
edin
das
aseq
uence
ofpass
sizes
tobeused
durin
gcom
munication
�Theoptim
aldistrib
ution
strategy as
deter�
mined
experim
entally
would
berep
resented
asd�
�����
Sort�
TwiceThegrou
psize
gis
used
forsort�tw
icesim
ulation
sandspeci�
esthe
number
ofprocessors
responsib
lefor
rasterizationof
onecom
plete
copyof
the
screen�
Allof
these
param
etersinteract
aswould
beexpected
�Thusaduplication
fac�
torm
���andadistrib
ution
pattern
canbecom
bined
inthesam
esim
ulation
for
exam
ple�
Notab
lyabsen
tissupport
fortem
poral
localityoptim
izations�
Tem
poral
locality
���
requires
asubstan
tialexten
sionof
thesim
ulator�
Acom
mon
benchmark
fortem
poral
locality used
in��
forexam
ple
isto
render
ascen
e sligh
tlychange
theview
poin
t
andren
der
thescen
eagain
�Thesm
allchange
intheview
poin
tforces
somecom
mu�
nication
tobeperform
edso
that
thee�
ciency
oftem
poral
localityin
communication
canbemod
eled�How
ever theuse
ofabounding�b
ox�le
tocollap
sethepoly
gon
inform
ationdestroy
sthevery
inform
ationnecessary
tospecify
arbitrary
view
poin
ts�
Sim
ulation
oftem
poral
localityload
imbalan
cesin
x���exposed
severeenough
inef�
�cien
ciesthat
theapproach
was
abandoned
atthat
poin
t elim
inatin
gtheneed
for
simulator
support�
Inaddition
tothese
param
eters there
areanumber
ofinstru
mentation
knobsthat
may
beturned
onfor
anygiven
simulation
tocollect
extra
data�
Ofparticu
laruse
istheoption
togen
erateahistogram
ofthenumber
ofpoly
gonson
eachprocessor
before
andafter
eachphase
ofcom
munication
�This
exposes
both
geometry
load
imbalan
ces as
may
becreated
bym
���
andrasterization
loadim
balan
cesas
may
becreated
byanon�uniform
distrib
ution
ofpoly
gonsover
thedisp
lay�
A��
Operatio
n
Thehigh
�leveldiagram
inFigu
reA��
schem
aticallyrep
resents
theexecu
tionof
the
simulator�
Alm
ostall
oftheop
erationis
wrap
ped
upin
thedetails
oftheas
yet
unspeci�
edcom
munication
simulation
�In
addition
there
issom
eprep
rocessingto
loadthepoly
gondatab
aseinto
theprocessors
andsom
epost
processin
gto
countthe
expected
instru
ctionsfor
rasterization�Thefollow
ingsection
sdescrib
etheop
eration
ofthestages
ofthesim
ulator
indetail�
A����
Data
Initia
lizatio
n�
Geometry
Data
initialization
determ
ines
themappingof
poly
gonsto
processors
forgeom
etry
computation
s andthemappingof
pixels
toprocessors
forrasterization
�Theform
er
isspeci�
edbysim
ply
placin
gpoly
gonson
processors
while
thelatter
isdeterm
ined
���
clip/reject
geometry
transform
rasterizationproc 3
proc 2proc 1
proc 0
algorithms &
optimizations
comm
unication
polygon database
proc 2
screen map
proc 0
calculate coefficients
light
proc 1proc 2
proc 3geom
etry
23
proc 0proc 1
proc 3
10
Figu
reA���
SimulatorModel�
a�processor
poly
gonren
derer
with
a�poly
gondatab
ase�
���
bycalcu
latinga�destin
ationvector�
foreach
poly
gonwhich
tagseach
processor
responsib
lefor
rasterizingapart
ofthepoly
gon�
Thebounding�b
ox�leonly
de�nes
thetotal
setof
poly
gonsin
thesim
ulation
and
leavesunspeci�
edthemappingbetw
eenpoly
gonsandprocessors
forthegeom
etry
calculation
s�Thesim
ulator
allowstw
odistrib
ution
pattern
s�
Norm
alThepoly
gonsare
assigned
totheprocessors
inrou
ndrob
infash
ion�Num�
berin
gthepoly
gonsseq
uentially
asthey
areread
fromthebounding�b
ox�le
poly
gonbwill
lieon
processor
i�
mod�b�n
�on
annprocessor
system
�
Duplicated
Aduplication
factorm
��speci�
esthat
eachpoly
gonshould
bedu�
plicated
mtim
esacross
theprocessor
array�A
poly
gonbisinstan
tiatedon
all
processors
i�
mod�b�n
�m��km�
Thenorm
aldistrib
ution
isjust
adegen
eratecase
�m�
��of
theduplicated
distrib
ution
pattern
�
Thepossib
ilityofaran
dom
assignmentofpoly
gonsto
processors
was
not
inclu
ded
asthiscan
besim
ulated
byshu�ingthebounding�b
oxinput�lebefore
itisread
by
thesim
ulator
andexperim
ental
results
show
that
anycorrelation
inthedata
setis
alreadybrok
enupbytherou
ndrob
inassign
mentof
poly
gonsto
theprocessors�
Thecom
bination
ofthepoly
gonboundingboxes
fromthebounding�b
ox�leand
thepixel
toprocessor
map
fromthescreen
map
prov
ides
allof
thenecessary
in�
formation
forperform
ingcom
munication
�Each
poly
gonis
describ
edin
part
byan
n�bitdestin
ationvector
where
abitiisset
ifandonly
ifthepoly
gonboundingbox
intersects
theset
ofpixels
rasterizedbyprocessor
i�Figu
reA��
show
sthedestin
a�
tionvectors
associatedwith
eachpoly
gonas
ashort
bitvector
nextto
thepoly
gon�
Note
that
multip
lebits
areset
ifmore
than
asin
gleprocessor
will
beresp
onsib
lefor
rasterizationof
thepoly
gon�
Theuse
ofabounding�b
ox�lewhich
speci�
estherectan
gular
exten
tof
thepoly
�
gonsguaran
teesthat
wewill
never
incorrectly
assumeapoly
gondoesn
�toverlap
a
���
proc 2proc 3
proc 0proc 1
Figu
reA���
Destin
atio
nFalse
Positiv
e�thepoly
gonis
incorrectly
assumed
tointersect
therasterization
regionof
processor
��
processor�s
pixels
how
ever in
somecases
wemay
incorrectly
assumecoverage�
Con�
sider
thepoly
gonandscreen
map
show
nin
Figu
reA���
processor
�will
attemptto
rasterizeapiece
ofthispoly
gon even
though
itdoesn
�tactu
allyoverlap
processor
��s
region�It
isassu
med
that
these
occasional
falsepositiv
esincuralow
ercost
than
a
precise
test�Given
thecost
ofsort�m
iddlecom
munication
itmay
make
sense
toper�
formatest
toelim
inate
falsepositiv
es how
ever itisunnecessary
foroursim
ulation
s
ofinterest�
For
both
ourim
plem
entation
andmost
ofoursim
ulation
s�barrin
gsort�
twice
approach
es� acolu
mn�parallel
decom
position
ofthescreen
isused
in
which
casetheboundingbox
ofthepoly
gonwill
only
overlapaprocessor�s
columnof
the
screenifthere
will
bepixels
torasterize
sothere
isnoexcess
communication
done
dueto
falsepositives�
The�rst
stageof
geometry
computation
will
process
aworst
caseof
dp�m�n
e
poly
gonson
aprocessor
andreq
uires
Giinstru
ctionsper
poly
gonto
perform
�In
general
eachprocessor
will
beresp
onsib
lefor
communicatin
g andthusperform
ing
thesecon
dstage
ofgeom
etrycom
putation
all
visib
lepoly
gonsassign
edto
it�The
data
duplication
optim
ization d
iscussed
inx���
introd
uces
asligh
ttw
ist as
thesam
e
poly
gonwill
exist
onm
processors�
how
ever only
oneprocessor
should
communicate
thepoly
gon in
particu
lartheprocessor
which
canperform
thecom
munication
most
cheap
ly�Each
processor
will
source
apoly
gononly
ifitistherigh
tmost
processor
left
���
ofthe�rst
processor
rasterizingthispoly
gon��
After
thepoly
gondatab
aseis
loaded
onto
theprocessors
theligh
tingmod
elis
applied
tothevisib
lepoly
gonson
eachprocessor
andlin
earequation
coe�cien
tsare
calculated
�Thisisa�pseu
do�step
�in
thesim
ulator
which
only
deals
with
poly
gon
boundingboxes
andisinclu
ded
tomod
eltheactu
alalgorith
m�Itreq
uires
max
i �piv ��
Gfinstru
ctionsto
execu
te where
pivisthenumber
ofvisib
lepoly
gonson
processor
i�
Note
that
atthispoin
tweonly
have
asin
glecop
yof
eachpoly
gonread
yto
distrib
ute
regardless
ofthechoice
ofm�Theduplication
factorm
ishidden
fromtherem
ainder
ofthesim
ulator
ine�ect
��
A����
Communicatio
n
Com
munication
isthemost
challen
gingpart
ofthepoly
gonren
derin
galgorith
ms
tosim
ulate�
Alloth
erasp
ectsof
thesim
ulation
may
beperform
edat
agross
level
bysim
ply
countin
gpoly
gons�pixels�
andmultip
lyingbythenumber
ofinstru
ctions
required
per
poly
gon�pixel��
Com
munication
how
everreq
uires
adetailed
understan
d�
ingof
what
happensin
theinterp
rocessorcom
munication
channel
onapass
bypass
level�Com
munication
ismod
eledas
occurrin
gon
a�D
ringof
processors�
Each
proces�
sorisassign
eda�slot�
intherin
gwhich
itcan
place
apoly
gondescrip
torinto
and
readapoly
gondescrip
torfrom
�Therin
gis
represen
tedbyan
arrayof
poin
tersto
poly
gondescrip
tors�Each
poly
gondescrip
torrep
resentsaminim
alsetof
inform
ation
areferen
cenumber
forthepoly
gon�for
laterverify
ingthecorrectn
essof
thecom
mu�
nication
�andadestin
ationvector
which
speci�
estheprocessors
that
must
receive
acop
yof
thispoly
gonfor
thisstage
ofcom
munication
�Figu
reA��
show
stherin
g�
�Thisiswhatwemeanby�clo
sest��theprocesso
rthatwill
haveto
makethefew
estpasses
befo
re
thepolygonhasatlea
ststa
rtedto
arriv
eatits
destin
atio
n�This
proves
tobeausefu
lcost
metric
forcolumn�parallel
deco
mpositio
nsofthescreen
�where
alldestin
atio
nprocesso
rsare
contig
uous�
Amore
sophistica
tedcommunica
tioncost
measure
must
beused
fordisjo
intdeco
mpositio
nsofthe
screen�
�Ofcourse
duplica
tionmayintro
duce
largeasymmetries
inthevalues
ofpiv �
butthatwill
a�ect
only
thesim
ulated
execu
tioncost�
���
proc 3proc 2
proc 1proc 0
proc 3proc 0
proc 1proc 2
Figu
reA���
SimulatorCommunicatio
n�theinterp
rocessorcom
munication
isab�
stractedas
passin
gan
entire
poly
gondescrip
torneigh
bor�to�n
eighbor
atonce�
Processor
�poin
tsto
a�null�
which
indicates
that
there
isnopoly
gondescrip
tor
curren
tlyassociated
with
that
locationin
therin
g�
Each
processor
has
anarray
ofpoly
gondescrip
torsto
source
andan
arrayof
poly
gonsreceived
fromcom
munication
analogou
sto
thesourceandreceivearray
s
intheim
plem
entation
�Initially
therin
gis
empty
andtheprocessors
allhave
their
source
arraysfullandtheir
receivearray
sem
pty
exactly
liketheactu
alim
plem
en�
tation�Sim
ilarly at
theendof
communication
eachprocessor
will
have
anem
pty
source
arrayandareceive
arraywith
allpoly
gonsitwill
have
torasterize�
Com
munication
operation
sare
perform
edby�rotatin
g�therin
g�Allof
thepoin
t�
ersare
shifted
oneplace
totherigh
tin
thearray
with
thepoin
terthat
fallso�
the
endshifted
back
tothemiddle�
Larger
shifts
�forthedistrib
ution
optim
ization�are
implem
ented
byperform
ingalarger
rotationof
therin
g�Thecom
munication
isn�t
actually
broken
dow
nto
the�nest
granularity
ofsim
ulatin
geach
individual
commu�
nication
operation
necessary
topass
apoly
gondescrip
torfrom
processor
toprocessor�
Thecost
mod
elfor
apass
isprecisely
thei�
���d�a
�bmod
elused
intheanaly
tical
analy
ses with
a���
andb�
���
After
eachpass
operation
eachprocessor
teststhedestin
ationvector
ofthepoly
gon
descrip
torin
itsslot
oftherin
g andifthepoly
gonisdestin
edfor
itnotes
thereferen
ce
number
inthereceiv
earray
andclears
itsbit
inthevector�
Apoly
gonhas
been
communication
toall
interested
processors
when
nobitin
destin
ationvector
isset�
Thedense
distrib
ution
andsort�tw
icecom
munication
optim
izationsare
allex�
���
posed
with
inthecom
munication
simulation
andare
discu
ssedin
turn
below
�
Dense
Communicatio
n
Thesparse
vs�
dense
communication
optim
izationsare
implem
entedviathetvariab
le
which
speci�
eshow
oftentheprocessors
testthepoly
gondescrip
torin
their
�slot�
tosee
ifeveryon
ehas
seenit�
For
simplicity�s
sakeeach
processor
will
autom
atically
replace
anullin
itsslot
oftherin
gwith
oneof
thepoly
gonsfrom
itssou
rcearray�
The
period
icitytof
thetest
determ
ineshow
oftenthedestin
ationvector
ofeach
descrip
tor
intherin
gistested
andturned
into
anullifall
zero�
Distrib
utio
n
Thedistrib
ution
optim
izationis
relatively
trickyto
implem
ent�It
isperform
edby
execu
tingthecom
munication
algorithm
anumber
oftim
es andbetw
eeneach
execu
�
tionthereceiv
earray
foreach
processor
isused
toreb
uild
thesou
rcearray�
Durin
g
allstages
ofcom
munication
excep
tthelast
theshifts
that
areperform
edare
ofsize
greaterthan
� so
thenotion
ofthedestin
ationvector
becom
esabitmuddied
�Poly
�
gonscan
nolon
gerbearb
itrarilyrou
ted butcan
only
besen
tto
processors
i�
s�kd
where
sistheprocessor
thispoly
gonissou
rcedfrom
anddisthesize
ofthepass�
To
achieve
thisdistrib
ution
pattern
thedestin
ationvectors
arebuilt
fortheactu
alpoly
�
gondestin
ationsandthen
allof
thedestin
ationsare
adjusted
toalign
with
the�rst
validdestin
ationbefore
thedesired
destin
ationprocessor
asshow
nin
Figu
reA��a�
Inoursim
ple
exam
ple
d�
���andprocessor
�issou
rcingapoly
gondestin
ed
forprocessors
� �and���
The�rst
stageof
distrib
ution
results
inboth
processor
�andprocessor
�havingacop
yof
thepoly
gon�Thesecon
dpass
ofcom
munication
will
beof
size�andwill
route
thepoly
gonto
its�nal
destin
ations
processors
� �
and���
Ifwearen
�tcarefu
lboth
processor
�and�will
tryto
route
thepoly
gonto
all
ofthese
processors
which
intheworst
caseresu
ltsin
these
processors
rasterizingthe
poly
gontw
iceandalso
wastes
communication
bandwidth�Theobservation
ismade
that
afterperform
ingacom
munication
stagewith
passes
ofsize
deach
processor
can
���
128
40
(b)
(a)
processor
smooshed/m
asked destinationsvalid destinations
source processor
Figu
reA���
Destin
atio
nVectorAlignment�
a��
processor
system
usin
gadistri�
bution
pattern
of����Thedestin
ationvectors
arefor
apoly
gonthat
isdistrib
uted
fromprocessor
�to
�nal
destin
ationsof
processors
� �and���
�a�isthedestin
ationvector
used
forthe�rst
pass
ofcom
munication
and�b�are
thedestin
ationvectors
forprocessors
�and�to
perform
thesecon
dand�nal
pass
ofcom
munication
�
onlypass
anypoly
gonsithas
tosou
rceover
d��processors
toavoid
duplication
�Thus
processor
�will
only
pass
poly
gonsas
faras
processor
� etc�
Infact
wecan
always
thinkof
communication
asstartin
gwith
apass
ofsize
n thenumber
ofprocessors
afterwhich
eachpoly
gonmay
only
bepassed
overat
most
n��processors
before
it
will
beseen
twice
bysom
eprocessor�
Figu
reA��b
show
sthedestin
ationvectors
used
byprocessors
�and�for
thesecon
dstage
ofcom
munication
�
Sort�
Twice
The�nal
optim
izationto
deal
with
issort�tw
icecom
munication
�Theplacem
entof
processors
into
groupsof
sizegintrod
uces
asubtle
complex
ityinto
thecom
munication
structu
rebymakingprocessors
equivalen
tat
somelev
el�Ifapoly
gonoverlap
sregion
ritcan
berasterized
byanyprocessor
r�
kg
sothecom
munication
pattern
isno
longer
determ
inistic�
Inreality
ofcou
rseit
isdeterm
inistic
becau
sewewanteach
poly
gonto
traveltheminim
um
amountof
distan
cepossib
le which
mean
sits
only
destin
ationis
the�rst
processor
r�
kgafter
theprocessor
itstarts
on�How
ever
���
04
expand
mask
processor
valid destinationssource processor
128
destinations
Figu
reA���
Destin
atio
nVectorforg��
��thedestin
ationprocessor
setisinferred
fromthe�rst
gprocessors
andthen
mask
edso
that
thepoly
gonisreceived
atmost
oneof
theprocessors
which
canrasterize
theregion
�s�itoverlap
s�
thepixelto
processor
map
will
only
map
pixels
tothe�rst
gprocessors
becau
sethe
screenis
divided
into
gregion
s andonly
asin
gleprocessor
may
beresp
onsib
lefor
anypixel�
Thuswhen
thedestin
ationvector
isbuilt
itonly
speci�
esdestin
ationsin
the�rst
gprocessors�
Figu
reA��
show
sasam
ple
destin
ationvector
forsort�tw
ice
communication
�Thedestin
ationvector
isexpanded
toset
orclear
allof
thebits
based
onthe�rst
gprocessor
bits
tore�
ectall
thevalid
rasterizationprocessors�
Thevector
isthen
mask
edso
that
thepoly
gonisshifted
amaxim
um
ofg��tim
es
toavoid
unnecessary
communication
�After
this
more
complex
destin
ationvector
constru
ction thecom
munication
may
proceed
asit
norm
allywould alth
ough
of
course
theextra
instru
ctionsreq
uired
toperform
thesort�last
compositin
gmust
beaccou
nted
forlater�
Thesort�last
composition
isadata
independentop
eration
andthenumber
ofinstru
ctionsreq
uired
isjust
calculated
based
onthegrou
psize
andthenumber
ofprocessors�
Thecon
tribution
is�instru
ctionsfor
g�
n�van
illa
sort�middlecom
munication
�as
would
beexpected
�
Thissection
has
describ
edall
ofthemajor
features
ofthecom
munication
simula�
tionandsom
eof
thecare
that
isnecessary
intheir
implem
entation
�Itisworth
notin
g
that
complex
ityin
thesim
ulation
will
likely
alsobere�
ectedin
anyim
plem
entation
andanyaddition
alinsigh
tthesim
ulator
prov
ides
inim
plem
entation
canbequite
���
valuable�
A����
Rasterization
Therasterization
stageissim
ulated
inafash
ionsim
ilarto
thegeom
etrystage�
Itis
possib
lewith
reasonably
high
accuracy
toguess
how
longitwill
taketo
rasterizea
setof
poly
gons�
For
eachprocessor
i�riiscom
puted
�which
isthesum
ofthenumber
ofpixels
�rasterizedbythisprocessor�
that
allpoly
gonboundingboxes
overlap�The
totalrasterization
timeisthen
just
max
i �ri �
�R�where
Risthetim
eto
rasterizea
single
pixel�
Thisign
oresdetails
oftheim
plem
entation
such
asthecost
ofadvan
cing
tothenextpoly
gon�andthee�ects
ofperform
ingthetest
foradvan
cementto
the
nextpoly
gonperiod
icallyrath
erthan
afterevery
single
pixel�
These
e�ects
have
been
inclu
ded
inan
adhoc
fashion
byincreasin
gthecost
ofrasterizin
gasin
glepixel�
While
not
strictlycorrect�
thisapprox
imation
yield
sgood
results�
Thegoal
ofthesim
ulator
has
been
tomeasu
rethee�ectiven
essof
variouscom
munication
optim
izations�the
e�ects
ofwhich
areentirely
wrap
ped
upin
thegeom
etryandcom
munication
stages�
makingtheaccu
racyof
therasterization
stagesim
ulation
lessim
portan
t�
A��
Correctness
Theresu
ltsof
thesim
ulator
areto
alarge
exten
tveri�
able�
Asim
ple�an
dtherefore
prob
ably
correct�check
isdoneat
theendof
thecom
munication
stagethat
check
s
that��
Each
poly
gonwas
received
onall
processors
whose
pixels
itoverlap
ped�
�Noprocessor
receivedapoly
gonitdidn�tneed
�
�Noprocessor
receivedmore
than
onecop
yof
apoly
gon�
The�rst
check
isof
course
most
critical�Ifanyprocessor
doesn
�treceiv
eapoly
gonthat
itneed
sto
rasterizeitcou
ldresu
ltin
ahole
intheren
dered
scene�
The
��
secondcheck
insures
wearen
�tintrod
ucin
gextra
work
bygiv
ingprocessors
poly
gons
torasterize
that
don�toverlap
their
screenareas�
Thenecessity
ofthethird
check
isnot
immediately
obviou
s�Thisdetects
subtle
errorswhich
canbeintrod
uced
with
thedistrib
ution
optim
ization�In
thiscase
abug
inthealgorith
mcou
ld�an
ddid�accid
entally
distrib
ute
poly
gonsin
duplicate
�and
eventrip
licate�in
someboundary
cases�
Ofcou
rse�there
aresom
eerrors
that
thesim
ulator
can�tdetect�
andthey
will
necessary
allrelate
toperform
ance�
rather
than
thecorrectn
essof
theresu
lt�Such
errorswould
inclu
dedistrib
utin
gapoly
gonfurth
erthan
necessary
�past
thelast
processor
interested
init��
ormiscou
ntin
gtheinstru
ctioncycles
forsom
eoperation
�
These
areasof
thecod
ehave
allbeen
carefully
checked
�andtheresu
ltsgiven
bythe
simulator
re ect
ourexpectation
s�so
itisbeliev
edthey
arecorrect�
A��
Summary
Theanaly
sisof
variousoptim
izationshas
dem
onstrated
that
while
analy
ticalmod
elsmay
accurately
model
theform
ofthecorrect
answer�
simulation
may
yield
a
substan
tiallydi�eren
t�andpresu
mablymore
accurate�
solution
�Largely
these
di�er
ences
inaccu
racyare
caused
bytheassu
mption
sused
tomake
theanaly
ticalprob
lem
tractable�
Assu
mption
ssuch
asa�dense
ring�
aremadeto
compensate
forthelack
ofanygood
modelfor
theactu
aldistrib
ution
ofinform
ationin
therin
g�Variation
s
inthedistrib
ution
ofsou
rcesanddestin
ationsfor
poly
gonsfrom
uniform
ityexacer
bates
things
furth
er�makingsim
ulation
invalu
ableto
obtain
accurate
answers
forreal
poly
gondatab
ases�
Thevery
factorsthat
make
thecom
munication
sohard
toanaly
ticallymodelm
ake
itequally
hard
tosim
plify
thesim
ulation
�Fortu
nately
forourdata
setsit
isnot
proh
ibitively
expensive
toperform
afullsim
ulation
ofthecom
munication
structu
re�
��
Biblio
graphy
��Akeley�K�
Reality
Engin
eGrap
hics�
InProceed
ings
Siggra
ph��
�August
�����pp������
���Akeley�K��andJermoluk�T�High
�Perform
ance
Poly
gonRenderin
g�In
Proceed
ings
Siggra
ph���A
ugust
�����pp���������
���Chin�D��Passe�J��Bernard�F��Taylor�H��andKnight�S�
The
Prin
cetonEngin
e�A
Real�T
ImeVideo
System
Sim
ulator�
IEEE
Transactio
ns
onConsumer
Electro
nics
���M
ay�����
���Cox�M�B�Algo
rithmsforParallel
Renderin
g�PhD
dissertation
�Prin
ceton
University�
Departm
entof
Com
puter
Scien
ce�May
����
���Crockett�T�W��andOrloff�T�
AParallel
Renderin
gAlgorith
mfor
MIM
DArch
itectures�
InProceed
ings
Parallel
Renderin
gSymposiu
m�N
ewYork
�
�����ACM
Press�
pp�������
���Crow�F�C��Demos�G��Hardy�J��McLaughlin�J��andSims�K��D
Image
Synthesis
ontheConnection
Mach
ine�In
Parallel
Processin
gforComputer
Visio
nandDisp
lay�Addison
Wesley�
����
���Ellsworth�D�ANew
Algorith
mfor
Interactiv
eGrap
hics
onMulticom
puters�
IEEEComputer
Graphics
andApplica
tions���
��Ju
ly�����
������
���Foley�J�D��vanDam�A��Feiner�S�K��Hughes�J�F��andPhillips�
R�L�Intro
ductio
nto
Computer
Graphics�
Addison
Wesley�
����
�
���Heckbert�P��andMoreton�H�Interp
olationfor
Poly
gonTextureMapping
andShading�
InState
oftheArt
inComputer
Graphics�
Visu
aliza
tionand
Modelin
g�Sprin
ger�Verlag�
���
���Knight�S��Chin�D��Taylor�H��andPeters�J�TheSarn
o�Engin
e�A
Massively
Parallel
Com
puter
forHigh
De�nition
System
Sim
ulation
�Journalof
VLSISign
alProcessin
g��������
������
��Molnar�S��Cox�M��Ellsworth�D��andFuchs�H�ASortin
gClassi�
cationof
Parallel
Renderin
g�IEEEComputer
Graphics
andApplica
tions�Ju
ly
�����������
���Molnar�S��Eyles�J��andPoulton�J�PixelF
low�High
�Speed
Renderin
g
Usin
gIm
ageCom
position
�In
Proceed
ings
Siggra
ph���Ju
ly�����
pp��������
���Molnar�S�E�Im
age�C
ompositio
nArch
itectures
forReal�T
imeIm
age
Gen�
eratio
n�
PhD
dissertation
�TheUniversity
ofNorth
Carolin
aat
Chapel
Hill�
Departm
entof
Com
puter
Scien
ce�Octob
er���
���Neider�J��Davis�T��andWoo�M�OpenGLProgra
mmingGuide�Addison
Wesley�
����
���Phong�B�T�Illu
mination
forCom
puter
Generated
Pictu
res�Communica
tion
oftheACM
���
�������
�����
���Pineda�J�A
Parallel
Algorith
mfor
Poly
gonRasterization
�In
Proceed
ings
of
SIG
GRAPH
��������
pp������
���Shoemake�K�Anim
atingRotation
with
Quatern
ionCurves�
InProceed
ings
Siggra
ph��Ju
ly�����
pp���������
���Connection
Mach
ineModel
CM�
Tech
nical
Summary�
Tech
�Rep�HA����
ThinkingMach
ines�
April
����
��
���Whitman�S�Multip
rocesso
rMeth
odsforComputer
Graphics
Renderin
g�AK
Peters�
Wellesley�
MA�����
����Whitman�S�
ALoad
Balan
cedSIM
DPoly
gonRender�
unpublish
edarticle
�August
�����
���Whitted�T�A
Taxonom
yof
Image
Com
position
Arch
itectures�
InIntern
a�
tionalSummer
Institu
teonState
oftheStart
inComputer
Graphics
������
����Williams�L�Pyram
idal
Param
etrics�Computer
Graphics
�����Ju
ly�����
��
��