General Purpose Systolic Arrays 1
Transcript of General Purpose Systolic Arrays 1
-
7/25/2019 General Purpose Systolic Arrays 1
1/12
General-Purpose
Systolic Arrays
Kurtis
T.
Johnson and
A.R.
Hurson Pennsylvania State University
Behrooz Shirazi University of Texas Arlington
Systolic arrays
effectively exploit
massive parallelism in
computationally
intensive applications.
With advances in VLSI
WSI and
FPGA
technologies they have
progressed from fixed-
function to general-
purpose architectures.
hen Sun Mic rosys tems in t roduced i t s fi r st works ta t ion , the com pany
cou ld no t have imag ined how qu ick ly works ta t ions wou ld revo lu t ion -
ize compu t ing . Th e idea o f a communi ty o f eng inee rs , sc ien t i s ts , o r
re sea rche rs t ime-sha r ing on a s ing le ma in f rame compu te r cou ld ha rd ly have
become anc ien t any mo re qu ick ly . Th e a lmos t in s tan t wide accep tance o f works ta -
t ions and deskt op com puters indicates that [hey were quickly recognized as giving
the bes t a nd mos t f lex ib le pe r fo rmance fo r the do l la r .
Desk top compu te rs p ro l i fe ra ted fo r th ree reasons . F i rs t , very la rge sca le in teg ra -
t ion (VLSI) and wafe r sca le in teg ra t ion
(WSI) ,
desp i te some p rob lem s , inc reased
the ga te dens i ty of chips while dramatically lowering their production cost . '
Moreove r . inc reased ga te dens i ty pe rmi ts a m ore compl ica ted p rocesso r , wh ich in
tu rn p rom otes pa ra l le l ism.
Second , desk top compu te rs d i s t r ibu te p rocess ing power t o the u se r in an eas ily
cus tomized open a rch i tec tu re . Rca l - t ime app l ica t ions tha t requ i re in tens ive
110
and compu ta t ion need no t consume a l l the re sou rces o f a supe rcompu te r . Also,
desk top compu te rs suppor t h igh -de f in i t ion sc reens wi th co lo r and mo t ion fa r
exceed ing those ava i lab le wi th any m u l t ip le -use r . sha red - re sou rce ma in f rame .
Th i r d , economica l . h igh -bandwid th ne tworks a l low desk t op compu te rs to sha re
data . thus re ta ining the most app ealing aspect of centralized computing , resource
sha r ing . Moreove r , ne tworks a l low the compu te rs t o sha re da ta wi th d i ss imila r
compu t ing mach ines . Tha t i s pe rhaps the mos t impor tan t reason fo r the accep -
tance o f desk t op compu te rs . s ince al l the pe r fo rmance in the wor ld i s wor th l i t t l e
if
the ma chine is isolated.
Today ' s works ta t ions have rede f ined the way the compu t ing comm uni ty d i s t rib -
u te s p rocess ing re sou rces , and tomorrow's mach ines wil l con t inue th i s t r en d wi th
h ighe r bandwid th ne tworks and h ighe r compu ta t iona l pe r fo rmance . O ne way to
ob ta in h ighe r compu ta t iona l pe r fo rmance is t o
use
spec ia l pa ra lle l cop rocesso rs to
pe r fo rm func t ions such as mot ion and co lo r suppor t o f h igh -de f in it ion sc reens .
Fu tu re com pu ta t iona l ly in tens ive app l ica t ions su i ted fo r desk top compu t ing ma-
ch ines inc lude rea l - t ime tex t , speech , and image p rocess ing . These app l ica t ions
require massive paralle l ism. '
I
-
7/25/2019 General Purpose Systolic Arrays 1
2/12
Many comp utational tasks are by their
ve ry na tu re sequ en t ia l : fo r o the r ta sks
t h e d e g r e e
of
para l le l ism va r ie s . The re -
fo re . a mass ive ly pa ra l le l com pu ta t ion -
al archi tectur e must maintain suffic ient
application f lexibil ity and com putational
efficiency.
It
must be
reconf igu rab le to exp lo i t app l ica -
t ion -dependen t pa ra l le l i sms ,
h igh - leve l - language p rog rammab le
for task co ntrol a nd f lexibil ity ,
sca lab le fo r easy ex tens ion to m a n y
app l ica t ions. and
capable of supportingsingle-instruc-
t ion s t ream, mu l t ip le -da ta s t ream
( S I M D ) o r g a n i z a t i o n s f o r v e c t o r
ope ra t ions a nd mu l t ip le - in s t ruc t ion
s t r e a m , m u l t i p l e - d a t a s t r e a m
( M I M D ) o r g a n i z a t i o n s t o e x p l o i t
n o n h o m o g e n e o u s p a r a l l e l i s m r e -
qu i remen ts .
Systolic arrays are ideally qualif ied
fo r compu ta t iona l ly in tens ive app l ica -
t ions . Wh e the r func t ion ingas a ded ica t -
ed f ixed - func t ion g raph ics p rocesso r o r
a more com pl ica ted and f lex ib le cop ro -
cesso r sha red ac ross a ne twork , a sys to l -
ic array effectively exploits massive p ar-
a l le l i sm. Fa l l ing in to an a rea be tw een
vec to r compu te rs and mass ively pa ra l -
lel
computers . systolic arrays typically
combine in tens ive loca l communica t ion
a n d c o m p u t a t i o n w i t h d e c e n t r a l i z e d
pa ra l le l ism in a comp ac t package . They
cap i ta l ize on regu la r , modu la r , rhy th -
mic , synch ronous . concu r ren t p rocess -
e s ha t requ i re in tens ive , r epe t i t ive com-
putation. While systolic arrays originally
were u sed fo r f ixed o r spec ia l -pu rpose
architectures, the systolic concept has
b e e n e x t e n d e d t o g e n e r a 1 p u r p o s e
S I M D a n d M I M D a r c h it e c tu r e s.
W hy sys tolic ar rays?
Ever s ince Kung p roposed the sys tol -
ic model. . i ts e legant solutions to de-
mand ing p rob lem s and i t s po ten t ia l pe r -
fo rmance have a t t rac ted g rea t a t ten t ion .
In physiology, the term sysfolicdescribes
the con t rac t ion ( sys to le ) o f the hea r t ,
wh ich regu la r ly sends b lood to all cells
o f the body th rough the a r te r ie s , ve ins ,
and capil lar ies . Analogously . systolic
compu te r p rocesses pe r fo rm ope ra t ions
in a rhy thmic . inc remen ta l , ce l lu la r . and
repe t i t ive manner . T he sys tol ic compu-
ta t iona l ra te
is
re s t r ic ted by the a r ray s
U
opera t ions . much a s the hea r t con -
I I
than p rocesso r a r ray s tha t execu te sys-
tolic a lgorithm s. Systolic arra ys consist
of
e lemen ts tha t t ake o ne o f the fo llow-
ing forms:
a special-purpose cell with h ardwire d
func t ions ,
a vec to r -compu te r l ike ce ll wi th an
ins t ruc t ion decod ing un i t a nd a p ro -
cessing un i t , o r
a p rocesso r comple te wi th a con t ro l
un i t a nd a p rocess ing un it .
Figure 1. General systolic organization.
tro ls b lood f low to the cells s ince i t is the
sou rce an d des t ina t ion fo r a l l b lood . J
Al though the re i s no widely accep ted
standard definit ion of systolic arrays
and systolic cells . the follo wing descrip-
t ion se rves a s a work ing de f in i t ion in
th i s a r t ic le (a l so see Sys to l ic a r ray
summ ary be low) . Sys to lic a r rays have
balanced . uniform, gridlike architectures
(F igu re
I
in which each l ine indicates a
communica t ion pa th and each in te r sec -
t ion represents a cell or a systolic e le-
men t . H owever , sy s tol ic a r rays a re m ore
In all cases, the systolic e lements or
ce l l s a re cus tomized fo r in tens ive loca l
comm unica t ions and decen t ra l ized pa r -
alle l ism. Because an array consists
of
ce l ls o f on ly one o r , a t m os t , a few k inds ,
it
has regu la r and s imp le cha rac te r i s -
t ics . T he arra y usually is extensible with
minim al diff iculty.
Thr ee fac to rs have con t r ibu ted
to
the
sys to l ic a r ray s evo lu t ion in to a lead ing
approach fo r hand l ing compu ta t iona l ly
in tens ive app l ica t ions : t echno logy ad -
vances , concu r rency p rocess ing , and
demanding scientif ic applicationsP
Technology advances. Advances in
Systolic array summary
Systol ic array:
A g ridlike structure of sp ecial processing elements that
processes data much like an n-dimensional pipeline. Unlike a pipeline, how-
ever, the input data as w ell as partial resu lts flow through the array. In addi-
tion, data can flow in a systolic organiza tion at m ultiple speeds in multiple di-
rections. Systolic arrays usually have a very high rate of I/O and are well
suited for intensive parallel operations .
Appl icat ions:
Matrix arithmetic, signal processing, mage processing, an-
guage recog nition, relational database operations, data structure manipula -
tion, and character string manipulation.
Special purpose systol ic array: An array of hardwired systolic process-
ing elements tailored for a spec ific application. Typically, many tens or hun-
dreds of cells fit on a single chip.
General purpose systol ic array:
An array of systolic processing ele-
ments that can be adapted to a variety of app lications via programm ing or
reconfiguration.
Programmable systol ic array: An array of programmable systolic ele-
ments that operates either in SIMD or MIMD fashion. E ither the arrays inter-
connect or each processing unit is program mable and a program controls
dataflow through the elemen ts.
Reconf igurab le sys to l ic a r ray : An array of s ystolic elements that can be
program med at the lowest level. FPGA field-program mablegate array) tech-
nology allows the array to emu late hardwired systolic elements at a very low
level for each unique application.
N o v e m b e r
l Y Y 3
21
-
7/25/2019 General Purpose Systolic Arrays 1
3/12
VLSI /WSI techno logy complem en t the
systolic arrays qualif ications in the fol-
lowing ways:
Demand ing sc ien t i f ic app l ica t ions .
Th e techno logy g rowth o f the la st th ree
decades has p roduced com pu t ing env i -
ronmen ts tha t mak e i t f ea s ib le to a t tack
demand ing sc ien t if ic app l ica t ions on
a
l a rge r sca le . La rge -ma t r ix mu l t ip l ica -
t ion , f ea tu re ex t rac t ion , c lu s te r ana ly -
s i s, and rada r s igna l p rocess ing a re on ly
afe w examples. As recent h is tory shows,
w h e n m a n y c o m p u t e r u s e r s w o r k
on
a
wide va r ie ty o f app l ica t ions , they deve l -
op new app l ica t ions requ i r ing inc reased
Cellgran ulari ty .The level of cell gran-
u l a r i t y d i r e c tl y a f f e c t s t h e a r r a y s
th roughpu t and f lex ib i l i ty and de te r -
mines the se t o f a lgo r i thms tha t i t c an
e f f ic ien t ly execu te . Each ce l l s bas ic
ope ra t ion can range f rom a log ica l o r
b i twise ope ra t ion . to a word- leve l mu l -
t ip l ica t ion o r add i t ion .
to a
c o m p l e t e
p rog ram. G ranu la r i ty i s sub jec t to tech -
nology capabil i t ies
and
l imi ta t ions a s
well
as
des ign
goals.
F o r e x a m p l e . i n t e-
g ra t ion -subs t ra te fami l ie s have d i f fe r -
Smalle r and faster gatesa llow a high-
e r ra te o f on -ch ip communica t ion be -
cause da ta has a sho r te r d i s tance to
t rave l .
Highe r ga te dens i t i e s pe rmi t more
complicated cells with higher individu-
a l and g roup pe r fo rmance . Granu la r i ty
inc reases
as
word leng th inc reases , and
concur rency inc reases wi th more com -
plicated cells .
Economica l des ign a nd fab r ica t ion
p rocesses p roduce le ss expens ive sys-
to l ic ch ips , even in sma l l quan t i t i e s .
Be t t e r des ign
tools
a l low a r rays t o be
designed m ore eff ic iently . A systoliccell
can be fu lly s imu la ted be fo re fab r ica -
t ion , r educ ing the chances tha t i t wi ll
fa il to w ork as des igned . Wi th advances
in s imulation techniques, fully tested,
un ique ce l l s can now be qu ick ly cop ied
and a r ranged in regu la r , modu la r a r -
rays. As V LSIiWSI designs becom e more
complicated. systolic iz ing them pro-
v ides an e f f ic ien t way t o ensu re fau l t
to le rance : any fau l t to le rance p recau -
t ions bu i lt in to one ce ll a r e ex tens ib le to
all cells.
compu ta t iona l pe r fo rmance . Examples
of these innovativ e applications include
in te rac t ive language recogn i t ion , re la -
t iona l da tabase op e ra t ions , t ex t recog -
nit ion, and vir tual reali ty . These appli -
ca t ions requ i re mass ive repe t i t ive an d
rhythm ic paralle l processin g, as well as
in tens iveU0 opera t ion . H ence , sy sto lic
compu t ing .
en t pe r fo rmance and dens i ty cha rac te r -
is t ics . Packaging also in trod uces
U0
p in
restr ic t ions.
Extensibil ity . B ecaus e systolic arrays
a re bu i l t o f ce l lu la r bu i ld ing b locks, the
cell design should be suffic iently f lexi-
ble for use in a wide varie ty of topolo-
g ie s imp lem en ted in a w ide va r ie ty o f
subs t ra te techno log ie s .
Implementation issues
Clock
synchron ization. Clock l ine s of
different lengths within in tegrated chips,
as
well
a s
ex te rna l to the ch ips , can
in t roduce skews . The r i sk o f c lock skew
is g rea te r when da ta f low in the sys to l ic
array is b idirectional. Wavefron t arrays
reduce the c lock skew p rob lem by in -
t roduc ing more com pl ica ted , a synch ro -
A number o f imp lemen ta t ion i s sues
de te rmine
a
systolic arraysperfo rmance
effic iency. Designe rs should understa nd
the fo l lowing pe r fo rm ance t rade -o f f s a t
the design stage.
Re la t ive ly new f ie ld -p rog rammab le
g a t e a r r a y ( F P G A ) t e c h n ol o g y p e r m i t s
a
reconf igu rab le a rch i tec tu re , a s
op-
posed to a rep rog rammab le a rch i tec -
tu re .
Algo r i thms a nd mapp ing . Des igne rs
mus t be in t ima te ly fami l ia r wi th the
a lgo r i thms they a re imp lemen t ing on
systolic arrays. Designing
a
systolic ar-
ray heu r i s t ica l ly f rom
an
a lgo r i thm i s
s low an d e r ro r -p ron e , equ i r ing s imu la -
t ion fo r ve r i f ica t ion and o f ten p roduc -
ing aless-than-optim um algorithm. Thus,
au tomat ic a r ray syn thes i s i s an impor -
t a n t r es e ar c h a r e a 6 A t p r e s e n t. h o w e v -
nous , in te rce l lu la r communica t ions .
Reliabil i ty . Asin tegratedcirc uits grow
la rge r . des igne rs mus t bu i ld in g rea te r
fau l t to le rance to ma in ta in re l iab i l i ty ,
and d iagnos t ic s to ve r ify p rop e r ope ra -
t ion.oncurrency processing. Past efforts
to add concur rency to the conven t iona l ,
v o n N e u m a n n c o m p u t e r a r c h i t e c t u r e
have yielded coprocesso rs , multip le pro-
t ip le hom ogene ous processo rs . Systolic r is t ics .
a r rays combine fea tu res f rom
all
of these
a rch i tec tu res in a massively paralle l ar-
ch i tec tu re tha t can be in teg ra ted in to
ex ist ing p la t fo rms w i thou t a com ple te
redesign. A systolic array can act as a
coprocesso r , can con ta in mu l t ip le p ro -
Systolic array
cessing units . data pipelin ing. and mul-
e r , mos t a r ray des igns a re based on heu- taxonomy
I n t e g r a t i o n i n t o e x i s t i n g s y s t e m s .
Genera l ly ,
a
systolic array is in tegrat ed
in to an ex is t ing host a s
a
back -end p ro -
cesso r . The a r ray s h igh
U
requ i re -
men ts o f te n make sys tem in teg ra t ion a
T h e t e rm systolic
array
originally re-
f e r r e d
to
spec ia l -pu rpose o r f ixed - func -
t ion a rch i tec tu res des igned
as
h a r d w a r e
imp leme n ta t ions o f a g iven a lgo r i thm.
In
mass quan t i t i e s , the p roduc t ion o f
cessing un i t s and lo r p rocesso rs , an d can
ac t a s an n -d imens iona l p ipe l ine . Al -
though da ta p ipe l in ing reduces
IiO
re -
qu i remen ts by a l lowing ad jacen t ce l l s
to reuse the inpu t da ta , the sys tol ic a r -
rays real novelty is i ts incremental in-
s t ruc t ion p rocess ing o r compu ta t iona l
pipelining. Each cell computes an in-
c r e m e n t a l r e s ul t . a n d t h e c o m p u t e r d e -
r ives the co mple te re su l t by in te rp re t -
ing the inc remen ta l r e su l t s f rom the
en t i re a r ray in
a
prespecif ied algorith-
mic fo rma t .
s ignif icant problem. Because the exist-
ing IiO channel rarely satisf ies the ar-
ray s bandwid th requ i remen t , a m emo-
r y s u b s y s t e m o f t e n m u s t b e a d d e d
be tween the hos t an d the sys to l ic a r ray
t o s u p p o r t d a t a a c c e s s a n d d a t a m u l t i-
p lex ing and demul t ip lex ing . Th e mem-
ory subsys tem can range f rom the com-
p l ica ted suppor t a nd c lu s te r p rocesso rs
i n t h e W a r p a r r a y
to
the s imp le r s tag ing
memory in the Sp la sh a r ray . (These
sys tems wi l l be d i scussed in g rea te r
de ta i l l a te r . )
these a r rays was manageab le and eco -
nomica l , and , thus , they were we l l su i t -
ed fo r common app l ica t ions . Bu t these
d e s ig n s w e r e b o u n d to the speci f ic ap -
p l ica t ion a t hand an d were no t f lex ib le
o r ve rsa t i l e . Eve ry t ime a systolic array
was to b e u s e d o n a new app l ica t ion , the
manufac tu re r had to unde r take the long ,
costly , and potentia l ly r isky process of
des ign ing . t e s t ing , and fab r ica t ing an
app l ica t ion -spec i f ic in teg ra ted ch ip .
Al though the cos t and r i sks of deve lop -
ing
A S l C s
have dec reased in recen t
C O M P U T E R
2
-
7/25/2019 General Purpose Systolic Arrays 1
4/12
Table
1.
Systolic array taxonom y.
Class
T y p e
Genera l -pu rpose Spec ia l -pu rpose
Prog rammab le Reconf igu rab le Hybr id Hardwired
O r g a n i za t i o n S I M D o r M I M D
V F I M D V F I M D
Fixed
F ixed
Topo logy
In te rconnec t ions
Dimens ions n -d imens iona l (n
> 2
i s r a r e d u e
to
complex i ty )
n -d imens iona l
P rog rammab le
S ta t ic Dynamic
S I MD: single-instruction stream, multiple-data stream: MI MD: multiple-instruction stream, multiple-data stream;
VFTMD:
very-few-
instruction stream, multiple-data stream
Fixed
yea rs , budge t cons t ra in t s have m o t iva t -
ed a t rend away f rom un ique ha rdware
deve lopmen t . Consequen t ly . gene ra l -
purpose systolic architectures have be-
come a logical a l ternative. I n add i t ion
to serving in a wide varie ty of applica-
t ions . they also prov ide te s t beds fo r
developing, verifying, and debugging
new systolic a lgorithm s. Tab le 1shows
a taxonomy o f gene ra l -pu rpose and spe -
cial-purpose systolic arrays.
S ta t ic Dynamic
Special-purpose architectures. Spe-
cial-purpose systolic architectures are
cus tom des igned fo r each app l ica t ion .
Few p rob lems re si s t a t tack f rom sys to l -
ic a r rays , bu t som e p rob lems m ay re -
q u i r e e l e g a n t a l g o r i t h m s . G e n e r a l l y
speaking, the systolic design requir es a
per fo rmance a lgo r i thm tha t can be e f fi -
c ien t ly imp lemen ted wi th today s VLSI
technology.
One area that easily uti l izes systolic
a lgo r i thms i s ma t r ix ope ra t ions . Figure
2
i l lu s tra te s the a lgo r i thm fo r th e sum o f
a
scalar product, computed in a s ingle
systolic e lem ent. A fter the cell is in it ia l-
ized . the
as
a n d t h e bs a re synch ro -
nously shif ted through the processing
e lemen t . The accumula to r s to re s the
sum o f the a,b products .
All
t h e
a
a n d
b
da ta synch ronous ly ex i t s the p rocess ing
e lemen t unmod i f ied to be ava i lab le fo r
the nex t e lemen t .
At
t h e e n d o f p r o ce s s -
ing,
the sum o f the p roduc ts i s shi f ted
o u t
of
the acc umula to r . Th is p r inc ip le
easily extends
to
a ma t r ix p roduc t . as
shown in Figure
3.
The on ly d i f fe rence
be tween s ing le -e lemen t p rocess ing and
a r ray p rocess ing is tha t th e la t te r de lays
each add i t ional co lumn and row by one
cycle so that
the
columns and rows l ine
u p f o r
a
matr ix mu l t ip ly . The p roduc t
ma t r ix i s sh i f ted ou t a f te r comple t ion o f
processing.
An
obv ious p rob lem wi th th i s ap -
p roach i s tha t m a t r ix p roduc ts invo lv ing
a
mat r ix la rge r tha n the sys to l ic a r ray
mus t be d iv ided in to a se t o f sma l le r
ma t r ix p roduc ts . Th is re sou rce an d im-
plem entatio n prob lem affects a l l systol-
ica r rays . P rope r a lgo r i thm deve lopmen t
compensa te s fo r the p rob lem, bu t pe r -
fo rmance dec reases neve r the le ss.
The m a t r ix p roduc t example a l so dem-
ons t ra te s a no the r p rob lem wi th spec ial -
pu rpose sys to l ic a r rays and ha rdware in
ib -ty pe data exits
~ ~ ~
Figure
2.
A
systolic processing elemen t that compute s the sum
of
a scalar
product.
a13 a 2 a l l
a23
22
*
a
a32
a3
*
b-type data exits array
Figure 3 The sys-
tolic product
of
two 3
x
3 matrices.
N o v e m b e r I993
23
-
7/25/2019 General Purpose Systolic Arrays 1
5/12
Host
Data in
_ _
Array
Data out
Figure 4. General organization of
SIMD
programmable linear systolic arrays.
gene ra l . The mo re spec ia l ized the ha rd -
ware , the h ighe r the pe r fo rmance : bu t
cost per application also r ises and f lex-
ibil i ty decreases. Therein l ies the a t-
tractiveness of gene ral-pu rpose systolic
a rch i tec tu res .
General-purpose architectures.
T h e
two basic types of genera l-purp ose sys-
t o li c a r r a y s a r e t h e p r o g r a m m a b l e m o d -
e l and the recon f igu rab le mode l . Re -
cen t ly . hyb r id mode ls have a l so been
proposed .
In
the p rog ramm ab le mo de l , cel l a r -
ch i tec tu res and a r ray a rch i tec tu res re -
ma in the sam e f rom app l ica t ion to ap -
p l ica tion . How ever , a prog ram con t ro ls
da ta ope ra t ions in the ce l l s and da ta
rou t ing th rough the a r ray . Al l comm u-
n ica t ion pa ths and func t iona l un i t s a re
f ixed , and the p rog ram de te rmines when
and wh ich pa ths a re u t i l i zed . O ne o f the
f i rs t p rog ramma b le a rch i tec tu res was
Carneg ie M el lon s p rog ramm ab le sys-
tolic chip .
In
the recon f igu rab le m ode l , ce ll a r -
ch i tec tu res as well as array architec-
tu re s change f rom one app l ica t ion to
a n o t h e r . T h e a r c h i t ec t u r e f o r e a c h a p -
p l ica t ion appea rs
as
a
spec ia l -pu rpose
a r ray . Th e p r imary means o f imp lemen t -
ing the reconf igu rab le mode l i s FP GA
techno logy . Sp la sh was on e o f the ea r ly
F P G A - b a s e d r e c o n f i g u r a bl e s y st o li c
arrays.
Hybr idmode lsmake use
of
b o t h
VLSI
24
and FPGA techno logy . They usua l ly
consi st of
VLSI
c i rcu it s embed ded in an
FPGA-reconf igu rab le in te rconnec t ion
ne twork .
S y s t o l i c t o p o l o g i e s .
A r r a y t o p o l o g ie s
c a n b e e i t h e r p r o g r a m m a b l e o r r e -
con f igurab le . L ikewise , a r ray ce l ls a r e
e i t h e r p r o g r a m m a b l e o r r e c o n f i g u r -
ab le .
A
prog ramm able systolic architecture
is
a
co l lect ion o f in te rconnec ted , gene r -
al-pu rpose systolic cells , each of w hich
is e i the r p rog rammab le o r recon f igu -
rab le . P rog ramm ab le sys to lic ce l ls a re
flexible processing elements specially
des igned to mee t the com pu ta t iona l and
U
equ i remen ts
of
systolic arrays. Pro -
g ramm ab le sys to l ic a rch i tec tu res can be
classif ied accordin g to their cell in ter-
connectio n topologies: f ixed or program -
m a b l e .
F ixed ce l l in te rconnec t ions l imi t a
g iven topo logy to some subse t o f
all
poss ib le a lgo r i thms . Tha t topo logy can
e m u l a t e o t h e r t o p o l o g i e s by m e a n s of
t h e p r o p e r m a p p i n g t r a n s f o r m a ti o n , b u t
reduced pe r fo rmance i s o f ten a conse -
quence .
P rog rammab le ce l l in te rconnec t ion
topol ogies typically consist of prog ram -
mab le ce l ls emb edde d in a swi tch la t ti ce
tha t a l lows the a r ray to a s s u m e m a n y
d i f fe ren t topo log ie s. P rog ramm ab le to -
po log ie s a re e i the r s ta t ic o r dynamic .
S ta t ic opo log ie s can be a l te red be tween
app l ica t ions , and dynamic topo log ie s
can be a l te red wi th in
an
application.
S ta t ic p rog ramm ab le topo log ie s can be
imp lem en ted wi th much le ss complex i -
t y t h a n d y n a m i c p r o g r a m m a b l e t o p o l o -
g ie s. The re has been l i t tl e re sea rch
in
d y n a m i c p r o g r a m m a b l e t o p o l o g ie s b e -
cause
a
h igh ly complex in te rco nnec t ion
ne twork cou ld unde rmine the regu la r
and simple principles of systolic archi-
tec tu res .
Reconf igu rab le sys to l ic a rch i tec tu res
cap i tal ize on F PG A techno logy , wh ich
a l lows the u se r to con f igu re
a
low-level
logic c ircuit for each cell . Reconfigu-
rab le a r rays a l so have e i the r f ixed o r
reconf igu rab le ce l l in te rconnec t ions .
Th e use r recon f igu res
an
a r ray s top o l -
ogy by means o f a swi tch la t t i ce . Any
gene ra l -pu rpose a r ray tha t i s no t con -
ven t iona l ly p rog rammab le i s u sua l ly
c o n s i d e re d r e c o n f i g u r ab l e . A l l F P G A
reconf igu r ing is s ta t ic due to techno lo -
gy l imitations.
A r r a y d im e n s i o n s . We can further c las-
s i fy gene ra l -pu rpose and spec ia l -pu r -
pose sys to l ic a rch i tec tu res by the i r a r -
r a y d i m e n s io n s . T h e t w o m o s t c o m m o n
s t ruc tu res a re the l inea r a r ray and the
two-d imens iona l a r ray . Lin ea r sys to l ic
array s are by default s ta t ically reconfig-
u rab le in one -d imens io na l space . Two-
d imens iona l a r rays a l low m ore e f f ic ient
execution of complicated algorithms.
Due
t o
I/O
imi ta t ions . gene ra l -pu rp ose
sys to l ic a r rays o f d imens ions g rea te r
t h a n t w o a r e n o t c o m m o n .
Programmable array organization.
Programm ab le sys to l ic a r rays a re p ro -
g rammab le e i the r a t
a
high level or a
low level . A t e i the r
level,
p r o g r a m m a -
b le a r rays can be ca tego r ized
as
e i the r
S I M D
o r
M I M D
m a c h i n es . T h e y a r e
typically back-end processors with an
add i t iona l bu ffe r memory to hand le the
high systolic U a te s . High - leve l p ro -
g r a m m a b l e a r r a y s u s u a ll y a r e p r o -
g r a m m e d
in
h igh - leve l l anguages and
a re word o r ien ted . Low- leve l a r rays a re
p rog ramm ed in a ssembly language and
a re b i t o r ien ted .
SZMD. S I MD
systolic mac hines (Fig-
u re
4)
o p e r a t e s i m il a rl y t o
a
vec to r p ro -
cesso r. The hos t works ta t ion p re load s a
con t ro l le r and
a
mem ory , wh ich a re ex -
te rna l to the a r ray , with the in s t ruc t ions
and d a ta fo r the app l ica t ion . The sys tol -
ic ce l l s s to re no p rog rams o r in s t ruc -
t ions. Ea ch cells instruction-process-
C O M P U T E R
-
7/25/2019 General Purpose Systolic Arrays 1
6/12
ing functional unit serv es as a large de-
code r . Once the works ta t ion enab le s
e x e c u t i o n , t h e c o n t r o l l e r s e q u e n c e s
t h r o u g h t h e e x t e r n a l m e m o r y , t h u s d e -
l ivering instructions a nd da ta to the sys-
to l ic a r ray . Wi th in th e a r ray , in s t ruc -
t ions a re b roadcas t and a l l ce l ls pe r fo rm
t h e s a m e o p e r a t i o n o n d i f f er e n t d a t a .
Ad jacen t ce l l s may sha re memory , bu t
generally
no
memory i s sha red by the
en t i re a r ray . Af te r ex i t ing the a r ray ,
da ta i s co l lec ted in the ex te rna l bu f fe r
memo ry . S IMD-prog rammab le sys to l ic
ce l ls t ake le s s space on the VLSI w afe r
t h a n t h e i r M I M D c o u n t e r p a rt s b e c a u s e
they are s imple instruction-processing
e l e m e n t s r e q u ir i n g n o p r o g r a m m e m o -
ry . F rom tens t o hundreds o f such ce ll s
fit on one in teg ra ted ch ip . '
M I M D . M I M D s y s t o l i c m a c h i n e s
(F igu re
5)
opera te s imila rly to hom oge-
n e o u s v o n N e u m a n n m u l t i p r o c e s s o r
mach ines . The works ta t ion downloads
a p rog ram to each M IM D sys to l ic cel l .
Ea ch cell may be loa ded with a different
p rog ram, o r a l l the ce ll s in the a r ray may
be loaded wi th the same p rog ram. Each
cell 's architecture
is
somewha t s imi lar
t o t h e c o n v e n t io n a l v on N e u m a n n a r -
ch i tec tu re : I t con ta in s a con t ro l un i t , an
A L U , a n d l o c a l m e m o r y . M I M D s y st o l-
ic ce l l s have more loca l memory than
t h e i r S I M D c o u n t e r p a r t s t o s u p p o r t
the von Neumann-s ty le o rgan iza t ion .
Some may have a sma l l amoun t o f glo-
ba l memory , bu t gene ral ly no mem ory
is sha red by a l l the ce l l s . Wheneve r da ta
is to be sh are d by processors . i t must be
passed to the nex t ce l l . Thus , da t a ava i l -
abil i ty become s a very impo rtant issue.
High - leve l MIM D sys tol ic cel l s a re ve ry
complicated. and usually only one f i ts
on
a s ing le in teg ra ted ch ip . Fo r exam -
p le , the iWarp i s a Warp ce l l wi thou t
m e m o r y
on
a s ing le ch ip. Loca l mem ory
fo r each iW arp ce ll mus t be supp l ied by
addit ional chips.
Reconfigurable array organization.
Recen t ga te -densi ty advances in FP GA
techno logy have p roduced a low- leve l,
reconfigurable systolic array architec-
tu re tha t b r idges the gap be tween spe -
c ia l -pu rpose a r rays and the m ore ve rsa -
t i l e , p r o g r a m m a b l e . g e n e r a l - p u r p o s e
a r rays . The FPGA a rch i tec tu re i s un -
usual because a s ingle hardware plat-
form can be logically reconfigured as an
exact duplic ate of a special-purpose sys-
tolic array. F igu re
6
shows the gene ra l
organization of a reconfigurable array.
Data in
_ _ _ _ _
Array
i
ata
out
Figure 5. General organization of M IMD program mable linear systolic arrays.
Th e design er logically draws cell archi-
tec tu res on the works ta t ion wi th a sche -
m a t i c e d i t o r ( s u ch a s M e n t o r ) , c o n v e r ts
t h e d e s ig n t o F P G A c o d e w i th a n o t h e r
u t il i ty , and th en dow nloads the code in
a f e w s e c o n d s t o c o n f i g ur e t h e F P G A
arch i tec tu re .x
Reconf igu rab le sys to l ic a r rays do no t
f a ll i n t o t h e S I M D or M I M D ca tego r ie s
fo r the same reason tha t spec ia l -pu r -
pose sys to l ic a r rays do no t . R econf igu -
rable arrays, l ike special-purpose ar-
rays , a re gene ra l ly l imi ted to VFIMD
(very-few-instruction streams. multip le-
da ta s t reams) o rgan izat ion due to FPG A
ga te dens i ty . In s t ruc t ions a r e imp l ic i t in
the configuration of each cell ; there-
fo re , the re i s
no
n e e d t o d o w n l o a d t h e m
from th e workstation. Since special-pur-
pose systolic architectures consist of a
ve ry few un ique ce ll s repea ted th rou gh-
ou t the a r ray , the en t i re a r ray a l so tends
t o b e V F I M D . P u r e l y r e c o n f i g u r a b l e
architectures are f ine-grain , low-level
devices best suited for logical or b it
manipulations. They typically lack the
ga te dens ity to supp or t h igh - level func -
t ions such as multiplication.
Tab les
2 , 3,
a n d
4
l is t most of the
recen t p rog rammab le and reconf igu -
rab le , gene ra l -pu rpose sys to l ic a r rays
repo r ted in the l i t e ra tu re . When in fo r -
ma t ion i s unc lea r in the l i t e ra tu re , the
co r re spond ing space in the tab le is lef t
b lank . Th e tab le s ind ica te th ree s tages
o f p roduc t deve lopmen t :
1
Figure 6. General
organization of
reconfigurable
arrays.
N o v e m b e r 1993
25
-
7/25/2019 General Purpose Systolic Arrays 1
7/12
Table
2
SIMD -organized programmable systolic architectures.
Sys tem nam e , Deve lopmen t
deve lope r s tage Topo logy Key fea tu res
Brow n Sys to l ic Ar ray
Brown U n ive rs i ty
P r o t o t y p e
Micmacs*
I R I S A , C a m p us d e B e a d i e r .
F r a n c e P r o t o t y p e
Geom etr ic Ar i thme t ic Pa ra l le l P rocesso r Comm erc ia l
N C R
Saxpy-1M4
C o m p u t e r C o r p .
Commerc ia l
Sys tol ic /Ce l lu la r Arch i tec tu re5 P ro to type
H u g h e s R e s e a r c h L a b o r a t o r i e s
Cy l indr ica l Banyan Mul t icompu te rh Resea rch
Unive rs i ty of Texas a t A us t in
Programmable Systolic Device
Aus t ra l ian N a t iona l Un ive rs i ty
Resea rch
Linea r
470 cells
Linea r
18
cells
48 x 8 a r r a y
2,304 cells
Linea r
32
cells
1 6
x 16
a r ray
256 cells
Very sma l l VLSI foo tp r in t ;
100s
of cells
pe r ch ip ; ISA, SSR a rch i tec tu res ; 8 -b i t
ALU only; 157 MOPS
16-b i t f ixed -po int ma th ; b roadcas t d a ta ;
90 MOPS
Bit-sl ice cellular architecture; g lobal data;
900
M O P S
32-bit f loating-point capabil i ty ; broadcas t
and Saxpy g loba l da ta wi th b lock
process ing
1,000
M F L O P S
32-bit f ixed-function units
Packe t - swi tched p rog ram mab le topology
with programmable cells
In s t ruc t ion decod ing occu rs once pe r ch ip ;
each ch ip has m any ce l l s
1.
R. Hughey and
D.
Lopresti. Architecture
of
a Programm able Systolic Array,
Proc. Inrl Con,f.Sysfolic Arruys.
I E E E CS Press.
LOS
2. P. Frison et al., Micmacs: A VLSI Programmable Systolic Architecture. Systolic Array Processors, Prentice-Hall. Englewood Cliffs,
3. P. Greussay, Programmation des Mega-Processeurs: du GAPP
a
la Connection Machinc Course Notes DEA Artificial Intelligence,
4.
D.
Foulser and R. Scheiber, Th e Saxpy Matrix-1: A General-Purp ose Systolic Com puter. Computer . Vol. 20. No. 7, July 1987.pp. 35-43.
5. K.W. Przytula and
J.B.
Nash, lmplementation of Synthetic Apertu re R adar A lgorithms on a SystoliciCellular Architecture, Proc. In[ /
6. M. Malek and
E .
Oppe r. The C ylindrical Banyan Multicomputer: A Reconfigurable Systolic Architecture. Parallel C o m p u t i n g , Vol.
7 .
P. Lenders and
H .
Schroder, A ProRrammable Systolic Device
for
Image Processing Based on M orphology.
Purullel Conzp uting.
Vol.
Alamitos, Calif., Ord er No. 860, 1988, pp. 41-49.
N.J.:
1989,pp. 145-155.
Paris Univ.. VIII, Vincennes, 1985.
Conf Systolic Arr ays , IE EE CS Press,
Los
Alamitos, Calif., Order
No.
860, 1988, pp. 21-27.
10,
No. 3. May 1989.
pp.
319-326.
13, No. 3
Mar. 1990.
pp.
337-344
Resea rch : A func tiona l sys tem tha t
has no t ye t been imp lemen ted .
* P r o t o t y p e : At least a part ia l system
has been imp lemen ted .
Comm erc ia l : The sys tem has been
implem ented. and an integrated chip
with multip le cells , a complete sys-
tem , o r bo th a re ava i lab le fo r pu r -
chase.
Architectural issues
of
programmable systolic
arrays
To be effective. a programm able sys-
to l ic a rch i tec tu re mus t adhe re to the
genera l principles of systolic arrays: reg-
u la r i ty , s imp l ic i ty , concu r rency , and
rhy thmic communica t ions . In add i t ion .
the in troduction of apr ogr am ma ble sys-
tolic cell with s ignif icant local memo ry
has re su l ted in a new mode o f ope ra -
t ions: b lock processing. Block process-
ing combines periods of in tensive sys-
tolic
IiO
and pe r iods
of
sequen t ia l von
Neumann-s ty le p rocess ingto fo rm wha t
is known as a pseudosystolic mod el.
R e p l i c at i n g a n d i n t e r c o n n e c t i n g a
basic program mabl e systolic cell to form
an a r ray ca r r ie s ou t th e regu lar i ty and
simplicity principles , provided an ap-
p rop r ia te a lgo r i thm and p rog ram a re
u t i l i z e d . H o w e v e r . p r o g r a m m a b l e -
c e l l - b a s e d a r c h i t e c t u r e s m u s t i n t r o -
duce spec ia l f ea tu res
to
address th e sys -
tolic properties
of
concurrency, rhyth-
mic communica t ions , and b lock p ro -
cessing.
Concurrency. Tw o leve ls of concur -
rency a re poss ib le in a p rog rammab le
systolic architecture: concurren cy across
cells and co ncurre ncy within cells , both
implicit to hardw ired designs. To m a k e
concur rency feas ib le in p rog ram mab le
systolic archite ctures, he designer must
add special mechanisms that facil i ta te
p rocess ing con t ro l .
I n t e rc e l l c onc urre nc y .
Coord ina t ion
of concur rency con t ro l has a lways been
inhe ren t
in
a hardwired systolic design
because i t has been add ressed a t the
a lgo r i thm deve lopmen t s tage . Th e ad -
ven t o f the p rog ram mab le systo lic a r ray
requ i red tha t innova t ive fea tu res be in -
co rpo ra ted in to des igns to sa t i s fy a p ro -
g rammab le coo rd ina t ion requ i remen t .
Methods of prog ram load ing . memory
init ia l ization, program switching. and
local memo ry access a t the cell level had
26
C O M P U T E R
I
-
7/25/2019 General Purpose Systolic Arrays 1
8/12
Table 3. MIMD-organized programmable systolic architectures.
Sys tem name ,
deve lope r
D e v e l o p m e n t
s tage Topo logy Key fea tu res
Warp
Carneg ie Mel lon Un ive rs i ty
A r c h i t e c t u r e P r o t
Comm erc ia l Linea r 32 -bi t f loa t ing -po in t mu l t ip l ica t ion :
b lock p rocess ing ;
IiO
queu ing .
100
M F L O P S
10
cells
i W a r p 3
Carneg ie M el lon Un ive rs i ty
C o m p u t e r f o r E x p e r i m e n t a l
S y n t h e t ic A p e r t u r e R a d a r 4
Norweg ian Defense Res . Es tab .
Ce l lu la r Ar ra y P rocesso r5
Fu j i t su Labora to r ie s , Japan
Conf igu rab le High ly Pa ra l le l Compu te rh
Purdue Un ive rs i ty
Associative Str ing Processor
B r u n e l U n i v e r s it y , U K
PICAP3
University of Paris
P S D P 9
Univ .
of
Sou th F lo r ida ,
H o n e y w e l l
P r o t o t y p e F o u r
8
x 16
Bit-seria l cellu lar
IiO;
32-bit f loating-
a r rays
512 ce l ls 320 MF LO PS
point multip liers in each cell ;
C o m m e r
Resea rch P rog ramm ab le ce l ls em bedd ed in swi tch
la t t i ce fo r p rog rammab le topo logy
R e s e a r c h a t h i
P ro to type
8
x
8
a r ray 16 -b it word leng th ; image o r ien ted
64
cells
R e s e a r c h
4 -b i t word leng th ; wafe r - sca le des ign
1.
A.L. Fisher.
K.
Sarocky. and H.T. Kung. Experience with the CMU Programmable Systolic Chip,
Proc. Soc. of Phoro-Oprical
2. M . Annaratone et a l . . Architecture of Warp, Proc.
Compcon,
IEE E CS Press, L o s Alamitos, Calif.. Order
N o .
764, Feb. 1987, pp.
3 . S.
Borkar et al.. iWarp: An Integrated Solution
to
High-speed Parallel Computing.
Proc. Superconzpuring.
I E E E CS Press.
Los
4. M. Toverud and V. Anderson. CE SAR : A Programmable H igh-Performance Systolic Array Proccssor,
Proc. In / [
Con
Coni pu f e r
5.
M .
Ishii et al., Cellular Array Proccssor C AP and Application,
Proc.
Intl
Conf. Sys to l i c
Arrays,
IE EE C S Press. Los Alamitos. Calif.,
6. L. Snyder. Introduction
to
the Configurable Highly Parallel Comp uter. Conzputer. Vol.
15,
No. 1. Jan. 1982. pp. 47-56.
7. R.M. Lea, Th e ASP. a Fault-Tolerant V LSI/ULSI/WSI Associative String Processor for Cost-Effective Processing,
Proc. Inrl
Conf:
8.
B. Lindscog and P.E. Danielsson. PICA P3: A Parallel Processor
Tuned
for 3D Image Operations.
Proc. E i g h th Intl
Conf:
Patrern
9. D. Landis et al. . A W afer-Scale Programm able Systolic D ata Processor, Proc. N i n th Biennial Univ . /Gov f / lnd.Microrlec /ronics Synp . ,
1n.strunzenfufion Engineers, Re d -T ime
Signal
Processing
V I I ,
Spie. San Di ego. C alif., 1984. p. 495.
264-267.
Alamitos, Calif., Ord er No. 882, 1988, pp. 330.339.
Design,
IEEE CS Press,
Los
Alamitos. Calif., Ord er
No.
872 (micr ofiche only) , 1988, pp. 414-417.
Order
N o.
860. 1988. pp. 535-544.
Sysrolic
Ar r a y s . pp. 515.524.
Recognilion, IEEE
CS Press, O rder No. 742 (mic rofiche o nly ), 1986, pp. 1248-1250.
IE EE , Piscataway,
N.I . ,
1991, pp. 252-256.
to be inco rpo ra ted in to a sys to lic env i -
r o n m e n t . T h e f o l l o w i n g m e c h a n i s m s
f a c il i ta t e p r o g r a m m a b l e c o n c u r r e n c y
th roughou t th e a r ray :
B r o a d c a s t d a t a
permits a l l cells of
an a r ray to be re se t s imu l taneous ly
wi th in i t ia l cond i t ions and cons tan ts
a t the s ta r t o f a p rocess ing b lock
dur ing nonsys to l ic modes o f ope ra -
t ion . The la rge r the a r ray , the more
effic iency i t will gain from a broad-
cas t da ta capab i l i ty .
B r o a d c a s t i n s t r u c t i o n s p e r m i t o p e r -
a t ions s imi la r to those o f a vec to r
compu te r by mak ing each sys to l ic
c e ll a n a l o g o u s t o t h e v e c t o r c o m -
pu te r s p rocess ing un i t . However ,
un l ike mos t vec to r compu te rs ,
sys-
to lic cells can support a h igh-band-
wid th comm unica t ion channe l wi th
adjacent cells .
B r o a d c a s t i n s t r u c t i o n a d d r e s s e s al-
l o w f a s t a n d e f f i c i e n t s w i t c h i n g
a m o n g p r o g r a m s i n a n M I M D c e ll s
p r o g r a m m e m o r y . W h e n a l l
M I M D
ce l l s have p rog rams s to red in the
s a m e p l a c e s in t h e i r p r o g r a m
m e m o r i e s , a j u m p t o t h e b r o a d c a s t
in s t ruc t ion add ress ca r r ie s ou t s i -
m u l t a n e o u s p r o g r a m s w i t ch i n g
th roughou t the a r ray . I f the p ro -
g rams in a l l the M IM D ce l ls happen
N o v e m b e r
1993
27
-
7/25/2019 General Purpose Systolic Arrays 1
9/12
Table 4. VFIMD-organized programmable systolic architectures.
S y s t em n a m e ,
deve lope r
D e v e l o p m e n t
s tage Topo logy Key fea tu res
~~~ ~ ~~~~~~~
Splash Systolic Engine '
S u p e r C o m p u t i n g L i n e a r b a s e d on
R e s e a r c h C e n t e r P r o t o t y p e
32
cells
Ce l lu la r Ar r ay Log ic '
Ed inburgh Un ive rs i ty , U K Pro to type 256 cells custom chip
16
x
16
a r ray
FPGA -reconf igu rab le ce l l a rch i tec tu re based on
Hybr id Arch i tec tu re3
Unive rs i ty
of
Illinois
R e s e a r c h
Reconf igu rab le ce l l a rch i tec tu re in teg ra t ing FPGA
capab i l i ty an d 32 -b i t f loa t ing -po in t mu l t ip l ica t ion
P r o g r a m m a b l e A d a p t i v e
Computing Engine. '
Un ive rs i ty o f Wales , Bangor , UK Pro to type P rog ramm ab le p rog ramm ab le topo logy
C o n f i g u r a b le F u n c t i o n a l A r r a y 5
Ts inghua Un ive rs i ty , Be i j ing Resea rch Conf igu rab le a r ray topo logy and ce l l a rch i tec tu re
Func t iona l un i t s embe dded in each ce l l :
reconfigu rable con necti ons in cell as well as
1.
M . Gok hale et al.. Splash: A Reconfigurable Linear Logic Array, Pm c . In i l
Conf:
Parr i l l e l Processing,
IEEE CS
Press. Los Alamitos,
2.
T.
Kcan and
J .
Gray, Configurable Hardware: Two Case Studies of Micrograin Computation.
in
Sysro l i c
Array Processorh,
Prentice-
3.
R. Smith and G. Sobelman, Simulation-Based Design of Programmablc Systolic Arrays.
Coniprrier-Aided Desig n.
Vol.
2.3. N o .
IO. Dec.
4.
S .
Jones. A. Spray, and A. Ling. A Flexib lc Building Block for
t h e
Construction of Processor Arrays. in
Sys/o/ic
A r r n v P r o c r s s o r s
5 . C. Wenyang, L. Yanda. and J . Yuc, Systolic Realization for 2D Convolution Using Conrigurable Functional Method in VLSI Parallel
Calif., Orde r
No.
2101. 1990.
pp. 1526-1531.
Hall, E nglcwood C liffs,
N.J.,
1989. pp . 310-319.
1901,
pp.
669-675.
Prcntice-Hall, E nglewood Cliffs. N.J. ,
19S9,
pp.
459-466.
Array Dcsigns.
Proc .
I E E E Compurrrs
u n d
Digirtrl
e~hnology~ol.
138, No. S Sept.
1991.
pp. 361-370.
to be iden t ica l . the a r ray i s pe r fo rm-
i n g S I M D o p e r a t i o n s .
I n s t r u c t i o n s y s t o l i c urriiys prov ide
a precise low-level method of coor-
d ina t ing p rocess ing f rom ce l l to ce l l .
In an IS A. instructions travel through
t h e a r r a y w i th t h e d a t a . T h e p r o -
cessed da ta passes wi th th e o r ig ina l
instruction from ea ch cell to the n ext.
An ap p rop r ia te ly wide ISA ins t ruc -
t ion also has buil t- in microcodable
pa ra l le l i sm. ISA a lgo r i thms can be
leng thy and m ore d i f f icu l t to deve l -
o p t h a n a l g o r it h m s f o r o t h e r S I M D
a n d M I M D a r r a y s. B u t t h e y p r o v i de
a way to un ique ly spec i fy concu r -
rency across cells withou t an overly
complicated circuit .
D i re c t l oc a l memory a c c es s o r g l o -
bal
rnen io ry .
Sta tus f lags can be
s to red e i the r in each ce l l ' s loca l
m e m o r y or in a g loba l memory . In
the case o f loca l mem ory , the con -
t ro l le r mus t b e ab le to d i rec t ly ac -
cess each ce ll 's loca l memory t o de -
te rmine f lag s ta tu s ; o the rwise , the
con t ro l le r mus t i s sue a reques t and
wa i t fo r i t to p ropaga te th rough the
a r ray . When g loba l memory i s avai l -
ab le . the con t ro l le r ob ta in s s ta tu s
in fo rma t ion much m ore eas i ly . Di -
rec t loca l memory access o r g loba l
mem ory also facil i ta tes in it ia l ization
wi th in a ce l l. The Saxpy-1M is one
of th e f irs t systolic array s with glo-
b a l m e m o r y .
I tifrace11c o n c u r r e n c y . P r o g r a m m a b l e
ce l ls requ i re t he ab i l i ty t o pe r fo rm mul-
t ip le s imi la r o r d i s simi lar ope ra t ions s i -
mu l taneous ly . Mechan isms inco rpo ra t -
ed in to p rog ramm ab le sys to l ic ce l ls to
suppor t concu r rency a re a l l p roven tech -
n iques tha t o r ig ina ted in conven t iona l
von Neum ann p rocesso rs : mic rocodab le
processing elements . duplication offun c-
t iona l un i t s , mu l t ip le da ta p a ths , su f fi -
c ien t lywide in s t ruc t ion words . a nd p ipe -
l in ing of functional units .
Rhythmic communications. The p r in -
ciple of rhythm iccom munic ationsclearly
sepa ra te s sys to l ic a r rays f rom o the r a r -
ch i tec tu res . P rog rammab i l i ty c rea te s a
mo re f lex ib le sys to l ic a rch i tec tu re . bu t
the pena l t ie s a re complex i ty and poss i -
ble s lowing of operat ions. Wh en systol-
ic ce ll s a re p rog ramm ab le , the i s sue o f
da ta availabil i ty arises. To min imize the
pe r fo rmance deg rada t ion
of
p r o g r a m -
mable systolic designs, each cell 's pro-
cessing e lemen t mus t have da ta ava i l -
able when i t is required. h igh-speed ari th -
me t ic capab i l i t i e s , and the ab i l i ty to
t rans fe r o r s to re the p rocessed da ta .
Thus , each ce l l mus t have su f f ic ien t
mem ory fo r da ta s to rage a s we l l a s p ro -
g ram s to rage to fac i l i t a te in te rce l l com-
mun ica t ions .
Th e fo l lowing a rch i tec tu ra l f ea tu res
suppor t rhy thmic comm unica t ions :
QueiLed //O s t r e a m l i n e s c e l l u l a r
c o m m u n i ca
on s
b y a l o w in g t h e
s o u r c e c e ll t o s e n d d a t a t o t h e d e s ti -
nation cell when the data is avail-
ab le . no t when the des t ina t ion ce ll i s
ready to accep t i t . Da ta ex i t s the
queu e on a F IF O ( f i rs t in . f i rs t ou t )
bas is tha t a l lows p rog ram com pu ta -
t ion to p roceed i r re spec t ive o f com-
m u n i c a t i o n s s t a t u s . Q u e u e d IiO
he lps cons ide rab lywh en a l l ce l ls a re
compu t ing a s ing le p rog ram tha t i s
skewed in t ime ac ross the ce l ls . A
d isadvan tage i s tha t th i s mechan ism
uses memory space tha t i s expen -
s ive on
VLSI
wafe rs .
Se r i a l I/O reduces the bandwid th
requ i remen t on the sys to l ic a r ray ' s
works ta t ion mach ine , a t the same
t ime p romot ing rhy thmic ce l lu la r
c o m m u n i c a t i o n . T h e r e s ul t i ng p r o -
28
C O M P U T E R
-
7/25/2019 General Purpose Systolic Arrays 1
10/12
grammable systolic array has bit-
se r ia l ce llu la r communica t ions and
word-wide operations in each cell .
Each cell s tores part ia l words unti l
i t r ece ives the fu ll word and then
p e r f o r m s f u n c t i o n a l o p e r a t i o n s .
Using mo re cells can part ia l ly offset
t h e d i s a d v a n t a g e of r e d u c e d
th roughpu t .
M u l t i p l e d a tu p a t h s
provide t he cell
with fast , paralle l in ternal and ex-
ternal comm unications.O n e o r m o r e
da ta pa ths a re ded ica ted to compu-
ta t iona l needs . ano the r to ce l lu la r
inpu t ope ra t ions , and ye t ano th e r to
ce l lu lar ou tp u t ope ra t ions .
Sy s t o l i c share d re g i s t e rs provide sev-
eral benefits specif ically for
pro-
grammab le sys to l ic a rch i tec tu res .
SSR a rch i tec tu res (F igu re
7
pro -
vide a program-implicit means of
s t reaml ined da ta f low. Each two a d -
jacen t p rocess ing e lemen ts sha re a
sma l l r eg is te r memory . Da ta f lows
concur ren t ly wi th compu ta t ion a s
each p rocess ing e lemen t rece ives
new da ta f rom the reg is te r memory
ups t ream, ope ra te s on i t , and d i -
rec t s i t to the nex t reg is te r memory
downstream . This architecture e lim-
i n a t e s t h e r e q u i r e m e n t t h a t d a t a
movement be explic i t ly specif ied,
and da ta f low th rough the a r ray be -
comes a re su l t o f p rocess ing . The
inpu t reg is te r and the ou tpu t reg is -
te r a re spec i f ied by the in s t ruc tion
word . The re fo re . b id i rec t iona l da ta -
f low i s a na tu ra l r e su l t . An SSR
arch i tec tu re i s advan tageous ove r
queued
110
because communications
do no t occu r on a F IFO bas i s . A
disadvantage
is
that the register bank
must be kept small to minimize ac-
cess t ime a nd thus max imize sys to l -
ic bandwid th . The sma l l r eg is te r
m a k e s p r o g r a m m i n g d i f f i c u l t f o r
some o f the mo re compl ica ted a lgo -
ri thms.
B r o u d c a s t
ita
eases systolic
IiO
require ment s in certa in instances by
a l lowing all func t iona l un i t s and
local memo ry of a cell to be in i t i al -
ized o r se t to a common va r iab le a t
once .
Global datu eases systolic IiO re -
qu i remen ts and the ce l l ' s s to rage
requ i remen ts by keep ing on ly one
copy o f the sam e da ta an d a l lowing
a l l ce l l s to sha re i t . Wi th p rope r
g loba l memory cohe rency . any mod-
if ications to g loba l da ta a re in s tan t -
ly available to ot her cells .
I I
I
Figure
7 An
SIMD systolic shared-register architecture.
B i d i re c t i ona l und wrapurou t zd data-
,flow
must be explic i tly specif ied in
the p rog ram, whereas in ha rdwired
des igns da ta f low i s imp l ic i t . Bi -
d i rec t iona l and wrapa round da ta -
f l ow can reduce the
I/O
bandwid th
requ i remen ts o f the ex te rna l bu f fe r
memory by rec i rcu la ting da ta wi th -
in the a r ray . More f lex ib le da taf low
also allows arrays to execute a wide
va r ie ty o f a lgo r i th ms more e f f i -
c iently .
Block processing
If each program-
mable systolic cell has significant local
mem ory. b lock processing is possible in
the sys to lic env i ronmen t . Dur ing b lock
processing, very little systolic I/O oc-
curs , and ea ch cell executes a series of
in s t ruct ions on loca l da ta . O nce t he ce l l s
gene ra te an in te rmed ia te re su l t , p ro -
cessing pauses w hile systolic IiO t akes
p lace , and new da ta i s wr i tt en to each
cell 's local memory.
Block processing on prog rammab le
systolic architectures can result in a more
effic ient . pseudosystolic operation in
some a pplications for two reasons. First ,
the merge r o f von Neumann p rog ram -
mable cell architectures in to a systolic
a r ray env i ronmen t causes many o f the
sam e p rob lems found in homogeneous
mul t ip rocesso r sys tems . The se r ia l na -
tu re o f \ ' on Neumann mach ines in te r -
feres with the rhythmic systolic I/O of
t h e a r r a y , c a u si n g a n u n a c c e pt a b l e
amoun t o f t ime was ted in wa i t ing fo r
d a t a to become available . Block pro-
cessing minimizes th is wasted t ime in
app l ica t ions tha t can be d iv ided in to
equa l segmen ts . App l ica t ions a re d iv id -
ed into paralle l tasks that u ti l ize local
da ta .
Second, for any systolic array, the
bandwidth and size of the external mem-
ory a re a lways the l imi t ing fac to rs on
t h r o u g h p u t a n d p e r f o r m a n c e . B l o c k
processing reduces the array 's systolic
11 r e q u i r e m e n t s b y r e d u c i n g t h e
amou n t o f ce l lu lar IiO.
I m p o r t a n t w o r k h a s b e en d o n e o n
characteriz ing block processing in a gen-
eral-pu rpose systolic array. ' Block pro-
c e s s i n g c a n b e p e r f o r m e d o n e i t h e r
SIMD or MI MD prog ram mab le sys to l -
ic mach ines. Re qu i rem en ts fo r e f f ic ient
b lock p rocess ing inc lude an app rop r i -
a te a lgorithm a nd signif icant local mem -
ory in each ce l l . Othe r types o f a r ray
communica t ion such a s b roadcas t da ta
and g loba l da ta a re a l so des i rab le .
Architectural issues
of
reconfigurable systolic
arrays
The p r imary a rch i tec tu ra l i s sues in
d e s i g n i n g a r e c o n f i g u r a b l e s y s t o l i c
a r r a y a r e t h e h a r d w a r e p l a t f o r m o f
FPG As and th e c lass of a lgo r i thms to be
t a r g et e d t o t h a t p l a t fo r m . T h e n u m b e r
of FPG As, the i r topo logy , and the i r ga te
density will determ ine th e set of systolic
a rch i tec tu res tha t can be syn thes ized
a n d t h e s e t of systolic a lgorithms that
c a n b e i m p l e m e n t ed o n t h a t h a r d w a r e
platform. R econfigurable systolic archi-
tec tu res a r e ve ry in te re s t ing because
the cell 's architecture is actually pro-
g rammed . C onsequen t ly , the a rch i tec -
tu re p rog rammer has comple te con t ro l
ove r wha t a rch i tec tu ra l f ea tu res a re in -
c o r p o r a t e d i n e a ch F P G A . D e t e rm i n -
ing the ce ll a rch i tec tu re can be an i t e r -
ative process that continuously refines
the architecture unti l i t sa t isf ies appli-
ca t ion requ i remen ts .
FP GA a rch i tec tu re p rog ramming d i f-
fe r s f rom conven t iona l p rog ramming in
tha t one p rog rams t he c i rcu i t ' s log ica l
func t ion , in s tead o f p rog ramming a
mode l of operations in a high-level lan-
guage . Reconf igu r ing an FPGA is the
sam e as logically drawing a new circu it .
T h e r e f or e , a n F P G A p l a t f o rm c a n a s -
s u m e a n y a r c h it e c t ur e t h a t F P G A g a t e
dens i ty and package p inou t pe rmi t .
C o m m u n i c a t i o n a n d c o n c u r r e n c y
N o v c m b e r 1993 29
-
7/25/2019 General Purpose Systolic Arrays 1
11/12
Broadcast
instructions
I
Figure 8. A hybrid S l M D programmable systolic architecture.
must be eva lua ted on a case -by -case
bas is a t the t ime o f con f igu ra t ion . Cur -
ren t ga te -dens i ty techno logy does no t
m a k e M I M D a r r a y s f e a si b l e d u e t o c e ll
m e m o r y a n d c o n t r o l r e q u i re m e n t s , b u t
FP GA p la t fo rms a re we l l su ited fo r log-
ical or b it- level applications. Splash, for
example , i s
a
l inear reconfigurable ar-
ray app rop r ia te fo r b i t - leve l app l ica -
t ions.
An
FP GA ch ip can be con f igu red fo r
on e med ium-s ize systo lic e lemen t o r fo r
several s imple systolic cells .
In
e i the r
case, current technology l imits the c ir-
cu i t s ize to an o rde r o f m agn i tude ap -
p roach ing 25 ,000 ga te s . Packag ing a nd
110
p ins a l so in f luence the des ign . Cur -
rent technology l imits
I/O
t o a p p r o x i-
ma te ly
200
pins.
Reconf igu rab le a rch i tec tu res a re a l so
h ighly qua l i f ied fo r fau l t - to le ran t com-
puting. Recon figuration can simply elim-
ina te a bad ce l l f rom the a r ray . Ce l l
a rch i tec tu res can be con figu red a round
VLSI defects .
Architectural issues of
hybrid arrays
High- leve l p rog ramm ab le a r rays re -
quire extensive efforts to ma p algorithms
to h igh -level l anguages a f te r a lgo r i thm
deve lopmen t . When mapp ing i s com-
p le ted , the re su l t ing sys tem o f ten i s no t
a s e f f ic ien t a s a sys tem imp lemen ted in
a f ixed - func t ion ASIC.
On
t h e o t h e r
hand . low FPG A ga te dens i ty makes i t
unlikely t hat large-g rain tasks will ever
b e c o m p l e te l y i m p l e m e n t e d i n F P G A s
or tha t recon f igu rab le a r rays wi l l r e -
place conventional
S l M D
a n d M I M D
arch i tec tu res in the nea r fu tu re . Conse -
quen t ly , a na tu ra l s tep i s
to m e r g e t h e
two app roaches , keep ing the i r des i r -
ab le a spec ts and d isca rd ing the i r unde -
sirable aspects .
T h e h y b ri d S l M D a r c h i te c t u r e ( F i g-
u r e
8
is best u ti l ized for in tensive f lo at-
ing-point computationa l applications but
d o e s n o t d e g r a d e i n p e r f o r m a n c e a s
much
as
h igh - leve l p rog rammab le a r -
rays when s ign i f ican t log ica l o r con t ro l
o p e r a t i o n s a r e i n c l u d e d . T h e h y b r i d
des ign co mbines a commerc ia l f loa t ing -
p o i n t m u l t i pl i e r c hi p a n d a n F P G A c o n -
t ro l le r to fo rm a sys to l ic ce l l . Th e com-
m e r c i a l m u l t i p l i e r i s u s e d f o r i t s
economy. speed , an d package dens i ty :
the FPGA c lose ly b inds the ce l l to a
specif ic application. Another project is
a t tempt ing to in teg ra te a hyb r id gene r -
al-pur pose systolic array cell
on
a single
chip.':
A recen t ly deve loped s imu la to r a l -
lows simulation and test ing of the cell
and the a r ray des ign fo r a hyb r id a rch i -
tec tu re . Un l ike mos t commerc ia l ly
available computer-aided design uti l i-
t i e s , wh ich ve r i fy des igns a t the ch ip
leve l , it ve r i f ie s the pe r fo rma nce o f
al-
gor i thms and a rch i tec tu res h rough s im-
u la t ion o f the comple te a r ray . T he s im-
u la to r ma in ta in s the i t e ra t ive na tu re o f
purely reconfigurable array design by
facil i ta t ing ta i loring of hybrid designs
for specif ic applications.
Existing architectures
A review of projects in it ia ted during
the la s t decade shows tha t the t ren d was
to deve lop la rge sys to l ic a r ray p roces -
so rs tha t r equ i re e labo ra te . cus tomized
hos t suppor t . Warp . the Compu te r fo r
Exper imen ta l Syn the t ic Aper tu re Ra-
d a r , t h e C e l l ul a r A r r a y P r o c e s so r , a n d
Saxpy-1M a l l r equ i re expens ive and
compl ica ted
I/O
suppor t o r the in -
tensive instruction I/O as we l l a s the
d a t a
1/0.
T h e s e m a c h i n e s a r e h i g h -
p e r f o r m a n c e s u p e r c o m p u t e r s
(200
M F L O P S , 3 2 0 M F L O P S , a n d 1 , 00 0
MFLOPS. re spec t ive ly ) wi th cen t ra l -
ized concu r rency , cons t ruc ted to f i l l an
ex is t ing pe r fo rmance gap .
Th e introduction of workstations in t o
the workp lace has changed the way a
s ign i f ican t po r t ion o f the com pu t ing
communi ty v iews compu te rs . Use rs de -
m a n d i m p r o v e d d e s k t o p c o m p u t i n g
pe r fo rmance a s app l ica t ions con t inue
t o i n c r e a s e i n s i z e a n d c o m p l e x i t y .
Works tation architectur e can always be
improved . e spec ia l ly fo r emerg ing ap -
p l ica t ions such a s tex t an d speech p ro -
cess ing and gene ma tch ing . The more
recen t gene ra l -pu rpose sys to l ic a r ray
p r o j e c t s ( B r o w n , M i c m a c s , S p l a s h )
show tha t back -end sys to l ic p rocesso rs
are effective in boosting a workstation's
p e r f o r m a n c e f o r t h e s e a p p l i c a t i o n s .
These a r rays a re sma l l enough tha t the
host 's open a rch i tec tu re wi th l imi ted
U
bandwid th does no t seve re ly im-
pac t the a r ray ' s pe r fo rmance fo r
low-
to modera te - leve l g ranu la r i ty . Sma l l
back-end systolic processors are a lso
e c o n o m i c a l l y s o u n d . F o r e x a m p l e .
Sp la sh cons is ts of a two-boa rd add -on
se t fo r a Sun works ta t ion . One boa rd
suppor t s the l inea r a r ray . and the o th -
e r s u p p o r t s a bu f f er m e m o r y . T h e set
costs fr om $13,000 to $35,000.'
Almos t w i thou t excep t ion , cu r ren t
r e s e a r c h e m p h a s i z e s r e c o n f i g u r a b l e
cells , reconfigura ble arrays , and hybrid s
o f func t iona l un i t s embed ded in recon -
f igu rab le FP GA a r rays . Reconf igu rab le
des igns have p roven to be unmatched
for low- to moderate -granulari ty require-
m e n t s b u t a r e n o t y e t m a t u r e e n o u g h
fo r h igh -g ranu la r i ty app l ica t ions . I n
add i t ion , a s we have sa id . r econ f igu -
rable topologies and cells are highly
fau l t - to le ran t . The i r fau l t to le rance i s a
configuration issue. not a design and
fabrication issue.
Un t i l FP GA ch ip dens i ty p rog resses
t o t h e p o i n t w h e r e a v e r y l a r g e F P G A
can achieve high-level granulari ty , hy-
b r id a rch i tec tu res p re sen t pe rhaps the
mos t p rac t ica l means to a recon f igu -
rable high-level systolic array. Hybrid s
make use o f the mos t a t t rac t ive fea tu res
o f p rog rammab le and reconf igu rab le
30
C O M P U T E R
-
7/25/2019 General Purpose Systolic Arrays 1
12/12
me tho ds while addin g greater f lexibil i-
ty than e i the r me thod .
N o d i s c u s s i o n o f g e n e r a l - p u r p o s e
systolic arrays is complete without ad-
dressing the issues
of
p r o g ra m m i n g a n d
configuration. High-level- lan guagepro -
g ramming i s des i rab le fo r p romot ing
widesp read use of prog rammab le sys -
tolic back-end processors . Of the projects
we su rveyed . on ly the la rge r sys tems
had ma tu re p rog ramming env i ronmen ts.
Cur ren t ly , the sma l le r p rog rammab le
a r rays have imp lemen ted on ly a ssem-
b ly p rog ramming . As men t ioned ea r -
l i er , F P G A - c o n f i g u r a b l e a r r a y s a r e
configured with th e use of
a
schemat ic
ed i to r . One d raw back o f th i s app roach
is tha t i t typ ica lly takes m ore e f fo r t than
p rog ramming a p rog rammab le ce ll .
T
e systolic array is a formida ble
approach to exp lo i t ing concu r -
renc ie s in a compu ta t iona l ly
rhy thmic and in tens ive env i ronmen t .
Gene ral-purp ose systolic arrays provide
an economical way to enhance compu-
ta t iona l pe r fo rmance by emphas iz ing
concur rency and pa ra l le l i sm. Arrays
rang ing f rom low- leve l to h igh - leve l
g ranu la ri ty have been app l ied to p rob -
lems f rom b i t -o r ien ted p ixe l mapp ing
to 32-bit f loating-point-based scientif ic
computing. Systolic arrays hold great
p romise to be
a
pervas ive fo rm
of
con-
cu r rency p rocess ing . As a so lu t ion to
t h e i n t e n s i v e c o m p u t a t i o n a l p e r f o r -
mance requ i remen ts of tomorrows ap -
plications, general-purpose systolic ar-
rays canno t be ove r looked .
References
1. W.K. Fuchs and
E.E.
Swartzlander Jr.,
Wafer-Scale Integration: Architectures
and Algorithms.
Com put e r .
Vol. 25.
N o .
4, Apr. 1092,pp. 6-8.
2. P. Quinton and Y . Robert, S j s t o l i c Algo-
rifhni.c Architectures,
Prentice-Hall.
Englewood Cliffs, N.J . , 1991.
3. A. Krikelis and R.M. Lea. Architectur-
al Constructs for Cost-Effective Parallel
Computers. in
Systolic Array Proces-
sors. Prentice-Hall, Englewood Cliffs.
N.J . .
1989, pp. 287-300.
4. H.T. Kung. Why Systolic Architec-
tures? Cotnpurer , Vol. 15. No.
1.
Jan.
1982, pp. 37-46.
N o v e m b e r 1993
5.
6.
7.
8.
9.
10
11
12
J.A.B. Fortes and B.W. Wah , Systolic
Arrays: From Concept to Implementa-
tion. Computer (special issue on sys-
tolic arrays). Vol.
20, No.
7, July 1987, pp.
12-17.
C.K. KO and 0 Wing. Mapping Strate -
gy for Automated Design of Systolic
Arrays. Proc. Inrl Conf. Systol ic Ar-
rays.
IEE E CS Press,
Los
Alamitos. Ca-
l i f . .
Order
No.
860, 1988. pp. 285-294.
R. Hughey and D. Loprcsti. B-SYS: A
470-Processor Programmable Systolic
Array. Proc Intl Conf. Parallel Pro-
cersing.
IEEE CS Press. Los Alamitos,
Calif., Ord er N o 2355-22,1991 . p p . 1580-
1583.
M. Gokh ale et al., Building and Using a
Highly Parallel Programmable Logic
Array. Computer . Vol. 24. Jan. 1991.pp.
81
-89.
H.
Schroder and P. Strardins. Program
Compression on the ISA, Parallel Coni-
piiting, Vol. 17,
No.
2-3, June 1991, pp.
207-2
15.
B . Friedlander. Block Processing on a
Programmable Systolic Array ,Proc.Intl
Conf Parallel Proces sing. IE EE CS Press.
Los Alamitos, Calif.. Ord er No. 889.1987.
pp. 184-187.
R . Smith and G . Sobelman, Simulation-
Based Design of Programmable Systolic
Arrays.
Computer-Aided Design.
Vol.
2-3 . N o . 10, Dec. 7991. pp. 669-675.
S. Jones, A. Spray, and A. L ing, A Flex-
ible Building Block for the C onstruction
of Processor Arrays, in
Systolic Array
ProcrsJors,
Prentice-Hall. Englewood
Cliffs, N.J., 1989. pp. 459-466.
Kurtis
T.
Johnson
is an advanccd enginee r at
HR B Systems, an E-Systems subsidiary. His
intcrests include parallel computer architec-
tures, parallel algorithms, and digital signal
proces sing. He received his BS in 1986 and
his
M S
in 1992, both in electrical en gineer -
ing, from Pennsylvania State University.
A R
Hurson
is an associate professor of
computer engineering at Pennsylvania State
University. His research is directed toward
the design and analysis of general-purpose
and special-purpose computer architectures.
He has published m ore than
110
papers on
topics including computer architecture, par-
allel processing. database systems and ma-
chines, dataflow architectures. and V LSI al-
gorithms. He coauthored the IE EE tutorial
hooks,
Parallel Architectures
for
Database
Systenis and Multidatabase Systems:
An Ad-
vanced Solut ion or Globallnfornzatioti Shar-
ing Process. He cofounded the IEEE Sym-
posium on Paralle l and Distr ibuted
Processing. Hurson is a member of the IE EE
Com puter Society Press Editorial Board and
a member of the IE EE Distinguished Visi-
tors Program.
Behrooz Shirazi
is an associate professor of
computer science engineering at the U niver-
sity of Tex as at Arli ngton. Previously, he was
an assistant professor at Southern Meth odist
University. His research interests include
task scheduling, heterogeneous computing,
dataflow computation, and parallel and dis-
tributed processing. He has published wide-
ly
on these topics. He is a cofounder of the
IEE E Symposium on Parallel and Distribut-
ed Processing. Hc has served as chair
of
the
Com puter Society section in the Dalla s chap-
ter
oC
the IEEE and as chair
of
the IEEE
Region V A rea Activities B oard. Shirazi re-
ceived his MS and PhD degrees in computer
science from the University of Oklahoma. n
1980 and 1Y85. respectively.
Send correspondence about this article to
A.R. Hurson. Dept. of Computer Science
and Engineering. Pennsylvania State Uni-
versity, University Park, PA 16802: o r
Laxmi Bhuyan,
Conzputers
system archi-
tecture area ed itor, coordinated and recom-
mended this article for publication. His c-
mail addre ss is [email protected].
31
mailto:[email protected]:[email protected]:[email protected]:[email protected]