General Purpose Systolic Arrays 1

7/25/2019 General Purpose Systolic Arrays 1

1/12

General-Purpose

Systolic Arrays

Kurtis

T.

Johnson and

A.R.

Hurson Pennsylvania State University

Behrooz Shirazi University of Texas Arlington

Systolic arrays

effectively exploit

massive parallelism in

computationally

intensive applications.

With advances in VLSI

WSI and

FPGA

technologies they have

progressed from fixed-

function to general-

purpose architectures.

hen Sun Mic rosys tems in t roduced i t s fi r st works ta t ion , the com pany

cou ld no t have imag ined how qu ick ly works ta t ions wou ld revo lu t ion -

ize compu t ing . Th e idea o f a communi ty o f eng inee rs , sc ien t i s ts , o r

re sea rche rs t ime-sha r ing on a s ing le ma in f rame compu te r cou ld ha rd ly have

become anc ien t any mo re qu ick ly . Th e a lmos t in s tan t wide accep tance o f works ta -

t ions and deskt op com puters indicates that [hey were quickly recognized as giving

the bes t a nd mos t f lex ib le pe r fo rmance fo r the do l la r .

Desk top compu te rs p ro l i fe ra ted fo r th ree reasons . F i rs t , very la rge sca le in teg ra -

t ion (VLSI) and wafe r sca le in teg ra t ion

(WSI) ,

desp i te some p rob lem s , inc reased

the ga te dens i ty of chips while dramatically lowering their production cost . '

Moreove r . inc reased ga te dens i ty pe rmi ts a m ore compl ica ted p rocesso r , wh ich in

tu rn p rom otes pa ra l le l ism.

Second , desk top compu te rs d i s t r ibu te p rocess ing power t o the u se r in an eas ily

cus tomized open a rch i tec tu re . Rca l - t ime app l ica t ions tha t requ i re in tens ive

110

and compu ta t ion need no t consume a l l the re sou rces o f a supe rcompu te r . Also,

desk top compu te rs suppor t h igh -de f in i t ion sc reens wi th co lo r and mo t ion fa r

exceed ing those ava i lab le wi th any m u l t ip le -use r . sha red - re sou rce ma in f rame .

Th i r d , economica l . h igh -bandwid th ne tworks a l low desk t op compu te rs to sha re

data . thus re ta ining the most app ealing aspect of centralized computing , resource

sha r ing . Moreove r , ne tworks a l low the compu te rs t o sha re da ta wi th d i ss imila r

compu t ing mach ines . Tha t i s pe rhaps the mos t impor tan t reason fo r the accep -

tance o f desk t op compu te rs . s ince al l the pe r fo rmance in the wor ld i s wor th l i t t l e

if

the ma chine is isolated.

Today ' s works ta t ions have rede f ined the way the compu t ing comm uni ty d i s t rib -

u te s p rocess ing re sou rces , and tomorrow's mach ines wil l con t inue th i s t r en d wi th

h ighe r bandwid th ne tworks and h ighe r compu ta t iona l pe r fo rmance . O ne way to

ob ta in h ighe r compu ta t iona l pe r fo rmance is t o

use

spec ia l pa ra lle l cop rocesso rs to

pe r fo rm func t ions such as mot ion and co lo r suppor t o f h igh -de f in it ion sc reens .

Fu tu re com pu ta t iona l ly in tens ive app l ica t ions su i ted fo r desk top compu t ing ma-

ch ines inc lude rea l - t ime tex t , speech , and image p rocess ing . These app l ica t ions

require massive paralle l ism. '

I


2/12

Many comp utational tasks are by their

ve ry na tu re sequ en t ia l : fo r o the r ta sks

t h e d e g r e e

of

para l le l ism va r ie s . The re -

fo re . a mass ive ly pa ra l le l com pu ta t ion -

al archi tectur e must maintain suffic ient

application f lexibil ity and com putational

efficiency.

It

must be

reconf igu rab le to exp lo i t app l ica -

t ion -dependen t pa ra l le l i sms ,

h igh - leve l - language p rog rammab le

for task co ntrol a nd f lexibil ity ,

sca lab le fo r easy ex tens ion to m a n y

app l ica t ions. and

capable of supportingsingle-instruc-

t ion s t ream, mu l t ip le -da ta s t ream

( S I M D ) o r g a n i z a t i o n s f o r v e c t o r

ope ra t ions a nd mu l t ip le - in s t ruc t ion

s t r e a m , m u l t i p l e - d a t a s t r e a m

( M I M D ) o r g a n i z a t i o n s t o e x p l o i t

n o n h o m o g e n e o u s p a r a l l e l i s m r e -

qu i remen ts .

Systolic arrays are ideally qualif ied

fo r compu ta t iona l ly in tens ive app l ica -

t ions . Wh e the r func t ion ingas a ded ica t -

ed f ixed - func t ion g raph ics p rocesso r o r

a more com pl ica ted and f lex ib le cop ro -

cesso r sha red ac ross a ne twork , a sys to l -

ic array effectively exploits massive p ar-

a l le l i sm. Fa l l ing in to an a rea be tw een

vec to r compu te rs and mass ively pa ra l -

lel

computers . systolic arrays typically

combine in tens ive loca l communica t ion

a n d c o m p u t a t i o n w i t h d e c e n t r a l i z e d

pa ra l le l ism in a comp ac t package . They

cap i ta l ize on regu la r , modu la r , rhy th -

mic , synch ronous . concu r ren t p rocess -

e s ha t requ i re in tens ive , r epe t i t ive com-

putation. While systolic arrays originally

were u sed fo r f ixed o r spec ia l -pu rpose

architectures, the systolic concept has

b e e n e x t e n d e d t o g e n e r a 1 p u r p o s e

S I M D a n d M I M D a r c h it e c tu r e s.

W hy sys tolic ar rays?

Ever s ince Kung p roposed the sys tol -

ic model. . i ts e legant solutions to de-

mand ing p rob lem s and i t s po ten t ia l pe r -

fo rmance have a t t rac ted g rea t a t ten t ion .

In physiology, the term sysfolicdescribes

the con t rac t ion ( sys to le ) o f the hea r t ,

wh ich regu la r ly sends b lood to all cells

o f the body th rough the a r te r ie s , ve ins ,

and capil lar ies . Analogously . systolic

compu te r p rocesses pe r fo rm ope ra t ions

in a rhy thmic . inc remen ta l , ce l lu la r . and

repe t i t ive manner . T he sys tol ic compu-

ta t iona l ra te

is

re s t r ic ted by the a r ray s

U

opera t ions . much a s the hea r t con -

I I

than p rocesso r a r ray s tha t execu te sys-

tolic a lgorithm s. Systolic arra ys consist

of

e lemen ts tha t t ake o ne o f the fo llow-

ing forms:

a special-purpose cell with h ardwire d

func t ions ,

a vec to r -compu te r l ike ce ll wi th an

ins t ruc t ion decod ing un i t a nd a p ro -

cessing un i t , o r

a p rocesso r comple te wi th a con t ro l

un i t a nd a p rocess ing un it .

Figure 1. General systolic organization.

tro ls b lood f low to the cells s ince i t is the

sou rce an d des t ina t ion fo r a l l b lood . J

Al though the re i s no widely accep ted

standard definit ion of systolic arrays

and systolic cells . the follo wing descrip-

t ion se rves a s a work ing de f in i t ion in

th i s a r t ic le (a l so see Sys to l ic a r ray

summ ary be low) . Sys to lic a r rays have

balanced . uniform, gridlike architectures

(F igu re

I

in which each l ine indicates a

communica t ion pa th and each in te r sec -

t ion represents a cell or a systolic e le-

men t . H owever , sy s tol ic a r rays a re m ore

In all cases, the systolic e lements or

ce l l s a re cus tomized fo r in tens ive loca l

comm unica t ions and decen t ra l ized pa r -

alle l ism. Because an array consists

of

ce l ls o f on ly one o r , a t m os t , a few k inds ,

it

has regu la r and s imp le cha rac te r i s -

t ics . T he arra y usually is extensible with

minim al diff iculty.

Thr ee fac to rs have con t r ibu ted

to

the

sys to l ic a r ray s evo lu t ion in to a lead ing

approach fo r hand l ing compu ta t iona l ly

in tens ive app l ica t ions : t echno logy ad -

vances , concu r rency p rocess ing , and

demanding scientif ic applicationsP

Technology advances. Advances in

Systolic array summary

Systol ic array:

A g ridlike structure of sp ecial processing elements that

processes data much like an n-dimensional pipeline. Unlike a pipeline, how-

ever, the input data as w ell as partial resu lts flow through the array. In addi-

tion, data can flow in a systolic organiza tion at m ultiple speeds in multiple di-

rections. Systolic arrays usually have a very high rate of I/O and are well

suited for intensive parallel operations .

Appl icat ions:

Matrix arithmetic, signal processing, mage processing, an-

guage recog nition, relational database operations, data structure manipula -

tion, and character string manipulation.

Special purpose systol ic array: An array of hardwired systolic process-

ing elements tailored for a spec ific application. Typically, many tens or hun-

dreds of cells fit on a single chip.

General purpose systol ic array:

An array of systolic processing ele-

ments that can be adapted to a variety of app lications via programm ing or

reconfiguration.

Programmable systol ic array: An array of programmable systolic ele-

ments that operates either in SIMD or MIMD fashion. E ither the arrays inter-

connect or each processing unit is program mable and a program controls

dataflow through the elemen ts.

Reconf igurab le sys to l ic a r ray : An array of s ystolic elements that can be

program med at the lowest level. FPGA field-program mablegate array) tech-

nology allows the array to emu late hardwired systolic elements at a very low

level for each unique application.

N o v e m b e r

l Y Y 3

21


3/12

VLSI /WSI techno logy complem en t the

systolic arrays qualif ications in the fol-

lowing ways:

Demand ing sc ien t i f ic app l ica t ions .

Th e techno logy g rowth o f the la st th ree

decades has p roduced com pu t ing env i -

ronmen ts tha t mak e i t f ea s ib le to a t tack

demand ing sc ien t if ic app l ica t ions on

a

l a rge r sca le . La rge -ma t r ix mu l t ip l ica -

t ion , f ea tu re ex t rac t ion , c lu s te r ana ly -

s i s, and rada r s igna l p rocess ing a re on ly

afe w examples. As recent h is tory shows,

w h e n m a n y c o m p u t e r u s e r s w o r k

on

a

wide va r ie ty o f app l ica t ions , they deve l -

op new app l ica t ions requ i r ing inc reased

Cellgran ulari ty .The level of cell gran-

u l a r i t y d i r e c tl y a f f e c t s t h e a r r a y s

th roughpu t and f lex ib i l i ty and de te r -

mines the se t o f a lgo r i thms tha t i t c an

e f f ic ien t ly execu te . Each ce l l s bas ic

ope ra t ion can range f rom a log ica l o r

b i twise ope ra t ion . to a word- leve l mu l -

t ip l ica t ion o r add i t ion .

to a

c o m p l e t e

p rog ram. G ranu la r i ty i s sub jec t to tech -

nology capabil i t ies

and

l imi ta t ions a s

well

as

des ign

goals.

F o r e x a m p l e . i n t e-

g ra t ion -subs t ra te fami l ie s have d i f fe r -

Smalle r and faster gatesa llow a high-

e r ra te o f on -ch ip communica t ion be -

cause da ta has a sho r te r d i s tance to

t rave l .

Highe r ga te dens i t i e s pe rmi t more

complicated cells with higher individu-

a l and g roup pe r fo rmance . Granu la r i ty

inc reases

as

word leng th inc reases , and

concur rency inc reases wi th more com -

plicated cells .

Economica l des ign a nd fab r ica t ion

p rocesses p roduce le ss expens ive sys-

to l ic ch ips , even in sma l l quan t i t i e s .

Be t t e r des ign

tools

a l low a r rays t o be

designed m ore eff ic iently . A systoliccell

can be fu lly s imu la ted be fo re fab r ica -

t ion , r educ ing the chances tha t i t wi ll

fa il to w ork as des igned . Wi th advances

in s imulation techniques, fully tested,

un ique ce l l s can now be qu ick ly cop ied

and a r ranged in regu la r , modu la r a r -

rays. As V LSIiWSI designs becom e more

complicated. systolic iz ing them pro-

v ides an e f f ic ien t way t o ensu re fau l t

to le rance : any fau l t to le rance p recau -

t ions bu i lt in to one ce ll a r e ex tens ib le to

all cells.

compu ta t iona l pe r fo rmance . Examples

of these innovativ e applications include

in te rac t ive language recogn i t ion , re la -

t iona l da tabase op e ra t ions , t ex t recog -

nit ion, and vir tual reali ty . These appli -

ca t ions requ i re mass ive repe t i t ive an d

rhythm ic paralle l processin g, as well as

in tens iveU0 opera t ion . H ence , sy sto lic

compu t ing .

en t pe r fo rmance and dens i ty cha rac te r -

is t ics . Packaging also in trod uces

U0

p in

restr ic t ions.

Extensibil ity . B ecaus e systolic arrays

a re bu i l t o f ce l lu la r bu i ld ing b locks, the

cell design should be suffic iently f lexi-

ble for use in a wide varie ty of topolo-

g ie s imp lem en ted in a w ide va r ie ty o f

subs t ra te techno log ie s .

Implementation issues

Clock

synchron ization. Clock l ine s of

different lengths within in tegrated chips,

as

well

a s

ex te rna l to the ch ips , can

in t roduce skews . The r i sk o f c lock skew

is g rea te r when da ta f low in the sys to l ic

array is b idirectional. Wavefron t arrays

reduce the c lock skew p rob lem by in -

t roduc ing more com pl ica ted , a synch ro -

A number o f imp lemen ta t ion i s sues

de te rmine

a

systolic arraysperfo rmance

effic iency. Designe rs should understa nd

the fo l lowing pe r fo rm ance t rade -o f f s a t

the design stage.

Re la t ive ly new f ie ld -p rog rammab le

g a t e a r r a y ( F P G A ) t e c h n ol o g y p e r m i t s

a

reconf igu rab le a rch i tec tu re , a s

op-

posed to a rep rog rammab le a rch i tec -

tu re .

Algo r i thms a nd mapp ing . Des igne rs

mus t be in t ima te ly fami l ia r wi th the

a lgo r i thms they a re imp lemen t ing on

systolic arrays. Designing

a

systolic ar-

ray heu r i s t ica l ly f rom

an

a lgo r i thm i s

s low an d e r ro r -p ron e , equ i r ing s imu la -

t ion fo r ve r i f ica t ion and o f ten p roduc -

ing aless-than-optim um algorithm. Thus,

au tomat ic a r ray syn thes i s i s an impor -

t a n t r es e ar c h a r e a 6 A t p r e s e n t. h o w e v -

nous , in te rce l lu la r communica t ions .

Reliabil i ty . Asin tegratedcirc uits grow

la rge r . des igne rs mus t bu i ld in g rea te r

fau l t to le rance to ma in ta in re l iab i l i ty ,

and d iagnos t ic s to ve r ify p rop e r ope ra -

t ion.oncurrency processing. Past efforts

to add concur rency to the conven t iona l ,

v o n N e u m a n n c o m p u t e r a r c h i t e c t u r e

have yielded coprocesso rs , multip le pro-

t ip le hom ogene ous processo rs . Systolic r is t ics .

a r rays combine fea tu res f rom

all

of these

a rch i tec tu res in a massively paralle l ar-

ch i tec tu re tha t can be in teg ra ted in to

ex ist ing p la t fo rms w i thou t a com ple te

redesign. A systolic array can act as a

coprocesso r , can con ta in mu l t ip le p ro -

Systolic array

cessing units . data pipelin ing. and mul-

e r , mos t a r ray des igns a re based on heu- taxonomy

I n t e g r a t i o n i n t o e x i s t i n g s y s t e m s .

Genera l ly ,

a

systolic array is in tegrat ed

in to an ex is t ing host a s

a

back -end p ro -

cesso r . The a r ray s h igh

U

requ i re -

men ts o f te n make sys tem in teg ra t ion a

T h e t e rm systolic

array

originally re-

f e r r e d

to

spec ia l -pu rpose o r f ixed - func -

t ion a rch i tec tu res des igned

as

h a r d w a r e

imp leme n ta t ions o f a g iven a lgo r i thm.

In

mass quan t i t i e s , the p roduc t ion o f

cessing un i t s and lo r p rocesso rs , an d can

ac t a s an n -d imens iona l p ipe l ine . Al -

though da ta p ipe l in ing reduces

IiO

re -

qu i remen ts by a l lowing ad jacen t ce l l s

to reuse the inpu t da ta , the sys tol ic a r -

rays real novelty is i ts incremental in-

s t ruc t ion p rocess ing o r compu ta t iona l

pipelining. Each cell computes an in-

c r e m e n t a l r e s ul t . a n d t h e c o m p u t e r d e -

r ives the co mple te re su l t by in te rp re t -

ing the inc remen ta l r e su l t s f rom the

en t i re a r ray in

a

prespecif ied algorith-

mic fo rma t .

s ignif icant problem. Because the exist-

ing IiO channel rarely satisf ies the ar-

ray s bandwid th requ i remen t , a m emo-

r y s u b s y s t e m o f t e n m u s t b e a d d e d

be tween the hos t an d the sys to l ic a r ray

t o s u p p o r t d a t a a c c e s s a n d d a t a m u l t i-

p lex ing and demul t ip lex ing . Th e mem-

ory subsys tem can range f rom the com-

p l ica ted suppor t a nd c lu s te r p rocesso rs

i n t h e W a r p a r r a y

to

the s imp le r s tag ing

memory in the Sp la sh a r ray . (These

sys tems wi l l be d i scussed in g rea te r

de ta i l l a te r . )

these a r rays was manageab le and eco -

nomica l , and , thus , they were we l l su i t -

ed fo r common app l ica t ions . Bu t these

d e s ig n s w e r e b o u n d to the speci f ic ap -

p l ica t ion a t hand an d were no t f lex ib le

o r ve rsa t i l e . Eve ry t ime a systolic array

was to b e u s e d o n a new app l ica t ion , the

manufac tu re r had to unde r take the long ,

costly , and potentia l ly r isky process of

des ign ing . t e s t ing , and fab r ica t ing an

app l ica t ion -spec i f ic in teg ra ted ch ip .

Al though the cos t and r i sks of deve lop -

ing

A S l C s

have dec reased in recen t

C O M P U T E R

2


4/12

Table

1.

Systolic array taxonom y.

Class

T y p e

Genera l -pu rpose Spec ia l -pu rpose

Prog rammab le Reconf igu rab le Hybr id Hardwired

O r g a n i za t i o n S I M D o r M I M D

V F I M D V F I M D

Fixed

F ixed

Topo logy

In te rconnec t ions

Dimens ions n -d imens iona l (n

> 2

i s r a r e d u e

to

complex i ty )

n -d imens iona l

P rog rammab le

S ta t ic Dynamic

S I MD: single-instruction stream, multiple-data stream: MI MD: multiple-instruction stream, multiple-data stream;

VFTMD:

very-few-

instruction stream, multiple-data stream

Fixed

yea rs , budge t cons t ra in t s have m o t iva t -

ed a t rend away f rom un ique ha rdware

deve lopmen t . Consequen t ly . gene ra l -

purpose systolic architectures have be-

come a logical a l ternative. I n add i t ion

to serving in a wide varie ty of applica-

t ions . they also prov ide te s t beds fo r

developing, verifying, and debugging

new systolic a lgorithm s. Tab le 1shows

a taxonomy o f gene ra l -pu rpose and spe -

cial-purpose systolic arrays.

S ta t ic Dynamic

Special-purpose architectures. Spe-

cial-purpose systolic architectures are

cus tom des igned fo r each app l ica t ion .

Few p rob lems re si s t a t tack f rom sys to l -

ic a r rays , bu t som e p rob lems m ay re -

q u i r e e l e g a n t a l g o r i t h m s . G e n e r a l l y

speaking, the systolic design requir es a

per fo rmance a lgo r i thm tha t can be e f fi -

c ien t ly imp lemen ted wi th today s VLSI

technology.

One area that easily uti l izes systolic

a lgo r i thms i s ma t r ix ope ra t ions . Figure

2

i l lu s tra te s the a lgo r i thm fo r th e sum o f

a

scalar product, computed in a s ingle

systolic e lem ent. A fter the cell is in it ia l-

ized . the

as

a n d t h e bs a re synch ro -

nously shif ted through the processing

e lemen t . The accumula to r s to re s the

sum o f the a,b products .

All

t h e

a

a n d

b

da ta synch ronous ly ex i t s the p rocess ing

e lemen t unmod i f ied to be ava i lab le fo r

the nex t e lemen t .

At

t h e e n d o f p r o ce s s -

ing,

the sum o f the p roduc ts i s shi f ted

o u t

of

the acc umula to r . Th is p r inc ip le

easily extends

to

a ma t r ix p roduc t . as

shown in Figure

3.

The on ly d i f fe rence

be tween s ing le -e lemen t p rocess ing and

a r ray p rocess ing is tha t th e la t te r de lays

each add i t ional co lumn and row by one

cycle so that

the

columns and rows l ine

u p f o r

a

matr ix mu l t ip ly . The p roduc t

ma t r ix i s sh i f ted ou t a f te r comple t ion o f

processing.

An

obv ious p rob lem wi th th i s ap -

p roach i s tha t m a t r ix p roduc ts invo lv ing

a

mat r ix la rge r tha n the sys to l ic a r ray

mus t be d iv ided in to a se t o f sma l le r

ma t r ix p roduc ts . Th is re sou rce an d im-

plem entatio n prob lem affects a l l systol-

ica r rays . P rope r a lgo r i thm deve lopmen t

compensa te s fo r the p rob lem, bu t pe r -

fo rmance dec reases neve r the le ss.

The m a t r ix p roduc t example a l so dem-

ons t ra te s a no the r p rob lem wi th spec ial -

pu rpose sys to l ic a r rays and ha rdware in

ib -ty pe data exits

~ ~ ~

Figure

2.

A

systolic processing elemen t that compute s the sum

of

a scalar

product.

a13 a 2 a l l

a23

22

*

a

a32

a3

*

b-type data exits array

Figure 3 The sys-

tolic product

of

two 3

x

3 matrices.

N o v e m b e r I993

23


5/12

Host

Data in

_ _

Array

Data out

Figure 4. General organization of

SIMD

programmable linear systolic arrays.

gene ra l . The mo re spec ia l ized the ha rd -

ware , the h ighe r the pe r fo rmance : bu t

cost per application also r ises and f lex-

ibil i ty decreases. Therein l ies the a t-

tractiveness of gene ral-pu rpose systolic

a rch i tec tu res .

General-purpose architectures.

T h e

two basic types of genera l-purp ose sys-

t o li c a r r a y s a r e t h e p r o g r a m m a b l e m o d -

e l and the recon f igu rab le mode l . Re -

cen t ly . hyb r id mode ls have a l so been

proposed .

In

the p rog ramm ab le mo de l , cel l a r -

ch i tec tu res and a r ray a rch i tec tu res re -

ma in the sam e f rom app l ica t ion to ap -

p l ica tion . How ever , a prog ram con t ro ls

da ta ope ra t ions in the ce l l s and da ta

rou t ing th rough the a r ray . Al l comm u-

n ica t ion pa ths and func t iona l un i t s a re

f ixed , and the p rog ram de te rmines when

and wh ich pa ths a re u t i l i zed . O ne o f the

f i rs t p rog ramma b le a rch i tec tu res was

Carneg ie M el lon s p rog ramm ab le sys-

tolic chip .

In

the recon f igu rab le m ode l , ce ll a r -

ch i tec tu res as well as array architec-

tu re s change f rom one app l ica t ion to

a n o t h e r . T h e a r c h i t ec t u r e f o r e a c h a p -

p l ica t ion appea rs

as

a

spec ia l -pu rpose

a r ray . Th e p r imary means o f imp lemen t -

ing the reconf igu rab le mode l i s FP GA

techno logy . Sp la sh was on e o f the ea r ly

F P G A - b a s e d r e c o n f i g u r a bl e s y st o li c

arrays.

Hybr idmode lsmake use

of

b o t h

VLSI

24

and FPGA techno logy . They usua l ly

consi st of

VLSI

c i rcu it s embed ded in an

FPGA-reconf igu rab le in te rconnec t ion

ne twork .

S y s t o l i c t o p o l o g i e s .

A r r a y t o p o l o g ie s

c a n b e e i t h e r p r o g r a m m a b l e o r r e -

con f igurab le . L ikewise , a r ray ce l ls a r e

e i t h e r p r o g r a m m a b l e o r r e c o n f i g u r -

ab le .

A

prog ramm able systolic architecture

is

a

co l lect ion o f in te rconnec ted , gene r -

al-pu rpose systolic cells , each of w hich

is e i the r p rog rammab le o r recon f igu -

rab le . P rog ramm ab le sys to lic ce l ls a re

flexible processing elements specially

des igned to mee t the com pu ta t iona l and

U

equ i remen ts

of

systolic arrays. Pro -

g ramm ab le sys to l ic a rch i tec tu res can be

classif ied accordin g to their cell in ter-

connectio n topologies: f ixed or program -

m a b l e .

F ixed ce l l in te rconnec t ions l imi t a

g iven topo logy to some subse t o f

all

poss ib le a lgo r i thms . Tha t topo logy can

e m u l a t e o t h e r t o p o l o g i e s by m e a n s of

t h e p r o p e r m a p p i n g t r a n s f o r m a ti o n , b u t

reduced pe r fo rmance i s o f ten a conse -

quence .

P rog rammab le ce l l in te rconnec t ion

topol ogies typically consist of prog ram -

mab le ce l ls emb edde d in a swi tch la t ti ce

tha t a l lows the a r ray to a s s u m e m a n y

d i f fe ren t topo log ie s. P rog ramm ab le to -

po log ie s a re e i the r s ta t ic o r dynamic .

S ta t ic opo log ie s can be a l te red be tween

app l ica t ions , and dynamic topo log ie s

can be a l te red wi th in

an

application.

S ta t ic p rog ramm ab le topo log ie s can be

imp lem en ted wi th much le ss complex i -

t y t h a n d y n a m i c p r o g r a m m a b l e t o p o l o -

g ie s. The re has been l i t tl e re sea rch

in

d y n a m i c p r o g r a m m a b l e t o p o l o g ie s b e -

cause

a

h igh ly complex in te rco nnec t ion

ne twork cou ld unde rmine the regu la r

and simple principles of systolic archi-

tec tu res .

Reconf igu rab le sys to l ic a rch i tec tu res

cap i tal ize on F PG A techno logy , wh ich

a l lows the u se r to con f igu re

a

low-level

logic c ircuit for each cell . Reconfigu-

rab le a r rays a l so have e i the r f ixed o r

reconf igu rab le ce l l in te rconnec t ions .

Th e use r recon f igu res

an

a r ray s top o l -

ogy by means o f a swi tch la t t i ce . Any

gene ra l -pu rpose a r ray tha t i s no t con -

ven t iona l ly p rog rammab le i s u sua l ly

c o n s i d e re d r e c o n f i g u r ab l e . A l l F P G A

reconf igu r ing is s ta t ic due to techno lo -

gy l imitations.

A r r a y d im e n s i o n s . We can further c las-

s i fy gene ra l -pu rpose and spec ia l -pu r -

pose sys to l ic a rch i tec tu res by the i r a r -

r a y d i m e n s io n s . T h e t w o m o s t c o m m o n

s t ruc tu res a re the l inea r a r ray and the

two-d imens iona l a r ray . Lin ea r sys to l ic

array s are by default s ta t ically reconfig-

u rab le in one -d imens io na l space . Two-

d imens iona l a r rays a l low m ore e f f ic ient

execution of complicated algorithms.

Due

t o

I/O

imi ta t ions . gene ra l -pu rp ose

sys to l ic a r rays o f d imens ions g rea te r

t h a n t w o a r e n o t c o m m o n .

Programmable array organization.

Programm ab le sys to l ic a r rays a re p ro -

g rammab le e i the r a t

a

high level or a

low level . A t e i the r

level,

p r o g r a m m a -

b le a r rays can be ca tego r ized

as

e i the r

S I M D

o r

M I M D

m a c h i n es . T h e y a r e

typically back-end processors with an

add i t iona l bu ffe r memory to hand le the

high systolic U a te s . High - leve l p ro -

g r a m m a b l e a r r a y s u s u a ll y a r e p r o -

g r a m m e d

in

h igh - leve l l anguages and

a re word o r ien ted . Low- leve l a r rays a re

p rog ramm ed in a ssembly language and

a re b i t o r ien ted .

SZMD. S I MD

systolic mac hines (Fig-

u re

4)

o p e r a t e s i m il a rl y t o

a

vec to r p ro -

cesso r. The hos t works ta t ion p re load s a

con t ro l le r and

a

mem ory , wh ich a re ex -

te rna l to the a r ray , with the in s t ruc t ions

and d a ta fo r the app l ica t ion . The sys tol -

ic ce l l s s to re no p rog rams o r in s t ruc -

t ions. Ea ch cells instruction-process-

C O M P U T E R


6/12

ing functional unit serv es as a large de-

code r . Once the works ta t ion enab le s

e x e c u t i o n , t h e c o n t r o l l e r s e q u e n c e s

t h r o u g h t h e e x t e r n a l m e m o r y , t h u s d e -

l ivering instructions a nd da ta to the sys-

to l ic a r ray . Wi th in th e a r ray , in s t ruc -

t ions a re b roadcas t and a l l ce l ls pe r fo rm

t h e s a m e o p e r a t i o n o n d i f f er e n t d a t a .

Ad jacen t ce l l s may sha re memory , bu t

generally

no

memory i s sha red by the

en t i re a r ray . Af te r ex i t ing the a r ray ,

da ta i s co l lec ted in the ex te rna l bu f fe r

memo ry . S IMD-prog rammab le sys to l ic

ce l ls t ake le s s space on the VLSI w afe r

t h a n t h e i r M I M D c o u n t e r p a rt s b e c a u s e

they are s imple instruction-processing

e l e m e n t s r e q u ir i n g n o p r o g r a m m e m o -

ry . F rom tens t o hundreds o f such ce ll s

fit on one in teg ra ted ch ip . '

M I M D . M I M D s y s t o l i c m a c h i n e s

(F igu re

5)

opera te s imila rly to hom oge-

n e o u s v o n N e u m a n n m u l t i p r o c e s s o r

mach ines . The works ta t ion downloads

a p rog ram to each M IM D sys to l ic cel l .

Ea ch cell may be loa ded with a different

p rog ram, o r a l l the ce ll s in the a r ray may

be loaded wi th the same p rog ram. Each

cell 's architecture

is

somewha t s imi lar

t o t h e c o n v e n t io n a l v on N e u m a n n a r -

ch i tec tu re : I t con ta in s a con t ro l un i t , an

A L U , a n d l o c a l m e m o r y . M I M D s y st o l-

ic ce l l s have more loca l memory than

t h e i r S I M D c o u n t e r p a r t s t o s u p p o r t

the von Neumann-s ty le o rgan iza t ion .

Some may have a sma l l amoun t o f glo-

ba l memory , bu t gene ral ly no mem ory

is sha red by a l l the ce l l s . Wheneve r da ta

is to be sh are d by processors . i t must be

passed to the nex t ce l l . Thus , da t a ava i l -

abil i ty become s a very impo rtant issue.

High - leve l MIM D sys tol ic cel l s a re ve ry

complicated. and usually only one f i ts

on

a s ing le in teg ra ted ch ip . Fo r exam -

p le , the iWarp i s a Warp ce l l wi thou t

m e m o r y

on

a s ing le ch ip. Loca l mem ory

fo r each iW arp ce ll mus t be supp l ied by

addit ional chips.

Reconfigurable array organization.

Recen t ga te -densi ty advances in FP GA

techno logy have p roduced a low- leve l,

reconfigurable systolic array architec-

tu re tha t b r idges the gap be tween spe -

c ia l -pu rpose a r rays and the m ore ve rsa -

t i l e , p r o g r a m m a b l e . g e n e r a l - p u r p o s e

a r rays . The FPGA a rch i tec tu re i s un -

usual because a s ingle hardware plat-

form can be logically reconfigured as an

exact duplic ate of a special-purpose sys-

tolic array. F igu re

6

shows the gene ra l

organization of a reconfigurable array.

Data in

_ _ _ _ _

Array

i

ata

out

Figure 5. General organization of M IMD program mable linear systolic arrays.

Th e design er logically draws cell archi-

tec tu res on the works ta t ion wi th a sche -

m a t i c e d i t o r ( s u ch a s M e n t o r ) , c o n v e r ts

t h e d e s ig n t o F P G A c o d e w i th a n o t h e r

u t il i ty , and th en dow nloads the code in

a f e w s e c o n d s t o c o n f i g ur e t h e F P G A

arch i tec tu re .x

Reconf igu rab le sys to l ic a r rays do no t

f a ll i n t o t h e S I M D or M I M D ca tego r ie s

fo r the same reason tha t spec ia l -pu r -

pose sys to l ic a r rays do no t . R econf igu -

rable arrays, l ike special-purpose ar-

rays , a re gene ra l ly l imi ted to VFIMD

(very-few-instruction streams. multip le-

da ta s t reams) o rgan izat ion due to FPG A

ga te dens i ty . In s t ruc t ions a r e imp l ic i t in

the configuration of each cell ; there-

fo re , the re i s

no

n e e d t o d o w n l o a d t h e m

from th e workstation. Since special-pur-

pose systolic architectures consist of a

ve ry few un ique ce ll s repea ted th rou gh-

ou t the a r ray , the en t i re a r ray a l so tends

t o b e V F I M D . P u r e l y r e c o n f i g u r a b l e

architectures are f ine-grain , low-level

devices best suited for logical or b it

manipulations. They typically lack the

ga te dens ity to supp or t h igh - level func -

t ions such as multiplication.

Tab les

2 , 3,

a n d

4

l is t most of the

recen t p rog rammab le and reconf igu -

rab le , gene ra l -pu rpose sys to l ic a r rays

repo r ted in the l i t e ra tu re . When in fo r -

ma t ion i s unc lea r in the l i t e ra tu re , the

co r re spond ing space in the tab le is lef t

b lank . Th e tab le s ind ica te th ree s tages

o f p roduc t deve lopmen t :

1

Figure 6. General

organization of

reconfigurable

arrays.

N o v e m b e r 1993

25


7/12

Table

2

SIMD -organized programmable systolic architectures.

Sys tem nam e , Deve lopmen t

deve lope r s tage Topo logy Key fea tu res

Brow n Sys to l ic Ar ray

Brown U n ive rs i ty

P r o t o t y p e

Micmacs*

I R I S A , C a m p us d e B e a d i e r .

F r a n c e P r o t o t y p e

Geom etr ic Ar i thme t ic Pa ra l le l P rocesso r Comm erc ia l

N C R

Saxpy-1M4

C o m p u t e r C o r p .

Commerc ia l

Sys tol ic /Ce l lu la r Arch i tec tu re5 P ro to type

H u g h e s R e s e a r c h L a b o r a t o r i e s

Cy l indr ica l Banyan Mul t icompu te rh Resea rch

Unive rs i ty of Texas a t A us t in

Programmable Systolic Device

Aus t ra l ian N a t iona l Un ive rs i ty

Resea rch

Linea r

470 cells

Linea r

18

cells

48 x 8 a r r a y

2,304 cells

Linea r

32

cells

1 6

x 16

a r ray

256 cells

Very sma l l VLSI foo tp r in t ;

100s

of cells

pe r ch ip ; ISA, SSR a rch i tec tu res ; 8 -b i t

ALU only; 157 MOPS

16-b i t f ixed -po int ma th ; b roadcas t d a ta ;

90 MOPS

Bit-sl ice cellular architecture; g lobal data;

900

M O P S

32-bit f loating-point capabil i ty ; broadcas t

and Saxpy g loba l da ta wi th b lock

process ing

1,000

M F L O P S

32-bit f ixed-function units

Packe t - swi tched p rog ram mab le topology

with programmable cells

In s t ruc t ion decod ing occu rs once pe r ch ip ;

each ch ip has m any ce l l s

1.

R. Hughey and

D.

Lopresti. Architecture

of

a Programm able Systolic Array,

Proc. Inrl Con,f.Sysfolic Arruys.

I E E E CS Press.

LOS

2. P. Frison et al., Micmacs: A VLSI Programmable Systolic Architecture. Systolic Array Processors, Prentice-Hall. Englewood Cliffs,

3. P. Greussay, Programmation des Mega-Processeurs: du GAPP

a

la Connection Machinc Course Notes DEA Artificial Intelligence,

4.

D.

Foulser and R. Scheiber, Th e Saxpy Matrix-1: A General-Purp ose Systolic Com puter. Computer . Vol. 20. No. 7, July 1987.pp. 35-43.

5. K.W. Przytula and

J.B.

Nash, lmplementation of Synthetic Apertu re R adar A lgorithms on a SystoliciCellular Architecture, Proc. In[ /

6. M. Malek and

E .

Oppe r. The C ylindrical Banyan Multicomputer: A Reconfigurable Systolic Architecture. Parallel C o m p u t i n g , Vol.

7 .

P. Lenders and

H .

Schroder, A ProRrammable Systolic Device

for

Image Processing Based on M orphology.

Purullel Conzp uting.

Vol.

Alamitos, Calif., Ord er No. 860, 1988, pp. 41-49.

N.J.:

1989,pp. 145-155.

Paris Univ.. VIII, Vincennes, 1985.

Conf Systolic Arr ays , IE EE CS Press,

Los

Alamitos, Calif., Order

No.

860, 1988, pp. 21-27.

10,

No. 3. May 1989.

pp.

319-326.

13, No. 3

Mar. 1990.

pp.

337-344

Resea rch : A func tiona l sys tem tha t

has no t ye t been imp lemen ted .

* P r o t o t y p e : At least a part ia l system

has been imp lemen ted .

Comm erc ia l : The sys tem has been

implem ented. and an integrated chip

with multip le cells , a complete sys-

tem , o r bo th a re ava i lab le fo r pu r -

chase.

Architectural issues

of

programmable systolic

arrays

To be effective. a programm able sys-

to l ic a rch i tec tu re mus t adhe re to the

genera l principles of systolic arrays: reg-

u la r i ty , s imp l ic i ty , concu r rency , and

rhy thmic communica t ions . In add i t ion .

the in troduction of apr ogr am ma ble sys-

tolic cell with s ignif icant local memo ry

has re su l ted in a new mode o f ope ra -

t ions: b lock processing. Block process-

ing combines periods of in tensive sys-

tolic

IiO

and pe r iods

of

sequen t ia l von

Neumann-s ty le p rocess ingto fo rm wha t

is known as a pseudosystolic mod el.

R e p l i c at i n g a n d i n t e r c o n n e c t i n g a

basic program mabl e systolic cell to form

an a r ray ca r r ie s ou t th e regu lar i ty and

simplicity principles , provided an ap-

p rop r ia te a lgo r i thm and p rog ram a re

u t i l i z e d . H o w e v e r . p r o g r a m m a b l e -

c e l l - b a s e d a r c h i t e c t u r e s m u s t i n t r o -

duce spec ia l f ea tu res

to

address th e sys -

tolic properties

of

concurrency, rhyth-

mic communica t ions , and b lock p ro -

cessing.

Concurrency. Tw o leve ls of concur -

rency a re poss ib le in a p rog rammab le

systolic architecture: concurren cy across

cells and co ncurre ncy within cells , both

implicit to hardw ired designs. To m a k e

concur rency feas ib le in p rog ram mab le

systolic archite ctures, he designer must

add special mechanisms that facil i ta te

p rocess ing con t ro l .

I n t e rc e l l c onc urre nc y .

Coord ina t ion

of concur rency con t ro l has a lways been

inhe ren t

in

a hardwired systolic design

because i t has been add ressed a t the

a lgo r i thm deve lopmen t s tage . Th e ad -

ven t o f the p rog ram mab le systo lic a r ray

requ i red tha t innova t ive fea tu res be in -

co rpo ra ted in to des igns to sa t i s fy a p ro -

g rammab le coo rd ina t ion requ i remen t .

Methods of prog ram load ing . memory

init ia l ization, program switching. and

local memo ry access a t the cell level had

26

C O M P U T E R

I


8/12

Table 3. MIMD-organized programmable systolic architectures.

Sys tem name ,

deve lope r

D e v e l o p m e n t

s tage Topo logy Key fea tu res

Warp

Carneg ie Mel lon Un ive rs i ty

A r c h i t e c t u r e P r o t

Comm erc ia l Linea r 32 -bi t f loa t ing -po in t mu l t ip l ica t ion :

b lock p rocess ing ;

IiO

queu ing .

100

M F L O P S

10

cells

i W a r p 3

Carneg ie M el lon Un ive rs i ty

C o m p u t e r f o r E x p e r i m e n t a l

S y n t h e t ic A p e r t u r e R a d a r 4

Norweg ian Defense Res . Es tab .

Ce l lu la r Ar ra y P rocesso r5

Fu j i t su Labora to r ie s , Japan

Conf igu rab le High ly Pa ra l le l Compu te rh

Purdue Un ive rs i ty

Associative Str ing Processor

B r u n e l U n i v e r s it y , U K

PICAP3

University of Paris

P S D P 9

Univ .

of

Sou th F lo r ida ,

H o n e y w e l l

P r o t o t y p e F o u r

8

x 16

Bit-seria l cellu lar

IiO;

32-bit f loating-

a r rays

512 ce l ls 320 MF LO PS

point multip liers in each cell ;

C o m m e r

Resea rch P rog ramm ab le ce l ls em bedd ed in swi tch

la t t i ce fo r p rog rammab le topo logy

R e s e a r c h a t h i

P ro to type

8

x

8

a r ray 16 -b it word leng th ; image o r ien ted

64

cells

R e s e a r c h

4 -b i t word leng th ; wafe r - sca le des ign

1.

A.L. Fisher.

K.

Sarocky. and H.T. Kung. Experience with the CMU Programmable Systolic Chip,

Proc. Soc. of Phoro-Oprical

2. M . Annaratone et a l . . Architecture of Warp, Proc.

Compcon,

IEE E CS Press, L o s Alamitos, Calif.. Order

N o .

764, Feb. 1987, pp.

3 . S.

Borkar et al.. iWarp: An Integrated Solution

to

High-speed Parallel Computing.

Proc. Superconzpuring.

I E E E CS Press.

Los

4. M. Toverud and V. Anderson. CE SAR : A Programmable H igh-Performance Systolic Array Proccssor,

Proc. In / [

Con

Coni pu f e r

5.

M .

Ishii et al., Cellular Array Proccssor C AP and Application,

Proc.

Intl

Conf. Sys to l i c

Arrays,

IE EE C S Press. Los Alamitos. Calif.,

6. L. Snyder. Introduction

to

the Configurable Highly Parallel Comp uter. Conzputer. Vol.

15,

No. 1. Jan. 1982. pp. 47-56.

7. R.M. Lea, Th e ASP. a Fault-Tolerant V LSI/ULSI/WSI Associative String Processor for Cost-Effective Processing,

Proc. Inrl

Conf:

8.

B. Lindscog and P.E. Danielsson. PICA P3: A Parallel Processor

Tuned

for 3D Image Operations.

Proc. E i g h th Intl

Conf:

Patrern

9. D. Landis et al. . A W afer-Scale Programm able Systolic D ata Processor, Proc. N i n th Biennial Univ . /Gov f / lnd.Microrlec /ronics Synp . ,

1n.strunzenfufion Engineers, Re d -T ime

Signal

Processing

V I I ,

Spie. San Di ego. C alif., 1984. p. 495.

264-267.

Alamitos, Calif., Ord er No. 882, 1988, pp. 330.339.

Design,

IEEE CS Press,

Los

Alamitos. Calif., Ord er

No.

872 (micr ofiche only) , 1988, pp. 414-417.

Order

N o.

860. 1988. pp. 535-544.

Sysrolic

Ar r a y s . pp. 515.524.

Recognilion, IEEE

CS Press, O rder No. 742 (mic rofiche o nly ), 1986, pp. 1248-1250.

IE EE , Piscataway,

N.I . ,

1991, pp. 252-256.

to be inco rpo ra ted in to a sys to lic env i -

r o n m e n t . T h e f o l l o w i n g m e c h a n i s m s

f a c il i ta t e p r o g r a m m a b l e c o n c u r r e n c y

th roughou t th e a r ray :

B r o a d c a s t d a t a

permits a l l cells of

an a r ray to be re se t s imu l taneous ly

wi th in i t ia l cond i t ions and cons tan ts

a t the s ta r t o f a p rocess ing b lock

dur ing nonsys to l ic modes o f ope ra -

t ion . The la rge r the a r ray , the more

effic iency i t will gain from a broad-

cas t da ta capab i l i ty .

B r o a d c a s t i n s t r u c t i o n s p e r m i t o p e r -

a t ions s imi la r to those o f a vec to r

compu te r by mak ing each sys to l ic

c e ll a n a l o g o u s t o t h e v e c t o r c o m -

pu te r s p rocess ing un i t . However ,

un l ike mos t vec to r compu te rs ,

sys-

to lic cells can support a h igh-band-

wid th comm unica t ion channe l wi th

adjacent cells .

B r o a d c a s t i n s t r u c t i o n a d d r e s s e s al-

l o w f a s t a n d e f f i c i e n t s w i t c h i n g

a m o n g p r o g r a m s i n a n M I M D c e ll s

p r o g r a m m e m o r y . W h e n a l l

M I M D

ce l l s have p rog rams s to red in the

s a m e p l a c e s in t h e i r p r o g r a m

m e m o r i e s , a j u m p t o t h e b r o a d c a s t

in s t ruc t ion add ress ca r r ie s ou t s i -

m u l t a n e o u s p r o g r a m s w i t ch i n g

th roughou t the a r ray . I f the p ro -

g rams in a l l the M IM D ce l ls happen

N o v e m b e r

1993

27


9/12

Table 4. VFIMD-organized programmable systolic architectures.

S y s t em n a m e ,

deve lope r

D e v e l o p m e n t

s tage Topo logy Key fea tu res

~~~ ~ ~~~~~~~

Splash Systolic Engine '

S u p e r C o m p u t i n g L i n e a r b a s e d on

R e s e a r c h C e n t e r P r o t o t y p e

32

cells

Ce l lu la r Ar r ay Log ic '

Ed inburgh Un ive rs i ty , U K Pro to type 256 cells custom chip

16

x

16

a r ray

FPGA -reconf igu rab le ce l l a rch i tec tu re based on

Hybr id Arch i tec tu re3

Unive rs i ty

of

Illinois

R e s e a r c h

Reconf igu rab le ce l l a rch i tec tu re in teg ra t ing FPGA

capab i l i ty an d 32 -b i t f loa t ing -po in t mu l t ip l ica t ion

P r o g r a m m a b l e A d a p t i v e

Computing Engine. '

Un ive rs i ty o f Wales , Bangor , UK Pro to type P rog ramm ab le p rog ramm ab le topo logy

C o n f i g u r a b le F u n c t i o n a l A r r a y 5

Ts inghua Un ive rs i ty , Be i j ing Resea rch Conf igu rab le a r ray topo logy and ce l l a rch i tec tu re

Func t iona l un i t s embe dded in each ce l l :

reconfigu rable con necti ons in cell as well as

1.

M . Gok hale et al.. Splash: A Reconfigurable Linear Logic Array, Pm c . In i l

Conf:

Parr i l l e l Processing,

IEEE CS

Press. Los Alamitos,

2.

T.

Kcan and

J .

Gray, Configurable Hardware: Two Case Studies of Micrograin Computation.

in

Sysro l i c

Array Processorh,

Prentice-

3.

R. Smith and G. Sobelman, Simulation-Based Design of Programmablc Systolic Arrays.

Coniprrier-Aided Desig n.

Vol.

2.3. N o .

IO. Dec.

4.

S .

Jones. A. Spray, and A. Ling. A Flexib lc Building Block for

t h e

Construction of Processor Arrays. in

Sys/o/ic

A r r n v P r o c r s s o r s

5 . C. Wenyang, L. Yanda. and J . Yuc, Systolic Realization for 2D Convolution Using Conrigurable Functional Method in VLSI Parallel

Calif., Orde r

No.

2101. 1990.

pp. 1526-1531.

Hall, E nglcwood C liffs,

N.J.,

1989. pp . 310-319.

1901,

pp.

669-675.

Prcntice-Hall, E nglewood Cliffs. N.J. ,

19S9,

pp.

459-466.

Array Dcsigns.

Proc .

I E E E Compurrrs

u n d

Digirtrl

e~hnology~ol.

138, No. S Sept.

1991.

pp. 361-370.

to be iden t ica l . the a r ray i s pe r fo rm-

i n g S I M D o p e r a t i o n s .

I n s t r u c t i o n s y s t o l i c urriiys prov ide

a precise low-level method of coor-

d ina t ing p rocess ing f rom ce l l to ce l l .

In an IS A. instructions travel through

t h e a r r a y w i th t h e d a t a . T h e p r o -

cessed da ta passes wi th th e o r ig ina l

instruction from ea ch cell to the n ext.

An ap p rop r ia te ly wide ISA ins t ruc -

t ion also has buil t- in microcodable

pa ra l le l i sm. ISA a lgo r i thms can be

leng thy and m ore d i f f icu l t to deve l -

o p t h a n a l g o r it h m s f o r o t h e r S I M D

a n d M I M D a r r a y s. B u t t h e y p r o v i de

a way to un ique ly spec i fy concu r -

rency across cells withou t an overly

complicated circuit .

D i re c t l oc a l memory a c c es s o r g l o -

bal

rnen io ry .

Sta tus f lags can be

s to red e i the r in each ce l l ' s loca l

m e m o r y or in a g loba l memory . In

the case o f loca l mem ory , the con -

t ro l le r mus t b e ab le to d i rec t ly ac -

cess each ce ll 's loca l memory t o de -

te rmine f lag s ta tu s ; o the rwise , the

con t ro l le r mus t i s sue a reques t and

wa i t fo r i t to p ropaga te th rough the

a r ray . When g loba l memory i s avai l -

ab le . the con t ro l le r ob ta in s s ta tu s

in fo rma t ion much m ore eas i ly . Di -

rec t loca l memory access o r g loba l

mem ory also facil i ta tes in it ia l ization

wi th in a ce l l. The Saxpy-1M is one

of th e f irs t systolic array s with glo-

b a l m e m o r y .

I tifrace11c o n c u r r e n c y . P r o g r a m m a b l e

ce l ls requ i re t he ab i l i ty t o pe r fo rm mul-

t ip le s imi la r o r d i s simi lar ope ra t ions s i -

mu l taneous ly . Mechan isms inco rpo ra t -

ed in to p rog ramm ab le sys to l ic ce l ls to

suppor t concu r rency a re a l l p roven tech -

n iques tha t o r ig ina ted in conven t iona l

von Neum ann p rocesso rs : mic rocodab le

processing elements . duplication offun c-

t iona l un i t s , mu l t ip le da ta p a ths , su f fi -

c ien t lywide in s t ruc t ion words . a nd p ipe -

l in ing of functional units .

Rhythmic communications. The p r in -

ciple of rhythm iccom munic ationsclearly

sepa ra te s sys to l ic a r rays f rom o the r a r -

ch i tec tu res . P rog rammab i l i ty c rea te s a

mo re f lex ib le sys to l ic a rch i tec tu re . bu t

the pena l t ie s a re complex i ty and poss i -

ble s lowing of operat ions. Wh en systol-

ic ce ll s a re p rog ramm ab le , the i s sue o f

da ta availabil i ty arises. To min imize the

pe r fo rmance deg rada t ion

of

p r o g r a m -

mable systolic designs, each cell 's pro-

cessing e lemen t mus t have da ta ava i l -

able when i t is required. h igh-speed ari th -

me t ic capab i l i t i e s , and the ab i l i ty to

t rans fe r o r s to re the p rocessed da ta .

Thus , each ce l l mus t have su f f ic ien t

mem ory fo r da ta s to rage a s we l l a s p ro -

g ram s to rage to fac i l i t a te in te rce l l com-

mun ica t ions .

Th e fo l lowing a rch i tec tu ra l f ea tu res

suppor t rhy thmic comm unica t ions :

QueiLed //O s t r e a m l i n e s c e l l u l a r

c o m m u n i ca

on s

b y a l o w in g t h e

s o u r c e c e ll t o s e n d d a t a t o t h e d e s ti -

nation cell when the data is avail-

ab le . no t when the des t ina t ion ce ll i s

ready to accep t i t . Da ta ex i t s the

queu e on a F IF O ( f i rs t in . f i rs t ou t )

bas is tha t a l lows p rog ram com pu ta -

t ion to p roceed i r re spec t ive o f com-

m u n i c a t i o n s s t a t u s . Q u e u e d IiO

he lps cons ide rab lywh en a l l ce l ls a re

compu t ing a s ing le p rog ram tha t i s

skewed in t ime ac ross the ce l ls . A

d isadvan tage i s tha t th i s mechan ism

uses memory space tha t i s expen -

s ive on

VLSI

wafe rs .

Se r i a l I/O reduces the bandwid th

requ i remen t on the sys to l ic a r ray ' s

works ta t ion mach ine , a t the same

t ime p romot ing rhy thmic ce l lu la r

c o m m u n i c a t i o n . T h e r e s ul t i ng p r o -

28

C O M P U T E R


10/12

grammable systolic array has bit-

se r ia l ce llu la r communica t ions and

word-wide operations in each cell .

Each cell s tores part ia l words unti l

i t r ece ives the fu ll word and then

p e r f o r m s f u n c t i o n a l o p e r a t i o n s .

Using mo re cells can part ia l ly offset

t h e d i s a d v a n t a g e of r e d u c e d

th roughpu t .

M u l t i p l e d a tu p a t h s

provide t he cell

with fast , paralle l in ternal and ex-

ternal comm unications.O n e o r m o r e

da ta pa ths a re ded ica ted to compu-

ta t iona l needs . ano the r to ce l lu la r

inpu t ope ra t ions , and ye t ano th e r to

ce l lu lar ou tp u t ope ra t ions .

Sy s t o l i c share d re g i s t e rs provide sev-

eral benefits specif ically for

pro-

grammab le sys to l ic a rch i tec tu res .

SSR a rch i tec tu res (F igu re

7

pro -

vide a program-implicit means of

s t reaml ined da ta f low. Each two a d -

jacen t p rocess ing e lemen ts sha re a

sma l l r eg is te r memory . Da ta f lows

concur ren t ly wi th compu ta t ion a s

each p rocess ing e lemen t rece ives

new da ta f rom the reg is te r memory

ups t ream, ope ra te s on i t , and d i -

rec t s i t to the nex t reg is te r memory

downstream . This architecture e lim-

i n a t e s t h e r e q u i r e m e n t t h a t d a t a

movement be explic i t ly specif ied,

and da ta f low th rough the a r ray be -

comes a re su l t o f p rocess ing . The

inpu t reg is te r and the ou tpu t reg is -

te r a re spec i f ied by the in s t ruc tion

word . The re fo re . b id i rec t iona l da ta -

f low i s a na tu ra l r e su l t . An SSR

arch i tec tu re i s advan tageous ove r

queued

110

because communications

do no t occu r on a F IFO bas i s . A

disadvantage

is

that the register bank

must be kept small to minimize ac-

cess t ime a nd thus max imize sys to l -

ic bandwid th . The sma l l r eg is te r

m a k e s p r o g r a m m i n g d i f f i c u l t f o r

some o f the mo re compl ica ted a lgo -

ri thms.

B r o u d c a s t

ita

eases systolic

IiO

require ment s in certa in instances by

a l lowing all func t iona l un i t s and

local memo ry of a cell to be in i t i al -

ized o r se t to a common va r iab le a t

once .

Global datu eases systolic IiO re -

qu i remen ts and the ce l l ' s s to rage

requ i remen ts by keep ing on ly one

copy o f the sam e da ta an d a l lowing

a l l ce l l s to sha re i t . Wi th p rope r

g loba l memory cohe rency . any mod-

if ications to g loba l da ta a re in s tan t -

ly available to ot her cells .

I I

I

Figure

7 An

SIMD systolic shared-register architecture.

B i d i re c t i ona l und wrapurou t zd data-

,flow

must be explic i tly specif ied in

the p rog ram, whereas in ha rdwired

des igns da ta f low i s imp l ic i t . Bi -

d i rec t iona l and wrapa round da ta -

f l ow can reduce the

I/O

bandwid th

requ i remen ts o f the ex te rna l bu f fe r

memory by rec i rcu la ting da ta wi th -

in the a r ray . More f lex ib le da taf low

also allows arrays to execute a wide

va r ie ty o f a lgo r i th ms more e f f i -

c iently .

Block processing

If each program-

mable systolic cell has significant local

mem ory. b lock processing is possible in

the sys to lic env i ronmen t . Dur ing b lock

processing, very little systolic I/O oc-

curs , and ea ch cell executes a series of

in s t ruct ions on loca l da ta . O nce t he ce l l s

gene ra te an in te rmed ia te re su l t , p ro -

cessing pauses w hile systolic IiO t akes

p lace , and new da ta i s wr i tt en to each

cell 's local memory.

Block processing on prog rammab le

systolic architectures can result in a more

effic ient . pseudosystolic operation in

some a pplications for two reasons. First ,

the merge r o f von Neumann p rog ram -

mable cell architectures in to a systolic

a r ray env i ronmen t causes many o f the

sam e p rob lems found in homogeneous

mul t ip rocesso r sys tems . The se r ia l na -

tu re o f \ ' on Neumann mach ines in te r -

feres with the rhythmic systolic I/O of

t h e a r r a y , c a u si n g a n u n a c c e pt a b l e

amoun t o f t ime was ted in wa i t ing fo r

d a t a to become available . Block pro-

cessing minimizes th is wasted t ime in

app l ica t ions tha t can be d iv ided in to

equa l segmen ts . App l ica t ions a re d iv id -

ed into paralle l tasks that u ti l ize local

da ta .

Second, for any systolic array, the

bandwidth and size of the external mem-

ory a re a lways the l imi t ing fac to rs on

t h r o u g h p u t a n d p e r f o r m a n c e . B l o c k

processing reduces the array 's systolic

11 r e q u i r e m e n t s b y r e d u c i n g t h e

amou n t o f ce l lu lar IiO.

I m p o r t a n t w o r k h a s b e en d o n e o n

characteriz ing block processing in a gen-

eral-pu rpose systolic array. ' Block pro-

c e s s i n g c a n b e p e r f o r m e d o n e i t h e r

SIMD or MI MD prog ram mab le sys to l -

ic mach ines. Re qu i rem en ts fo r e f f ic ient

b lock p rocess ing inc lude an app rop r i -

a te a lgorithm a nd signif icant local mem -

ory in each ce l l . Othe r types o f a r ray

communica t ion such a s b roadcas t da ta

and g loba l da ta a re a l so des i rab le .

Architectural issues

of

reconfigurable systolic

arrays

The p r imary a rch i tec tu ra l i s sues in

d e s i g n i n g a r e c o n f i g u r a b l e s y s t o l i c

a r r a y a r e t h e h a r d w a r e p l a t f o r m o f

FPG As and th e c lass of a lgo r i thms to be

t a r g et e d t o t h a t p l a t fo r m . T h e n u m b e r

of FPG As, the i r topo logy , and the i r ga te

density will determ ine th e set of systolic

a rch i tec tu res tha t can be syn thes ized

a n d t h e s e t of systolic a lgorithms that

c a n b e i m p l e m e n t ed o n t h a t h a r d w a r e

platform. R econfigurable systolic archi-

tec tu res a r e ve ry in te re s t ing because

the cell 's architecture is actually pro-

g rammed . C onsequen t ly , the a rch i tec -

tu re p rog rammer has comple te con t ro l

ove r wha t a rch i tec tu ra l f ea tu res a re in -

c o r p o r a t e d i n e a ch F P G A . D e t e rm i n -

ing the ce ll a rch i tec tu re can be an i t e r -

ative process that continuously refines

the architecture unti l i t sa t isf ies appli-

ca t ion requ i remen ts .

FP GA a rch i tec tu re p rog ramming d i f-

fe r s f rom conven t iona l p rog ramming in

tha t one p rog rams t he c i rcu i t ' s log ica l

func t ion , in s tead o f p rog ramming a

mode l of operations in a high-level lan-

guage . Reconf igu r ing an FPGA is the

sam e as logically drawing a new circu it .

T h e r e f or e , a n F P G A p l a t f o rm c a n a s -

s u m e a n y a r c h it e c t ur e t h a t F P G A g a t e

dens i ty and package p inou t pe rmi t .

C o m m u n i c a t i o n a n d c o n c u r r e n c y

N o v c m b e r 1993 29


11/12

Broadcast

instructions

I

Figure 8. A hybrid S l M D programmable systolic architecture.

must be eva lua ted on a case -by -case

bas is a t the t ime o f con f igu ra t ion . Cur -

ren t ga te -dens i ty techno logy does no t

m a k e M I M D a r r a y s f e a si b l e d u e t o c e ll

m e m o r y a n d c o n t r o l r e q u i re m e n t s , b u t

FP GA p la t fo rms a re we l l su ited fo r log-

ical or b it- level applications. Splash, for

example , i s

a

l inear reconfigurable ar-

ray app rop r ia te fo r b i t - leve l app l ica -

t ions.

An

FP GA ch ip can be con f igu red fo r

on e med ium-s ize systo lic e lemen t o r fo r

several s imple systolic cells .

In

e i the r

case, current technology l imits the c ir-

cu i t s ize to an o rde r o f m agn i tude ap -

p roach ing 25 ,000 ga te s . Packag ing a nd

110

p ins a l so in f luence the des ign . Cur -

rent technology l imits

I/O

t o a p p r o x i-

ma te ly

200

pins.

Reconf igu rab le a rch i tec tu res a re a l so

h ighly qua l i f ied fo r fau l t - to le ran t com-

puting. Recon figuration can simply elim-

ina te a bad ce l l f rom the a r ray . Ce l l

a rch i tec tu res can be con figu red a round

VLSI defects .

Architectural issues of

hybrid arrays

High- leve l p rog ramm ab le a r rays re -

quire extensive efforts to ma p algorithms

to h igh -level l anguages a f te r a lgo r i thm

deve lopmen t . When mapp ing i s com-

p le ted , the re su l t ing sys tem o f ten i s no t

a s e f f ic ien t a s a sys tem imp lemen ted in

a f ixed - func t ion ASIC.

On

t h e o t h e r

hand . low FPG A ga te dens i ty makes i t

unlikely t hat large-g rain tasks will ever

b e c o m p l e te l y i m p l e m e n t e d i n F P G A s

or tha t recon f igu rab le a r rays wi l l r e -

place conventional

S l M D

a n d M I M D

arch i tec tu res in the nea r fu tu re . Conse -

quen t ly , a na tu ra l s tep i s

to m e r g e t h e

two app roaches , keep ing the i r des i r -

ab le a spec ts and d isca rd ing the i r unde -

sirable aspects .

T h e h y b ri d S l M D a r c h i te c t u r e ( F i g-

u r e

8

is best u ti l ized for in tensive f lo at-

ing-point computationa l applications but

d o e s n o t d e g r a d e i n p e r f o r m a n c e a s

much

as

h igh - leve l p rog rammab le a r -

rays when s ign i f ican t log ica l o r con t ro l

o p e r a t i o n s a r e i n c l u d e d . T h e h y b r i d

des ign co mbines a commerc ia l f loa t ing -

p o i n t m u l t i pl i e r c hi p a n d a n F P G A c o n -

t ro l le r to fo rm a sys to l ic ce l l . Th e com-

m e r c i a l m u l t i p l i e r i s u s e d f o r i t s

economy. speed , an d package dens i ty :

the FPGA c lose ly b inds the ce l l to a

specif ic application. Another project is

a t tempt ing to in teg ra te a hyb r id gene r -

al-pur pose systolic array cell

on

a single

chip.':

A recen t ly deve loped s imu la to r a l -

lows simulation and test ing of the cell

and the a r ray des ign fo r a hyb r id a rch i -

tec tu re . Un l ike mos t commerc ia l ly

available computer-aided design uti l i-

t i e s , wh ich ve r i fy des igns a t the ch ip

leve l , it ve r i f ie s the pe r fo rma nce o f

al-

gor i thms and a rch i tec tu res h rough s im-

u la t ion o f the comple te a r ray . T he s im-

u la to r ma in ta in s the i t e ra t ive na tu re o f

purely reconfigurable array design by

facil i ta t ing ta i loring of hybrid designs

for specif ic applications.

Existing architectures

A review of projects in it ia ted during

the la s t decade shows tha t the t ren d was

to deve lop la rge sys to l ic a r ray p roces -

so rs tha t r equ i re e labo ra te . cus tomized

hos t suppor t . Warp . the Compu te r fo r

Exper imen ta l Syn the t ic Aper tu re Ra-

d a r , t h e C e l l ul a r A r r a y P r o c e s so r , a n d

Saxpy-1M a l l r equ i re expens ive and

compl ica ted

I/O

suppor t o r the in -

tensive instruction I/O as we l l a s the

d a t a

1/0.

T h e s e m a c h i n e s a r e h i g h -

p e r f o r m a n c e s u p e r c o m p u t e r s

(200

M F L O P S , 3 2 0 M F L O P S , a n d 1 , 00 0

MFLOPS. re spec t ive ly ) wi th cen t ra l -

ized concu r rency , cons t ruc ted to f i l l an

ex is t ing pe r fo rmance gap .

Th e introduction of workstations in t o

the workp lace has changed the way a

s ign i f ican t po r t ion o f the com pu t ing

communi ty v iews compu te rs . Use rs de -

m a n d i m p r o v e d d e s k t o p c o m p u t i n g

pe r fo rmance a s app l ica t ions con t inue

t o i n c r e a s e i n s i z e a n d c o m p l e x i t y .

Works tation architectur e can always be

improved . e spec ia l ly fo r emerg ing ap -

p l ica t ions such a s tex t an d speech p ro -

cess ing and gene ma tch ing . The more

recen t gene ra l -pu rpose sys to l ic a r ray

p r o j e c t s ( B r o w n , M i c m a c s , S p l a s h )

show tha t back -end sys to l ic p rocesso rs

are effective in boosting a workstation's

p e r f o r m a n c e f o r t h e s e a p p l i c a t i o n s .

These a r rays a re sma l l enough tha t the

host 's open a rch i tec tu re wi th l imi ted

U

bandwid th does no t seve re ly im-

pac t the a r ray ' s pe r fo rmance fo r

low-

to modera te - leve l g ranu la r i ty . Sma l l

back-end systolic processors are a lso

e c o n o m i c a l l y s o u n d . F o r e x a m p l e .

Sp la sh cons is ts of a two-boa rd add -on

se t fo r a Sun works ta t ion . One boa rd

suppor t s the l inea r a r ray . and the o th -

e r s u p p o r t s a bu f f er m e m o r y . T h e set

costs fr om $13,000 to $35,000.'

Almos t w i thou t excep t ion , cu r ren t

r e s e a r c h e m p h a s i z e s r e c o n f i g u r a b l e

cells , reconfigura ble arrays , and hybrid s

o f func t iona l un i t s embed ded in recon -

f igu rab le FP GA a r rays . Reconf igu rab le

des igns have p roven to be unmatched

for low- to moderate -granulari ty require-

m e n t s b u t a r e n o t y e t m a t u r e e n o u g h

fo r h igh -g ranu la r i ty app l ica t ions . I n

add i t ion , a s we have sa id . r econ f igu -

rable topologies and cells are highly

fau l t - to le ran t . The i r fau l t to le rance i s a

configuration issue. not a design and

fabrication issue.

Un t i l FP GA ch ip dens i ty p rog resses

t o t h e p o i n t w h e r e a v e r y l a r g e F P G A

can achieve high-level granulari ty , hy-

b r id a rch i tec tu res p re sen t pe rhaps the

mos t p rac t ica l means to a recon f igu -

rable high-level systolic array. Hybrid s

make use o f the mos t a t t rac t ive fea tu res

o f p rog rammab le and reconf igu rab le

30

C O M P U T E R


12/12

me tho ds while addin g greater f lexibil i-

ty than e i the r me thod .

N o d i s c u s s i o n o f g e n e r a l - p u r p o s e

systolic arrays is complete without ad-

dressing the issues

of

p r o g ra m m i n g a n d

configuration. High-level- lan guagepro -

g ramming i s des i rab le fo r p romot ing

widesp read use of prog rammab le sys -

tolic back-end processors . Of the projects

we su rveyed . on ly the la rge r sys tems

had ma tu re p rog ramming env i ronmen ts.

Cur ren t ly , the sma l le r p rog rammab le

a r rays have imp lemen ted on ly a ssem-

b ly p rog ramming . As men t ioned ea r -

l i er , F P G A - c o n f i g u r a b l e a r r a y s a r e

configured with th e use of

a

schemat ic

ed i to r . One d raw back o f th i s app roach

is tha t i t typ ica lly takes m ore e f fo r t than

p rog ramming a p rog rammab le ce ll .

T

e systolic array is a formida ble

approach to exp lo i t ing concu r -

renc ie s in a compu ta t iona l ly

rhy thmic and in tens ive env i ronmen t .

Gene ral-purp ose systolic arrays provide

an economical way to enhance compu-

ta t iona l pe r fo rmance by emphas iz ing

concur rency and pa ra l le l i sm. Arrays

rang ing f rom low- leve l to h igh - leve l

g ranu la ri ty have been app l ied to p rob -

lems f rom b i t -o r ien ted p ixe l mapp ing

to 32-bit f loating-point-based scientif ic

computing. Systolic arrays hold great

p romise to be

a

pervas ive fo rm

of

con-

cu r rency p rocess ing . As a so lu t ion to

t h e i n t e n s i v e c o m p u t a t i o n a l p e r f o r -

mance requ i remen ts of tomorrows ap -

plications, general-purpose systolic ar-

rays canno t be ove r looked .

References

1. W.K. Fuchs and

E.E.

Swartzlander Jr.,

Wafer-Scale Integration: Architectures

and Algorithms.

Com put e r .

Vol. 25.

N o .

4, Apr. 1092,pp. 6-8.

2. P. Quinton and Y . Robert, S j s t o l i c Algo-

rifhni.c Architectures,

Prentice-Hall.

Englewood Cliffs, N.J . , 1991.

3. A. Krikelis and R.M. Lea. Architectur-

al Constructs for Cost-Effective Parallel

Computers. in

Systolic Array Proces-

sors. Prentice-Hall, Englewood Cliffs.

N.J . .

1989, pp. 287-300.

4. H.T. Kung. Why Systolic Architec-

tures? Cotnpurer , Vol. 15. No.

1.

Jan.

1982, pp. 37-46.

N o v e m b e r 1993

5.

6.

7.

8.

9.

10

11

12

J.A.B. Fortes and B.W. Wah , Systolic

Arrays: From Concept to Implementa-

tion. Computer (special issue on sys-

tolic arrays). Vol.

20, No.

7, July 1987, pp.

12-17.

C.K. KO and 0 Wing. Mapping Strate -

gy for Automated Design of Systolic

Arrays. Proc. Inrl Conf. Systol ic Ar-

rays.

IEE E CS Press,

Los

Alamitos. Ca-

l i f . .

Order

No.

860, 1988. pp. 285-294.

R. Hughey and D. Loprcsti. B-SYS: A

470-Processor Programmable Systolic

Array. Proc Intl Conf. Parallel Pro-

cersing.

IEEE CS Press. Los Alamitos,

Calif., Ord er N o 2355-22,1991 . p p . 1580-

1583.

M. Gokh ale et al., Building and Using a

Highly Parallel Programmable Logic

Array. Computer . Vol. 24. Jan. 1991.pp.

81

-89.

H.

Schroder and P. Strardins. Program

Compression on the ISA, Parallel Coni-

piiting, Vol. 17,

No.

2-3, June 1991, pp.

207-2

15.

B . Friedlander. Block Processing on a

Programmable Systolic Array ,Proc.Intl

Conf Parallel Proces sing. IE EE CS Press.

Los Alamitos, Calif.. Ord er No. 889.1987.

pp. 184-187.

R . Smith and G . Sobelman, Simulation-

Based Design of Programmable Systolic

Arrays.

Computer-Aided Design.

Vol.

2-3 . N o . 10, Dec. 7991. pp. 669-675.

S. Jones, A. Spray, and A. L ing, A Flex-

ible Building Block for the C onstruction

of Processor Arrays, in

Systolic Array

ProcrsJors,

Prentice-Hall. Englewood

Cliffs, N.J., 1989. pp. 459-466.

Kurtis

T.

Johnson

is an advanccd enginee r at

HR B Systems, an E-Systems subsidiary. His

intcrests include parallel computer architec-

tures, parallel algorithms, and digital signal

proces sing. He received his BS in 1986 and

his

M S

in 1992, both in electrical en gineer -

ing, from Pennsylvania State University.

A R

Hurson

is an associate professor of

computer engineering at Pennsylvania State

University. His research is directed toward

the design and analysis of general-purpose

and special-purpose computer architectures.

He has published m ore than

110

papers on

topics including computer architecture, par-

allel processing. database systems and ma-

chines, dataflow architectures. and V LSI al-

gorithms. He coauthored the IE EE tutorial

hooks,

Parallel Architectures

for

Database

Systenis and Multidatabase Systems:

An Ad-

vanced Solut ion or Globallnfornzatioti Shar-

ing Process. He cofounded the IEEE Sym-

posium on Paralle l and Distr ibuted

Processing. Hurson is a member of the IE EE

Com puter Society Press Editorial Board and

a member of the IE EE Distinguished Visi-

tors Program.

Behrooz Shirazi

is an associate professor of

computer science engineering at the U niver-

sity of Tex as at Arli ngton. Previously, he was

an assistant professor at Southern Meth odist

University. His research interests include

task scheduling, heterogeneous computing,

dataflow computation, and parallel and dis-

tributed processing. He has published wide-

ly

on these topics. He is a cofounder of the

IEE E Symposium on Parallel and Distribut-

ed Processing. Hc has served as chair

of

the

Com puter Society section in the Dalla s chap-

ter

oC

the IEEE and as chair

of

the IEEE

Region V A rea Activities B oard. Shirazi re-

ceived his MS and PhD degrees in computer

science from the University of Oklahoma. n

1980 and 1Y85. respectively.

Send correspondence about this article to

A.R. Hurson. Dept. of Computer Science

and Engineering. Pennsylvania State Uni-

versity, University Park, PA 16802: o r

[email protected].

Laxmi Bhuyan,

Conzputers

system archi-

tecture area ed itor, coordinated and recom-

mended this article for publication. His c-

mail addre ss is [email protected].

31
mailto:[email protected]:[email protected]:[email protected]:[email protected]

General Purpose Systolic Arrays 1

Documents

Transcript of General Purpose Systolic Arrays 1