General Purpose Systolic Arrays 1

download General Purpose Systolic Arrays 1

of 12

Transcript of General Purpose Systolic Arrays 1

  • 7/25/2019 General Purpose Systolic Arrays 1

    1/12

    General-Purpose

    Systolic Arrays

    Kurtis

    T.

    Johnson and

    A.R.

    Hurson Pennsylvania State University

    Behrooz Shirazi University of Texas Arlington

    Systolic arrays

    effectively exploit

    massive parallelism in

    computationally

    intensive applications.

    With advances in VLSI

    WSI and

    FPGA

    technologies they have

    progressed from fixed-

    function to general-

    purpose architectures.

    hen Sun Mic rosys tems in t roduced i t s fi r st works ta t ion , the com pany

    cou ld no t have imag ined how qu ick ly works ta t ions wou ld revo lu t ion -

    ize compu t ing . Th e idea o f a communi ty o f eng inee rs , sc ien t i s ts , o r

    re sea rche rs t ime-sha r ing on a s ing le ma in f rame compu te r cou ld ha rd ly have

    become anc ien t any mo re qu ick ly . Th e a lmos t in s tan t wide accep tance o f works ta -

    t ions and deskt op com puters indicates that [hey were quickly recognized as giving

    the bes t a nd mos t f lex ib le pe r fo rmance fo r the do l la r .

    Desk top compu te rs p ro l i fe ra ted fo r th ree reasons . F i rs t , very la rge sca le in teg ra -

    t ion (VLSI) and wafe r sca le in teg ra t ion

    (WSI) ,

    desp i te some p rob lem s , inc reased

    the ga te dens i ty of chips while dramatically lowering their production cost . '

    Moreove r . inc reased ga te dens i ty pe rmi ts a m ore compl ica ted p rocesso r , wh ich in

    tu rn p rom otes pa ra l le l ism.

    Second , desk top compu te rs d i s t r ibu te p rocess ing power t o the u se r in an eas ily

    cus tomized open a rch i tec tu re . Rca l - t ime app l ica t ions tha t requ i re in tens ive

    110

    and compu ta t ion need no t consume a l l the re sou rces o f a supe rcompu te r . Also,

    desk top compu te rs suppor t h igh -de f in i t ion sc reens wi th co lo r and mo t ion fa r

    exceed ing those ava i lab le wi th any m u l t ip le -use r . sha red - re sou rce ma in f rame .

    Th i r d , economica l . h igh -bandwid th ne tworks a l low desk t op compu te rs to sha re

    data . thus re ta ining the most app ealing aspect of centralized computing , resource

    sha r ing . Moreove r , ne tworks a l low the compu te rs t o sha re da ta wi th d i ss imila r

    compu t ing mach ines . Tha t i s pe rhaps the mos t impor tan t reason fo r the accep -

    tance o f desk t op compu te rs . s ince al l the pe r fo rmance in the wor ld i s wor th l i t t l e

    if

    the ma chine is isolated.

    Today ' s works ta t ions have rede f ined the way the compu t ing comm uni ty d i s t rib -

    u te s p rocess ing re sou rces , and tomorrow's mach ines wil l con t inue th i s t r en d wi th

    h ighe r bandwid th ne tworks and h ighe r compu ta t iona l pe r fo rmance . O ne way to

    ob ta in h ighe r compu ta t iona l pe r fo rmance is t o

    use

    spec ia l pa ra lle l cop rocesso rs to

    pe r fo rm func t ions such as mot ion and co lo r suppor t o f h igh -de f in it ion sc reens .

    Fu tu re com pu ta t iona l ly in tens ive app l ica t ions su i ted fo r desk top compu t ing ma-

    ch ines inc lude rea l - t ime tex t , speech , and image p rocess ing . These app l ica t ions

    require massive paralle l ism. '

    I

  • 7/25/2019 General Purpose Systolic Arrays 1

    2/12

    Many comp utational tasks are by their

    ve ry na tu re sequ en t ia l : fo r o the r ta sks

    t h e d e g r e e

    of

    para l le l ism va r ie s . The re -

    fo re . a mass ive ly pa ra l le l com pu ta t ion -

    al archi tectur e must maintain suffic ient

    application f lexibil ity and com putational

    efficiency.

    It

    must be

    reconf igu rab le to exp lo i t app l ica -

    t ion -dependen t pa ra l le l i sms ,

    h igh - leve l - language p rog rammab le

    for task co ntrol a nd f lexibil ity ,

    sca lab le fo r easy ex tens ion to m a n y

    app l ica t ions. and

    capable of supportingsingle-instruc-

    t ion s t ream, mu l t ip le -da ta s t ream

    ( S I M D ) o r g a n i z a t i o n s f o r v e c t o r

    ope ra t ions a nd mu l t ip le - in s t ruc t ion

    s t r e a m , m u l t i p l e - d a t a s t r e a m

    ( M I M D ) o r g a n i z a t i o n s t o e x p l o i t

    n o n h o m o g e n e o u s p a r a l l e l i s m r e -

    qu i remen ts .

    Systolic arrays are ideally qualif ied

    fo r compu ta t iona l ly in tens ive app l ica -

    t ions . Wh e the r func t ion ingas a ded ica t -

    ed f ixed - func t ion g raph ics p rocesso r o r

    a more com pl ica ted and f lex ib le cop ro -

    cesso r sha red ac ross a ne twork , a sys to l -

    ic array effectively exploits massive p ar-

    a l le l i sm. Fa l l ing in to an a rea be tw een

    vec to r compu te rs and mass ively pa ra l -

    lel

    computers . systolic arrays typically

    combine in tens ive loca l communica t ion

    a n d c o m p u t a t i o n w i t h d e c e n t r a l i z e d

    pa ra l le l ism in a comp ac t package . They

    cap i ta l ize on regu la r , modu la r , rhy th -

    mic , synch ronous . concu r ren t p rocess -

    e s ha t requ i re in tens ive , r epe t i t ive com-

    putation. While systolic arrays originally

    were u sed fo r f ixed o r spec ia l -pu rpose

    architectures, the systolic concept has

    b e e n e x t e n d e d t o g e n e r a 1 p u r p o s e

    S I M D a n d M I M D a r c h it e c tu r e s.

    W hy sys tolic ar rays?

    Ever s ince Kung p roposed the sys tol -

    ic model. . i ts e legant solutions to de-

    mand ing p rob lem s and i t s po ten t ia l pe r -

    fo rmance have a t t rac ted g rea t a t ten t ion .

    In physiology, the term sysfolicdescribes

    the con t rac t ion ( sys to le ) o f the hea r t ,

    wh ich regu la r ly sends b lood to all cells

    o f the body th rough the a r te r ie s , ve ins ,

    and capil lar ies . Analogously . systolic

    compu te r p rocesses pe r fo rm ope ra t ions

    in a rhy thmic . inc remen ta l , ce l lu la r . and

    repe t i t ive manner . T he sys tol ic compu-

    ta t iona l ra te

    is

    re s t r ic ted by the a r ray s

    U

    opera t ions . much a s the hea r t con -

    I I

    than p rocesso r a r ray s tha t execu te sys-

    tolic a lgorithm s. Systolic arra ys consist

    of

    e lemen ts tha t t ake o ne o f the fo llow-

    ing forms:

    a special-purpose cell with h ardwire d

    func t ions ,

    a vec to r -compu te r l ike ce ll wi th an

    ins t ruc t ion decod ing un i t a nd a p ro -

    cessing un i t , o r

    a p rocesso r comple te wi th a con t ro l

    un i t a nd a p rocess ing un it .

    Figure 1. General systolic organization.

    tro ls b lood f low to the cells s ince i t is the

    sou rce an d des t ina t ion fo r a l l b lood . J

    Al though the re i s no widely accep ted

    standard definit ion of systolic arrays

    and systolic cells . the follo wing descrip-

    t ion se rves a s a work ing de f in i t ion in

    th i s a r t ic le (a l so see Sys to l ic a r ray

    summ ary be low) . Sys to lic a r rays have

    balanced . uniform, gridlike architectures

    (F igu re

    I

    in which each l ine indicates a

    communica t ion pa th and each in te r sec -

    t ion represents a cell or a systolic e le-

    men t . H owever , sy s tol ic a r rays a re m ore

    In all cases, the systolic e lements or

    ce l l s a re cus tomized fo r in tens ive loca l

    comm unica t ions and decen t ra l ized pa r -

    alle l ism. Because an array consists

    of

    ce l ls o f on ly one o r , a t m os t , a few k inds ,

    it

    has regu la r and s imp le cha rac te r i s -

    t ics . T he arra y usually is extensible with

    minim al diff iculty.

    Thr ee fac to rs have con t r ibu ted

    to

    the

    sys to l ic a r ray s evo lu t ion in to a lead ing

    approach fo r hand l ing compu ta t iona l ly

    in tens ive app l ica t ions : t echno logy ad -

    vances , concu r rency p rocess ing , and

    demanding scientif ic applicationsP

    Technology advances. Advances in

    Systolic array summary

    Systol ic array:

    A g ridlike structure of sp ecial processing elements that

    processes data much like an n-dimensional pipeline. Unlike a pipeline, how-

    ever, the input data as w ell as partial resu lts flow through the array. In addi-

    tion, data can flow in a systolic organiza tion at m ultiple speeds in multiple di-

    rections. Systolic arrays usually have a very high rate of I/O and are well

    suited for intensive parallel operations .

    Appl icat ions:

    Matrix arithmetic, signal processing, mage processing, an-

    guage recog nition, relational database operations, data structure manipula -

    tion, and character string manipulation.

    Special purpose systol ic array: An array of hardwired systolic process-

    ing elements tailored for a spec ific application. Typically, many tens or hun-

    dreds of cells fit on a single chip.

    General purpose systol ic array:

    An array of systolic processing ele-

    ments that can be adapted to a variety of app lications via programm ing or

    reconfiguration.

    Programmable systol ic array: An array of programmable systolic ele-

    ments that operates either in SIMD or MIMD fashion. E ither the arrays inter-

    connect or each processing unit is program mable and a program controls

    dataflow through the elemen ts.

    Reconf igurab le sys to l ic a r ray : An array of s ystolic elements that can be

    program med at the lowest level. FPGA field-program mablegate array) tech-

    nology allows the array to emu late hardwired systolic elements at a very low

    level for each unique application.

    N o v e m b e r

    l Y Y 3

    21

  • 7/25/2019 General Purpose Systolic Arrays 1

    3/12

    VLSI /WSI techno logy complem en t the

    systolic arrays qualif ications in the fol-

    lowing ways:

    Demand ing sc ien t i f ic app l ica t ions .

    Th e techno logy g rowth o f the la st th ree

    decades has p roduced com pu t ing env i -

    ronmen ts tha t mak e i t f ea s ib le to a t tack

    demand ing sc ien t if ic app l ica t ions on

    a

    l a rge r sca le . La rge -ma t r ix mu l t ip l ica -

    t ion , f ea tu re ex t rac t ion , c lu s te r ana ly -

    s i s, and rada r s igna l p rocess ing a re on ly

    afe w examples. As recent h is tory shows,

    w h e n m a n y c o m p u t e r u s e r s w o r k

    on

    a

    wide va r ie ty o f app l ica t ions , they deve l -

    op new app l ica t ions requ i r ing inc reased

    Cellgran ulari ty .The level of cell gran-

    u l a r i t y d i r e c tl y a f f e c t s t h e a r r a y s

    th roughpu t and f lex ib i l i ty and de te r -

    mines the se t o f a lgo r i thms tha t i t c an

    e f f ic ien t ly execu te . Each ce l l s bas ic

    ope ra t ion can range f rom a log ica l o r

    b i twise ope ra t ion . to a word- leve l mu l -

    t ip l ica t ion o r add i t ion .

    to a

    c o m p l e t e

    p rog ram. G ranu la r i ty i s sub jec t to tech -

    nology capabil i t ies

    and

    l imi ta t ions a s

    well

    as

    des ign

    goals.

    F o r e x a m p l e . i n t e-

    g ra t ion -subs t ra te fami l ie s have d i f fe r -

    Smalle r and faster gatesa llow a high-

    e r ra te o f on -ch ip communica t ion be -

    cause da ta has a sho r te r d i s tance to

    t rave l .

    Highe r ga te dens i t i e s pe rmi t more

    complicated cells with higher individu-

    a l and g roup pe r fo rmance . Granu la r i ty

    inc reases

    as

    word leng th inc reases , and

    concur rency inc reases wi th more com -

    plicated cells .

    Economica l des ign a nd fab r ica t ion

    p rocesses p roduce le ss expens ive sys-

    to l ic ch ips , even in sma l l quan t i t i e s .

    Be t t e r des ign

    tools

    a l low a r rays t o be

    designed m ore eff ic iently . A systoliccell

    can be fu lly s imu la ted be fo re fab r ica -

    t ion , r educ ing the chances tha t i t wi ll

    fa il to w ork as des igned . Wi th advances

    in s imulation techniques, fully tested,

    un ique ce l l s can now be qu ick ly cop ied

    and a r ranged in regu la r , modu la r a r -

    rays. As V LSIiWSI designs becom e more

    complicated. systolic iz ing them pro-

    v ides an e f f ic ien t way t o ensu re fau l t

    to le rance : any fau l t to le rance p recau -

    t ions bu i lt in to one ce ll a r e ex tens ib le to

    all cells.

    compu ta t iona l pe r fo rmance . Examples

    of these innovativ e applications include

    in te rac t ive language recogn i t ion , re la -

    t iona l da tabase op e ra t ions , t ex t recog -

    nit ion, and vir tual reali ty . These appli -

    ca t ions requ i re mass ive repe t i t ive an d

    rhythm ic paralle l processin g, as well as

    in tens iveU0 opera t ion . H ence , sy sto lic

    compu t ing .

    en t pe r fo rmance and dens i ty cha rac te r -

    is t ics . Packaging also in trod uces

    U0

    p in

    restr ic t ions.

    Extensibil ity . B ecaus e systolic arrays

    a re bu i l t o f ce l lu la r bu i ld ing b locks, the

    cell design should be suffic iently f lexi-

    ble for use in a wide varie ty of topolo-

    g ie s imp lem en ted in a w ide va r ie ty o f

    subs t ra te techno log ie s .

    Implementation issues

    Clock

    synchron ization. Clock l ine s of

    different lengths within in tegrated chips,

    as

    well

    a s

    ex te rna l to the ch ips , can

    in t roduce skews . The r i sk o f c lock skew

    is g rea te r when da ta f low in the sys to l ic

    array is b idirectional. Wavefron t arrays

    reduce the c lock skew p rob lem by in -

    t roduc ing more com pl ica ted , a synch ro -

    A number o f imp lemen ta t ion i s sues

    de te rmine

    a

    systolic arraysperfo rmance

    effic iency. Designe rs should understa nd

    the fo l lowing pe r fo rm ance t rade -o f f s a t

    the design stage.

    Re la t ive ly new f ie ld -p rog rammab le

    g a t e a r r a y ( F P G A ) t e c h n ol o g y p e r m i t s

    a

    reconf igu rab le a rch i tec tu re , a s

    op-

    posed to a rep rog rammab le a rch i tec -

    tu re .

    Algo r i thms a nd mapp ing . Des igne rs

    mus t be in t ima te ly fami l ia r wi th the

    a lgo r i thms they a re imp lemen t ing on

    systolic arrays. Designing

    a

    systolic ar-

    ray heu r i s t ica l ly f rom

    an

    a lgo r i thm i s

    s low an d e r ro r -p ron e , equ i r ing s imu la -

    t ion fo r ve r i f ica t ion and o f ten p roduc -

    ing aless-than-optim um algorithm. Thus,

    au tomat ic a r ray syn thes i s i s an impor -

    t a n t r es e ar c h a r e a 6 A t p r e s e n t. h o w e v -

    nous , in te rce l lu la r communica t ions .

    Reliabil i ty . Asin tegratedcirc uits grow

    la rge r . des igne rs mus t bu i ld in g rea te r

    fau l t to le rance to ma in ta in re l iab i l i ty ,

    and d iagnos t ic s to ve r ify p rop e r ope ra -

    t ion.oncurrency processing. Past efforts

    to add concur rency to the conven t iona l ,

    v o n N e u m a n n c o m p u t e r a r c h i t e c t u r e

    have yielded coprocesso rs , multip le pro-

    t ip le hom ogene ous processo rs . Systolic r is t ics .

    a r rays combine fea tu res f rom

    all

    of these

    a rch i tec tu res in a massively paralle l ar-

    ch i tec tu re tha t can be in teg ra ted in to

    ex ist ing p la t fo rms w i thou t a com ple te

    redesign. A systolic array can act as a

    coprocesso r , can con ta in mu l t ip le p ro -

    Systolic array

    cessing units . data pipelin ing. and mul-

    e r , mos t a r ray des igns a re based on heu- taxonomy

    I n t e g r a t i o n i n t o e x i s t i n g s y s t e m s .

    Genera l ly ,

    a

    systolic array is in tegrat ed

    in to an ex is t ing host a s

    a

    back -end p ro -

    cesso r . The a r ray s h igh

    U

    requ i re -

    men ts o f te n make sys tem in teg ra t ion a

    T h e t e rm systolic

    array

    originally re-

    f e r r e d

    to

    spec ia l -pu rpose o r f ixed - func -

    t ion a rch i tec tu res des igned

    as

    h a r d w a r e

    imp leme n ta t ions o f a g iven a lgo r i thm.

    In

    mass quan t i t i e s , the p roduc t ion o f

    cessing un i t s and lo r p rocesso rs , an d can

    ac t a s an n -d imens iona l p ipe l ine . Al -

    though da ta p ipe l in ing reduces

    IiO

    re -

    qu i remen ts by a l lowing ad jacen t ce l l s

    to reuse the inpu t da ta , the sys tol ic a r -

    rays real novelty is i ts incremental in-

    s t ruc t ion p rocess ing o r compu ta t iona l

    pipelining. Each cell computes an in-

    c r e m e n t a l r e s ul t . a n d t h e c o m p u t e r d e -

    r ives the co mple te re su l t by in te rp re t -

    ing the inc remen ta l r e su l t s f rom the

    en t i re a r ray in

    a

    prespecif ied algorith-

    mic fo rma t .

    s ignif icant problem. Because the exist-

    ing IiO channel rarely satisf ies the ar-

    ray s bandwid th requ i remen t , a m emo-

    r y s u b s y s t e m o f t e n m u s t b e a d d e d

    be tween the hos t an d the sys to l ic a r ray

    t o s u p p o r t d a t a a c c e s s a n d d a t a m u l t i-

    p lex ing and demul t ip lex ing . Th e mem-

    ory subsys tem can range f rom the com-

    p l ica ted suppor t a nd c lu s te r p rocesso rs

    i n t h e W a r p a r r a y

    to

    the s imp le r s tag ing

    memory in the Sp la sh a r ray . (These

    sys tems wi l l be d i scussed in g rea te r

    de ta i l l a te r . )

    these a r rays was manageab le and eco -

    nomica l , and , thus , they were we l l su i t -

    ed fo r common app l ica t ions . Bu t these

    d e s ig n s w e r e b o u n d to the speci f ic ap -

    p l ica t ion a t hand an d were no t f lex ib le

    o r ve rsa t i l e . Eve ry t ime a systolic array

    was to b e u s e d o n a new app l ica t ion , the

    manufac tu re r had to unde r take the long ,

    costly , and potentia l ly r isky process of

    des ign ing . t e s t ing , and fab r ica t ing an

    app l ica t ion -spec i f ic in teg ra ted ch ip .

    Al though the cos t and r i sks of deve lop -

    ing

    A S l C s

    have dec reased in recen t

    C O M P U T E R

    2

  • 7/25/2019 General Purpose Systolic Arrays 1

    4/12

    Table

    1.

    Systolic array taxonom y.

    Class

    T y p e

    Genera l -pu rpose Spec ia l -pu rpose

    Prog rammab le Reconf igu rab le Hybr id Hardwired

    O r g a n i za t i o n S I M D o r M I M D

    V F I M D V F I M D

    Fixed

    F ixed

    Topo logy

    In te rconnec t ions

    Dimens ions n -d imens iona l (n

    > 2

    i s r a r e d u e

    to

    complex i ty )

    n -d imens iona l

    P rog rammab le

    S ta t ic Dynamic

    S I MD: single-instruction stream, multiple-data stream: MI MD: multiple-instruction stream, multiple-data stream;

    VFTMD:

    very-few-

    instruction stream, multiple-data stream

    Fixed

    yea rs , budge t cons t ra in t s have m o t iva t -

    ed a t rend away f rom un ique ha rdware

    deve lopmen t . Consequen t ly . gene ra l -

    purpose systolic architectures have be-

    come a logical a l ternative. I n add i t ion

    to serving in a wide varie ty of applica-

    t ions . they also prov ide te s t beds fo r

    developing, verifying, and debugging

    new systolic a lgorithm s. Tab le 1shows

    a taxonomy o f gene ra l -pu rpose and spe -

    cial-purpose systolic arrays.

    S ta t ic Dynamic

    Special-purpose architectures. Spe-

    cial-purpose systolic architectures are

    cus tom des igned fo r each app l ica t ion .

    Few p rob lems re si s t a t tack f rom sys to l -

    ic a r rays , bu t som e p rob lems m ay re -

    q u i r e e l e g a n t a l g o r i t h m s . G e n e r a l l y

    speaking, the systolic design requir es a

    per fo rmance a lgo r i thm tha t can be e f fi -

    c ien t ly imp lemen ted wi th today s VLSI

    technology.

    One area that easily uti l izes systolic

    a lgo r i thms i s ma t r ix ope ra t ions . Figure

    2

    i l lu s tra te s the a lgo r i thm fo r th e sum o f

    a

    scalar product, computed in a s ingle

    systolic e lem ent. A fter the cell is in it ia l-

    ized . the

    as

    a n d t h e bs a re synch ro -

    nously shif ted through the processing

    e lemen t . The accumula to r s to re s the

    sum o f the a,b products .

    All

    t h e

    a

    a n d

    b

    da ta synch ronous ly ex i t s the p rocess ing

    e lemen t unmod i f ied to be ava i lab le fo r

    the nex t e lemen t .

    At

    t h e e n d o f p r o ce s s -

    ing,

    the sum o f the p roduc ts i s shi f ted

    o u t

    of

    the acc umula to r . Th is p r inc ip le

    easily extends

    to

    a ma t r ix p roduc t . as

    shown in Figure

    3.

    The on ly d i f fe rence

    be tween s ing le -e lemen t p rocess ing and

    a r ray p rocess ing is tha t th e la t te r de lays

    each add i t ional co lumn and row by one

    cycle so that

    the

    columns and rows l ine

    u p f o r

    a

    matr ix mu l t ip ly . The p roduc t

    ma t r ix i s sh i f ted ou t a f te r comple t ion o f

    processing.

    An

    obv ious p rob lem wi th th i s ap -

    p roach i s tha t m a t r ix p roduc ts invo lv ing

    a

    mat r ix la rge r tha n the sys to l ic a r ray

    mus t be d iv ided in to a se t o f sma l le r

    ma t r ix p roduc ts . Th is re sou rce an d im-

    plem entatio n prob lem affects a l l systol-

    ica r rays . P rope r a lgo r i thm deve lopmen t

    compensa te s fo r the p rob lem, bu t pe r -

    fo rmance dec reases neve r the le ss.

    The m a t r ix p roduc t example a l so dem-

    ons t ra te s a no the r p rob lem wi th spec ial -

    pu rpose sys to l ic a r rays and ha rdware in

    ib -ty pe data exits

    ~ ~ ~

    Figure

    2.

    A

    systolic processing elemen t that compute s the sum

    of

    a scalar

    product.

    a13 a 2 a l l

    a23

    22

    *

    a

    a32

    a3

    *

    b-type data exits array

    Figure 3 The sys-

    tolic product

    of

    two 3

    x

    3 matrices.

    N o v e m b e r I993

    23

  • 7/25/2019 General Purpose Systolic Arrays 1

    5/12

    Host

    Data in

    _ _

    Array

    Data out

    Figure 4. General organization of

    SIMD

    programmable linear systolic arrays.

    gene ra l . The mo re spec ia l ized the ha rd -

    ware , the h ighe r the pe r fo rmance : bu t

    cost per application also r ises and f lex-

    ibil i ty decreases. Therein l ies the a t-

    tractiveness of gene ral-pu rpose systolic

    a rch i tec tu res .

    General-purpose architectures.

    T h e

    two basic types of genera l-purp ose sys-

    t o li c a r r a y s a r e t h e p r o g r a m m a b l e m o d -

    e l and the recon f igu rab le mode l . Re -

    cen t ly . hyb r id mode ls have a l so been

    proposed .

    In

    the p rog ramm ab le mo de l , cel l a r -

    ch i tec tu res and a r ray a rch i tec tu res re -

    ma in the sam e f rom app l ica t ion to ap -

    p l ica tion . How ever , a prog ram con t ro ls

    da ta ope ra t ions in the ce l l s and da ta

    rou t ing th rough the a r ray . Al l comm u-

    n ica t ion pa ths and func t iona l un i t s a re

    f ixed , and the p rog ram de te rmines when

    and wh ich pa ths a re u t i l i zed . O ne o f the

    f i rs t p rog ramma b le a rch i tec tu res was

    Carneg ie M el lon s p rog ramm ab le sys-

    tolic chip .

    In

    the recon f igu rab le m ode l , ce ll a r -

    ch i tec tu res as well as array architec-

    tu re s change f rom one app l ica t ion to

    a n o t h e r . T h e a r c h i t ec t u r e f o r e a c h a p -

    p l ica t ion appea rs

    as

    a

    spec ia l -pu rpose

    a r ray . Th e p r imary means o f imp lemen t -

    ing the reconf igu rab le mode l i s FP GA

    techno logy . Sp la sh was on e o f the ea r ly

    F P G A - b a s e d r e c o n f i g u r a bl e s y st o li c

    arrays.

    Hybr idmode lsmake use

    of

    b o t h

    VLSI

    24

    and FPGA techno logy . They usua l ly

    consi st of

    VLSI

    c i rcu it s embed ded in an

    FPGA-reconf igu rab le in te rconnec t ion

    ne twork .

    S y s t o l i c t o p o l o g i e s .

    A r r a y t o p o l o g ie s

    c a n b e e i t h e r p r o g r a m m a b l e o r r e -

    con f igurab le . L ikewise , a r ray ce l ls a r e

    e i t h e r p r o g r a m m a b l e o r r e c o n f i g u r -

    ab le .

    A

    prog ramm able systolic architecture

    is

    a

    co l lect ion o f in te rconnec ted , gene r -

    al-pu rpose systolic cells , each of w hich

    is e i the r p rog rammab le o r recon f igu -

    rab le . P rog ramm ab le sys to lic ce l ls a re

    flexible processing elements specially

    des igned to mee t the com pu ta t iona l and

    U

    equ i remen ts

    of

    systolic arrays. Pro -

    g ramm ab le sys to l ic a rch i tec tu res can be

    classif ied accordin g to their cell in ter-

    connectio n topologies: f ixed or program -

    m a b l e .

    F ixed ce l l in te rconnec t ions l imi t a

    g iven topo logy to some subse t o f

    all

    poss ib le a lgo r i thms . Tha t topo logy can

    e m u l a t e o t h e r t o p o l o g i e s by m e a n s of

    t h e p r o p e r m a p p i n g t r a n s f o r m a ti o n , b u t

    reduced pe r fo rmance i s o f ten a conse -

    quence .

    P rog rammab le ce l l in te rconnec t ion

    topol ogies typically consist of prog ram -

    mab le ce l ls emb edde d in a swi tch la t ti ce

    tha t a l lows the a r ray to a s s u m e m a n y

    d i f fe ren t topo log ie s. P rog ramm ab le to -

    po log ie s a re e i the r s ta t ic o r dynamic .

    S ta t ic opo log ie s can be a l te red be tween

    app l ica t ions , and dynamic topo log ie s

    can be a l te red wi th in

    an

    application.

    S ta t ic p rog ramm ab le topo log ie s can be

    imp lem en ted wi th much le ss complex i -

    t y t h a n d y n a m i c p r o g r a m m a b l e t o p o l o -

    g ie s. The re has been l i t tl e re sea rch

    in

    d y n a m i c p r o g r a m m a b l e t o p o l o g ie s b e -

    cause

    a

    h igh ly complex in te rco nnec t ion

    ne twork cou ld unde rmine the regu la r

    and simple principles of systolic archi-

    tec tu res .

    Reconf igu rab le sys to l ic a rch i tec tu res

    cap i tal ize on F PG A techno logy , wh ich

    a l lows the u se r to con f igu re

    a

    low-level

    logic c ircuit for each cell . Reconfigu-

    rab le a r rays a l so have e i the r f ixed o r

    reconf igu rab le ce l l in te rconnec t ions .

    Th e use r recon f igu res

    an

    a r ray s top o l -

    ogy by means o f a swi tch la t t i ce . Any

    gene ra l -pu rpose a r ray tha t i s no t con -

    ven t iona l ly p rog rammab le i s u sua l ly

    c o n s i d e re d r e c o n f i g u r ab l e . A l l F P G A

    reconf igu r ing is s ta t ic due to techno lo -

    gy l imitations.

    A r r a y d im e n s i o n s . We can further c las-

    s i fy gene ra l -pu rpose and spec ia l -pu r -

    pose sys to l ic a rch i tec tu res by the i r a r -

    r a y d i m e n s io n s . T h e t w o m o s t c o m m o n

    s t ruc tu res a re the l inea r a r ray and the

    two-d imens iona l a r ray . Lin ea r sys to l ic

    array s are by default s ta t ically reconfig-

    u rab le in one -d imens io na l space . Two-

    d imens iona l a r rays a l low m ore e f f ic ient

    execution of complicated algorithms.

    Due

    t o

    I/O

    imi ta t ions . gene ra l -pu rp ose

    sys to l ic a r rays o f d imens ions g rea te r

    t h a n t w o a r e n o t c o m m o n .

    Programmable array organization.

    Programm ab le sys to l ic a r rays a re p ro -

    g rammab le e i the r a t

    a

    high level or a

    low level . A t e i the r

    level,

    p r o g r a m m a -

    b le a r rays can be ca tego r ized

    as

    e i the r

    S I M D

    o r

    M I M D

    m a c h i n es . T h e y a r e

    typically back-end processors with an

    add i t iona l bu ffe r memory to hand le the

    high systolic U a te s . High - leve l p ro -

    g r a m m a b l e a r r a y s u s u a ll y a r e p r o -

    g r a m m e d

    in

    h igh - leve l l anguages and

    a re word o r ien ted . Low- leve l a r rays a re

    p rog ramm ed in a ssembly language and

    a re b i t o r ien ted .

    SZMD. S I MD

    systolic mac hines (Fig-

    u re

    4)

    o p e r a t e s i m il a rl y t o

    a

    vec to r p ro -

    cesso r. The hos t works ta t ion p re load s a

    con t ro l le r and

    a

    mem ory , wh ich a re ex -

    te rna l to the a r ray , with the in s t ruc t ions

    and d a ta fo r the app l ica t ion . The sys tol -

    ic ce l l s s to re no p rog rams o r in s t ruc -

    t ions. Ea ch cells instruction-process-

    C O M P U T E R

  • 7/25/2019 General Purpose Systolic Arrays 1

    6/12

    ing functional unit serv es as a large de-

    code r . Once the works ta t ion enab le s

    e x e c u t i o n , t h e c o n t r o l l e r s e q u e n c e s

    t h r o u g h t h e e x t e r n a l m e m o r y , t h u s d e -

    l ivering instructions a nd da ta to the sys-

    to l ic a r ray . Wi th in th e a r ray , in s t ruc -

    t ions a re b roadcas t and a l l ce l ls pe r fo rm

    t h e s a m e o p e r a t i o n o n d i f f er e n t d a t a .

    Ad jacen t ce l l s may sha re memory , bu t

    generally

    no

    memory i s sha red by the

    en t i re a r ray . Af te r ex i t ing the a r ray ,

    da ta i s co l lec ted in the ex te rna l bu f fe r

    memo ry . S IMD-prog rammab le sys to l ic

    ce l ls t ake le s s space on the VLSI w afe r

    t h a n t h e i r M I M D c o u n t e r p a rt s b e c a u s e

    they are s imple instruction-processing

    e l e m e n t s r e q u ir i n g n o p r o g r a m m e m o -

    ry . F rom tens t o hundreds o f such ce ll s

    fit on one in teg ra ted ch ip . '

    M I M D . M I M D s y s t o l i c m a c h i n e s

    (F igu re

    5)

    opera te s imila rly to hom oge-

    n e o u s v o n N e u m a n n m u l t i p r o c e s s o r

    mach ines . The works ta t ion downloads

    a p rog ram to each M IM D sys to l ic cel l .

    Ea ch cell may be loa ded with a different

    p rog ram, o r a l l the ce ll s in the a r ray may

    be loaded wi th the same p rog ram. Each

    cell 's architecture

    is

    somewha t s imi lar

    t o t h e c o n v e n t io n a l v on N e u m a n n a r -

    ch i tec tu re : I t con ta in s a con t ro l un i t , an

    A L U , a n d l o c a l m e m o r y . M I M D s y st o l-

    ic ce l l s have more loca l memory than

    t h e i r S I M D c o u n t e r p a r t s t o s u p p o r t

    the von Neumann-s ty le o rgan iza t ion .

    Some may have a sma l l amoun t o f glo-

    ba l memory , bu t gene ral ly no mem ory

    is sha red by a l l the ce l l s . Wheneve r da ta

    is to be sh are d by processors . i t must be

    passed to the nex t ce l l . Thus , da t a ava i l -

    abil i ty become s a very impo rtant issue.

    High - leve l MIM D sys tol ic cel l s a re ve ry

    complicated. and usually only one f i ts

    on

    a s ing le in teg ra ted ch ip . Fo r exam -

    p le , the iWarp i s a Warp ce l l wi thou t

    m e m o r y

    on

    a s ing le ch ip. Loca l mem ory

    fo r each iW arp ce ll mus t be supp l ied by

    addit ional chips.

    Reconfigurable array organization.

    Recen t ga te -densi ty advances in FP GA

    techno logy have p roduced a low- leve l,

    reconfigurable systolic array architec-

    tu re tha t b r idges the gap be tween spe -

    c ia l -pu rpose a r rays and the m ore ve rsa -

    t i l e , p r o g r a m m a b l e . g e n e r a l - p u r p o s e

    a r rays . The FPGA a rch i tec tu re i s un -

    usual because a s ingle hardware plat-

    form can be logically reconfigured as an

    exact duplic ate of a special-purpose sys-

    tolic array. F igu re

    6

    shows the gene ra l

    organization of a reconfigurable array.

    Data in

    _ _ _ _ _

    Array

    i

    ata

    out

    Figure 5. General organization of M IMD program mable linear systolic arrays.

    Th e design er logically draws cell archi-

    tec tu res on the works ta t ion wi th a sche -

    m a t i c e d i t o r ( s u ch a s M e n t o r ) , c o n v e r ts

    t h e d e s ig n t o F P G A c o d e w i th a n o t h e r

    u t il i ty , and th en dow nloads the code in

    a f e w s e c o n d s t o c o n f i g ur e t h e F P G A

    arch i tec tu re .x

    Reconf igu rab le sys to l ic a r rays do no t

    f a ll i n t o t h e S I M D or M I M D ca tego r ie s

    fo r the same reason tha t spec ia l -pu r -

    pose sys to l ic a r rays do no t . R econf igu -

    rable arrays, l ike special-purpose ar-

    rays , a re gene ra l ly l imi ted to VFIMD

    (very-few-instruction streams. multip le-

    da ta s t reams) o rgan izat ion due to FPG A

    ga te dens i ty . In s t ruc t ions a r e imp l ic i t in

    the configuration of each cell ; there-

    fo re , the re i s

    no

    n e e d t o d o w n l o a d t h e m

    from th e workstation. Since special-pur-

    pose systolic architectures consist of a

    ve ry few un ique ce ll s repea ted th rou gh-

    ou t the a r ray , the en t i re a r ray a l so tends

    t o b e V F I M D . P u r e l y r e c o n f i g u r a b l e

    architectures are f ine-grain , low-level

    devices best suited for logical or b it

    manipulations. They typically lack the

    ga te dens ity to supp or t h igh - level func -

    t ions such as multiplication.

    Tab les

    2 , 3,

    a n d

    4

    l is t most of the

    recen t p rog rammab le and reconf igu -

    rab le , gene ra l -pu rpose sys to l ic a r rays

    repo r ted in the l i t e ra tu re . When in fo r -

    ma t ion i s unc lea r in the l i t e ra tu re , the

    co r re spond ing space in the tab le is lef t

    b lank . Th e tab le s ind ica te th ree s tages

    o f p roduc t deve lopmen t :

    1

    Figure 6. General

    organization of

    reconfigurable

    arrays.

    N o v e m b e r 1993

    25

  • 7/25/2019 General Purpose Systolic Arrays 1

    7/12

    Table

    2

    SIMD -organized programmable systolic architectures.

    Sys tem nam e , Deve lopmen t

    deve lope r s tage Topo logy Key fea tu res

    Brow n Sys to l ic Ar ray

    Brown U n ive rs i ty

    P r o t o t y p e

    Micmacs*

    I R I S A , C a m p us d e B e a d i e r .

    F r a n c e P r o t o t y p e

    Geom etr ic Ar i thme t ic Pa ra l le l P rocesso r Comm erc ia l

    N C R

    Saxpy-1M4

    C o m p u t e r C o r p .

    Commerc ia l

    Sys tol ic /Ce l lu la r Arch i tec tu re5 P ro to type

    H u g h e s R e s e a r c h L a b o r a t o r i e s

    Cy l indr ica l Banyan Mul t icompu te rh Resea rch

    Unive rs i ty of Texas a t A us t in

    Programmable Systolic Device

    Aus t ra l ian N a t iona l Un ive rs i ty

    Resea rch

    Linea r

    470 cells

    Linea r

    18

    cells

    48 x 8 a r r a y

    2,304 cells

    Linea r

    32

    cells

    1 6

    x 16

    a r ray

    256 cells

    Very sma l l VLSI foo tp r in t ;

    100s

    of cells

    pe r ch ip ; ISA, SSR a rch i tec tu res ; 8 -b i t

    ALU only; 157 MOPS

    16-b i t f ixed -po int ma th ; b roadcas t d a ta ;

    90 MOPS

    Bit-sl ice cellular architecture; g lobal data;

    900

    M O P S

    32-bit f loating-point capabil i ty ; broadcas t

    and Saxpy g loba l da ta wi th b lock

    process ing

    1,000

    M F L O P S

    32-bit f ixed-function units

    Packe t - swi tched p rog ram mab le topology

    with programmable cells

    In s t ruc t ion decod ing occu rs once pe r ch ip ;

    each ch ip has m any ce l l s

    1.

    R. Hughey and

    D.

    Lopresti. Architecture

    of

    a Programm able Systolic Array,

    Proc. Inrl Con,f.Sysfolic Arruys.

    I E E E CS Press.

    LOS

    2. P. Frison et al., Micmacs: A VLSI Programmable Systolic Architecture. Systolic Array Processors, Prentice-Hall. Englewood Cliffs,

    3. P. Greussay, Programmation des Mega-Processeurs: du GAPP

    a

    la Connection Machinc Course Notes DEA Artificial Intelligence,

    4.

    D.

    Foulser and R. Scheiber, Th e Saxpy Matrix-1: A General-Purp ose Systolic Com puter. Computer . Vol. 20. No. 7, July 1987.pp. 35-43.

    5. K.W. Przytula and

    J.B.

    Nash, lmplementation of Synthetic Apertu re R adar A lgorithms on a SystoliciCellular Architecture, Proc. In[ /

    6. M. Malek and

    E .

    Oppe r. The C ylindrical Banyan Multicomputer: A Reconfigurable Systolic Architecture. Parallel C o m p u t i n g , Vol.

    7 .

    P. Lenders and

    H .

    Schroder, A ProRrammable Systolic Device

    for

    Image Processing Based on M orphology.

    Purullel Conzp uting.

    Vol.

    Alamitos, Calif., Ord er No. 860, 1988, pp. 41-49.

    N.J.:

    1989,pp. 145-155.

    Paris Univ.. VIII, Vincennes, 1985.

    Conf Systolic Arr ays , IE EE CS Press,

    Los

    Alamitos, Calif., Order

    No.

    860, 1988, pp. 21-27.

    10,

    No. 3. May 1989.

    pp.

    319-326.

    13, No. 3

    Mar. 1990.

    pp.

    337-344

    Resea rch : A func tiona l sys tem tha t

    has no t ye t been imp lemen ted .

    * P r o t o t y p e : At least a part ia l system

    has been imp lemen ted .

    Comm erc ia l : The sys tem has been

    implem ented. and an integrated chip

    with multip le cells , a complete sys-

    tem , o r bo th a re ava i lab le fo r pu r -

    chase.

    Architectural issues

    of

    programmable systolic

    arrays

    To be effective. a programm able sys-

    to l ic a rch i tec tu re mus t adhe re to the

    genera l principles of systolic arrays: reg-

    u la r i ty , s imp l ic i ty , concu r rency , and

    rhy thmic communica t ions . In add i t ion .

    the in troduction of apr ogr am ma ble sys-

    tolic cell with s ignif icant local memo ry

    has re su l ted in a new mode o f ope ra -

    t ions: b lock processing. Block process-

    ing combines periods of in tensive sys-

    tolic

    IiO

    and pe r iods

    of

    sequen t ia l von

    Neumann-s ty le p rocess ingto fo rm wha t

    is known as a pseudosystolic mod el.

    R e p l i c at i n g a n d i n t e r c o n n e c t i n g a

    basic program mabl e systolic cell to form

    an a r ray ca r r ie s ou t th e regu lar i ty and

    simplicity principles , provided an ap-

    p rop r ia te a lgo r i thm and p rog ram a re

    u t i l i z e d . H o w e v e r . p r o g r a m m a b l e -

    c e l l - b a s e d a r c h i t e c t u r e s m u s t i n t r o -

    duce spec ia l f ea tu res

    to

    address th e sys -

    tolic properties

    of

    concurrency, rhyth-

    mic communica t ions , and b lock p ro -

    cessing.

    Concurrency. Tw o leve ls of concur -

    rency a re poss ib le in a p rog rammab le

    systolic architecture: concurren cy across

    cells and co ncurre ncy within cells , both

    implicit to hardw ired designs. To m a k e

    concur rency feas ib le in p rog ram mab le

    systolic archite ctures, he designer must

    add special mechanisms that facil i ta te

    p rocess ing con t ro l .

    I n t e rc e l l c onc urre nc y .

    Coord ina t ion

    of concur rency con t ro l has a lways been

    inhe ren t

    in

    a hardwired systolic design

    because i t has been add ressed a t the

    a lgo r i thm deve lopmen t s tage . Th e ad -

    ven t o f the p rog ram mab le systo lic a r ray

    requ i red tha t innova t ive fea tu res be in -

    co rpo ra ted in to des igns to sa t i s fy a p ro -

    g rammab le coo rd ina t ion requ i remen t .

    Methods of prog ram load ing . memory

    init ia l ization, program switching. and

    local memo ry access a t the cell level had

    26

    C O M P U T E R

    I

  • 7/25/2019 General Purpose Systolic Arrays 1

    8/12

    Table 3. MIMD-organized programmable systolic architectures.

    Sys tem name ,

    deve lope r

    D e v e l o p m e n t

    s tage Topo logy Key fea tu res

    Warp

    Carneg ie Mel lon Un ive rs i ty

    A r c h i t e c t u r e P r o t

    Comm erc ia l Linea r 32 -bi t f loa t ing -po in t mu l t ip l ica t ion :

    b lock p rocess ing ;

    IiO

    queu ing .

    100

    M F L O P S

    10

    cells

    i W a r p 3

    Carneg ie M el lon Un ive rs i ty

    C o m p u t e r f o r E x p e r i m e n t a l

    S y n t h e t ic A p e r t u r e R a d a r 4

    Norweg ian Defense Res . Es tab .

    Ce l lu la r Ar ra y P rocesso r5

    Fu j i t su Labora to r ie s , Japan

    Conf igu rab le High ly Pa ra l le l Compu te rh

    Purdue Un ive rs i ty

    Associative Str ing Processor

    B r u n e l U n i v e r s it y , U K

    PICAP3

    University of Paris

    P S D P 9

    Univ .

    of

    Sou th F lo r ida ,

    H o n e y w e l l

    P r o t o t y p e F o u r

    8

    x 16

    Bit-seria l cellu lar

    IiO;

    32-bit f loating-

    a r rays

    512 ce l ls 320 MF LO PS

    point multip liers in each cell ;

    C o m m e r

    Resea rch P rog ramm ab le ce l ls em bedd ed in swi tch

    la t t i ce fo r p rog rammab le topo logy

    R e s e a r c h a t h i

    P ro to type

    8

    x

    8

    a r ray 16 -b it word leng th ; image o r ien ted

    64

    cells

    R e s e a r c h

    4 -b i t word leng th ; wafe r - sca le des ign

    1.

    A.L. Fisher.

    K.

    Sarocky. and H.T. Kung. Experience with the CMU Programmable Systolic Chip,

    Proc. Soc. of Phoro-Oprical

    2. M . Annaratone et a l . . Architecture of Warp, Proc.

    Compcon,

    IEE E CS Press, L o s Alamitos, Calif.. Order

    N o .

    764, Feb. 1987, pp.

    3 . S.

    Borkar et al.. iWarp: An Integrated Solution

    to

    High-speed Parallel Computing.

    Proc. Superconzpuring.

    I E E E CS Press.

    Los

    4. M. Toverud and V. Anderson. CE SAR : A Programmable H igh-Performance Systolic Array Proccssor,

    Proc. In / [

    Con

    Coni pu f e r

    5.

    M .

    Ishii et al., Cellular Array Proccssor C AP and Application,

    Proc.

    Intl

    Conf. Sys to l i c

    Arrays,

    IE EE C S Press. Los Alamitos. Calif.,

    6. L. Snyder. Introduction

    to

    the Configurable Highly Parallel Comp uter. Conzputer. Vol.

    15,

    No. 1. Jan. 1982. pp. 47-56.

    7. R.M. Lea, Th e ASP. a Fault-Tolerant V LSI/ULSI/WSI Associative String Processor for Cost-Effective Processing,

    Proc. Inrl

    Conf:

    8.

    B. Lindscog and P.E. Danielsson. PICA P3: A Parallel Processor

    Tuned

    for 3D Image Operations.

    Proc. E i g h th Intl

    Conf:

    Patrern

    9. D. Landis et al. . A W afer-Scale Programm able Systolic D ata Processor, Proc. N i n th Biennial Univ . /Gov f / lnd.Microrlec /ronics Synp . ,

    1n.strunzenfufion Engineers, Re d -T ime

    Signal

    Processing

    V I I ,

    Spie. San Di ego. C alif., 1984. p. 495.

    264-267.

    Alamitos, Calif., Ord er No. 882, 1988, pp. 330.339.

    Design,

    IEEE CS Press,

    Los

    Alamitos. Calif., Ord er

    No.

    872 (micr ofiche only) , 1988, pp. 414-417.

    Order

    N o.

    860. 1988. pp. 535-544.

    Sysrolic

    Ar r a y s . pp. 515.524.

    Recognilion, IEEE

    CS Press, O rder No. 742 (mic rofiche o nly ), 1986, pp. 1248-1250.

    IE EE , Piscataway,

    N.I . ,

    1991, pp. 252-256.

    to be inco rpo ra ted in to a sys to lic env i -

    r o n m e n t . T h e f o l l o w i n g m e c h a n i s m s

    f a c il i ta t e p r o g r a m m a b l e c o n c u r r e n c y

    th roughou t th e a r ray :

    B r o a d c a s t d a t a

    permits a l l cells of

    an a r ray to be re se t s imu l taneous ly

    wi th in i t ia l cond i t ions and cons tan ts

    a t the s ta r t o f a p rocess ing b lock

    dur ing nonsys to l ic modes o f ope ra -

    t ion . The la rge r the a r ray , the more

    effic iency i t will gain from a broad-

    cas t da ta capab i l i ty .

    B r o a d c a s t i n s t r u c t i o n s p e r m i t o p e r -

    a t ions s imi la r to those o f a vec to r

    compu te r by mak ing each sys to l ic

    c e ll a n a l o g o u s t o t h e v e c t o r c o m -

    pu te r s p rocess ing un i t . However ,

    un l ike mos t vec to r compu te rs ,

    sys-

    to lic cells can support a h igh-band-

    wid th comm unica t ion channe l wi th

    adjacent cells .

    B r o a d c a s t i n s t r u c t i o n a d d r e s s e s al-

    l o w f a s t a n d e f f i c i e n t s w i t c h i n g

    a m o n g p r o g r a m s i n a n M I M D c e ll s

    p r o g r a m m e m o r y . W h e n a l l

    M I M D

    ce l l s have p rog rams s to red in the

    s a m e p l a c e s in t h e i r p r o g r a m

    m e m o r i e s , a j u m p t o t h e b r o a d c a s t

    in s t ruc t ion add ress ca r r ie s ou t s i -

    m u l t a n e o u s p r o g r a m s w i t ch i n g

    th roughou t the a r ray . I f the p ro -

    g rams in a l l the M IM D ce l ls happen

    N o v e m b e r

    1993

    27

  • 7/25/2019 General Purpose Systolic Arrays 1

    9/12

    Table 4. VFIMD-organized programmable systolic architectures.

    S y s t em n a m e ,

    deve lope r

    D e v e l o p m e n t

    s tage Topo logy Key fea tu res

    ~~~ ~ ~~~~~~~

    Splash Systolic Engine '

    S u p e r C o m p u t i n g L i n e a r b a s e d on

    R e s e a r c h C e n t e r P r o t o t y p e

    32

    cells

    Ce l lu la r Ar r ay Log ic '

    Ed inburgh Un ive rs i ty , U K Pro to type 256 cells custom chip

    16

    x

    16

    a r ray

    FPGA -reconf igu rab le ce l l a rch i tec tu re based on

    Hybr id Arch i tec tu re3

    Unive rs i ty

    of

    Illinois

    R e s e a r c h

    Reconf igu rab le ce l l a rch i tec tu re in teg ra t ing FPGA

    capab i l i ty an d 32 -b i t f loa t ing -po in t mu l t ip l ica t ion

    P r o g r a m m a b l e A d a p t i v e

    Computing Engine. '

    Un ive rs i ty o f Wales , Bangor , UK Pro to type P rog ramm ab le p rog ramm ab le topo logy

    C o n f i g u r a b le F u n c t i o n a l A r r a y 5

    Ts inghua Un ive rs i ty , Be i j ing Resea rch Conf igu rab le a r ray topo logy and ce l l a rch i tec tu re

    Func t iona l un i t s embe dded in each ce l l :

    reconfigu rable con necti ons in cell as well as

    1.

    M . Gok hale et al.. Splash: A Reconfigurable Linear Logic Array, Pm c . In i l

    Conf:

    Parr i l l e l Processing,

    IEEE CS

    Press. Los Alamitos,

    2.

    T.

    Kcan and

    J .

    Gray, Configurable Hardware: Two Case Studies of Micrograin Computation.

    in

    Sysro l i c

    Array Processorh,

    Prentice-

    3.

    R. Smith and G. Sobelman, Simulation-Based Design of Programmablc Systolic Arrays.

    Coniprrier-Aided Desig n.

    Vol.

    2.3. N o .

    IO. Dec.

    4.

    S .

    Jones. A. Spray, and A. Ling. A Flexib lc Building Block for

    t h e

    Construction of Processor Arrays. in

    Sys/o/ic

    A r r n v P r o c r s s o r s

    5 . C. Wenyang, L. Yanda. and J . Yuc, Systolic Realization for 2D Convolution Using Conrigurable Functional Method in VLSI Parallel

    Calif., Orde r

    No.

    2101. 1990.

    pp. 1526-1531.

    Hall, E nglcwood C liffs,

    N.J.,

    1989. pp . 310-319.

    1901,

    pp.

    669-675.

    Prcntice-Hall, E nglewood Cliffs. N.J. ,

    19S9,

    pp.

    459-466.

    Array Dcsigns.

    Proc .

    I E E E Compurrrs

    u n d

    Digirtrl

    e~hnology~ol.

    138, No. S Sept.

    1991.

    pp. 361-370.

    to be iden t ica l . the a r ray i s pe r fo rm-

    i n g S I M D o p e r a t i o n s .

    I n s t r u c t i o n s y s t o l i c urriiys prov ide

    a precise low-level method of coor-

    d ina t ing p rocess ing f rom ce l l to ce l l .

    In an IS A. instructions travel through

    t h e a r r a y w i th t h e d a t a . T h e p r o -

    cessed da ta passes wi th th e o r ig ina l

    instruction from ea ch cell to the n ext.

    An ap p rop r ia te ly wide ISA ins t ruc -

    t ion also has buil t- in microcodable

    pa ra l le l i sm. ISA a lgo r i thms can be

    leng thy and m ore d i f f icu l t to deve l -

    o p t h a n a l g o r it h m s f o r o t h e r S I M D

    a n d M I M D a r r a y s. B u t t h e y p r o v i de

    a way to un ique ly spec i fy concu r -

    rency across cells withou t an overly

    complicated circuit .

    D i re c t l oc a l memory a c c es s o r g l o -

    bal

    rnen io ry .

    Sta tus f lags can be

    s to red e i the r in each ce l l ' s loca l

    m e m o r y or in a g loba l memory . In

    the case o f loca l mem ory , the con -

    t ro l le r mus t b e ab le to d i rec t ly ac -

    cess each ce ll 's loca l memory t o de -

    te rmine f lag s ta tu s ; o the rwise , the

    con t ro l le r mus t i s sue a reques t and

    wa i t fo r i t to p ropaga te th rough the

    a r ray . When g loba l memory i s avai l -

    ab le . the con t ro l le r ob ta in s s ta tu s

    in fo rma t ion much m ore eas i ly . Di -

    rec t loca l memory access o r g loba l

    mem ory also facil i ta tes in it ia l ization

    wi th in a ce l l. The Saxpy-1M is one

    of th e f irs t systolic array s with glo-

    b a l m e m o r y .

    I tifrace11c o n c u r r e n c y . P r o g r a m m a b l e

    ce l ls requ i re t he ab i l i ty t o pe r fo rm mul-

    t ip le s imi la r o r d i s simi lar ope ra t ions s i -

    mu l taneous ly . Mechan isms inco rpo ra t -

    ed in to p rog ramm ab le sys to l ic ce l ls to

    suppor t concu r rency a re a l l p roven tech -

    n iques tha t o r ig ina ted in conven t iona l

    von Neum ann p rocesso rs : mic rocodab le

    processing elements . duplication offun c-

    t iona l un i t s , mu l t ip le da ta p a ths , su f fi -

    c ien t lywide in s t ruc t ion words . a nd p ipe -

    l in ing of functional units .

    Rhythmic communications. The p r in -

    ciple of rhythm iccom munic ationsclearly

    sepa ra te s sys to l ic a r rays f rom o the r a r -

    ch i tec tu res . P rog rammab i l i ty c rea te s a

    mo re f lex ib le sys to l ic a rch i tec tu re . bu t

    the pena l t ie s a re complex i ty and poss i -

    ble s lowing of operat ions. Wh en systol-

    ic ce ll s a re p rog ramm ab le , the i s sue o f

    da ta availabil i ty arises. To min imize the

    pe r fo rmance deg rada t ion

    of

    p r o g r a m -

    mable systolic designs, each cell 's pro-

    cessing e lemen t mus t have da ta ava i l -

    able when i t is required. h igh-speed ari th -

    me t ic capab i l i t i e s , and the ab i l i ty to

    t rans fe r o r s to re the p rocessed da ta .

    Thus , each ce l l mus t have su f f ic ien t

    mem ory fo r da ta s to rage a s we l l a s p ro -

    g ram s to rage to fac i l i t a te in te rce l l com-

    mun ica t ions .

    Th e fo l lowing a rch i tec tu ra l f ea tu res

    suppor t rhy thmic comm unica t ions :

    QueiLed //O s t r e a m l i n e s c e l l u l a r

    c o m m u n i ca

    on s

    b y a l o w in g t h e

    s o u r c e c e ll t o s e n d d a t a t o t h e d e s ti -

    nation cell when the data is avail-

    ab le . no t when the des t ina t ion ce ll i s

    ready to accep t i t . Da ta ex i t s the

    queu e on a F IF O ( f i rs t in . f i rs t ou t )

    bas is tha t a l lows p rog ram com pu ta -

    t ion to p roceed i r re spec t ive o f com-

    m u n i c a t i o n s s t a t u s . Q u e u e d IiO

    he lps cons ide rab lywh en a l l ce l ls a re

    compu t ing a s ing le p rog ram tha t i s

    skewed in t ime ac ross the ce l ls . A

    d isadvan tage i s tha t th i s mechan ism

    uses memory space tha t i s expen -

    s ive on

    VLSI

    wafe rs .

    Se r i a l I/O reduces the bandwid th

    requ i remen t on the sys to l ic a r ray ' s

    works ta t ion mach ine , a t the same

    t ime p romot ing rhy thmic ce l lu la r

    c o m m u n i c a t i o n . T h e r e s ul t i ng p r o -

    28

    C O M P U T E R

  • 7/25/2019 General Purpose Systolic Arrays 1

    10/12

    grammable systolic array has bit-

    se r ia l ce llu la r communica t ions and

    word-wide operations in each cell .

    Each cell s tores part ia l words unti l

    i t r ece ives the fu ll word and then

    p e r f o r m s f u n c t i o n a l o p e r a t i o n s .

    Using mo re cells can part ia l ly offset

    t h e d i s a d v a n t a g e of r e d u c e d

    th roughpu t .

    M u l t i p l e d a tu p a t h s

    provide t he cell

    with fast , paralle l in ternal and ex-

    ternal comm unications.O n e o r m o r e

    da ta pa ths a re ded ica ted to compu-

    ta t iona l needs . ano the r to ce l lu la r

    inpu t ope ra t ions , and ye t ano th e r to

    ce l lu lar ou tp u t ope ra t ions .

    Sy s t o l i c share d re g i s t e rs provide sev-

    eral benefits specif ically for

    pro-

    grammab le sys to l ic a rch i tec tu res .

    SSR a rch i tec tu res (F igu re

    7

    pro -

    vide a program-implicit means of

    s t reaml ined da ta f low. Each two a d -

    jacen t p rocess ing e lemen ts sha re a

    sma l l r eg is te r memory . Da ta f lows

    concur ren t ly wi th compu ta t ion a s

    each p rocess ing e lemen t rece ives

    new da ta f rom the reg is te r memory

    ups t ream, ope ra te s on i t , and d i -

    rec t s i t to the nex t reg is te r memory

    downstream . This architecture e lim-

    i n a t e s t h e r e q u i r e m e n t t h a t d a t a

    movement be explic i t ly specif ied,

    and da ta f low th rough the a r ray be -

    comes a re su l t o f p rocess ing . The

    inpu t reg is te r and the ou tpu t reg is -

    te r a re spec i f ied by the in s t ruc tion

    word . The re fo re . b id i rec t iona l da ta -

    f low i s a na tu ra l r e su l t . An SSR

    arch i tec tu re i s advan tageous ove r

    queued

    110

    because communications

    do no t occu r on a F IFO bas i s . A

    disadvantage

    is

    that the register bank

    must be kept small to minimize ac-

    cess t ime a nd thus max imize sys to l -

    ic bandwid th . The sma l l r eg is te r

    m a k e s p r o g r a m m i n g d i f f i c u l t f o r

    some o f the mo re compl ica ted a lgo -

    ri thms.

    B r o u d c a s t

    ita

    eases systolic

    IiO

    require ment s in certa in instances by

    a l lowing all func t iona l un i t s and

    local memo ry of a cell to be in i t i al -

    ized o r se t to a common va r iab le a t

    once .

    Global datu eases systolic IiO re -

    qu i remen ts and the ce l l ' s s to rage

    requ i remen ts by keep ing on ly one

    copy o f the sam e da ta an d a l lowing

    a l l ce l l s to sha re i t . Wi th p rope r

    g loba l memory cohe rency . any mod-

    if ications to g loba l da ta a re in s tan t -

    ly available to ot her cells .

    I I

    I

    Figure

    7 An

    SIMD systolic shared-register architecture.

    B i d i re c t i ona l und wrapurou t zd data-

    ,flow

    must be explic i tly specif ied in

    the p rog ram, whereas in ha rdwired

    des igns da ta f low i s imp l ic i t . Bi -

    d i rec t iona l and wrapa round da ta -

    f l ow can reduce the

    I/O

    bandwid th

    requ i remen ts o f the ex te rna l bu f fe r

    memory by rec i rcu la ting da ta wi th -

    in the a r ray . More f lex ib le da taf low

    also allows arrays to execute a wide

    va r ie ty o f a lgo r i th ms more e f f i -

    c iently .

    Block processing

    If each program-

    mable systolic cell has significant local

    mem ory. b lock processing is possible in

    the sys to lic env i ronmen t . Dur ing b lock

    processing, very little systolic I/O oc-

    curs , and ea ch cell executes a series of

    in s t ruct ions on loca l da ta . O nce t he ce l l s

    gene ra te an in te rmed ia te re su l t , p ro -

    cessing pauses w hile systolic IiO t akes

    p lace , and new da ta i s wr i tt en to each

    cell 's local memory.

    Block processing on prog rammab le

    systolic architectures can result in a more

    effic ient . pseudosystolic operation in

    some a pplications for two reasons. First ,

    the merge r o f von Neumann p rog ram -

    mable cell architectures in to a systolic

    a r ray env i ronmen t causes many o f the

    sam e p rob lems found in homogeneous

    mul t ip rocesso r sys tems . The se r ia l na -

    tu re o f \ ' on Neumann mach ines in te r -

    feres with the rhythmic systolic I/O of

    t h e a r r a y , c a u si n g a n u n a c c e pt a b l e

    amoun t o f t ime was ted in wa i t ing fo r

    d a t a to become available . Block pro-

    cessing minimizes th is wasted t ime in

    app l ica t ions tha t can be d iv ided in to

    equa l segmen ts . App l ica t ions a re d iv id -

    ed into paralle l tasks that u ti l ize local

    da ta .

    Second, for any systolic array, the

    bandwidth and size of the external mem-

    ory a re a lways the l imi t ing fac to rs on

    t h r o u g h p u t a n d p e r f o r m a n c e . B l o c k

    processing reduces the array 's systolic

    11 r e q u i r e m e n t s b y r e d u c i n g t h e

    amou n t o f ce l lu lar IiO.

    I m p o r t a n t w o r k h a s b e en d o n e o n

    characteriz ing block processing in a gen-

    eral-pu rpose systolic array. ' Block pro-

    c e s s i n g c a n b e p e r f o r m e d o n e i t h e r

    SIMD or MI MD prog ram mab le sys to l -

    ic mach ines. Re qu i rem en ts fo r e f f ic ient

    b lock p rocess ing inc lude an app rop r i -

    a te a lgorithm a nd signif icant local mem -

    ory in each ce l l . Othe r types o f a r ray

    communica t ion such a s b roadcas t da ta

    and g loba l da ta a re a l so des i rab le .

    Architectural issues

    of

    reconfigurable systolic

    arrays

    The p r imary a rch i tec tu ra l i s sues in

    d e s i g n i n g a r e c o n f i g u r a b l e s y s t o l i c

    a r r a y a r e t h e h a r d w a r e p l a t f o r m o f

    FPG As and th e c lass of a lgo r i thms to be

    t a r g et e d t o t h a t p l a t fo r m . T h e n u m b e r

    of FPG As, the i r topo logy , and the i r ga te

    density will determ ine th e set of systolic

    a rch i tec tu res tha t can be syn thes ized

    a n d t h e s e t of systolic a lgorithms that

    c a n b e i m p l e m e n t ed o n t h a t h a r d w a r e

    platform. R econfigurable systolic archi-

    tec tu res a r e ve ry in te re s t ing because

    the cell 's architecture is actually pro-

    g rammed . C onsequen t ly , the a rch i tec -

    tu re p rog rammer has comple te con t ro l

    ove r wha t a rch i tec tu ra l f ea tu res a re in -

    c o r p o r a t e d i n e a ch F P G A . D e t e rm i n -

    ing the ce ll a rch i tec tu re can be an i t e r -

    ative process that continuously refines

    the architecture unti l i t sa t isf ies appli-

    ca t ion requ i remen ts .

    FP GA a rch i tec tu re p rog ramming d i f-

    fe r s f rom conven t iona l p rog ramming in

    tha t one p rog rams t he c i rcu i t ' s log ica l

    func t ion , in s tead o f p rog ramming a

    mode l of operations in a high-level lan-

    guage . Reconf igu r ing an FPGA is the

    sam e as logically drawing a new circu it .

    T h e r e f or e , a n F P G A p l a t f o rm c a n a s -

    s u m e a n y a r c h it e c t ur e t h a t F P G A g a t e

    dens i ty and package p inou t pe rmi t .

    C o m m u n i c a t i o n a n d c o n c u r r e n c y

    N o v c m b e r 1993 29

  • 7/25/2019 General Purpose Systolic Arrays 1

    11/12

    Broadcast

    instructions

    I

    Figure 8. A hybrid S l M D programmable systolic architecture.

    must be eva lua ted on a case -by -case

    bas is a t the t ime o f con f igu ra t ion . Cur -

    ren t ga te -dens i ty techno logy does no t

    m a k e M I M D a r r a y s f e a si b l e d u e t o c e ll

    m e m o r y a n d c o n t r o l r e q u i re m e n t s , b u t

    FP GA p la t fo rms a re we l l su ited fo r log-

    ical or b it- level applications. Splash, for

    example , i s

    a

    l inear reconfigurable ar-

    ray app rop r ia te fo r b i t - leve l app l ica -

    t ions.

    An

    FP GA ch ip can be con f igu red fo r

    on e med ium-s ize systo lic e lemen t o r fo r

    several s imple systolic cells .

    In

    e i the r

    case, current technology l imits the c ir-

    cu i t s ize to an o rde r o f m agn i tude ap -

    p roach ing 25 ,000 ga te s . Packag ing a nd

    110

    p ins a l so in f luence the des ign . Cur -

    rent technology l imits

    I/O

    t o a p p r o x i-

    ma te ly

    200

    pins.

    Reconf igu rab le a rch i tec tu res a re a l so

    h ighly qua l i f ied fo r fau l t - to le ran t com-

    puting. Recon figuration can simply elim-

    ina te a bad ce l l f rom the a r ray . Ce l l

    a rch i tec tu res can be con figu red a round

    VLSI defects .

    Architectural issues of

    hybrid arrays

    High- leve l p rog ramm ab le a r rays re -

    quire extensive efforts to ma p algorithms

    to h igh -level l anguages a f te r a lgo r i thm

    deve lopmen t . When mapp ing i s com-

    p le ted , the re su l t ing sys tem o f ten i s no t

    a s e f f ic ien t a s a sys tem imp lemen ted in

    a f ixed - func t ion ASIC.

    On

    t h e o t h e r

    hand . low FPG A ga te dens i ty makes i t

    unlikely t hat large-g rain tasks will ever

    b e c o m p l e te l y i m p l e m e n t e d i n F P G A s

    or tha t recon f igu rab le a r rays wi l l r e -

    place conventional

    S l M D

    a n d M I M D

    arch i tec tu res in the nea r fu tu re . Conse -

    quen t ly , a na tu ra l s tep i s

    to m e r g e t h e

    two app roaches , keep ing the i r des i r -

    ab le a spec ts and d isca rd ing the i r unde -

    sirable aspects .

    T h e h y b ri d S l M D a r c h i te c t u r e ( F i g-

    u r e

    8

    is best u ti l ized for in tensive f lo at-

    ing-point computationa l applications but

    d o e s n o t d e g r a d e i n p e r f o r m a n c e a s

    much

    as

    h igh - leve l p rog rammab le a r -

    rays when s ign i f ican t log ica l o r con t ro l

    o p e r a t i o n s a r e i n c l u d e d . T h e h y b r i d

    des ign co mbines a commerc ia l f loa t ing -

    p o i n t m u l t i pl i e r c hi p a n d a n F P G A c o n -

    t ro l le r to fo rm a sys to l ic ce l l . Th e com-

    m e r c i a l m u l t i p l i e r i s u s e d f o r i t s

    economy. speed , an d package dens i ty :

    the FPGA c lose ly b inds the ce l l to a

    specif ic application. Another project is

    a t tempt ing to in teg ra te a hyb r id gene r -

    al-pur pose systolic array cell

    on

    a single

    chip.':

    A recen t ly deve loped s imu la to r a l -

    lows simulation and test ing of the cell

    and the a r ray des ign fo r a hyb r id a rch i -

    tec tu re . Un l ike mos t commerc ia l ly

    available computer-aided design uti l i-

    t i e s , wh ich ve r i fy des igns a t the ch ip

    leve l , it ve r i f ie s the pe r fo rma nce o f

    al-

    gor i thms and a rch i tec tu res h rough s im-

    u la t ion o f the comple te a r ray . T he s im-

    u la to r ma in ta in s the i t e ra t ive na tu re o f

    purely reconfigurable array design by

    facil i ta t ing ta i loring of hybrid designs

    for specif ic applications.

    Existing architectures

    A review of projects in it ia ted during

    the la s t decade shows tha t the t ren d was

    to deve lop la rge sys to l ic a r ray p roces -

    so rs tha t r equ i re e labo ra te . cus tomized

    hos t suppor t . Warp . the Compu te r fo r

    Exper imen ta l Syn the t ic Aper tu re Ra-

    d a r , t h e C e l l ul a r A r r a y P r o c e s so r , a n d

    Saxpy-1M a l l r equ i re expens ive and

    compl ica ted

    I/O

    suppor t o r the in -

    tensive instruction I/O as we l l a s the

    d a t a

    1/0.

    T h e s e m a c h i n e s a r e h i g h -

    p e r f o r m a n c e s u p e r c o m p u t e r s

    (200

    M F L O P S , 3 2 0 M F L O P S , a n d 1 , 00 0

    MFLOPS. re spec t ive ly ) wi th cen t ra l -

    ized concu r rency , cons t ruc ted to f i l l an

    ex is t ing pe r fo rmance gap .

    Th e introduction of workstations in t o

    the workp lace has changed the way a

    s ign i f ican t po r t ion o f the com pu t ing

    communi ty v iews compu te rs . Use rs de -

    m a n d i m p r o v e d d e s k t o p c o m p u t i n g

    pe r fo rmance a s app l ica t ions con t inue

    t o i n c r e a s e i n s i z e a n d c o m p l e x i t y .

    Works tation architectur e can always be

    improved . e spec ia l ly fo r emerg ing ap -

    p l ica t ions such a s tex t an d speech p ro -

    cess ing and gene ma tch ing . The more

    recen t gene ra l -pu rpose sys to l ic a r ray

    p r o j e c t s ( B r o w n , M i c m a c s , S p l a s h )

    show tha t back -end sys to l ic p rocesso rs

    are effective in boosting a workstation's

    p e r f o r m a n c e f o r t h e s e a p p l i c a t i o n s .

    These a r rays a re sma l l enough tha t the

    host 's open a rch i tec tu re wi th l imi ted

    U

    bandwid th does no t seve re ly im-

    pac t the a r ray ' s pe r fo rmance fo r

    low-

    to modera te - leve l g ranu la r i ty . Sma l l

    back-end systolic processors are a lso

    e c o n o m i c a l l y s o u n d . F o r e x a m p l e .

    Sp la sh cons is ts of a two-boa rd add -on

    se t fo r a Sun works ta t ion . One boa rd

    suppor t s the l inea r a r ray . and the o th -

    e r s u p p o r t s a bu f f er m e m o r y . T h e set

    costs fr om $13,000 to $35,000.'

    Almos t w i thou t excep t ion , cu r ren t

    r e s e a r c h e m p h a s i z e s r e c o n f i g u r a b l e

    cells , reconfigura ble arrays , and hybrid s

    o f func t iona l un i t s embed ded in recon -

    f igu rab le FP GA a r rays . Reconf igu rab le

    des igns have p roven to be unmatched

    for low- to moderate -granulari ty require-

    m e n t s b u t a r e n o t y e t m a t u r e e n o u g h

    fo r h igh -g ranu la r i ty app l ica t ions . I n

    add i t ion , a s we have sa id . r econ f igu -

    rable topologies and cells are highly

    fau l t - to le ran t . The i r fau l t to le rance i s a

    configuration issue. not a design and

    fabrication issue.

    Un t i l FP GA ch ip dens i ty p rog resses

    t o t h e p o i n t w h e r e a v e r y l a r g e F P G A

    can achieve high-level granulari ty , hy-

    b r id a rch i tec tu res p re sen t pe rhaps the

    mos t p rac t ica l means to a recon f igu -

    rable high-level systolic array. Hybrid s

    make use o f the mos t a t t rac t ive fea tu res

    o f p rog rammab le and reconf igu rab le

    30

    C O M P U T E R

  • 7/25/2019 General Purpose Systolic Arrays 1

    12/12

    me tho ds while addin g greater f lexibil i-

    ty than e i the r me thod .

    N o d i s c u s s i o n o f g e n e r a l - p u r p o s e

    systolic arrays is complete without ad-

    dressing the issues

    of

    p r o g ra m m i n g a n d

    configuration. High-level- lan guagepro -

    g ramming i s des i rab le fo r p romot ing

    widesp read use of prog rammab le sys -

    tolic back-end processors . Of the projects

    we su rveyed . on ly the la rge r sys tems

    had ma tu re p rog ramming env i ronmen ts.

    Cur ren t ly , the sma l le r p rog rammab le

    a r rays have imp lemen ted on ly a ssem-

    b ly p rog ramming . As men t ioned ea r -

    l i er , F P G A - c o n f i g u r a b l e a r r a y s a r e

    configured with th e use of

    a

    schemat ic

    ed i to r . One d raw back o f th i s app roach

    is tha t i t typ ica lly takes m ore e f fo r t than

    p rog ramming a p rog rammab le ce ll .

    T

    e systolic array is a formida ble

    approach to exp lo i t ing concu r -

    renc ie s in a compu ta t iona l ly

    rhy thmic and in tens ive env i ronmen t .

    Gene ral-purp ose systolic arrays provide

    an economical way to enhance compu-

    ta t iona l pe r fo rmance by emphas iz ing

    concur rency and pa ra l le l i sm. Arrays

    rang ing f rom low- leve l to h igh - leve l

    g ranu la ri ty have been app l ied to p rob -

    lems f rom b i t -o r ien ted p ixe l mapp ing

    to 32-bit f loating-point-based scientif ic

    computing. Systolic arrays hold great

    p romise to be

    a

    pervas ive fo rm

    of

    con-

    cu r rency p rocess ing . As a so lu t ion to

    t h e i n t e n s i v e c o m p u t a t i o n a l p e r f o r -

    mance requ i remen ts of tomorrows ap -

    plications, general-purpose systolic ar-

    rays canno t be ove r looked .

    References

    1. W.K. Fuchs and

    E.E.

    Swartzlander Jr.,

    Wafer-Scale Integration: Architectures

    and Algorithms.

    Com put e r .

    Vol. 25.

    N o .

    4, Apr. 1092,pp. 6-8.

    2. P. Quinton and Y . Robert, S j s t o l i c Algo-

    rifhni.c Architectures,

    Prentice-Hall.

    Englewood Cliffs, N.J . , 1991.

    3. A. Krikelis and R.M. Lea. Architectur-

    al Constructs for Cost-Effective Parallel

    Computers. in

    Systolic Array Proces-

    sors. Prentice-Hall, Englewood Cliffs.

    N.J . .

    1989, pp. 287-300.

    4. H.T. Kung. Why Systolic Architec-

    tures? Cotnpurer , Vol. 15. No.

    1.

    Jan.

    1982, pp. 37-46.

    N o v e m b e r 1993

    5.

    6.

    7.

    8.

    9.

    10

    11

    12

    J.A.B. Fortes and B.W. Wah , Systolic

    Arrays: From Concept to Implementa-

    tion. Computer (special issue on sys-

    tolic arrays). Vol.

    20, No.

    7, July 1987, pp.

    12-17.

    C.K. KO and 0 Wing. Mapping Strate -

    gy for Automated Design of Systolic

    Arrays. Proc. Inrl Conf. Systol ic Ar-

    rays.

    IEE E CS Press,

    Los

    Alamitos. Ca-

    l i f . .

    Order

    No.

    860, 1988. pp. 285-294.

    R. Hughey and D. Loprcsti. B-SYS: A

    470-Processor Programmable Systolic

    Array. Proc Intl Conf. Parallel Pro-

    cersing.

    IEEE CS Press. Los Alamitos,

    Calif., Ord er N o 2355-22,1991 . p p . 1580-

    1583.

    M. Gokh ale et al., Building and Using a

    Highly Parallel Programmable Logic

    Array. Computer . Vol. 24. Jan. 1991.pp.

    81

    -89.

    H.

    Schroder and P. Strardins. Program

    Compression on the ISA, Parallel Coni-

    piiting, Vol. 17,

    No.

    2-3, June 1991, pp.

    207-2

    15.

    B . Friedlander. Block Processing on a

    Programmable Systolic Array ,Proc.Intl

    Conf Parallel Proces sing. IE EE CS Press.

    Los Alamitos, Calif.. Ord er No. 889.1987.

    pp. 184-187.

    R . Smith and G . Sobelman, Simulation-

    Based Design of Programmable Systolic

    Arrays.

    Computer-Aided Design.

    Vol.

    2-3 . N o . 10, Dec. 7991. pp. 669-675.

    S. Jones, A. Spray, and A. L ing, A Flex-

    ible Building Block for the C onstruction

    of Processor Arrays, in

    Systolic Array

    ProcrsJors,

    Prentice-Hall. Englewood

    Cliffs, N.J., 1989. pp. 459-466.

    Kurtis

    T.

    Johnson

    is an advanccd enginee r at

    HR B Systems, an E-Systems subsidiary. His

    intcrests include parallel computer architec-

    tures, parallel algorithms, and digital signal

    proces sing. He received his BS in 1986 and

    his

    M S

    in 1992, both in electrical en gineer -

    ing, from Pennsylvania State University.

    A R

    Hurson

    is an associate professor of

    computer engineering at Pennsylvania State

    University. His research is directed toward

    the design and analysis of general-purpose

    and special-purpose computer architectures.

    He has published m ore than

    110

    papers on

    topics including computer architecture, par-

    allel processing. database systems and ma-

    chines, dataflow architectures. and V LSI al-

    gorithms. He coauthored the IE EE tutorial

    hooks,

    Parallel Architectures

    for

    Database

    Systenis and Multidatabase Systems:

    An Ad-

    vanced Solut ion or Globallnfornzatioti Shar-

    ing Process. He cofounded the IEEE Sym-

    posium on Paralle l and Distr ibuted

    Processing. Hurson is a member of the IE EE

    Com puter Society Press Editorial Board and

    a member of the IE EE Distinguished Visi-

    tors Program.

    Behrooz Shirazi

    is an associate professor of

    computer science engineering at the U niver-

    sity of Tex as at Arli ngton. Previously, he was

    an assistant professor at Southern Meth odist

    University. His research interests include

    task scheduling, heterogeneous computing,

    dataflow computation, and parallel and dis-

    tributed processing. He has published wide-

    ly

    on these topics. He is a cofounder of the

    IEE E Symposium on Parallel and Distribut-

    ed Processing. Hc has served as chair

    of

    the

    Com puter Society section in the Dalla s chap-

    ter

    oC

    the IEEE and as chair

    of

    the IEEE

    Region V A rea Activities B oard. Shirazi re-

    ceived his MS and PhD degrees in computer

    science from the University of Oklahoma. n

    1980 and 1Y85. respectively.

    Send correspondence about this article to

    A.R. Hurson. Dept. of Computer Science

    and Engineering. Pennsylvania State Uni-

    versity, University Park, PA 16802: o r

    [email protected].

    Laxmi Bhuyan,

    Conzputers

    system archi-

    tecture area ed itor, coordinated and recom-

    mended this article for publication. His c-

    mail addre ss is [email protected].

    31

    mailto:[email protected]:[email protected]:[email protected]:[email protected]