PACT2013Slides.pdf

download PACT2013Slides.pdf

of 33

Transcript of PACT2013Slides.pdf

  • 8/13/2019 PACT2013Slides.pdf

    1/33

    Exposing ILP in Custom Hardwarewith a Dataflow Compiler IR

    Ali Mustafa Zaidi

    Superisor! Dr" Daid #reaes

    $niersit% of Cam&ridgeComputer La&orator%

  • 8/13/2019 PACT2013Slides.pdf

    2/33

    2

    'he Dar( Sili)on Pro&lem

    *"+#H, -./nm 01/23

    18%

    4"*#H, -54nm 01/23

    7%

    6"7#H, -7*nm 01/23

    3%

    Amdahl8s Law

    $tili,ation 2all

    +

    =Dark Silicon

    54nm 9 1nm 07*x resour)es3

    CP$! 7"4x: #P$ *"5x 0Cnsr"3

    CP$!6".x: #P$ *"6x0I'RS3

    Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scalin"! EEE Micro #$1#!

  • 8/13/2019 PACT2013Slides.pdf

    3/33

    3

    'he Dar( Sili)on Pro&lem

    *"+#H, -./nm 01/23

    18%

    4"*#H, -54nm 01/23

    7%

    6"7#H, -7*nm 01/23

    3%

    Amdahl8s Law

    $tili,ation 2all

    +

    =Dark Silicon

    Can we a)hiee Supers)alar Performan)e: w;oSupers)alar

  • 8/13/2019 PACT2013Slides.pdf

    4/33

    4

    Solution! Spatial Ar)hite)tures=

    ustom &ard'are, ()*s, *s, M))s, etc!

    d-antaes S)ala&le: De)entrali,ed ar)hite)tures: with short: p*p wiring"

    High Computational Densit%

    +/>+///x Energ% and Performan)e effi)ien)%"

    ssues

    Poor Programma&ilit%! often re?uiring low>leel hardware (nowledge

    Limited Amena&ilit%! poor performan)e on se?uential: irregular: or)omplex )ontrol>flow )ode"

    E.am/les Conseration Cores! Performan)e @ in>order MIPS*5E )ore

    Phoenix CASH Hardware! Performan)e 7/B less than 5>wa%

  • 8/13/2019 PACT2013Slides.pdf

    5/33

    5

    e% Reasons for High Performan)eof Complex: of>order exe)ution

    s)heduling

    Custom hardware has er% limitedspe)ulation

    Single flow of )ontrol

    If>)onersion h%per&lo)( formationfor forward &ran)hes"

    0o acceleration of ack'ardsranches2

    = A[i]

    > 0

    A i

    foo()

    T F

    Start

    i = 0

    i++

    < 100

    T

    End

    F

    bar()

    Control>Datalow #raph

    McFarlin et al., Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?, S)4S 513

    Solution! Spatial Ar)hite)tures=

  • 8/13/2019 PACT2013Slides.pdf

    6/33

    6

    4ur Solution

    Instead of

    D(* + om/ile6timeE.ecution Schedulin

    2e Emplo%

    S(* + Dataflo'E.ecution Model

    Control>Datalow #raph

    = A[i]

    > 0

    A i

    foo()

    T F

    Start

    i = 0

    i++

    < 100

    T

    End

    F

    bar()

    Solution! Spatial Ar)hite)tures

  • 8/13/2019 PACT2013Slides.pdf

    7/33

  • 8/13/2019 PACT2013Slides.pdf

    8/33

    8

    low with the FS#

    Falue State low#raph

    Infinite DA#

    Loops represented as 'ail Re)ursion

    ran)hes represented ia if>)onersion Ena&les ressi-e S/eculation2

    Ko single 8low of Control8

    Instead: )ontrol implemented ia

    8oolean Predi)ate Expressions8" Logi) minimi,ation )an simplif%

    expressions: fa)ilitating ontrolDe/endence nalsis2

    = A[i]

    foo()> 0

    'P

    i = 0 A STATE_IN

    STATE_OUT

    i++

    < 100 Nextiteration of'for' loop

    'P

    bar()

    inPred

  • 8/13/2019 PACT2013Slides.pdf

    9/33

    9

    low with the FS#

    Falue State low#raph

    Hierarchical Dataflow #raph

    Su&graphs ma% &e 8predi)ated8:

    or exe)uted spe)ulatiel% 0ia 8if>)onersion83"

    'Flattening'loop tail>)allsu&graphs 9 loopunrolling;pipelining"

    Multiple loops in a loop>nest ma%&e unrolled independentl% toexpose ILP

    = A[i]

    foo()> 0

    'P

    i = 0 A STATE_IN

    STATE_OUT

    i++

    < 100 Nextiteration of'for' loop

    'P

    bar()

    inPred

  • 8/13/2019 PACT2013Slides.pdf

    10/33

    10

    low with the FS#

  • 8/13/2019 PACT2013Slides.pdf

    11/33

    11

    low with the FS#

  • 8/13/2019 PACT2013Slides.pdf

    12/33

    12

    High Leel S%nthesis Case Stud%

    An% High

    LeelLanguage

    LLFM FS#luespe)

    S%stemFerilog ASIC ; P#A

    Low>

    LeelIR

    %1 = mul i32 %x, %y;%2 = srem i32 %1, %z;%3 = icmp slt i32 %2, %1;

    FIFOF(int) x mkFIFOF1;FIFOF(int) y mkFIFOF1;FIFOF(int) z mkFIFOF1;

    FIFOF(int) srem_1 mkFIFOF1;FIFOF(int) icmp_1 mkFIFOF1;FIFOF(int) icmp_2 mkFIFOF1;FIFOF(int) out_3 mkFIFOF1;

    rule mul_inst;let !l1 = x"#irst; x"$e;let !l2 = y"#irst; y"$e;let rslt = !l1 & !l2;srem_1"en (rslt);icmp_1"en (rslt);

    en$rule

    rule srem_inst;let !l1 = srem_1"#irst; srem_1"$e;let !l2 = z"#irst; z"$e;let rslt = !l1 % !l2;icmp_2"en (rslt);

    en$rule"

  • 8/13/2019 PACT2013Slides.pdf

    13/33

    13

    Leg$p LLFM *".

  • 8/13/2019 PACT2013Slides.pdf

    14/33

    14

    Performan)e 0C%)le Counts3

    Kormalised

    to Leg$p

    Compared to Kios II;f Intel Kehalem Core i6 0SniperSim3

    Matrix 'ranspose0x+( )%)les3

    adp)m0x+( )%)les3

    dfsin0x+( )%)les3

    Keural Ket Simulator0x+M )%)les3

  • 8/13/2019 PACT2013Slides.pdf

    15/33

    15

    epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa

    0

    50

    100

    150

    200

    250

    300

    350

    400

    450

    Frequency (Higher is e!!er"

    #eg$p (%F&" 'F&_0 'F&_1 'F&_3

    )H*

    epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa

    0

    0+2

    0+4

    0+,

    0+-

    1

    1+2

    1+4

    ./rmalied elay (#/er is e!!er"

    #eg$p (%F&" 'F&_0 'F&_1 'F&_3

    Kios IIf -*4/MH,

    re?uen)% Dela%

  • 8/13/2019 PACT2013Slides.pdf

    16/33

    16

    epic adpcm dfadd dfdiv dfmul dfsin mips

    0

    1

    2

    3

    4

    5

    ,

    misspecula!ed ac!ivi!y (bi!s"

    useful ac!ivi!y (bi!s"

    Power and Spe)ulation

  • 8/13/2019 PACT2013Slides.pdf

    17/33

    17

    epic adpcm dfadd dfdiv dfmul dfsin mips

    0

    1

    2

    3

    4

    5

    ,

    misspecula!ed ac!ivi!y (bi!s"

    useful ac!ivi!y (bi!s"

    Power and Spe)ulation

  • 8/13/2019 PACT2013Slides.pdf

    18/33

    18

    Kormali,ed Energ%

    epic adpcm dfadd dfdiv dfmul dfsin mips GEOMEAN0.1

    1

    10

    100

    1 1

    3

    1

    32

    22

    1

    3

    5

    2

    43 3 33

    6

    3

    75 4

    62

    1

    17 18

    31

    14

    6

    12

    LegUp VSFG_0 VSFG_1 VSFG_3 Nios

    S f E I ffi i

  • 8/13/2019 PACT2013Slides.pdf

    19/33

    19

    Energ% Cost Comparison!

    s Kios II;f! /"*4 x0#E5 x0#E

  • 8/13/2019 PACT2013Slides.pdf

    20/33

    20

    74B &etter performan)e than stati)all% s)heduled C#: without an%optimi,ations!

    Improements due to d%nami) s)heduling: MC CDA $nrolling helps: &ut speed>up saturates ?ui)(l%"

    urther Improements possi&le!

    alan)e &etween /redication s/eculation: to improe speed>up withoutunrolling 0thus redu)ing area and energ% )osts3

    State>edge is on )riti)al path O limits &oth unrolling MC"

    Last remnant of 8se?uential8 nature of program"

    re?uen)% S)aling limited &% Memor% Inter)onne)t

    Partition memor% pipeline memor% a))ess tree

    Limitations on Performan)e

  • 8/13/2019 PACT2013Slides.pdf

    21/33

    21

    'han( ou

    Impli)it Parallelism State edge Partitioning

  • 8/13/2019 PACT2013Slides.pdf

    22/33

    22

    IncreasingProgrammer

    / CompilerEffort

    Alias

    Anal%sis

    Spe)ul"Loads

    edge"

    edge Partitioning

    SpM' ;'LS

    D%nami)

  • 8/13/2019 PACT2013Slides.pdf

    23/33

    Performan)e 0C%)le Counts3

  • 8/13/2019 PACT2013Slides.pdf

    24/33

    24

    Performan)e 0C%)le Counts3

    C%)le )ounts normali,ed to Leg$p results

    FS# implemented with all loops unrolled /: +: and 7 times ull Spe)ulation! all su&graphs 0ex)ept loops3 triggeredwithout predi)ates

    epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa

    0

    0+2

    0+4

    0+,

    0+-

    1

    1+2

    1+4

    1+,

    %ycle %/un!s i!h Full pecula!i/n

    #eg$p (%F&"

    'F&_0

    'F&_1

    'F&_3

    Performan)e 0C%)le Counts3

  • 8/13/2019 PACT2013Slides.pdf

    25/33

    25

    Performan)e 0C%)le Counts3

    )redication!onl% one &lo)(

    will exe)ute

    S/eculation!&oth &lo)(s

    exe)ute: &utonl% one resultis )hosen

    epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa

    0

    0+2

    0+4

    0+,

    0+-

    1

    1+2

    1+4

    1+,

    %ycle %/un!s i!h Full pecula!i/n

    #eg$p (%F&"

    'F&_0

    'F&_1

    'F&_3

    Performan)e 0C%)le Counts3

  • 8/13/2019 PACT2013Slides.pdf

    26/33

    26

    Performan)e 0C%)le Counts3

    epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa

    0

    0+2

    0+4

    0+,

    0+-

    1

    1+2

    1+4

    1+,

    %ycle %/un!s i!h Full pecula!i/n

    #eg$p (%F&"

    'F&_0

    'F&_1

    'F&_3

    Performan)e 0C%)le Counts3

  • 8/13/2019 PACT2013Slides.pdf

    27/33

    27

    epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa

    0

    0+2

    0+4

    0+,

    0+-

    1

    1+2

    1+4

    1+,

    %ycle %/un!s i!h Full pecula!i/n

    #eg$p (%F&"

    'F&_0

    'F&_1

    'F&_3

    Performan)e 0C%)le Counts3

    epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa

    0

    0+2

    0+4

    0+,

    0+-

    1

    1+2

    1+4

    1+,

    %ycle %/un!s i!h redica!ed ubgraphs

    #eg$p (%F&"

    'F&_0

    'F&_1

    'F&_3

    Performan)e 0C%)le Counts3

  • 8/13/2019 PACT2013Slides.pdf

    28/33

    28

    %/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3

    0

    20000000

    40000000

    ,0000000

    -0000000

    100000000

    120000000

    140000000

    1,0000000

    1-0000000

    200000000

    3,,45,

    3334552

    1423-,,,

    1143,144

    -1,4- 430,4-

    small_bimpa

    %/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3

    0

    20000

    40000

    ,0000

    -0000

    100000

    120000

    140000

    10453

    142055-

    1053

    200 1-, 1-,

    dfsin

    %/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3

    0

    200000

    400000

    ,00000

    -00000

    1000000

    1200000

    1400000

    20014

    33,34

    10-444 10,243,

    52-21-

    2,510

    epic

    %/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3

    0

    10000

    20000

    30000

    40000

    50000

    ,0000

    0000

    -0000

    42,,2

    114

    134

    5-,0

    515-0 511-,

    adpcm

    Performan)e 0C%)le Counts3

    Performan)e 0C%)le Counts3

  • 8/13/2019 PACT2013Slides.pdf

    29/33

    29

    %/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3

    0

    2000

    4000

    ,000

    -000

    10000

    12000

    14000

    1,000

    1-000

    dfadd

    %/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3

    0

    5000

    10000

    15000

    20000

    25000

    30000

    35000

    40000

    dfdiv

    %/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3

    0

    2000

    4000

    ,000

    -000

    10000

    12000

    14000

    1,000

    dfmul

    %/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3

    0

    5000

    10000

    15000

    20000

    25000

    30000

    35000

    mips**

    Performan)e 0C%)le Counts3

    $nderstanding

  • 8/13/2019 PACT2013Slides.pdf

    30/33

    30

    $nderstanding )onersion h%per&lo)( formation for forward&ran)hes"

    0o acceleration of ack'ards ranches2

    = A[i]

    > 0

    A i

    foo()

    T F

    Start

    i = 0

    i++

    < 100

    T

    End

    F

    bar()

    Control>Datalow #raph

    ormali,ing Ealuating the FS#

  • 8/13/2019 PACT2013Slides.pdf

    31/33

    31

    ormali,ing Ealuating the FS#

    An% HighLeel

    LanguageLLFM FS#

    luespe)S%stemFerilog

    ASIC ; P#ALow>Leel

    IR

    Plot(in>st%le operational semanti)s deeloped for FS# Assuming Stati) Dataflow exe)ution model

    Low>Leel IR deeloped to fa)ilitate )onersion to luespe)

    ased on Hierar)hi)al Coloured Petri>nets

    High>Leel S%nthesis 'ool)hain implemented

    Hardware

  • 8/13/2019 PACT2013Slides.pdf

    32/33

    32

    Hardware

  • 8/13/2019 PACT2013Slides.pdf

    33/33

    33

    Hardware lues/ec ode

    )etri 0et asedo' e-elDataflo'