L04-Pipelining

download L04-Pipelining

of 33

Transcript of L04-Pipelining

  • 8/6/2019 L04-Pipelining

    1/33

    CS 152 Computer Architecture andEngineering

    Lecture 4 - Pipelining

    Krste Asanovic

    Electrical Engineering and Computer SciencesUniversity of California at Berkeley

    http://www.eecs.berkeley.edu/~krste

    http://inst.eecs.berkeley.edu/~cs152

    http://www.t-ram.com/http://www.t-ram.com/http://www.t-ram.com/http://www.t-ram.com/
  • 8/6/2019 L04-Pipelining

    2/33

    2/3/2009 CS152-Spring09 2

    Last time in Lecture 3

    Microcoding became less attractive as gap betweenRAM and ROM speeds reduced

    Complex instruction sets difficult to pipeline, sodifficult to increase performance as gate count grew

    Iron-law explains architecture design space Trade instructions/program, cycles/instruction, and time/cycle

    Load-Store RISC ISAs designed for efficientpipelined implementations Very similar to vertical microcode

    Inspired by earlier Cray machines

  • 8/6/2019 L04-Pipelining

    3/33

    2/3/2009 CS152-Spring09 3

    Iron Law of ProcessorPerformance

    Time = Instructions Cycles TimeProgram Program * Instruction * Cycle

    Instructions per program depends on source code,compiler technology, and ISA

    Cycles per instructions (CPI) depends upon theISA and the microarchitecture

    Time per cycle depends upon themicroarchitecture and the base technology

    Microarchitecture CPI cycle time

    Microcoded >1 short

    Single-cycle unpipelined 1 long

    Pipelined 1 shortThis lecture

  • 8/6/2019 L04-Pipelining

    4/33

    2/3/2009 CS152-Spring09 4

    An Ideal Pipeline

    All objects go through the same stages

    No sharing of resources between any two stages

    Propagation delay through all pipeline stages is equal

    The scheduling of an object entering the pipelineis not affected by the objects in other stages

    stage1

    stage2

    stage3

    stage4

    These conditions generally hold for industrialassembly lines.But can an instruction pipeline satisfy the last

    condition?

  • 8/6/2019 L04-Pipelining

    5/33

    2/3/2009 CS152-Spring09 5

    Pipelined MIPS

    To pipeline MIPS:

    First build MIPS without pipelining with

    CPI=1

    Next, add pipeline registers to reduce

    cycle time while maintaining CPI=1

  • 8/6/2019 L04-Pipelining

    6/33

    2/3/2009 CS152-Spring09 6

    Pipelined Datapath

    Clock period can be reduced by dividing the execution of aninstruction into multiple cycles

    tC > max {tIM, tRF , tALU , tDM , tRW } ( = tDM probably)

    However, CPI will increase unless instructions are pipelined

    write-back

    phase

    fetch

    phase

    execute

    phase

    decode & Reg-fetch

    phase

    memory

    phase

    addr

    wdata

    rdataData

    Memory

    we

    ALU

    ImmExt

    0x4Add

    addrrdata

    Inst.Memory

    rd1

    GPRs

    rs1rs2

    wswd rd2

    we

    IRPC

  • 8/6/2019 L04-Pipelining

    7/332/3/2009 CS152-Spring09 7

    Technology Assumptions

    Thus, the following timing assumption is reasonable

    A small amount of very fast memory (caches)backed up by a large, slower memory

    Fast ALU (at least for integers)

    Multiported Register files (slower!)

    tIM tRF tALU tDM tRW

    A 5-stage pipelined Harvard architecture will bethe focus of our detailed design

  • 8/6/2019 L04-Pipelining

    8/332/3/2009 CS152-Spring09 8

    5-Stage Pipelined Execution

    time t0 t1 t2 t3 t4 t5 t6 t7 . . . .instruction1 IF1 ID1 EX1 MA1 WB1instruction2 IF2 ID2 EX2 MA2 WB2instruction3 IF3 ID3 EX3 MA3 WB3instruction4 IF4 ID4 EX4 MA4 WB4

    instruction5 IF5 ID5 EX5 MA5 WB5

    Write-Back(WB)

    I-Fetch(IF)

    Execute(EX)

    Decode, Reg. Fetch(ID)

    Memory(MA)

    addr

    wdata

    rdataData

    Memory

    we

    ALU

    ImmExt

    0x4Add

    addrrdata

    Inst.

    Memory

    rd1

    GPRs

    rs1rs2

    wswdrd2

    we

    IRPC

  • 8/6/2019 L04-Pipelining

    9/332/3/2009 CS152-Spring09 9

    5-Stage Pipelined ExecutionResource Usage Diagram

    time t0 t1 t2 t3 t4 t5 t6 t7 . . . .IF I1 I2 I3 I4 I5ID I1 I2 I3 I4 I5EX I1 I2 I3 I4 I5MA I1 I2 I3 I4 I5

    WB I1 I2 I3 I4 I5

    Resour

    ces

    Write-Back(WB)

    I-Fetch(IF)

    Execute(EX)

    Decode, Reg. Fetch(ID)

    Memory(MA)

    addr

    wdata

    rdataData

    Memory

    we

    ALU

    ImmExt

    0x4Add

    addrrdata

    Inst.

    Memory

    rd1

    GPRs

    rs1rs2

    wswdrd2

    we

    IRPC

  • 8/6/2019 L04-Pipelining

    10/332/3/2009 CS152-Spring09 10

    Pipelined Execution:ALU Instructions

    IRIR IR

    31

    PCA

    B

    Y

    R

    MD1 MD2

    addrinst

    InstMemory

    0x4

    Add

    IR

    ImmExt

    ALU

    rd1

    GPRs

    rs1

    rs2

    wswd rd2

    we

    wdata

    addr

    wdata

    rdataDataMemory

    we

    Not quite correct!

    We need an Instruction Reg (IR) for each stage

  • 8/6/2019 L04-Pipelining

    11/332/3/2009 CS152-Spring09 11

    Pipelined MIPS Datapathwithout jumps

    IRIR IR

    31

    PCA

    B

    Y

    R

    MD1 MD2

    addrinst

    InstMemory

    0x4

    Add

    IR

    ImmExt

    ALU

    rd1

    GPRs

    rs1rs2

    wswd rd2

    we

    DataMemory

    wdata

    addr

    wdata

    rdata

    we

    OpSel

    ExtSel BSrc

    WBSrcMemWrite

    RegDstRegWrite

    F D E M W

    Control Points Needto Be Connected

  • 8/6/2019 L04-Pipelining

    12/332/3/2009 CS152-Spring09 12

    How Instructions can Interact witheach other in a pipeline

    An instruction in the pipeline may need aresource being used by another instruction

    in the pipelinestructural hazard

    An instruction may depend on somethingproduced by an earlier instruction Dependence may be for a data value

    data hazard

    Dependence may be for the next instructionsaddress

    control hazard (branches, exceptions)

  • 8/6/2019 L04-Pipelining

    13/332/3/2009 CS152-Spring09 13

    Data Hazards

    ...r1 r0 + 10r4 r1 + 17...

    r1 is stale. Oops!

    r1 r4 r1

    IRIR IR31

    PCA

    B

    Y

    R

    MD1 MD2

    addrinst

    InstMemory

    0x4

    Add

    IR

    ImmExt

    ALU

    rd1

    GPRs

    rs1

    rs2

    wswd rd2

    we

    wdata

    addr

    wdata

    rdataDataMemory

    we

  • 8/6/2019 L04-Pipelining

    14/332/3/2009 CS152-Spring09 14

    CS152 Administrivia

    PS 1 due Tuesday Feb 10 in class

    Section covering PS 1 on Wednesday Feb 11 Room/time TBD

    First Quiz on Thursday Feb 12

    In class, closed-book, no computers or calculators Covers lectures 1-5 (this weeks lectures)

    Lecture 7, Tuesday Feb 17 in 320 Soda

    Lecture 8, Thursday Feb 19 back in 306 Soda

    See website for full schedule

  • 8/6/2019 L04-Pipelining

    15/332/3/2009 CS152-Spring09 15

    Resolving Data Hazards (1)

    Strategy 1:

    Wait for the result to be available by freezingearlier pipeline stages interlocks

  • 8/6/2019 L04-Pipelining

    16/332/3/2009 CS152-Spring09 16

    Feedback to Resolve Hazards

    Later stages provide dependence information to earlier stageswhich can stall (or kill) instructions

    FB1

    stage1

    stage2

    stage

    3

    stage

    4

    FB2 FB3 FB4

    Controlling a pipeline in this manner works providedthe instruction at stage i+1 can complete withoutany interference from instructions in stages 1 to i

    (otherwise deadlocks may occur)

  • 8/6/2019 L04-Pipelining

    17/332/3/2009 CS152-Spring09 17

    IRIR IR

    31

    PCA

    B

    Y

    R

    MD1 MD2

    addrinst

    Inst

    Memory

    0x4

    Add

    IR

    ImmExt

    ALU

    rd1

    GPRs

    rs1rs2

    wswd rd2

    we

    wdata

    addr

    wdata

    rdata

    DataMemory

    we

    nop

    Interlocks to resolve Data Hazards

    ...r1 r0 + 10r4 r1 + 17...

    Stall Condition

  • 8/6/2019 L04-Pipelining

    18/332/3/2009 CS152-Spring09 18

    stalled stages

    timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

    IF I1 I2 I3 I3 I3 I3 I4 I5ID I1 I2 I2 I2 I2 I3 I4 I5EX I1 nop nop nop I2 I3 I4 I5MA I1 nop nop nop I2 I3 I4 I5WB I1 nop nop nop I2 I3 I4 I5

    Stalled Stages and Pipeline Bubbles

    timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

    (I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1(I2) r4 (r1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2(I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 WB3

    (I4) IF4 ID4 EX4 MA4 WB4(I5) IF5 ID5 EX5 MA5 WB5

    ResourceUsage

    nop pipeline bubble

  • 8/6/2019 L04-Pipelining

    19/332/3/2009 CS152-Spring09 19

    IRIR IR31

    PCA

    B

    Y

    R

    MD1 MD2

    addrinst

    InstMemory

    0x4

    Add

    IR

    ImmExt

    ALU

    rd1

    GPRs

    rs1rs2

    wswd rd2

    we

    wdata

    addr

    wdata

    rdataDataMemory

    we

    nop

    Interlock Control Logic

    Compare the source registers of the instruction in the decodestage with the destination registerof the uncommitted

    instructions.

    stall

    Cstall

    ws

    rsrt

    ?

  • 8/6/2019 L04-Pipelining

    20/332/3/2009 CS152-Spring09 20

    Cdest

    Interlock Control Logicignoring jumps & branches

    Should we always stall if the rs field matches some rd?

    IRIR IR

    PCA

    B

    Y

    R

    MD1 MD2

    addrinst

    InstMemory

    0x4

    Add

    IR

    ImmExt

    ALU

    rd1

    GPRs

    rs1rs2

    wswd rd2

    we

    wdata

    addr

    wdata

    rdataDataMemory

    we

    31

    nop

    stall

    Cstall

    ws

    rsrt

    ?

    we

    re1 re2

    Cre

    ws we ws

    Cdest Cdest

    we

    not every instruction writes a register we

    not every instruction reads a register re

  • 8/6/2019 L04-Pipelining

    21/332/3/2009 CS152-Spring09 21

    Source & Destination Registers

    source(s) destination

    ALU rd (rs) func (rt) rs, rt rdALUi rt (rs) op imm rs rtLW rt M [(rs) + imm] rs rtSW M [(rs) + imm] (rt) rs, rtBZ cond(rs)

    true: PC (PC) + imm rsfalse: PC (PC) + 4 rs

    J PC (PC) + immJAL r31 (PC), PC (PC) + imm 31JR PC (rs) rs

    JALR r31

    (PC), PC

    (rs) rs 31

    R-type: op rs rt rd func

    I-type: op rs rt immediate16

    J-type: op immediate26

  • 8/6/2019 L04-Pipelining

    22/332/3/2009 CS152-Spring09 22

    Deriving the Stall Signal

    Cdest

    ws = Case opcodeALU rdALUi, LW rtJAL, JALR R31

    we = Case opcodeALU, ALUi, LW (ws 0)JAL, JALR on... off

    Cre

    re1 = Case opcodeALU, ALUi,

    on off

    re2 = Case opcode on off

    LW, SW, BZ,JR, JALRJ, JAL

    ALU, SW...

    Cstall

    stall = ((rsD =wsE).weE +(rsD =wsM).weM +

    (rsD =wsW).weW) . re1D +

    ((rtD =wsE).weE +

    (rtD =wsM).weM +

    (rtD =wsW).weW) . re2D

    This

    isnot

    thefu

    llstory!

  • 8/6/2019 L04-Pipelining

    23/332/3/2009 CS152-Spring09 23

    Hazards due to Loads & Stores

    ...M[(r1)+7] (r2)r4 M[(r3)+5]

    ...

    IRIR IR

    31

    PCA

    B

    Y

    R

    MD1 MD2

    addrinst

    Inst

    Memory

    0x4

    Add

    IR

    ImmExt

    ALU

    rd1

    GPRs

    rs1rs2

    wswd rd2

    we

    wdata

    addr

    wdata

    rdata

    DataMemory

    we

    nop

    Stall Condition

    Is there any possible data hazardin this instruction sequence?

    What if(r1)+7 = (r3)+5 ?

  • 8/6/2019 L04-Pipelining

    24/33

    2/3/2009 CS152-Spring09 24

    Load & Store Hazards

    However, the hazard is avoided because ourmemory system completes writes in one cycle !

    Load/Store hazards are sometimes resolved in the

    pipeline and sometimes in the memory systemitself.

    More on this later in the course.

    ...M[(r1)+7] (r2)r4 M[(r3)+5]...

    (r1)+7 = (r3)+5 data hazard

  • 8/6/2019 L04-Pipelining

    25/33

    2/3/2009 CS152-Spring09 25

    Resolving Data Hazards (2)

    Strategy 2:

    Route data as soon as possible after it iscalculated to the earlier pipeline stage bypass

  • 8/6/2019 L04-Pipelining

    26/33

    2/3/2009 CS152-Spring09 26

    Bypassing

    Each stall or killintroduces a bubble in the pipeline CPI > 1

    time t0 t1 t2 t3 t4 t5 t6 t7 . . . .

    (I1)r1

    r0 + 10 IF1 ID1 EX1 MA1 WB1(I2) r4 r1 + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2(I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3(I4) stalled stages IF4 ID4 EX4(I5) IF5 ID5

    time t0 t1 t2 t3 t4 t5 t6 t7 . . . .(I1) r1 r0 + 10 IF1 ID1 EX1 MA1 WB1(I2) r4 r1 + 17 IF2 ID2 EX2 MA2 WB2(I3) IF3 ID3 EX3 MA3 WB3(I4) IF4 ID4 EX4 MA4 WB4

    (I5) IF5 ID5 EX5 MA5 WB5

    A new datapath, i.e., a bypass, can get the data fromthe output of the ALU to its input

  • 8/6/2019 L04-Pipelining

    27/33

    2/3/2009 CS152-Spring09 27

    Adding a Bypass

    ASrc

    ...(I1) r1 r0 + 10

    (I2) r4 r1 + 17

    r4 r1... r1 ...

    IRIR IR

    PCA

    B

    Y

    R

    MD1 MD2

    addrinst

    InstMemory

    0x4

    Add

    IR

    ImmExt

    ALU

    rd1

    GPRs

    rs1rs2

    wswd rd2

    we

    wdata

    addr

    wdata

    rdataDataMemory

    we

    31

    nop

    stall

    D

    E M W

    When does this bypass help?

    r1 M[r0 + 10]r4 r1 + 17

    JAL 500r4 r31 + 17

    yes no no

  • 8/6/2019 L04-Pipelining

    28/33

    2/3/2009 CS152-Spring09 28

    The Bypass SignalDeriving it from the Stall Signal

    ASrc = (rsD=wsE).weE.re1D

    we = Case opcodeALU, ALUi, LW (ws 0)JAL, JALR on

    ... off

    No because only ALU and ALUi instructions can benefit

    from this bypass

    Is this correct?

    Split weE into two components: we-bypass, we-stall

    stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D +((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D )

    ws = Case opcodeALU rdALUi, LW rt

    JAL, JALR R31

  • 8/6/2019 L04-Pipelining

    29/33

    2/3/2009 CS152-Spring09 29

    Bypass and Stall Signals

    we-bypassE = Case opcodeEALU, ALUi (ws 0)... off

    ASrc = (rsD =wsE).we-bypassE . re1D

    Split weE into two components: we-bypass, we-stall

    stall = ((rsD =wsE).we-stallE+

    (rsD=wsM).weM + (rsD=wsW).weW). re1D

    +((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D

    we-stallE = Case opcodeELW (ws 0)JAL, JALR on... off

  • 8/6/2019 L04-Pipelining

    30/33

    2/3/2009 CS152-Spring09 30

    Fully Bypassed Datapath

    ASrcIRIR IR

    PCA

    B

    Y

    R

    MD1 MD2

    addrinst

    InstMemory

    0x4

    Add

    IR ALU

    ImmExt

    rd1

    GPRs

    rs1rs2

    wswd rd2

    we

    wdata

    addr

    wdata

    rdataDataMemory

    we

    31

    nop

    stall

    D

    E M W

    PC for JAL, ...

    BSrcIs there stilla need for thestall signal ? stall = (rsD=wsE). (opcodeE=LWE).(wsE 0 ).re1D

    + (rtD=wsE). (opcodeE=LWE).(wsE 0 ).re2D

  • 8/6/2019 L04-Pipelining

    31/33

    2/3/2009 CS152-Spring09 31

    Resolving Data Hazards (3)

    Strategy 3:

    Speculate on the dependence. Two cases:

    Guessed correctly do nothing

    Guessed incorrectly kill and restart

  • 8/6/2019 L04-Pipelining

    32/33

    2/3/2009 CS152-Spring09 32

    Next Time: Control Hazards

    Branches/Jumps

    Exceptions/Interrupts

  • 8/6/2019 L04-Pipelining

    33/33

    Acknowledgements

    These slides contain material developed and

    copyright by: Arvind (MIT)

    Krste Asanovic (MIT/UCB)

    Joel Emer (Intel/MIT)

    James Hoe (CMU)

    John Kubiatowicz (UCB)

    David Patterson (UCB)

    MIT material derived from course 6.823

    UCB material derived from course CS252