David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect22.pdfRFread + t ALU + t mux +...

67
Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 Elsevier Chapter 7 <1> Digital Design and Computer Architecture, RISC-V Edition Chapter 6 David M. Harris and Sarah L. Harris

Transcript of David M. Harris and Sarah L. Harrispages.hmc.edu/harris/class/e85/lect22.pdfRFread + t ALU + t mux +...

  • Chapter 6 Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Digital Design and Computer Architecture, RISC-V Edition

    Chapter 6

    David M. Harris and Sarah L. Harris

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Chapter 7 :: Microarchitecture

    • Introduction• Performance Analysis• Single-Cycle Processor• Multicycle Processor• Pipelined Processor• Advanced Microarchitecture

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Single Cycle Processor

    ImmExt

    CLK

    A RDInstructionMemory

    4

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    Extend

    RegisterFile

    01

    A RDData

    MemoryWD

    WEPC0

    1

    PCTarget

    Instr19:15

    24:20

    31:7

    6:0

    SrcB11:7

    ALUResult ReadData

    WriteData

    SrcA

    Result

    14:12

    MemWrite

    ALUSrc

    RegWrite

    funct3op

    ControlUnit

    Zero

    PCSrc

    CLK

    ALUControl2:0

    ALU

    ImmSrc1:0

    ResultSrc1:0

    +

    PCPlus4+

    PCNext

    funct7530

    100100

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Single Cycle Main Decoderop Instruct RegWrite ImmSrc ALUSrc MemWrite ResultSrc Branch ALUOp PCUpdate

    3 lw 1 00 1 0 10 0 00 0

    35 sw 0 01 1 1 XX 0 00 0

    51 R-type 1 XX 0 0 01 0 10 0

    99 beq 0 10 0 0 XX 1 01 0

    19 addi 1 00 1 0 01 0 10 0

    111 jal 1 11 X 0 00 0 XX 1

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Program Execution Time = (#instructions)(cycles/instruction)(seconds/cycle)= # instructions x CPI x TC

    Processor Performance

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    TC limited by critical path (lw)

    Single-Cycle Performance

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Single-cycle critical path:Tc1 = tpcq_PC + tmem + max[tRFread, tdec + text + tmux] + tALU + tmem + tmux + tRFsetup

    • Typically, limiting paths are: – memory, ALU, register file – Tc1 = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup

    Single-Cycle Performance

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Single-cycle:+ simple- cycle time limited by longest instruction (lw)- separate memories for instruction and data- 3 adders/ALUs

    • Multicycle processor addresses these issues by breaking instruction into shorter stepso shorter instructions take fewer stepso can re-use hardwareo cycle time is faster

    Multicycle RISC-V Processor

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Tc1 = ?

    Single-Cycle Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30ALU tALU 120Decoder (Control Unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Tc1 = tpcq_PC + 2tmem + tRFread + tALU + tmux + tRFsetup

    = [40 + 2(200) + 100 + 120 + 30 + 60] ps= 750 ps

    Single-Cycle Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30ALU tALU 120Decoder (Control Unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Program with 100 billion instructions:

    Execution Time = # instructions x CPI x TC= (100 × 109)(1)(750 × 10-12 s)= 75 seconds

    Single-Cycle Performance Example

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Review: Multicycle RISC-V Processor

    ImmExt

    CLK

    ARD

    Instr / DataMemory

    PC 01

    Instr

    SrcB

    ALUResult

    SrcA

    ALUOut

    MemWrite

    ALUSrcA1:0

    RegWrite

    Zero

    ResultSrc1:0

    CLK

    ALUControl2:0

    ALU

    WD

    WE

    CLK

    Adr

    Data

    CLK

    CLK

    A

    WriteD

    ata

    4

    CLK

    EN

    ALUSrcB1:0

    IRWrite

    AdrSrcPCWrite

    ReadD

    ata

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    19:15

    11:7

    31:7

    24:20 000110

    Result

    14:12

    30 funct75funct3

    Zero

    6:0 op

    ControlUnit

    ImmSrc1:0

    Extend

    Rs1

    Rs2

    CLK

    OldPC

    Rd

    EN

    000110

    100100

    PCNext

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Review: Multicycle Main FSMState Datapath µOpFetch Instr ←Mem[PC]; PC ← PC+4Decode ALUOut ← PCTargetMemAdr ALUOut ← rs1 + immMemRead Data ← Mem[ALUOut]MemWB rd ← DataMemWrite Mem[ALUOut] ← rdExecuteR ALUOut ← rs1 op rs2ExecuteI ALUOut ← rs1 op immALUWB rd ← ALUOutBEQ ALUResult = rs1-rs2; if Zero, PC = ALUOutJAL PC = ALUOut, ALUOut = PC+4

    S0: FetchAdrSrc = 0

    IRWriteALUSrcA = 00ALUSrcB =10ALUOp = 00

    ResultSrc = 00PCUpdate

    S1: DecodeALUSrcA = 01ALUSrcB = 01ALUOp = 00

    S3: MemReadResultSrc = 10

    AdrSrc = 1

    S7: ALUWBResultSrc = 10

    RegWrite

    S5: MemWriteResultSrc = 10

    AdrSrc = 1MemWrite

    S8: ExecuteIALUSrcA = 10ALUSrcB = 01ALUOp = 10

    Reset

    S4: MemWBResultSrc = 01

    RegWrite

    S6: ExecuteRALUSrcA = 10ALUSrcB = 00ALUOp = 10

    S10: BEQALUSrcA = 10ALUSrcB = 00ALUOp = 01

    ResultSrc = 10Branch

    S2: MemAdrALUSrcA = 10ALUSrcB = 01ALUOp = 00

    op = lw OR

    op = sw

    op = lw op = sw

    op = R-type op = addi op = jal op = beq

    S9: JALALUSrcA = 01ALUSrcB = 10ALUOp = 00

    ResultSrc = 10PCUpdate

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Instructions take different number of cycles:

    Multicycle Processor Performance

    S0: FetchAdrSrc = 0

    IRWriteALUSrcA = 00ALUSrcB =10ALUOp = 00

    ResultSrc = 00PCUpdate

    S1: DecodeALUSrcA = 01ALUSrcB = 01ALUOp = 00

    S3: MemReadResultSrc = 10

    AdrSrc = 1

    S7: ALUWBResultSrc = 10

    RegWrite

    S5: MemWriteResultSrc = 10

    AdrSrc = 1MemWrite

    S8: ExecuteIALUSrcA = 10ALUSrcB = 01ALUOp = 10

    Reset

    S4: MemWBResultSrc = 01

    RegWrite

    S6: ExecuteRALUSrcA = 10ALUSrcB = 00ALUOp = 10

    S10: BEQALUSrcA = 10ALUSrcB = 00ALUOp = 01

    ResultSrc = 10Branch

    S2: MemAdrALUSrcA = 10ALUSrcB = 01ALUOp = 00

    op = 0000011OR

    op = 0100011

    op = 0000011

    op = 0100011

    op = 0110011

    op = 0010011

    op = 1101111

    op = 1100011

    S9: JALALUSrcA = 01ALUSrcB = 10ALUOp = 00

    ResultSrc = 10PCUpdate

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Instructions take different number of cycles:– 3 cycles:– 4 cycles:– 5 cycles:

    Multicycle Processor Performance

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Instructions take different number of cycles:– 3 cycles: beq– 4 cycles: R-type, addi, sw , jal– 5 cycles: lw

    Multicycle Processor Performance

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Instructions take different number of cycles:– 3 cycles: beq– 4 cycles: R-type, addi, sw , jal– 5 cycles: lw

    • CPI is weighted average• SPECINT2000 benchmark:

    – 25% loads– 10% stores– 13% branches– 52% R-type

    Multicycle Processor Performance

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Instructions take different number of cycles:– 3 cycles: beq– 4 cycles: R-type, addi, sw , jal– 5 cycles: lw

    • CPI is weighted average• SPECINT2000 benchmark:

    – 25% loads– 10% stores– 13% branches– 52% R-type

    Average CPI = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12

    Multicycle Processor Performance

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Assumptions:• RF is faster than memory• writing memory is faster than reading memory

    • Two possibilities:• Read memory (MemRead state)• PC = PC + 4 path (Fetch state)

    Multicycle Processor Critical Path

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Option 1: Read memory (MemRead state)

    Multicycle Processor Critical Path

    Tc2 = tpcq + tmux + tmux+ tmem + tsetup

    ImmExt

    CLK

    ARD

    Instr / DataMemory

    PC 01

    Instr

    SrcB

    ALUResult

    SrcA

    ALUOut

    MemWrite

    ALUSrcA1:0

    RegWrite

    Zero

    ResultSrc1:0

    CLK

    ALUControl2:0

    ALU

    WD

    WE

    CLK

    Adr

    Data

    CLK

    CLK

    AWriteD

    ata

    4

    CLK

    EN

    ALUSrcB1:0

    IRWrite

    AdrSrcPCWrite

    ReadD

    ata

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    19:15

    11:7

    31:7

    24:20 000110

    Result

    14:12

    30 funct75funct3

    Zero

    6:0 op

    ControlUnit

    ImmSrc1:0

    Extend

    Rs1

    Rs2

    CLK

    OldPC

    Rd

    EN

    000110

    100100

    PCNext

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Option 1: Read memory (MemRead state)

    Multicycle Processor Critical Path

    Tc2 = tpcq + tmux + tmux+ tmem + tsetup= tpcq + 2tmux + tmem + tsetup

    ImmExt

    CLK

    ARD

    Instr / DataMemory

    PC 01

    Instr

    SrcB

    ALUResult

    SrcA

    ALUOut

    MemWrite

    ALUSrcA1:0

    RegWrite

    Zero

    ResultSrc1:0

    CLK

    ALUControl2:0

    ALU

    WD

    WE

    CLK

    Adr

    Data

    CLK

    CLK

    AWriteD

    ata

    4

    CLK

    EN

    ALUSrcB1:0

    IRWrite

    AdrSrcPCWrite

    ReadD

    ata

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    19:15

    11:7

    31:7

    24:20 000110

    Result

    14:12

    30 funct75funct3

    Zero

    6:0 op

    ControlUnit

    ImmSrc1:0

    Extend

    Rs1

    Rs2

    CLK

    OldPC

    Rd

    EN

    000110

    100100

    PCNext

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Option 2: PC = PC + 4 path (Fetch state)

    Multicycle Processor Critical Path

    Tc2 = tpcq + tmux + tALU + tmux + tsetup

    ImmExt

    CLK

    ARD

    Instr / DataMemory

    PC 01

    Instr

    SrcB

    ALUResult

    SrcA

    ALUOut

    MemWrite

    ALUSrcA1:0

    RegWrite

    Zero

    ResultSrc1:0

    CLK

    ALUControl2:0

    ALU

    WD

    WE

    CLK

    Adr

    Data

    CLK

    CLK

    AWriteD

    ata

    4

    CLK

    EN

    ALUSrcB1:0

    IRWrite

    AdrSrcPCWrite

    ReadD

    ata

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    19:15

    11:7

    31:7

    24:20 000110

    Result

    14:12

    30 funct75funct3

    Zero

    6:0 op

    ControlUnit

    ImmSrc1:0

    Extend

    Rs1

    Rs2

    CLK

    OldPC

    Rd

    EN

    000110

    100100

    PCNext

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Option 2: PC = PC + 4 path (Fetch state)

    Multicycle Processor Critical Path

    Tc2 = tpcq + tmux + tALU + tmux + tsetup= tpcq + 2tmux + tALU + tsetup

    ImmExt

    CLK

    ARD

    Instr / DataMemory

    PC 01

    Instr

    SrcB

    ALUResult

    SrcA

    ALUOut

    MemWrite

    ALUSrcA1:0

    RegWrite

    Zero

    ResultSrc1:0

    CLK

    ALUControl2:0

    ALU

    WD

    WE

    CLK

    Adr

    Data

    CLK

    CLK

    AWriteD

    ata

    4

    CLK

    EN

    ALUSrcB1:0

    IRWrite

    AdrSrcPCWrite

    ReadD

    ata

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    19:15

    11:7

    31:7

    24:20 000110

    Result

    14:12

    30 funct75funct3

    Zero

    6:0 op

    ControlUnit

    ImmSrc1:0

    Extend

    Rs1

    Rs2

    CLK

    OldPC

    Rd

    EN

    000110

    100100

    PCNext

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Two possibilities:• Read memory (MemRead state)• PC = PC + 4 path (Fetch state)

    Tc2 = tpcq + 2tmux + max[tALU, tmem] + tsetup

    Multicycle Processor Critical Path

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Multi-Cycle Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30AND-OR gate tAND-OR 20ALU tALU 120Decoder (control unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60

    Tc2 = tpcq + 2tmux + max[tALU, tmem] + tsetup= (40 + 2(30) + 200 + 50) ps = 350 ps

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    For a program with 100 billion instructions executing on a multicycle RISC-V processor

    – CPI = 4.12 cycles/instruction– Clock cycle time: Tc2 = 350 ps

    Execution Time = ?

    Multicycle Performance Example

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    For a program with 100 billion instructions executing on a multicycle RISC-V processor

    – CPI = 4.12 cycles/instruction– Clock cycle time: Tc2 = 350 ps

    Execution Time = (# instructions) × CPI × Tc= (100 × 109)(4.12)(350 × 10-12)= 144 seconds

    This is slower than the single-cycle processor (75 sec.)

    Multicycle Performance Example

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Temporal parallelism• Divide single-cycle processor into 5 stages:

    – Fetch– Decode– Execute– Memory– Writeback

    • Add pipeline registers between stages

    Pipelined RISC-V Processor

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Single-Cycle vs. Pipelined

    Time (ps)Instr

    FetchInstruction

    DecRead Reg

    ExecuteALU

    MemoryRead / Write

    WrReg1

    2

    0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 15001000

    Instr

    1

    2

    (b)

    3

    FetchInstruction

    DecRead Reg

    ExecuteALU

    MemoryRead / Write

    WrReg

    FetchInstruction

    DecRead Reg

    ExecuteALU

    MemoryRead / Write

    WrReg

    FetchInstruction

    DecRead Reg

    ExecuteALU

    MemoryRead / Write

    WrReg

    FetchInstruction

    DecRead Reg

    ExecuteALU

    MemoryRead / Write

    WrReg

    Single-Cycle

    Pipelined

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Pipelined Processor Abstraction

    Time (cycles)

    lw s2, 40(s0) RF 40s0

    RFs2

    + DM

    RF s10s9

    RFs3

    + DM

    RF s8t1

    RFs4

    - DM

    RF t5s11

    RFs5

    & DM

    RF 20t4

    RFs6

    + DM

    RF t3t2

    RFs7

    | DM

    add s3, s9, s10

    sub s4, t1, s8

    and s5, s11, t5

    sw s6, 20(t4)

    or s7, t2, t3

    1 2 3 4 5 6 7 8 9 10

    add

    IM

    IM

    IM

    IM

    IM

    IM lw

    sub

    and

    sw

    or

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Single-Cycle & Pipelined Datapath

    Single-Cycle

    Pipelined

    CLK

    A RDInstructionMemory

    +

    4

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    01

    A RDData

    MemoryWD

    WEPC0

    1PC' Instr

    19:15

    24:20

    31:7

    SrcBE

    ALUResult ReadData

    WriteData

    SrcAE

    PCTarget

    Result

    PCPlus4

    CLK

    ALU

    Extend

    +

    ImmExt

    Zero

    11:7

    CLK

    A RDInstructionMemory

    +

    4

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    01

    A RDData

    MemoryWD

    WEPCF0

    1PCF' InstrD

    19:15

    24:20

    31:7

    SrcBE

    ALUResultM

    WriteDataE WriteDataM

    SrcAE

    PCPlus4DPCPlus4F

    CLK CLK

    ALU

    Extend

    +

    PCPlus4E PCPlus4M

    PCPlus4W

    RD1E

    RD2E

    PCD PCE

    ImmExtEImmExtD

    11:7

    Zero

    Fetch Decode Execute Memory Writeback

    PCTargetE

    ReadDataW 000110

    000110

    ResultW

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Rd must arrive at same time as Result• Register file written on falling edge of CLK

    Corrected Pipelined DatapathCLK

    A RDInstructionMemory

    +

    4

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    01

    A RDData

    MemoryWD

    WEPCF0

    1PCF' InstrD

    19:15

    24:20

    31:7

    SrcBE

    ALUResultM ReadDataW

    WriteDataE WriteDataM

    SrcAE

    PCPlus4DPCPlus4F

    CLK CLK

    ALU

    Extend

    RdM RdW

    +

    PCPlus4E PCPlus4M

    PCPlus4W

    RD1E

    RD2E

    PCD PCE

    ImmExtEImmExtD

    11:7 RdD RdE

    PCTargetE

    000110

    ResultW

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Same control unit as single-cycle processor• Control signals travel with the instruction (drop off when used)

    Pipelined Processor with Control

    CLK

    A RDInstructionMemory

    +

    4

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    01

    A RDData

    MemoryWD

    WEPCF0

    1PCF' InstrD

    19:15

    24:20

    31:7

    6:0

    SrcBE

    ALUResultM ReadDataW

    WriteDataE WriteDataM

    SrcAE

    PCPlus4DPCPlus4F

    14:12

    ImmSrcD

    MemWriteD

    ResultSrcD1:0

    ALUControlD2:0

    ALUSrcD

    RegWriteD

    CLK CLK

    CLK CLK

    ALUControlE2:0

    ALU

    RegWriteE RegWriteM RegWriteW

    ResultSrcE1:0 ResultSrcM1:0

    MemWriteE MemWriteM

    ALUSrcE

    Extend

    30

    ResultSrcW1:0

    RdM RdW

    +

    PCPlus4E PCPlus4M

    PCPlus4W

    ZeroE

    BranchD

    JumpD

    PCSrcE

    RD1E

    RD2E

    PCD PCE

    ImmExtEImmExtD

    11:7 RdD RdE

    JumpE

    BranchE

    PCTargetE

    000110

    CLK

    ControlUnit

    funct3

    funct75

    op

    ResultW

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • When an instruction depends on result from instruction that hasn’t completed

    • Types:– Data hazard: register value not yet written back to

    register file– Control hazard: next instruction not decided yet

    (caused by branch)

    Pipeline Hazards

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Data Hazard

    Time (cycles)

    add s1, s4, s5 RF s5s4

    RFs1

    + DM

    RF s3s1

    RFs2

    & DM

    RF s1t6

    RFs9

    | DM

    RF t2s1

    RFs7

    - DM

    and s2, s1, s3

    or s9, t6, s1

    sub s7, s1, t2

    1 2 3 4 5 6 7 8

    and

    IM

    IM

    IM

    IM add

    or

    sub

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    How do we ensure that our programs run correctly?

    Handling Data Hazards

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    How do we ensure that our programs run correctly?• Insert nops in code at compile time• Rearrange code at compile time• Forward data at run time• Stall the processor at run time

    Handling Data Hazards

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Insert enough nops for result to be ready• Or move independent useful instructions forward

    Compile-Time Hazard Elimination

    Time (cycles)

    RF s5s4

    RFs1

    + DM

    RF s3s1

    RFs2

    & DM

    RF s1t6

    RFs9

    | DM

    RF s2s1

    RFs7

    - DM

    1 2 3 4 5 6 7 8

    and

    IM

    IM

    IM

    IM add

    or

    sub

    nop

    nop

    RF RFDMnopIM

    RF RFDMnopIM

    9 10

    add s1, s4, s5

    and s2, s1, s3

    or s9, t6, s1

    sub s7, s1, t2

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Data Forwarding

    Time (cycles)

    RF s5s4

    RFs1

    + DM

    RF s3s1

    RFs2

    & DM

    RF s1t6

    RFs9

    | DM

    RF t2s1

    RFs7

    - DM

    1 2 3 4 5 6 7 8

    and

    IM

    IM

    IM

    IM add

    or

    sub

    add s1, s4, s5

    and s2, s1, s3

    or s9, t6, s1

    sub s7, s1, t2

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Data Forwarding

    • Check if register read in Execute stage matches register written in Memory or Writeback stage

    • If so, forward result

    Time (cycles)

    RF s5s4

    RFs1

    + DM

    RF s3s1

    RFs2

    & DM

    RF s1t6

    RFs9

    | DM

    RF t2s1

    RFs7

    - DM

    1 2 3 4 5 6 7 8

    and

    IM

    IM

    IM

    IM add

    or

    sub

    add s1, s4, s5

    and s2, s1, s3

    or s9, t6, s1

    sub s7, s1, t2

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Data Forwarding

    CLK

    A RDInstruction

    Memory

    +

    4

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    01

    A RDData

    MemoryWD

    WEPCF0

    1PCF' InstrD

    19:15

    24:20

    31:7

    6:0

    SrcBE

    19:15

    11:7

    Rs1E

    RdE

    ALUResultM ReadDataW

    WriteDataE WriteDataM

    SrcAE

    PCPlus4D

    PCTargetE

    PCPlus4F

    14:12

    ImmSrcD

    MemWriteD

    ResultSrcD1:0

    ALUControlD2:0

    ALUSrcD

    RegWriteD

    op

    ControlUnit

    CLK CLK CLK

    CLK CLK

    ALUControlE2:0

    ALU

    RegWriteE RegWriteM RegWriteW

    ResultSrcE1:0 ResultSrcM1:0

    MemWriteE MemWriteM

    ALUSrcE

    24:20 Rs2E

    Rs1D

    RdD

    Rs2D

    Hazard Unit

    Extend

    30

    ResultSrcW1:0

    RdM RdW

    +

    PCPlus4E PCPlus4M

    PCPlus4W

    ZeroE

    BranchD

    JumpD

    PCSrcE

    RD1E

    RD2E

    PCD PCE

    ExtImmEExtImmD

    BranchE

    JumpE

    ForwardAE

    ForwardBE

    funct75

    funct3

    funct75

    000110

    100100

    000110

    ResultW

    PCPlus4W

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Data Forwarding• Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?

    Forward from Memory stage• Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?

    Forward from Writeback stage• Case 3: Otherwise use value read from register file (as usual)

    Equations for Rs1:if ((Rs1E == RdM) & RegWriteM) // Case 1

    ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) // Case 2

    ForwardAE = 01else ForwardAE = 00 // Case 3

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Data Forwarding• Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?

    Forward from Memory stage• Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?

    Forward from Writeback stage• Case 3: Otherwise use value read from register file (as usual)

    Equations for Rs1:if ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) // Case 1

    ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) // Case 2

    ForwardAE = 01else ForwardAE = 00 // Case 3

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Data Forwarding• Case 1: Execute stage Rs1 or Rs2 matches Memory stage Rd?

    Forward from Memory stage• Case 2: Execute stage Rs1 or Rs2 matches Writeback stage Rd?

    Forward from Writeback stage• Case 3: Otherwise use value read from register file (as usual)

    Equations for Rs1:if ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) // Case 1

    ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) // Case 2

    ForwardAE = 01else ForwardAE = 00 // Case 3

    ForwardBE equation is similar (replace Rs1E with Rs2E)

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Stalling

    Time (cycles)

    lw s1, 40(s5) RF 40s5

    RFs1

    + DM

    RF t3s1

    RFs8

    & DM

    RF s1s6

    RFt2

    | DM

    RF s7s1

    RFs3

    - DM

    and s8, s1, t3

    or t2, s6, s1

    sub s3, s1, s7

    1 2 3 4 5 6 7 8

    and

    IM

    IM

    IM

    IM lw

    or

    sub

    Trouble!

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Stalling

    Time (cycles)

    RF 40s5

    RFs1

    + DM

    RF t3s1

    RFs8

    & DM

    RF s1s6

    RFt2

    | DM

    RF s7s1

    RFs3

    - DM

    1 2 3 4 5 6 7 8

    and

    IM

    IM

    IM

    IM lw

    or

    sub

    9

    RF t3s1

    IM or

    Stall

    lw s1, 40(s5)

    and s8, s1, t3

    or t2, s6, s1

    sub s3, s1, s7

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Stalling Hardware

    CLK

    A RDInstruction

    Memory

    +

    4

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    01

    A RDData

    MemoryWD

    WEPCF0

    1PCF' InstrD

    19:15

    24:20

    31:7

    6:0

    SrcBE

    19:15

    11:7

    ALUResultM ReadDataW

    WriteDataE WriteDataM

    SrcAE

    PCPlus4DPCPlus4F

    14:12

    ImmSrcD

    MemWriteD

    ResultSrcD1:0

    ALUControlD2:0

    ALUSrcD

    RegWriteD

    funct3

    op

    ControlUnit

    CLK CLK CLK

    CLK CLK

    ALUControlE2:0

    ALU

    RegWriteE RegWriteM RegWriteW

    ResultSrcE1:0 ResultSrcM1:0

    MemWriteE MemWriteM

    ALUSrcE

    000110

    000110

    StallF

    StallD

    ForwardAE

    ForwardBE

    24:20

    Rs1D

    RdD

    Rs2D

    Hazard Unit

    FlushE

    EN

    Extend

    30

    ResultSrcW1:0

    RdM RdW

    +

    PCPlus4M

    ZeroE

    BranchD

    JumpD

    PCSrcE

    PCD

    ExtImmD

    BranchE

    JumpE

    PCTargetE

    000110

    Rs1E

    RdE

    Rs2E

    PCPlus4E

    RD1E

    RD2E

    PCE

    ExtImmE

    ResultSrcE 0

    ResultW

    PCPlus4W

    funct75

    0

    EN CLR

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND

    • Is the instruction in Execute stage a lw?

    lwStall = ((Rs1D == RdE) | (Rs2D == RdE)) & ResultSrcE0

    StallF = StallD = FlushE = lwStall

    Stalling Logic

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND

    • Is the instruction in Execute stage a lw?

    lwStall = ((Rs1D == RdE) | (Rs2D == RdE)) & ResultSrcE0

    StallF = StallD = FlushE = lwStall

    Stalling Logic

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND

    • Is the instruction in Execute stage a lw? AND• Is the lw’s destination register (RdE) not x0?

    lwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 & (RdE != 0)

    StallF = StallD = FlushE = lwStall

    Stalling Logic

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND

    • Is the instruction in Execute stage a lw? AND• Is the lw’s destination register (RdE) not x0? AND• Are the source registers (Rs1D and Rs2D) used?

    lwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 & (RdE != 0) &(~JumpD)

    StallF = StallD = FlushE = lwStall

    Stalling Logic

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND

    • Is the instruction in Execute stage a lw? AND• Is the lw’s destination register (RdE) not x0? AND• Are the source registers (Rs1D and Rs2D) used?

    lwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 & (RdE != 0) &(~JumpD)

    StallF = StallD = FlushE = lwStall

    Stalling Logic

    JAL doesn’t use rs1, rs2

    I-type instructions don’t use rs2(ALUSrcD = 1 selects ExtImm as SrcB of ALU)

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • Is either source register in Decode stage the same as the one to be written by instruction in Execute stage? AND

    • Is the instruction in Execute stage a lw? AND• Is the lw’s destination register (RdE) not x0? AND• Are the source registers (Rs1D and Rs2D) used?

    lwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 & (RdE != 0) &(~JumpD)

    StallF = StallD = FlushE = lwStall

    Stalling Logic

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • beq: – branch not determined until the Execute stage of

    pipeline– Instructions after branch fetched before branch

    occurs– These 2 instructions must be flushed if branch

    happens

    Control Hazards

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Control Hazards

    Branch misprediction penalty• number of instruction flushed when branch is taken (2)

    Time (cycles)

    beq s1, s2, L1 RF RFDM

    RF s3t1

    RFDM

    RF RFDM

    sub s8, t1, s3

    or s9, t6, s5

    1 2 3 4 5 6 7 8

    sub

    IM

    IM

    IM beq

    or

    20

    24

    28

    2C ...

    ... ...

    9

    Flushthese

    instructions

    58 L1: add s7, s3, s4 RF s4s3

    RFDMIM add

    10

    s2

    s1-

    +s7

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Flushing Hardware for Control Hazards

    CLK

    A RDInstruction

    Memory

    +

    4

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    01

    A RDData

    MemoryWD

    WEPCF0

    1PCF' InstrD

    19:15

    24:20

    31:7

    6:0

    SrcBE

    19:15

    11:7

    ALUResultM ReadDataW

    WriteDataE WriteDataM

    SrcAE

    PCPlus4D

    PCTargetE

    PCPlus4F

    14:12

    ImmSrcD

    MemWriteD

    ResultSrcD1:0

    ALUControlD2:0

    ALUSrcD

    RegWriteD

    funct3

    op

    ControlUnit

    CLK CLK CLK

    CLK CLK

    ALUControlE2:0

    ALU

    RegWriteE RegWriteM RegWriteW

    ResultSrcE1:0 ResultSrcM1:0

    MemWriteE MemWriteM

    ALUSrcE

    000110

    000110

    StallF

    StallD

    ForwardAE

    ForwardBE

    24:20

    Rs1D

    RdD

    Rs2D

    Hazard Unit

    FlushE

    Extend

    30

    ResultSrcW1:0

    RdM RdW

    +

    PCPlus4M

    ZeroE

    BranchD

    JumpD

    FlushD

    PCSrcE

    PCD

    ExtImmD

    BranchE

    JumpE

    000110

    Rs1E

    RdE

    Rs2E

    RD1E

    RD2E

    PCE

    ExtImmE

    ResultW

    PCPlus4W

    funct75

    0

    EN

    EN CLRCLR

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • If branch is taken in execute stage, need to flush the instructions in the fetch and decode stages– Do this by clearing Decode and Execute Pipeline

    registers using FlushD and FlushE

    • Equations:FlushD = PCSrcEFlushE = lwStall | PCSrcE

    Control Flushing Logic

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Forward to solve data hazards when possibleif ((Rs1E == RdM) & RegWriteM) & (Rs1E != 0) then

    ForwardAE = 10else if ((Rs1E == RdW) & RegWriteW) & (Rs1E != 0) then

    ForwardAE = 01else ForwardAE = 00

    Stall when a load hazard occurslwStall = ((Rs1D == RdE) | (Rs2D == RdE) & ~ALUSrcD) & ResultSrcE0 &

    (RdE != 0) & ~JumpDStallF = lwStallStallD = lwStall

    Flush when a branch is taken or a load introduces a bubbleFlushD = PCSrcEFlushE = lwStall | PCSrcE

    Pipeline Hazard Summary

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    RISC-V Pipelined Processor with Hazard Unit

    CLK

    A RDInstruction

    Memory

    +

    4

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    01

    DataMemory

    PCF01

    PCF' InstrD19:15

    24:20

    31:7

    6:0

    SrcBE

    19:15

    11:7

    Rs1E

    RdE

    ALUResultM ReadDataW

    WriteDataE WriteDataM

    SrcAE

    PCPlus4D

    PCTargetE

    ResultW

    PCPlus4F

    14:12

    ImmSrcD

    MemWriteD

    ResultSrcD1:0

    ALUControlD2:0

    ALUSrcD

    RegWriteD

    funct3

    op

    ControlUnit

    CLK CLK CLK

    CLK

    ALUControlE2:0

    ALU

    RegWriteE RegWriteM RegWriteW

    ResultSrcE1:0 ResultSrcM1:0

    MemWriteE MemWriteM

    ALUSrcE

    000110

    000110

    StallF

    StallD

    ForwardAE

    ForwardBE

    24:20 Rs2E

    Rs1D

    RdD

    Rs2D

    Hazard Unit

    FlushE

    Extend

    000110

    funct7530

    ResultSrcW1:0

    RdM RdW

    +

    PCPlus4E PCPlus4MPCPlus4W

    ZeroE

    BranchD

    JumpD

    FlushD

    PCSrcE

    RD1E

    RD2E

    PCD PCE

    ExtImmEExtImmD

    BranchE

    JumpE

    CLK

    WE

    A

    WD

    RD

    0

    EN

    EN CLRCLR

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • SPECINT2000 benchmark:– 25% loads– 10% stores– 13% branches– 52% R-type

    • Suppose:– 40% of loads used by next instruction– 50% of branches mispredicted

    • What is the average CPI?

    Pipelined Performance Example

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    • SPECINT2000 benchmark:– 25% loads– 10% stores– 13% branches– 52% R-type

    • Suppose:– 40% of loads used by next instruction– 50% of branches mispredicted

    • What is the average CPI?– Load CPI = 1 when not stalling, 2 when stalling

    So, CPIlw = 1(0.6) + 2(0.4) = 1.4– Branch CPI = 1 when not stalling, 3 when stalling

    So, CPIbeq = 1(0.5) + 3(0.5) = 2

    Average CPI = (0.25)(1.4) + (0.1)(1) + (0.13)(2) + (0.52)(1) = 1.23

    Pipelined Performance Example

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Pipelined processor critical path:Tc3 = max of

    tpcq + tmem + tsetup Fetch2(tRFread + tsetup ) Decodetpcq + 4tmux + tALU + tAND-OR + tsetup Executetpcq + tmem + tsetup Memory2(tpcq + tmux + tRFwrite) Writeback

    Pipelined Performance

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Pipelined processor critical path:Tc3 = max of

    tpcq + tmem + tsetup Fetch2(tRFread + tsetup ) Decodetpcq + 4tmux + tALU + tAND-OR + tsetup Executetpcq + tmem + tsetup Memory2(tpcq + tmux + tRFwrite) Writeback

    Pipelined Performance

    • Decode and Writeback stages both use the register file in each cycle• So each stage gets half of the cycle time (Tc/2) to do their work• Or, stated a different way, 2x of their work must fit in a cycle (Tc)

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    tpcq + 4tmux + tALU + tAND-OR + tsetup Execute

    Pipelined Processor Critical Path

    CLK

    A RDInstruction

    Memory

    +

    4

    A1

    A3WD3

    RD2

    RD1WE3

    A2

    CLK

    RegisterFile

    01

    DataMemory

    PCF01

    PCF' InstrD19:15

    24:20

    31:7

    6:0

    SrcBE

    19:15

    11:7

    Rs1E

    RdE

    ALUResultM ReadDataW

    WriteDataE WriteDataM

    SrcAE

    PCPlus4D

    PCTargetE

    ResultW

    PCPlus4F

    14:12

    ImmSrcD

    MemWriteD

    ResultSrcD1:0

    ALUControlD2:0

    ALUSrcD

    RegWriteD

    funct3

    op

    ControlUnit

    CLK CLK CLK

    CLK

    ALUControlE2:0

    ALU

    RegWriteE RegWriteM RegWriteW

    ResultSrcE1:0 ResultSrcM1:0

    MemWriteE MemWriteM

    ALUSrcE

    000110

    000110

    StallF

    StallD

    ForwardAE

    ForwardBE

    24:20 Rs2E

    Rs1D

    RdD

    Rs2D

    Hazard Unit

    FlushE

    Extend

    000110

    funct7530

    ResultSrcW1:0

    RdM RdW

    +

    PCPlus4E PCPlus4MPCPlus4W

    ZeroE

    BranchD

    JumpD

    FlushD

    PCSrcE

    RD1E

    RD2E

    PCD PCE

    ExtImmEExtImmD

    BranchE

    JumpE

    CLK

    WE

    A

    WD

    RD

    0

    EN

    EN CLRCLR

    beq in Execute stage that requires forwarding

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Cycle time: Tc3 = tpcq + 4tmux + tALU + tAND-OR + tsetup= (40 + 4(30) + 120 + 20 + 50) ps = 350 ps

    Pipelined Performance ExampleElement Parameter Delay (ps)Register clock-to-Q tpcq_PC 40Register setup tsetup 50Multiplexer tmux 30AND-OR gate tAND-OR 20ALU tALU 120Decoder (control unit) tdec 35Memory read tmem 200Register file read tRFread 100Register file setup tRFsetup 60

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Program with 100 billion instructionsExecution Time = (# instructions) × CPI × Tc

    = (100 × 109)(1.23)(350 × 10-12)= 43 seconds

    Pipelined Performance Example

  • Digital Design and Computer Architecture: RISC-V Edition Harris & Harris © 2020 ElsevierChapter 7

    Processor

    Execution Time(seconds)

    Speedup(single-cycle as baseline)

    Single-cycle 75 1

    Multicycle 144 0.5Pipelined 43 1.7

    Processor Performance Comparison