CS:APP Chapter 4 Computer Architecture Pipelined...

Randal E. Bryant

Carnegie Mellon University

CS:APP3e

CS:APP Chapter 4Computer Architecture

PipelinedImplementation

Part I

http://csapp.cs.cmu.edu

– 2 – CS:APP3e

OverviewGeneral Principles of Pipelining

n  Goaln  Difficulties

Creating a Pipelined Y86-64 Processorn  Rearranging SEQn  Inserting pipeline registersn  Problems with data and control hazards

– 3 – CS:APP3e

Real-World Pipelines: Car Washes

Idean  Divide process into

independent stagesn  Move objects through stages

in sequencen  At any given times, multiple

objects being processed

Sequential Parallel

Pipelined

– 4 – CS:APP3e

Computational Example

Systemn  Computation requires total of 300 picosecondsn  Additional 20 picoseconds to save result in registern  Must have clock cycle of at least 320 ps

Combinational logic

300 ps 20 ps

Delay = 320 ps Throughput = 3.12 GIPS

– 5 – CS:APP3e

3-Way Pipelined Version

Systemn  Divide combinational logic into 3 blocks of 100 ps eachn  Can begin new operation as soon as previous one passes

through stage A.l  Begin new operation every 120 ps

n  Overall latency increasesl  360 ps from start to finish

Comb. logic

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

– 6 – CS:APP3e

Pipeline DiagramsUnpipelined

n  Cannot start new operation until previous one completes

3-Way Pipelined

n  Up to 3 operations in process simultaneously

OP1OP2OP3

A B CA B C

OP1OP2OP3

– 7 – CS:APP3e

Operating a Pipeline

OP1OP2OP3

A B CA B C

0 120 240 360 480 640

Comb. logic

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. logic

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. logic

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

– 8 – CS:APP3e

Limitations: Nonuniform Delays

n  Throughput limited by slowest stagen  Other stages sit idle for much of the timen  Challenging to partition system into balanced stages

Comb. logic

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps

Comb. logic A

OP1OP2OP3

A B CA B C

– 9 – CS:APP3e

Limitations: Register Overhead

n  As try to deepen pipeline, overhead of loading registers becomes more significant

n  Percentage of clock cycle spent loading register:l  1-stage pipeline: 6.25% l  3-stage pipeline: 16.67% l  6-stage pipeline: 28.57%

n  High speeds of modern processor designs obtained through very deep pipelining

Delay = 420 ps, Throughput = 14.29 GIPS Clock

Comb. logic

50 ps 20 ps

Comb. logic

50 ps 20 ps

Comb. logic

50 ps 20 ps

Comb. logic

50 ps 20 ps

Comb. logic

50 ps 20 ps

Comb. logic

50 ps 20 ps

– 10 – CS:APP3e

Data Dependencies

Systemn  Each operation depends on result from preceding one

Combinational logic

OP1OP2OP3

– 11 – CS:APP3e

Data Hazards

n  Result does not feed back around in time for next operationn  Pipelining has changed behavior of system

Comb. logic

OP1OP2OP3

A B CA B C

A B COP4 A B C

– 12 – CS:APP3e

Data Dependencies in Processors

n  Result from one instruction used as operand for anotherl  Read-after-write (RAW) dependency

n  Very common in actual programsn  Must make sure our pipeline handles these properly

l  Get correct resultsl  Minimize performance impact

1 irmovq $50, %rax

2 addq %rax , %rbx

3 mrmovq 100( %rbx ), %rdx

– 13 – CS:APP3e

SEQ Hardwaren  Stages occur in sequencen  One operation in process

at a time

– 14 – CS:APP3e

SEQ+ Hardwaren  Still sequential

implementationn  Reorder PC stage to put at

beginning

PC Stagen  Task is to select PC for

current instructionn  Based on results

computed by previous instruction

Processor Staten  PC is no longer stored in

registern  But, can determine PC

based on other stored information

– 15 – CS:APP3e

Adding Pipeline Registers

Instructionmemory

PCincrement

CCCCALUALU

Datamemory

Decode

Execute

Memory

Write back

icode, ifunrA , rB

Registerfile

srcA, srcBdstA, dstB

valA, valB

aluA, aluB

Addr, Data

PCvalE , valM

– 16 – CS:APP3e

Pipeline StagesFetch

n  Select current PCn  Read instructionn  Compute incremented PC

Decoden  Read program registers

Executen  Operate ALU

Memoryn  Read or write data memory

Write Backn  Update register file

– 17 – CS:APP3e

PIPE- Hardwaren  Pipeline registers hold

intermediate values from instruction execution

Forward (Upward) Pathsn  Values passed from one

stage to nextn  Cannot jump past

stagesl  e.g., valC passes

through decode

– 18 – CS:APP3e

Signal Naming ConventionsS_Field

n  Value of Field held in stage S pipeline register

s_Fieldn  Value of Field computed in stage S

– 19 – CS:APP3e

Feedback PathsPredicted PC

n  Guess value of next PC

Branch informationn  Jump taken/not-takenn  Fall-through or target

address

Return pointn  Read from memory

Register updatesn  To register file write

– 20 – CS:APP3e

Predicting the PC

n  Start fetch of new instruction after current one has completed fetch stagel  Not enough time to reliably determine next instruction

n  Guess which instruction will followl  Recover if prediction was incorrect

– 21 – CS:APP3e

Our Prediction StrategyInstructions that Don’t Transfer Control

n  Predict next PC to be valPn  Always reliable

Call and Unconditional Jumpsn  Predict next PC to be valC (destination)n  Always reliable

Conditional Jumpsn  Predict next PC to be valC (destination)n  Only correct if branch is taken

l  Typically right 60% of time

Return Instructionn  Don’t try to predict

– 22 – CS:APP3e

Recovering from PC Misprediction

n  Mispredicted Jumpl  Will see branch condition flag once instruction reaches memory

stagel  Can get fall-through PC from valA (value M_valA)

n  Return Instructionl  Will get return PC when ret reaches write-back stage (W_valM)

– 23 – CS:APP3e

Pipeline Demonstration

File: demo-basic.ys

irmovq $1,%rax #I1

1 2 3 4 5 6 7 8 9

F D E MWirmovq $2,%rcx #I2 F D E M

irmovq $3,%rdx #I3 F D E M Wirmovq $4,%rbx #I4 F D E M Whalt #I5 F D E M W

Cycle 5

– 24 – CS:APP3e

Data Dependencies: 3 Nop’s

0x000: irmovq $10,%rdx

1 2 3 4 5 6 7 8 9

F D E M WF D E M W0x00a: irmovq $3,%rax F D E M WF D E M W0x014: nop F D E M WF D E M W0x015: nop F D E M WF D E M W0x016: nop F D E M WF D E M W0x017: addq %rdx,%rax F D E M WF D E M W

WR[%rax] f3

DvalA fR[%rdx] = 10valB fR[%rax] = 3

# demo-h3.ys

Cycle 6

0x019: halt F D E M WF D E M W

Cycle 7

– 25 – CS:APP3e

Data Dependencies: 2 Nop’s0x000: irmovq $10,%rdx

1 2 3 4 5 6 7 8 9

F D E M WF D E M W0x00a: irmovq $3,%rax F D E M WF D E M W0x014: nop F D E M WF D E M W0x015: nop F D E M WF D E M W0x016: addq %rdx,%rax F D E M WF D E M W0x018: halt F D E M WF D E M W

10# demo-h2.ys

WR[%rax] f3

•••

WR[%rax] f3

•••

Cycle 6

– 26 – CS:APP3e

Data Dependencies: 1 Nop0x000: irmovq $10,%rdx

1 2 3 4 5 6 7 8 9

F D E MW0x00a: irmovq $3,%rax F D E M

0x014: nop F D E M WF D E M W0x015: addq %rdx,%rax F D E M WF D E M W0x017: halt F D E M WF D E M W

# demo-h1.ys

WR[%rdx] f10

•••

Cycle 5

MM_valE = 3M_dstE = %rax

– 27 – CS:APP3e

Data Dependencies: No Nop0x000: irmovq $10,%rdx

1 2 3 4 5 6 7 8

F D E MW0x00a: irmovq $3,%rax F D E M

F D E M W0x014: addq %rdx,%rax

F D E M W0x016: halt

# demo-h0.ys

Cycle 4

MM_valE = 10M_dstE = %rdx

e_valE f 0 + 3 = 3 E_dstE = %rax

– 28 – CS:APP3e

Branch Misprediction Example

n  Should only execute first 8 instructions

0x000: xorq %rax,%rax 0x002: jne t # Not taken 0x00b: irmovq $1, %rax # Fall through 0x015: nop 0x016: nop 0x017: nop 0x018: halt 0x019: t: irmovq $3, %rdx # Target (Should not execute) 0x023: irmovq $4, %rcx # Should not execute 0x02d: irmovq $5, %rdx # Should not execute

demo-j.ys

– 29 – CS:APP3e

Branch Misprediction Trace

n  Incorrectly execute two instructions at branch target

0x000: xorq % rax ,% rax 1 2 3 4 5 6 7 8 9 F D E M

W 0x002: jne t # Not taken F D E M W

0x019: t: irmovq $3, % rdx # Target F D E M W 0x023: irmovq $4, % rcx # Target+1 F D E M W 0x00b: irmovq $1, % rax # Fall Through F D E M W

# demo - j

F D E M W

Cycle 5

E valE f 3 dstE = % rdx

M M_Cnd = 0 M_ valA = 0x007

D valC = 4 dstE = % ecx

D valC = 4 dstE = % rcx

F valC f 1 rB f % rax

– 30 – CS:APP3e

0x000: irmovq Stack,%rsp # Intialize stack pointer 0x00a: nop # Avoid hazard on %rsp 0x00b: nop 0x00c: nop 0x00d: call p # Procedure call 0x016: irmovq $5,%rsi # Return point 0x020: halt 0x020: .pos 0x20 0x020: p: nop # procedure 0x021: nop 0x022: nop 0x023: ret 0x024: irmovq $1,%rax # Should not be executed 0x02e: irmovq $2,%rcx # Should not be executed 0x038: irmovq $3,%rdx # Should not be executed 0x042: irmovq $4,%rbx # Should not be executed 0x100: .pos 0x100 0x100: Stack: # Initial stack pointer

Return Example

n  Require lots of nops to avoid data hazards

demo-ret.ys

– 31 – CS:APP3e

Incorrect Return Example

n  Incorrectly execute 3 instructions following ret

– 32 – CS:APP3e

Pipeline SummaryConcept

n  Break instruction execution into 5 stagesn  Run instructions through in pipelined mode

Limitationsn  Can’t handle dependencies between instructions when

instructions follow too closelyn  Data dependencies

l  One instruction writes register, later one reads itn  Control dependency

l  Instruction sets PC in way that pipeline did not predict correctlyl  Mispredicted branch and return

Fixing the Pipelinen  We’ll do that next time

CS:APP Chapter 4 Computer Architecture Pipelined...

Documents

Transcript of CS:APP Chapter 4 Computer Architecture Pipelined...

Experiment 5 Superfund Lab: Isolating and Identifying Contaminates from ...personal.denison.edu/.../Chem-132-resources/Chem_132-01_Lab_5.pdf · Superfund Lab: Isolating and Identifying

Open Source Software and Consequential Responsibility: GPU ...personal.denison.edu/~havill/cs111s15/open_source_software.pdf— Philosophy and Computers — ... Richard Stallman and

CS:APP Chapter 4 Computer Architecture Pipelined ...raymond.namyst.emi.u-bordeaux.fr/ens/archi-enseirb... · – 2 – CS:APP Overview Make the pipelined processor work! Data Hazards

Using the IBM Opteron 1350 at OSC - Personal Pages - …personal.denison.edu/.../Supplements/opteron-mod1-1010.pdf · 2011-03-02 · • Accessing the IBM 1350 Opteron Cluster •

CS:APP3e CS:APP Chapter 4 Computer Architecture Logic Design CS:APP Chapter 4 Computer Architecture Logic Design CENG331 - Computer Organization Murat.

CSCI2467: Systems Programming Concepts2467.cs.uno.edu/lectures/02data.pdf · Source: CS:APP Bryant & O’Hallaron (Section 2.1) Course Instructors: Matthew Toups Caitlin Boyce Course

ASYMPTOTIC ANALYSIS OF THE SPECTRUM OF …personal.denison.edu/~meim/SMI/MegTaraMaria_Paper.pdfASYMPTOTIC ANALYSIS OF THE SPECTRUM OF THE DISCRETE HAMILTONIAN WITH PERIOD DOUBLING

Using Glenn—the IBM Opteron 1350 Introduction to …personal.denison.edu/~bressoud/cs402-s11/Supplements/...Screen Editors—Cheat Sheet • cat – Create a file: cat > file.name

Anany Levitin. Introduction to the design and analysis …personal.denison.edu/~havill/371/backtracking.pdfAnany Levitin. Introduction to the design and analysis of algorithms, 3rd

E˙icient k-Regret ˚ery Algorithm with Restriction-free Bound for …personal.denison.edu/~lalla/papers/sphere4.pdf · Science and Technology raywong@cse.ust.hk Jian Li Tsinghua

Cache Memory and Performance Memory Hierarchy 1courses.cs.vt.edu/cs2506/Fall2016/Notes/L16.MemoryHierarchy.pdfMemory Hierarchy 19 CS@VT Computer Organization II ©2005-2015 CS:APP

The Confusion Range, west-central Utah: Fold-thrust ...personal.denison.edu/~greened/ges009721full.pdf · Similar structural style and fold-thrust struc-tures are continuous southward

CS:APP Chapter 4 Computer Architecture Sequential ...€¦ · Randal E. Bryant Carnegie Mellon University CS:APP CS:APP Chapter 4 Computer Architecture Sequential Implementation

Chemistry/Biology 302 – Biochemistry: Exam 1 Practice …personal.denison.edu/~kuhlman/courses/biochem/302_Exam1_Q.pdf · Chemistry/Biology 302 – Biochemistry: Exam 1 Practice

CS:APP Chapter 4 Computer Architecture Pipelined ...webby.cc.denison.edu/~bressoud/cs-281-2/05-pipeline-b.pdf– 2 – CS:APP3e Overview Make the pipelined processor work! Data Hazards

[Pipeline] Inspecting Pipeline Installation

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Cache Memory and Performance Virtual Memory 1courses.cs.vt.edu/.../Spring2020/notes/L18_VirtualMemory.pdfVirtual Memory 1 CS@VT Computer Organization II ©2005-2020 CS:APP & WD McQuain

Randal E. Bryant Carnegie Mellon University CS:APP CS:APP Chapter 4 Computer Architecture PipelinedImplementation Part II CS:APP Chapter 4 Computer Architecture.

CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.