EE108A Lecture 11: Pipelining - Stanford CVA...

19
2/24/2005 1 EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1 11/1/2006 EE108A Lecture 11: Pipelining EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2 11/1/2006 Announcements Lab 5: easier to implement than lab 4, but harder to debug Quiz 2 will be on 11/14 Metastability demonstration at next week’s lecture (11/7) When good flip-flops go bad This is cool because no one every demonstrates this except Bill!

Transcript of EE108A Lecture 11: Pipelining - Stanford CVA...

Page 1: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

1

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 111/1/2006

EE108A

Lecture 11: Pipelining

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 211/1/2006

Announcements

• Lab 5: easier to implement than lab 4, but harder to debug

• Quiz 2 will be on 11/14

• Metastability demonstration at next week’s lecture (11/7)– When good flip-flops go bad– This is cool because no one every demonstrates this except Bill!

Page 2: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

2

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 311/1/2006

Some Comments on Labs 4 and 5

• Complete your design before you start coding– Understand your interfaces– Prepare a timing table

(remember it takes a cycle to read a RAM/ROM)– Partition functions

• Keep it simple– Address each feature in exactly one place– Don’t change interfaces

• Incremental debugging– Get one piece working at a time– Find out where the signal stops

• Think about “event flow”– What event causes things to happen, then what event

• e.g., new_frame, then next_note, etc…

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 411/1/2006

System Design – a process

• Design (25% of the time)– Specification

• Understand what you need to build– Divide and conquer

• Break it down into manageable pieces– Define interfaces (this takes practice to get it right)

• Clearly specify every signal between pieces• Hide implementation• Choose representations

– Timing and sequencing• Overall timing – use a table• Timing of each interface – use a simple convention (e.g., valid – ready)

– Add parallelism as needed (pipeline or duplicate units)– Timing and sequencing (of parallel structures)– Design each module

• Code (25% of the time)• Verify (50+% of the time) <= Note the time here. Plan for it.

Iterate back to the top at any step as needed.

Page 3: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

3

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 511/1/2006

Making things Faster

• Question:– Why do people design circuits for computation?– Why not just use a programmable processor and a compiler?– Verilog is so much harder to program and debug...what’s the benefit?

• Answer:– Performance (parallelism)– Efficiency (just use what you need)

• You can tailor the circuit to be just the right tradeoff for your design• Exactly as much computation power as you need• Result: Much higher (100x or more) efficiency than a generic processor

but...much higher (1000x or more) cost to develop(when does this tradeoff make sense?)

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 611/1/2006

Two ways to make things faster: pipeline and parallel

A B C D

Master

A B C D

Page 4: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

4

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 711/1/2006

Pipelines

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 811/1/2006

More Like an Assembly Line

Page 5: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

5

Pipelines

• Like an assembly line – each pipeline stage does part of the work andpasses the ‘workpiece’ to the next stage

• Example 1: Pipelined 32b Adder

FAa0

b0

c0

s0

c1

FAa1

b1

s1

c2

FAa31

b31

s31

c32

c31

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1011/1/2006

Split into 4 8-bit adders

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adda[15:8]

b[15:8]s[15:8]

c16

Adda[23:16]

b[23:16]s[23:16]

c24

Adda[31:24]

b[31:24]s[23:16]

c32

Page 6: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

6

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1111/1/2006

Split into stages4 problems ‘in process’ at once

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

Pipeline Register(Stores intermediate results)

Pipeline DiagramIllustrates pipeline timing

Problems

Time

P0

P1

P2

P3

P4

7:0 15:8

0

Cycle

23:16 31:24

1 2 3 4 5 6 7

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

Page 7: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

7

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1311/1/2006

Movie

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1411/1/2006

Cycle 1

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

S0

P0

Page 8: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

8

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1511/1/2006

Cycle 2

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

S0

P1

P0

S1

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1611/1/2006

Cycle 3

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

P2

S2

S0

P0

P1

S1

Page 9: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

9

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1711/1/2006

Cycle 3

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

P2

S2

S0

P1

S1

P2

S2

Pipeline DiagramIllustrates pipeline timing

Problems

Time

P0

P1

P2

P3

P4

7:0 15:8

0

Cycle

23:16 31:24

1 2 3 4 5 6 7

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

Takes 3 cycles to fill the pipelineThen 1 output per cycleTakes 3 cycles to drain the pipeline

Page 10: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

10

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 1911/1/2006

Latency and Throughput of a Pipeline

• Suppose before pipelining the delay of our 32b adder is 3200ps (100psper bit) and this adder can do one problem each 3200ps for a throughputof 1/3200ps = 312Mops

• What is the delay (latency) and throughput of the adder with pipelining?

Suppose tdCQ = 100ps, ts = 50ps, tk = 50ps (200ps Overhead for each FF)

• tpipe = n(tstage + tdCQ + ts + tk) 4(1000) = 4000

• Θ = n/tpipe = 1/(tstage + tdCQ + ts + tk) 4/4000 = 1Gops

• So it now takes 4000ps for the first result, but we get one every 1000ps• Big win if you’re doing lots of adds back-to-back…

…loss if you only do them every 4 cycles or less.

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2011/1/2006

Example 2: Processor Pipeline

PC

Inst

Cache

IRR

Regs

IRA

ALU

IRM

IRW

RegsMux

Data

Cache

Inst1

Inst2

Inst3

Inst4

Fetch Read ALU Mem Write

Fetch Read ALU Mem Write

Fetch Read ALU Mem Write

Fetch Read ALU Mem Write

Page 11: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

11

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2111/1/2006

• Example 3: Graphics rendering pipeline

Xformtriangles

Clip Light Rasterize

triangle pipeline

Shade Composite

fragment pipeline

Textures Z-bufferFrame

Buffer

triangle

fragment

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2211/1/2006

Example 4 – Packet Processing Pipeline

And each of these modules is internally pipelined

You get the idea. Lots of systems are organized this way.

Framer PolicingRoute

Lookup

Switch

Scheduling

Queue

Mgt

Output

SchedulerFramer

Page 12: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

12

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2311/1/2006

Pipelines: Key Point

• Slower for a single problem (overhead from FFs and partitioning)

• If we can input one new problem per cycle wecan get an output every cycle

• By splitting up the logic the cycles get shorter

• So, if we can keep the pipeline full, we can get dramaticallybetter throughput.

• But it’s rarely so easy...

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2411/1/2006

Issues with pipelines(all deal with time per stage)• Load balance (across stages)

– one stage takes longer to process each input than the others –becomes a ‘bottleneck’

– Example• Rasterizing an ‘average’ triangle in a graphics pipeline takes more time

than ‘lighting’ its vertices.• Variable load (across data)

– A given stage takes more time on some inputs than others– Example

• The the time needed to rasterize a triangle is proportional to the numberof fragments in the triangle. The average triangle may contain 20fragments, but triangles range from 0 to over 1M

• Long latency– A stage may require a long latency operation (e.g., texture access)

• Rigid Pipeline: each stage takes the same time, so all stages must takethe maximum time. (Inefficient if some are waiting for others.)

Page 13: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

13

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2511/1/2006

Load Balancing Pipelines

• Suppose transform takes 2 cycles and clip 4 cycles

• Clip is a ‘bottleneck’ pipeline stage

• Xform unit is busy only half the time

Xform Clip

Stall ClipXform

Stall ClipXform

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2611/1/2006

Load Balancing Solutions1 – Parallel copies of slow unit

Xform

Clip (1)

Distribute

Clip (2)

Join

Xform Clip (1)

Clip (2)Xform

Clip (1)Xform

Clip (2)Xform

Page 14: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

14

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 2711/1/2006

Load Balancing Solutions2 – Split slow pipeline stage

Xform Clip A Clip B

Xform Clip A Clip B

Xform Clip A Clip B

Xform Clip A Clip B

Xform Clip A Clip B

When is it better to split? To copy?Throughput and latency are the same

Xform Clip A Clip B

Xform

Clip (1)

Distribute

Clip (2)

Join

Page 15: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

15

Variable loadStage A always takes 10 cycles.Stage B takes 5 or 15 cycles – averages 10 cyclesPipeline averages ____ cycles per element

A10 cycles

B5 or 15 cycles

IS

A B

A B

A B

SA B

ISA B

Stalling a rigid pipelineA stall in any stage halts all stages upstream of the stall pointinstantly (on the next clock)

What if we stopped all stages, not just upstream stages?

How does the delay of this structure scale with thenumber of stages?

Avalid

ready

int_ready

Bvalid

ready

int_ready

Cvalid

ready

int_ready

valid

ready

Page 16: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

16

Double BufferAdd an extra buffer to each stage that is filled during the firstcycle of a stall.

Full

Buffer Mux

validuvalidb

readyu

readyb

readyd

validd

int_ready

Input goes to stage whenready, or to buffer when stalled.

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 3211/1/2006

What is the logic equation for “next_full”? “next_buf”? “mux_sel”?

Full

Buffer Mux

validuvalidb

readyu

readyb

readyd

validd

int_ready

Page 17: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

17

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 3311/1/2006

Double Buffer Timing

stage

time

0 1 2 3

Not

ready

A B C

B CA

ABA

D

BA

Ready

DC

E

A

B

DC

F

E

A

B

CD

FE

A

B

C

D

0

1

2

3

4

5

FE

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 3411/1/2006

Double Buffer Alternate Timing(how do you make this happen?)

stage

time

0 1 2 3

Not

ready

A B C

B CA

ABA

D

BA

Ready

DC

E

A

B

DC

FE

A

B

C

D

FE

A

B

C

D

E

F0

1

2

3

4

5

Page 18: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

18

Elastic PipelinesA FIFO between stages decouples timingAllows stages to operate at their ‘average’ speed

A10 cycles

B5 or 15 cycles

FIFO

I

A B

A B

A B

A B

A B

A B

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 3611/1/2006

Resource SharingSuppose two pipeline stages need to access the same memory•

A B C D

Memory

Mux

a

Arb

d

How would you set the priority on the arbiter?

Page 19: EE108A Lecture 11: Pipelining - Stanford CVA Groupcva.stanford.edu/people/davidbbs/classes/ee108a/fall0708 lectures/lect... · –Verilog is so much harder to program and debug...what’s

2/24/2005

19

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 3711/1/2006

Pipeline overview

• Divide large problem into stages assembly-line style• Divide evenly or load imbalance will occur

– Fix by splitting or copying bottleneck stage• Rigid pipelines have no extra storage between stages

– A stall on any stage halts all upstream stages– Hard to stop 100 stages at once

• Make this scalable with double-buffering• Variable load results in stalls and idle cycles on a rigid pipeline

– Make pipeline elastic by adding FIFOs between key stages