EE108A Lecture 11: Pipelining - Stanford CVA...

2/24/2005

1

EE 108A Lecture 11 (c) 2005-2007 W. J. Dally and D. Black-Schaffer 111/1/2006

EE108A

Lecture 11: Pipelining


Announcements

• Lab 5: easier to implement than lab 4, but harder to debug

• Quiz 2 will be on 11/14

• Metastability demonstration at next week’s lecture (11/7)– When good flip-flops go bad– This is cool because no one every demonstrates this except Bill!

2/24/2005

2


Some Comments on Labs 4 and 5

• Complete your design before you start coding– Understand your interfaces– Prepare a timing table

(remember it takes a cycle to read a RAM/ROM)– Partition functions

• Keep it simple– Address each feature in exactly one place– Don’t change interfaces

• Incremental debugging– Get one piece working at a time– Find out where the signal stops

• Think about “event flow”– What event causes things to happen, then what event

• e.g., new_frame, then next_note, etc…


System Design – a process

• Design (25% of the time)– Specification

• Understand what you need to build– Divide and conquer

• Break it down into manageable pieces– Define interfaces (this takes practice to get it right)

• Clearly specify every signal between pieces• Hide implementation• Choose representations

– Timing and sequencing• Overall timing – use a table• Timing of each interface – use a simple convention (e.g., valid – ready)

– Add parallelism as needed (pipeline or duplicate units)– Timing and sequencing (of parallel structures)– Design each module

• Code (25% of the time)• Verify (50+% of the time) <= Note the time here. Plan for it.

Iterate back to the top at any step as needed.

2/24/2005

3


Making things Faster

• Question:– Why do people design circuits for computation?– Why not just use a programmable processor and a compiler?– Verilog is so much harder to program and debug...what’s the benefit?

• Answer:– Performance (parallelism)– Efficiency (just use what you need)

• You can tailor the circuit to be just the right tradeoff for your design• Exactly as much computation power as you need• Result: Much higher (100x or more) efficiency than a generic processor

but...much higher (1000x or more) cost to develop(when does this tradeoff make sense?)


Two ways to make things faster: pipeline and parallel

A B C D

Master

A B C D

2/24/2005

4


Pipelines


More Like an Assembly Line

2/24/2005

5

Pipelines

• Like an assembly line – each pipeline stage does part of the work andpasses the ‘workpiece’ to the next stage

• Example 1: Pipelined 32b Adder

FAa0

b0

c0

s0

c1

FAa1

b1

s1

c2

FAa31

b31

s31

c32

c31


Split into 4 8-bit adders

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adda[15:8]

b[15:8]s[15:8]

c16

Adda[23:16]

b[23:16]s[23:16]

c24

Adda[31:24]

b[31:24]s[23:16]

c32

2/24/2005

6


Split into stages4 problems ‘in process’ at once

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

Pipeline Register(Stores intermediate results)

Pipeline DiagramIllustrates pipeline timing

Problems

Time

P0

P1

P2

P3

P4

7:0 15:8

0

Cycle

23:16 31:24

1 2 3 4 5 6 7

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

2/24/2005

7


Movie

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32


Cycle 1

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

S0

P0

2/24/2005

8


Cycle 2

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

S0

P1

P0

S1


Cycle 3

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

P2

S2

S0

P0

P1

S1

2/24/2005

9


Cycle 3

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

P2

S2

S0

P1

S1

P2

S2

Pipeline DiagramIllustrates pipeline timing

Problems

Time

P0

P1

P2

P3

P4

7:0 15:8

0

Cycle

23:16 31:24

1 2 3 4 5 6 7

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

7:0 15:8 23:16 31:24

a[31:24]

b[31:24]

a[23:16]

b[23:16]

Adda[7:0]

b[7:0]

c0

s[7:0]

c8

Adds[15:8]

c16

Adds[23:16]

Adds[23:16]

a[15:8]

b[15:8]

c24

c32 c32

Takes 3 cycles to fill the pipelineThen 1 output per cycleTakes 3 cycles to drain the pipeline

2/24/2005

10


Latency and Throughput of a Pipeline

• Suppose before pipelining the delay of our 32b adder is 3200ps (100psper bit) and this adder can do one problem each 3200ps for a throughputof 1/3200ps = 312Mops

• What is the delay (latency) and throughput of the adder with pipelining?

Suppose tdCQ = 100ps, ts = 50ps, tk = 50ps (200ps Overhead for each FF)

• tpipe = n(tstage + tdCQ + ts + tk) 4(1000) = 4000

• Θ = n/tpipe = 1/(tstage + tdCQ + ts + tk) 4/4000 = 1Gops

• So it now takes 4000ps for the first result, but we get one every 1000ps• Big win if you’re doing lots of adds back-to-back…

…loss if you only do them every 4 cycles or less.


Example 2: Processor Pipeline

PC

Inst

Cache

IRR

Regs

IRA

ALU

IRM

IRW

RegsMux

Data

Cache

Inst1

Inst2

Inst3

Inst4

Fetch Read ALU Mem Write




2/24/2005

11


• Example 3: Graphics rendering pipeline

Xformtriangles

Clip Light Rasterize

triangle pipeline

Shade Composite

fragment pipeline

Textures Z-bufferFrame

Buffer

triangle

fragment


Example 4 – Packet Processing Pipeline

And each of these modules is internally pipelined

You get the idea. Lots of systems are organized this way.

Framer PolicingRoute

Lookup

Switch

Scheduling

Queue

Mgt

Output

SchedulerFramer

2/24/2005

12


Pipelines: Key Point

• Slower for a single problem (overhead from FFs and partitioning)

• If we can input one new problem per cycle wecan get an output every cycle

• By splitting up the logic the cycles get shorter

• So, if we can keep the pipeline full, we can get dramaticallybetter throughput.

• But it’s rarely so easy...


Issues with pipelines(all deal with time per stage)• Load balance (across stages)

– one stage takes longer to process each input than the others –becomes a ‘bottleneck’

– Example• Rasterizing an ‘average’ triangle in a graphics pipeline takes more time

than ‘lighting’ its vertices.• Variable load (across data)

– A given stage takes more time on some inputs than others– Example

• The the time needed to rasterize a triangle is proportional to the numberof fragments in the triangle. The average triangle may contain 20fragments, but triangles range from 0 to over 1M

• Long latency– A stage may require a long latency operation (e.g., texture access)

• Rigid Pipeline: each stage takes the same time, so all stages must takethe maximum time. (Inefficient if some are waiting for others.)

2/24/2005

13


Load Balancing Pipelines

• Suppose transform takes 2 cycles and clip 4 cycles

• Clip is a ‘bottleneck’ pipeline stage

• Xform unit is busy only half the time

Xform Clip

Stall ClipXform

Stall ClipXform


Load Balancing Solutions1 – Parallel copies of slow unit

Xform

Clip (1)

Distribute

Clip (2)

Join

Xform Clip (1)

Clip (2)Xform

Clip (1)Xform

Clip (2)Xform

2/24/2005

14


Load Balancing Solutions2 – Split slow pipeline stage

Xform Clip A Clip B

Xform Clip A Clip B

Xform Clip A Clip B

Xform Clip A Clip B

Xform Clip A Clip B

When is it better to split? To copy?Throughput and latency are the same

Xform Clip A Clip B

Xform

Clip (1)

Distribute

Clip (2)

Join

2/24/2005

15

Variable loadStage A always takes 10 cycles.Stage B takes 5 or 15 cycles – averages 10 cyclesPipeline averages ____ cycles per element

A10 cycles

B5 or 15 cycles

IS

A B

A B

A B

SA B

ISA B

Stalling a rigid pipelineA stall in any stage halts all stages upstream of the stall pointinstantly (on the next clock)

What if we stopped all stages, not just upstream stages?

How does the delay of this structure scale with thenumber of stages?

Avalid

ready

int_ready

Bvalid

ready

int_ready

Cvalid

ready

int_ready

valid

ready

2/24/2005

16

Double BufferAdd an extra buffer to each stage that is filled during the firstcycle of a stall.

Full

Buffer Mux

validuvalidb

readyu

readyb

readyd

validd

int_ready

Input goes to stage whenready, or to buffer when stalled.


What is the logic equation for “next_full”? “next_buf”? “mux_sel”?

Full

Buffer Mux

validuvalidb

readyu

readyb

readyd

validd

int_ready

2/24/2005

17


Double Buffer Timing

stage

time

0 1 2 3

Not

ready

A B C

B CA

ABA

D

BA

Ready

DC

E

A

B

DC

F

E

A

B

CD

FE

A

B

C

D

0

1

2

3

4

5

FE


Double Buffer Alternate Timing(how do you make this happen?)

stage

time

0 1 2 3

Not

ready

A B C

B CA

ABA

D

BA

Ready

DC

E

A

B

DC

FE

A

B

C

D

FE

A

B

C

D

E

F0

1

2

3

4

5

2/24/2005

18

Elastic PipelinesA FIFO between stages decouples timingAllows stages to operate at their ‘average’ speed

A10 cycles

B5 or 15 cycles

FIFO

I

A B

A B

A B

A B

A B

A B


Resource SharingSuppose two pipeline stages need to access the same memory•

A B C D

Memory

Mux

a

Arb

d

How would you set the priority on the arbiter?

2/24/2005

19


Pipeline overview

• Divide large problem into stages assembly-line style• Divide evenly or load imbalance will occur

– Fix by splitting or copying bottleneck stage• Rigid pipelines have no extra storage between stages

– A stall on any stage halts all upstream stages– Hard to stop 100 stages at once

• Make this scalable with double-buffering• Variable load results in stalls and idle cycles on a rigid pipeline

– Make pipeline elastic by adding FIFOs between key stages

EE108A Lecture 11: Pipelining - Stanford CVA...

Documents

Transcript of EE108A Lecture 11: Pipelining - Stanford CVA...