CS252 Spring 2017 Graduate Computer Architecture Lecture...

38
WU UCB CS252 SP17 CS252 Spring 2017 Graduate Computer Architecture Lecture 13: Cache Coherence Part 2 Multithreading Part 1 Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17

Transcript of CS252 Spring 2017 Graduate Computer Architecture Lecture...

Page 1: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

CS252 Spring 2017Graduate Computer Architecture

Lecture 13:Cache Coherence Part 2

Multithreading Part 1Lisa Wu, Krste Asanovic

http://inst.eecs.berkeley.edu/~cs252/sp17

Page 2: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

Last Time in Lecture 12• Reviewed store policies and cache read/write

policies• Write through vs. write back• Write allocate vs. write no allocate

• Shared memory multiprocessor cache coherence

• Snoopy protocols: MSI, MESI• Intervention• False Sharing

2

Page 3: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

Review: Cache Coherence vs. Memory ConsistencyFor a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. As part of supporting a memory consistency model, many machines also provide cache coherence protocols that ensure that multiple cached copies of data are kept up-to-date.

~”A Primer on Memory Consistency and Cache Coherence”, D. J. Sorin, M. D. Hill, and D. A. Wood

3

Page 4: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

Cache Coherence: Directory Protocol

Page 5: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

ScalableApproach:Directories

§ Everymemorylinehasassociateddirectoryinformation- keepstrackofcopiesofcachedlinesandtheirstates- onamiss,finddirectoryentry,lookitup,andcommunicateonlywiththenodesthathavecopiesifnecessary

- inscalablenetworks,communicationwithdirectoryandcopiesisthroughnetworktransactions

§ Manyalternativesfororganizingdirectoryinformation

5

Page 6: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

DirectoryCacheProtocol

6

§ Assumptions:Reliablenetwork,FIFOmessagedeliverybetweenanygivensource-destinationpair

CPU

Cache

Interconnection Network

Directory Controller

DRAM Bank

Directory Controller

DRAM Bank

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Directory Controller

DRAM Bank

Directory Controller

DRAM Bank

DataTagStat.

Each line in cache has state field plus tag

DataStat. Directry

Each line in memory has state field plus bit vector directory with one bit per processor

Page 7: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

CacheStates

§ Foreachcacheline,thereare4possiblestates:- C-invalid(=Nothing):Theaccesseddataisnotresidentinthecache.

- C-shared(=Sh):Theaccesseddataisresidentinthecache,andpossiblyalsocachedatothersites.Thedatainmemoryisvalid.

- C-modified(=Ex):Theaccesseddataisexclusivelyresidentinthiscache,andhasbeenmodified.Memorydoesnothavethemostup-to-datedata.

- C-transient(=Pending):Theaccesseddataisinatransientstate(forexample,thesitehasjustissuedaprotocolrequest,buthasnotreceivedthecorrespondingprotocolreply).

7

Page 8: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Homedirectorystates

§ Foreachmemoryline,thereare4possiblestates:- R(dir):Thememorylineissharedbythesitesspecifiedindir(dirisasetofsites).Thedatainmemoryisvalidinthisstate.Ifdirisempty(i.e.,dir=ε),thememorylineisnotcachedbyanysite.

-W(id):Thememorylineisexclusivelycachedatsiteid,andhasbeenmodifiedatthatsite.Memorydoesnothavethemostup-to-datedata.

- TR(dir):Thememorylineisinatransientstatewaitingfortheacknowledgementstotheinvalidationrequeststhatthehomesitehasissued.

- TW(id):Thememorylineisinatransientstatewaitingforalineexclusivelycachedatsiteid(i.e.,inC-modifiedstate)tomakethememorylineatthehomesiteup-to-date.

8

Page 9: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

DAP.F96 37

Directory Protocol MessagesMessage type Source Destination MsgRead miss Local cache Home directory P, A

– Processor P reads data at address A; send data and make P a read sharer

Write miss Local cache Home directory P, A– Processor P writes data at address A;

send data and make P the exclusive ownerInvalidate Home directory Remote caches A

– Invalidate a shared copy at address A.Fetch Home directory Remote cache A

– Fetch the block at address A and send it to its home directoryFetch/Invalidate Home directory Remote cache A

– Fetch the block at address A and send it to its home directory; invalidate the block in the cache

Data value reply Home directory Local cache Data– Return a data value from the home memory

Data write-back Remote cache Home directory A, Data– Write-back a data value for address A

9

Dave Patterson, CS252, Fall 1996

Page 10: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

DAP.F96 40

Example Directory Protocol• Message sent to directory causes two actions:

– Update the directory– More messages to satisfy request

• Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are:

– Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared.

– Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.

• Block is Shared => the memory value is up-to-date:– Read miss: requesting processor is sent back the data from

memory & requesting processor is added to the sharing set.– Write miss: requesting processor is sent the value. All processors

in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

10

Dave Patterson, CS252, Fall 1996

Page 11: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

DAP.F96 41

Example Directory Protocol• Block is Exclusive: current value of the block is held in

the cache of the processor identified by the set Sharers (the owner) => three possible directory requests:

– Read miss: owner processor sent data fetch message, which causes state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy).

– Data write-back: owner processor is replacing the block and hence must write it back. This makes the memory copy up-to-date (the home directory essentially becomes the owner), the block is now uncached, and the Sharer set is empty.

– Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

11

Dave Patterson, CS252, Fall 1996

Page 12: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

DAP.F96 43

Example

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

12

Dave Patterson, CS252, Fall 1996

Page 13: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

DAP.F96 44

Example

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

13

Dave Patterson, CS252, Fall 1996

Page 14: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

DAP.F96 45

Example

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1 Excl. A1 10P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

14

Dave Patterson, CS252, Fall 1996

Page 15: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

DAP.F96 46

Example

P2: Write 20 to A1

A1 and A2 map to the same cache block

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10

1010

P2: Write 40 to A2 10

15

Dave Patterson, CS252, Fall 1996

Page 16: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

DAP.F96 47

Example

P2: Write 20 to A1

A1 and A2 map to the same cache block

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 10

16

Dave Patterson, CS252, Fall 1996

Page 17: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

DAP.F96 48

Example

P2: Write 20 to A1

A1 and A2 map to the same cache block

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0

WrBk P2 A1 20 A1 Unca. { } 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0

17

Dave Patterson, CS252, Fall 1996

Page 18: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Readmiss,touncached orsharedline

18

Directory Controller

DRAM Bank

CPU

Cache

1Load request at head of

CPU->Cache queue.

2Load misses in cache.

3Send ShReqmessage to directory.

4Message received at directory controller.

5Access state and directory for line. Line’s state is R, with zero or more

sharers.

6Update directory by setting bit for new processor sharer.

7 Send ShRep message with contents of cache line.

8 ShRep arrives at cache.

9

Update cache tag and data and return load data to CPU.

Interconnection Network

Page 19: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Writemiss,toreadsharedline

19

Directory Controller

DRAM Bank

CPU

Cache

1Store request at head of

CPU->Cache queue.

2Store misses in cache.

3Send ExReq message to directory.

4ExReq message received

at directory controller.

5Access state and directory for line. Line’s state is R, with some

set of sharers.

6 Send one InvReqmessage to each sharer.

11

ExRep arrives at cache

12

Update cache tag and data, then store data

from CPU

Interconnection Network

CPU

Cache

7

InvReq arrives at cache.8

Invalidate cache line.

Send InvRepto directory.

9InvRep received. Clear down sharer bit.

10 When no more sharers, send ExRep to cache.

Multiple sharers

CPU

Cache

CPU

Cache

Page 20: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

ConcurrencyManagement

§ Protocolwouldbeeasytodesignifonlyonetransactioninflightacrossentiresystem

§ But,wantgreaterthroughputanddon’twanttohavetocoordinateacrossentiresystem

§ Greatcomplexityinmanagingmultipleoutstandingconcurrenttransactionstocachelines- Canhavemultiplerequestsinflighttosamecacheline!

20

Page 21: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

Multithreading:Intro to MT and SMT

Page 22: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Multithreading

§ Difficulttocontinuetoextractinstruction-levelparallelism(ILP)fromasinglesequentialthreadofcontrol

§ Manyworkloadscanmakeuseofthread-levelparallelism(TLP)- TLPfrommultiprogramming(runindependentsequentialjobs)

- TLPfrommultithreadedapplications(runonejobfasterusingparallelthreads)

§ MultithreadingusesTLPtoimproveutilizationofasingleprocessor

22

Page 23: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Multithreading

Howcanweguaranteenodependenciesbetweeninstructionsinapipeline?

Onewayistointerleaveexecutionofinstructionsfromdifferentprogramthreadsonsamepipeline

23

F D X MWt0 t1 t2 t3 t4 t5 t6 t7 t8

T1:LD x1,0(x2)T2:ADD x7,x1,x4T3:XORI x5,x4,12T4:SD 0(x7),x5T1:LD x5,12(x1)

t9

F D X MWF D X MW

F D X MWF D X MW

Interleave4threads,T1-T4,onnon-bypassed5-stagepipe

Priorinstructioninathreadalwayscompleteswrite-backbeforenextinstructioninsamethreadreadsregisterfile

Page 24: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

CDC6600PeripheralProcessors(Cray,1964)

§ Firstmultithreadedhardware§ 10“virtual”I/Oprocessors§ Fixedinterleaveonsimplepipeline§ Pipelinehas100nscycletime§ Eachvirtualprocessorexecutesoneinstructionevery1000ns§ Accumulator-basedinstructionsettoreduceprocessorstate

24

Page 25: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

SimpleMultithreadedPipeline

§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/writtenateachpipestage

§ Appearstosoftware(includingOS)asmultiple,albeitslower,CPUs

25

+1

2 Thread select

PC1PC1PC1PC1

I$ IR GPR1GPR1GPR1GPR1

X

Y

2

D$

Page 26: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

MultithreadingCosts

§ Eachthreadrequiresitsownuserstate- PC- GPRs

§ Also,needsitsownsystemstate- Virtual-memorypage-table-baseregister- Exception-handlingregisters

§ Otheroverheads:- Additionalcache/TLBconflictsfromcompetingthreads- (oraddlargercache/TLBcapacity)- MoreOSoverheadtoschedulemorethreads(wheredoallthesethreadscomefrom?)

26

Page 27: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

ThreadSchedulingPolicies

27

§ Fixedinterleave(CDC6600PPUs,1964)- EachofNthreadsexecutesoneinstructioneveryNcycles- Ifthreadnotreadytogoinitsslot,insertpipelinebubble

§ Software-controlledinterleave(TIASCPPUs,1971)-OSallocatesSpipelineslotsamongstNthreads- HardwareperformsfixedinterleaveoverSslots,executingwhicheverthreadisinthatslot

§ Hardware-controlledthreadscheduling(HEP,1982)- Hardwarekeepstrackofwhichthreadsarereadytogo- Picksnextthreadtoexecutebasedonhardwarepriorityscheme

Page 28: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

WU UCB CS252 SP17

Issue Slots:Vertical vs. Horizontal Waste

28

“Simultaneous Multithreading: Maximizing On-Chip Parallelism”, D. M. Tullsen, S. J. Eggers, and H. M. Levy, University of Washington, ISCA 1995

Page 29: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

SimultaneousMultithreading(SMT)forOoO Superscalars

§ Techniquespresentedsofarhaveallbeen“vertical”multithreadingwhereeachpipelinestageworksononethreadatatime

§ SMTusesfine-graincontrolalreadypresentinsideanOoO superscalartoallowinstructionsfrommultiplethreadstoenterexecutiononsameclockcycle.Givesbetterutilizationofmachineresources.

29

Page 30: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Formostapps,mostexecutionunitslieidleinanOoOsuperscalar

30

From:Tullsen,Eggers,andLevy,“SimultaneousMultithreading:MaximizingOn-chipParallelism”,ISCA1995.

Foran8-waysuperscalar.

Page 31: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

SuperscalarMachineEfficiency

31

Issuewidth

Time

Completelyidlecycle(verticalwaste)

Instructionissue

Partiallyfilledcycle,i.e.,IPC<4(horizontalwaste)

Page 32: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

VerticalMultithreading

32

Cycle-by-cycleinterleavingremovesverticalwaste,butleavessomehorizontalwaste

Issuewidth

Time

Secondthreadinterleavedcycle-by-cycle

Instructionissue

Partiallyfilledcycle,i.e.,IPC<4(horizontalwaste)

Page 33: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

ChipMultiprocessing(CMP)

33

§ Whatistheeffectofsplittingintomultipleprocessors?- reduceshorizontalwaste,- leavessomeverticalwaste,and- putsupperlimitonpeakthroughputofeachthread.

Issuewidth

Time

Page 34: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

IdealSuperscalarMultithreading[Tullsen,Eggers,Levy,UW,1995]

34

§ Interleavemultiplethreadstomultipleissueslotswithnorestrictions

Issuewidth

Time

Page 35: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

SMTadaptationtoparallelismtype

35

Forregionswithhighthread-levelparallelism(TLP)entiremachinewidthissharedbyallthreads

Issuewidth

Time

Issuewidth

Time

Forregionswithlowthread-levelparallelism(TLP)entiremachinewidthisavailableforinstruction-levelparallelism(ILP)

Page 36: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

MultithreadedDesignDiscussion

36

§Wanttobuildamultithreadedprocessor,howshouldeachcomponentbechangedandwhatarethetradeoffs?§L1caches(instructionanddata)§L2caches§Branchpredictor§TLB§Physicalregisterfile

Page 37: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Summary:MultithreadedCategories

37

Time(processorcycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread1Thread2

Thread3Thread4

Thread5Idleslot

Page 38: CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec13.pdf1 Load request at head of CPU->Cache queue. Load misses in cache.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Acknowledgements

§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Krste Asanovic (UCB)- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

38