CS252 Spring 2017 Graduate Computer Architecture Lecture...

Post on 19-Jun-2020

2 views 0 download

Transcript of CS252 Spring 2017 Graduate Computer Architecture Lecture...

WU UCB CS252 SP17

CS252 Spring 2017Graduate Computer Architecture

Lecture 13:Cache Coherence Part 2

Multithreading Part 1Lisa Wu, Krste Asanovic

http://inst.eecs.berkeley.edu/~cs252/sp17

WU UCB CS252 SP17

Last Time in Lecture 12• Reviewed store policies and cache read/write

policies• Write through vs. write back• Write allocate vs. write no allocate

• Shared memory multiprocessor cache coherence

• Snoopy protocols: MSI, MESI• Intervention• False Sharing

2

WU UCB CS252 SP17

Review: Cache Coherence vs. Memory ConsistencyFor a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. As part of supporting a memory consistency model, many machines also provide cache coherence protocols that ensure that multiple cached copies of data are kept up-to-date.

~”A Primer on Memory Consistency and Cache Coherence”, D. J. Sorin, M. D. Hill, and D. A. Wood

3

WU UCB CS252 SP17

Cache Coherence: Directory Protocol

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

ScalableApproach:Directories

§ Everymemorylinehasassociateddirectoryinformation- keepstrackofcopiesofcachedlinesandtheirstates- onamiss,finddirectoryentry,lookitup,andcommunicateonlywiththenodesthathavecopiesifnecessary

- inscalablenetworks,communicationwithdirectoryandcopiesisthroughnetworktransactions

§ Manyalternativesfororganizingdirectoryinformation

5

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

DirectoryCacheProtocol

6

§ Assumptions:Reliablenetwork,FIFOmessagedeliverybetweenanygivensource-destinationpair

CPU

Cache

Interconnection Network

Directory Controller

DRAM Bank

Directory Controller

DRAM Bank

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Cache

Directory Controller

DRAM Bank

Directory Controller

DRAM Bank

DataTagStat.

Each line in cache has state field plus tag

DataStat. Directry

Each line in memory has state field plus bit vector directory with one bit per processor

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

CacheStates

§ Foreachcacheline,thereare4possiblestates:- C-invalid(=Nothing):Theaccesseddataisnotresidentinthecache.

- C-shared(=Sh):Theaccesseddataisresidentinthecache,andpossiblyalsocachedatothersites.Thedatainmemoryisvalid.

- C-modified(=Ex):Theaccesseddataisexclusivelyresidentinthiscache,andhasbeenmodified.Memorydoesnothavethemostup-to-datedata.

- C-transient(=Pending):Theaccesseddataisinatransientstate(forexample,thesitehasjustissuedaprotocolrequest,buthasnotreceivedthecorrespondingprotocolreply).

7

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Homedirectorystates

§ Foreachmemoryline,thereare4possiblestates:- R(dir):Thememorylineissharedbythesitesspecifiedindir(dirisasetofsites).Thedatainmemoryisvalidinthisstate.Ifdirisempty(i.e.,dir=ε),thememorylineisnotcachedbyanysite.

-W(id):Thememorylineisexclusivelycachedatsiteid,andhasbeenmodifiedatthatsite.Memorydoesnothavethemostup-to-datedata.

- TR(dir):Thememorylineisinatransientstatewaitingfortheacknowledgementstotheinvalidationrequeststhatthehomesitehasissued.

- TW(id):Thememorylineisinatransientstatewaitingforalineexclusivelycachedatsiteid(i.e.,inC-modifiedstate)tomakethememorylineatthehomesiteup-to-date.

8

WU UCB CS252 SP17

DAP.F96 37

Directory Protocol MessagesMessage type Source Destination MsgRead miss Local cache Home directory P, A

– Processor P reads data at address A; send data and make P a read sharer

Write miss Local cache Home directory P, A– Processor P writes data at address A;

send data and make P the exclusive ownerInvalidate Home directory Remote caches A

– Invalidate a shared copy at address A.Fetch Home directory Remote cache A

– Fetch the block at address A and send it to its home directoryFetch/Invalidate Home directory Remote cache A

– Fetch the block at address A and send it to its home directory; invalidate the block in the cache

Data value reply Home directory Local cache Data– Return a data value from the home memory

Data write-back Remote cache Home directory A, Data– Write-back a data value for address A

9

Dave Patterson, CS252, Fall 1996

WU UCB CS252 SP17

DAP.F96 40

Example Directory Protocol• Message sent to directory causes two actions:

– Update the directory– More messages to satisfy request

• Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are:

– Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared.

– Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.

• Block is Shared => the memory value is up-to-date:– Read miss: requesting processor is sent back the data from

memory & requesting processor is added to the sharing set.– Write miss: requesting processor is sent the value. All processors

in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

10

Dave Patterson, CS252, Fall 1996

WU UCB CS252 SP17

DAP.F96 41

Example Directory Protocol• Block is Exclusive: current value of the block is held in

the cache of the processor identified by the set Sharers (the owner) => three possible directory requests:

– Read miss: owner processor sent data fetch message, which causes state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy).

– Data write-back: owner processor is replacing the block and hence must write it back. This makes the memory copy up-to-date (the home directory essentially becomes the owner), the block is now uncached, and the Sharer set is empty.

– Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

11

Dave Patterson, CS252, Fall 1996

WU UCB CS252 SP17

DAP.F96 43

Example

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

12

Dave Patterson, CS252, Fall 1996

WU UCB CS252 SP17

DAP.F96 44

Example

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

13

Dave Patterson, CS252, Fall 1996

WU UCB CS252 SP17

DAP.F96 45

Example

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1 Excl. A1 10P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

14

Dave Patterson, CS252, Fall 1996

WU UCB CS252 SP17

DAP.F96 46

Example

P2: Write 20 to A1

A1 and A2 map to the same cache block

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10

1010

P2: Write 40 to A2 10

15

Dave Patterson, CS252, Fall 1996

WU UCB CS252 SP17

DAP.F96 47

Example

P2: Write 20 to A1

A1 and A2 map to the same cache block

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 10

16

Dave Patterson, CS252, Fall 1996

WU UCB CS252 SP17

DAP.F96 48

Example

P2: Write 20 to A1

A1 and A2 map to the same cache block

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0

WrBk P2 A1 20 A1 Unca. { } 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0

17

Dave Patterson, CS252, Fall 1996

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Readmiss,touncached orsharedline

18

Directory Controller

DRAM Bank

CPU

Cache

1Load request at head of

CPU->Cache queue.

2Load misses in cache.

3Send ShReqmessage to directory.

4Message received at directory controller.

5Access state and directory for line. Line’s state is R, with zero or more

sharers.

6Update directory by setting bit for new processor sharer.

7 Send ShRep message with contents of cache line.

8 ShRep arrives at cache.

9

Update cache tag and data and return load data to CPU.

Interconnection Network

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Writemiss,toreadsharedline

19

Directory Controller

DRAM Bank

CPU

Cache

1Store request at head of

CPU->Cache queue.

2Store misses in cache.

3Send ExReq message to directory.

4ExReq message received

at directory controller.

5Access state and directory for line. Line’s state is R, with some

set of sharers.

6 Send one InvReqmessage to each sharer.

11

ExRep arrives at cache

12

Update cache tag and data, then store data

from CPU

Interconnection Network

CPU

Cache

7

InvReq arrives at cache.8

Invalidate cache line.

Send InvRepto directory.

9InvRep received. Clear down sharer bit.

10 When no more sharers, send ExRep to cache.

Multiple sharers

CPU

Cache

CPU

Cache

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

ConcurrencyManagement

§ Protocolwouldbeeasytodesignifonlyonetransactioninflightacrossentiresystem

§ But,wantgreaterthroughputanddon’twanttohavetocoordinateacrossentiresystem

§ Greatcomplexityinmanagingmultipleoutstandingconcurrenttransactionstocachelines- Canhavemultiplerequestsinflighttosamecacheline!

20

WU UCB CS252 SP17

Multithreading:Intro to MT and SMT

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Multithreading

§ Difficulttocontinuetoextractinstruction-levelparallelism(ILP)fromasinglesequentialthreadofcontrol

§ Manyworkloadscanmakeuseofthread-levelparallelism(TLP)- TLPfrommultiprogramming(runindependentsequentialjobs)

- TLPfrommultithreadedapplications(runonejobfasterusingparallelthreads)

§ MultithreadingusesTLPtoimproveutilizationofasingleprocessor

22

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Multithreading

Howcanweguaranteenodependenciesbetweeninstructionsinapipeline?

Onewayistointerleaveexecutionofinstructionsfromdifferentprogramthreadsonsamepipeline

23

F D X MWt0 t1 t2 t3 t4 t5 t6 t7 t8

T1:LD x1,0(x2)T2:ADD x7,x1,x4T3:XORI x5,x4,12T4:SD 0(x7),x5T1:LD x5,12(x1)

t9

F D X MWF D X MW

F D X MWF D X MW

Interleave4threads,T1-T4,onnon-bypassed5-stagepipe

Priorinstructioninathreadalwayscompleteswrite-backbeforenextinstructioninsamethreadreadsregisterfile

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

CDC6600PeripheralProcessors(Cray,1964)

§ Firstmultithreadedhardware§ 10“virtual”I/Oprocessors§ Fixedinterleaveonsimplepipeline§ Pipelinehas100nscycletime§ Eachvirtualprocessorexecutesoneinstructionevery1000ns§ Accumulator-basedinstructionsettoreduceprocessorstate

24

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

SimpleMultithreadedPipeline

§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/writtenateachpipestage

§ Appearstosoftware(includingOS)asmultiple,albeitslower,CPUs

25

+1

2 Thread select

PC1PC1PC1PC1

I$ IR GPR1GPR1GPR1GPR1

X

Y

2

D$

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

MultithreadingCosts

§ Eachthreadrequiresitsownuserstate- PC- GPRs

§ Also,needsitsownsystemstate- Virtual-memorypage-table-baseregister- Exception-handlingregisters

§ Otheroverheads:- Additionalcache/TLBconflictsfromcompetingthreads- (oraddlargercache/TLBcapacity)- MoreOSoverheadtoschedulemorethreads(wheredoallthesethreadscomefrom?)

26

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

ThreadSchedulingPolicies

27

§ Fixedinterleave(CDC6600PPUs,1964)- EachofNthreadsexecutesoneinstructioneveryNcycles- Ifthreadnotreadytogoinitsslot,insertpipelinebubble

§ Software-controlledinterleave(TIASCPPUs,1971)-OSallocatesSpipelineslotsamongstNthreads- HardwareperformsfixedinterleaveoverSslots,executingwhicheverthreadisinthatslot

§ Hardware-controlledthreadscheduling(HEP,1982)- Hardwarekeepstrackofwhichthreadsarereadytogo- Picksnextthreadtoexecutebasedonhardwarepriorityscheme

WU UCB CS252 SP17

Issue Slots:Vertical vs. Horizontal Waste

28

“Simultaneous Multithreading: Maximizing On-Chip Parallelism”, D. M. Tullsen, S. J. Eggers, and H. M. Levy, University of Washington, ISCA 1995

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

SimultaneousMultithreading(SMT)forOoO Superscalars

§ Techniquespresentedsofarhaveallbeen“vertical”multithreadingwhereeachpipelinestageworksononethreadatatime

§ SMTusesfine-graincontrolalreadypresentinsideanOoO superscalartoallowinstructionsfrommultiplethreadstoenterexecutiononsameclockcycle.Givesbetterutilizationofmachineresources.

29

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Formostapps,mostexecutionunitslieidleinanOoOsuperscalar

30

From:Tullsen,Eggers,andLevy,“SimultaneousMultithreading:MaximizingOn-chipParallelism”,ISCA1995.

Foran8-waysuperscalar.

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

SuperscalarMachineEfficiency

31

Issuewidth

Time

Completelyidlecycle(verticalwaste)

Instructionissue

Partiallyfilledcycle,i.e.,IPC<4(horizontalwaste)

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

VerticalMultithreading

32

Cycle-by-cycleinterleavingremovesverticalwaste,butleavessomehorizontalwaste

Issuewidth

Time

Secondthreadinterleavedcycle-by-cycle

Instructionissue

Partiallyfilledcycle,i.e.,IPC<4(horizontalwaste)

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

ChipMultiprocessing(CMP)

33

§ Whatistheeffectofsplittingintomultipleprocessors?- reduceshorizontalwaste,- leavessomeverticalwaste,and- putsupperlimitonpeakthroughputofeachthread.

Issuewidth

Time

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

IdealSuperscalarMultithreading[Tullsen,Eggers,Levy,UW,1995]

34

§ Interleavemultiplethreadstomultipleissueslotswithnorestrictions

Issuewidth

Time

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

SMTadaptationtoparallelismtype

35

Forregionswithhighthread-levelparallelism(TLP)entiremachinewidthissharedbyallthreads

Issuewidth

Time

Issuewidth

Time

Forregionswithlowthread-levelparallelism(TLP)entiremachinewidthisavailableforinstruction-levelparallelism(ILP)

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

MultithreadedDesignDiscussion

36

§Wanttobuildamultithreadedprocessor,howshouldeachcomponentbechangedandwhatarethetradeoffs?§L1caches(instructionanddata)§L2caches§Branchpredictor§TLB§Physicalregisterfile

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Summary:MultithreadedCategories

37

Time(processorcycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread1Thread2

Thread3Thread4

Thread5Idleslot

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

Acknowledgements

§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Krste Asanovic (UCB)- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

38