CS252 Spring 2017 Graduate Computer Architecture Lecture...

WU UCB CS252 SP17

CS252 Spring 2017Graduate Computer Architecture

Lecture 13:Cache Coherence Part 2

Multithreading Part 1Lisa Wu, Krste Asanovic

http://inst.eecs.berkeley.edu/~cs252/sp17

WU UCB CS252 SP17

Last Time in Lecture 12• Reviewed store policies and cache read/write

policies• Write through vs. write back• Write allocate vs. write no allocate

• Shared memory multiprocessor cache coherence

• Snoopy protocols: MSI, MESI• Intervention• False Sharing

WU UCB CS252 SP17

Review: Cache Coherence vs. Memory ConsistencyFor a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. As part of supporting a memory consistency model, many machines also provide cache coherence protocols that ensure that multiple cached copies of data are kept up-to-date.

~”A Primer on Memory Consistency and Cache Coherence”, D. J. Sorin, M. D. Hill, and D. A. Wood

WU UCB CS252 SP17

Cache Coherence: Directory Protocol

©KrsteAsanovic,2015CS252,Fall2015,Lecture13

ScalableApproach:Directories

§ Everymemorylinehasassociateddirectoryinformation- keepstrackofcopiesofcachedlinesandtheirstates- onamiss,finddirectoryentry,lookitup,andcommunicateonlywiththenodesthathavecopiesifnecessary

- inscalablenetworks,communicationwithdirectoryandcopiesisthroughnetworktransactions

§ Manyalternativesfororganizingdirectoryinformation

DirectoryCacheProtocol

§ Assumptions:Reliablenetwork,FIFOmessagedeliverybetweenanygivensource-destinationpair

Interconnection Network

Directory Controller

DRAM Bank

DataTagStat.

Each line in cache has state field plus tag

DataStat. Directry

Each line in memory has state field plus bit vector directory with one bit per processor

CacheStates

§ Foreachcacheline,thereare4possiblestates:- C-invalid(=Nothing):Theaccesseddataisnotresidentinthecache.

- C-shared(=Sh):Theaccesseddataisresidentinthecache,andpossiblyalsocachedatothersites.Thedatainmemoryisvalid.

- C-modified(=Ex):Theaccesseddataisexclusivelyresidentinthiscache,andhasbeenmodified.Memorydoesnothavethemostup-to-datedata.

- C-transient(=Pending):Theaccesseddataisinatransientstate(forexample,thesitehasjustissuedaprotocolrequest,buthasnotreceivedthecorrespondingprotocolreply).

Homedirectorystates

§ Foreachmemoryline,thereare4possiblestates:- R(dir):Thememorylineissharedbythesitesspecifiedindir(dirisasetofsites).Thedatainmemoryisvalidinthisstate.Ifdirisempty(i.e.,dir=ε),thememorylineisnotcachedbyanysite.

-W(id):Thememorylineisexclusivelycachedatsiteid,andhasbeenmodifiedatthatsite.Memorydoesnothavethemostup-to-datedata.

- TR(dir):Thememorylineisinatransientstatewaitingfortheacknowledgementstotheinvalidationrequeststhatthehomesitehasissued.

- TW(id):Thememorylineisinatransientstatewaitingforalineexclusivelycachedatsiteid(i.e.,inC-modifiedstate)tomakethememorylineatthehomesiteup-to-date.

WU UCB CS252 SP17

DAP.F96 37

Directory Protocol MessagesMessage type Source Destination MsgRead miss Local cache Home directory P, A

– Processor P reads data at address A; send data and make P a read sharer

Write miss Local cache Home directory P, A– Processor P writes data at address A;

send data and make P the exclusive ownerInvalidate Home directory Remote caches A

– Invalidate a shared copy at address A.Fetch Home directory Remote cache A

– Fetch the block at address A and send it to its home directoryFetch/Invalidate Home directory Remote cache A

– Fetch the block at address A and send it to its home directory; invalidate the block in the cache

Data value reply Home directory Local cache Data– Return a data value from the home memory

Data write-back Remote cache Home directory A, Data– Write-back a data value for address A

Dave Patterson, CS252, Fall 1996

WU UCB CS252 SP17

DAP.F96 40

Example Directory Protocol• Message sent to directory causes two actions:

– Update the directory– More messages to satisfy request

• Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are:

– Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared.

– Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.

• Block is Shared => the memory value is up-to-date:– Read miss: requesting processor is sent back the data from

memory & requesting processor is added to the sharing set.– Write miss: requesting processor is sent the value. All processors

in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.

WU UCB CS252 SP17

DAP.F96 41

Example Directory Protocol• Block is Exclusive: current value of the block is held in

the cache of the processor identified by the set Sharers (the owner) => three possible directory requests:

– Read miss: owner processor sent data fetch message, which causes state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy).

– Data write-back: owner processor is replacing the block and hence must write it back. This makes the memory copy up-to-date (the home directory essentially becomes the owner), the block is now uncached, and the Sharer set is empty.

– Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

WU UCB CS252 SP17

DAP.F96 43

Example

P1 P2 Bus Direct ory Memoryst ep St at eAddr ValueSt at eAddrValueAct ionProc.Addr Value Addr St at e {Procs}Value

P1: Write 10 to A1

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

WU UCB CS252 SP17

DAP.F96 44

Example

P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

WU UCB CS252 SP17

DAP.F96 45

Example

P1: Read A1 Excl. A1 10P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

WU UCB CS252 SP17

DAP.F96 46

Example

P2: Write 20 to A1

P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10

P2: Write 40 to A2 10

WU UCB CS252 SP17

DAP.F96 47

Example

P2: Write 20 to A1

Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 10

WU UCB CS252 SP17

DAP.F96 48

Example

P2: Write 20 to A1

Shar. A1 10 Ft ch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1 ,P2 } 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0

WrBk P2 A1 20 A1 Unca. { } 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0

Readmiss,touncached orsharedline

DRAM Bank

1Load request at head of

CPU->Cache queue.

2Load misses in cache.

3Send ShReqmessage to directory.

4Message received at directory controller.

5Access state and directory for line. Line’s state is R, with zero or more

sharers.

6Update directory by setting bit for new processor sharer.

7 Send ShRep message with contents of cache line.

8 ShRep arrives at cache.

Update cache tag and data and return load data to CPU.

Writemiss,toreadsharedline

DRAM Bank

1Store request at head of

CPU->Cache queue.

2Store misses in cache.

3Send ExReq message to directory.

4ExReq message received

at directory controller.

5Access state and directory for line. Line’s state is R, with some

set of sharers.

6 Send one InvReqmessage to each sharer.

ExRep arrives at cache

Update cache tag and data, then store data

from CPU

InvReq arrives at cache.8

Invalidate cache line.

Send InvRepto directory.

9InvRep received. Clear down sharer bit.

10 When no more sharers, send ExRep to cache.

Multiple sharers

ConcurrencyManagement

§ Protocolwouldbeeasytodesignifonlyonetransactioninflightacrossentiresystem

§ But,wantgreaterthroughputanddon’twanttohavetocoordinateacrossentiresystem

§ Greatcomplexityinmanagingmultipleoutstandingconcurrenttransactionstocachelines- Canhavemultiplerequestsinflighttosamecacheline!

WU UCB CS252 SP17

Multithreading:Intro to MT and SMT

Multithreading

§ Difficulttocontinuetoextractinstruction-levelparallelism(ILP)fromasinglesequentialthreadofcontrol

§ Manyworkloadscanmakeuseofthread-levelparallelism(TLP)- TLPfrommultiprogramming(runindependentsequentialjobs)

- TLPfrommultithreadedapplications(runonejobfasterusingparallelthreads)

§ MultithreadingusesTLPtoimproveutilizationofasingleprocessor

Multithreading

Howcanweguaranteenodependenciesbetweeninstructionsinapipeline?

Onewayistointerleaveexecutionofinstructionsfromdifferentprogramthreadsonsamepipeline

F D X MWt0 t1 t2 t3 t4 t5 t6 t7 t8

T1:LD x1,0(x2)T2:ADD x7,x1,x4T3:XORI x5,x4,12T4:SD 0(x7),x5T1:LD x5,12(x1)

F D X MWF D X MW

Interleave4threads,T1-T4,onnon-bypassed5-stagepipe

Priorinstructioninathreadalwayscompleteswrite-backbeforenextinstructioninsamethreadreadsregisterfile

CDC6600PeripheralProcessors(Cray,1964)

§ Firstmultithreadedhardware§ 10“virtual”I/Oprocessors§ Fixedinterleaveonsimplepipeline§ Pipelinehas100nscycletime§ Eachvirtualprocessorexecutesoneinstructionevery1000ns§ Accumulator-basedinstructionsettoreduceprocessorstate

SimpleMultithreadedPipeline

§ Havetocarrythreadselectdownpipelinetoensurecorrectstatebitsread/writtenateachpipestage

§ Appearstosoftware(includingOS)asmultiple,albeitslower,CPUs

2 Thread select

PC1PC1PC1PC1

I$ IR GPR1GPR1GPR1GPR1

MultithreadingCosts

§ Eachthreadrequiresitsownuserstate- PC- GPRs

§ Also,needsitsownsystemstate- Virtual-memorypage-table-baseregister- Exception-handlingregisters

§ Otheroverheads:- Additionalcache/TLBconflictsfromcompetingthreads- (oraddlargercache/TLBcapacity)- MoreOSoverheadtoschedulemorethreads(wheredoallthesethreadscomefrom?)

ThreadSchedulingPolicies

§ Fixedinterleave(CDC6600PPUs,1964)- EachofNthreadsexecutesoneinstructioneveryNcycles- Ifthreadnotreadytogoinitsslot,insertpipelinebubble

§ Software-controlledinterleave(TIASCPPUs,1971)-OSallocatesSpipelineslotsamongstNthreads- HardwareperformsfixedinterleaveoverSslots,executingwhicheverthreadisinthatslot

§ Hardware-controlledthreadscheduling(HEP,1982)- Hardwarekeepstrackofwhichthreadsarereadytogo- Picksnextthreadtoexecutebasedonhardwarepriorityscheme

WU UCB CS252 SP17

Issue Slots:Vertical vs. Horizontal Waste

“Simultaneous Multithreading: Maximizing On-Chip Parallelism”, D. M. Tullsen, S. J. Eggers, and H. M. Levy, University of Washington, ISCA 1995

SimultaneousMultithreading(SMT)forOoO Superscalars

§ Techniquespresentedsofarhaveallbeen“vertical”multithreadingwhereeachpipelinestageworksononethreadatatime

§ SMTusesfine-graincontrolalreadypresentinsideanOoO superscalartoallowinstructionsfrommultiplethreadstoenterexecutiononsameclockcycle.Givesbetterutilizationofmachineresources.

Formostapps,mostexecutionunitslieidleinanOoOsuperscalar

From:Tullsen,Eggers,andLevy,“SimultaneousMultithreading:MaximizingOn-chipParallelism”,ISCA1995.

Foran8-waysuperscalar.

SuperscalarMachineEfficiency

Issuewidth

Completelyidlecycle(verticalwaste)

Instructionissue

Partiallyfilledcycle,i.e.,IPC<4(horizontalwaste)

VerticalMultithreading

Cycle-by-cycleinterleavingremovesverticalwaste,butleavessomehorizontalwaste

Issuewidth

Secondthreadinterleavedcycle-by-cycle

Instructionissue

Partiallyfilledcycle,i.e.,IPC<4(horizontalwaste)

ChipMultiprocessing(CMP)

§ Whatistheeffectofsplittingintomultipleprocessors?- reduceshorizontalwaste,- leavessomeverticalwaste,and- putsupperlimitonpeakthroughputofeachthread.

Issuewidth

IdealSuperscalarMultithreading[Tullsen,Eggers,Levy,UW,1995]

§ Interleavemultiplethreadstomultipleissueslotswithnorestrictions

Issuewidth

SMTadaptationtoparallelismtype

Forregionswithhighthread-levelparallelism(TLP)entiremachinewidthissharedbyallthreads

Issuewidth

Forregionswithlowthread-levelparallelism(TLP)entiremachinewidthisavailableforinstruction-levelparallelism(ILP)

MultithreadedDesignDiscussion

§Wanttobuildamultithreadedprocessor,howshouldeachcomponentbechangedandwhatarethetradeoffs?§L1caches(instructionanddata)§L2caches§Branchpredictor§TLB§Physicalregisterfile

Summary:MultithreadedCategories

Time(processorcycle) Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread1Thread2

Thread3Thread4

Thread5Idleslot

Acknowledgements

§ ThiscourseispartlyinspiredbypreviousMIT6.823andBerkeleyCS252computerarchitecturecoursescreatedbymycollaboratorsandcolleagues:- Krste Asanovic (UCB)- Arvind (MIT)- JoelEmer (Intel/MIT)- JamesHoe(CMU)- JohnKubiatowicz (UCB)- DavidPatterson(UCB)

CS252 Spring 2017 Graduate Computer Architecture Lecture...

Documents

Transcript of CS252 Spring 2017 Graduate Computer Architecture Lecture...

CS252/Culler Lec 5.1 2/5/02 CS203A Computer Architecture Lecture 15 Cache and Memory Technology and Virtual Memory.

CS252: Systems Programming

CS252 Spring 2017 Graduate Computer Architecture Lecture ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec11.pdf · §Coincident current pulses on X and Y wires would write

Design of the B 5000 System - University of California ...inst.eecs.berkeley.edu/~cs252/sp17/papers/B5000.pdfDesign of the B 5000 System WILLIAM LONERGAN PAUL KING Computing systems

inst.eecs.berkeley.edu · Web viewCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Memory Consistency and Cache Coherence Assigned April 15Problem

CS252%Graduate%Computer%Architecture% …cs252/fa15/lectures/L01-CS252... · 8 . CS252,"Fall"2015,"Lecture"1" ©"Krste"Asanovic,"2015" TechnologyTrends%! Compu; ... Which"of"these"techniques"to"manage"pipeline"hazards"

CS252 Spring 2017 Graduate Computer …inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec10.pdfEPIC is the style of architecture (cf. CISC, RISC)-Explicitly Parallel Instruction

Lecture 17: Memory Hierarchy— Five Ways to Reduce Miss ...bnrg.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture17.pdf · Five Ways to Reduce Miss Penalty (Second Level Cache) ...

CS252 Graduate Computer Architecture Lecture 16 Cache Optimizations (Con’t) Memory Technology John Kubiatowicz Electrical Engineering and Computer Sciences.

CS250 VLSI Systems Design Lecture 2: Chisel Introductioninst.eecs.berkeley.edu/~cs250/sp17/lectures/lec02-chisel-sp17.pdf · Lecture 02, HDLs/Chisel CS250, UC Berkeley Sp17 Chisel

The Cray BlackWidow: A Highly Scalable Vector Multiprocessorinst.eecs.berkeley.edu/~cs252/sp17/papers/blackwidow-sc2007.pdf · The Cray BlackWidow: A Highly Scalable Vector Multiprocessor

CS252/Patterson Lec 11.1 2/23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

CS252 Spring 2017 Graduate Computer Architecture Lecture 9 ...inst.eecs.berkeley.edu/~cs252/sp17/lec/CS252-Sp17-Lec09.pdf · WU UCB CS252 SP17 Last Time in Lecture 8 Overcoming the

AC10 Config Settings SP17

CS252 – Advanced Programming Language Principlesaustin/cs252-fall13/CS252-Introduction.pdfCS252 – Advanced Programming Language Principles Prof. Tom Austin San José State University

EDOUARD BUGNION, SCOTT DEVINE, MENDEL ROSENBLUM, …inst.eecs.berkeley.edu/~cs252/sp17/papers/vmware.pdfVMware’s ﬁrst product—VMware Workstation—was the ﬁrst virtualization

CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.

Intel’s Haswell CPU Microarchitectureinst.eecs.berkeley.edu/~cs252/sp17/papers/Intel-Haswell.pdfinstruction decoding stages and enabling an ... supported with better timing to improve

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

HotChips23 HMCinst.eecs.berkeley.edu/~cs252/sp17/papers/HC23.18.320... · 2017. 1. 18. · • Revolutionary DRAM performance improvement demonstr ated by Changing to abstracted high-speed