CS 61C: Great Ideas in Computer Architecture (Machine...

50
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 2 Instructors: Krste Asanović & Randy H. Katz http://inst.eecs.berkeley.edu/~cs61c/ 1 10/16/17 Fall 2017 - Lecture #15

Transcript of CS 61C: Great Ideas in Computer Architecture (Machine...

CS61C:GreatIdeasinComputerArchitecture(MachineStructures)

CachesPart2

Instructors:Krste Asanović &RandyH.Katz

http://inst.eecs.berkeley.edu/~cs61c/

110/16/17 Fall2017 - Lecture#15

Outline

• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…

210/16/17 Fall2017 – Lecture#15

Outline

• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…

310/16/17 Fall2017 – Lecture#15

Second-LevelCache(SRAM)

TypicalMemoryHierarchy

Control

Datapath

SecondaryMemory(Disk

OrFlash)

On-ChipComponents

RegFile

MainMemory(DRAM)Data

CacheInstrCache

Speed(cycles):½’s1’s10’s100’s-10001,000,000’s

Size(bytes): 100’s 10K’sM’sG’sT’s

• Principleoflocality+memoryhierarchypresentsprogrammerwith≈asmuchmemoryasisavailableinthecheapest technologyatthe≈speedofferedbythefastest technology

Cost/bit:highestlowest

Third-LevelCache(SRAM)

10/16/17 Fall2017 - Lecture#15 4

Processor

Control

Datapath

AddingCachetoComputer

PC

Registers

Arithmetic&LogicUnit(ALU)

MemoryInput

Output

Bytes

Enable?Read/Write

Address

WriteData

ReadData

Processor-MemoryInterface I/O-MemoryInterfaces

Program

Data

Cache

10/16/17 Fall2017 - Lecture#15 5

Processororganizedaroundwordsand bytes

Memory(includingcache)organizedaroundblocks,

whicharetypicallymultiplewords

KeyCacheConcepts

• PrincipleofLocality– TemporalLocalityandSpatialLocality

• HierarchyofMemories (speed/size/costperbit)toexploitlocality

• Cache– copyofdatainlowerlevelofmemoryhierarchy• DirectMappedtofindblockincacheusingTagfieldandValid

bitforHit• CacheDesignOrganizationChoices:– FullyAssociative,Set-Associative,Direct-Mapped

610/16/17 Fall2017 - Lecture#15

CacheOrganizations• “FullyAssociative”:Blockplacedanywhereincache– Firstdesignlastlecture– Note:NoIndexfield,butonecomparator/block

• “DirectMapped”:Blockgoesonlyoneplaceincache– Note:Onlyonecomparator– Numberofsets=numberblocks

• “N-waySetAssociative”:Nplacesforblockincache– Numberofsets=NumberofBlocks/N– Ncomparators– FullyAssociative:N=numberofblocks– DirectMapped:N=1

710/16/17 Fall2017 - Lecture#15

0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111

0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111

0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111

8 88Byte

Word8-Byte Block

address address address

2 LSBs are 0 3 LSBs are 0

0

1

2

3

01234567012345670123456701234567

Byte offset in blockBlock #

MemoryBlockvs.WordAddressing

810/16/17 Fall2017 - Lecture#15

010100100000

010100110000

010101000000

010101010000

010101100000

010101110000

010110000000

010110010000

010110100000

010110110000

010100100000

010100110000

010101000000

010101010000

010101100000

010101110000

010110000000

010110010000

010110100000

010110110000

82

83

84

85

86

87

88

89

90

91

2

3

4

5

6

7

0

1

2

3

0

1

0

1

0

1

0

1

0

1

010100100000

010100110000

010101000000

010101010000

010101100000

010101110000

010110000000

010110010000

010110100000

010110110000

MemoryBlockNumberAliasing

Block# Block#mod8 Block#mod2

12-bitmemoryaddresses,16Byteblocks

10/16/17 Fall2017 - Lecture#15 9

ProcessorAddressFieldsUsedbyCacheController

• BlockOffset:Byteaddresswithinblock• SetIndex:Selectswhichset• Tag:Remainingportionofprocessoraddress

• SizeofIndex=log2(numberofsets)• SizeofTag=Addresssize– SizeofIndex

– log2(numberofbytes/block)

BlockoffsetSetIndexTag

ProcessorAddress(32-bitstotal)

10/16/17 Fall2017 - Lecture#15 10

WhatLimitsNumberofSets?

• Foragiventotalnumberofblocks,wesavecomparatorsifhavemorethantwosets

• Limit:AsManySetsasCacheBlocks=>onlyoneblockperset–onlyneedsonecomparator!

• Called“Direct-Mapped”Design

11

BlockoffsetIndexTag

10/16/17 Fall2017 - Lecture#15

DirectMappedCacheExample:Mappinga6-bitMemoryAddress

• Inexample,blocksizeis4bytes/1word• Memoryandcacheblocksalwaysthesamesize,unitoftransferbetweenmemoryandcache• #Memoryblocks>>#Cacheblocks

– 16Memoryblocks=16words=64bytes=>6bitstoaddressallbytes– 4Cacheblocks,4bytes(1word)perblock– 4Memoryblocksmaptoeachcacheblock

• Memoryblocktocacheblock,akaindex:middletwobits• Whichmemoryblockisinagivencacheblock,akatag:toptwobits

12

05 1

ByteWithinBlock

ByteOffset

23

BlockWithin$

4

Mem BlockWithin$Block

Tag Index

10/16/17 Fall2017 - Lecture#15

OneMoreDetail:ValidBit

• Whenstartanewprogram,cachedoesnothavevalidinformationforthisprogram

• Needanindicatorwhetherthistagentryisvalidforthisprogram

• Adda“validbit”tothecachetagentry0=>cachemiss,evenifbychance,address=tag1=>cachehit,ifprocessoraddress=tag

10/16/17 Fall2017 - Lecture#15 13

CacheOrganization:SimpleFirstExample

00011011

Cache

MainMemory

Q:Whereinthecacheisthemem block?

Usenext2low-ordermemoryaddressbits– theindex– todeterminewhichcacheblock(i.e.,modulothenumberofblocksinthecache)

Tag Data

Q:Isthememoryblockincache?Comparethecachetagtothehigh-order2memoryaddressbitstotellifthememoryblockisinthecache(providedvalidbitisset)

Valid

0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx

OnewordblocksTwoloworderbits(xx)definethebyteintheblock(32bwords)Index

10/16/17 Fall2017 - Lecture#15 14

Example:Alternativesinan8BlockCache• DirectMapped:8blocks,1way,1tagcomparator,8sets• FullyAssociative:8blocks,8ways,8tagcomparators,1set• 2WaySetAssociative:8blocks,2ways,2tagcomparators,4sets• 4WaySetAssociative:8blocks,4ways,4tagcomparators,2sets

1510/16/17 Fall2017 - Lecture#15 15

0

1

2

3

DM:8sets1way

4

5

6

7

0

1

2

3

FA:1set8ways

4

5

6

7

0

1

2

3

2WaySA:4sets Set0

Set1

Set2

Set3

4

5

6

7

0

1

2

3

4WaySA:2sets

Set0

Set1

4

5

6

7

• Onewordblocks,cachesize=1Kwords(or4KB)

Direct-MappedCache

20Tag 10Index

DataIndex TagValid012...

102110221023

3130...131211...210Byteoffset

20

Data

32

HitValidbitensures

somethingusefulincacheforthisindex

CompareTagwithupperpartofAddresstoseeifa

Hit

Readdatafromcache

insteadofmemoryif

aHit

Comparator

10/16/17 Fall2017 - Lecture#15 16

PeerInstruction

• Foracachewithconstanttotalcapacity, ifweincreasethenumberofwaysbyafactoroftwo,whichstatementisfalse:A:ThenumberofsetscouldbedoubledB:ThetagwidthcoulddecreaseC:Theblocksizecouldstaythesame:Theblocksizecouldbehalved

1710/16/17 Fall2017 - Lecture#15

Break!

1810/16/17 Fall2017 - Lecture#15

Outline

• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…

1910/16/17 Fall2017 – Lecture#15

HandlingStoreswithWrite-Through

• Storeinstructionswritetomemory,changingvalues• Needtomakesurecacheandmemoryhavesamevaluesonwrites:twopolicies

1)Write-ThroughPolicy:writecacheandwritethroughthecachetomemory– Everywriteeventuallygetstomemory– Tooslow,soincludeWriteBuffertoallowprocessortocontinueoncedatainBuffer

– Bufferupdatesmemoryinparalleltoprocessor

10/16/17 Fall2017 - Lecture#15 20

Write-ThroughCache

• Writebothvaluesincacheandinmemory

• WritebufferstopsCPUfromstallingifmemorycannotkeepup

• Writebuffermayhavemultipleentriestoabsorbburstsofwrites

• Whatifstoremissesincache?

Processor

32-bitAddress

32-bitData

Cache

32-bitAddress

32-bitData

Memory

1022 99252

720

12

1312041 Addr Data

WriteBuffer

10/16/17 Fall2017 - Lecture#15 21

HandlingStoreswithWrite-Back

2)Write-BackPolicy:writeonlytocacheandthenwritecacheblockbacktomemorywhenevictblockfromcache–Writescollectedincache,onlysinglewritetomemoryperblock– Includebittoseeifwrotetoblockornot,andthenonlywritebackifbitisset• Called“Dirty”bit(writingmakesit“dirty”)

10/16/17 Fall2017 - Lecture#15 22

Write-BackCache

• Store/cachehit,writedataincacheonlyandsetdirtybit– Memoryhasstalevalue

• Store/cachemiss,readdatafrommemory,thenupdateandsetdirtybit– “Write-allocate”policy

• Load/cachehit,usevaluefromcache

• Onanymiss,writebackevictedblock,onlyifdirty.Updatecachewithnewblockandcleardirtybit

Processor

32-bitAddress

32-bitData

Cache

32-bitAddress

32-bitData

Memory

1022 99252

720

12

1312041

DDDD

DirtyBits

10/16/17 Fall2017 - Lecture#15 23

Write-Throughvs.Write-Back

• Write-Through:– Simplercontrollogic– Morepredictabletimingsimplifiesprocessorcontrollogic

– Easiertomakereliable,sincememoryalwayshascopyofdata(bigidea:Redundancy!)

• Write-Back– Morecomplexcontrollogic– Morevariabletiming(0,1,2memoryaccessespercacheaccess)

– Usuallyreduceswritetraffic– Hardertomakereliable,sometimescachehasonlycopyofdata

10/16/17 Fall2017 - Lecture#15 24

Administrivia• Midterm#22weeksaway!October31!

– Inclass!8-9:30AM– SynchronousdigitaldesignandProject2(processordesign)included– PipelinesandCaches– ONEDoublesidedCribsheet– ReviewSession:Saturday,Oct28(LocationTBA)

• 5-10opendrop-inseatsforthesetutoringsessions:• M 1-2Soda611• Th3-4Soda380• F 5-6Soda651

• GuerrillaSessiontonight7-9pminCory293• Project2-1Partytomorrow7-9pmCory293• IfyouwouldliketochangeyourpartnershipforProject2,emailyourlabTA

– WewillsendoutaGoogleformtotrackallProject2partnerships

2510/16/17 Fall2017 - Lecture#15

Outline

• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…

10/16/17 Fall2017 – Lecture#15 26

Cache(Performance) Terms

• Hitrate:fractionofaccessesthathitinthecache• Missrate:1– Hitrate• Misspenalty:timetoreplaceablockfromlowerlevelinmemoryhierarchytocache

• Hittime:timetoaccesscachememory(includingtagcomparison)

• Abbreviation:“$”=cache(aBerkeleyinnovation!)

10/16/17 Fall2017 - Lecture#15 27

AverageMemoryAccessTime(AMAT)• AverageMemoryAccessTime(AMAT)istheaveragetimetoaccessmemoryconsideringbothhitsandmissesinthecacheAMAT=Timeforahit+Missrate× Misspenalty

10/16/17 Fall2017 - Lecture#15 28

PeerInstruction

AMAT=Timeforahit+MissratexMisspenalty• Givena200psec clock,amisspenaltyof50clockcycles,amissrateof0.02missesperinstructionandacachehittimeof1clockcycle,whatisAMAT?A:≤200psecB:400psecC:600psec: 800psec

2910/16/17 Fall2017 - Lecture#15

PingPongCacheExample:Direct-MappedCachew/4Single-WordBlocks,Worst-CaseReferenceString

0 4 0 4

0 4 0 4

• Considerthemainmemoryaddressreferencestringofwordnumbers:04040404

Startwithanemptycache- allblocksinitiallymarkedasnotvalid

10/16/17 Fall2017 - Lecture#15 30

0 4 0 4

0 4 0 4

miss miss miss miss

miss miss miss miss

00Mem(0) 00Mem(0)01 4

01Mem(4)000

00Mem(0)01 4

00Mem(0)01 4

00Mem(0)01 4

01Mem(4)000

01Mem(4)000

Startwithanemptycache- allblocksinitiallymarkedasnotvalid

Ping-pong effectduetoconflictmisses- twomemorylocationsthatmapintothesamecacheblock

• 8requests,8misses

• Considerthemainmemoryaddressreferencestringofwordnumbers:04040404

10/16/17 Fall2017 - Lecture#15 31

PingPongCacheExample:Direct-MappedCachew/4Single-WordBlocks,Worst-CaseReferenceString

Outline

• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…

3210/16/17 Fall2017 – Lecture#15

Example:2-WaySetAssociative$(4words=2setsx2waysperset)

0

Cache

MainMemory

Q:Howdowefindit?

Usenext1lowordermemoryaddressbittodeterminewhichcacheset(i.e.,modulothenumberofsetsinthecache)

Tag Data

Q:Isitthere?

Compareall thecachetagsinthesettothehighorder3memoryaddressbits totellifthememoryblockisinthecache

V

0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx

Set

1

01

Way

0

1

OnewordblocksTwoloworderbitsdefinethebyteintheword(32bwords)

10/16/17 Fall2017 - Lecture#15 33

PingPongCacheExample:4Word2-WaySA$,SameReferenceString

0 4 0 4

• Considerthemainmemorywordreferencestring04040404Startwithanemptycache- allblocks

initiallymarkedasnotvalid

10/16/17 Fall2017 - Lecture#15 34

PingPongCacheExample:4-Word2-WaySA$,SameReferenceString

0 4 0 4

• Considerthemainmemoryaddressreferencestring04040404

miss miss hit hit

000Mem(0) 000Mem(0)

Startwithanemptycache- allblocksinitiallymarkedasnotvalid

010Mem(4) 010Mem(4)

000Mem(0) 000Mem(0)

010Mem(4)

• Solvestheping-pong effectinadirect-mappedcacheduetoconflictmissessincenowtwomemorylocationsthatmapintothesamecachesetcanco-exist!

• 8requests,2misses

10/16/17 Fall2017 - Lecture#15 35

Four-WaySet-AssociativeCache• 28 =256setseachwithfourways(eachwithoneblock)

3130...131211...210 Byteoffset

DataTagV012...

253254255

DataTagV012...

253254255

DataTagV012...

253254255

Index DataTagV012...

253254255

8Index

22Tag

Hit Data

32

4x1select

Way0 Way1 Way2 Way3

10/16/17 Fall2017 - Lecture#15 36

Break!

3710/16/17 Fall2017 - Lecture#15

RangeofSet-AssociativeCaches• Forafixed-sizecacheandfixedblocksize,eachincreasebyafactoroftwoinassociativitydoublesthenumberofblocksperset(i.e.,thenumberorways)andhalvesthenumberofsets– decreasesthesizeoftheindexby1bitandincreasesthesizeofthetagby1bit

Wordoffset ByteoffsetIndexTag

10/16/17 Fall2017 - Lecture#15 38

RangeofSet-AssociativeCaches• Forafixed-sizecacheandfixedblocksize,eachincreasebyafactoroftwoinassociativitydoublesthenumberofblocksperset(i.e.,thenumberorways)andhalvesthenumberofsets– decreasesthesizeoftheindexby1bitandincreasesthesizeofthetagby1bit

Wordoffset ByteoffsetIndexTag

Decreasingassociativity,lowerway,moresets

Fullyassociative(onlyoneset)Tagisallthebitsexceptblockandbyteoffset

Directmapped(onlyoneway)Smallertags,onlyasinglecomparator

Increasingassociativity,higherway,lesssets

SelectsthesetUsedfortagcompare Selectsthewordintheblock

10/16/17 Fall2017 - Lecture#15 39

TotalCacheCapacity=Associativity× #ofsets× block_sizeBytes=blocks/set× sets× Bytes/block

ByteOffsetTag Index

C=N× S× B

address_size =tag_size +index_size +offset_size=tag_size +log2(S)+log2(B)

10/16/17 Fall2017 - Lecture#15 40

TotalCacheCapacity=

41

Associativity*#ofsets*block_sizeBytes=blocks/set*sets*Bytes/block

ByteOffsetTag Index

C=N*S*B

address_size =tag_size +index_size +offset_size=tag_size +log2(S)+log2(B)

DoubletheAssociativity:Numberofsets?tag_size?index_size?#comparators?

DoubletheSets:Associativity?tag_size?index_size?#comparators?

10/16/17 Fall2017 - Lecture#15

YourTurn• Foracacheof64blocks,eachblockfourbytesinsize:1. Thecapacityofthecacheis:____ bytes.2. Givena2-waySetAssociativeorganization,thereare___ sets,eachof__

blocks,and__ placesablockfrommemorycouldbeplaced.3. Givena4-waySetAssociativeorganization,thereare____ setseachof__

blocksand__ placesablockfrommemorycouldbeplaced.4. Givenan8-waySetAssociativeorganization,thereare____ setseachof__

blocksand___ placesablockfrommemorycouldbeplaced.

10/16/17 Fall2017 - Lecture#15 42

PeerInstruction• ForSsets,Nways,Bblocks,whichstatementshold?

(i)ThecachehasBtags(ii)ThecacheneedsNcomparators(iii)B=NxS(iv)SizeofIndex=Log2(S)

A:(i)onlyB:(i)and(ii)onlyC:(i),(ii),(iii)only:Allfourstatementsaretrue

10/16/17 Fall2017 - Lecture#15 43

PeerInstruction• ForSsets,Nways,Bblocks,whichstatementshold?

(i)ThecachehasBtags(ii)ThecacheneedsNcomparators(iii)B=NxS(iv)SizeofIndex=Log2(S)

A:(i)onlyB:(i)and(ii)onlyC:(i),(ii),(iii)only:Allfourstatementsaretrue

10/16/17 Fall2017 - Lecture#15 44

CostsofSet-AssociativeCaches• N-wayset-associativecachecosts– Ncomparators(delayandarea)–MUXdelay(setselection)beforedataisavailable– Dataavailableaftersetselection(andHit/Missdecision).DM$:blockisavailablebeforetheHit/Missdecision• InSet-Associative,notpossibletojustassumeahitandcontinueandrecoverlaterifitwasamiss

• Whenmissoccurs,whichway’sblockselectedforreplacement?– LeastRecentlyUsed(LRU):onethathasbeenunusedthelongest(principleoftemporallocality)• Musttrackwheneachway’sblockwasusedrelativetootherblocksintheset• For2-waySA$,onebitperset→setto1whenablockisreferenced;resettheotherway’sbit(i.e.,“lastused”)

10/16/17 Fall2017 - Lecture#15 45

CacheReplacementPolicies• RandomReplacement

– Hardwarerandomlyselectsacacheevict• Least-RecentlyUsed

– Hardwarekeepstrackofaccesshistory– Replacetheentrythathasnotbeenusedforthelongesttime– For2-wayset-associativecache,needonebitforLRUreplacement

• ExampleofaSimple“Pseudo”LRUImplementation– Assume64FullyAssociativeentries– Hardwarereplacementpointerpointstoonecacheentry– Wheneveraccessismadetotheentrythepointerpointsto:

• Movethepointertothenextentry– Otherwise:donotmovethepointer– (exampleof“not-most-recentlyused”replacementpolicy)

46

:

Entry0Entry1

Entry63

ReplacementPointer

10/16/17 Fall2017 - Lecture#15

BenefitsofSet-AssociativeCaches• ChoiceofDM$versusSA$dependsonthecostofamissversusthecostof

implementation

• Largestgainsareingoingfromdirectmappedto2-way(20%+reductioninmissrate)

10/16/17 Fall2017 - Lecture#15 47

Outline

• CacheOrganizationandPrinciples• WriteBackvs.WriteThrough• CachePerformance• CacheDesignTradeoffs• AndinConclusion…

4810/16/17 Fall2017 – Lecture#15

ChipPhotos

4910/16/17 Fall2017- Lecture#15

And inConclusion…

• NameoftheGame:ReduceAMAT–ReduceHitTime–ReduceMissRate–ReduceMissPenalty

• Balancecacheparameters(Capacity,associativity,blocksize)

10/16/17 Fall2017 - Lecture#15 50