Cache Lab Implementaon and Blocking

Carnegie Mellon

1

CacheLabImplementa/onandBlocking

AdityaShah Recita-on7:Oct8th,2015

Carnegie Mellon

2

WelcometotheWorldofPointers!

Carnegie Mellon

3

Outline¢  Schedule¢  Memoryorganiza/on¢  Caching

§  Differenttypesoflocality§  Cacheorganiza-on

¢  Cachelab§  Part(a)BuildingCacheSimulator§  Part(b)EfficientMatrixTranspose§  Blocking

Carnegie Mellon

4

ClassSchedule¢  CacheLab

§  DuethisThursday,Oct15th.§  Startnow(ifyouhaven’talready!)

¢  ExamSoon!§  Startdoingprac-ceproblems.§  TheyhavebeenuploadedontotheCourseWebsite!

Carnegie Mellon

5

MemoryHierarchy

Registers

L1cache(SRAM)

Mainmemory(DRAM)

Localsecondarystorage

(localdisks)

Larger,slower,cheaperperbyte

Remotesecondarystorage(tapes,distributedfilesystems,Webservers)

Localdisksholdfilesretrievedfromdisksonremotenetworkservers

Mainmemoryholdsdiskblocksretrievedfromlocaldisks

L2cache(SRAM)

L1cacheholdscachelinesretrievedfromL2cache

CPUregistersholdwordsretrievedfromL1cache

L2cacheholdscachelinesretrievedfrommainmemory

L0:

L1:

L2:

L3:

L4:

L5:

Smaller,faster,costlierperbyte

Carnegie Mellon

6

MemoryHierarchy

We will discuss this interaction

¢  Registers

¢  SRAM

¢  DRAM

¢  Local Secondary storage

¢  Remote Secondary storage

Carnegie Mellon

7

SRAM vs DRAM tradeoff¢  SRAM (cache)

§  Faster (L1 cache: 1 CPU cycle)§  Smaller (Kilobytes (L1) or Megabytes (L2))§  More expensive and “energy-hungry”

¢  DRAM (main memory)§  Relatively slower (hundreds of CPU cycles)§  Larger (Gigabytes)§  Cheaper

Carnegie Mellon

8

Locality¢  Temporallocality

§  Recentlyreferenceditemsarelikelytobereferencedagaininthenearfuture

§  AWeraccessingaddressXinmemory,savethebytesincacheforfutureaccess

¢  Spa/allocality§  Itemswithnearbyaddressestend

tobereferencedclosetogetherin-me§  AWeraccessingaddressX,savetheblockofmemoryaroundXin

cacheforfutureaccess

Carnegie Mellon

9

MemoryAddress¢  64-bitonsharkmachines

¢  Blockoffset:bbits¢  Setindex:sbits¢  TagBits:(AddressSize–b–s)

Carnegie Mellon

10

Cache¢  A cache is a set of 2^s cache sets

¢  A cache set is a set of E cache lines§  E is called associativity§  If E=1, it is called “direct-mapped”

¢  Each cache line stores a block§  Each block has B = 2^b bytes

¢  Total Capacity = S*B*E

Carnegie Mellon

11

VisualCacheTerminologyE lines per set

S = 2s sets

0 1 2 B-1 tag v

valid bit B = 2b bytes per cache block (the data)

t bits s bits b bits Address of word:

tag set index

block offset

data begins at this offset

Carnegie Mellon

12

GeneralCacheConcepts

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

8 9 14 3Cache

MemoryLarger,slower,cheapermemoryviewedaspar//onedinto“blocks”

Dataiscopiedinblock-sizedtransferunits

Smaller,faster,moreexpensivememorycachesasubsetoftheblocks

4

4

4

10

10

10

Carnegie Mellon

13

GeneralCacheConcepts:Miss

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

8 9 14 3Cache

Memory

DatainblockbisneededRequest:12

Blockbisnotincache:Miss!

BlockbisfetchedfrommemoryRequest:12

12

12

12

Blockbisstoredincache• Placementpolicy:determineswherebgoes• Replacementpolicy:determineswhichblockgetsevicted(vic-m)

Carnegie Mellon

14

GeneralCachingConcepts:TypesofCacheMisses

¢  Cold(compulsory)miss§  Thefirstaccesstoablockhastobeamiss

¢  Conflictmiss§  Conflictmissesoccurwhenthelevelkcacheislargeenough,butmul-ple

dataobjectsallmaptothesamelevelkblock§  E.g.,Referencingblocks0,8,0,8,0,8,...wouldmissevery-me

¢  Capacitymiss§  Occurswhenthesetofac-vecacheblocks(workingset)islargerthan

thecache

Carnegie Mellon

15

CacheLab¢  Part(a)Buildingacachesimulator

¢  Part(b)Op/mizingmatrixtranspose

Carnegie Mellon

16

Part(a):Cachesimulator¢  AcachesimulatorisNOTacache!

§  MemorycontentsNOTstored§  BlockoffsetsareNOTused–thebbitsinyouraddressdon’t

mader.§  Simplycounthits,misses,andevic-ons

¢  Yourcachesimulatorneedstoworkfordifferents,b,E,givenatrun/me.

¢  UseLRU–LeastRecentlyUsedreplacementpolicy§  Evicttheleastrecentlyusedblockfromthecachetomakeroom

forthenextblock.§  Queues?TimeStamps?

Carnegie Mellon

17

Part(a):Hints¢  Acacheisjust2Darrayofcachelines:

§  structcache_linecache[S][E];§  S=2^s,isthenumberofsets§  Eisassocia-vity

¢  Eachcache_linehas:§  Validbit§  Tag§  LRUcounter(onlyifyouarenotusingaqueue)

Carnegie Mellon

18

Part(a):getopt

¢ getopt()automatesparsingelementsontheunixcommandlineIffunc/ondeclara/onismissing

§  Typicallycalledinalooptoretrievearguments§  Itsreturnvalueisstoredinalocalvariable§  Whengetopt()returns-1,therearenomoreop-ons

¢ Tousegetopt,yourprogrammustincludetheheaderfile #include<unistd.h>

¢ Ifnotrunningonthesharkmachinesthenyouwillneed#include<getopt.h>.

§ BederAdvice:RunonSharkMachines!

Carnegie Mellon

19

Part(a):getopt¢ Aswitchstatementisusedonthelocalvariableholdingthereturnvaluefromgetopt()

§  Eachcommandlineinputcasecanbetakencareofseparately§  “optarg”isanimportantvariable–itwillpointtothevalueofthe

op-onargument

¢  Thinkabouthowtohandleinvalidinputs

¢ Formoreinforma/on,§  lookatman3getopt§  hdp://www.gnu.org/soWware/libc/manual/html_node/

Getopt.html

Carnegie Mellon

20

Part(a):getoptExampleint main(int argc, char** argv){ int opt,x,y; /* looping over arguments */ while(-1 != (opt = getopt(argc, argv, “x:y:"))){ /* determine which argument it’s processing */ switch(opt) { case 'x': x = atoi(optarg); break; case ‘y': y = atoi(optarg); break; default: printf(“wrong argument\n"); break; } } }

¢ Supposetheprogramexecutablewascalled“foo”.Thenwewouldcall“./foo-x1–y3“topassthevalue1tovariablexand3toy.

Carnegie Mellon

21

Part(a):fscanf¢ Thefscanf()func/onisjustlikescanf()exceptitcanspecifyastreamtoreadfrom(scanfalwaysreadsfromstdin)

§  parameters:§  Astreampointer§  formatstringwithinforma-ononhowtoparsethefile§  therestarepointerstovariablestostoretheparseddata

§  Youtypicallywanttousethisfunc-oninaloop.Itreturns-1whenithitsEOForifthedatadoesn’tmatchtheformatstring

¢  Formoreinforma/on,§  manfscanf§  hdp://crasseux.com/books/ctutorial/fscanf.html

¢  fscanfwillbeusefulinreadinglinesfromthetracefiles.§  L10,1 §  M20,1

Carnegie Mellon

22

Part(a):fscanfexampleFILE*pFile;//pointertoFILEobjectpFile=fopen("tracefile.txt",“r");//openfileforreadingcharidentifier;unsignedaddress;intsize;//Readinglineslike"M20,1"or"L19,3"while(fscanf(pFile,“%c%x,%d”,&identifier,&address,&size)>0){

//Dostuff}fclose(pFile);//remembertoclosefilewhendone

Carnegie Mellon

23

Part(a):Malloc/free¢ Usemalloctoallocatememoryontheheap

¢ Alwaysfreewhatyoumalloc,otherwisemaygetmemoryleak

§  some_pointer_you_malloced=malloc(sizeof(int));§  Free(some_pointer_you_malloced);

¢ Don’tfreememoryyoudidn’tallocate

Carnegie Mellon

24

1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Part(b)EfficientMatrixTranspose¢  Matrix Transpose (A -> B)

Matrix A Matrix B

¢  How do we optimize this operation using the cache?

Carnegie Mellon

25

Part(b):EfficientMatrixTranspose¢  SupposeBlocksizeis8bytes?

¢  AccessA[0][0]cachemiss Shouldwehandle3&4¢  AccessB[0][0]cachemiss nextor5&6?¢  AccessA[0][1]cachehit¢  AccessB[1][0]cachemiss

Carnegie Mellon

26

Part(b):Blocking

¢  Blocking:dividematrixintosub-matrices.

¢   Sizeofsub-matrixdependsoncacheblocksize,cachesize,inputmatrixsize.

¢   Trydifferentsub-matrixsizes.

Carnegie Mellon

27

Example:MatrixMul/plica/on

a b

i

j

*c

=

c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++)

for (j = 0; j < n; j++) for (k = 0; k < n; k++)

c[i*n + j] += a[i*n + k] * b[k*n + j]; }

Carnegie Mellon

28

CacheMissAnalysis¢  Assume:

§  Matrixelementsaredoubles§  Cacheblock=8doubles§  CachesizeC<<n(muchsmallerthann)

¢  Firstitera/on:§  n/8+n=9n/8misses

§  AWerwardsincache:(schema-c)

*=

n

*=8wide

Carnegie Mellon

29


§  Matrixelementsaredoubles§  Cacheblock=8doubles§  CachesizeC<<n(muchsmallerthann)

¢  Seconditera/on:§  Again:

n/8+n=9n/8misses

¢  Totalmisses:§  9n/8*n2=(9/8)*n3

n

*=8wide

Carnegie Mellon

30

BlockedMatrixMul/plica/onc = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B)

for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B)

/* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i++) for (j1 = j; j1 < j+B; j++) for (k1 = k; k1 < k+B; k++)

c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1]; }

a b

i1

j1

*c

=c

+

BlocksizeBxB

Carnegie Mellon

31


§  Cacheblock=8doubles§  CachesizeC<<n(muchsmallerthann)§  Threeblocksfitintocache:3B2<C

¢  First(block)itera/on:§  B2/8missesforeachblock§  2n/B*B2/8=nB/4

(omiyngmatrixc)

§  AWerwardsincache(schema-c)

*=

*=

BlocksizeBxB

n/Bblocks

Carnegie Mellon

32


§  Cacheblock=8doubles§  CachesizeC<<n(muchsmallerthann)§  Threeblocksfitintocache:3B2<C

¢  Second(block)itera/on:§  Sameasfirstitera-on§  2n/B*B2/8=nB/4

¢  Totalmisses:§  nB/4*(n/B)2=n3/(4B)

*=

BlocksizeBxB

n/Bblocks

Carnegie Mellon

33

Part(b):BlockingSummary¢  Noblocking:(9/8)*n3

¢  Blocking:1/(4B)*n3

¢  SuggestlargestpossibleblocksizeB,butlimit3B2<C!

¢  Reasonfordrama/cdifference:§  Matrixmul-plica-onhasinherenttemporallocality:

§  Inputdata:3n2,computa-on2n3

§  EveryarrayelementsusedO(n)-mes!§  Butprogramhastobewridenproperly

¢  Foradetaileddiscussionofblocking:§  hdp://csapp.cs.cmu.edu/public/waside.html

Carnegie Mellon

34

Part(b):Specs¢  Cache:

§  You get 1 kilobytes of cache§  Directly mapped (E=1)§  Block size is 32 bytes (b=5)§  There are 32 sets (s=5)

¢  Test Matrices:§  32 by 32§  64 by 64§  61 by 67

Carnegie Mellon

35

Part(b)¢  Things you’ll need to know:

§  Warnings are errors§  Header files§  Eviction policies in the cache

Carnegie Mellon

36

WarningsareErrors¢  Strictcompila/onflags

¢  Reasons:§  Avoidpoten-alerrorsthatarehardtodebug§  Learngoodhabitsfromthebeginning

¢  Add“-Werror”toyourcompila/onflags

Carnegie Mellon

37

MissingHeaderFiles¢  Remembertoincludefilesthatwewillbeusingfunc/ons

from

¢  Iffunc/ondeclara/onismissing§  Findcorrespondingheaderfiles§  Use:man<func-on-name>

¢  Liveexample§  man3getopt

Carnegie Mellon

38

Evic/onpoliciesofCache¢  ThefirstrowofMatrixAevictsthefirstrowofMatrixB

§  Cachesarememoryaligned.§  MatrixAandBarestoredinmemoryataddressessuchthatboth

thefirstelementsaligntothesameplaceincache!§  Diagonalelementsevicteachother.

¢  Matricesarestoredinmemoryinarowmajororder.§  Iftheen-rematrixcan’tfitinthecache,thenaWerthecacheisfull

withalltheelementsitcanload.Thenextelementswillevicttheexis-ngelementsofthecache.

§  Example:-4x4Matrixofintegersanda32bytecache.§  Thethirdrowwillevictthefirstrow!

Carnegie Mellon

39

Style¢ Readthestyleguideline

§  ButIalreadyreadit!§  Good,readitagain.

¢  Startforminggoodhabitsnow!

Carnegie Mellon

40

Ques/ons?

Cache Lab Implementaon and Blocking

Documents

Transcript of Cache Lab Implementaon and Blocking