Everything You Always Wanted to Know About ...csl.skku.edu/uploads/ECE5658S16/pr7.pdfEverything You...

EverythingYouAlwaysWantedtoKnowAboutSynchronization

butWereAfraidtoAsk

TudorDavid,Rachid Guerraoui andVasileios TrigonakisEcole Polytechnique Federale deLausanne(EPFL)

Haksu Lim,Luis,Hwanjin Jeong

2016-04-18

Multi-Core

• Multi-coreisusedinmanysystems

• Thennumberofcore↑,Performance↑?

2

NO

Synchronizationisoneofthebiggestscalabilitybottlenecks

Synchronization

• Whydoesweuse?▪ Concurrentaccesstoshareddata

▪ Toensuretheorderlyexecution

• Whyissynchronizationbottleneck?▪ Hardware

▪ Synchronizationalgorithm

▪ Applicationcontext

▪ Workload3

Focusingthis

CacheCoherence

• Multi-coresystemhaveaseparatecacheforeachcore▪ Writeoperationbreakconsistencyamongcaches

• Cachecoherence▪ Tomaintaincachesofacommonmemoryresource

4

CacheCoherenceprotocols

Modified Shared

Invalid

Write

Busread

Write ReadBuswrite Buswrite

ReadBusRead

ReadWrite

• MSIprotocol

CacheCoherenceProtocols

• MESIprotocol§ Addedexclusivestate

− Nootherhasacopyofthiscacheline

§ Reducedexpensive invalidateoperation

• MOESIprotocol§ Addedownedstate

− Thiscachelinehasbeenmodifiedbuttheremightbemoresharedcopyonothercore

§ Reducedexpensivewriteoperationtomemory

CacheCoherenceExample

• Acquiringlockprocess

Mod Held=1State Data

Cache

Processor

Acq(lock);

Mod Held=1State Data

Cache

Processor

Acq(lock);

Sharedmemory(held=0)

Read-ExclusiveUpdate Invalidate

Inval

Sharedmemory(held=1)

Whattodealwith

• HardwareProcessors§ Multi-sockets

– AMDOpteron• 4x6172– 48cores

– IntelXeon• 8xE7-8867L– 80cores

§ Single-sockets– SunNiagara2

• 8cores

– Tilera TILE-Gx36• 36cores

• Synchronization layer§ Concurrentsoftware

– Hashtable,etc.

§ Primitives– Lock,etc.

§ Atomicoperations– Compare&swap,etc.

§ Cachecoherence– Load&store

8

Hardware-LevelAnalysis

9

LocalAccesses

10

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

Opteron Zeon

• Withinsocket:40ns • Withinsocket:20– 40ns

RemoteAccesses

11

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

CC

CC

CC

CC

C C

Opteron Zeon

• Withinsocket:40ns

• Perhop:+40ns

• Withinsocket:20– 40ns

• Perhop:+50ns

OperationLatency– MultiSocket

12

7.5x

3x

CrossingsocketsisakillerUpto7.5xmoreexpensive

Single-SocketProcessors

13

C

C

C

C

C

C

C

C

C C C C C C

C C C C C C

C C C C C C

C C C C C C

C C C C C C

C C C C C C

NiagaraTilera

• Equidistantfromthecache

• Uniform:23ns

• Nonuniform

• 1hop:40ns

• Perhop:+2ns

OperationLatency– SingleSocket

14

0.5x

Uniformisexpectedtoscalebetter,Thenon-uniformisaffectedbothdistanceand

thenumberofinvolvedcores

AtomicOperations– MultiSockets

• Veryfastsingle-threadperformance▪ Butdropsontwoormorecoresanddecreasesfurtherwhenthereiscross-socketcommunication

15

Opteron Xeon

AtomicOperations– SingleSockets

• Lowersingle-threadthroughput▪ Butscaletoamaximumvalue

16

Niagara Tilera

Software-LevelAnalysis

17

AnalysisScope

• 9Locks▪ Spinlocks

– Testandtest-and-setlock(TTAS),Ticketlock

▪ Queuebasedlock– Arraybasedlock,CLHlock,MCSlock

▪ Hierarchicallock– HierarchicalCLHlock,Hierarchicalticketlock

▪ Mutex

• Concurrentsoftware▪ Hashtable

18

TicketLock

19

Lock Nextticket :Nowserving:

AcquiredTicket :0

10

AcquiringTicket :1

AcquiringTicket :2

AcquiringTicket :3

AcquiringTicket :4

Spin Spin Spin Spin

TicketLock

20

Lock Nextticket :Nowserving:

Release

21

AcquiringTicket :1

AcquiringTicket :2

AcquiringTicket :3

AcquiringTicket :4

Spin Spin SpinSpin

CLHLock

21

tail false

Acquiring

CLHLock

22

tail false

Acquired

true

prev

reference

CLHLock

23

tail false

Acquired

true

prev

reference

false

Acquiring

Spin

CLHLock

24

tail false

Unlock

false

prev

reference

false

Acquiring

Spin

CLHLock

25

tail falsefalse

prev

reference

true

Acquired

HierarchicalLock

26

C

C

C

C

C

C

C

C

HierarchicalLock

27

C

C

C

C

C

C

C

C

• NUMAawarelock▪ Usinglocalcacheforlock

LocksMicrobenchmark

• InitializeNlocks&Tthreads

• Eachthreadrepeatedly▪ ChoosesonelockoutofNatrandom▪ Acquiresthelock▪ Readsandwritestheprotecteddata▪ Releases thelock

• Repeatwith9differentlockalgorithms▪ spinlocks,queue-based, hierarchical,mutex

• Reportthebesttotalthroughput

28

LocksonMultiSockets

29X:Y,X:thescalabilityoverthesingle-threadexecution

Y:thebest-performance lock

Highcontention(4locks) Lowcontention(128locks)

Multisocketsprovidelimitedscalabilityduetohigherlatenciesofremoteaccess

LocksonSingleSockets

30X:Y,X:thescalabilityoverthesingle-threadexecution

Y:thebest-performance lock

Highcontention(4locks) Lowcontention(128locks)

Complexlocksaregenerallythebestunderextremecontention,Simplelocksperformbetterunderlowcontention

HashTable– bestlocks

31

Simplelocksarepowerful25/32

Highcontention Lowcontention

Conclusion

• Crossingsocketsisakiller▪ Upto7.5xmoreexpensivecommunication

• Intra-socketuniformitymatters

• Simplelocksarepowerful▪ Betterin25outof32data-pointsonahashtable

32

ExtraSlides

33

Hardware-LevelAnalysis

• Multisocketprocessor▪ Localaccesslatency

▪ Remoteaccesslatency

• Singlesocketprocessor▪ Intra-socketaccesslatency

34

KeyObservations

• Crossingsocketsisakiller

• Intra-socketuniformitydoesmatter

• Loadsandstorescanbeasexpensiveasatomicoperations

• Simplelocksarepowerful

35

HighContention

• Multi-socket,singlelock

36

HighContention

• Single-socket,singlelock

37

LowContention

• Multi-socket,512locks

38

LowContention

• Single-socket,512locks

39

HashTableonMultiSockets

40

Highcontention(12buckets) Lowcontention(512buckets)

• Using80%get,10%put,and10%remove

HashTableonSingleSockets

41

Highcontention(12buckets) Lowcontention(512buckets)

• Using80%get,10%put,and10%remove

TheScalableCommutativityRule:DesigningScalableSoftwareforMulticoreProcessors

AustinT.Clements,M.Frans Kaashoek,Nickolai Zeldovich,RobertT.Morris,andEddieKohler†

MITCSAILand†HarvardUniversitySOSP2013

-Presentedby-Luis,Haksu, Hwanjin

Background• Evaluatingscalabilityofmulticoresoftware:

• Focuseffortonrealissues.

• DifferentWorkloads?

• Highercore#?

• Criticalbottlenecks?

• Mightnotonlybeimplementation.

• Toolatefordesign-lvl solutions?

43

Workload

Testwithmorecores

AnalyzeScalability

Findbottlenecks

Fixthebottlenecks

Approach• Inshared-memorymulticoreprocessorwith~MESIcoherentcache,acorecanscalereadsandwritesithascachedexclusivelyandscalereadsthatarecachedinsharedmode.

• Operationsscaleifimplementations haveconflict-freememoryaccess.

• Considerscalabilityearlierintheprocess->softwareinterface.▪ Beforeimplementation.

▪ Beforehardware.

▪ Findscalabilityproblemsearlier->solvethemearlier.

TheScalableCommutativityRule“Wheneverinterfaceoperationscommute,theycanbeimplementedinawaythatscales.”

BasedonSIMCommutativity

• State-dependent: Contextofsystem,op.arguments,andconcurrentop.NOTallstateswillcommute.

• Interface-based: Independentofimplementation,justsameresultingstate.

• Monotonic: foranyreorderinginaprefixsequenceofoperationstheregioniscommutative.

Formalexplanationoftherule• Asystemexecutesactions (invocationorresponse).

• Invocation:Systemcall.Response:result.

• Aseriesofactionsformsahistory.

• Theruleonlyconsiderswell-formed histories(oneoutstandinginvocationatanypointperthread,andeachthreadshistoryforminvocation-responsesequence.

Formalexplanationoftherule• Aspecification (closesetofwell-formed histories)distinguishesifahistoryis“correct”, defining theinterface.

• Ie.UNIXgetpid()

• Commutativity =orderofoperations irrelevant• Asetofactionsarecommutativewhenthespecification isindifferent totheexecutionorderofthatset.• ForH=

Formalexplanationoftherule• SI-commutation(forY):

• Xputsthesystemintodesiredstate.• SwitchingYforY’(reorderofY)requiresthatthereturnvaluesofYarevalidregardlessoforder.• ZForcesthattheresultsfromthereorderingof(Y)donotaffectfutureoperations.• Howeverthisisnon-monotonic,thatisforsomeprefixreorderingtheregionmightnotbecommutative

Formalexplanationoftherule

• SIM-commutation(forY):

• WhenforanyprefixP ofsomereorderingofY.

• P SIcommutesinX||P.

• SIMcommutativity isinterfacebased=evaluatesconsequencesofexecutionorderusingonlyspecification.

Designingcommutativeinterface• ApplytheruletoPOSIXresultsininsights:

• Decomposecompoundoperations

fork()

NewP

Copymemst

fdsignal

exec()

Replacememst

fdsignal

posix_spawn()

NewP

Loadimage

Designingcommutativeinterface• EmbraceSpecificationnon-determinism

open()

Allocatefd

Returnsmallest fd

Designingcommutativeinterface• Weakordering

message socketmessagemessage

Pipe-SIGPIE

Designingcommutativeinterface• Releaseresourcesasynchronously.▪ Operationshaveglobaleffectsvisibleuponreturn.▪ Goodforusableinterfacebutstrictforops.Thatreturnresources.

▪ Nocommutewithlastclose()ofareadfd.Musttrackno.ofreadfd.

fd

AnalyzinginterfacedesignusingCOMMUTER

• Understandingcommutativityofcomplexinterfaceisnottrivial.

• Developanimplementationthatdoesn’tsharewhenoperationscommuteincreasesdifficulty

• AutomatedtoolnamedCOMMUTER

Commutativityconditions

ANALIZER TESTGEN MTRACEPythonmodel

Testcases Sharedcachelines

Implementation


ANALYZER

• Inputpythonsymbolicmodelofinterface.

• Findsconditions inwhichthemodelcommutes.

• Outputscommutativityconditions: argumentsandstates.

• Symbolicmodelenablesfocusonexternalbehavior.


ANALYZER TESTGEN MTRACEPythonmodel


Implementation


TESTGEN

• Input:Commutativityconditions.

• Convertintotestcases.

• Specifyconcretevaluesforeverysymbolicvariableinthemodel.

• ProduceactualCtestcasecode.

• Testcasecode:statesetup+functionstorun.

• Pathcoverage– codepath.

• Conflictcoverage– accesspattern.




Implementation


MTRACE

• Runthetestcasesonarealimplementation.

• Onviolationofcommutativity ruleitreportswhatvariableswheresharedandthecodethataccessed them.

• Runsonqemu anstartslogforeachtestcase.




Implementation

Implementation• PrototypeofCOMMUTER

▪ (ANALYZERandTESTGEN)=3,050linesofpython.

▪ MTRACE1,594linesofcodechangesinqemu.

▪ Modify612linesofcodeoflinux.

▪ 2,865linesofC++codetomadeaprogramthatprocessthelogfiles.

Findingscalabilityopportunities

• Modeled18POSIXfilesystemandvirtualmemorysystemcallsinCOMMUTER.

• EvaluateLinuxkernel3.8scalability.

• Developscalablefileandvirtualmemorysystem.

• COMMUTERgenerated13,664testcases.

• Runningthetestcases8minutes.

ComparingScalability• ForLinuxkernelOutof13,664testcases4,257werenotconflictfree.

• Commoncases:sharedreferencecount,coarsegrainedlocks.

ComparingScalabilityFollowcommutativitydesignprinciplesandimplementontopofsv6:

• in-memoryfilesystemcalledScaleFS

• virtualmemorysystemcalledRadixVM

ComparingScalabilityCOMMUTERpointedout• Layerscalability:Useofdatastructuresthatsatisfycommutativityrulesuchas;radixarray,hashtabledetc.

• DeferWork:Lazyresourcerelease.Batchreferencecountreconciliation.

• Precedepessimismwithoptimism:Checkfirstthenacquirelock.

• Don’treadunlessnecessary.

Performanceevaluation• 80core machine,eight 2.4Ghz10core IntelE7-8870and256GBRAM.

• Each 30MBsocketL3cacheis shared by 10cores.

• Nohardwareprefetcher.

• CompareLinux3.5.7(UbuntuQuantal)Vs.Sv6

• Singlecore baseline

Microbenchmark:statbench

• Scalability offstat.

• Create singlefilethat n/2cores fstat().

• Other n/2core linktonewname then unlink

Microbenchmark:openbench

• Scalability ofopen.

• Nthreads concurrently openandclose per-threadfiles

Microbenchmark:Mailserver

• Morerealworld workload.

• Separate comm proc.

• Roughly like qmail.

• Mailclient with nthreads continuosly deliver emailby spawning andfeeding mail-enqueue.

Conclusion• The newruleenables design for scalability design.

• +scalable implemention ==+perfomance (ALWAYS???)

• Casespecific,what totunefor?

• Toolsgives hint about commutative ruleimplementation feasibility but it wont clearly specify how toachieve this.

Everything You Always Wanted to Know About ...csl.skku.edu/uploads/ECE5658S16/pr7.pdfEverything You...

Documents

Transcript of Everything You Always Wanted to Know About ...csl.skku.edu/uploads/ECE5658S16/pr7.pdfEverything You...