NERIUM DISTRIBUTORS WANTED! Distributors wanted for scientific anti-aging breakthrough.
Everything You Always Wanted to Know About ...csl.skku.edu/uploads/ECE5658S16/pr7.pdfEverything You...
Transcript of Everything You Always Wanted to Know About ...csl.skku.edu/uploads/ECE5658S16/pr7.pdfEverything You...
EverythingYouAlwaysWantedtoKnowAboutSynchronization
butWereAfraidtoAsk
TudorDavid,Rachid Guerraoui andVasileios TrigonakisEcole Polytechnique Federale deLausanne(EPFL)
Haksu Lim,Luis,Hwanjin Jeong
2016-04-18
Multi-Core
• Multi-coreisusedinmanysystems
• Thennumberofcore↑,Performance↑?
2
NO
Synchronizationisoneofthebiggestscalabilitybottlenecks
Synchronization
• Whydoesweuse?▪ Concurrentaccesstoshareddata
▪ Toensuretheorderlyexecution
• Whyissynchronizationbottleneck?▪ Hardware
▪ Synchronizationalgorithm
▪ Applicationcontext
▪ Workload3
Focusingthis
CacheCoherence
• Multi-coresystemhaveaseparatecacheforeachcore▪ Writeoperationbreakconsistencyamongcaches
• Cachecoherence▪ Tomaintaincachesofacommonmemoryresource
4
CacheCoherenceprotocols
Modified Shared
Invalid
Write
Busread
Write ReadBuswrite Buswrite
ReadBusRead
ReadWrite
• MSIprotocol
CacheCoherenceProtocols
• MESIprotocol§ Addedexclusivestate
− Nootherhasacopyofthiscacheline
§ Reducedexpensive invalidateoperation
• MOESIprotocol§ Addedownedstate
− Thiscachelinehasbeenmodifiedbuttheremightbemoresharedcopyonothercore
§ Reducedexpensivewriteoperationtomemory
CacheCoherenceExample
• Acquiringlockprocess
Mod Held=1State Data
Cache
Processor
Acq(lock);
Mod Held=1State Data
Cache
Processor
Acq(lock);
Sharedmemory(held=0)
Read-ExclusiveUpdate Invalidate
Inval
Sharedmemory(held=1)
Whattodealwith
• HardwareProcessors§ Multi-sockets
– AMDOpteron• 4x6172– 48cores
– IntelXeon• 8xE7-8867L– 80cores
§ Single-sockets– SunNiagara2
• 8cores
– Tilera TILE-Gx36• 36cores
• Synchronization layer§ Concurrentsoftware
– Hashtable,etc.
§ Primitives– Lock,etc.
§ Atomicoperations– Compare&swap,etc.
§ Cachecoherence– Load&store
8
Hardware-LevelAnalysis
9
LocalAccesses
10
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
Opteron Zeon
• Withinsocket:40ns • Withinsocket:20– 40ns
RemoteAccesses
11
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
CC
CC
CC
CC
C C
Opteron Zeon
• Withinsocket:40ns
• Perhop:+40ns
• Withinsocket:20– 40ns
• Perhop:+50ns
OperationLatency– MultiSocket
12
7.5x
3x
CrossingsocketsisakillerUpto7.5xmoreexpensive
Single-SocketProcessors
13
C
C
C
C
C
C
C
C
C C C C C C
C C C C C C
C C C C C C
C C C C C C
C C C C C C
C C C C C C
NiagaraTilera
• Equidistantfromthecache
• Uniform:23ns
• Nonuniform
• 1hop:40ns
• Perhop:+2ns
OperationLatency– SingleSocket
14
0.5x
Uniformisexpectedtoscalebetter,Thenon-uniformisaffectedbothdistanceand
thenumberofinvolvedcores
AtomicOperations– MultiSockets
• Veryfastsingle-threadperformance▪ Butdropsontwoormorecoresanddecreasesfurtherwhenthereiscross-socketcommunication
15
Opteron Xeon
AtomicOperations– SingleSockets
• Lowersingle-threadthroughput▪ Butscaletoamaximumvalue
16
Niagara Tilera
Software-LevelAnalysis
17
AnalysisScope
• 9Locks▪ Spinlocks
– Testandtest-and-setlock(TTAS),Ticketlock
▪ Queuebasedlock– Arraybasedlock,CLHlock,MCSlock
▪ Hierarchicallock– HierarchicalCLHlock,Hierarchicalticketlock
▪ Mutex
• Concurrentsoftware▪ Hashtable
18
TicketLock
19
Lock Nextticket :Nowserving:
AcquiredTicket :0
10
AcquiringTicket :1
AcquiringTicket :2
AcquiringTicket :3
AcquiringTicket :4
Spin Spin Spin Spin
TicketLock
20
Lock Nextticket :Nowserving:
Release
21
AcquiringTicket :1
AcquiringTicket :2
AcquiringTicket :3
AcquiringTicket :4
Spin Spin SpinSpin
CLHLock
21
tail false
Acquiring
CLHLock
22
tail false
Acquired
true
prev
reference
CLHLock
23
tail false
Acquired
true
prev
reference
false
Acquiring
Spin
CLHLock
24
tail false
Unlock
false
prev
reference
false
Acquiring
Spin
CLHLock
25
tail falsefalse
prev
reference
true
Acquired
HierarchicalLock
26
C
C
C
C
C
C
C
C
HierarchicalLock
27
C
C
C
C
C
C
C
C
• NUMAawarelock▪ Usinglocalcacheforlock
LocksMicrobenchmark
• InitializeNlocks&Tthreads
• Eachthreadrepeatedly▪ ChoosesonelockoutofNatrandom▪ Acquiresthelock▪ Readsandwritestheprotecteddata▪ Releases thelock
• Repeatwith9differentlockalgorithms▪ spinlocks,queue-based, hierarchical,mutex
• Reportthebesttotalthroughput
28
LocksonMultiSockets
29X:Y,X:thescalabilityoverthesingle-threadexecution
Y:thebest-performance lock
Highcontention(4locks) Lowcontention(128locks)
Multisocketsprovidelimitedscalabilityduetohigherlatenciesofremoteaccess
LocksonSingleSockets
30X:Y,X:thescalabilityoverthesingle-threadexecution
Y:thebest-performance lock
Highcontention(4locks) Lowcontention(128locks)
Complexlocksaregenerallythebestunderextremecontention,Simplelocksperformbetterunderlowcontention
HashTable– bestlocks
31
Simplelocksarepowerful25/32
Highcontention Lowcontention
Conclusion
• Crossingsocketsisakiller▪ Upto7.5xmoreexpensivecommunication
• Intra-socketuniformitymatters
• Simplelocksarepowerful▪ Betterin25outof32data-pointsonahashtable
32
ExtraSlides
33
Hardware-LevelAnalysis
• Multisocketprocessor▪ Localaccesslatency
▪ Remoteaccesslatency
• Singlesocketprocessor▪ Intra-socketaccesslatency
34
KeyObservations
• Crossingsocketsisakiller
• Intra-socketuniformitydoesmatter
• Loadsandstorescanbeasexpensiveasatomicoperations
• Simplelocksarepowerful
35
HighContention
• Multi-socket,singlelock
36
HighContention
• Single-socket,singlelock
37
LowContention
• Multi-socket,512locks
38
LowContention
• Single-socket,512locks
39
HashTableonMultiSockets
40
Highcontention(12buckets) Lowcontention(512buckets)
• Using80%get,10%put,and10%remove
HashTableonSingleSockets
41
Highcontention(12buckets) Lowcontention(512buckets)
• Using80%get,10%put,and10%remove
TheScalableCommutativityRule:DesigningScalableSoftwareforMulticoreProcessors
AustinT.Clements,M.Frans Kaashoek,Nickolai Zeldovich,RobertT.Morris,andEddieKohler†
MITCSAILand†HarvardUniversitySOSP2013
-Presentedby-Luis,Haksu, Hwanjin
Background• Evaluatingscalabilityofmulticoresoftware:
• Focuseffortonrealissues.
• DifferentWorkloads?
• Highercore#?
• Criticalbottlenecks?
• Mightnotonlybeimplementation.
• Toolatefordesign-lvl solutions?
43
Workload
Testwithmorecores
AnalyzeScalability
Findbottlenecks
Fixthebottlenecks
Approach• Inshared-memorymulticoreprocessorwith~MESIcoherentcache,acorecanscalereadsandwritesithascachedexclusivelyandscalereadsthatarecachedinsharedmode.
• Operationsscaleifimplementations haveconflict-freememoryaccess.
• Considerscalabilityearlierintheprocess->softwareinterface.▪ Beforeimplementation.
▪ Beforehardware.
▪ Findscalabilityproblemsearlier->solvethemearlier.
TheScalableCommutativityRule“Wheneverinterfaceoperationscommute,theycanbeimplementedinawaythatscales.”
BasedonSIMCommutativity
• State-dependent: Contextofsystem,op.arguments,andconcurrentop.NOTallstateswillcommute.
• Interface-based: Independentofimplementation,justsameresultingstate.
• Monotonic: foranyreorderinginaprefixsequenceofoperationstheregioniscommutative.
Formalexplanationoftherule• Asystemexecutesactions (invocationorresponse).
• Invocation:Systemcall.Response:result.
• Aseriesofactionsformsahistory.
• Theruleonlyconsiderswell-formed histories(oneoutstandinginvocationatanypointperthread,andeachthreadshistoryforminvocation-responsesequence.
Formalexplanationoftherule• Aspecification (closesetofwell-formed histories)distinguishesifahistoryis“correct”, defining theinterface.
• Ie.UNIXgetpid()
• Commutativity =orderofoperations irrelevant• Asetofactionsarecommutativewhenthespecification isindifferent totheexecutionorderofthatset.• ForH=
Formalexplanationoftherule• SI-commutation(forY):
• Xputsthesystemintodesiredstate.• SwitchingYforY’(reorderofY)requiresthatthereturnvaluesofYarevalidregardlessoforder.• ZForcesthattheresultsfromthereorderingof(Y)donotaffectfutureoperations.• Howeverthisisnon-monotonic,thatisforsomeprefixreorderingtheregionmightnotbecommutative
Formalexplanationoftherule
• SIM-commutation(forY):
• WhenforanyprefixP ofsomereorderingofY.
• P SIcommutesinX||P.
• SIMcommutativity isinterfacebased=evaluatesconsequencesofexecutionorderusingonlyspecification.
Designingcommutativeinterface• ApplytheruletoPOSIXresultsininsights:
• Decomposecompoundoperations
fork()
NewP
Copymemst
fdsignal
exec()
Replacememst
fdsignal
posix_spawn()
NewP
Loadimage
Designingcommutativeinterface• EmbraceSpecificationnon-determinism
open()
Allocatefd
Returnsmallest fd
Designingcommutativeinterface• Weakordering
message socketmessagemessage
Pipe-SIGPIE
Designingcommutativeinterface• Releaseresourcesasynchronously.▪ Operationshaveglobaleffectsvisibleuponreturn.▪ Goodforusableinterfacebutstrictforops.Thatreturnresources.
▪ Nocommutewithlastclose()ofareadfd.Musttrackno.ofreadfd.
fd
AnalyzinginterfacedesignusingCOMMUTER
• Understandingcommutativityofcomplexinterfaceisnottrivial.
• Developanimplementationthatdoesn’tsharewhenoperationscommuteincreasesdifficulty
• AutomatedtoolnamedCOMMUTER
Commutativityconditions
ANALIZER TESTGEN MTRACEPythonmodel
Testcases Sharedcachelines
Implementation
AnalyzinginterfacedesignusingCOMMUTER
ANALYZER
• Inputpythonsymbolicmodelofinterface.
• Findsconditions inwhichthemodelcommutes.
• Outputscommutativityconditions: argumentsandstates.
• Symbolicmodelenablesfocusonexternalbehavior.
Commutativityconditions
ANALYZER TESTGEN MTRACEPythonmodel
Testcases Sharedcachelines
Implementation
AnalyzinginterfacedesignusingCOMMUTER
TESTGEN
• Input:Commutativityconditions.
• Convertintotestcases.
• Specifyconcretevaluesforeverysymbolicvariableinthemodel.
• ProduceactualCtestcasecode.
• Testcasecode:statesetup+functionstorun.
• Pathcoverage– codepath.
• Conflictcoverage– accesspattern.
Commutativityconditions
ANALIZER TESTGEN MTRACEPythonmodel
Testcases Sharedcachelines
Implementation
AnalyzinginterfacedesignusingCOMMUTER
MTRACE
• Runthetestcasesonarealimplementation.
• Onviolationofcommutativity ruleitreportswhatvariableswheresharedandthecodethataccessed them.
• Runsonqemu anstartslogforeachtestcase.
Commutativityconditions
ANALIZER TESTGEN MTRACEPythonmodel
Testcases Sharedcachelines
Implementation
Implementation• PrototypeofCOMMUTER
▪ (ANALYZERandTESTGEN)=3,050linesofpython.
▪ MTRACE1,594linesofcodechangesinqemu.
▪ Modify612linesofcodeoflinux.
▪ 2,865linesofC++codetomadeaprogramthatprocessthelogfiles.
Findingscalabilityopportunities
• Modeled18POSIXfilesystemandvirtualmemorysystemcallsinCOMMUTER.
• EvaluateLinuxkernel3.8scalability.
• Developscalablefileandvirtualmemorysystem.
• COMMUTERgenerated13,664testcases.
• Runningthetestcases8minutes.
ComparingScalability• ForLinuxkernelOutof13,664testcases4,257werenotconflictfree.
• Commoncases:sharedreferencecount,coarsegrainedlocks.
ComparingScalabilityFollowcommutativitydesignprinciplesandimplementontopofsv6:
• in-memoryfilesystemcalledScaleFS
• virtualmemorysystemcalledRadixVM
ComparingScalabilityCOMMUTERpointedout• Layerscalability:Useofdatastructuresthatsatisfycommutativityrulesuchas;radixarray,hashtabledetc.
• DeferWork:Lazyresourcerelease.Batchreferencecountreconciliation.
• Precedepessimismwithoptimism:Checkfirstthenacquirelock.
• Don’treadunlessnecessary.
Performanceevaluation• 80core machine,eight 2.4Ghz10core IntelE7-8870and256GBRAM.
• Each 30MBsocketL3cacheis shared by 10cores.
• Nohardwareprefetcher.
• CompareLinux3.5.7(UbuntuQuantal)Vs.Sv6
• Singlecore baseline
Microbenchmark:statbench
• Scalability offstat.
• Create singlefilethat n/2cores fstat().
• Other n/2core linktonewname then unlink
Microbenchmark:openbench
• Scalability ofopen.
• Nthreads concurrently openandclose per-threadfiles
Microbenchmark:Mailserver
• Morerealworld workload.
• Separate comm proc.
• Roughly like qmail.
• Mailclient with nthreads continuosly deliver emailby spawning andfeeding mail-enqueue.
Conclusion• The newruleenables design for scalability design.
• +scalable implemention ==+perfomance (ALWAYS???)
• Casespecific,what totunefor?
• Toolsgives hint about commutative ruleimplementation feasibility but it wont clearly specify how toachieve this.