CS 61C: Great Ideas in Computer Architecture …cs61c/sp16/lec/29/2016Sp...Back in 2011 • Google...
Transcript of CS 61C: Great Ideas in Computer Architecture …cs61c/sp16/lec/29/2016Sp...Back in 2011 • Google...
CS61C:GreatIdeasinComputerArchitecture(MachineStructures)Warehouse-ScaleComputing
Instructors:NicholasWeaver&VladimirStojanovichttp://inst.eecs.berkeley.edu/~cs61c/
CoherencyTrackedbyCacheBlock
• Blockping-pongsbetweentwocacheseventhoughprocessorsareaccessingdisjointvariables
• Effectcalledfalsesharing• Howcanyoupreventit?
2
Review:UnderstandingCacheMisses:The3Cs
• Compulsory(coldstartorprocessmigration,1st reference):– Firstaccesstoblock,impossibletoavoid;smalleffectforlong-running
programs– Solution:increaseblocksize(increasesmisspenalty;verylargeblocks
couldincreasemissrate)• Capacity (notcompulsoryand…)
– Cachecannotcontainallblocksaccessedbytheprogramevenwithperfectreplacementpolicyinfullyassociativecache
– Solution:increasecachesize(mayincreaseaccesstime)• Conflict(notcompulsoryorcapacityand…):
– Multiplememorylocationsmaptothesamecachelocation– Solution1:increasecachesize– Solution2:increaseassociativity(mayincreaseaccesstime)– Solution3:improvereplacementpolicy,e.g..LRU
3
Fourth“C”ofCacheMisses:CoherenceMisses
• Missescausedbycoherencetrafficwithotherprocessor
• Alsoknownascommunicationmissesbecauserepresentsdatamovingbetweenprocessorsworkingtogetheronaparallelprogram
• Forsomeparallelprograms,coherencemissescandominatetotalmisses– Itgetsevenmorecomplicatedwithmultithreadedprocessors:YouwantseparatethreadsonthesameCPUtohavecommonworkingset,otherwiseyougetwhatcouldbedescribedasincoherencemisses
4
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssigned tocomputere.g.,Search“cats”
• ParallelThreadsAssigned tocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelined instructions
• ParallelData>1dataitem@one timee.g.,DeepLearningfor
imageclassification
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages 5
SmartPhone
WarehouseScale
ComputerHarness
Parallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
SoftwareHardware
Backin2011• Googledisclosedthatitcontinuouslyusesenoughelectricitytopower200,000homes,butitsaysthatindoingso,italsomakestheplanetgreener.
• Averageenergyusepertypicaluser permonthissameasrunninga60-wattbulbfor3hours(180watt-hours).
6
Urs Hoelzle,Google SVPCo-authoroftoday’sreading
http://www.nytimes.com/2011/09/09/technology/google-details-and-defends-its-use-of-electricity.html
Google’sWSCs
74/8/16
Ex:InOregon
ContainersinWSCs
8
InsideWSC InsideContainer
Server,Rack,Array
9
GoogleServerInternals
10
GoogleServer
Warehouse-ScaleComputers• Datacenter– Collectionof10,000to100,000servers– Networksconnectingthemtogether
• Singlegiganticmachine• Verylargeapplications(Internetservice):
search,email,videosharing,socialnetworking• Veryhighavailability• “…WSCsarenolessworthyoftheexpertiseofcomputer
systemsarchitectsthananyotherclassofmachines”Barroso andHoelzle,2009
11
UniquetoWSCs• AmpleParallelism
– Request-levelParallelism:ex:Websearch– Data-levelParallelism:ex:Imageclassifier training
• ScaleanditsOpportunities/Problems– Scaleofeconomy:lowper-unitcost– Cloudcomputing:rentcomputingpowerwithlowcosts(ex:AWS)
• OperationCostCount– Longerlifetime(>10years)– Costofequipmentpurchases<<costofownership– Oftensemi-customorcustomhardware
• Butconsortiumsofhardwaredesignstosavecostthere• Designforfailure:
– Transientfailures– Hardfailures– High#offailures
ex:4disks/server,annualfailurerate:4%à WSCof50,000servers:1diskfail/hour
12
WSCArchitecture
13
1UServer:8-16cores,16GBDRAM,4x4TBdisk+diskpods
Rack:40-80severs,LocalEthernet(1-10Gbps)switch(30$/1Gbps/server)
Array(akacluster):16-32racksExpensiveswitch(10Xbandwidthà 100xcost)
WSCStorageHierarchy
14
1UServer:DRAM:16GB,100ns,20GB/sDisk:2TB,10ms,200MB/s
Rack(80severs):DRAM:1TB,300us,100MB/sDisk:160TB,11ms,100MB/s
Array(30racks):DRAM:30TB,500us,10MB/sDisk:4.80PB,12ms,10MB/s
LowerlatencytoDRAMinanotherserverthanlocaldiskHigherbandwidthtolocaldiskthantoDRAMinanotherserver
WorkloadVariation
• Onlineservice:Peakusage2Xoff-peak15
Midnight Noon Midnight
Workload
2X
ImpactonWSCsoftware• Latency,bandwidthà Performance– Independentdatasetwithinanarray– Localityofaccesswithinserverorrack
• Highfailurerateà Reliability,Availability– Preventingfailuresiseffectivelyimpossible atthisscale– Copewithfailuresgracefullybydesigningthesystemasawhole
• Varyingworkloadsà Availability– Scaleupanddowngracefully
• Morechallengingthansoftwareforsinglecomputers!
16
PowerUsageEffectiveness• Energyefficiency– PrimaryconcerninthedesignofWSC– Importantcomponentofthetotalcostofownership
• PowerUsageEffectiveness(PUE):
– ApowerefficiencymeasureforWSC– Notconsideringefficiencyofservers,networking– Perfection=1.0– GoogleWSC’sPUE=1.2
• GettingprettyclosetoAmdahl'slawlimit 17
TotalBuildingPowerITequipmentPower
PUEintheWild(2007)
18
19
LoadProfileofWSCs
• AverageCPUutilizationof5,000Googleservers,6monthperiod• Serversrarelyidleorfullyutilized,operatingmostofthetimeat
10%to50%oftheirmaximumutilization20
Energy-ProportionalComputing:DesignGoalofWSC
• Energy=PowerxTime,Efficiency=Computation/Energy• Desire:
– Consumealmostnopowerwhenidle(“Doingnothingwell”)– Graduallyconsumemorepowerastheactivitylevelincreases
21
CauseofPoorEnergyProportionality
22
• CPU:50%atpeek,30%atidle• DRAM,disks,networking:70%atidle!
– Becausetheyareneverreallyidleunlesstheyarepoweredoff!• Needtoimprovetheenergyefficiencyofperipherals
Clicker/PeerInstruction:WhichStatementisTrue
• A:Idleserversconsumealmostnopower.
• B:Diskswillfailoncein20years,sofailureisnotaproblemofWSC.
• C:Thesearchrequestsofthesamekeywordfromdifferentusersaredependent.
• D:MorethanhalfofthepowerofWSCsgoesintocooling.
• E:WSCscontainmanycopiesofdata.23
Administrivia• ReminderthatProject4isout…
24
Agenda
• WarehouseScaleComputing
• Administrivia &Clickers/PeerInstructions
• Request-levelParallelisme.g.Websearch
25
Request-LevelParallelism(RLP)• Hundredsofthousandsofrequestspersec.– PopularInternetserviceslikewebsearch,socialnetworking,…
– Suchrequestsarelargelyindependent• Ofteninvolveread-mostlydatabases• Rarelyinvolveread-writesharingorsynchronizationacrossrequests
• Computationeasilypartitionedacrossdifferentrequestsandevenwithinarequest
26
GoogleQuery-ServingArchitecture
27
AnatomyofaWebSearch
28
AnatomyofaWebSearch(1/3)• Google“cats”– Directrequestto“closest”GoogleWSC
• HandledbyDNS
– Front-endloadbalancerdirectsrequesttooneofmanyarrays(clusterofservers)withinWSC• Oneofpotentiallymanyloadbalancers
– Withinarray,selectoneofmanyGoggleWebServers(GWS)tohandletherequestandcomposetheresponsepages
– GWScommunicateswithIndexServerstofinddocumentsthatcontainsthesearchword,“cats”• IndexserverskeepindexinRAM,notondisk
– Returndocumentlistwithassociatedrelevancescore 29
AnatomyofaWebSearch(2/3)• Inparallel,– Adsystem:runadauctionforbiddersonsearchterms
• Yes,youarebeingboughtandsoldinarealtime auctionallovertheweb
• Pageadsareworsethansearchads
• Usedocids (DocumentIDs)toaccessindexeddocuments• Composethepage– Resultdocumentextracts(withkeywordincontext)orderedbyrelevancescore
– Sponsoredlinks(alongthetop)andadvertisements(alongthesides)
30
AnatomyofaWebSearch(3/3)• Implementationstrategy– Randomlydistributetheentries
– Makemanycopiesofdata(a.k.a.“replicas”)
– Loadbalancerequestsacrossreplicas
• Redundantcopiesofindicesanddocuments
– Breaksupsearchhotspots,e.g.“TaylorSwift”
– Increasesopportunitiesforrequest-levelparallelism
– Makesthesystemmoretolerantoffailures
31
Summary• WarehouseScaleComputers– Newclassofcomputers– Scalability,energyefficiency,highfailurerate
• Request-levelparallelisme.g.WebSearch
• Data-levelparallelismonalargedataset– AgazillionVMsfordifferentpeople
– MapReduce– Hadoop,Spark
32