CS 61C: Great Ideas in Computer Architecture …cs61c/sp16/lec/29/2016Sp...Back in 2011 • Google...

CS61C:GreatIdeasinComputerArchitecture(MachineStructures)Warehouse-ScaleComputing

Instructors:NicholasWeaver&VladimirStojanovichttp://inst.eecs.berkeley.edu/~cs61c/

CoherencyTrackedbyCacheBlock

• Blockping-pongsbetweentwocacheseventhoughprocessorsareaccessingdisjointvariables

• Effectcalledfalsesharing• Howcanyoupreventit?

2

Review:UnderstandingCacheMisses:The3Cs

• Compulsory(coldstartorprocessmigration,1st reference):– Firstaccesstoblock,impossibletoavoid;smalleffectforlong-running

programs– Solution:increaseblocksize(increasesmisspenalty;verylargeblocks

couldincreasemissrate)• Capacity (notcompulsoryand…)

– Cachecannotcontainallblocksaccessedbytheprogramevenwithperfectreplacementpolicyinfullyassociativecache

– Solution:increasecachesize(mayincreaseaccesstime)• Conflict(notcompulsoryorcapacityand…):

– Multiplememorylocationsmaptothesamecachelocation– Solution1:increasecachesize– Solution2:increaseassociativity(mayincreaseaccesstime)– Solution3:improvereplacementpolicy,e.g..LRU

3

Fourth“C”ofCacheMisses:CoherenceMisses

• Missescausedbycoherencetrafficwithotherprocessor

• Alsoknownascommunicationmissesbecauserepresentsdatamovingbetweenprocessorsworkingtogetheronaparallelprogram

• Forsomeparallelprograms,coherencemissescandominatetotalmisses– Itgetsevenmorecomplicatedwithmultithreadedprocessors:YouwantseparatethreadsonthesameCPUtohavecommonworkingset,otherwiseyougetwhatcouldbedescribedasincoherencemisses

4

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssigned tocomputere.g.,Search“cats”

• ParallelThreadsAssigned tocoree.g.,Lookup,Ads

• ParallelInstructions>[email protected].,5pipelined instructions

• ParallelData>1dataitem@one timee.g.,DeepLearningfor

imageclassification

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages 5

SmartPhone

WarehouseScale

ComputerHarness

Parallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

SoftwareHardware

Backin2011• Googledisclosedthatitcontinuouslyusesenoughelectricitytopower200,000homes,butitsaysthatindoingso,italsomakestheplanetgreener.

• Averageenergyusepertypicaluser permonthissameasrunninga60-wattbulbfor3hours(180watt-hours).

6

Urs Hoelzle,Google SVPCo-authoroftoday’sreading

http://www.nytimes.com/2011/09/09/technology/google-details-and-defends-its-use-of-electricity.html

Google’sWSCs

74/8/16

Ex:InOregon

ContainersinWSCs

8

InsideWSC InsideContainer

Server,Rack,Array

9

GoogleServerInternals

10

GoogleServer

Warehouse-ScaleComputers• Datacenter– Collectionof10,000to100,000servers– Networksconnectingthemtogether

• Singlegiganticmachine• Verylargeapplications(Internetservice):

search,email,videosharing,socialnetworking• Veryhighavailability• “…WSCsarenolessworthyoftheexpertiseofcomputer

systemsarchitectsthananyotherclassofmachines”Barroso andHoelzle,2009

11

UniquetoWSCs• AmpleParallelism

– Request-levelParallelism:ex:Websearch– Data-levelParallelism:ex:Imageclassifier training

• ScaleanditsOpportunities/Problems– Scaleofeconomy:lowper-unitcost– Cloudcomputing:rentcomputingpowerwithlowcosts(ex:AWS)

• OperationCostCount– Longerlifetime(>10years)– Costofequipmentpurchases<<costofownership– Oftensemi-customorcustomhardware

• Butconsortiumsofhardwaredesignstosavecostthere• Designforfailure:

– Transientfailures– Hardfailures– High#offailures

ex:4disks/server,annualfailurerate:4%à WSCof50,000servers:1diskfail/hour

12

WSCArchitecture

13

1UServer:8-16cores,16GBDRAM,4x4TBdisk+diskpods

Rack:40-80severs,LocalEthernet(1-10Gbps)switch(30$/1Gbps/server)

Array(akacluster):16-32racksExpensiveswitch(10Xbandwidthà 100xcost)

WSCStorageHierarchy

14

1UServer:DRAM:16GB,100ns,20GB/sDisk:2TB,10ms,200MB/s

Rack(80severs):DRAM:1TB,300us,100MB/sDisk:160TB,11ms,100MB/s

Array(30racks):DRAM:30TB,500us,10MB/sDisk:4.80PB,12ms,10MB/s

LowerlatencytoDRAMinanotherserverthanlocaldiskHigherbandwidthtolocaldiskthantoDRAMinanotherserver

WorkloadVariation

• Onlineservice:Peakusage2Xoff-peak15

Midnight Noon Midnight

Workload

2X

ImpactonWSCsoftware• Latency,bandwidthà Performance– Independentdatasetwithinanarray– Localityofaccesswithinserverorrack

• Highfailurerateà Reliability,Availability– Preventingfailuresiseffectivelyimpossible atthisscale– Copewithfailuresgracefullybydesigningthesystemasawhole

• Varyingworkloadsà Availability– Scaleupanddowngracefully

• Morechallengingthansoftwareforsinglecomputers!

16

PowerUsageEffectiveness• Energyefficiency– PrimaryconcerninthedesignofWSC– Importantcomponentofthetotalcostofownership

• PowerUsageEffectiveness(PUE):

– ApowerefficiencymeasureforWSC– Notconsideringefficiencyofservers,networking– Perfection=1.0– GoogleWSC’sPUE=1.2

• GettingprettyclosetoAmdahl'slawlimit 17

TotalBuildingPowerITequipmentPower

PUEintheWild(2007)

18

LoadProfileofWSCs

• AverageCPUutilizationof5,000Googleservers,6monthperiod• Serversrarelyidleorfullyutilized,operatingmostofthetimeat

10%to50%oftheirmaximumutilization20

Energy-ProportionalComputing:DesignGoalofWSC

• Energy=PowerxTime,Efficiency=Computation/Energy• Desire:

– Consumealmostnopowerwhenidle(“Doingnothingwell”)– Graduallyconsumemorepowerastheactivitylevelincreases

21

CauseofPoorEnergyProportionality

22

• CPU:50%atpeek,30%atidle• DRAM,disks,networking:70%atidle!

– Becausetheyareneverreallyidleunlesstheyarepoweredoff!• Needtoimprovetheenergyefficiencyofperipherals

Clicker/PeerInstruction:WhichStatementisTrue

• A:Idleserversconsumealmostnopower.

• B:Diskswillfailoncein20years,sofailureisnotaproblemofWSC.

• C:Thesearchrequestsofthesamekeywordfromdifferentusersaredependent.

• D:MorethanhalfofthepowerofWSCsgoesintocooling.

• E:WSCscontainmanycopiesofdata.23

Administrivia• ReminderthatProject4isout…

24

Agenda

• WarehouseScaleComputing

• Administrivia &Clickers/PeerInstructions

• Request-levelParallelisme.g.Websearch

25

Request-LevelParallelism(RLP)• Hundredsofthousandsofrequestspersec.– PopularInternetserviceslikewebsearch,socialnetworking,…

– Suchrequestsarelargelyindependent• Ofteninvolveread-mostlydatabases• Rarelyinvolveread-writesharingorsynchronizationacrossrequests

• Computationeasilypartitionedacrossdifferentrequestsandevenwithinarequest

26

GoogleQuery-ServingArchitecture

27

AnatomyofaWebSearch

28

AnatomyofaWebSearch(1/3)• Google“cats”– Directrequestto“closest”GoogleWSC

• HandledbyDNS

– Front-endloadbalancerdirectsrequesttooneofmanyarrays(clusterofservers)withinWSC• Oneofpotentiallymanyloadbalancers

– Withinarray,selectoneofmanyGoggleWebServers(GWS)tohandletherequestandcomposetheresponsepages

– GWScommunicateswithIndexServerstofinddocumentsthatcontainsthesearchword,“cats”• IndexserverskeepindexinRAM,notondisk

– Returndocumentlistwithassociatedrelevancescore 29

AnatomyofaWebSearch(2/3)• Inparallel,– Adsystem:runadauctionforbiddersonsearchterms

• Yes,youarebeingboughtandsoldinarealtime auctionallovertheweb

• Pageadsareworsethansearchads

• Usedocids (DocumentIDs)toaccessindexeddocuments• Composethepage– Resultdocumentextracts(withkeywordincontext)orderedbyrelevancescore

– Sponsoredlinks(alongthetop)andadvertisements(alongthesides)

30

AnatomyofaWebSearch(3/3)• Implementationstrategy– Randomlydistributetheentries

– Makemanycopiesofdata(a.k.a.“replicas”)

– Loadbalancerequestsacrossreplicas

• Redundantcopiesofindicesanddocuments

– Breaksupsearchhotspots,e.g.“TaylorSwift”

– Increasesopportunitiesforrequest-levelparallelism

– Makesthesystemmoretolerantoffailures

31

Summary• WarehouseScaleComputers– Newclassofcomputers– Scalability,energyefficiency,highfailurerate

• Request-levelparallelisme.g.WebSearch

• Data-levelparallelismonalargedataset– AgazillionVMsfordifferentpeople

– MapReduce– Hadoop,Spark

32

CS 61C: Great Ideas in Computer Architecture …cs61c/sp16/lec/29/2016Sp...Back in 2011 • Google...

Documents

Transcript of CS 61C: Great Ideas in Computer Architecture …cs61c/sp16/lec/29/2016Sp...Back in 2011 • Google...