What Can We Learn from Four Years of Data Center Hardware...
Transcript of What Can We Learn from Four Years of Data Center Hardware...
![Page 1: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/1.jpg)
WhatCanWeLearnfromFourYearsofDataCenterHardwareFailures?
Guosai Wang,Lifei Zhang,WeiXu
![Page 2: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/2.jpg)
Motivation:EvolvingFailureModel
• Failuresindatacentersarecommonandcostly- Violateservicelevelagreement(SLA)andcauselossofrevenue
• Understandfailures:reduceTCO• Today’sdatacentersaredifferent- ! Betterfailuredetectionsystems,experiencedoperators- " Adoptionofless-reliable,commodityorcustomorderedhardware,moreheterogeneoushardwareandworkload- Result:morecomplexfailuremodel
• Goal:comprehensiveanalysisofhardwarefailuresinmodernlarge-scaleIDCs
![Page 3: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/3.jpg)
WeRe-studyHardwareFailuresinIDCs
Ourwork:- Largescale:hundredsofthousandsofserverswith290,000failureoperationtickets- Long-term:2012-2016- Multi-dimensional:components,time,space,productlines,operators’response,etc.- Reconfirmorextendpreviousfindings+Observenewpatterns
Time
Space Components
Productlines Operators’response
![Page 4: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/4.jpg)
Commonbeliefs• Failuresareuniformlyrandomlydistributedovertime/space
• Failureshappenindependently
• HWunreliabilityshapesthesoftwarefaulttolerancedesign
Ourfindings• HWfailuresarenotuniformlyrandom- atdifferenttimescales- sometimesatdifferentlocations
• CorrelatedHWfailuresarecommoninIDCs• Itisalsotheotherwayaround:softwarefaulttoleranceindulgesoperatorstocarelessaboutHWdependability
InterestingFindingsOverview
![Page 5: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/5.jpg)
FailureManagementArchitecture
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
![Page 6: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/6.jpg)
FailureManagementArchitecture
• HMSagentsdetectfailuresonservers
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
![Page 7: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/7.jpg)
FailureManagementArchitecture
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
• HMSagentsdetectfailuresonservers• HMS collectsfailurerecords,andstoretheminafailurepool
![Page 8: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/8.jpg)
FailureManagementArchitecture
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
• HMSagentsdetectfailuresonservers• HMS collectsfailurerecords,andstoretheminafailurepool• Operators/programs generateaFOTforeachfailurerecord
![Page 9: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/9.jpg)
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.
Dataset:290,000+FOTs
• Thefailureoperationtickets(FOTs)containmanyfields
![Page 10: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/10.jpg)
• WestudythefailuresondifferentdimensionsbasedondifferentfieldsofFOTs
Multi-dimensionalAnalysisontheDataset
Time
Space Components
Productlines Operators’response id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.
![Page 11: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/11.jpg)
• WestudythefailuresondifferentdimensionsbasedondifferentfieldsofFOTs
Multi-dimensionalAnalysisontheDataset
Time:errortime
Space:hostname,hostidc
Components:errordevice
Productlines:hostname
Operators’response:errortime,optime
id,hostname,hostidc,errordevice,errortype,errortime,errorposition,optime,errordetail,etc.
![Page 12: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/12.jpg)
Device Proportion
Hard DiskDrive 81.84%
Miscellaneous* 10.20%
Memory 3.06%
Power 1.74%
RAID card 1.23%
Flashcard 0.67%
Motherboard 0.57%
SSD 0.31%
Fan 0.19%
HDDbackboard 0.14%
CPU 0.04%
*”Miscellaneous”aremanuallysubmittedoruncategorizedfailures
FailurePercentageBreakdownbyComponent
![Page 13: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/13.jpg)
FailureTypesforHardDiskDrive
• AbouthalfofHDDfailuresarerelatedtoSMARTvalues orpredictionerrorcount
FailureTypeBreakdownofHDD
SMARTFailPredictErrRaidPdPreErrRaidPdFailedMissingNotReadyMediumErrRaidPdMediaErrBadSectorPendingLBATooManyDStatusOthers
SomeHDDSMARTvalueexceedsthethreshold
Thepredictionerrorcountexceedsthethreshold
OthertypesSMART =SelfMonitoringAnalysisandReportingTechnique
![Page 14: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/14.jpg)
FailureTypesforHardDiskDrive
• AbouthalfofHDDfailuresarerelatedtoSMARTvalues orpredictionerrorcount
FailureTypeBreakdownofHDD
SMARTFailPredictErrRaidPdPreErrRaidPdFailedMissingNotReadyMediumErrRaidPdMediaErrBadSectorPendingLBATooManyDStatusOthers
SomeHDDSMARTvalueexceedsthethreshold
Thepredictionerrorcountexceedsthethreshold
OthertypesSMART =SelfMonitoringAnalysisandReportingTechnique
![Page 15: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/15.jpg)
Outline
• DatasetoverviewØTemporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned
![Page 16: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/16.jpg)
FRisNOT UniformlyRandomoverDaysoftheWeek
• Hypothesis1. Theaveragenumberofcomponentfailuresisuniformlyrandomoverdifferentdaysoftheweek.
• Achi-squaretestcanrejectthehypothesisat0.01significancelevelforall componentclasses.
![Page 17: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/17.jpg)
FRisNOT UniformlyRandomoverHoursoftheDay
• Hypothesis2.Theaveragenumberofcomponentfailuresisuniformlyrandomduringeachhouroftheday.
![Page 18: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/18.jpg)
• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches
FRisNOT UniformlyRandomoverHoursoftheDay
![Page 19: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/19.jpg)
• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches
FRisNOT UniformlyRandomoverHoursoftheDay
![Page 20: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/20.jpg)
• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches
FRisNOT UniformlyRandomoverHoursoftheDay
![Page 21: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/21.jpg)
• PossibleReasons- Highworkloadresultsinmorefailures- Humanfactors- Componentsfailinlargebatches
FRisNOT UniformlyRandomoverHoursoftheDay
![Page 22: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/22.jpg)
FRofeachComponentChangesDuringitsLifeCycle
• DifferentcomponentclassesexhibitdifferentFRpatterns.
![Page 23: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/23.jpg)
• Infantmortalities:
FRofeachComponentChangesDuringitsLifeCycle
![Page 24: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/24.jpg)
• Wearout
FRofeachComponentChangesDuringitsLifeCycle
![Page 25: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/25.jpg)
Outline
• Datasetoverview• TemporaldistributionofthefailuresØSpatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned
![Page 26: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/26.jpg)
PhysicalLocationsMightAffecttheFRDistribution
• Hypothesis3. Thefailurerateoneachrackpositionisindependentoftherackposition.
• Ingeneral,at0.05significancelevel:- cannotrejectthehypothesisin40%ofthedatacenters- canrejectitintheother60%
![Page 27: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/27.jpg)
FRCanbeAffectedbytheCoolingDesign
• FRsarehigheratrackposition22and35
• Possiblereasons- DesignofIDCcoolingandphysicalstructureoftheracks
Atthetop
AbovethePSU Coolingair
AtypicalScorpionrack
![Page 28: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/28.jpg)
Outline
• Datasetoverview• Temporaldistributionofthefailures• SpatialdistributionofthefailuresØCorrelatedfailures• Operators’responsetofailures• LessonsLearned
![Page 29: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/29.jpg)
CorrelatedFailures areCommon
• Correlatedfailures:batchfailures,correlatedcomponentfailures,repeatingsynchronousfailures• Fact:200+HDDfailuresoneachof22.5%ofthedays• Casestudy- Nov.16thand17th,2015- 5,000+servers,or32%ofalltheserversoftheproductline,reportingharddriveSMARTFail failures- 99%ofthesefailuresweredetectedbetween21:00onthe16thand3:00onthe17th.- Operatorsreplacedabout1,600,decommissionedtheremaining4000+out-of-warrantydrives- Failurereasonnotclearyet
![Page 30: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/30.jpg)
CausesofCorrelatedFailures
Allthefollowinghavehappenedbefore#- Environmentalfactors(e.g.,humidity)- Firmwarebugs- Singlepointoffailure(e.g.,powermodulefailures)- Humanoperatormistakes- ...
![Page 31: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/31.jpg)
Outline
• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• CorrelatedfailuresØOperators’responsetofailures• LessonsLearned
![Page 32: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/32.jpg)
Operators’ResponsetoFailures
• Responsetime:RT=op_time – err_time
� �����
���� �����������
!������� ������������
���������������
������
���
��������������
�����������������
��������
�����������������������
![Page 33: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/33.jpg)
RT isVeryHighinGeneral
• RTforD_fixing:Avg.42.2days,median6.1days• 10%oftheFOTs:RT>140days
- Isitbecauseoperatorsbusydealingwithlargenumberoffailures?- No!
![Page 34: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/34.jpg)
RT inDifferentProductLinesVaries
• Observation1:VariationofRT indifferentproductlinesislarge• Observation2:Operatorsrespondtolargenumberoffailuremorequickly
Number ofHDDFailuresDuringYear2015
TheREALproblems$
Whocares?%
![Page 35: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/35.jpg)
OPsareLessMotivatedtoRespondtoHWFailures
Possiblereasons• Softwareredundancydesign- Delayed Responding,processfailuresinbatches
• Manyhardwarefailuresarenolongerurgent- E.g.,SMARTfailuresmaynotbefatal
• Repairoperationcanbecostly- E.g.,Taskmigration
Operator
ResilientSoftware
HardwareRedundancy
![Page 36: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/36.jpg)
Outline
• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailuresØLessonsLearned
![Page 37: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/37.jpg)
LessonsLearnedI
• Mucholdwisdomstillholds.- Morecorrelatedfailures� softwaredesignchallenge- Automatichardwarefailuredetection&handling:!- Datacenterdesign:avoid“batspot”
![Page 38: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/38.jpg)
LessonsLearnedII
• Striketherightbalanceamongsoftwarestackcomplexity,hardwaredependability,andoperationcost.• Datacenterdependabilityneedsjointoptimizationeffortthatcrosseslayers.
OperationCost
ResilientSoftwareDesign
DependableHardwareInfrastructure
![Page 39: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/39.jpg)
LessonsLearnedIII
• Stateful failurehandlingsystem- Dataminingtool:discovercorrelationamongfailures- Provideoperatorswithextrainformation
HardwareFailure
Servermodel Workload
Environment
Failurehistory
Correlationwithotherfailures
![Page 40: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/40.jpg)
Thankyou!Q&A
Outline• Datasetoverview• Temporaldistributionofthefailures• Spatialdistributionofthefailures• Correlatedfailures• Operators’responsetofailures• LessonsLearned
![Page 41: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/41.jpg)
TBFCannotbeWellFittedbyWell-knownDistributions
• Hypothesis4. Timebetweenfailures(TBF)ofallcomponentsfollowsanexponentialdistribution.• Hypothesis5. TBFofeachindividualcomponentclassfollowsanexponentialdistribution.
100 101 102
Time between Failures (min)
0
0.2
0.4
0.6
0.8
1
CD
F
ExpWeibullGammaLogNormalData
Largeproportionofsmallvalues
![Page 42: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/42.jpg)
FailureOperationTicket(FOT)
• CategoriesofFOTs
• Fields:id,hostid,hostname,hostidc,errordevice,errortype,errortime,errorposition,errordetail
![Page 43: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/43.jpg)
FRofMisc.FailuresDuringtheLifecycle
• Mostmanualdetectionanddebuggingeffortshappenonlyatdeploymenttime• Lesscosttorepair(notmuchtaskstomigrate)
![Page 44: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/44.jpg)
RTforEachComponentClass
• MedianRTsforSSDandmist.failuresaretheshortest(hours)• MedianRTsforHDD,fans,andmemoryarethelongest(7-18days)• StandarddeviationoftheRTforHDD:30.2days
![Page 45: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/45.jpg)
Self-Monitoring,AnalysisandReportingTechnology
• Fields:raw value,worst,threshold,status• SMARTattributeexamples(failurerelated)
• ReallocatedSectorsCount• End-to-Enderror• UncorrectableSectorCount• ReportedUncorrectableErrors• CurrentPendingSectorCount• CommandTimeout• ...
![Page 46: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/46.jpg)
ExamplesofFailureTypes
![Page 47: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/47.jpg)
RepeatingFailures
• Over85%ofthefixedcomponentsneverrepeatthesamefailure• Repaircanfail• 2%ofserversthateverfailedcontributemorethan99%ofallfailures
![Page 48: What Can We Learn from Four Years of Data Center Hardware ...people.iiis.tsinghua.edu.cn/~weixu/Krvdro9c/dsn17-wang-slides.pdf · •Software redundancy design-DelayedResponding,](https://reader034.fdocuments.us/reader034/viewer/2022050114/5f4b00a1306fd41a4754ed55/html5/thumbnails/48.jpg)
BatchFailureFrequencyforEachComponent
• r_N:anormalizedcounterofhowmanydaysduringtheDdays,inwhichmorethanNfailureshappenonthesameday• NormalizedbythetotaltimelengthD.