Understanding Real-World Timeout Problems in Cloud Server ... · Replica Node Timeout. Computer...
Transcript of Understanding Real-World Timeout Problems in Cloud Server ... · Replica Node Timeout. Computer...
-
Computer Science
Understanding Real-World Timeout Problems in Cloud Server Systems
TingDai,Jingzhu He,Xiaohui (Helen)Gu,ShanLu*
NCStateUniversity *UniversityofChicago
-
Computer Science
Real-world timeout problems
https://aws.amazon.com/cn/message/5467D2/
2
Storageservers
AmazonDynamoDBservicewasdownfor5 hours.
Metadataserver
Sendreq1
Timeout Sendreq2
Sendreq3…
Bug OverloadedTimeoutTimeout
Noproperlimitofretry.
-
Computer Science
•DataNode1•Balancer •DataNode2
Move datablockSend job req1
Sendjobreq2
HDFS-6166
3
1min
A Motivating Example
Timeout
Threadquotaexceedederror
Sendjobreq3
Timeout
Misconfiguredtimeoutvalue
-
Computer Science
•DataNode1•Balancer •DataNode2
Move datablockSend job req1
•HDFS-6166
4
1min
A Motivating Example
20min Sendjobresponse1
patch
-
Computer Science 5
What are timeout bugs?
Timeoutbugshappenwhentheserverapplicationslackproperconfiguration andhandling ofthetimeoutevents.
-
Computer Science 6
Why are timeout bugs areprevalent?
• Cloudserversystemshavebecomeincreasinglycomplex.
• Timeoutisoneofthecommonlyusedmechanismstohandleunexpectedfailuresindistributedcomputingenvironments.
-
Computer Science
Methodology
• Wesearchedtimeoutbugsin11 popularcloudserverapplicationsfromApacheJIRA.
• Weextensivelystudied156 bugs.
7
System #ofbugsCassandra 17
Flume 13HadoopCommon 15
HadoopMapreduce 15HadoopYarn 4
HDFS 26
System #ofbugsHBase 28
Phoenix 6Qpid 20Spark 4
Zookeeper 8Total 156
-
Computer Science
• Weclassifiedthe156 timeoutbugsinregardtothree characteristics:v rootcausesv impacttosystemsorapplicationsv diagnosability
Methodology
8
-
Computer Science
Root Cause
9
47%
31%
12%
5% 5%
Misused timeout value
Missing timeout checking
Improper handling
Unnecessary timeout
Clock drifting
Misusedtimeoutvalue&Missingtimeoutcheckingdominate.
-
Computer Science
• Misusedtimeoutvalue(65 bugs)v Misconfiguredtimeoutvalue(38 bugs)v Ignoredtimeoutvalue(10 bugs)v Incorrectlyreusedtimeoutvalue(8 bugs)v Inconsistenttimeoutvalue(4 bugs)v Staletimeoutvalue(3 bugs)v Impropertimeoutscope(2 bugs)
10
Root Cause
-
Computer Science
HBase-8581
11
An Ignored Timeout Value Example
Client RegionServer
ReadmetaTable
//HTable classoperationTimeout = isMetaTable(tableName) ? HConstants.DEFAULT_HBASE_CLIENT_OPERATION_TIMEOUT : config.getInt(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT...);
Theconfiguredtimeoutvalueisignored
1 msInteger.Maxms
Defau
lt
Configured
-
Computer Science
Misused timeout value bugs often occur when: v lack extensive testing on timeout configurations; v do not understand the system’s timeout mechanisms.
Setting proper timeout value is challenging.
12
Observation
-
Computer Science
• Missingtimeoutchecking(42 bugs)v Missingtimeoutfornetworkcommunication(26 bugs)v Missingtimeoutforsynchronization(16 bugs)
13
Root Cause
-
Computer Science
A Missing Timeout Example
•Zookeeper-2224
14
Client ZKLeader
ZKFollower
send4LetterWord
//FourLetterWordMain class74 sock = new Socket(host, port);
//Socket class425 connect(address);
ZKFollower
•Missingtimeoutfornetworkcommunication
-
Computer Science
•HBase-13971
15
Another Missing Timeout Example
RegionServer WALKey FSHLog
getSequenceId setLogSeqNum
logSeqNum
seqNumAssignedLatch.await();logSeqNum = sequence;seqNumAssignedLatch.countDown();
Missingtimeoutforsynchronization
-
Computer Science
Missing timeout bugs often occur when developers do not consider the system’s failover mechanisms.
16
Observation
-
Computer Science
• Impropertimeouthandling(16 bugs)v Insufficient/missingretries(8 bugs)v Excessiveretries(3 bugs)v Incorrectretry(2 bugs)v Incompleteabort(2 bugs)v Incorrectabort(1 bug)
17
Root Cause
-
Computer Science
•DataNode1DFSClient
Readfilesreq
Hadoop-3831
18
Insufficient/missing retries cause job failure
Timeout
Filecontents
TryotherDataNodes
8min
Jobfailure
-
Computer Science
It is challenging to implement proper timeout handling mechanisms, which requires developers to understand:v the tradeoffs between handling schemes (e.g.,
aborting v.s. retry); v each handling scheme’s impact to the systems and
applications.
19
Observation
-
Computer Science
• Unnecessarytimeoutprotection(7 bugs)
Thosebugsoccurwhendevelopersmistakenlyusetimeoutretrymechanismsoveroperationswhichrequirescontinuousorat-most-once-executionsemantics.
20
Root Cause
-
Computer Science
Clockdrifting(7 bugs)
Those bugs occur when the clocks are out-of-synchronization, the elapsed time is miscalculated, which generates a wrong timer value.
21
Root Cause
-
Computer Science
40%
33%
26%
2%
System unavailabilityJob failurePerformance degradationData loss
22
Impact
-
Computer Science
Unavailability caused by missing timeout
HDFS-4858
NameNodeSecondaryNameNode
Down
DataNodesmisstimeout.HDFSbecomesunavailable.
23
Reboot
DataNodes
TCPRST
Poweroutage
-
Computer Science
60% 29%
12%
No error message
Correct error message
Wrong error message
Only29% timeoutbugsreportthecorrecterrormessages.
24
Diagnosability
-
Computer Science
A Wrong Error Message Example
•Cassandra-3651
25
Client Coordinator
ReplicaNode
Truncatetable1
UnavailableException
try {...
} catch (TimeoutException e) {throw new UnavailableException();
}
ReplicaNode
Timeout
-
Computer Science
• Enhancedtimeoutdetectiontoolv Featureextractionv Semi-supervisedmachinelearningscheme
26
Future Work
-
Computer Science
State of the Art
• Generalbugstudies [Gunawi etal.SoCC’14, Huangetal.SoCC’15, etc]v Theyfoundtimeoutbugswidelyexistindistributedsystems.
• Specificbugstudies [Yinetal.SOSP’11, Wangetal.IC2E’15, etc ]v Misconfigurations;DataCorruption;Performance;
Concurrency.
• Performancebugdiagnosis [Deanetal.SoCC’14, etc ]v Existingtoolscannotdetect/diagnoseperformance
anomaliescausedbytimeoutbugs[ICAC’15].
• Concurrencybugdetection/fix [Jin etal.OSDI’12,PLDI’12,etc ]v Ourstudyrevealsunder-studiedtypesofrootcausesfor
concurrencybugs:missing,misused,andunnecessarytimeout.
27
-
Computer Science
• Weperformacharacteristicstudyof156 real-worldtimeoutbugsin11 popularopensourcecloudserversystems.
• 81% timeoutbugsarecausedbyeithermisusedtimeoutvaluesormissingtimeoutchecking.
• Timeoutproblemshaveseriousimpacttobothcloudserversystemsandapplications.
• Existingtimeoutissuesaredifficulttodiagnosewith71%bugsproducingnoerrormessageormisleadingerrormessages.
28
Conclusion
Thank you!