Understanding Real-World Timeout Problems in Cloud Server ... · Replica Node Timeout. Computer...

28
Computer Science Understanding Real-World Timeout Problems in Cloud Server Systems Ting Dai, Jingzhu He, Xiaohui (Helen) Gu, Shan Lu* NC State University *University of Chicago

Transcript of Understanding Real-World Timeout Problems in Cloud Server ... · Replica Node Timeout. Computer...

  • Computer Science

    Understanding Real-World Timeout Problems in Cloud Server Systems

    TingDai,Jingzhu He,Xiaohui (Helen)Gu,ShanLu*

    NCStateUniversity *UniversityofChicago

  • Computer Science

    Real-world timeout problems

    https://aws.amazon.com/cn/message/5467D2/

    2

    Storageservers

    AmazonDynamoDBservicewasdownfor5 hours.

    Metadataserver

    Sendreq1

    Timeout Sendreq2

    Sendreq3…

    Bug OverloadedTimeoutTimeout

    Noproperlimitofretry.

  • Computer Science

    •DataNode1•Balancer •DataNode2

    Move datablockSend job req1

    Sendjobreq2

    HDFS-6166

    3

    1min

    A Motivating Example

    Timeout

    Threadquotaexceedederror

    Sendjobreq3

    Timeout

    Misconfiguredtimeoutvalue

  • Computer Science

    •DataNode1•Balancer •DataNode2

    Move datablockSend job req1

    •HDFS-6166

    4

    1min

    A Motivating Example

    20min Sendjobresponse1

    patch

  • Computer Science 5

    What are timeout bugs?

    Timeoutbugshappenwhentheserverapplicationslackproperconfiguration andhandling ofthetimeoutevents.

  • Computer Science 6

    Why are timeout bugs areprevalent?

    • Cloudserversystemshavebecomeincreasinglycomplex.

    • Timeoutisoneofthecommonlyusedmechanismstohandleunexpectedfailuresindistributedcomputingenvironments.

  • Computer Science

    Methodology

    • Wesearchedtimeoutbugsin11 popularcloudserverapplicationsfromApacheJIRA.

    • Weextensivelystudied156 bugs.

    7

    System #ofbugsCassandra 17

    Flume 13HadoopCommon 15

    HadoopMapreduce 15HadoopYarn 4

    HDFS 26

    System #ofbugsHBase 28

    Phoenix 6Qpid 20Spark 4

    Zookeeper 8Total 156

  • Computer Science

    • Weclassifiedthe156 timeoutbugsinregardtothree characteristics:v rootcausesv impacttosystemsorapplicationsv diagnosability

    Methodology

    8

  • Computer Science

    Root Cause

    9

    47%

    31%

    12%

    5% 5%

    Misused timeout value

    Missing timeout checking

    Improper handling

    Unnecessary timeout

    Clock drifting

    Misusedtimeoutvalue&Missingtimeoutcheckingdominate.

  • Computer Science

    • Misusedtimeoutvalue(65 bugs)v Misconfiguredtimeoutvalue(38 bugs)v Ignoredtimeoutvalue(10 bugs)v Incorrectlyreusedtimeoutvalue(8 bugs)v Inconsistenttimeoutvalue(4 bugs)v Staletimeoutvalue(3 bugs)v Impropertimeoutscope(2 bugs)

    10

    Root Cause

  • Computer Science

    HBase-8581

    11

    An Ignored Timeout Value Example

    Client RegionServer

    ReadmetaTable

    //HTable classoperationTimeout = isMetaTable(tableName) ? HConstants.DEFAULT_HBASE_CLIENT_OPERATION_TIMEOUT : config.getInt(HConstants.HBASE_CLIENT_OPERATION_TIMEOUT...);

    Theconfiguredtimeoutvalueisignored

    1 msInteger.Maxms

    Defau

    lt

    Configured

  • Computer Science

    Misused timeout value bugs often occur when: v lack extensive testing on timeout configurations; v do not understand the system’s timeout mechanisms.

    Setting proper timeout value is challenging.

    12

    Observation

  • Computer Science

    • Missingtimeoutchecking(42 bugs)v Missingtimeoutfornetworkcommunication(26 bugs)v Missingtimeoutforsynchronization(16 bugs)

    13

    Root Cause

  • Computer Science

    A Missing Timeout Example

    •Zookeeper-2224

    14

    Client ZKLeader

    ZKFollower

    send4LetterWord

    //FourLetterWordMain class74 sock = new Socket(host, port);

    //Socket class425 connect(address);

    ZKFollower

    •Missingtimeoutfornetworkcommunication

  • Computer Science

    •HBase-13971

    15

    Another Missing Timeout Example

    RegionServer WALKey FSHLog

    getSequenceId setLogSeqNum

    logSeqNum

    seqNumAssignedLatch.await();logSeqNum = sequence;seqNumAssignedLatch.countDown();

    Missingtimeoutforsynchronization

  • Computer Science

    Missing timeout bugs often occur when developers do not consider the system’s failover mechanisms.

    16

    Observation

  • Computer Science

    • Impropertimeouthandling(16 bugs)v Insufficient/missingretries(8 bugs)v Excessiveretries(3 bugs)v Incorrectretry(2 bugs)v Incompleteabort(2 bugs)v Incorrectabort(1 bug)

    17

    Root Cause

  • Computer Science

    •DataNode1DFSClient

    Readfilesreq

    Hadoop-3831

    18

    Insufficient/missing retries cause job failure

    Timeout

    Filecontents

    TryotherDataNodes

    8min

    Jobfailure

  • Computer Science

    It is challenging to implement proper timeout handling mechanisms, which requires developers to understand:v the tradeoffs between handling schemes (e.g.,

    aborting v.s. retry); v each handling scheme’s impact to the systems and

    applications.

    19

    Observation

  • Computer Science

    • Unnecessarytimeoutprotection(7 bugs)

    Thosebugsoccurwhendevelopersmistakenlyusetimeoutretrymechanismsoveroperationswhichrequirescontinuousorat-most-once-executionsemantics.

    20

    Root Cause

  • Computer Science

    Clockdrifting(7 bugs)

    Those bugs occur when the clocks are out-of-synchronization, the elapsed time is miscalculated, which generates a wrong timer value.

    21

    Root Cause

  • Computer Science

    40%

    33%

    26%

    2%

    System unavailabilityJob failurePerformance degradationData loss

    22

    Impact

  • Computer Science

    Unavailability caused by missing timeout

    HDFS-4858

    NameNodeSecondaryNameNode

    Down

    DataNodesmisstimeout.HDFSbecomesunavailable.

    23

    Reboot

    DataNodes

    TCPRST

    Poweroutage

  • Computer Science

    60% 29%

    12%

    No error message

    Correct error message

    Wrong error message

    Only29% timeoutbugsreportthecorrecterrormessages.

    24

    Diagnosability

  • Computer Science

    A Wrong Error Message Example

    •Cassandra-3651

    25

    Client Coordinator

    ReplicaNode

    Truncatetable1

    UnavailableException

    try {...

    } catch (TimeoutException e) {throw new UnavailableException();

    }

    ReplicaNode

    Timeout

  • Computer Science

    • Enhancedtimeoutdetectiontoolv Featureextractionv Semi-supervisedmachinelearningscheme

    26

    Future Work

  • Computer Science

    State of the Art

    • Generalbugstudies [Gunawi etal.SoCC’14, Huangetal.SoCC’15, etc]v Theyfoundtimeoutbugswidelyexistindistributedsystems.

    • Specificbugstudies [Yinetal.SOSP’11, Wangetal.IC2E’15, etc ]v Misconfigurations;DataCorruption;Performance;

    Concurrency.

    • Performancebugdiagnosis [Deanetal.SoCC’14, etc ]v Existingtoolscannotdetect/diagnoseperformance

    anomaliescausedbytimeoutbugs[ICAC’15].

    • Concurrencybugdetection/fix [Jin etal.OSDI’12,PLDI’12,etc ]v Ourstudyrevealsunder-studiedtypesofrootcausesfor

    concurrencybugs:missing,misused,andunnecessarytimeout.

    27

  • Computer Science

    • Weperformacharacteristicstudyof156 real-worldtimeoutbugsin11 popularopensourcecloudserversystems.

    • 81% timeoutbugsarecausedbyeithermisusedtimeoutvaluesormissingtimeoutchecking.

    • Timeoutproblemshaveseriousimpacttobothcloudserversystemsandapplications.

    • Existingtimeoutissuesaredifficulttodiagnosewith71%bugsproducingnoerrormessageormisleadingerrormessages.

    28

    Conclusion

    Thank you!