What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J....

55
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake Thanh Do

Transcript of What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J....

Page 1: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

What Bugs Live in the Cloud?A Study of 3000+ Issues in Cloud Systems

Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria

Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake

Thanh Do

Page 2: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

2

First, let’s ask Google

Page 3: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

3

Cloud era

No Deep Root Causes…

Page 4: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

4

What reliability research community do?

• Bug study1. A Study of Linux File System Evolution. In FAST ’13. 2. A Comprehensive Study on Real World Concurrency Bug

Characteristics. In ASPLOS ’08. 3. Precomputing Possible Configuration Error Diagnoses. In ASE

’11. …

Page 5: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

5

Open sourced cloud software

• Publicly accessible bug repositories

Page 6: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

6

Study to solve…

• What bugs “live” in the cloud?• Are there new classes of bugs unique to cloud

systems?• How should cloud dependability tools evolve

in near future?• Many others questions…

Page 7: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

7

Cloud Bug Study (CBS)

• 6 systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume

• 11 people, 1 year study• Issues in a 3-year window:

Jan 2011 to Jan 2014• ~21000 issues reviewed• ~3600 “vital” issues in-depth study• Cloud Bug Study (CBS) database

Page 8: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

8

Classifications

• Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS

• Hardware failures - types of hardware and types of hardware failures

• Software bug types – Logic, error handling, optimization, config, race, hang, space, load

• Implications – Failed operation, performance, component down- time, data loss, data staleness, data corruption

• ~25000 annotations in total, about 7 annotations per issue

Page 9: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

9

Cloud Bug Study (CBS) database

• Open to public

Page 10: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

10

Outline

• Introduction• Methodology• Overview of results• Other CBS database use cases• Conclusion

Page 11: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

11

Methodology

• 6 systems, 3-year span, 2011 to 2014• 20~30 bugs a day! Protein yeah!• 17% “vital” issues affecting

real deployments• 3655 vital issues

Page 12: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

12

Example issueTitle

Type & Priority

Description

Time to resolve

Discussion

Page 13: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

13

Outline

• Introduction• Methodology• Overview of results• Other CBS database use cases• Conclusion

Page 14: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

14

Classifications for each vital issue

• Aspects• Hardware types and failure modes• Software bug types• Implications• Bug scopes

Page 15: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

15

Overview of result

• Aspects • Hardware faults vs. Software faults• Implications

Page 16: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

16

Aspects

• CS = Cassandra• FL = flume• HB = HBase• HD = HDFS• MR = MapReduce• ZK = ZooKeeper

Page 17: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

17

Aspects: Reliability

• Reliability (45%)– Operation & job

failures/errors, data loss/corruption/staleness

Page 18: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

18

Aspects: Performance

• Reliability• Performance (22%)

Page 19: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

19

Aspects: Availability

• Reliability• Performance• Availability (16%)– Node and cluster

downtime

Page 20: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

20

Aspects: Security

• Reliability• Performance• Availability• Security (6%)

Page 21: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

21

Overview of result

• Aspects (classical)• Aspects – Data consistency, scalability, topology, QoS

• Hardware faults vs. Software faults• Implications

Page 22: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

22

Aspects: Data consistency

• Data consistency (5%)– Permanent inconsistent

replicas– Various root causes:• Buggy operational

protocol• Concurrency bugs

and node failures

Page 23: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

23

Cassandra cross-DC synchronization

A’

B’ B

C’

A

C

Background operational protocols often buggy!

A’ A’

B’ B’

C’ Permanent inconsistency

Page 24: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

24

Aspects: Scalability

• Data consistency• Scalability (2%)– Small number does not

mean not important!– Only found at scale

• Large cluster size• Large data• Large load• Large failures

Page 25: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

25

Large cluster• In Cassandra

O(n3) calculation

Ring position changed.

100x

CPU explosion

Page 26: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

26

Large data

In HBase

Tens ofminutes

R1

R2

R3

R…

R100K

Insufficient lookup operation

Page 27: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

27

Large load

In HDFS 1000x small files in parallel

… Not expecting small files!

Page 28: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

28

Large failure

Time cost: 7+ hours

AM managing 16,000 tasks fails

1

2

3

1K

2K

3K

4K

5K

16K

Un-optimized connection

Page 29: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

29

From above examples…

• Protocol algorithms must anticipate – Large cluster sizes– Large data– Large request load of various kinds– Large scale failures

• The need for scalability bug detection tools

Page 30: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

30

Aspects: Topology

• Data consistency• Scalability• Topology (1%)– Systems have problem

when deployed on some network topology• Cross DC• Different racks• New layering architecture

– Typically unseen in pre-deployment

Page 31: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

31

Aspects: QoS

• Data consistency• Scalability• Topology• QoS (1%)– Fundamental for multi-

tenant systems– Two main points

• Horizontal/intra-system QoS

• Vertical/cross-system QoS

Page 32: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

32

Overview of result

• Aspects (classical)• Aspects (unique)– Data consistency, scalability, topology, QoS

• Hardware faults vs. Software faults• Implications

Page 33: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

33

HW faults vs. SW faults“Hardware can fail, and reliability should come from software.”

Page 34: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

34

HW faults and modes

• 299 improper handling of node fail-stop failure

• A 25% normal speed memory card causes problems in HBase deployment.

Page 35: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

35

Hardware faults vs. Software faults

• Hardware failures, components and modes• Software bug types

Page 36: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

36

Software bug types: Logic

• Logic (29%)– Many domain-specific

issues

Page 37: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

37

Software bug types: Error handling

• Logic• Error handling (18%)– Aspirator, Yuan et al,

[OSDI’ 14]

Page 38: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

38

Software bug types: Optimization

• Logic• Error handling• Optimization (15%)

Page 39: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

39

Software bug types: Configuration

• Logic• Error handling• Optimization• Configuration (14%)

– Automating Configuration Troubleshooting. [OSDI ’10]

– Precomputing Possible Configuration Error Diagnoses. [ASE ’11]

– Do Not Blame Users for Misconfigurations. [SOSP ’13]

Page 40: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

40

Software bug types: Race

• Race (12%)– < 50% local concurrency

bugs• Buggy thread interleaving• Tons of work

– > 50% distributed concurrency bugs• Reordering of messages,

crashes, timeouts• More work is needed

– SAMC [OSDI ’14]

Page 41: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

41

Software bug types: Hang

• Hang (4%)– Classical deadlock– Un-served jobs, stalled

operations, …• Root causes?• How to detect them?

Page 42: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

42

Software bug types: Space

• Space (4%)– Big data + leak = Big leak– Clean-up operations

must be flawless.

Page 43: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

43

Software bug types: Load

• Load (4%)– Happen when systems

face high request load– Relates to QoS and

admission control

Page 44: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

44

Overview of result

• Aspects (classical)• Aspects (unique)– Data consistency, scalability, topology, QoS

• Hardware faults vs. Software faults• Implications

Page 45: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

45

Implications

• Failed operation (42%)• Performance (23%)• Downtimes (18%)• Data loss (7%)• Data corruption (5%)• Data staleness (5%)

Page 46: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

46

Root causesEvery implication can be caused by all kinds of hardware and software faults!

Page 47: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

47

“Killer” bugs

• Bugs that simultaneously affect multiple nodes or even the entire cluster

• Single Point of Failure still exists in many forms– Positive feedback loop – Buggy failover – Repeated bugs after failover – …

Page 48: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

48

Outline

• Introduction• Methodology• Overview of results• Other CBS database use cases• Conclusion

Page 49: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

49

CBS database

• 50+ per system and aggregate graphs from mining CBS database in the last one year

• Still more waiting to be studied…

Page 50: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

50

Components with most issuesHow should we enhance reliability for multiple cloud system interaction?

Cross-system issues are prevalent!

Page 51: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

51

Most challenging types of issues

Page 52: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

52

Top k% of most complicated issue

Page 53: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

53

System evolution

Hadoop 2.0

Page 54: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

54

Conclude

• One of the largest bug studies for cloud systems

• Many interesting findings, but more questions can be raised from our analysis– What types of performance issues exist?– Root causes for hang issues?– …

• Cloud Bug Study(CBS) database.