Failure Sketching: A Technique for Automated Root Cause ...

93
Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, George Candea

Transcript of Failure Sketching: A Technique for Automated Root Cause ...

Page 1: Failure Sketching: A Technique for Automated Root Cause ...

Failure Sketching: A Technique for Automated Root Cause Diagnosis

of In-Production Failures

Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, George Candea

Page 2: Failure Sketching: A Technique for Automated Root Cause ...

Debugging In-Production Software Failures Today

2

Page 3: Failure Sketching: A Technique for Automated Root Cause ...

Debugging In-Production Software Failures Today

2

Page 4: Failure Sketching: A Technique for Automated Root Cause ...

Debugging In-Production Software Failures Today

2

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Page 5: Failure Sketching: A Technique for Automated Root Cause ...

Debugging In-Production Software Failures Today

2

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Understand root cause

Page 6: Failure Sketching: A Technique for Automated Root Cause ...

Debugging In-Production Software Failures Today

2

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Understand root cause

Reproduce the failure

Page 7: Failure Sketching: A Technique for Automated Root Cause ...

Debugging In-Production Software Failures Today

2

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Understand root cause

Reproduce the failure

Page 8: Failure Sketching: A Technique for Automated Root Cause ...

Related Work

3

• Collaborative approaches • WER [SOSP’09], CBI [PLDI’05], CCI [OOPSLA’10]

• Identifying differences of failing and successful runs • Delta debugging [TSE’02], Symbiosis [PLDI’15]

• Record & replay, checkpointing • ODR [SOSP’09], Triage [SOSP’07]

• Hardware support • PBI [ASPLOS’13], LBRA/LCRA [ASPLOS’14]

Page 9: Failure Sketching: A Technique for Automated Root Cause ...

Related Work

3

• Collaborative approaches • WER [SOSP’09], CBI [PLDI’05], CCI [OOPSLA’10]

• Identifying differences of failing and successful runs • Delta debugging [TSE’02], Symbiosis [PLDI’15]

• Record & replay, checkpointing • ODR [SOSP’09], Triage [SOSP’07]

• Hardware support • PBI [ASPLOS’13], LBRA/LCRA [ASPLOS’14]

Page 10: Failure Sketching: A Technique for Automated Root Cause ...

Contributions

4

Page 11: Failure Sketching: A Technique for Automated Root Cause ...

Contributions

4

Goal: automate the manual detective work of debugging

Page 12: Failure Sketching: A Technique for Automated Root Cause ...

Contributions

4

Goal: automate the manual detective work of debugging

Failure sketching Complements in-house static analysis with in-production

dynamic analysis

Automatically and efficiently builds accurate failure sketches that show root causes of failures

Page 13: Failure Sketching: A Technique for Automated Root Cause ...

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time Thread 2

Segfault

Thread 1

cons(queue* f) { ... mutex_unlock(f->mut); }

Page 14: Failure Sketching: A Technique for Automated Root Cause ...

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time Thread 2

Segfault

Thread 1

cons(queue* f) { ... mutex_unlock(f->mut); }

Page 15: Failure Sketching: A Technique for Automated Root Cause ...

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time Thread 2

Segfault

Thread 1

cons(queue* f) { ... mutex_unlock(f->mut); }

Page 16: Failure Sketching: A Technique for Automated Root Cause ...

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time Thread 2

Segfault

Thread 1

cons(queue* f) { ... mutex_unlock(f->mut); }

Page 17: Failure Sketching: A Technique for Automated Root Cause ...

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time Thread 2

Segfault

Thread 1

cons(queue* f) { ... mutex_unlock(f->mut); }

Page 18: Failure Sketching: A Technique for Automated Root Cause ...

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time Thread 2

Root cause

Segfault

Thread 1

cons(queue* f) { ... mutex_unlock(f->mut); }

Page 19: Failure Sketching: A Technique for Automated Root Cause ...

Failure Sketch Usage Model

6

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Understand root cause

Reproduce the failure

Page 20: Failure Sketching: A Technique for Automated Root Cause ...

Failure Sketch Usage Model

6

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Understand root cause

Reproduce the failure

Page 21: Failure Sketching: A Technique for Automated Root Cause ...

Failure Sketch Usage Model

6

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Understand root cause

Runtime traces

Reproduce the failure

Page 22: Failure Sketching: A Technique for Automated Root Cause ...

Failure Sketch Usage Model

6

Understand root cause

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

cons(queue* f) { ... mutex_unlock(f->mut); }

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time

Failure: segmentation fault

Thread 1 Thread 2

Runtime traces

Reproduce the failure

Page 23: Failure Sketching: A Technique for Automated Root Cause ...

Failure Sketch Usage Model

6

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

cons(queue* f) { ... mutex_unlock(f->mut); }

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time

Failure: segmentation fault

Thread 1 Thread 2

Runtime traces

Page 24: Failure Sketching: A Technique for Automated Root Cause ...

Failure Sketch Usage Model

6

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

cons(queue* f) { ... mutex_unlock(f->mut); }

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time

Failure: segmentation fault

Thread 1 Thread 2

Runtime traces

Page 25: Failure Sketching: A Technique for Automated Root Cause ...

Outline

• Challenges

• Design

• Evaluation

7

Page 26: Failure Sketching: A Technique for Automated Root Cause ...

Outline

• Challenges

• Design

• Evaluation

7

Page 27: Failure Sketching: A Technique for Automated Root Cause ...

Challenges of Building Failure Sketches

• Accuracy • Exclude all irrelevant information, preserve all relevant one

• Recurrence • Gathering enough execution information from rare failures

• Latency • Achieve high accuracy after just a few recurrences

8

Page 28: Failure Sketching: A Technique for Automated Root Cause ...

Outline

• Challenges

• Design

9

• Evaluation

Page 29: Failure Sketching: A Technique for Automated Root Cause ...

Outline

• Challenges

• Design

9

• Evaluation

Static Analysis

Control and Data Flow Tracking

Statistical Analysis

Failure sketch

Page 30: Failure Sketching: A Technique for Automated Root Cause ...

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

cons(queue* f) { ... mutex_unlock(f->mut); }

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time

Failure: segmentation fault

Thread 1 Thread 2

Gist System Architecture

10

Runtime traces

Page 31: Failure Sketching: A Technique for Automated Root Cause ...

ClientServer

Gist System Architecture

10

Page 32: Failure Sketching: A Technique for Automated Root Cause ...

ClientServer

Gist System Architecture

10

Program P (source)

Failure report

(core dump, stack trace,

etc)

1

Page 33: Failure Sketching: A Technique for Automated Root Cause ...

ClientServer

Gist System Architecture

10

Program P (source)

Failure report

(core dump, stack trace,

etc)

1Static

Analyzer

• queue* f = init(size); • init_vars(); • free(f->mut); • print(“Done”); • f->mut = NULL;

Static slice

Page 34: Failure Sketching: A Technique for Automated Root Cause ...

ClientServer

Gist System Architecture

10

Program P (source)

Failure report

(core dump, stack trace,

etc)

1Static

Analyzer2

Instrumentation Tracking control and data flow• queue* f = init(size);

• init_vars(); • free(f->mut); • print(“Done”); • f->mut = NULL;

Static slice

Page 35: Failure Sketching: A Technique for Automated Root Cause ...

ClientServer

Gist System Architecture

10

Program P (source)

Failure report

(core dump, stack trace,

etc)

1Static

Analyzer2

Instrumentation

3

Refinement with runtime traces

Tracking control and data flow• queue* f = init(size);

• init_vars(); • free(f->mut); • print(“Done”); • f->mut = NULL;

Static slice

Page 36: Failure Sketching: A Technique for Automated Root Cause ...

ClientServer

Gist System Architecture

10

Program P (source)

Failure report

(core dump, stack trace,

etc)

1Static

Analyzer2

Instrumentation

3

Refinement with runtime traces

Tracking control and data flow• queue* f = init(size);

• init_vars(); • free(f->mut); • print(“Done”); • f->mut = NULL;

Static sliceRefined static slice

Page 37: Failure Sketching: A Technique for Automated Root Cause ...

ClientServer

Gist System Architecture

10

Program P (source)

Failure report

(core dump, stack trace,

etc)

1Static

Analyzer2

Instrumentation

3

Refinement with runtime traces

Failure SketchComputation

Engine

4

Tracking control and data flow• queue* f = init(size);

• init_vars(); • free(f->mut); • print(“Done”); • f->mut = NULL;

Static slice

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); f->mut = NULL; ... }

cons(queue* f) { ... mutex_unlock(f->mut); }

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time

Failure: segmentation fault

Thread 1 Thread 2

Failure Sketch

Refined static slice

Page 38: Failure Sketching: A Technique for Automated Root Cause ...

Outline

11

• Challenges

• Design

• Evaluation

Static Analysis

Control and Data Flow Tracking

Statistical Analysis

Failure sketch

Page 39: Failure Sketching: A Technique for Automated Root Cause ...

Outline

11

• Challenges

• Design

• Evaluation

Static Analysis

Control and Data Flow Tracking

Statistical Analysis

Failure sketch

Page 40: Failure Sketching: A Technique for Automated Root Cause ...

12

Static Analysis to Reduce the Overhead

• Computes backward slices • Includes statements with dependencies to the failure

• Excludes all other statements

• Inter-procedural • Identify dependencies across functions

Page 41: Failure Sketching: A Technique for Automated Root Cause ...

12

Static Analysis to Reduce the Overhead

• Computes backward slices • Includes statements with dependencies to the failure

• Excludes all other statements

• Inter-procedural • Identify dependencies across functions

Static analysis reduces subsequent runtime tracking (20x)

Page 42: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

Page 43: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

Page 44: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

Page 45: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

Page 46: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

Page 47: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

Segfault

Page 48: Failure Sketching: A Technique for Automated Root Cause ...

14

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Example

Segfault

Page 49: Failure Sketching: A Technique for Automated Root Cause ...

15

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Example: Static Backward Slicing

Segfault

Page 50: Failure Sketching: A Technique for Automated Root Cause ...

Outline

16

• Challenges

• Design

• Evaluation

Static Analysis

Control and Data Flow Tracking

Statistical Analysis

Failure sketch

Page 51: Failure Sketching: A Technique for Automated Root Cause ...

Outline

16

• Challenges

• Design

• Evaluation

Static Analysis

Control and Data Flow Tracking

Statistical Analysis

Failure sketch

Page 52: Failure Sketching: A Technique for Automated Root Cause ...

Low-Overhead Control Flow Tracking

• Software-based tracking is expensive (up to 15x)

• Hardware-based tracking is more efficient • Intel PT: new feature in Intel CPUs (~40%)

• Gist combines static analysis and hardware-based control flow tracking • Low overhead (~2%)

17

Page 53: Failure Sketching: A Technique for Automated Root Cause ...

Low-Overhead Control Flow Tracking

• Software-based tracking is expensive (up to 15x)

• Hardware-based tracking is more efficient • Intel PT: new feature in Intel CPUs (~40%)

• Gist combines static analysis and hardware-based control flow tracking • Low overhead (~2%)

17

Static analysis → Low-overhead control flow tracking

Page 54: Failure Sketching: A Technique for Automated Root Cause ...

Example: Control Flow Tracking (Step 1)

18

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); } Segfault

Page 55: Failure Sketching: A Technique for Automated Root Cause ...

19

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Example: Control Flow Tracking (Step 2)

Segfault

Page 56: Failure Sketching: A Technique for Automated Root Cause ...

19

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Example: Control Flow Tracking (Step 2)

Static analysis + control flow tracking shorten the sketch

Segfault

Page 57: Failure Sketching: A Technique for Automated Root Cause ...

20

Data Flow Tracking to Increase Accuracy

Page 58: Failure Sketching: A Technique for Automated Root Cause ...

• Data flow information • Variable values & total order of memory accesses

20

Data Flow Tracking to Increase Accuracy

Page 59: Failure Sketching: A Technique for Automated Root Cause ...

• Data flow information • Variable values & total order of memory accesses

• Hardware watchpoints • Allow tracking reads and writes with low overhead

• Allow tracking the total order of accesses

20

Data Flow Tracking to Increase Accuracy

Page 60: Failure Sketching: A Technique for Automated Root Cause ...

• Data flow information • Variable values & total order of memory accesses

• Hardware watchpoints • Allow tracking reads and writes with low overhead

• Allow tracking the total order of accesses

• Monitor multiple clients when run out of watchpoints

20

Data Flow Tracking to Increase Accuracy

Page 61: Failure Sketching: A Technique for Automated Root Cause ...

• Data flow information • Variable values & total order of memory accesses

• Hardware watchpoints • Allow tracking reads and writes with low overhead

• Allow tracking the total order of accesses

• Monitor multiple clients when run out of watchpoints

20

Data Flow Tracking to Increase Accuracy

Precise ordering information → High accuracy

Page 62: Failure Sketching: A Technique for Automated Root Cause ...

21

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Segfault

Example: Data Flow Tracking

Page 63: Failure Sketching: A Technique for Automated Root Cause ...

21

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Segfault

Watch &s

Example: Data Flow Tracking

Page 64: Failure Sketching: A Technique for Automated Root Cause ...

22

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Example: Data Flow Tracking

Segfault

Watch &s

Page 65: Failure Sketching: A Technique for Automated Root Cause ...

22

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

Example: Data Flow TrackingThread 1 Thread 2

Page 66: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }Failure

22

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

Example: Data Flow TrackingThread 1 Thread 2

Page 67: Failure Sketching: A Technique for Automated Root Cause ...

Outline

23

• Challenges

• Design

• Evaluation

Static Analysis

Control and Data Flow Tracking

Statistical Analysis

Failure sketch

Page 68: Failure Sketching: A Technique for Automated Root Cause ...

Outline

23

• Challenges

• Design

• Evaluation

Static Analysis

Control and Data Flow Tracking

Statistical Analysis

Failure sketch

Page 69: Failure Sketching: A Technique for Automated Root Cause ...

24

Statistical Analysis

Page 70: Failure Sketching: A Technique for Automated Root Cause ...

• Identification of failure predictors1

• A good predictor portends a failure with high probability (e.g., data races, atomicity violations)

24

Statistical Analysis

1Liblit,B.etal.Scalablesta0s0calbugisola0on.PLDI2005

Page 71: Failure Sketching: A Technique for Automated Root Cause ...

• Identification of failure predictors1

• A good predictor portends a failure with high probability (e.g., data races, atomicity violations)

• Example: data races

24

Statistical Analysis

1Liblit,B.etal.Scalablesta0s0calbugisola0on.PLDI2005

write x

write x

write x

read x

read x

write x

Page 72: Failure Sketching: A Technique for Automated Root Cause ...

• Identification of failure predictors1

• A good predictor portends a failure with high probability (e.g., data races, atomicity violations)

• Example: data races

24

Statistical Analysis

1Liblit,B.etal.Scalablesta0s0calbugisola0on.PLDI2005

write x

write x

write x

read x

read x

write x

Failure predictors across multiple executions

Page 73: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }Failure

2525

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Example: Statistical Analysis

Page 74: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

2525

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) {

log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Successvoid cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Successvoid cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

Example: Statistical Analysis

Page 75: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

2525

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) {

log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Successvoid cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Successvoid cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

Example: Statistical Analysis

Page 76: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

2525

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) {

log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Successvoid cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Successvoid cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

Example: Statistical Analysis

write

read

Page 77: Failure Sketching: A Technique for Automated Root Cause ...

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

2525

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) {

log(“Func:display_size”); log(“State: %u”, s->size); }Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Successvoid cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Successvoid cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

Static analysis + cooperative dynamic analysis

Example: Statistical Analysis

write

read

Page 78: Failure Sketching: A Technique for Automated Root Cause ...

Outline

26

• Challenges

• Design

• Evaluation

Static Analysis

Control and Data Flow Tracking

Statistical Analysis

Failure sketch

Page 79: Failure Sketching: A Technique for Automated Root Cause ...

Outline

26

• Challenges

• Design

• Evaluation • Does Gist help developers do root cause diagnosis?

• Is Gist efficient?

• Is Gist accurate?

Page 80: Failure Sketching: A Technique for Automated Root Cause ...

27

• Client side executions are analyzed in the lab

• Real world server and desktop programs

memcached

Experimental Setup

Page 81: Failure Sketching: A Technique for Automated Root Cause ...

Do Failure Sketches Help Developers?

• We manually analyzed the usefulness of Gist for 11 failures

• Gist-identified failure predictors point to root causes • Developers eliminated those root causes to fix the bugs

• Average number of statements to look at: 7

28

Page 82: Failure Sketching: A Technique for Automated Root Cause ...

Do Failure Sketches Help Developers?

• We manually analyzed the usefulness of Gist for 11 failures

• Gist-identified failure predictors point to root causes • Developers eliminated those root causes to fix the bugs

• Average number of statements to look at: 7

Gist points developers to the root causes of failures

28

Page 83: Failure Sketching: A Technique for Automated Root Cause ...

0

1

2

3

4

5

0 5 10 15 20 25

# of statements tracked

Perfo

rman

ce

over

head

[%]

Efficiency

29

(Control & data flow tracking)

Page 84: Failure Sketching: A Technique for Automated Root Cause ...

0

1

2

3

4

5

0 5 10 15 20 25

# of statements tracked

Perfo

rman

ce

over

head

[%]

Gist has low average overhead (always below 5%)

Efficiency

29

(Control & data flow tracking)

Page 85: Failure Sketching: A Technique for Automated Root Cause ...

30

Accu

racy

[%]

Apac

he-1

Apac

he-2

Apac

he-3

Apac

he-4

Cpp

chec

k-1

Cpp

chec

k-2

Cur

l

Tran

smis

sion

SQLi

te

mem

cach

ed

Pbzi

p2

Accuracy

Page 86: Failure Sketching: A Technique for Automated Root Cause ...

30

0.0

25.0

50.0

75.0

100.0

Accu

racy

[%]

Apac

he-1

Apac

he-2

Apac

he-3

Apac

he-4

Cpp

chec

k-1

Cpp

chec

k-2

Cur

l

Tran

smis

sion

SQLi

te

mem

cach

ed

Pbzi

p2

Accuracy

Page 87: Failure Sketching: A Technique for Automated Root Cause ...

30

0.0

25.0

50.0

75.0

100.0

Static slicing

Accu

racy

[%]

Apac

he-1

Apac

he-2

Apac

he-3

Apac

he-4

Cpp

chec

k-1

Cpp

chec

k-2

Cur

l

Tran

smis

sion

SQLi

te

mem

cach

ed

Pbzi

p2

Accuracy

Page 88: Failure Sketching: A Technique for Automated Root Cause ...

30

0.0

25.0

50.0

75.0

100.0

Control flow trackingStatic slicing

Accu

racy

[%]

Apac

he-1

Apac

he-2

Apac

he-3

Apac

he-4

Cpp

chec

k-1

Cpp

chec

k-2

Cur

l

Tran

smis

sion

SQLi

te

mem

cach

ed

Pbzi

p2

Accuracy

Page 89: Failure Sketching: A Technique for Automated Root Cause ...

30

0.0

25.0

50.0

75.0

100.0

Control flow tracking Data flow trackingStatic slicing

Accu

racy

[%]

Apac

he-1

Apac

he-2

Apac

he-3

Apac

he-4

Cpp

chec

k-1

Cpp

chec

k-2

Cur

l

Tran

smis

sion

SQLi

te

mem

cach

ed

Pbzi

p2

Accuracy

Page 90: Failure Sketching: A Technique for Automated Root Cause ...

30

0.0

25.0

50.0

75.0

100.0

Control flow tracking Data flow trackingStatic slicing

Accu

racy

[%]

Apac

he-1

Apac

he-2

Apac

he-3

Apac

he-4

Cpp

chec

k-1

Cpp

chec

k-2

Cur

l

Tran

smis

sion

SQLi

te

mem

cach

ed

Pbzi

p2

Accuracy

Average accuracy

is 96%

Page 91: Failure Sketching: A Technique for Automated Root Cause ...

30

0.0

25.0

50.0

75.0

100.0

Control flow tracking Data flow trackingStatic slicing

Accu

racy

[%]

Apac

he-1

Apac

he-2

Apac

he-3

Apac

he-4

Cpp

chec

k-1

Cpp

chec

k-2

Cur

l

Tran

smis

sion

SQLi

te

mem

cach

ed

Pbzi

p2

Each technique is needed for accuracy

Accuracy

Average accuracy

is 96%

Page 92: Failure Sketching: A Technique for Automated Root Cause ...

• Failure sketching • Combination of static and dynamic program

analysis

• Failure sketches are summaries explaining failure root causes

• Accurate, efficient, improves developer productivity

31

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); f->mut = NULL; ... }

cons(queue* f) { ... mutex_unlock(f->mut); }

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time

Failure: segmentation fault

Thread 1 Thread 2

Conclusion

Page 93: Failure Sketching: A Technique for Automated Root Cause ...

• Failure sketching • Combination of static and dynamic program

analysis

• Failure sketches are summaries explaining failure root causes

• Accurate, efficient, improves developer productivity

31

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); f->mut = NULL; ... }

cons(queue* f) { ... mutex_unlock(f->mut); }

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

Time

Failure: segmentation fault

Thread 1 Thread 2

Conclusion

CristianoBaris Ben Gilles George

http://dslab.epfl.ch/proj/gist