Effective Methods for Debugging Concurrent Softwarejeff/academic/phdThesis.pdf · 2015-08-19 ·...

THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY

Effective Methods for Debugging ConcurrentSoftware

by

Shaoming HUANG

A Thesis Submitted toThe Hong Kong University of Science and Technology

in Partial Fulfillment of the Requirements forthe Degree of Doctor of Philosophy

in theDepartment of Computer Science and Engineering

April 2013, Hong Kong

http://www.cse.ust.hk/~smhuang/

Authorization

I hereby declare that I am the sole author of the thesis.

I authorize the Hong Kong University of Science and Technology to lend this thesis to

other institutions or individuals for the purpose of scholarly research.

I further authorize the Hong Kong University of Science and Technology to reproduce

the thesis by photocopying or by other means, in total or in part, at the request of other

institutions or individuals for the purpose of scholarly research.

Shaoming HUANG

April 2013

i

EFFECTIVE METHODS FOR DEBUGGINGCONCURRENT SOFTWARE

by

SHAOMING HUANG

This is to certify that I have examined the above PhD thesis

and have found that it is complete and satisfactory in all respects,

and that any and all revisions required bythe thesis examination committee have been made.

Dr. Charles ZHANG (Thesis Supervisor)

Prof. Mounir HAMDI (Department Head)

Department of Computer Science and Engineering

April 2013

ii

To My Beloved Parents and My Dearest Wife Kami

iii

Acknowledgements

My first and foremost, thanks go to my advisor, Charles Zhang, who has spent a tremendous

amount of time and energy forming me into both a confident researcher and a nice person.

His exemplary guidance, his far reaching vision, and his unwavering optimism and patience

have been a constant source of encouragement that helped me explore and develop ideas and

overcome incredible challenges throughout my PhD. From him, I received the largest possible

freedom and unconditional support one can imagine during a four and half year graduate edu-

cation. I can never repay Charles for what he has given to me. The best way for me to express

my gratitude towards him is to try to become what he has been to me: teacher, mentor, guide,

collaborator, and friend.

I am also very grateful to every member in my defense, proposal, and qualifying examination

committee: S.C. Cheung, Sung Kim, Lin Gu, Jiang Xu, Tom Ball, Xueqing Zhang. S.C. is

always so nice and willing to help all the way throughout my graduate study. Leading our big

software engineering group, his extraordinary enthusiasm and knowledge and his creation of a

passionate group culture have made me feel constantly optimistic and inspiring. I am indebted

to Sung too much for his priceless guidance and innumerable suggestions in the various stages

of my research. I will also never forget his kindness and encouragement on all the other aspects

during my study at HKUST. Lin has been very supportive to me ever since our first meet in a

group discussion and has provided invaluable advice on debugging concurrent and distributed

systems. His course on cloud computing systems is of particular interest to me and from which

I learned quite a lot on system programming and system research. I would also like to thank

Tom, Jiang, and Xueqing for serving on my thesis committee. I am grateful to Tom in particular

for his thorough review and constructive comments on my thesis.

I need to thank all the other members in the Prism group: Liu Peng, Xiao Xiao, Jinguo Zhou,

Xiang Gao, Wei Li, Yiqing Zhu, Jin Huang, and Meng Wang. I feel really lucky to grow in such

a smart and energetic group, and I am grateful to have them as friends and colleagues. Most of

my research projects would not have been possible without the critical discussions with them.

Thanks also go to all the other members in our software engineering group, especially Zhifeng

Lai, Chang Xu, Xinming Wang, Yueqi Li, Qiaona Hong, Ning Chen, Yepang Liu, Yida Tao,

Wenmao Gong, Dongxiang Cai, Jaechang Nam, Rae Noh, Donggyun Han and Hyunmin Seo.

They have made the whole group as a warm family to me. I especially want to thank Zhifeng,

who always gives me helpful suggestions and keeps me positive. Discussion with him also

greatly improved my understanding of concurrent program analysis and testing.

A big and big thank you goes to Can Yang, who is like a brother to me. “Open mind” and

“Follow your heart”, I will always cherish these words and will never forget the happy time we

iv

spent together. Equally I would like to thank Chao Yang, Xiaowei Zhou, Tiangzhu Liang, Suijie

Wang, Tao Lu, Xiang Wan, Lingsing Yung, Wei Chen, Guangyuan Yang, and Wei Jiang, for

sharing with me unforgettable time in the past few years. My gratitude extends to my friends

at HKUST: Zhewei Wei, Lixing Wang, Tengfei Liu, Ang Li, Xiaoheng Xie, Yu Peng, Yincheng

Lin, Xiaofei Zhang, Xiangming Fang, Zhiqiang Ma, Dong Lin, Yu Zhang, Ning Ding, Haodi

Zhang, Li Li, and Shanchao Zhang,

Finally, I would like to thank my beloved parents, my brother Qiming, and my dearest wife

Kami. I thank God for bringing them to me. This work wouldn’t have been possible without

their amazing support, tolerance, understanding, and most importantly, love.

Contents

Authorization Page i

Signature Page ii

Acknowledgements iv

Contents vi

List of Figures x

List of Tables xiii

Abstract xiv

Abbreviations xvi

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Concurrency bugs are difficult to reproduce . . . . . . . . . . . . 2Concurrency bugs are difficult to detect . . . . . . . . . . . . . . 3Concurrency bugs are difficult to understand . . . . . . . . . . . 3Concurrency bugs are difficult to fix . . . . . . . . . . . . . . . . 3

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Multiprocessor Deterministic Replay . . . . . . . . . . . . . . . . . . . . 41.2.2 Predictive Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.3 Dynamic and Static Trace Simplification . . . . . . . . . . . . . . . . . . 51.2.4 Data Sharing Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background and Previous Work 72.1 Concurrent Program Execution Modeling . . . . . . . . . . . . . . . . . . . . . . 72.2 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Thread Interleaving Patterns for Concurrency Bugs . . . . . . . . . . . . . . . . . 13

Data race . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Atomicity Violations . . . . . . . . . . . . . . . . . . . . . . . . . 13

vi

Atomic-set serializability violations . . . . . . . . . . . . . . . . 132.4 Tackling Concurrency Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Concurrency Bug Reproduction . . . . . . . . . . . . . . . . . . . . . . . 152.4.1.1 Deterministic Replay . . . . . . . . . . . . . . . . . . . . . . . 152.4.1.2 Offline Search and Deterministic Multithreading . . . . . . . . 16

2.4.2 Concurrency Bug Detection . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2.1 Static and Dynamic Program Analyses . . . . . . . . . . . . . 17

Active Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2.2 Trace-based concurrent program analysis . . . . . . . . . . . . 18

2.4.3 Surviving Concurrency Bugs . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Multiprocessor Deterministic Replay 203.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 LEAP: Local-Order Based Deterministic Replay . . . . . . . . . . . . . . . . . . 22

3.2.1 LEAP Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.2 Locating Shared Variable Accesses . . . . . . . . . . . . . . . . . . . . . 233.2.3 Field-based Shared Variable Identification . . . . . . . . . . . . . . . . . 243.2.4 Unique Thread Identification . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.5 Handling Early Replay Termination . . . . . . . . . . . . . . . . . . . . . 26

3.3 A Theorem of Local Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 LEAP Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.1 The LEAP Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.2 The LEAP Recorder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.3 The LEAP Replayer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.5.1 Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5.1.1 Micro-benchmarking . . . . . . . . . . . . . . . . . . . . . . . 323.5.1.2 Benchmarking with third-party systems . . . . . . . . . . . . . 333.5.1.3 Concurrency bug reproduction . . . . . . . . . . . . . . . . . . 343.5.1.4 Random bug injection . . . . . . . . . . . . . . . . . . . . . . . 353.5.1.5 Real and benchmark concurrency bugs . . . . . . . . . . . . . 35

3.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Persuasive Prediction of Concurrency Access Anomalies 384.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 PECAN in a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3 Pattern Specification of Access Anomalies . . . . . . . . . . . . . . . . . . . . . 414.4 Graph Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4.1 Constraint Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.4.2 The AA Prediction Problem . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Graph Pattern Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.5.1 Compact Encoding of PTG . . . . . . . . . . . . . . . . . . . . . . . . . . 454.5.2 Pattern-Directed Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.6 Schedule Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.6.1 How to Generate a Feasible Schedule? . . . . . . . . . . . . . . . . . . . 474.6.2 What Can Our Algorithm Guarantee? . . . . . . . . . . . . . . . . . . . . 50

4.6.3 Pruning False Warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.7.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.7.2 Detected Real Bugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.7.3 PECAN Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 Scaling Predictive Trace Analysis by Removing Redundant Events 585.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.2 General PTA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3 Removing Trace Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3.1 Modeling trace redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . 635.3.1.1 A theory of trace redundancy . . . . . . . . . . . . . . . . . . . 645.3.1.2 Concurrency context . . . . . . . . . . . . . . . . . . . . . . . . 665.3.1.3 Two dimensions of redundancy . . . . . . . . . . . . . . . . . 67

5.3.2 Filtering redundant events . . . . . . . . . . . . . . . . . . . . . . . . . . 695.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.5.1 RQ1: Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.5.2 RQ2: Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.5.3 RQ3: Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6 Dynamically Simplifying Concurrency Bug Reproduction 786.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Key Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 A Model of Trace Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.3 Automatic Redundance Removing . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3.1 Removing Whole-Thread Redundancy . . . . . . . . . . . . . . . . . . . 846.3.2 Removing Partial-Thread Redundancy . . . . . . . . . . . . . . . . . . . 86

6.3.2.1 Multithreaded dynamic slicing . . . . . . . . . . . . . . . . . . 866.3.2.2 Repetition analysis . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.5 A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.5.1 Description of Derby Bug #2861 . . . . . . . . . . . . . . . . . . . . . . 906.5.2 How LEAN Simplifies the Bug Reproduction . . . . . . . . . . . . . . . 91

6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.6.1 RQ1: Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.6.2 RQ2: Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7 Static Trace Simplification 100

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007.2 SimTrace: Efficient Static Trace Simplification . . . . . . . . . . . . . . . . . . . 102

7.2.1 General Trace Simplification Problem . . . . . . . . . . . . . . . . . . . . 1027.2.2 A Theorem of Trace Equivalence . . . . . . . . . . . . . . . . . . . . . . 1037.2.3 SimTrace Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Dependence Graph Construction . . . . . . . . . . . . . . . . . . 104Simplifying Dependence Graph . . . . . . . . . . . . . . . . . . . 105

7.3 Implementation and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8 Execution Privatization for Scheduler-Oblivious Concurrent Programs 1128.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.2 A Theorem of Privatizability for Scheduler-Oblivious Programs . . . . . . . . . 1158.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

8.3.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.3.2 Privatization Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8.4 Execution Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.4.2 Dynamic Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.4.3 Path and Context Sensitive Privatization . . . . . . . . . . . . . . . . . . 124

8.4.3.1 Privatization Rules . . . . . . . . . . . . . . . . . . . . . . . . . 1258.4.3.2 Path and Context Sensitive P-Path Clone . . . . . . . . . . . . 127

8.4.4 Privatization Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8.6.1 Concurrency Bug Fixing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1318.6.2 Performance Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . 1328.6.3 Pervasive Privatization Opportunities . . . . . . . . . . . . . . . . . . . . 1338.6.4 Program Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1358.7.1 Concurrent Program Testing and Debugging . . . . . . . . . . . . . . . . 1358.7.2 Privatization Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

9 Conclusion and Future Work 138Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Bibliography 143

List of Figures

1.1 The same program exhibits different behaviors with different thread interleav-ings. The error manifests with the interleaving A (left) but not the interleavingB (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Overview of the work in this thesis for concurrent program debugging . . . . . . 4

2.1 Atomic-set serializability violation patterns [125]. Wu(l) and Ru(l) representa write and a read, respectively, to a memory location l of a unit of work u. l1and l2 belong to the same atomic set. . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 The instrumentation of SPE accesses . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 The overview of LEAP infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 The runtime characteristic of LEAP and other techniques on our microbench-

mark with the number of SPE ranges from 1 to 500. The microbenchmark starts10 threads running on 8 processors. . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 The runtime characteristic of LEAP and other techniques on our microbench-mark with the number of threads ranges from 1 to 80 running on 8 processors.The number of SPE is set to 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 General access anomaly patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Example of searching atomicity violations . . . . . . . . . . . . . . . . . . . . . . 464.3 An example of schedule generation . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4 An example for illustrating the difficulty of satisfying the lock constraint for

schedule generation. The race pair (v3,v8) is a false warning, though it satisfiesboth the POR and the lockset condition. . . . . . . . . . . . . . . . . . . . . . . . 49

4.5 A destructive race in OpenJMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.6 A predicted real bug in Jigsaw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.1 Example code for illustrating the trace redundancy . . . . . . . . . . . . . . . . . 605.2 Statements (10,7,10) form a real atomicity violation. However, the simple strat-

egy of “dropping all re-references by the same thread to the same variable ifthere are no synchronization operations between them” would drop the secondread of T2 at line 10, which causes PTA to miss this atomicity violation. . . . . 61

5.3 A trace corresponding to a serial execution of the example program in Figure 5.1. 635.4 Trie representation of local (left) and global (right) redundancy . . . . . . . . . . 71

6.1 A typical test case for stressing testing an account function. A significant amountof computation in a buggy execution of this program may be redundant. . . . . . 80

6.2 An example of dynamic thead hierarchy graph (TH-Tree). When T1,3 are se-lected, all T1,3 and their descendents (gray color) are disabled. . . . . . . . . . . 84

x

6.3 The delta-debugging algorithm. The function validate return true if the twoconditions in the redundance criterion are both satisfied. For conciseness, theinput trace is ignored in the ddmin algorithm. . . . . . . . . . . . . . . . . . . . 85

6.4 Some iterations of the code block demarcated by @rcb-begin and @rcb-endare specified as potentially redundant. . . . . . . . . . . . . . . . . . . . . . . . . 88

6.5 An overview of LEAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.6 A real concurrency bug #2861 in Derby. The thread interleaving following the

solid arrow on the shared data referencedColumnMap crashed the programwith NullPointerException. . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.7 A real world test driver for triggering the concurrency bug in Figure 6.6. Thestatements inserted by LEAN to simplify the execution are shown in the grayareas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.8 Illustration of delta-debugging for removing the whole thread redundancy. Tidenotes the ith test thread created by the main thread T0. After four rounds ofsimplification, threads T(2,3) remain and all the other threads are removed. . . . 99

6.9 Illustration of delta-debugging for removing the redundant repetitions for theremaining threads T(2,3). Iij denotes the jth iteration of thread Ti where i=2,3and j=1,2,. . . ,10. After ten rounds of simplification, the 7th iteration of T2 andthe 4th iteration of T3 remain and all the other iterations are removed. . . . . . . 99

7.1 A greedy merge may produce non-optimal result in (a). Unfortunately, the prob-lem of producing the optimal result in (b) is NP-hard. . . . . . . . . . . . . . . . 108

8.1 Top: a real bug #2861 in Apache Derby. The program crashes with Null-PointerException when a thread references the shared data structure referenced-ColumnMap at line 11 after another thread sets it to null in the method setRef-erencedColumnMap. Bottom: the getObjectName method after privatization. . . 117

8.2 The benchmark contains 8 threads simultaneously decreasing the shared vari-able num. The privatized version (right) is 17.9% faster than the original version(left). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

8.3 Privatization must be path-sensitive . . . . . . . . . . . . . . . . . . . . . . . . . . 1198.4 An atomicity violation in the appendmethod of java.lang.StringBuffer

class. The program throws StringIndexOutOfBoundsException whena thread at line 11 references the stale length of sb changed by another threadat line 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

8.5 Privatization must preserve progressiveness . . . . . . . . . . . . . . . . . . . . . 1218.6 D-SAP and P-SAP are path-sensitive . . . . . . . . . . . . . . . . . . . . . . . . . 1238.7 Conceptual view of execution privatization. The privatization is tailored to the

P-Path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248.8 Privatization rules of D-SAP and P-SAP . . . . . . . . . . . . . . . . . . . . . . . 1258.9 The P-SAP and the D-SAP are at the same program location (line 3). Neverthe-

less, because their calling contexts are different (line 1 and line 2, respectively),they are still privatizable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

8.10 Intra-procedural privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1278.11 Inter-procedural privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288.12 Architecture of Privateer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.13 Privatization may not repair this bug . . . . . . . . . . . . . . . . . . . . . . . . . 1328.14 Frequent shared array accesses in RayTracer . . . . . . . . . . . . . . . . . . . 133

8.15 The lock/unlock operations at line 5/6 can not be removed, though there is nocode to execute between them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

9.1 The program above crashes at line 9 following the interleaving 1-10-2-11-3-12-13-4-5-14-9. To reproduce the crash, LEAP [48] requires 12 synchronizationsat runtime to record the thread access order information (right) on the sharedvariables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

9.2 A schedule different from the original one, but is able to reproduce the bug.Moreover, this schedule has fewer (4) context switches than the original one (8). 141

List of Tables

3.1 The runtime overhead of LEAP and the state-of-the-art techniques. . . . . . . . 333.2 LEAP - summary of the evaluated real bugs . . . . . . . . . . . . . . . . . . . . . 353.3 LEAP - summary of the evaluated benchmark bugs . . . . . . . . . . . . . . . . . 36

4.1 PECAN experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1 TraceFilter experimental results- RQ1: Effectiveness . . . . . . . . . . . . . . . . 735.2 TraceFilter experimental results - RQ2: Efficiency . . . . . . . . . . . . . . . . . 755.3 TraceFilter experimental results - RQ3: Correctness . . . . . . . . . . . . . . . . 77

6.1 LEAN evaluation benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.2 LEAN experimental results - RQ1: Effectiveness . . . . . . . . . . . . . . . . . . 936.3 LEAN - decomposed effectiveness on trace size reduction . . . . . . . . . . . . . 946.4 Comparison between LEAN and ER . . . . . . . . . . . . . . . . . . . . . . . . . 956.5 LEAN experimental results - RQ2: Efficiency . . . . . . . . . . . . . . . . . . . . 96

7.1 Simtrace experimental results. Data are averaged over 50 runs for each subject. 109

8.1 Results of real concurrency bug fixing by privatization . . . . . . . . . . . . . . . 1318.2 Performance improvement by privatization . . . . . . . . . . . . . . . . . . . . . 1328.3 Statistics of the privatization results . . . . . . . . . . . . . . . . . . . . . . . . . 1348.4 Bytecode size increase after privatization . . . . . . . . . . . . . . . . . . . . . . 134

xiii

Effective Methods for Debugging ConcurrentSoftware

by Shaoming HUANG

Department of Computer Science and Engineering

The Hong Kong University of Science and Technology

Abstract

Multicore is here to stay. To keep up with the hardware innovation, software developers must

move from sequential programming to concurrent programming. However, this move is slow

and challenging due to the exponential complexity in reasoning about concurrency. In particular,

Heisenbugs such as data races, which are non-deterministic concurrency errors, pervasively

infect concurrent software, making concurrent program debugging notoriously difficult.

In this dissertation, we develop several effective methods for debugging concurrent programs

along four directions: multiprocessor deterministic replay, predictive trace analysis, trace sim-

plification, and data sharing reduction. We first present LEAP, a lightweight record and replay

system that makes Heisenbugs reproducible on multi-core and multi-processors. Underpinned

by a new local-order based replay theorem, LEAP is fast, portable, and deterministic. As long as

a Heisenbug manifests once, LEAP is able to deterministically reproduce it in every subsequent

execution, and more importantly, with much lower overhead compared to previous approaches.

We second present PECAN and TraceFilter, a persuasive predictive trace analysis system that

predicts Heisenbugs from normal executions, and an efficient algorithm that significantly im-

proves the scalability of predictive analysis by removing the trace redundancy. The salient fea-

ture of PECAN is that, in addition to predicting Heisenbugs, it generates a concrete execution

that deterministically expose and validate the predicted bugs. With PECAN, programmers are

provided with the full execution history and context information to understand the bug, which

dramatically expedites the debugging process.

We third present LEAN and SimTrace, a dynamic and a static technique for simplifying con-

currency bug reproduction through removing computational redundancy and validating trace

equivalence. A simplified execution with fewer threads, fewer thread interleavings, and faster

replay greatly reduces the debugging effort by reducing the number of places in the trace where

we need to look for the cause of the bug and by speeding up the bug reproduction process.

We finally present Privateer, an execution privatization technique that soundly privatizes a subset

of shared data accesses in a vast category of scheduler-oblivious concurrent programs. Under-

pinned by a privatization theorem, Privateer safely reduces the data sharing and isolates the

erroneous thread interleavings without introducing any additional synchronization. With Pri-

vateer, many Heisenbugs are fixed and a wide range of concurrency problems are alleviated

without impairing but, instead, improving the program performance.

Abbreviations

AA Access Anomaly

ASV Atomic-set Serializability Violations

MDR Multiprocessor Deterministic Replay

PTA Predictive Trace Analysis

xvi

Chapter 1

Introduction

We have entered a new era where our daily life is being dramatically changed by computing

technology. One of the greatest innovations in this era lies in the multicore hardware archi-

tecture, which brings our computers a new dimension of computational power. Even though

Moore’s law is going to hit the power wall, the performance of our computers will continue to

increase, as multicore promises to deliver continuous performance boost by packing more and

more computational cores on each chip.

While it is obvious that a multicore computer has the potential for higher performance, actually

realizing this potential is difficult. Despite a decade of practice, developing good quality concur-

rent software that efficiently utilizes multicore hardware remains notoriously difficult. A main

challenge is the interleaving of actions from concurrent thread, which is essential for parallel

performance. Due to the interleaving, programmers can no longer reason in a sequential way

because threads sharing the same address space can interfere with each other through the shared

data following different access orders.

Moreover, the number of thread interleavings is astronomical, exponential in both the number

of the threads and the size of their instructions. Facing the exponential complexity of reasoning

about concurrency, it is very difficult for the programmers to write correct and efficient concur-

rent programs. In addition, due to the huge interleaving space, software testing is often far from

enough to cover the adequate portion of the interleaving space, making many concurrency bugs

slip to the production site and impact the end users.

Even worse, the interleaving is non-deterministic, due to the thread scheduling non-determinism

and the timing differences between different cores. On a multicore computer, the same concur-

rent program running on the same machine with the same input can produce different outputs in

different runs. The non-determinism makes testing and debugging concurrent programs much

more challenging because a bug might “disappear” when programmers want to understand it.

1

Introduction 2

As a consequence, concurrency bugs such as data races, atomicity violations, atomic-set seri-

alizability violations, and deadlocks widely infect concurrent software systems, causing severe

problems such as data corruption, program crashes, with huge economical cost [124], and even

real world disasters [71].

Facing the numerous challenges above, we develop in this thesis a range of effective and scal-

able methods for dealing with concurrency bugs, aiming to improve the quality of concurrent

software in the multicore era.

1.1 Motivation

Concurrency bugs widely exist in today’s real world concurrent software system [74]. While

concurrent programs are more difficult to reason about than sequential programs, there are sev-

eral other important reasons that greatly affect the quality and reliability of concurrent programs.

Concurrency bugs are difficult to reproduce The exhibition of concurrency bugs is not

only dependent on the program input, but also the thread interleaving. Since the interleaving is

non-deterministic due to choices made by the thread scheduler, the exhibition of concurrency

bugs also is non-deterministic. Consider a simple multithreaded example in Figure 1.1. In this

artificial program, there are two threads t1 and t2 accessing two different shared variables

x and y, and there is an error at line 4. Because these two threads can execute concurrently

on different cores, their execution order may following different interleaving sequences. For

example, execution may follow either the interleaving A or B, represented by the statement line

numbers 1-2-6-7-3-4 and 1-2-3-5-6, respectively. If the program execution follows the

interleaving A, the error at line 4 is triggered. However, if it follows the interleaving B, the error

does not manifest.

This simple example illustrates the fact that the computation of concurrent programs is sensitive

to the thread interleaving. Even if we run the same program on the same machine with the

same program input, the error may or may not manifest in different runs. This phenomenon

makes debugging concurrent programs very hard. To reproduce a concurrency bug, not only

the same program input is required, but also the same thread interleaving. Unfortunately, it is

very challenging to capture the thread interleavings on multicore computers. Because recording

the thread interleavings at runtime inevitably hampers the execution parallelism, most runtime

techniques incur unacceptable program slowdown and are hard to deploy in production.

Introduction 3

2: y=1

3: if(x<0)

4: ERROR

6: if(y=1)

7: x=-1

Interleaving A: 1->5->2->6->7->3->4

2: y=1

3: if(x<0)

4: ERROR

6: if(y=1)

7: x=-1

Interleaving B: 1->2->3->5->6

FIGURE 1.1: The same program exhibits different behaviors with different thread interleavings.The error manifests with the interleaving A (left) but not the interleaving B (right).

Concurrency bugs are difficult to detect Due to the astronomical number of thread inter-

leavings, detecting concurrent bugs also is very challenging. Traditional program testing tech-

niques for sequential programs do not work well for concurrent programs because they do not

take the interleaving into account. Moreover, as the interleaving space is huge, testing is often

far from sufficient to cover the entire interleaving space. For this reason, traditional program

analysis techniques for bug detection do not work well on concurrent programs. It is hard for

static program analysis or model checking techniques to find concurrency bugs in real world

large concurrent programs, because there are just too many thread interleavings to explore. Tra-

ditional dynamic analysis do not work well neither because only a limited size of paths and

schedules are observed. Furthermore, due to the inherent complexity of concurrent programs,

program analysis techniques tend to report quite a large number false warnings, further impeding

the debugging process.

Concurrency bugs are difficult to understand Typical executions of real world concurrent

programs often contain a large number of threads, thread interleavings, and shared data accesses

and thread synchronizations. Even if a concurrency bug can be reproduced deterministically, it is

still very challenging for programmers to locate and understand the cause of the bug. Moreover,

the performance of replay is often significantly slower than native execution. For long running

programs, the bug reproduction process may take too long. Furthermore, the bug reasoning

process based on the trace often involves frequent context switches between the executions of

different threads. As most programmers are trained to thinking sequentially, they have to jump

from the context of one thread to another frequently for reasoning the concurrency bug. These

frequent context switches significantly impair the effectiveness of concurrent program debug-

ging.

Concurrency bugs are difficult to fix After diagnosing the concurrency bug, fixing it is still

a challenging problem. A common way to fix a concurrency bug is to add synchronization that

Introduction 4

PECAN

Detection

TraceFilter

Predictive trace analysis

Privateer

Fixing

Data Sharing

Reduction

LEAN

Diagnosis

SimTrace

Trace simplification

LEAP

Reproduction

Multiprocessor

deterministic replay

FIGURE 1.2: Overview of the work in this thesis for concurrent program debugging

prevents the erroneous thread interleavings. However, facing the huge interleaving space and the

large number of thread contexts, it is usually difficult to find the proper type of synchronization

and the proper location to place the synchronization. Improper placement of synchronization

can incur non-ignorable program slowdown, but might also introduce new bugs such as dead-

locks. Moreover, even if the proper synchronization is placed at the right location to rule out the

manifested erroneous interleavings, it does not necessarily guarantee the bug is fixed. Because

the interleaving space is enormous, it is possible that some other unmanifested interleavings

which can still trigger the bug are not forbidden by the added synchronization.

1.2 Contributions

This thesis works on four directions to address the debugging problem: multiprocessor deter-

ministic replay to reproduce concurrency bugs, predictive trace analysis to detect concurrency

bugs, static and dynamic trace simplification to help concurrency bug understanding, and data

sharing reduction to help fixing concurrency bugs without adding synchronization. Figure 1.2

shows an overview of the work done in this thesis. We next elaborate the contributions in each

of the work.

1.2.1 Multiprocessor Deterministic Replay

Bug reproduction is often the first step in debugging. This thesis presents LEAP, a lightweight

record and replay system that makes concurrency bugs reproducible in general multicore and

multiprocessor environments. LEAP is fast, portable, and deterministic. As long as a Heisen-

bug manifests once, LEAP is able to deterministically reproduce it in every subsequent exe-

cution, and more importantly, with much lower overhead compared to previous approaches.

We describe the design and implementation of LEAP that uses static analysis and bytecode in-

strumentation to transparently provide the capability of deterministic replay for Java programs

Introduction 5

without any user intervention. LEAP is the first public available deterministic replay system for

Java programs and has been used by several research groups worldwide.

1.2.2 Predictive Trace Analysis

Predictive trace analysis overcomes the limitation of static and dynamic analyses by combining

them. It records a trace of execution events, statically (often exhaustively) generates other per-

mutations of these events under certain scheduling constraints, and exposes concurrency bugs

unseen in the recorded execution. Predictive trace analysis is a powerful technique as, compared

to dynamic analysis, it is capable of exposing bugs in unexercised executions and, compared

to static analysis, it incurs much fewer false positives because its static analysis phase uses the

concrete execution history.

We present PECAN, a new predictive trace analysis system that predicts Heisenbugs from nor-

mal executions. The salient feature of PECAN is that, in addition to predict Heisenbugs, it

generates concrete executions that deterministically expose the predicted bugs. With PECAN,

programmers are provided with the full execution history and context information to understand

the bug, which dramatically expedites the debugging process. PECAN has revealed several

serious and previously unknown bugs in large open source concurrent systems.

General predictive analysis for exposing Heisenbugs faces considerable challenges scaling to

large traces, due to the exponential explosion of the schedule exploration space. We further

present TraceFilter, an efficient algorithm that significantly improves the scalability of predic-

tive trace analysis. TraceFilter is based on a trace redundancy theorem which guarantees that

predictive trace analysis based on a redundancy-removed trace produces the same analysis result

as that on the original trace.

1.2.3 Dynamic and Static Trace Simplification

To address the difficulty of diagnosing concurrency bugs on a reproducible buggy trace, we

present LEAN and SimTrace, a dynamic and a static trace simplification technique that reduce

the size of the execution trace, the number of threads, and the number of thread context switches.

A simplified trace greatly lessens the debugging effort by reducing the number of places in the

trace where programmers need to look for the cause of the bug. More importantly, through

reasoning about the computational equivalence of the trace offline, SimTrace dramatically im-

proves the efficiency of trace simplification for reducing the thread context switches. SimTrace

scales well to traces with more than 1M events, making it attractive for practical use.

Introduction 6

1.2.4 Data Sharing Reduction

We finally propose Privateer, an execution privatization technique that soundly privatizes a sub-

set of shared data accesses in a vast category of concurrent programs - scheduler-oblivious pro-

grams, of which the computation result is always deterministic regardless of the thread schedul-

ing. Underpinned by a privatization theorem, Privateer is able to reduce the data sharing in

scheduler-oblivious programs without introducing any additional program behavior. Moreover,

the non-deterministic thread interleavings on the privatized accesses are isolated without adding

any synchronization. With Privateer, many Heisenbugs are fixed and a wide range of concur-

rency problems are alleviated without impairing the execution parallelism, but conversely, with

the program performance improved, because the heap accesses become local stack operations

after privatization.

1.3 Outline

The remainder of this thesis is organized as follows. Chapter 2 describes the background knowl-

edge and previous work on concurrent program debugging and the related concurrent program

execution modeling concepts. Chapter 3 presents our multiprocessor deterministic replay system

LEAP. Chapters 4 and 5 focus on our predictive trace analysis work on the scalable concurrency

bug detection, PECAN and TraceFilter. Chapters 6 and 7 present our dynamic and static trace

simplification techniques, LEAN and SimTrace. Chapter 8 presents Privateer, our execution

privatization technique for soundly reducing the data sharing in scheduler-oblivious concurrent

programs. Finally, Chapter 9 concludes this thesis and discusses future work.

The materials in some chapters have been published as conference and journal papers. The

materials in Chapter 3 have been presented in [47, 48]. The materials in Chapters 4 and 5 have

been presented in [50, 53]. The materials in Chapter 6 and 7 have been presented in [49, 52].

The materials in Chapter 8 has been presented in [51], and some materials in Chapter 8 have

been presented in [46].

Chapter 2

Background and Previous Work

This chapter introduces the background of concurrent program debugging and concurrency de-

fect analysis. Section 2.1 presents an execution model for concurrent programs. Section 2.2

presents the basic definitions used in this thesis. Section 2.3 presents the concurrency bug pat-

terns characterized by thread interleaving. Section 2.4 discusses existing techniques for tackling

the concurrency problems, including deterministic replay approaches for concurrency bug re-

production, concurrent program analysis techniques to detect concurrency bugs, and automatic

techniques for fixing and surviving concurrency bugs at runtime.

2.1 Concurrent Program Execution Modeling

In this section, we describe a general execution model of concurrent programs. This model is

a starting point to understand the difficulties in concurrent programming and to comprehend all

the program analysis techniques presented in this thesis for concurrent program debugging.

A concurrent program in our language consists of a set of concurrently executing threads T =

{t1, t2, ...} that communicate through a global store σ. The global store consists of a set of

variables S = {s1, s2, ...} that are shared among threads. Each thread also has its own local store

π, consisting of the local variables and the program counter to the thread. We use σ[s] to denote

the value of the shared variable s on the global store. Each thread executes by performing a

sequence of actions on the global store or the thread’s own local store. Let α refer to an action

and var(α) the variable accessed by α. If var(α) is a shared variable, we call α a global action,

otherwise it is a local action. Note that for any global action, it operates on only one variable

on the global store. This is also true for synchronization actions, though they are only enabled

when certain pre-conditions are met. For local actions, the number of accessed variables on the

local store is not important in our modeling.

7

Background and Previous Work 8

A program execution is modeled as a sequence of transitions defined over the program state

Σ = (σ,Π), where σ is the global store and Π is a mapping from thread identifiers ti to the local

store πi of each thread. Since the program counter is included in the local store, each thread

is deterministic and the next action of ti is determined by ti’s current local store πi. Let αk be

the kth action in the global order of the program execution and Σk−1 be the program state just

before αk is performed (Σ0 is the initial state), the state transition sequence is:

Σ0 α1Ð→ Σ1 α2

Ð→ Σ2 α3Ð→ . . . (2.1)

Given a concurrent system described above, we next formally define the execution semantics

of action α. To give a precise definition, we first introduce some additional notations similar to

[34]:

• σ[s ∶= v] is identical to σ except that it maps the variable s to the value v.

• Π[ti ∶= πi] is identical to Π except that it maps the thread identifier ti to πi.

Let the relation σαÐ→ σ′ model the effect of performing an action α on the global store σ, and

παÐ→ π′ model the effect of performing α on the local store π. The execution semantics of

performing α are defined as follows.

Local action For the case of local actions, the execution semantics of performing α is simply

defined as:

Local ∶var(α) ∉ S Γ(α) = ti πi

αÐ→ π′i

(σ,Π)αÐ→ (σ,Π[ti ∶= π′i])

(2.2)

The program state transition above means that when a local action is performed by a thread,

only the local store of that thread is changed to a new state determined by its current state. The

global store and the local stores of the other threads remain the same.

Global action The common part of the semantics of global actions is that when a global action

is performed by a thread ti on the shared variable s, only s and πi are changed to new states.

The states of all the other shared variables on the global store as well as the local stores of all

the other threads remain the same:

Global ∶var(α) = s ∈ S Γ(α) = ti πi

αÐ→ π′i σ[s]

αÐ→ σ′[s]

(σ,Π)αÐ→ (σ[s ∶= σ′[s]],Π[ti ∶= π′i])

(2.3)

Let τ(α) denote the computation type of the global action α. To make the execution model

general to different programming languages, we consider the following types of global actions:


• READ - the thread ti reads the value of a shared variable in the global store into its local

store:τ(α) = READ var(α) ∈ S Γ(α) = ti πi

αÐ→ π′i

(σ,Π)αÐ→ (Π[ti ∶= π′i])

• WRITE - the thread ti assigns some value to a shared variable in the global store:

τ(α) =WRITE var(α) = s ∈ S Γ(α) = ti πiαÐ→ π′i σ[s]

αÐ→ σ′[s]

(σ,Π)αÐ→ (σ[s ∶= σ′[s]],Π[ti ∶= π′i])

• LOCK - the thread ti acquires a lock l (which is also a shared variable on the global store);

the pre-condition l = 0 means that the lock is available and the post-condition l = i means

that the lock l is now owned by the thread ti:

τ(α) = LOCK var(α) = l ∈ S Γ(α) = ti πiαÐ→ π′i σ[l] = 0

(σ,Π)αÐ→ (σ[l ∶= i],Π[ti ∶= π′i])

• UNLOCK - the thread ti releases a lock l; the pre-condition l = i means l is now owned

by the thread ti and the post-condition l = 0 means l is avaiable:

τ(α) = UNLOCK var(α) = l ∈ S Γ(α) = ti πiαÐ→ π′i σ[l] = i

(σ,Π)αÐ→ (σ[l ∶= 0],Π[ti ∶= π′i])

• FORK - the thread ti forks a new thread tj . Let the shared variable stj denote the existence

of the thread tj in the program. The pre-conditions stj = NA and πj = NA mean that the

thread tj is unavailable and its local store is undefined, and the post-conditions σ[stj ∶= 1]

and Π[tj ∶= π0j ] mean the thread tj is available now and its local store is initialized to π0j :

τ(α) = FORK var(α) = stj ∈ S Γ(α) = ti πiαÐ→ π′i stj = NA πj = NA

(σ,Π)αÐ→ (σ[stj ∶= 1],Π[ti ∶= π′i, tj ∶= π

0j ])

• JOIN - the thread ti joins the termination of the thread tj ; the pre-condition stj = 0 means

that the thread tj has already terminated:

τ(α) = FORK var(α) = stj ∈ S Γ(α) = ti πiαÐ→ π′i stj = 0

(σ,Π)αÐ→ (σ,Π[ti ∶= π′i)

• START - the first action in the action sequence of the thread ti. This is a dummy action

indicating that the thread ti is ready to run. This action does not change any program state

and it immediately follows the FORK action that forked the thread ti:

τ(α) = START var(α) = sti Γ(α) = ti

(σ,Π)αÐ→ (σ,Π)


• EXIT - the last action in the action sequence of the thread ti, indicating that ti has termi-

nated. The value of the shared variable sti is set to 0 after this action:

τ(α) = EXIT var(α) = sti ∈ S Γ(α) = ti πiαÐ→ π′i sti = 1

(σ,Π)αÐ→ (σ[sti ∶= 0],Π[ti ∶= NA])

• SIGNAL - the thread ti sets the value of a conditional variable c to 1:

τ(α) = SIGNAL var(α) = c ∈ S Γ(α) = ti πiαÐ→ π′i

(σ,Π)αÐ→ (σ[c ∶= 1],Π[ti ∶= π′i])

• WAIT - the standard semantics of a wait(c, l) action contains a sequence of three actions

UNLOCK-WAIT-LOCK: the thread ti first releases the lock l it is currently holding, then

it waits for a conditional variable c to become 1 and resets it back to 0 after c becomes

1, and finally it re-acquires lock l. The following execution semantics model the second

action:

τ(α) =WAIT var(α) = c ∈ S Γ(α) = ti πiαÐ→ π′i c = 1

(σ,Π)αÐ→ (σ[c ∶= 0],Π[ti ∶= π′i])

• YIELD - the thread ti yields execution to another thread. This action does not change

program state:τ(α) = Y IELD Γ(α) = ti

(σ,Π)αÐ→ (σ,Π)

The execution semantics defined above conform to a general concurrent execution model with

deterministic input. Although dynamic thread creation and dynamic shared variable creation

are not explicitly supported by the semantics, they can be modeled within the semantics in a

straightforward way [34].

2.2 Basic Definitions

Definition 2.1. (Trace) A trace captures a multi-threaded program execution as a sequence of

events δ = ⟨ei⟩. We associate each event ei with the following attributes:

• i: the global order of ei in δ;

• t: the thread executing ei;

• m: the memory location accessed by ei;

• a: the access type of ei, where a ∈{READ, WRITE, LOCK, UNLOCK, WAIT, NOTIFY,

FORK, JOIN};


• l: the locks held by the thread executing ei when ei is executed;

• u: the atomic region to which ei belongs.

In our presentation, we use t(i), m(i), a(i), l(i), and u(i) to denote the attributes t, m, a, l, u

associated with the event ei respectively.

Definition 2.2. (Trace Equivalence) Two traces are equivalent if they drive the same initial

program state to the same final program state.

Definition 2.3. (Atomic Region) An atomic region is defined as a region of code fragments that

preserves certain consistency properties w.r.t. the program states. Similar to the work by Wang

et. al. [133] and with no loss of generality, we consider every synchronized method and ev-

ery synchronized block as an atomic region. In addition, FORK/JOIN/WAIT/NOTIFY/YIELD

operations are considered to be region boundaries. In the case of nested regions, an event eibelongs to the outermost one.

Definition 2.4. (Partial Order Relation ≺) (POR) An important relation that is used by many

concurrent program analyses is the POR relation (also called happens-before relation) on the

events exhibited by a concurrent execution. Given a trace δ, the partial-order relation ≺ is the

smallest relation satisfying the following conditions:

• Intra-thread program order: If ei and ej are events from the same thread and ei comes

before ej in the trace, then ei ≺ ej .

• Inter-thread message order: If ei is an action that sends a message g and ej is an ac-

tion that receives g, then ei ≺ ej . In our model, such relations include FORK≺START,

EXIT≺JOIN, and NOTIFY≺WAIT. START and EXIT are two fake actions representing

the beginning and ending of a thread.

• ≺ is transitively closed.

The computation of ≺ is often done by maintaining a vector clock with every thread [81]. Note

that, slightly different from the classical happens-before in the Java memory model [77], the

lock order between UNLOCK and LOCK events are not included in the POR relation.

Definition 2.5. (Dependence Relation →) is a strict relation that captures data and control

dependencies between events in the trace. The dependence relation ei → ej holds whenever eioccurs before ej and one of the following holds:

• Partial order - ei ≺ ej ;

• Lock order - ei and ej are consecutive UNLOCK and LOCK actions on the same lock,

respectively, by different threads such that ei releases the lock acquired by ej ;


• Conflicting order - ei and ej are consecutive conflicting actions by different threads on

the shared variable. There are three types of conflicting orders:

– WRITE→READ: ei is a WRITE action and ej is a READ action;

– READ→WRITE: ei is a READ action and ej is a WRITE action;

– WRITE→WRITE: both ei and ej are WRITE actions.

Given a dependence relation ei → ej , If ei and ej are from different threads, we say ei has a

remote outgoing dependence to ej , and similarly, ej has a remote incoming dependence from ei.

It is important to notice that the remote dependence relations in our model are between actions

accessing the same shared variable. Therefore, context switches between threads accessing

different variables in the trace are allowed to be reduced in our model.

Definition 2.6. (Memory model) A memory consistency model defines what value a READ

action will return. For example, the simplest but the most strict model, sequential consistency

(SCMM)[65], requires that a READ always returns the value written by the most recent WRITE

on the same memory address. Various relaxed memory models [2, 77, 78] have been developed

to admit additional optimizations by imposing fewer constraints on the value returned from

READ operations. For simplicity, unless we emphasize the other memory models, by default

we consider SCMM in this thesis. Nevertheless, most techniques presented in this thesis also

generalize to relaxed memory models.

Definition 2.7. (Thread scheduling and interleaving) Under SCMM, in any execution, there

exists a global order among all the actions, and a READ action always return the value writ-

ten by the most recent WRITE on the same variable in this global order. We call this global

order a schedule, denoted by ξ. ξ is non-deterministic, it may be different in different exe-

cutions. A thread interleaving occurs in ξ when an action from a certain thread is executed

between two successive actions from a different thread. A preemptive interleaving occurs when

the interleaved thread could have executed continuously without the interleaving. Preemptive

interleaving is non-deterministic, because it depends on the behavior of the thread scheduler and

the timing variations between threads [100]. If a schedule contains no preemptive interleaving,

we say it is sequential and, otherwise, non-sequential.

Definition 2.8. (Scheduler-obliviousness) A vast category of concurrent programs are scheduler-

oblivious. A scheduler-oblivious program requires that, given the same input, it always returns

the same output, regardless of the behavior of the underlying thread scheduler. More specifi-

cally, in our modeling, given the same initial state Σ0, for any schedule ξ, the computation of a

scheduler-oblivious program always reaches the same final state ΣN :

(Σ0, ξ)...Ð→ ΣN (2.4)


The definition of scheduler-obliviousness is semantically equivalent to determinancy [61]. A dif-

ference is that determinancy is a goal of parallel computation, whereas scheduler-obliviousness

is an expected property of the program.

Definition 2.9. (Blocking statement) A blocking statement is a statement that, when exe-

cuted, may enforce a thread interleaving or introduce an execution ordering between threads.

In our model, LOCK/WAIT/JOIN/YIELD/UNLOCK/NOTIFY/FORK are blocking statements,

and READ/WRITE are non-blocking. LOCK statement is blocking because they may wait if

the lock is unavailable. For WAIT statement, it always blocks first, and then waits until another

thread sets some conditions to be true. For JOIN statement, it must waits until the termination

of another thread. And for YIELD statement, it always yields the execution to another thread.

2.3 Thread Interleaving Patterns for Concurrency Bugs

Researchers have proposed various criteria for characterizing concurrency defects such as data

race [4, 109], atomicity [4, 34], causal atomicity [32], and conflict/view serializability [134].

A comprehensive study of concurrency-related bugs is given in [74]. We describe data race,

atomicity violation, and atomic-set serializability violations in this section. We omit deadlocks

and livelocks as they are not the focus of this thesis.

Data race Data races are one of the most common and subtle causes of pernicious concur-

rency bugs. A data race occurs when two threads are concurrently accessing the same data

without proper synchronization and at least one of these accesses is a write [109].

Atomicity Violations Atomicity guarantees that the program’s behavior can be understood as

if each atomic region executes serially (without interleaved steps of other threads). An atomicity

violation happens when the desired serializability among multiple memory accesses on some

shared data is violated [34]. Suppose ei and ej are data accesses (write or read) from the same

atomic region, and ek is a data access from another atomic region, and ei, ej , ek are accessing the

same memory location, an atomicity violation occurs if in some execution ei happened before

ek, ek happened before ej , and the access types of “ei-ek-ej” are of the form “write-read-write”

or “x-write-x”, where x means either read or write.

Atomic-set serializability violations Atomic-set serializability is a criterion for characteriz-

ing concurrency defects proposed by Vaziri et al. [125]. Since it also considers the correlations

between memory locations, this criterion characterizes a wider range of concurrency bugs than

many previously proposed criteria such as data race and atomicity violation. In the definition


Id Pattern

1 Wu(l1)Wu’(l)Wu’(L‐l)Wu(l2)

2 Wu(l1)Wu’(l2)Wu(l2)Wu’(l1)

3 Wu(l1)Ru’(l)Ru’(L‐l)Wu(l2)

4 Wu(l1)Ru’(l2)Wu(l2)Ru’(l1)

5 Ru(l1)Wu’(l)Wu’(L‐l)Ru(l2)

6 Ru(l1)Wu’(l2)Ru(l2)Wu’(l1)

FIGURE 2.1: Atomic-set serializability violation patterns [125]. Wu(l) and Ru(l) represent awrite and a read, respectively, to a memory location l of a unit of work u. l1 and l2 belong to

the same atomic set.

of atomic-set serializability, the memory locations that have consistency property among each

other are grouped into an atomic set, and code regions expected to preserve the consistency of

an atomic set are called units of work. Atomic-set serializability requires that the units of work

must be serializable for all the atomic sets that they operate on. Errors due to data races, high

level data races, and violations of standard notions of serializability can all be treated as vio-

lations of atomic-set serializability. Besides, previous experiences on using this criterion show

that the criterion can be more accurate in discerning real concurrency bugs than other existing

ones [43].

More importantly, Vaziri et al. [125] summarized a set of eleven problematic data access pat-

terns (Figure 2.1) that violate atomic-set serializability (ASV) and proved that the set is com-

plete, provided that each unit of work that writes to an atomic set, writes all locations in that

set. For example, pattern 6 “Wu(l1)Wu′(l)Wu′(L − l)Wu(l2) (l ∈ l1, l2 = L)” shows an atomic-set

serializability violation that causes memory to be left in an inconsistent state. The two memory

locations l1 and l2 belong to the same atomic set. Because the two consecutive writes to l1 and

l2 of the unit of work u are interleaved by two writes to the two memory locations of another

unit of work u′, the consistency property between l1 and l2 is violated.

2.4 Tackling Concurrency Problems

To address the difficulties in programming concurrent systems, existing research has focused on

three dimensions. The first dimension is to provide language and library support for easy and

safe reasoning of concurrency. This dimension includes high performance concurrency libraries

[54, 67], flexible synchronization mechanisms [145, 146], deterministic language semantics

[11, 14], and transactional memory [44, 114]. The second dimension is to provide deterministic


runtime enforcement for concurrent program execution [26, 97, 141]. This dimension usually

combats the concurrency peril with a compromise of program performance. The third dimension

targets the effective diagnosis of concurrency issues. Concurrency defect detection [35, 50, 79,

110], trace analysis [24, 41, 49, 55, 122], multiprocessor record/replay [45, 48, 83, 100] all

belong to this school. We discuss the previous research efforts related to concurrent program

debugging.

2.4.1 Concurrency Bug Reproduction

2.4.1.1 Deterministic Replay

The technique of deterministic replay aims at faithfully reproducing earlier program executions.

It plays a substantial role in concurrent program debugging as it makes concurrency bugs repro-

ducible. We next discuss the representative deterministic replay techniques.

Software-only approaches Dejavu [23] is a software-only solution that uses the logic clock

to provide deterministic replay of Java multi-threaded programs. It is developed as an JVM

extension that has two modes: record and replay. In the record mode, it records the thread

scheduling order at every critical events, including shared memory accesses and synchroniza-

tion operations. In the replay mode, it reproduces the execution behavior of the program by

enforcing the recorded logical thread schedule. However, since it has to trace every critical sec-

tion access, it only can support programs running on single-processor platforms. InstanceReplay

[68] is a record/replay technique that records the version number of shared objects accessed by

each thread for debugging parallel programs. It relies on a protocol called CREW that regu-

lates threads concurrent-read-exclusive-write on shared objects to reduce recording overhead.

To avoid the overhead of recording memory races, RecPlay [106] and Kendo [97] provide de-

terministic multi-threading of concurrent programs that perfectly synchronize using locks. Un-

fortunately, most real world concurrent applications may contain benign or harmful data races,

making these approaches unattractive. Though RecPlay and Kendo both use a data race detector

during replay to ensure deterministic replay up until the first race, they suffer from the limitation

that they cannot replay past the data race. For instance, while debugging using a replayer, a pro-

grammer might want to understand the after effects of a benign data race, which is not possible

with RecPlay and Kendo. JRapture [119] is a capture/replay tool for observation-based testing.

It captures interactions between a Java program and the system, including GUI, file, and console

inputs, among other types, and on replay it presents each thread with exactly the same input

sequence it saw during capture. DoublePlay [127] and Chimera [69] are two recent techniques

that support low overhead full-program replay. DoublePlay intelligently offloads the recording

processes to extra cores, while Chimera combines static data race analysis with offline profiling

and dynamic checking to provide efficient online recording.


Hardware-assisted approaches Hardware approaches such as DMP [26] make inter-thread

communication fully deterministic by imposing a deterministic commit order among proces-

sors. PSet [141] eliminates untested thread interleavings by enforcing the runtime to follow a

tested interleaving via processor support. Because hardware approaches rely on non-standard

hardware support, they are limited to proprietary platforms. Though DMP [26] also proposes a

software-only algorithm, its overhead is more than 10x. FDR [138] and BugNet[92] are deter-

ministic replay tools for program debugging based on checkpointing schema and hardware-level

assistance. FDR employs additional hardware to track data races, program I/O, interrupts and

DMA accesses to enable deterministic replay of full system execution from the beginning of

a checkpoint. BugNet focuses on deterministically replaying the instructions executed in user

code and shared libraries by logging the register file content at some point in time and recording

the load values that occur after that point. Both of them require changes to the host operating

system and special hardware support. SMP-ReVirt [28] makes use of hardware page protec-

tion to detect shared memory accesses, aimed at replaying multi-processor virtual machines, but

its overhead can be up to 10x on multi-processors. Rerun [45] exploits episodic memory race

recording to achieve efficient logging (around 4B per 1000 instructions), while DeLorean [83]

promises much smaller log sizes and higher replay speeds by investigating the total sequence of

chunk commits.

2.4.1.2 Offline Search and Deterministic Multithreading

PRES [100] and ODR [3] are two replay solutions that use partial recording and offline search

for the reproduction of concurrency bugs. PRES proposes a novel technique that uses a feed-

back replayer to explore thread interleavings, which reduces the recording overhead at the price

of more replay attempts. ODR proposes a new concept, output-deterministic replay, that fo-

cuses on replaying the same program output, and relies on offline inference to help recording

less information online. ESD [143] further reduces runtime tracing overhead by symbolically

exploring the complete thread scheduling decisions via execution synthesis. Weeratunge et al.

[136] present an approach to generate a failure inducing schedule by comparing the core dumps

offline, leveraging an execution indexing technique [137].

There also are several research attempts to make concurrent programs data race free by construc-

tion and deterministic by default. In this direction, there have been language design approaches

[11, 14] as well as hardware ones [26, 141]. For example, languages such as DPJ [14] guar-

antee deterministic semantics by providing a type and effect system to perform compile-time

type checking. The problem with language level approaches is that they often require nontrivial

programmer annotations or have a limited class of concurrency semantics.


2.4.2 Concurrency Bug Detection

2.4.2.1 Static and Dynamic Program Analyses

Researchers have proposed a large body of dynamic or static techniques for concurrency defect

analysis. Eraser [109] first proposed the lockset-based approach for dynamic race detection.

Atomizer [34] uses Lipton’s reduction theory combined with the lockset algorithm to detect

atomicity violations dynamically. The lockset-based algorithms also are extended by RacerX

[30] for static race and deadlock detection. Many techniques based on the happens-before re-

lation [66] also have been proposed for detecting concurrency defects. Farzan et al. [32] uses

happens-before to statically detect causal atomicity. Callahan and Choi [96] combine the lock-

set algorithm and the happens-before approach to dynamically detect races. Chord [88, 89]

uses a staged approach to statically detect data races. AVIO [75] detects atomicity violation

based on access interleaving invariants extracted at run time. MUVI [73] uses data mining tech-

niques to statically detect concurrency bugs based on multi-variable correlations. For detecting

ASVs, Hammer et al. [43] proposed a runtime monitoring technique based on a set of race

automata. The primary limitation of the dynamic techniques is that they can only detect the

defects manifested in a specific concrete execution. On the other hand, while static techniques

can potentially explore all paths to find possible concurrency defects, they typically report a lot

of false warnings.

Several hybrid techniques combining static and dynamic analysis also have been proposed for

concurrency defect analysis. CTrigger [99] uses a two phase approach to detect atomicity

violations by controlling program execution to actively exercise low-probability thread inter-

leavings. Velodrome [37] proposed a sound and complete approach for detecting conflict-

serializability violations based on the dependence information extracted from the execution

trace. Narayanasamy et al. [93] uses the replay analysis to automatically classify benign and

harmful races. The benefit of the hybrid approaches is that they may possess the merits of static

and dynamic analysis at the same time.

Active Testing [57, 59, 64, 98, 110] is a testing technique for concurrent programs proposed

by Sen et al.. Given the reports of some potential concurrency-related defects obtained from

existing analysis tools, such as data races, atomicity violations and deadlocks, active testing

controls a defect-directed random scheduler to expose these defects in the program. Lai et

al. [64] develop AssetFuzzer that effectively exposes real ASVs by combining predictive trace

analysis with randomized active testing. A limitation of active testing is that it may still suffer

from non-determinism, because it utilizes only the partial information of the race pairs or ASV

tuples. To further improve effectiveness, PENELOPE [118] proposes a technique to expose

atomicity violations by re-executing the program under the full atomicity-violating schedules.


The atomicity-violating schedules in PENELOPE are generated using a cut-point based theo-

retical scheduling algorithm that addresses the single variable atomicity problem.

Type system and language based techniques, such as DPJ [14] and Guava [7], also are proposed

for detecting and eliminating concurrency defects offline. The problem with these approaches

is that they often require nontrivial programmer annotations.

Model checking [18, 62, 86, 113, 129] is an alternative way to find bugs in concurrent programs.

By exhaustively exploring the thread scheduling space, they also also report counter examples

for the detected concurrency defects. For example, CHESS dynamically explores the thread

scheduling decisions to expose concurrency bugs using a context-bounded approach. Shacham

et al. [113] also uses a model checker to construct the witness for data races reported by the

lockset algorithm. Unfortunately, due to the exponential size of the search space, it is hard for

them to scale to large programs without compromising the detection capability. PCT [19] and

PPCT [87] further improve the effectiveness of CHESS by exploring the schedules in a random

fashion with probablistic guarantee of detecting concurrency bugs.

2.4.2.2 Trace-based concurrent program analysis

A large body of recent research focuses on the predictive trace analysis of concurrent programs.

Sen et al. [111] proposed a generalized predictive analysis technique for detecting violations of

safety properties. Wang et al. [134] proposed the reduction-based and block-based algorithms

for checking atomicity on the execution trace. Chen et al. [22] presented a framework for pre-

dictive analysis of concurrent Java programs. Lai et al. [64] combined PTA with randomized

active testing [110] to detect ASVs in a run. A common difficulty in these techniques is that

they do not scale as the size of executions increases.

To alleviate the scalability problem of PTA, Farzan et al. [33] developed a meta-analysis model

that produces an efficient algorithm for checking atomicity violations in programs that obey the

nested locking discipline. The algorithm works in time linear in the length of the runs, and

quadratic in the number of threads, and was also used in PENELOPE [118] for testing and

debugging atomicity violations.

Symbolic analysis Wang et al. [128, 130, 131] developed a symbolic analysis model for find-

ing concurrency errors, such as atomicity violations, based on the execution trace. The model

encodes the causal dependencies between events, the program control structure, and the prop-

erty of concurrency errors in a uniform way using symbolic constraints and calls a satisfiability

solver to verify the existence of property violations. This approach can statically check whether

a property holds in all feasible permutations of events in the given execution trace. However,

it still faces the inherent challenge of a huge search space and is hard to scale to large traces.


Moreover, although the symbolic model is able to exhaustively verify the feasibility of sched-

ules, it is not clear how to efficiently generate a witness that manifests the detected concurrency

errors using this approach.

2.4.3 Surviving Concurrency Bugs

Atomicity violation fixing A recent advance by Jin et al. [56] proposes an automated technique

that fixed six out of eight real atomicity violation bugs, using sophisticated static analysis com-

bined with dynamic monitoring to resolve deadlocks. Weeratunge et al. [135] also present a lock

based approach to effectively suppress concurrency errors by enforcing the atomicity property

observed from good executions. Synchronization is a general way to fixing concurrency bugs,

nevertheless, a drawback of using synchronizations is that it may incur high runtime overhead.

Runtime approaches A line of active research [25, 76, 103, 105, 126, 142] proposes detecting

and surviving concurrency bugs at runtime. ISOLATOR [103] makes the execution of a buggy

program more robust by isolating the well-behaved threads from ill-behaved ones. ToleRace

[105] detects and tolerates asymmetric races in lock-based programs through replication. Atom-

Aid [76] proposes a hardware architecture to reduce the possibility of atomicity violations. Yu

and Narayanasamy [142] uses hardware transaction to constrain the program execution to tested

interleavings. More recently, Veeraraghavan et al. propose a system called Frost [126] that sur-

vives data races by running multiple replicas with complementary schedules. Cui et al. develop

PEREGRINE [25] that generalizes the reusable schedule to more inputs by computing the path

constraints.

Chapter 3

Multiprocessor Deterministic Replay

The technique of deterministic record and replay aims at faithfully reproducing an earlier pro-

gram execution. For concurrent programs, it is one of the most important techniques for program

understanding and debugging. The state of the art deterministic replay techniques face chal-

lenging efficiency problems in supporting multi-processor executions due to the unoptimized

treatment of shared memory accesses. We propose LEAP: a deterministic record and replay

technique that uses a new type of local order w.r.t. the shared memory locations and concurrent

threads. Compared to previous work, our technique records much less information without los-

ing replay determinism. The correctness of our technique is underpinned by formal models and a

replay theorem that we have developed. Through our evaluation using both benchmarks and real

world applications, we show that LEAP is more than 10x faster than conventional global-order

based approaches and, in most cases, 2x to 10x faster than other local-order based approaches.

Our recording overhead on the two large open source multi-threaded applications Tomcat and

Derby is less than 10%. Moreover, LEAP is able to deterministically reproduce 7 out of 8 real

bugs in Tomcat and Derby, 13 out of 16 benchmark bugs in IBM ConTest benchmark suite, and

100% ofs randomly injected concurrency bugs.

3.1 Introduction

One of the most effective ways for combating concurrency bugs is the technique of record and

replay [3, 23, 28, 39, 45, 68, 83, 84, 91, 97, 100, 106, 108]. The record and replay technique aims

at fully reproducing the problematic execution of concurrent programs, thus giving programmers

both the context and the history information to dramatically expedite the debugging process.

A crucial design factor in record and replay solutions is the degree of recording fidelity, i.e.,

the amount of data to be recorded, for the sufficient reproduction of problematic program ex-

ecutions. Simply speaking, the degree of recording fidelity is proportional to the degree of

20

Multiprocessor Deterministic Replay 21

faithfulness in replay. This characteristic is less problematic for hardware-based record and

replay solutions [45, 83, 84, 92, 138], in which special chips share the cost of the recording

computation. For the software-only solutions [91, 108] on uni-processors, the replay of concur-

rent programs can be achieved with low overhead by capturing the thread scheduling decisions.

However, for software-only solutions on multi-processors, making the best trade-off between

how much to record and how faithful to replay is still a very challenging problem, drawing

intense research attention [3, 23, 39, 68, 97, 100, 106].

Our research also is concerned with the software-only record and replay solutions. Our gen-

eral observation is that the state of the art does not achieve both recording efficiency and re-

play determinism. Conventional deterministic multi-processor replay techniques usually incur

a significant runtime overhead of 10x to 100x [23, 26, 28, 68], making them unattractive for

production use or even for testing purposes. For instance, Dejavu [23] is a global clock based

approach that is capable of deterministically replaying concurrent systems on multi-processors

by assigning a global order to all “critical events”, including both the synchronization points and

the shared memory accesses. As indicated by the authors, the enforcement of the global order on

variable accesses across multiple threads incurs a large runtime overhead on multi-processors.

The research of lightweight record and replay techniques [3, 39, 97, 100, 106] has success-

fully lowered the recording overhead, but at the cost of sacrificing determinism. JaRec [39] and

RecPlay [106] abolish the idea of global ordering and use Lamport clock [66] to maintain par-

tial thread access orders w.r.t. only the monitor entry and exit events, thus, making the recording

process lightweight. However, without tracking the shared memory accesses, their approaches

cannot deterministically reproduce problematic runs because a large majority of shared memory

accesses are not synchronized, either due to programming errors or because they are harmless

[93].

As also pointed out in [107], to deterministically replay a concurrent system on multi-processors,

it is necessary to record the thread access orders of the shared memory locations, a method com-

monly believed to be too expensive to be practical [3, 39, 97, 100, 106]. In this work, we demon-

strate that it is possible to achieve efficiency in this approach by observing that, given the same

program input, it is sufficient to deterministically replay the program execution by recording

partial thread access information local to the individual shared variables. Based on this obser-

vation, we have designed and implemented LEAP, a replay tool that provides both recording

efficiency and replay determinism. The replay determinism is underpinned by a semantic model

and formal theorems. To achieve efficiency, we use a field-based approach to statically identify

shared variables, thus, avoiding the cost of runtime identification. In addition, we make exten-

sive use of static analysis to provide a close approximation of the necessary program locations

that need to be monitored and, thus, to prune away a large percentage of otherwise redundant

recording operations.


The idea of the local-order based recording can be traced back to InstantReplay [68], which

enables the deterministic replay by recording the access history of all the shared objects w.r.t. a

particular thread. This technique does not suit our design objectives of being both deterministic

and efficient. First, InstantReplay requires the unique identification of shared objects dynami-

cally, a task hard to efficiently and correctly implement in practice. Second, InstantReplay uses

a complex computation model based on the CREW protocol, making the recording process very

costly. Third, there are important soundness issues with the local-order based approaches that

must be formally proved. Another local-order based approach is the use of Lamport clock that

tracks the partial order of critical events that each thread sees [39, 106]. Our technique tracks

the order of thread accesses that each shared variable sees, which is operationally simpler than

the use of Lamport clock.

We evaluate the runtime performance of LEAP by comparing to the related techniques in-

cluding global clock, InstantReplay, and Lamport clock. Our micro-benchmark shows that

LEAP is more than 10x faster than the global clock based approach, more than 5x faster than

InstantReplay, and at least 2x faster than the use of Lamport clock. On real world large open

source multi-threaded applications such as Tomcat and Derby, LEAP is 5x to 10x faster than

the related approaches. The average runtime overhead of LEAP is less than 10% on Tomcat and

Derby. Moreover, LEAP is able to deterministically reproduce 7 out of 8 real concurrency bugs

in Tomcat and Derby, 13 out of 16 benchmark bugs in IBM ConTest benchmark suite [31], and

100% of the randomly injected concurrency bugs.

The rest of this chapter is organized as follows: Section 3.2 presents the technical details of

LEAP; Section 3.3 presents the semantic model and proofs; Section 3.4 describes the imple-

mentation of LEAP; Section 3.5 evaluates LEAP; Section 3.6 summarizes this chapter.

3.2 LEAP: Local-Order Based Deterministic Replay

LEAP provides a general technique for deterministic replay of concurrent programs on multi-

processors. We define replay determinism as the faithful reenactment of all program state tran-

sitions experienced by a previous execution. A more complete and formal model is presented

in Section 3.3. The main idea of LEAP is that each shared variable tracks the order of thread

accesses it sees during execution.

3.2.1 LEAP Overview

We first use a simple example to show the main technique of LEAP and draw its differences as

compared to the conventional global-order based approach to the deterministic replay. In Figure


1.1 (left), we show a race condition that triggers an ERROR at line 4 following the interleaved

execution order <1,5,2,6,7,3,4>. The global-order based approaches record this schedule

and use it to re-execute the program at the cost of six global synchronization operations. Our

observation is that not all thread accesses to different shared variables need to be tracked. Instead

of enforcing a global order, we claim that it is sufficient to record the thread access order that

each shared variable sees. In our example, instead of the global order vector, we use two access

vectors (x.vec and y.vec) for the shared variables x and y and record <t1,t2,t1> and

<t2,t1,t2> respectively. We require zero global synchronization operations and two groups

of local synchronization operations executed in parallel. During replay, we associate x and y

with conditional variables to enforce that the access order of threads is identical to what was

recorded in their respective access vectors.

Although our technique can be easily illustrated, to ensure determinism and efficiency, there are

many tough challenges that we must tackle:

1. Static shared variable localization. How to effectively locate shared variables statically?

What will happen if we miss some shared variables, or some local variables are mistakenly

recognized to be shared?

2. Consistent shared variable and thread identification across runs. How to match the identities

of shared variables and of threads between the recording run and the replay run? For example,

the deterministic replay would fail if the shared variable x at record is incorrectly recognized as

y at replay, or the thread t1 is mistakenly recognized as t2.

3. Non-unique global order. Keen readers may point out that, by only recording the thread

access orders each variable sees, LEAP will permit a global thread schedule that is different

from the recording run. For instance, in our example, LEAP also permits the global order

<5,1,2,6,7,3,4>. Will this affect the faithfulness of the replay?

In the rest of the section, we focus on discussing the first two issues. The soundness of our

approach associated with the third issue is fundamental to our technique. In Section 3.3, we

provide a formal semantic model and proofs to show this phenomenon does not affect the faith-

fulness of the replay.

3.2.2 Locating Shared Variable Accesses

Precisely locating shared variables is generally undecidable [15]. We therefore compute a com-

plete over-approximation using a static escape analysis in the Soot1 framework called Thread-

LocalObjectAnalysis [42]. ThreadLocalObjectAnalysis provides on demand answers to whether

a variable can be accessed by multiple threads simultaneously or not. However, there are a1http://www.sable.mcgill.ca/soot


class Account {

SPE name index{

int balance1; int balance2;

balance1 Account.balance1 1

balance2 Account balance2 2int balance2;

B l 1…

balance2 Account.balance2 2

getBalance1getBalance1{

tmp = balance1;

getBalance1{

thread_id = getThreadId();tmp balance1;return tmp;

}

_ g ()get_lock(1);accessSPE(thread_id, 1);

b l 1setBalance2{

}tmp = balance1;release_lock(1);return tmp;{

…balance2 = value;

return tmp; }

}}

FIGURE 3.1: The instrumentation of SPE accesses

few important issues with this analysis. First, static analysis is inherently conservative, as local

variables might be reported as shared. We show in Section 3.3 (Corollary 3.3) that this type of

conservativeness does not affect the correctness of the deterministic replay. Second, Thread-

LocalObjectAnalysis does not distinguish between read and write accesses. Shared immutable

variables, whose values never change after initialization, need not to be tracked for they cannot

cause nondeterminism. Third, we discover that static variables are all conservatively reported

as escaped in ThreadLocalObjectAnalysis. Since static variables might also be accessed only by

one thread, we wish to analyze them in the same way as the instance variables, in order to obtain

a more precise result. Thus, we make two enhancements to the ThreadLocalObjectAnalysis: 1.

we further refine the analysis results of ThreadLocalObjectAnalysis so that we do not record

accesses to shared immutable variables; 2. we modify ThreadLocalObjectAnalysis to treat static

variables in the same way as instance variables.

3.2.3 Field-based Shared Variable Identification

For Java programs, since the standard JVMs do not support the consistent object identification

across runs, we cannot use the default object hash-code. We use a static field-based shared

variable identification scheme, applied to the following three categories of variables, which

are collectively referred to as the shared program elements (SPE): 1. variables that serve as

monitors; 2. class variables; 3. thread escaped instance variables. These SPEs include both Java

monitors and shared field variables that may cause nondeterminism. SPEs are uniquely named

as follows: for category 1, it is the name of the declaring type of the object variable; for category


2 and 3, it is the variable name, combined with the name of the class in which the variable is

declared.

After obtaining all the SPEs in the program, LEAP assigns offline to each SPE a numerical index

as its runtime identifier. For example, in Figure 3.1, suppose the two field variable balance1

and balance2 of the Account class are identified as shared, they are mapped to the numerical

IDs 1 and 2.

The static field-based shared variable identification remains consistent across runs and does not

incur runtime overhead. Moreover, compared to the object level identification approaches [68],

this approach is more fine-grained as different fields of the same object are mapped to different

indices. Consequently, accesses to different fields of the same object do not need to be serialized

at the runtime.

There are a few issues with our field-based shared variable identification. First, our approach

does not statically distinguish between different instances of the same type. As a result, accesses

to the same shared field variable of different instances of the same type would be serialized

and recorded into the same access vector. For this concern, we formally prove in Section 3.3

(Corollary 3.4) that the deterministic replay is also guaranteed, if the thread accesses to different

shared variables are recorded globally into a single access vector. Second, we cannot uniquely

identify scalar variables that are aliases of shared array variables. To deal with this issue, we

perform an alias analysis for all of the scalar array variables in the program and represent all

the aliases with the same SPE, ignoring the indexing operations. This treatment guarantees that

the nondeterminism caused by array aliases can be correctly tracked, however, at the cost of

reducing the degree of concurrency. Fortunately, in our experiment, we find very few such cases

in large Java multi-threaded applications. A good object-oriented program rarely manipulates

shared array data directly, so they are rarely escaped.

3.2.4 Unique Thread Identification

Since thread identity is the only information recorded into the access vectors, we must make

sure that a thread at the recording phase is correctly recognized during replay. A naive way is to

keep a mapping between thread name and thread ID during recording and use the same mapping

for replay. However, different parent threads can race with each other when creating their child

threads. Therefore, the thread ID assignment is not fixed across runs.

We take a similar approach as that in jRapture [119] to identify threads and their children. The

key observation is that each thread should create its children threads in the same order, though

there may not exist a consistent global order among all threads. We therefore create a consistent

identification for all threads based on the parent-children order relationship. More specifically,


starting from the main thread (T0), each thread maintains a thread-local counter for recording

the number of children it has forked so far. And everytime a new thread is forked, it is identified

with its parent thread ID associated with the counter value. For instance, suppose a thread tiforks its jth child thread, this child thread will be identified as ti∶j .

3.2.5 Handling Early Replay Termination

Our local-order based approach permits different global schedules for threads that do not affect

each other’s program states. One caveat of this approach is that it gives rise to the possibility

of early termination: a program crash action might occur earlier in the replay execution, thus,

making the replayed run not fully identical to the recording run in terms of its behavior. To faith-

fully replay all the thread execution actions, we ensure that every thread in the replay execution

performs the same number of SPE accesses as it does in the recording execution. Consequently,

we guarantee that the replay execution does not terminate until all the recorded actions in the

original execution are performed, thus making the final state of the replayed execution the same

as that of the original one.

3.3 A Theorem of Local Ordering

In this section we formally prove the soundness of our local-order based approach for determin-

istic replay. We also use two corollaries to show the soundness of the field-based shared variable

identification approach and the soundness of using an unsound but complete static escape anal-

ysis for deterministic replay.

Recall the execution model described in Section 2.1. The action sequence ⟨αk⟩ of a program

execution is called an execution schedule denoted by δ. Suppose there is an execution schedule

δ of size N that drives the program state to ΣN , our goal is to have another execution schedule

δ′ that is able to produce the same program state as ΣN . Obviously, this can be achieved if δ′ = δ

holds. However, this is too strong a condition. We show a relaxed and sufficient condition based

on the access vectors of all the shared variables. To state precisely, let δs be the sequence of

actions w.r.t. a shared variable s projected from δ, τs be the sequence of thread identifiers picked

out from δs, and τ be the mapping from s to τs for ∀s ∈ S (τ is the access vectors of all the

shared variables), we prove:

Theorem 3.1. Under the execution semantics defined in Section 2.1, two execution schedules δ

and δ′ of the same concurrent program have the same final state ΣN = Σ′N if Σ0 = Σ′0∧ τ = τ ′.

The core of the proof is to prove the following lemma:


Lemma 3.2. For any action α′k (k ≤ N ) in the replay execution δ′, suppose it is the pth action

on a shared variable s, then α′k is equal to the pth action on s in the original execution δ.

For two actions to be equal here, they need to read and write the same values, not just do the

same operation on the same shared variable. Next, we first define a notion of “happened-before”,

and then we prove Lemma 3.2 using this notion.

Consider the “happened-before” order of the original execution. The “happened-before” rela-

tion is defined as follows:

(a) If action αi immediately preceded action αj in the same thread, then αi happened-before

αj ;

(b) If action αi and action αj by different threads are consecutive actions on a shared variable

s, without any intervening actions on s, then αi happened-before αj ;

(c) The “happened-before” is reflexive and transitive.

More accurately, rules (a) and (b) define “happened-immediately-before” and “happened-before”

is the reflexive transitive closure of “happened-immediately-before”.

Proof. Let’s say the “happened-before” tree of an action is the tree of all the actions that

“happened-before” it, we next prove Lemma 3.2 by induction on the depth of the “happened-

before” tree.

Base case: Consider an action on the shared variable s, with a “happened-before” tree of depth 1.

This means that the current action does not depend on anything that happened-before it involving

shared variables. Because the first action on a shared variable is performed by the same thread

in both the original and the replay execution, and because that thread is deterministic, the replay

action should be identical to the one in the original execution.

Induction: Now assuming that Lemma 3.2 holds for all actions with happened-before depth

≤ n, we prove it for n + 1. Consider an action αi on a shared variable s, where αi has a

tree of happened-before depth n + 1. Let’s say αi is the pth action on s. The (p-1)th action

on s has a lower happened-before depth so it is an equal action in both the original and the

replay execution. Additionally, every action αj that “happened-immediately-before” αi has a

happened-before tree of depth n, therefore it is equal to a similarly numbered action in the

original execution (i.e., if αj is the kth action on a shared variable v, then αj is equal to the kth

action on v in the original execution). Now action αi only depends on all the αj actions. So,

since our approach enforces that the pth action on s is performed by the same thread in both

executions, and since the thread is deterministic and every value that αi can depend on has to be

equal to each other, it follows that action αi is also equal in the original and replay executions.


Lemma 3.2 is proved. If we apply Lemma 3.2 to the last action α′N in the replay execution, we

can get Σ′N = ΣN . Thus, Theorem 3.1 is proved.

With Theorem 3.1, we have proved the soundness of local-order based approaches for the de-

terministic replay that is able to reach the same program state as the original execution, by only

recording the access vectors for all the shared variables.

While τ = τ ′ is a rather relaxed condition, we can surely add more information that also guar-

antees the deterministic replay. For example, if the local variable accesses are recorded, the

deterministic replay is still guaranteed as long as we do not miss any shared variable accesses.

Following we derive two corollaries:

Corollary 3.3. The deterministic replay holds as long as τ = τ ′, regardless of whether accesses

to local variables are recorded or not.

Corollary 3.4. Recording different shared variable accesses into a single access vector does

not affect the correctness of the deterministic replay.

As noted in Section 3.2.2, the static escape analysis is conservative such that local variables

might be mistakenly categorized as shared. Corollary 3.3 ensures that this conservativeness

does not affect the correctness of the deterministic replay as long as all the shared variables are

correctly identified. Corollary 3.4 is easy to understand as the thread access orders on different

shared variables can be considered as a global order on a single variable abstracted from these

shared variables. To be more clear, assuming all thread accesses are recorded into a global

access vector, it is a global order of the execution schedule; hence, the determinism must hold.

As noted in Section 3.2.3, Corollary 3.4 ensures the soundness of our field-based shared variable

identification.

3.4 LEAP Implementation

We have implemented LEAP using the Soot framework. Figure 3.2 shows the overview of the

LEAP infrastructure, consisting of the transformer, the recorder, and the replayer. The trans-

former takes the bytecode of an arbitrary Java program and produces two versions: the record

version and the replay version. Started by a record driver, LEAP collects the access vector for

each SPE during the execution of the record version. When the recording stops, LEAP saves

both the access vectors and the thread creation order information and generates a replay driver.

To replay, the LEAP replayer uses the generated replay driver as the entry point to run the replay

version of the program, together with recorded information. The replayer takes control of the

thread scheduling to enforce the correct execution order of the threads w.r.t. the SPEs. We now

introduce each of the components in turn.


T fRecorder

TransformerSPE Access Recorder

R d

SPE LocatorThread Creation Order

Recorder

Record version

SPE Access Instrumentor

Replay Driver GeneratorOriginal program

Instrumentor

Record version

Access

vector

Replay

driver

Thread creation

order

Record version Generator Replayer

Replayversion

Replay version Generator

Trace Loader

Thread Scheduler

version

Thread Scheduler

FIGURE 3.2: The overview of LEAP infrastructure

3.4.1 The LEAP Transformer

The LEAP transformer performs the instrumentation on Jimple, an intermediate representation

of Java bytecode in the three-address form. For the record version, after locating all the SPEs in

the program, the transformer visits each Jimple statement and performs the following tasks:

Instrumenting SPE accesses If the SPE is not a Java monitor object, we insert a LEAP moni-

toring API invocation before the Jimple statement to collect both the thread ID and the numeric

SPE ID. Both the API call and the SPE access are wrapped by a lock specific to the accessed

SPE to ensure that we collect the right thread accessing order een by the SPE. If the SPE is a

Java monitor object, we insert the monitoring API call after the monitorentry and before the

monitorexit instructions. The API call is also inserted before notify/notifyAll/thread

start operations and after wait/thread join operations. Figure 3.1 shows a source-

code equivalent view of the instrumentation on the read/write accesses to the shared field vari-

ables. The box on the left shows the original method getBalance1, inside of which the

shared variable balance1 is read. The box on the right shows the transformed version of

getBalance1. For multiple shared variable accesses in a method, the thread ID needs only

to be obtained once. Also, to remove the unnecessary recording overhead, we do not need to

instrument the SPEs that are always protected by the same monitor.


Instrumenting recording end points To enable the deterministic replay, we insert the record-

ing end points to save the recorded runtime information and to generate the replay driver. Cur-

rently, LEAP supports three types of recording end points. First, we add a ShutDownHook to

the JVM Runtime in the record driver as a recording end point. When the program ends, the

ShutDownHook will be invoked to perform the saving operations. Second, we insert a try-

catch block into the main thread and the run method of each Java Runnable class. We then

add a method invocation in the catch block to capture the uncaught runtime exceptions as the

recording end points. Third, LEAP also supports the user specified recording end points by

allowing the annotation-based specification of end points. During the traversal of the program

statements, the transformer will replace the annotation with a method invocation, indicating the

end of recording.

To generate the replay version, the transforming process is largely identical to the record ver-

sion with a few differences: 1. since the order of synchronization operations on each SPE is

controlled by the LEAP replayer during replay, we need to insert the API call before the original

synchronization operations in the program, i.e, monitorenter and wait, to avoid deadlock;

2. the inserted API call is bound to a different implementation from the one used during the

recording phase; 3. since we need to ensure that the replay execution does not terminate until all

recorded actions in the original execution have been executed (See Section 3.2.5), we insert ex-

tra API invocations after each SPE access so that we can check whether a thread has performed

all its recorded actions in the original execution or not.

3.4.2 The LEAP Recorder

When executing the record version of the target program, the LEAP monitoring API will be

invoked on each critical event to record the ID of the executing thread into the access vec-

tor of the accessed SPE. To reduce the memory requirement, we use a compact representation

of the access vectors by replacing consecutive and identical thread IDs with a single thread

ID and a corresponding counter. For example, suppose the access vector of a SPE contains

<t1,t1,t2,t2,t2>, it is replaced by <t1,t2> and a corresponding counter <2,3>. This

compact representation produces much smaller log size compared to the related approaches in

our experiment. When a new thread is created, its ID is computed according to our consistent

thread idenfication method. Once a program end point is detected, the LEAP recorder will then

save the recorded data, i.e, the recorded access vectors, and the thread creation order list, and

generate the replay driver.


3.4.3 The LEAP Replayer

The LEAP replayer controls the scheduling of threads to enforce a deterministic replay using

both the access vectors and the thread identity information. To enable the user level thread

scheduling, the replayer associates each thread in replay with a semaphore maintained in a

global data structure, so that each thread can be suspended and resumed on demand.

To replay, the replay driver first loads the saved access vectors and starts executing the replay

version of the program. Before each SPE access, the threads use their semaphores to coordinate

with each other in order to obey the access order defined in the access vector of the SPE. Also, to

make sure that the replay execution does not terminate “early”, the thread also counts the total

number of SPE accesses it has performed so far after each SPE access. The thread suspends

itself if it finds that it has already executed all its SPE accesses in the original execution, as

recorded in the access vector, until all threads have finished their recorded actions. Since the

threads accessing different SPEs can execute in parallel, the replaying process is also faster than

that of a global order scheduler, which can only execute one thread each time.

3.5 Evaluation

3.5.1 Evaluation methodology

We assess the quality of LEAP by quantifying both its recording overhead and the correctness of

the deterministic replay. To properly compare our technique to the state of the art, we have also

implemented the following techniques: the Dejavu approach based on the global clock [23], the

technique presented by InstantReplay [68], and the JaRec approach based on the Lamport clock

[66]. Because none of these tools are publicly available, we faithfully implemented them ac-

cording to their representative publications. Since JaRec is not a deterministic replay technique,

we extended its capability to tracking shared memory races, in order to make it comparable to

our technique.

For the evaluation, we first design a micro-benchmark to conduct controlled experiments for

quantifying various runtime characteristics of the evaluated techniques. We then use real com-

plex Java server programs and third-party benchmarks to assess the recording overhead of LEAP

in comparison to the related approaches. We use bug reproducibility to verify if our technique

can faithfully and deterministically reproduce problematic concurrent runs. All experiments are

conducted on two 8-core 3.00GHz Intel Xeon machines with 16GB memory and Linux version

2.6.22. We now present these experiments in detail.


0 50 100 150 200 250 300 350 400 450 5000

1

2

3

4

5

6x 10

5

Number of SPE

Tim

e <

ms>

Processor number = 8Thread number = 10

BaseLEAPLamportGlobalInstant

FIGURE 3.3: The runtime characteristic of LEAP and other techniques on our microbenchmarkwith the number of SPE ranges from 1 to 500. The microbenchmark starts 10 threads running

on 8 processors.

3.5.1.1 Micro-benchmarking

We designed a micro-benchmark to quantify the runtime characteristics of LEAP and the related

record and replay techniques. The benchmark consists of concurrent threads that randomly

update shared variables in a loop. For each experiment, we can control the number of threads

and shared variables. In our experiments, we set the number of threads from 1 to 100, and the

number of shared variables from 1 to 1000, we then measure the time needed for all the threads

to finish a fixed total number of updating operations under different settings.

Figures 3.3 and 3.4 show the runtime characteristics of LEAP and the related techniques on

our micro-benchmark. In the figures, Base refers to the native execution. Global, Lamport

and Instant refer to the recorded execution using global clock, Lamport clock and InstantReplay

respectively. Figure 3.3 shows that the performance of the LEAP instrumented version is close to

the base version. By fixing the number of threads to 10, as the number of SPE increases from 10

to 500, LEAP is more than 10x faster than global clock, more than 5x faster than InstantReplay,

and at least 2x faster than Lamport clock. Global clock is the slowest among the four techniques.

The main reason is that the use of global clock requires a global synchronization on every shared

variable access, which significantly affects the degree of concurrency. Figure 3.4 shows a similar

performance trend as the number of threads increases from 10 to 80 and the number of SPEs is

fixed to 1000.


0 10 20 30 40 50 60 70 800

0.5

1

1.5

2

2.5

3

3.5

4x 10

5

Number of Threads

Tim

e <

ms>

Processor number = 8SPE number = 1000

BaseLEAPLamportGlobalInstant

FIGURE 3.4: The runtime characteristic of LEAP and other techniques on our microbenchmarkwith the number of threads ranges from 1 to 80 running on 8 processors. The number of SPE

is set to 1000.

TABLE 3.1: The runtime overhead of LEAP and the state-of-the-art techniques.

Application LOC Total SPE SPESize Log LogCmp LEAP Lamport Instant GlobalAvrora 93K 16003 1725(11%) 113 30623 796 626% 1697% 1821% 1036%Lusearch 69K 11497 1140(9.9%) 75 7485 632 74% 308% 379% 227%Derby 1.51M 48356 1433(3.0%) 264 18545 113 9.9% 68% 113% 52%Tomcat 535K 23046 654(2.6%) 163 15351 51 7.3% 39% 44% 34%MolDyn 864 821 634(77%) 66 110761 37760 64% 2776% 3567% 9960%MonteCarlo 3128 427 104(24%) 18 70384 1994 7.5% 7.9% 8.6% 9.1%RayTracer 1431 442 223(50%) 19 124239 35878 18% 39% 43% 94%

3.5.1.2 Benchmarking with third-party systems

To perform an unbiased evaluation, we first use LEAP on two widely used complex server

programs, Derby and Tomcat, with the PolePosition2 database benchmark and the SPECWeb-

20053 web workload benchmark. Each benchmark starts with 10 threads and we measure the

time for finishing a total number of 10000 operations. We also selected a suite of third-party

programs, among which Avrora and Lusearch are from the dacapo-9.12-bach benchmark suite4,

and MolDyn, MonteCarlo and RayTracer are from the Java Grande multi-thread benchmark

suite.2http://polepos.sourceforge.net3http://www.spec.org/web20054http://dacapobench.org


Table 3.1 shows some of the relevant static attributes of the benchmarked programs as well as the

associated runtime overhead of the evaluated record and replay techniques. We report the total

number of field variable accesses in the program (Total), the total number instrumented SPE

accesses (SPE), the number of SPEs (SPESize), the log size (KB/sec) of the related approaches

(Log), the log size of LEAP (LogCmp), and the runtime overhead (LEAP, Lamport, Instant

and Global). Overall, the percentage of SPE accesses over the total number of field variable

accesses varies from less than 3% on Derby and Tomcat to around 10% on Avrora and Lusearch.

As MolDyn (77%), MonteCarlo (24%) and RayTracer (50%) are relatively small applications

dedicated for multi-threaded benchmarking, the percentage of their SPE accesses is large.

Log size By using our compact representation of the access vectors, the log size of LEAP is

much smaller than the related approaches, from 3x in MolDyn to as large as 164x in Derby.

We recognize that the log size in LEAP is still considerable from 51 to 37760 KB/sec. With

the increasing disk capacity and disk write performance, as also observed by other researchers

[100], moderate log size does not pose a serious problem. For long running programs, we can

reset logs through the use of checkpoints.

Recording overhead LEAP is the fastest on all the evaluated applications. It is more than 150x

faster than global clock on MolDyn. For Derby and Tomcat, LEAP is 5x to 10x faster than all

the related approaches. The sheer runtime overhead of LEAP on Derby and Tomcat is less than

10% (9.9% and 7.3% respectively). LEAP’s overhead is large on Avrora (626%), the reason is

that there are several SPEs in Avrora that are frequently accessed in hot loops.

3.5.1.3 Concurrency bug reproduction

One of the major motivating forces for the record and replay technique is to help reproducing so-

called Heisenbugs. We believe that the ability of deterministically reproducing a concurrency-

related bug is a strong indicator of the replay correctness, because it requires the program state

to be correctly restored for the bug to be triggered. To compare the bug reproducibility, we

have also implemented JaRec for the comparison. We first compare LEAP and JaRec for their

capabilities of reproducing real-world concurrency bugs in complex server systems as well as

a number of benchmark bugs widely used in concurrency testing. To proper quantify bug re-

producibility, we aso have designed a bug injection technique that injects atomic set violations

into our micro-benchmark. We then assess how many of the violations can be deterministically

reproduced by LEAP and JaRec.


TABLE 3.2: LEAP - summary of the evaluated real bugs

Bug Id Version LOC Exception TypeDerby230 Derby-10.1 1.34M DuplicateDescriptorDerby1573 Derby-10.2 1.52M NullPointerExceptionDerby2861 Derby-10.3 1.51M NullPointerExceptionDerby3260 Derby-10.2 1.52M SQLExceptionTomcat728 Tomcat-3.2 150K NullPointerExceptionTomcat4036 Tomcat-3.3 184K NumberFormatExceptionTomcat27315 Tomcat-4.1 361K ConcurrentModificationTomcat37458 Tomcat-5.5 535K NullPointerException

3.5.1.4 Random bug injection

Our bug injection technique is based on the problematic thread interleaving patterns presented

in [125]. We introduce 10 dummy shared variables into the program and divide them into 5

groups, each group representing an atomic set as defined in [125]. During the recording phase,

on each critical event, the thread also randomly performs a write or read access on one of the

introduced variables. We use the same random seed for each thread across record and replay.

After each random access, if one of the problematic thread interleaving patterns occurs, the

program stops and the replay data are exported. Given the same program input, a deterministic

replay technique should be able to recreate the occurred bug pattern.

To compare the concurrency bug reproducibility between LEAP and JaRec, we use 100 different

random seeds to inject 100 concurrency bugs into our micro-benchmark. For each run, we

initialize 10 threads in the program. LEAP is able to deterministically reproduce 100% of these

bugs, while JaRec cannot deterministically reproduce any of them. The reason is that JaRec

does not record shared memory races, while all these bug patterns are generated on shared

memory accesses.

3.5.1.5 Real and benchmark concurrency bugs

Tables 3.2 and 3.3 show the description of the real concurrency bugs and the benchmark bugs

used in our experiments. All the 8 real bugs in Table 3.2 are extracted from the Derby and

Tomcat bug repositories5 that were reported by users. The 16 benchmark bugs in Table 3.3

are from the IBM ConTest benchmark suite [31], which cover the major types of concurrency

bugs, including data races, atomicity violation, order violation, and deadlocks. We also run both

JaRec and LEAP on these buggy programs to compare the bug reproducibility between them.5https://issues.apache.org


TABLE 3.3: LEAP - summary of the evaluated benchmark bugs

Bug Name LOC Bug DescriptionBubbleSort 362 Not-atomic, Orphaned-ThreadAllocationVector 286 Weak-reality, two stage accessAirlineTickets 95 Not-atomic interleavingPingPong 272 Not-atomicBufferWriter 255 Wrong or no-LockRandomNumbers 359 Blocking-Critical-SectionLoader 130 Initialization-Sleep PatternAccount 155 Wrong or no-LockLinkedList 416 Not-atomicBoundedBuffer 536 Notify instead of notifyAllMergeSort 375 Not-atomicCritical 73 Not-atomicDeadlock 135 DeadlockDeadlockException 255 DeadlockFileWriter 311 Not-atomicManager 236 Not-atomic

For the 8 real world concurrency bugs, LEAP is able to deterministically reproduce 7 of them

(88%), except the bug tomcat4036, and JaRec reproduced none of them. For the 16 bench-

mark bugs, LEAP can reproduce 13 of them (81%), except BufferWriter, Loader, and

DeadlockException, while JaRec can only reproduce one of them (Deadlock). The rea-

son for LEAP to miss tomcat4036 is that the bug is triggered by races of the internal data

of the underlying JDK library java.text.DateFormat, which LEAP does not instrument.

And because all these real bugs are related to shared memory races, JaRec is not able to re-

produce any of them. For the three benchmark cases LEAP cannot reproduce, two of them are

related to random numbers and the other one makes LEAP run out of memory because too many

threads (>5000) are involved in loops.

3.5.2 Discussion

The evaluation results clearly have demonstrated the superior runtime performance of LEAP

as well as its much higher concurrency bug reproducibility, compared to existing approaches.

Through our experiments with real world large multi-threaded applications, we observed several

limitations of LEAP that we plan to address in our future work:

Input nondeterminism As LEAP only captures the nondeterminism brought by thread inter-

leavings, it may not reproduce executions containing input nondeterminism, e.g., programs with

nondeterministic I/O. The two benchmark bugs that LEAP cannot reproduce both contain ran-

dom number generators that use the current system time as the random seed. Since it is not likely

to keep the random numbers the same across record and replay without saving them, LEAP may


not reproduce executions that contain such random issues. A way to overcome these issues is to

save the program states of some key nondeterministic events, e.g., the value of random seeds.

JDK library LEAP does not record shared variable accesses in the underlying JDK library.

If an execution contains races of the internal data of these APIs, LEAP might not be able to

reproduce it. The bug tomcat4036 is an example of this limitation. In fact, we can also

instrument the underlying Java Runtime, but as the JDK library is used frequently, it would

incur large runtime overhead. An implementation of LEAP on the JVM should relieve this issue

as the JVM environment enables efficiently tracing the internal data of the JDK library.

Long running programs LEAP currently has to replay from the beginning of the program

execution. For long running programs, it might not be convenient to replay the whole program

execution concerning the long replay time and the large log size. A lightweight checkpoint

scheme would be helpful in such scenarios, as LEAP can then only replay the program from the

last checkpoint to the recording end point.

3.6 Summary

We have presented LEAP, a new local-order based approach that deterministically replays con-

current program executions on multi-processors with low overhead. Our basic idea is to capture

the thread access history of each shared variable, and we use theoretic models to guarantee its

correctness. We have implemented LEAP as an automatic program transformation tool that pro-

vides deterministic replay support to arbitrary Java programs. To evaluate our technique, we

make use of both benchmarks and real world concurrent applications. We extensively quanti-

fied the runtime overhead of using LEAP as well as the correctness of the LEAP-based replay

through reproducing concurrency bugs. Our evaluation shows that, compared to the state of the

art, LEAP incurs lower runtime overhead and has much superior capability of correctly repro-

ducing concurrency bugs. For real world applications that we evaluated, the overhead of using

LEAP is under 10%, exhibiting the great potential for the production use.

Chapter 4

Persuasive Prediction of ConcurrencyAccess Anomalies

Predictive analysis is a powerful technique that exposes concurrency bugs in un-exercised pro-

gram executions. However, current predictive analysis approaches lack the persuasiveness prop-

erty as they offer little assistance in helping programmers fully understand the execution history

that triggers the predicted bugs. We present a persuasive bug prediction technique as well as a

prototype tool, PECAN, for detecting general access anomalies (AAs) in concurrent programs.

The main characteristic of PECAN is that, in addition to predicting AAs in a more general way,

it generates concrete executions that deterministically expose the predicted AAs. The key ingre-

dient of PECAN is an efficient offline schedule generation algorithm, with proof of soundness,

that guarantees to generate a feasible schedule for every real AA in programs that use locks in a

nested way. We evaluate PECAN using twenty-two multi-threaded subjects including six large

concurrent systems, and our experiments demonstrate that PECAN is able to effectively predict

and deterministically expose real AAs. Several serious and previously unknown bugs in large

open source concurrent systems also were revealed in our experiments.

4.1 Introduction

Access anomalies (AAs) is a class of concurrency bugs characterized by criteria such as data

races [109], atomicity violations [34], and atomic-set serializability violations (ASV) [125].

Among the broad spectrum of concurrency bug detection techniques that have proliferated in

recent years [15, 17, 35, 37, 43, 58, 64, 79, 80, 86, 89, 98, 99, 110, 115, 118], the technique of

predictive trace analysis (PTA) has drawn significant research attention [22, 33, 128, 130, 131,

133].

38

Persuasive Prediction of Concurrency Access Anomalies 39

Generally speaking, a PTA technique records a trace of execution events, and then statically

(often exhaustively) generates other permutations of these events under certain scheduling con-

straints, and exposes concurrency bugs unseen in the recorded execution. PTA is a powerful

technique as, compared to dynamic analysis, it is capable of exposing bugs in unexercised exe-

cutions and, compared to static analysis, it incurs much fewer false positives for the fact that its

static analysis phase uses the concrete execution history.

A bug detection technique is more useful if it is persuasive. This new criterion emphasizes that

a bug detection technique should not only localize the bug in the source code but also, and more

importantly, help programmers in fully understanding how the bug has occurred, to provide good

fixes.1 We characterize persuasiveness by two key properties. First, a persuasive technique

should report violations with no false positives. Since it is non-trivial to manually verify the

false alarms in large sophisticated concurrent systems, the perceived usefulness of the technique

quickly deteriorates with even a small number of false positives. Second, a persuasive technique

should also show programmers how the detected bugs or violations can occur by accompanying

each violation a concrete execution that deterministically exposes the bug. We believe that

allowing programmers to deterministically trigger the bug is one of the most effective ways to

achieve the complete bug comprehension.

Assessed by the persuasiveness criterion, the state of the art PTA techniques [22, 33, 128, 130,

131, 133] are unsatisfactory in generally addressing access anomalies in real-life complex con-

current programs. Although several recent work [128, 130, 131] pointed out the usefulness of

persuasiveness, it is still not clear how to efficiently create a concrete execution that can expose

the predicted anomalies in real programs. In addition, despite the much improved soundness

compared to static analysis, current PTA techniques still report quite a number of false positives,

either due to the inadequacy of their prediction models or the incompleteness of the collected

traces. For example, as detecting data races in general is NP-hard [95], for efficiency reasons,

many race detectors [29, 96, 102, 110] employ an overly approximated prediction model that

combines the lockset-based algorithms [109] and the happens-before based approaches [66].

Moreover, for PTA techniques, a certain type of false positives simply cannot be avoided when

programmers use application level synchronization mechanisms, such as barrier and flag opera-

tions. These “non-standard” synchronization mechanisms are difficult to automatically discover

[123] and, in turn, result in incomplete traces.

We present PECAN, a novel persuasive PTA technique that detects general access anomalies

(AAs) in concurrent programs. Unlike other PTA techniques [22, 33, 118] that cater to spe-

cific types of concurrency bugs, PECAN offers a general prediction model that addresses a

much broader class of concurrent access anomalies. Moreover, for each predicted AA, PECAN1A recent report [140] shows that as much as 39% concurrency bug fixes are bad fix, either failing to fix a bug or

creating new bugs.


generates “bug hatching clips” that deterministically instruct the input program to exercise the

predicted AAs. PECAN does not present false positives to programmers as we guarantee that

each clip represents a feasible concrete execution. Since all AAs reported are real and the pro-

grammers are given the full history and context information to understand the bug, we believe

PECAN can dramatically expedite the process of bug fixing.

The key technical challenge that we are faced with is how to statically generate a feasible thread

execution schedule to expose the predicted AAs. We present an algorithm, with a proof of its

soundness, that guarantees to generate a feasible schedule for every real AA for programs that

use locks in a nested way, i.e., releasing locks reverse to the acquisition order. Moreover, to

predict AAs in a general way, we present a general specification model of AAs and reduce

the AA prediction problem to a graph pattern search problem. With compact encoding of the

happens-before relationship between the events and the scheduling order of memory accesses in

the trace, the graph supports efficient pattern search of AAs, enabling PECAN to scale well to

large traces.

The salient property of persuasiveness also is highly valued and explored by other classes of

techniques such as active testing [59, 64, 98, 110] and model checking [18, 62, 86, 113, 129].

In particular, RaceFuzzer and Atomfuzzer [98, 110] dynamically explore and, thus, are capable

of creating concrete executions to expose real races by actively controlling the race-directed and

randomized thread scheduler. Chess also systematically explores the thread scheduling space at

runtime to find concurrency bugs. As a PTA technique, the goal of PECAN is to provide the

generalized support of the persuasiveness for concurrency access anomalies.

We have implemented PECAN for Java programs and conducted extensive experiments for eval-

uating it. Three common types of AAs are investigated: data races, atomicity violations, and

ASVs. Our evaluation results show that PECAN is able to effectively and efficiently predict and

deterministically create real AAs in all the twenty evaluated subjects including six large multi-

threaded applications. PECAN achieves a 100% success ratio of creating the predicted AAs in

more than half of the subjects. For the other subjects, the success ratio is from 0.25 to 0.93 (due

to the reported false AAs). Several serious and previously unknown bugs were also revealed by

PECAN in large open source concurrent systems such as OpenJMS and Jigsaw. Moreover,

PECAN has good scalability that can, for instance, analyze a trace in Derby with more than

447K events in around 6 seconds. The PECAN prototype and the detected replayable bugs in

our experiments are publicly online at http://www.cse.ust.hk/prism/pecan/.

The rest of this chapter is organized as follows: Section 4.2 presents an overview of PECAN;

Section 4.3 presents pattern specification of general access anomalies; Section 4.4 presents the

graph-based prediction model; Section 4.5 presents the search algorithm based on the graph

model; Section 4.6 presents the schedule generation algorithm; Section 4.7 present the imple-

mentation and evaluation of PECAN; Section 4.8 summarizes this chapter.

http://www.cse.ust.hk/prism/pecan/


4.2 PECAN in a Nutshell

To make our technique more comprehensible, we first use the simple example in Figure 1.1 to

illustrate the AA detection process of PECAN. Let us use the line number as the identifier of the

statement. There are three data races in the program. The races are between statements (2,5),

(2,6) and (3,7). Among the three real races, the race (3,7) is more important because it

might trigger the ERROR at line 4.

PECAN addresses the above problem using the following steps: 1. We first collect traces of

interesting events during the program execution. 2. We extract from the trace a partial and

temporal order graph (PTG) that encodes the information about the happens-before relationship

between the events, the atomic blocks, and the scheduling order of memory accesses. 3. We

perform a pattern-directed search on the PTG for the matching of the general AA patterns w.r.t.

the program constraints. 4. Taking the original trace and the search results as the input, we stat-

ically generate a thread schedule for each predicted AA. 5. We use a deterministic replayer [48]

to re-execute the program to expose the predicted AAs according to the generated schedules.

Coming back to our example, suppose the collected execution trace is <1,5,2,3,6,7>. In

Step 3, PECAN will detect that (3,7) is a possible race and then, in Step 4, PECAN is able

to generate the thread schedule <1,5,2,6,7,3> that deterministically directs the replayer to

expose this race and to trigger the ERROR in Step 5. From the user’s perspective, the whole pro-

cess is automatic and requires no additional user intervention. We note that, like other PTA tech-

niques, our analysis requires the fact that the error inducing events (3,7) appear in the input

trace, which might not always happen. In practice, we can use techniques such as RaceFuzzer

[110] to compensate this deficiency2.

In the following sections, we go under the hood of our technique to discuss the pattern language

we use to specify the general AAs (Section 4.3), the graph prediction model (PTG) we use to

represent the AA prediction problem (Section 4.4), the pattern search algorithm for locating the

AAs on the graph model (Section 4.5), and the schedule generation algorithm for generating the

thread schedules for each predicted AAs (Section 4.6).

4.3 Pattern Specification of Access Anomalies

The most commonly known AAs include data races, atomicity violations, and atomic-set seri-

alizability violations (ASV). These anomalies are sequences of 2 - 4 events generated by two2We come back to this issue in Section 4.7.3.


E

T

SV

AR

AT

The event sequence:

The thread scheduling sequence:

The accessed shared variable sequence:

The atomic region sequence:

The access type sequence:

Examples

data race atomicity violation ASV

e1 – e2 e1 – e2 – e3 e1 – e2 – e3 – e4

t1 – t2 t1 – t2 – t1 t1 – t2 – t2 – t1

s1 – s1 s1 – s1 – s1 s1 – s1 – s2 – s2

u1 – u2 u1 – u2 – u1 u1 – u2 – u2 – u1

r – w r – w – r w – r – r – w

General access anomaly pattern

FIGURE 4.1: General access anomaly patterns

different threads on one or two shared variables. In our prediction model, we generalize the con-

cept of AA to allow arbitrary number of events, threads and shared variables, and we describe

each type of AA as an event sequence pattern.

An AA pattern p is comprised of a group of equal-length sequences [E,T,SV,AR,AT]. The

meaning of each symbol is described as follows:

• E: the event sequence defined by the pattern.

• T: the thread scheduling order corresponding to E, i.e., the event E[i] is by the thread

T[i].

• SV: the accessed shared variable sequence corresponding to E, i.e., the event E[i] ac-

cesses the shared variable SV[i].

• AR: the atomic region sequence corresponding to E, i.e., the event E[i] belongs to the

atomic region AR[i].

• AT: the access type sequence corresponding to E, i.e., the access type of E[i] is AT[i]

which is either a read or a write: AT[i] ∈ {r,w}.

Figure 4.1 shows example patterns of the three commonly known AAs. Clearly, the specifica-

tion of AA patterns above is general enough to describe all the three commonly known AAs.

Moreover, this general pattern model allows the users to define their own AA patterns that may

contain much more complex thread interleavings. Nevertheless, since in fact all complex AA

patterns can be composed by these three basic ones3, we focus on explaining them in this sec-

tion.3As proved in [125], a set of eleven ASV patterns forms a complete set of all the problematic thread interleaving

scenarios w.r.t. atomic sets and units of work.


We next discuss these three basic AAs and describe them using the general pattern specification.

Since they in total contain a dozen of patterns, for brevity, we only show one representative

pattern for each of them. The others patterns are similar.

Data race A data race occurs when two threads are concurrently accessing the same data without

proper synchronization and at least one of these accesses is a write. We thus can describe it

as: E=e1-e2, T=t1-t2, SV=s1-s1, AR=u1-u2, and AT=r-w, meaning that the first thread reads

a shared variable and immediately the second thread writes to it. Note that data race patterns

require the two events happen consecutively, while this condition is unnecessary for atomicity

violation and ASVs.

Atomicity violation An atomicity violation happens when the desired serializability among mul-

tiple memory accesses to a single memory location is violated. Suppose a memory location is

accessed by three consecutive events ei, ek, and ej in this written order, and ei, ej belong to the

same atomic region, while ek belongs to another. An atomicity violation with the three access

type“write-read-write” can be written as E=e1-e2-e3, T=t1-t2-t1, SV=s1-s1-s1, AR=u1-u2-u1,

and AT=w-r-w.

ASV Atomic-set serializability is a criterion for enforcing the serializability of units of work

that deal with atomic sets. An atomic set is defined to be a set of memory locations that together

satisfy some consistency property. For example, let Wu(m) (Ru(m)) represent a write (read)

access to a memory location, m, by a unit of work, u, and suppose m1 and m2 belong to the

same atomic set. The execution sequence “Wu(m1) −Ru′(m1) −Ru′(m2) −Wu(m2)” causes

an ASV as the two consecutive writes to m1 and m2 by u are interleaved by two reads to these

memory locations by u′, another unit of work, resulting in inconsistent reads. We describe this

pattern as E=e1-e2-e3-e4, T=t1-t2-t2-t1, SV=s1-s1-s2-s2, AR=u1-u2-u2-u1, and AT=w-r-r-w.

In our implementation, we consider each atomic region as a unit of work and all memory loca-

tions accessed in the same atomic region belong to the same atomic set.

4.4 Graph Prediction Model

Our approach to the general AA prediction problem is to reduce it to a graph search problem.

We start by formalizing the permutation constraints. We then describe our formulation as a

graph mutation and pattern search problem.

4.4.1 Constraint Model

Precisely detecting access anomalies in general is computationally intractable [95]. To achieve

efficiency in predicting AAs, similar to many race detection techniques [29, 110], we use a


hybrid constraint model [96] that combines the lockset condition [109] and the happens-before

relation [66]. Specifically, the hybrid model defines that two events ei and ej are independent

iff :

1. they do not hold a common lock (l(i) ∩ l(j) == ∅);

2. they do not have a POR relation (recall Definition 2.4) between each other (ei ↛ ej and

ej ↛ ei).

Notice that the hybrid constraint model we use is a conservative approximation of the precise

model for checking the independence between events [94]. Therefore, it is a possible source for

PECAN to report false warnings during the pattern search. Nevertheless, these false warnings

can be automatically pruned during the re-execution phase (see Section 4.6.3), hence, do not

affect the final results delivered to the end user.

4.4.2 The AA Prediction Problem

The essential idea behind the AA prediction is that the independent events in the trace can be

rearranged, simulating the thread scheduling effects. Therefore, even if a AA is not directly

witnessed in the trace, as long as it can be manifested in any feasible permutation of the trace,

we can locate it and expose it with a concrete execution. This idea is initiated in Lipton’s

theory of reduction [72] and has been exploited by many concurrency bug detection approaches

[34, 37, 133].

Our general objective is to search all the AAs that satisfy some given patterns on an execution

trace or on any of its feasible permutations allowed by our constraint model defined in Section

4.4.1. We model this problem as a graph pattern search and mutation problem. Before giving a

formal problem definition, let us first define the graph model:

Definition 4.1. The Temporal Order Relation (TOR) ei ⇢ ej holds if events ei and ej are

consecutive accesses on the same shared memory location and ei occurs before ej .

Definition 4.2. A Partial and Temporal Order Graph (PTG) is a graph G(V,E) where V is

a set of nodes and E is a set of edges. Each vi ∈ V corresponds to the event ei in the trace.

Each edge e is either solid (→) or dashed (⇢), corresponding to the POR and TOR between the

events, respectively.

The PTG can be mutated by interchanging the nodes connected by dashed edges w.r.t. the POR

and the lockset condition. For brevity, we call these two conditions as mutation condition, and

we refer to these mutated PTGs as vPTGs.


Based on the PTG, the AA patterns can be conveniently formulated as propositional formulas

between the nodes in the PTG. Our goal is to find all the AAs on the vPTGs that satisfy the user

specified patterns.

4.5 Graph Pattern Search

Since the number of vPTGs is exponential and the size of trace could be very large, it is inef-

ficient to perform pattern search on every individual vPTG. We use two primary techniques to

achieve efficiency. First, we have developed a compact encoding of the PTG. Second, we per-

form pattern-directed graph mutations on the fly based on the intermediate search results, hence,

does not require separate mutation steps.

4.5.1 Compact Encoding of PTG

We have two main techniques for compactly encoding the PTG. First, to facilitate efficient pat-

tern search, we build separate indices of events based on thread ID, memory location, access

type and atomic region. Second, to scale to large traces, we do not maintain the full POR but,

instead, maintain only the relations between the thread communication (TC) events, i.e., fork,

join, notify, and wait events. Since the TC events are the only sources of the POR between

events across different threads, we use them to compute the POR for all the other events on

demand. By this approach, we reduce the space cost from quadratic in the trace size to linear

in the trace size and quadratic in the number of TC events. The number of TC events usually is

much smaller compared to the entire trace size.

4.5.2 Pattern-Directed Search

In general, given a pattern described in the specification model in Section 4.3, our pattern search

algorithm first computes the number of threads, the number of shared variables, and the number

of events by each thread in the same atomic region on each shared variable. Our algorithm

then uses this information to search on the indexed PTG to obtain a set of candidate AAs. The

candidate AA may not match the thread scheduling order T specified in the pattern, in which

case the mutation condition is applied to check whether there exists certain allowed permutation

of nodes in the PTG that makes the matching possible. We next give detailed explanations for

data race, atomicity violation, and ASV patterns.

Data race Recall that each pattern of data race contains two events satisfying the conditions

defined in Section 4.3. We thus follow the dashed edges on the PTG and examine every candidate


thread1

1. lock(l) 2. read x 3. unlock(l)

… 4. lock(l) 5. read x 6. unlock(l)

thread2

7. lock(l) 8. write x 9. unlock(l)

Searching Atomicity violation

Atomic region

FIGURE 4.2: Example of searching atomicity violations

node pair that could possibly satisfy the conditions. If a node pair (vi,vj) matches the temporal

order (i.e., the two nodes are connected by a dashed edge), we report it as a real AA. Otherwise,

we check if the PTG can be mutated for the node pair to match the temporal order. The function

canSatisfyByMutation(vi,vj) (Algorithm 1) is used to check this condition.

Algorithm 1 canSatisfyByMutation(vi,vj)Ensure: i < j

1: return (l(i) ∩ l(j) = ∅ && !POR(vi, vj))

Algorithm 2 canSatisfyByMutation(vi,vk,vj)Ensure: i < j < k

1: for all vx ∈ [vi+1, vi+2, . . . , vj] do2: if canSatisfyByMutation(vx, vk) then3: return true4: return false

Atomicity violation and ASV The search algorithms for atomicity violation and ASV patterns

are similar to that for data races, with the main difference in checking the mutation condition.

Because each atomicity violation (ASV) pattern contains three (four) nodes, we need to check

the mutation condition for more pairs of nodes. Without loss of generality, we use the example

in Figure 4.2 to illustrate the mutation condition (Algorithm 2) for checking candidate atomicity

violations. Suppose we have already found the candidate triple (v2, v8, v5) by traversing the

events by the two threads on the shared memory location x. As the temporal order of this triple

does not directly satisfy the atomicity violation pattern, we next check if it can be satisfied in any

of the other vPTGs, i.e., if v8 can be placed in any position between v2 and v5 without violating

the POR and the lockset condition. Our algorithm thus tries to find a position between v2 and

v5, say vx, such that there is no POR between v8 and vx and they are not protected by a common

lock, i.e., the lockset condition. Finally we find vx = v4 and thus report this AA.


4.6 Schedule Generation

For each predicted AA, PECAN statically generates a corresponding thread schedule that is used

to deterministically direct an execution for exposing the AA. This problem is highly nontrivial

and there are several challenges to be addressed:

1. Given an AA, regardless of real or false, how to generate a schedule that can manifest it?

2. For each AA, there might be multiple corresponding schedules. Which one should we gener-

ate?

3. For real AAs, how to make sure the generated schedules are feasible, i.e., can expose the real

AAs?

In the following text, we first present our schedule generation algorithm and discuss how it

addresses the above challenges. Then we formally prove that, for programs using nested locks,

our algorithm guarantees to generate a feasible schedule for every real AA. For false AAs,

although our algorithm may also generate infeasible schedules, we show in Section 4.6.3 that

these false AAs can be automatically pruned away during the re-execution phase.

4.6.1 How to Generate a Feasible Schedule?

The basic idea of our schedule generation algorithm is to transform the original trace by chang-

ing the relative order of independent events, i.e., moving the related events to different positions

in the trace. The main challenge is that we need not only to make sure the transformed trace

can manifest the AA, but also to guarantee it is feasible (i.e., does not violate the program con-

straints). However, as there are exponential number of ways to transform the trace, it is very

inefficient to exhaustively generate every possible schedule and verify its feasibility by check-

ing the constraints. Figure 4.3 shows a simple trace in which the nodes v1,v2,v3 and v4,v5,v6belong to two different threads, and the POR and TOR are represented by solid and dashed

edges respectively. Suppose (v2,v5) is a real race pair. There are many possible rearrangements

of the nodes in which we can place v2 and v5 next to each other, but only some of them are

feasible schedules. For instance, if we naively move v2 to the position before v5, we will get an

infeasible schedule δ′, in which the relative order between v2 and v3 violates the POR.

We have the following tactics to reduce the computational complexity of the schedule gener-

ation: First, although there might be many feasible schedules that manifest a real AA, it is

sufficient for us to generate one of them. Second, since the original trace is a feasible schedule

(i.e., satisfies the program constraints), when we permute the original trace (e.g., move a node

to a different position), we only need to make sure the changed portion does not violate the con-

straints w.r.t. the entire trace. Third, since it is sufficient for the resulting schedule to manifest


1

23 4 5 6

2 3 4 5 6

2 541 1

FIGURE 4.3: An example of schedule generation

the violation, we can remove from the schedule the nodes that are placed beyond the violation

creation point.

With these tactics, the whole schedule generation process becomes clear and straightforward.

The key problem is how to satisfy the program constraints when permuting the nodes. There

are basically three types of program constraints: the POR, the lock constraint, and the program

control constraint. The lock constraint requires that, at any time of the program execution,

a lock cannot be held by more than one thread. The program control constraint is related to

the execution order determined by the evaluation results of program control statements. For

real AAs, we can ignore the program control constraint as the evaluation results of program

control statements should be unchanged if we move the violation node to a correct position that

manifests the AA; otherwise, the AA is not real. We next discuss how our algorithm respects

the POR and the lock constraint.

Satisfying the POR is relatively simple. The key point is that we should not only move the

violation node to the correct position such that the violation pattern can be satisfied, but also

move the nodes that are dependent on, or having PORs with, the violation node, to their correct

positions. Back to the example in Figure 4.3, we generate a correct schedule δ′′ by first moving

v2 and v3 (because v3 is dependent on v2) to the position next to v5, and then removing v3 and

v6 from the schedule (because v3 and v6 are beyond the violation creating point).

Satisfying the lock constraint is much more complicated. We first use an example to illustrate

the challenge and then describe our approach for addressing it.

Example In Figure 4.4, the race pair (v3,v8) satisfies our relaxed mutation constraints, i.e., v3and v8 are not protected by a common lock and there is also no POR between them. Therefore, it

would be reported as a possible race pair by our pattern search algorithm. However, it is a false

warning: it is impossible for v3 and v8 to happen next to each other in any feasible schedule, as

there is a POR between v2 and v5. For this false violation, if we only consider the POR in the

schedule generation, we would generate an infeasible schedule <v1,v2,v5,v6,v7,v3,v8> that


thread1

v1. lock(l) v2. … v3. read x v4. unlock(l)

thread2

v5. … v6. lock(l) … v7. unlock(l) v8. write x

FIGURE 4.4: An example for illustrating the difficulty of satisfying the lock constraint forschedule generation. The race pair (v3,v8) is a false warning, though it satisfies both the POR

and the lockset condition.

violates the lock constraint. This is fine as this false violation can be pruned in the re-execution

phase. The problem is, however, if we remove the partial order relation from v2 to v5 and (v3,v8)

becomes a real race, this schedule is still infeasible.

The root cause of the problem above is that, by moving the dependent nodes on the to-be-moved

violation node (v3 in Figure 4.4), we have moved an unlock node (v4) but not its corresponding

lock node (v1), causing the resulting schedule to violate the lock constraint. To address this

problem, whenever we move a unlock node, we should also make sure its corresponding lock

node is moved to a correct position. Thus, in addition to the steps illustrated in Figure 4.3, our

algorithm also looks for the outermost lock (OML) node protecting the to-be-moved violation

node, and moves all the dependent nodes on the OML node to their correct positions. For the

example in Figure 4.4, we first find v1 (the OML node) and move v1, v2 and v3 (the nodes

dependent on the OML node) to the positions before v8, then we move v4 to the position after

v8 and remove v4 afterwards. Finally we get a feasible schedule <v5,v6,v7,v1,v2,v3,v8>.

Algorithm 3 ScheduleGeneration(vi,vj)Require: i < j

1: Let vl be the outermost lock node that is protecting vi2: Move all the nodes dependent on vi to the positions after vj3: if vl is not NULL then4: Move vl and all the nodes from vl to vi that are dependent on vl to the positions before vj5: else6: Move vi to the position immediately before vj7: Remove all nodes after vj

Algorithm 3 summarizes our schedule generation algorithm for data race patterns. The goal

is to generate a feasible schedule in which vi and vj are placed next to each other. Since all

it does is move a sequence of nodes to different positions, the worst case time complexity of

this algorithm is linear in the length of the trace. The algorithms for the other AA patterns,

such as atomicity violation and ASV patterns, are in a similar style, though may require moving


more nodes if the pattern contains three or more events. For example, Algorithm 4 shows our

algorithm for atomicity violation patterns, which contain an event triple (vi,vk,vj). The goal of

the algorithm is to generate a feasible schedule in which vk is placed between vi and vj . With

no loss of generality, let us consider the case i < j < k. Recall that in reporting every potential

atomicity violation in the pattern search phase, we have found a node vx which is between viand vj and satisfies the mutation condition with vk. This means that in some feasible schedule

vk can be placed before vx. We thus generate such a feasible schedule by the safest and simplest

way: move all the nodes from vx to vj in the original trace that are dependent on vx to the

position after vk. The movement of nodes simply follows the same rule as that in the algorithm

for data race patterns.

Algorithm 4 ScheduleGeneration(vi,vk,vj)Require: i < j < k

1: Find vx in canSatisfyByMutation(vi, vk, vj)2: Let vl be the outermost lock node that is protecting vx3: if vl is not NULL then4: Move vl and all the nodes from vl to vx−1 that are dependent on vl to the positions before vk5: else6: Move all the nodes from vx to vj that are dependent on vx to the positions after vk7: Remove all nodes after vj

4.6.2 What Can Our Algorithm Guarantee?

Theorem 4.3. For programs that use locks in a nested way, i.e., releasing locks reverse to the

acquisition order, our schedule generation algorithm will produce a feasible schedule for every

real AA.

Proof. Since the essential idea of the schedule generation is event permutation: moving events

or event sequences in the original trace from one place to another, to prove the correctness in

general (for any AA), it is sufficient to prove the correctness for the most basic step: moving

a single event. Now let us pick a race pair (vi,vj) with i < j for the proof. Suppose (vi,vj) is

a real race but the schedule generated by Algorithm 3 is infeasible. Following we prove it is

impossible by contradiction.

Because the schedule is infeasible, it must have either violated the POR, the program control

constraint, or the lock constraint. For the POR, because Algorithm 3 only changed the temporal

order between vj and the nodes that were moved to the positions after vj , i.e., nodes dependent

on vi, the only possible POR the generated schedule may violate is between vj and the nodes

that are dependent on vi. However, for any of such PORs, say vx → vj , we must have vi → vx

and then vi → vj that contradicts to the condition that there is no POR between vi and vj , which

must be satisfied for our algorithm to report this AA. Besides, it cannot violate the program


control structure neither; otherwise the race is infeasible. Thus, it is impossible for the generated

schedule to violate the POR or the program control constraint.

We next prove that it is also impossible to violate the lock constraint. If the schedule violates

the lock constraint, then there must exist an unmatched lock and unlock node pair, i.e., the lock

node and its corresponding unlock node is interleaved by another lock or unlock node. However,

because the original trace satisfies the lock constraint, there are only two possible reasons for

this result: (I) we incorrectly moved the interleaved lock or unlock node to a position between

the lock and the unlock node; (II) we incorrectly moved the unlock node to a position after the

interleaved node. Case I is impossible because it violates the lockset condition which should be

satisfied for our algorithm to report this AA. For case II, we show it is also impossible if there

are only nested locks in the original trace. First, because our algorithm only moves those nodes

that are dependent on the outermost lock (OML) node that is protecting the violation node, if

we had ever moved an unlock node, this unlock node should be dependent on the OML node.

Additionally, if there are only nested locks in the trace, the corresponding lock node of this

unlock node should also be dependent on the OML node, otherwise the OML node would not

be the outermost lock node. Thus, if we had ever moved an unlock node, we should have also

moved its corresponding lock node to a correct position. So case II is also impossible.

4.6.3 Pruning False Warnings

Note that our schedule generation algorithm is sound but incomplete, i.e., it may generate infea-

sible schedules for false violations. Nevertheless, we are able to automatically prune all the false

AAs away during the re-execution phase. Specifically, during the re-execution, we control the

thread scheduling of the re-execution to strictly follow an input generated schedule by matching

the events between the two schedules. When we observe that some thread has executed a new

event that does not match the corresponding event in the input schedule, which means the thread

has taken a different branch from the original observed execution, or the re-execution hangs due

to a deadlock, we immediately stop the re-execution and report the AA is a false violation. In

this way, as we only report successful re-executions, we are able to prune all the false violations.

4.7 Evaluation

We have implemented PECAN based on LEAP. We use a set of popular subjects (Table 4.1),

used in benchmarking the concurrency defect analysis techniques [64, 98, 110], and a number of

large multi-threaded Java applications to evaluate PECAN. In all our experiments, we collect a

normal execution trace for each program with the fixed configuration setting and program input.

To represent the trace, we maintain a vector to record a global order of all the events. For all the


events, we record their access type, thread ID and the accessed memory ID at runtime. The lock

set and the atomic region information are computed offline to save runtime cost. For re-entrant

locks, like [36], we process them internally in the trace collection phase and do not expose them

to the resultant trace.

For each generated schedule, we re-execute the program once to verify whether the correspond-

ing predicted AA is present. Because of concurrency bugs, some subjects may throw uncaught

exceptions in certain problematic schedules. It is clearly a highly desirable and useful charac-

teristic if a technique is able to predict these concurrency bugs from a normal execution trace,

and generate the corresponding schedules to cause the program to raise uncaught exceptions.

Thus, in our evaluation, we also report the number of re-executions in which the program raised

uncaught exceptions, out of all the schedules generated by PECAN, for each evaluated program.

To remove nondeterminism caused by random numbers, we replace all random seeds in the

evaluated programs with a constant. For open libraries, we use the drivers from [57] to close

them. All the experiments were conducted on a 8-core 3.00GHz Intel Xeon machine with 16GB

memory and Linux version 2.6.22. The VM configuration is a standard Java HotSpot (TM)

64-Bit Server VM with version 1.6.0 10 with 10G heap space, which is sufficient for all our

experiments.


TAB

LE

4.1:

PEC

AN

expe

rim

enta

lres

ults

Prog

ram

LO

CTr

ace

Com

puta

tion

Vio

latio

nR

esul

tT

hrea

dSV

Eve

ntO

verh

ead

Ana

lysi

sTr

ansf

orm

Rac

eAV

ASV

TE

XF

Acc

ount

148

34

660.

00x

6ms

7ms

72

615

80

Bug

gyPr

g38

54

522

50.

67x

12m

s8m

s9

10

101

0C

ritic

al70

31

190.

33x

2ms

2ms

1614

028

20

Loa

der

148

41

480.

01x

4ms

8ms

24

05

10

Man

ager

212

53

158

0.00

x30

ms

14m

s15

00

141

1M

erge

Sort

456

62

1,58

70.

50x

22m

s10

ms

810

08

110

Shop

280

41

429

0.20

x10

1ms

32m

s4

00

41

0St

ring

Buf

1,32

03

286

1.40

x3m

s43

ms

01

01

10

Arr

ayL

ist

5,86

63

329

40.

14x

6ms

8ms

61

411

20

Lin

kedL

ist

5,97

93

935

70.

13x

8ms

7ms

61

713

30

Has

hSet

7,08

63

1140

40.

62x

12m

s8m

s4

34

105

0Tr

eeSe

t7,

532

323

475

0.21

x30

ms

7ms

42

410

40

Mol

dyn

1,35

22

1113

4,37

57.

84x

5.62

2s26

5ms

40

01

03

Ray

Trac

er1,

924

239

915

,140

1.25

x1.

034s

68m

s2

20

40

0M

onte

Car

lo3,

619

29

7,65

01.

69x

309m

s20

ms

10

01

00

Cac

he4j

3,89

75

191,

077

0.25

x11

ms

12m

s5

20

50

2Sp

ecJB

B-2

005

17,5

964

116

60,7

750.

07x

79m

s53

ms

241

018

27

Hed

c29

,949

710

3,11

70.

32x

5ms

9ms

253

012

216

Web

lech

-0.0

.335

,175

326

10,6

400.

14x

57m

s24

ms

100

04

06

Ope

nJM

S-0.

7.7*

154,

563

2418

518

0,88

70.

38x

298.

6s35

0ms

207

226

434

*8*

66*

Jigs

aw-2

.2.6

*38

1,34

812

307

275,

128

3.61

x17

7.7s

578m

s66

572

768

436

*15

*57

*D

erby

-10.

3.2.

1*66

5,73

34

9944

7,39

21.

90x

6.18

4s1.

453s

144

319

038

*6*

62*


4.7.1 Experimental Results

Table 4.1 summarizes the results of our experiments. For each program, Column 2 reports its

size in the lines of source code (LOC), Column 3-5 report the number of threads (Thread), the

number of real shared memory locations that contain both read and write accesses from different

threads (SM), and the number of events in the trace (Event) that we analyzed, respectively. The

thread number ranges from 2 in RayTracer to 24 in OpenJMS, the number of shared memory

locations ranges from 1 to 399, and the trace size ranges from 19 to 447,392.

Column 6 reports the runtime overhead (Overhead) of our trace collection4. The runtime over-

head ranges from 0.00x in Account and Manager to 7.84x in Moldyn. Columns 7-8 report the

pattern search time (Analysis) and the average schedule generation time (Transform). The pat-

tern search time ranges from 3ms in StringBuf, with 86 events in the trace, to around 5 minutes

in OpenJMS with 180,887 events in the trace. The average schedule generation time ranges

from 2ms in Critical to 1.473s in Derby.

Columns 9-11 report the number of predicted data races (Race), atomicity violations (AV), and

ASVs (ASV), respectively, in each program. PECAN predicted a number of data races and

atomicity violations in almost all the traces we analyzed. The number of predicted ASVs is

often zero or very small except for Jigsaw, in which PECAN predicted 684 ASVs. Note that

each AA reported by PECAN is unique in terms of the source code line numbers on which the

violation events are triggered. We do not report duplicate AAs that have the same line number

combinations in the source.

Columns 12-14 report the number of created real AAs (T), the number of re-executions that

raise uncaught exceptions (EX), and the number of re-executions that fail (F). For the three

large programs (OpenJMS,Jigsaw,Derby) marked with ‘*’, because they contain too many pre-

dicted AAs (from 437 to 2,076), we only generate the schedules for 100 randomly selected AAs.

PECAN created real AAs for all the evaluated programs and, for most of them, PECAN caused

the program to throw uncaught exceptions, which is a strong symptom of real concurrency bugs.

PECAN also reported a number of failed re-executions in several subjects, especially those large

programs. We manually inspected those failures and found that the only reason why PECAN

fails to create these AAs was that these AAs are false violations, due to the conservativeness of

the hybrid constraint model (recall Section 4.4.1) we use for AA prediction.

Our experiment results clearly demonstrate the performance and effectiveness of PECAN. First,

PECAN has predicted real AAs for all the evaluated subjects and achieves a 100% success ratio

creating the predicted AAs in more than half of the subjects. For the other subjects, the success

ratio is from 0.25 to 0.93 (due to the reported false violations). Second, the pattern search and

the schedule generation are both relatively fast. For Derby, which has more than 447K events4The overhead was averaged over 10 runs for each subject.


setInvocationHandler(…) { … 1. _multiplexer = createMultiplexer(…); … }

invoke(…) { synchronized (this) { 2. multiplexer = _multiplexer; } if (multiplexer != null) … else throw new ResourceException(…); }

MultiplexedManagedConnection.java

FIGURE 4.5: A destructive race in OpenJMS

getNextEvent() { 1. while (queue.size() == 0) { … } 2. Event e = queue.elementAt(0); … 3. queue.removeElementAt(0); }

EventManager.java

FIGURE 4.6: A predicted real bug in Jigsaw

in the trace, PECAN predicted 463 AAs in around 6 seconds and generated the corresponding

schedule for each AA in around 1.5 seconds on average. For OpenJMS, the trace of which con-

tains more than 180K events, PECAN predicted 2,076 AAs in less than 5 minutes. For the other

cases with smaller trace size, such as ArrayList that contains several hundred events, the pattern

search time and the schedule generation time are only several milliseconds. These results clearly

demonstrate the efficiency of our pattern search and schedule generation algorithms. Moreover,

since we compute most of the event attributes offline, the runtime overhead of PECAN is rela-

tively small, with slowdown factors ranging from 0.00x to 7.84x.

4.7.2 Detected Real Bugs

We investigated the uncaught exceptions and real AAs that PECAN created and confirmed a

number of real concurrency bugs in almost all the subjects, and several previously unknown

bugs. We next describe a couple of previously unknown bugs in two large projects OpenJMS-

0.7.7 and Jigsaw-2.2.6.

Figure 4.5 shows a destructive data race predicted by PECAN in OpenJMS-0.7.7. The race

happens on the field multiplexer of the class MultiplexedManagedConnection.

When a thread first read the shared field at line 2 before it is initialized by another thread at line

1, the thread will throw a ResourceException that crashes the program.


Figure 4.6 shows a predicted real bug in Jigsaw-2.2.6. In the method getNextEvent of

the class EventManager, a thread first checks in a while loop (line 1) until the event queue

becomes non-empty, then the thread gets the first item in the queue (line 2) and removes it

from the queue (line 3). This logic is correct in a single-threaded event manager. However,

when there are multiple threads executing inside the getNextEvent method simultaneously,

a thread might try to get an item from the queue that has already been removed by another

thread, causing an ArrayIndexOutOfBoundsException at line 2.

4.7.3 PECAN Limitations

Our experimental results clearly demonstrated the superior persuasive concurrency bug predic-

tion capability of PECAN compared to related approaches. Through our experiments with real

world large multi-threaded applications, we also observed some limitations of PECAN that we

plan to address in future work.

Limited path exploration PECAN currently has only the information of a single trace, it can

not predict access anomalies in execution paths absent from the collected traces. We plan to

enhance PECAN by combining it with approaches such as symbolic analysis [101, 129, 131] to

systematically exercise more execution paths.

Sensitivity to the original trace Both the pattern search and the schedule generation phases of

PECAN are dependent on the original trace. For example, to create the race (3,7) in Figure 1.1,

PECAN needs the statements 3 and 7 are both exercised in the original trace. However, such

a schedule, e.g., <1,5,2,3,6,7> or <5,1,2,3,6,7>, could be difficult to manifest in

either real executions or test runs. Techniques such as RaceFuzzer are effective in generating

error-inducing traces by intelligently exploring thread schedules based on statically detected

race pairs. As the future work, we plan to integrate PECAN with this school of techniques to

tackle the trace sensitivity issue and to improve the bug detection capability of PECAN.

4.8 Summary

In summary, this work makes the following contributions:

• We present a persuasive PTA technique as well as a prototype tool PECAN for detecting

general access anomalies in concurrent Java programs. PECAN not only predicts access

anomalies, but also generates “bug hatching clips” that deterministically instruct the input

program to exercise the predicted AAs.


• We present a general specification model of access anomalies and a prediction model

that models the problem of access anomaly prediction as a graph pattern search problem.

The graph compactly encodes the happens-before relationship between the events and the

scheduling order of memory accesses in the trace, and supports efficient pattern search of

AAs to enable PECAN to scale well to large traces.

• We present an efficient static thread schedule generation algorithm, with proof of sound-

ness, that will generate a feasible schedule for every real AA in programs that use locks

in a nested way.

• We evaluated PECAN using twenty-two multi-threaded subjects including six large con-

current systems and our experiments demonstrate that PECAN is able to effectively pre-

dict and deterministically expose real AAs.

Chapter 5

Scaling Predictive Trace Analysis byRemoving Redundant Events

Predictive trace analysis (PTA) of concurrent programs is powerful in finding concurrency bugs

unseen in past program executions. Unfortunately, existing PTA solutions face considerable

challenges in scaling to large traces. We identify that a large percentage of events in the trace

are redundant for presenting useful analysis results to the end user. Removing them from the

trace can significantly improve the scalability of PTA without affecting the quality of the results.

We present a trace redundancy theorem that specifies a redundancy criterion and a soundness

guarantee that the PTA results are preserved after removing the redundancy. Based on this cri-

terion, we design and implement TraceFilter, an efficient algorithm that automatically removes

redundant events from a trace for the PTA of general concurrency access anomalies. We eval-

uated TraceFilter on a set of popular concurrent benchmarks as well as real world large server

programs. Our experimental results show that TraceFilter is able to significantly improve the

scalability of PTA by orders of magnitude, without impairing the analysis result.

5.1 Introduction

PTA-based solutions often experience scalability problems with large traces because of exhaus-

tively checking all feasible permutations of the trace. The largest trace reported by recent PTA

techniques [130, 131] contains less than 10K events1 and one of the techniques [131] takes more

than two minutes to analyze a trace with only 1K events. It is important for PTA techniques to

scale as the trace of large complex concurrent programs can easily contain millions or even

billions of events [122].1This corresponds to a 0.01sec execution of a Bank benchmark in [130] with 135 lines of code.

58

Scaling Predictive Trace Analysis by Removing Redundant Events 59

We observe that existing research that addresses the scalability of the PTA techniques targets

two causes of computational complexities. The first cause is the well recognized exponential

explosion of the schedule exploration space. An array of space reduction methods have been

proposed such as partial order reduction [38], maximal causal models [112], and staged analysis

[116]. The second cause is the computational complexity inherent to the anomaly checking

algorithms themselves. For instance, in any particular schedule, the number of event pairs to

check for race conditions is O(N2) in the worst case, where N is the size of the trace. This

complexity becomes O(N4) for checking for the atomic-set serializability violations (ASV) [64,

125]. Approaches such as the meta-analysis model [33] and the work by Kahlon et al [60] can

effectively reduce this type of complexity by limiting the analysis to programs that obey the

nested locking discipline.

In this work, we identify that a third cause of computational complexity comes from the fact that

the trace often contains a large number of events that are mapped to the same lexical statements

in the source code. While increasing the size of the trace significantly, these events do not

reveal any additional information for fixing bugs when presented to the users of the PTA tools.

Therefore, we can dramatically improve the scalability of PTA techniques if we can remove this

redundancy, i.e, produce a smallerN , while preserving the quality of the results presented to the

end user. On the surface, it seems simple to remove the operations that are lexically identical

from the trace. Unfortunately, such an approach removes important dependency information and

causes the PTA techniques to work incorrectly. Let us further illustrate this through an example.

The program in Figure 5.1 consists of a parent thread (T0), executing line 1 to line 5, and three

children threads (T(1,2,3)), executing line 6 to line 14. Since T0 generates 3 writes to variable x

and T(1,2,3) generates 9 reads of x in total, a PTA technique for checking data races will need to

examine 3*9=27 event pairs. It is apparent that these 27 pairs to the variable x eventually map

to only two lines in the source code (line 3 and line 13). Therefore, only one racy pair of events

is sufficient to highlight the problem in this program. In modern day concurrent programs, this

type of redundancy is prevalent due to the single-process-multiple-data (SPMD) architectural

design. A straight forward way to combat this redundancy is to only record one instance of

each lexically distinctive statement. For instance, we can choose to only record the first write

at the line 3 by T0 as well as the first read at line 13 by the other three threads. The obtained

trace, albeit much smaller in size (4 x accesses instead of 12), is not that useful in finding

the race because it tells us that these reads are performed after the write as the result of the

thread creation operation at line 4. Therefore, the data race cannot be detected. However, if

we also record the second write by T0, a PTA algorithm can correctly report the data race by

only analyzing 5 accesses of x. Even better, by observing that the event sequences of the three

threads T(1,2,3) are all identical, we can drop the events by any two of them, resulting in only 3

x accesses to be analyzed for race detection.


Thread T1,2,3

11:

12:

13:

14:

for(i=1;i<=3;i++)

{

write x;

fork Thread Ti;

}

1:

2:

3:

4:

5:

m()

{

read x;

}

(a) Local redundancy (b) Global redundancy

Thread T0

lock l

m()

unlock l

m()

m()

6:

7:

8:

9:

10:

FIGURE 5.1: Example code for illustrating the trace redundancy

Through this example, we note that the identical lexical position of two recorded events is only a

necessary but not sufficient condition for them to be redundant, in terms of preserving the results

of the PTA techniques. We propose the concept of permutational redundancy, in conjunction

with lexical redundancy, to serve as the criteria of the safe removal of events from traces before

being analyzed by the PTA techniques. The permutational redundancy criterion states that two

events by the same thread are redundant to each other (called local redundancy) if, first, their

locksets contain no different locks and, second, their inter-thread happens-before relationships

with all the other events generated by the other threads are equivalent. In addition, we extend

this notion to characterize the redundant event sequences by different threads (called global

redundancy). Two event sequences by different threads are redundant if their corresponding

events are lexically redundant. Going back to our example in Figure 5.1, the fork Thread

statement states that the first write at line 3 by T0 happens before the read of T1 at line 13.

However, this relationship does not hold between this read operation and the second write of T0.

Therefore, the first and the second writes of T0 are not permutationally redundant to each other

and neither of them can be removed. By the same reasoning, the second and the third writes

(reads) of T0 (T(1,2,3)) are in fact redundant and only one of them is needed for further analysis.

Moreover, because the event sequences of T(1,2,3) accessing x are lexically identical and their

corresponding events are equivalent, they are globally redundant to each other and only one of

them is needed for detecting the race.

To remove the redundancy above, another simpler strategy is to drop all re-references to the

same variable at the same program location by the same thread if there are no synchronization

operations between them. For instance, the third reads of T(1,2,3) in our example are removed

from the trace. However, this simple strategy is less preferable in two aspects. First, it is limited

in removing the redundant events within the same synchronization region. Redundant thread

accesses across synchronization boundaries cannot be detected using this approach. More im-

portantly, this approach is unsound in addressing trace redundancy in the general PTA treatment

of access anomalies. It may incorrectly drop useful events that manifest access anomalies other

than data races. As illustrated in Figure 5.2, this simple strategy removes the second read of T2,

which results in missing a real atomicity violation formed by the statements (10,7,10).


9:

10:

m1():

lock l;

write x;

unlock l;

5:

6:

7:

8:

m2():

read x; m2()

m2()

• Statement pair (4,9) forms a real data race.

• The second write of Thread t1 is redundant, however.

The simple strategy of “dropping all re-references by the

same thread to the same variable if there are no

synchronization operations between them” does not work

for this redundancy, because there are lock/unlock

operations between the two writes of Thread t1.

m1()

m1()

T1 T2

1:

2:

3:

4:

FIGURE 5.2: Statements (10,7,10) form a real atomicity violation. However, the simple strat-egy of “dropping all re-references by the same thread to the same variable if there are nosynchronization operations between them” would drop the second read of T2 at line 10, which

causes PTA to miss this atomicity violation.

Based on the above observation, we present TraceFilter, a technique that efficiently removes

redundant events from a trace and, at the same time, preserves the results of the PTA techniques.

We first propose a generalized model of the PTA algorithms for analyzing the access anomaly

bugs in concurrent programs. Using this model, we associate each event in the trace with a new

attribute, called concurrency context, in addition to its lexical location. The concurrency context

contains the synchronization histories of the thread, at the time when the event is triggered in

the trace. We show that our technique is sound: it does not mis-classify any useful event to be

redundant, as the concurrency context strictly preserves the permutability conditions of events.

Moreover, the prefix-sharing property of the concurrency context enables us to use a compact

Trie data structure to detect redundancy in a memory-friendly way and to efficiently filter out

redundant events.

To evaluate our technique, we have implemented a prototype tool for analyzing the trace of

concurrent Java programs and evaluated the tool on a set of popular concurrent benchmarks and

real world large server programs. We considered the PTA of all the three common concurrency

access anomaly bugs including data races, atomicity violations, and ASVs. Our experimental

results show that: (1) redundant events are pervasive in concurrent programs and our technique

is very effective for detecting them. The overall percentage of redundant events detected by

our technique ranges from 7.9% to 99.9% in the trace, while for the real server programs, the

percentage of redundancy ranges from 34.7% to 85.5%. (2) our technique is able to significantly

improve the scalability of PTA. For a trace with more than 2M events (2,236,960) in Derby, the

PTA with our technique was able to finish in 177.5 seconds, whereas without our technique, the

same PTA does not finish in 2 hours. (3) our technique does not impair the analysis result for

the PTA of concurrency access anomaly bugs. By comparing the trace analysis results for all

the evaluated benchmarks, we empirically confirm that the analysis results reported by the trace

analysis algorithms with our technique are the same as without our technique.

The remainder of this chapter is organized as follows: Section 5.2 presents a description of

general PTA algorithm; Section 5.3 presents our technique in detail; Section 5.4 describes our

implementation; Section 5.5 presents our empirical evaluation; Section 5.6 summarizes this

chapter.


5.2 General PTA algorithm

The essential idea of PTA for detecting concurrency access anomalies [22, 64] is based on a

permutability property between events that combines both the lockset condition and happens-

before condition. We describe a generalized PTA algorithm as follows.

A PTA algorithm A, given a trace δ and a pattern p, first decides, in the pattern, the number of

different threads, different shared variables, and different events on each shared variable by each

thread in the same atomic region. Then it uses this information to search in the trace to obtain a

set of candidate access anomalies. In order for the search to be efficient, often the trace is pre-

processed to build the index based on the thread ID, the shared variable, the access type, and the

atomic region. Each candidate access anomaly contains a sequence of events usually satisfying

one of the patterns in Figure 4.1 with respect to SV, AR, and AT, but not T, the thread scheduling

order. In this case, it continues to check whether there exists certain allowed permutations of

events that match T under the lockset and the happens-before constraints.

For a pair of events ei and ej , PTA checks the following two conditions:

I: lockset condition: Li ∩ Lj = ∅, where Li and Lj are the locks held by the corresponding

thread when the event occurs.

II: happens-before condition: ¬(ei ≺ ej) ∧ ¬(ej ≺ ei), where ≺ is the POR relation defined in

Definition 2.5.

The above conditions mean that two events in the candidate access anomaly are permutable, i.e,

concurrent to each other (i.e., one access does not happen-before the other). Consequently, the

PTA algorithm can conclude that multiple thread scheduling orders are possible for this pair of

events and report this pair as a data race bug.

Finally, for each access anomaly, PTA extracts the information contained in the events and

presents it to the programmer for debugging. There is no uniform rule on what information

is extracted from each event, as the users of PTA may require different level of details for

understanding the access anomaly bug. However, the basic information of access anomalies

should contain the lexical statements in the program on which the events are triggered.

Example Let us consider the trace in Figure 5.3. There are in total 43 events (e1-e43) in

the trace. The events e1-e6 are performed by T0, e7-e18 by T1, e19-e31 by T2, and e32-e43 by

T3. There are in total three write events, e(1,3,5), all by thread T0, and nine read events, among

which e(10,14,17) are by T1, e(22,26,30) by T2, and e(35,39,42) by T3. The locksets of the read

events e(10,22,35) contain a single lock l. The locksets of the other read/write events are empty.


write x;fork T1;write x;fork T2;write x;fork T3;

start T1lock l;enter m;read x;exit m;unlock l;enter m;read x;exit m; enter m;read x;exit m;

e1:e2:e3:e4:e5:e6:

e7:e8:e9:e10:e11:e12:e13:e14:e15:e16:e17:e18:

Thread T0 Thread T1


e19:e20:e21:e22:e23:e24:e25:e26:e27:e29:e30:e31:

Thread T2


e32:e33:e34:e35:e36:e37:e38:e39:e40:e41:e42:e43:

Thread T3

FIGURE 5.3: A trace corresponding to a serial execution of the example program in Figure 5.1.

Since the fork Thread ti event must be executed before the start of thread ti, the happens-

before relation between the events are e1 ≺ e2 ≺ e7 ≺ . . . ≺ e18, e2 ≺ e3 ≺ e4 ≺ e19 ≺ . . . ≺ e31,

and e4 ≺ e5 ≺ e6 ≺ e32 ≺ . . . ≺ e43.

Given the above trace, PTA will first list all the read/write events on the shared variable x by

each thread. As e(1,3,5) are three write events by T0 and e(10,14,17,22,26,30,35,39,42) are nine read

events by T1,2,3, PTA will then check the 3*9=27 pairs of candidate races. By evaluating the

lockset and happens-before relation between the two events in each candidate race, PTA will

get nine real race pairs [(e(3), e(10,14,17)),(e(5), e(10,14,17,22,26,30))]. For the real race pairs, PTA

then reports the lexical statement pair contained in each of them, which finally produces the race

at lines (3,13) to the user.

5.3 Removing Trace Redundancy

This section presents our methodology for removing the trace redundancy. We start by giving

a formal modeling of the trace redundancy. We then present our algorithm on detecting the

redundant events.

5.3.1 Modeling trace redundancy

Consider a PTA algorithmA that takes a trace δ as the input and produces a set of access anomaly

bugs as the output. We define the concept of redundancy as follows:

Definition 5.1. Given an algorithm A and an arbitrary input δ, a subsequence X of δ is redun-

dant iff A(I) = A(I/X).


Recall Section 4.3, an access anomaly is a sequence of events that can be specified by a meta

pattern that defines both the attribute values of these events and the order relation between them.

To facilitate our discussion, we first define a concept called candidate access anomaly (CAA)

that will be used in our modeling of trace redundancy:

Definition 5.2. A candidate access anomaly (CAA) corresponding to a pattern p is an event

sequence, of which the event attribute values satisfy the condition defined in p, but the order

relation between them might not satisfy the condition defined in p.

Note that a CAA should correspond to a certain pattern that is provided by the user of the PTA

algorithm. For different patterns, a CAA may contain different numbers of distinctive events.

For example, for a data race pattern, a CAA contains two events, while for atomicity violation

patterns, it contains three events. We refer to this property as the feasibility property of CAA,

used later in proving the trace redundancy theorem.

5.3.1.1 A theory of trace redundancy

Given a pattern and a trace, as described in Section 5.2, a PTA algorithm proceeds in two steps.

First, it analyzes the trace to find the sequences of events (i.e., access anomalies) that satisfy

the conditions specified in the pattern. Second, for each sequence of events, it extracts the

information contained in the events and reports it to the programmer for debugging. We can

decompose such PTA algorithms into two components: a rule R and a function f . The rule R

evaluates on a CAA, say s, and simply reports true or false. If R reports true, it means s is a real

access anomaly, and f will be applied on each event in s to generate an output.

Let us assume the generated information of each event by f is its lexical location in the program

source, σ. Let s(e → e′) denote the result by replacing an event, e, in a CAA, s, with another

event, e′. Let R(s) denote that R reports true on s. And let ⊏ denote a relation where X ⊏ Y

means all events of the sequence X are also in Y . Based on the above assumption, we have the

following theorem for detecting redundancy in δ.

Theorem 5.3. Given an input trace δ, a pattern p and an algorithm A = (R,f) with rule R

and function f . An event e is redundant if there exists another event e′ in δ such that ∀ CAA

s ⊏ δ ∧ e ∈ s, the following three conditions hold:

• Lexical equivalence condition: f(s) = f(s(e→ e′));

• Permutational equivalence condition: R(s)⇒ R(s(e→ e′));

• CAA feasibility condition: s(e→ e′) is a CAA corresponding to p;


Proof. According to our definition of trace redundancy in Definition 5.1, if the above three

conditions are all satisfied, we have ∀e,∀s ⊏ δ, e ∈ s ∧ R(s), ∃s′ = s(e → e′) ⊏ δ/{e} s.t.

R(s′) ∧ f(s) = f(s′). Thus, e is redundant.

Theorem 5.3 says that, for an event e in δ, if the three conditions are all satisfied, then the trace

δ with e or without e would always produce the same result for all the PTA algorithms in our

assumption. Therefore, we can detect redundancy in the trace by checking the three conditions

for each event e.

Let us first consider the lexical and the permutational equivalence conditions in Theorem 5.3.

Since f generates the lexical statement σ according to our assumption, the lexical equivalence

is easy to evaluate, i.e., it is satisfied iff e′ and e are triggered on same lexical statement. For

the permutational equivalence, as we have shown in Section 5.2, the essential determinant is the

lockset and the happens-before relation between the events, which instantiate the rule R. More

specifically, we define that the events e and e′ satisfy the permutational equivalence condition

if :

• Lockset equivalence: their locksets contain no different locks, i.e., Le = Le′ ;

• Inter-thread happens-before equivalence: their happens-before relationships with all the

events by other threads are equivalent, i.e., ∀e′′ te′′ ≠ te ∨ te′′ ≠ te′ , ¬(e ≺ e′′)∧¬(e′′ ≺ e)

⇐⇒ ¬(e′ ≺ e′′) ∧ ¬(e′′ ≺ e′).

Since the above two conditions are indeed conservative in satisfying the permutational equiva-

lence condition, we have the following theorem:

Theorem 5.4. Events e and e′ are permutationally equivalent to each other if the lockset con-

dition and the inter-thread happens-before condition are both satisfied.

In the following, we say the two events are fully equivalent to each other if they satisfy both

the lexical equivalence and the permutational equivalence conditions. By Theorem 5.4, we can

determine if e and e′ are fully equivalent to each other by checking three conditions in total:

the lexical equivalence, the lockset equivalence, and the happens-before equivalence. However,

note that neither e nor e′ is redundant even if they are fully equivalent. We have to also consider

the CAA feasibility condition in Theorem 5.3.

The CAA feasibility condition requires that s(e → e′) is a CAA corresponding to the pattern p.

Recall in Section 5.3.1 that no two events in the CAA should be the same. Therefore, to satisfy

this condition, e′ must not be in s. More specifically, to determine whether or not an event e

is redundant, we have to ensure that, for any CAA s, there always exists an event e′ that is not


in s, such that e and e′ are fully equivalent to each other. However, this condition in general is

impossible to satisfy without considering the pattern p that s corresponds to.

For example, consider an atomicity violation pattern, which specifies three events with two of

them, e1 and e2, from the same thread and the third one from another thread. Even if these two

events are fully equivalent to each other, the pattern requires both of them to be present to form

the bug condition. However, an event, e3, is truly redundant if it is also fully equivalent to e1and e2 because two events are sufficient according to the definition of the bug pattern.

Hence, to determine whether the CAA feasibility condition can be satisfied or not, we need

to consider the specific pattern that the CAA sequence, s, corresponds to. This leads to our

definition of norm with respect to each pattern as follows:

Definition 5.5. The norm of a pattern p, denoted as ∥p∥, is the maximum number of lexically

and permutationally equivalent events allowed in p. For example, the norm of a data race pattern

is 1, and the norm of an atomicity violation is 2.

Given the definition of pattern norm above, we have the following theorem:

Theorem 5.6. An event e is redundant if the number of fully equivalent events to e in the trace

is no less than the pattern norm ∥p∥.

Proof. Suppose there are ∥p∥ or more equivalent events to e in the trace. Let us put them into

a set S. As there are at most ∥p∥ fully equivalent events for any CAA that corresponds to the

pattern p, no matter what events the CAA, s, contains, there always exists at least one event in S

that is not in s but fully equivalent to e. Therefore, the CAA feasibility condition in Theorem 5.3

is satisfied. e is redundant, because both the lexical and the permutational equivalence conditions

are also satisfied.

Therefore, using Theorem 5.6, given a pattern and a trace, we can determine whether an event

e is redundant or not in the trace by counting the number of fully equivalent events to e. If the

number is no less than the pattern norm, we can classify that e is redundant and remove e from

the trace.

5.3.1.2 Concurrency context

According to Theorem 5.3, to detect redundant events, we need to check lexical equivalence,

permutational equivalence, and the CAA feasibility conditions between events. While lexical

equivalence and the CAA feasibility conditions are straightforward to compute, we have to prop-

erly model the lockset and happens-before relationships of each event, to compute permutational

equivalence condition.


Recall in Section 5.2 that the lockset of an event is the set of locks the thread is holding when

it triggers the event, and the POR relation is computed using vector clocks by considering the

internal events in each thread and the synchronization events across different threads. To support

the efficient checking of these two conditions for detecting redundant events, we introduce a

new attribute, concurrency context, for each event to encode the lockset and the happens-before

relation in a uniform way:

Definition 5.7. The concurrency context of each event includes both the LOCK/UNLOCK and

the message send/receive (FORK/JOIN/WAIT/NOTIFY) history of the thread at the time when

the event is triggered.

By defining the concurrency context in this way, we have the following theorem:

Theorem 5.8. Two events from the same thread with the same concurrency context are permu-

tationally equivalent to each other.

Proof. Since the concurrency context encodes the LOCK/UNLOCK history, the two events

must have the same lockset, which satisfies the lockset equivalence condition. In addition,

since the concurrency context encodes the message send/receive history, which determines the

happens-before relation between events across different threads (recall Section 2, happens-

before relation, the second condition), these two events from the same thread must also have

the same happens-before relations with all the other events by the other thread, hence, satisfying

the inter-thread happens-before equivalence condition.

In addition, as programmers may require more details besides the lexical location of the access

anomaly, we also include the runtime method call stack (ENT/EXT) of each event in its concur-

rency context, to give programmers the full calling context information for understanding the

bug. Note that our definition of the concurrency context naturally supports online computation.

This is important since, as mentioned earlier, large traces may not fit in memory.

5.3.1.3 Two dimensions of redundancy

The model above describes a general way of determining redundancy in the context of a PTA

algorithm for concurrency access anomaly detection. According to Theorem 5.6, we know that

the number of fully equivalent events need to be no more than the pattern norm, and all addi-

tional ones are considered to be redundant. Conceptually, we can decompose redundancy into

two dimensions: the redundant events from the same thread and those from different threads.

According to Theorem 5.8, since full equivalence between two lexically equivalent events by


the same thread can be determined by comparing their associated concurrency contexts, an ad-

vantage of this decomposition is that it allows the separation of local and global reasoning of

redundancy with respect to each individual thread. We next show the decomposition in detail.

Local redundancy The first dimension of redundancy is called local redundancy, defined over

the events of each individual thread. Consider the set of fully equivalent events. If we further

divide it into subsets grouped by the thread ID, we are able to determine the redundancy locally

to each thread, without checking against all the events in the trace. More specifically, if the

size of some subset exceeds the pattern norm, the additional events in the subset are already

redundant regardless of the events in the other subsets. We refer to these additional events as

the locally redundant events and they can be safely removed from the trace. As an example,

consider detecting the data race on the trace in Figure 5.3. Since the second and third writes of

x (e(3,5)) by thread T0 are equivalent to each other, and the norm of a data race pattern is one,

we can safely remove either e3 or e5 from the trace.

Global redundancy The second dimension of redundancy is called global redundancy, which

is defined over the events across different threads. For general access anomaly patterns, it is

difficult to determine the equivalence between events from different threads. The reason is that

the permutational equivalence condition requires checking the happens-before relation between

the two events against all the other events in the trace. For two events from different threads,

their happens-before relationships with the events from the other thread would be different and,

thus, the permutational equivalence condition may never be satisfied. For example, the events

e30 and e42 by threads T2 and T3 (in Figure 5.3), respectively, are not equivalent to each other, as

their happens-before relationships with all the other events in T2 and T3 are different. Therefore,

to determine the redundancy across different threads, we need to examine the access anomaly

patterns in more detail.

Recall that an access anomaly pattern [E,T,SV,AR,AT] specifies a sequence of events by

different threads. Consider the element T which specifies the meta thread ID sequence in the

pattern. Our observation is that, only a limited number (nt) of different threads are required in

the formation of an access anomaly pattern. If there are more than nt threads in the trace that

contain the lexically identical events with respective to the pattern, those additional threads are

redundant and all their corresponding events, which are referred to as the globally redundant

events, can be removed. For example, consider the threads T(1,2,3) in Figure 5.1, since their

event sequences are lexically identical to each other (because they execute the same code), we

only need to keep the events from two of them because the three common access anomalies all

require only two threads. The reason is that any access anomaly contributed by the events from

the redundant threads can be replaced by the events from the remaining threads in the trace, as

the access anomaly pattern does not require concrete but rather meta thread IDs. This is also

known as symmetry reduction in model checking techniques [117].


To generalize to any pattern that specifies nt different threads, we determine global redundancy

by comparing the entire event sequences between different threads. For the set of threads that

contain lexically identical event sequences, we only keep nt of them (if the size of the set is

larger than nt) and discard the events from the rest of them. We detect global redundancy after

processing local redundancy to reduce the computation effort.

5.3.2 Filtering redundant events

To efficiently encode and filter redundant events, we design two filters for dealing with both the

locally and the globally redundant events. Our filters use the Trie data structure to represent the

concurrency contexts. The reason for choosing a Trie is that, for any particular thread, the stream

of events exhibits strong temporal locality due to the stack-based computation model. Events

generated at the top level of the function stack share all their preceding events generated by the

entire stack. We leverage this phenomenon to make good use of the prefix sharing capability of

Trie and to perform the online analysis of the events.

More specifically, each node in the Trie represents an element in the concurrency context, e.g.,

a method entry or a lock acquisition operation, and each node also is associated with a bounded

stack, of which the capacity is set to be the norm of the access anomaly pattern. During the fil-

tering, the new incoming event with the concurrency context represented by the corresponding

node in the Trie is added to the stack; when the stack is full, the event is discarded and auto-

matically removed from the trace as it is guaranteed to be redundant with respect to the analysis

result.

Algorithm 5 TraceFilter(δ)

1: Input: δ - a trace ⟨ei⟩2: cctxt ← empty concurrency context for each thread t3: for i = 1 to ∣δ∣ do4: switch ei5: case: MEM(σi, vi, ai, ti, Li)6: DetectLocalRedundancy(ti, σi, cctxti);7: case: ENT(mi, ti)8: add mi to cctxti ;9: case: EXT(mi, ti)

10: remove mi from cctxti ;11: case: LOCK(li, ti)12: add li to cctxti ;13: case: UNLOCK(li, ti)14: remove li from cctxti ;15: case: WAIT/NOTIFY(gi, ti)16: add gi to cctxti ;17: DetectGlobalRedundancy(δ);


Algorithm 5 shows our TraceFilter algorithm for removing redundant events in the trace. It

consists of two parts: an online algorithm (Algorithm 6 DetectLocalRedundancy) for detect-

ing local redundancy and an in-memory algorithm (Algorithm 7 DetectGlobalRedundancy)

for detecting global redundancy. The algorithm conducts a linear scan of the input trace and

maintains a concurrency context for each thread during the analysis. The concurrency context

is computed as follows. If the event is a method entry/lock acquisition (ENT/LOCK) event,

the method/lock ID (m/l) will be added to the thread’s concurrency context. If the event is a

method exit/lock release (EXT/UNLOCK) event, the most recent method/lock ID (m/l) will

be removed from the thread’s concurrency context. If the event is a message send/receive

(FORK/JOIN/WAIT/NOTIFY) event, the message ID (g) will be added to the thread’s con-

currency context. Otherwise, if the event is a shared variable read or write access (MEM),

it will call the algorithm DetectLocalRedundancy for checking redundancy with the current

concurrency context of the thread.

Algorithm 6 DetectLocalRedundancy(e, t, σ, cctx)1: Input: e - an event in the trace2: Input: t - an thread ID3: Input: σ - a program location4: Input: cctx - a concurrency context5: local trie map(t→(σ → trie)): a map that maps a given t and σ to a trie6: trie← local trie map(t, σ)7: stack ← trie.get(cctx) //get the corresponding stack of cctx8: if stack is full then9: discard e

10: else11: add e to stack

Detecting local redundancy In our algorithm DetectLocalRedundancy, our local redundancy

filter checks each event that is associated with a shared variable access. We first find the node in

the Trie, given the concurrency context of the event. If the stack associated with the node is full,

the event is discarded from the trace and the algorithm continues to process the next event. The

algorithm terminates after the last event in the trace is analyzed. The worst case time complexity

of this algorithm is linear in the trace size multiplied by the maximum length of the concurrency

context, i.e., the number of events in the concurrency context.

Figure 5.3.2 (left) shows an exemple snapshot of the local filter, assuming the norm of the

detected access anomaly pattern is 2. The table in Figure 5.3.2 (left) lists eight events and their

associated concurrency contexts that consist of locks, l1 and l2, and methods, m1 and m2. In

this Trie, each node contains a particular context element and is associated with a stack of size

2 for storing events. The events e7 and e8 are not stored because they hits the same node as e5and e6 and the stack is full.


Root

m1 m2

l1 l1

l2

e1 <m1>

e2 <m1, l1>

e3 <m1, l1>

e4 <m2, l1>

e5 <m2, l1, l2>

e6 <m2, l1, l2>

e7 <m2, l1,l2>

e8 <m2, l1, l2>

Event Concurrency context

e1

e3

e2

e4

e6

e5

Root

A B

B C

Thread Events

T6

T5

A

A

B C

B C

A B C

T1 <e1, e2, e3>

T2 <e4, e5, e6>

T3 <e7, e8, e9>

T4 <e10, e11, e12>

T5 <e13, e14>

T6 <e15, e16>

T7 <e17, e18, e19>

T8 <e20, e21>

B C

B C

B C

C T2

T1

A B D

D T7

T3

A B D

FIGURE 5.4: Trie representation of local (left) and global (right) redundancy

Detecting global redundancy We invoke the algorithm, DetectGlobalRedundancy, to re-

move global redundancy across different threads after removing local redundancy in each thread.

We first categorize the events according to their thread IDs. Instead of populating the Trie using

the attributes of the concurrency context, we use the lexical location of the event as the key to

populate the thread IDs of each event in the Trie. Our algorithm iterates through the set of all

threads and updates the global trie according to the lexical locations of the events in the event

sequence of each thread. If the corresponding lexical locations of two event sequences by two

threads are identical, the two thread IDs will be placed in the stack associated with the same

node. If a stack is full, all the events from the new coming thread are discarded.

Figure 5.3.2 (right) shows an exemplary snapshot of the global filter. The table shows the cate-

gorized events generated by eight threads. The lexical locations are shown on top of each event.

The events from the threads T1, T2, and T4 have the same lexical location sequence <A,B,C>,

and the events from the threads T5, T6, and T8 have the same lexical location sequence <B,C>.

Following the sequence, the thread IDs are recorded by the filter. Suppose the access anomaly

patterns in this example specify at most 2 different threads, the events from the threads T4 and

T8 are all dropped because the corresponding stacks of T4 and T8 in the global trie are full. T4is mapped to the same node as T1 and T2, and T8 is mapped to the same node as T5 and T6.

This operation uses the global information of the remaining event sequences of each thread.

Therefore, the entire trace is required to be in the memory. For large raw trace, this requirement

is hard to satisfy. Fortunately, after removing local redundancy, the size of the raw trace is often

greatly reduced, so that our technique is able to handle large traces despite the fact that removing

global redundancy is not memory-friendly.


Algorithm 7 DetectGlobalRedundancy(δ)

1: Input: δ - a trace ⟨ei⟩2: δt: the event sequence by thread t in δ3: trie: the global trie4: //iterate through the set of all threads5: for all t ∈ T do6: trie←UpdateGlobalTrie(δt)7: stack ← trie.get(t)//get the corresponding stack of t8: if stack is full then9: discard δt

10: else11: add δt to stack

5.4 Implementation

We implemented our technique on top of PECAN. To obtain a trace, PECAN first takes the

bytecode of an arbitrary Java program and outputs an instrumented version to collect interested

events of the program execution.

For detecting concurrency bugs using PTA, PECAN collects the following types of events in a

global order: READ/WRITE accesses to shared variables, method entry/exit, LOCK/UNLOCK,

FORK/JOIN, and WAIT/NOTIFY events. To support the recording of long running programs,

PECAN does not hold the entire trace in the main memory but saves it to a database. To reduce

the unnecessary recording of accesses on thread local variables, PECAN also performs a static

thread escape analysis [42] to identify all the possible shared variables in the program. Each

event in the trace is associated with a set of attributes: the access type, the memory address, the

thread ID, and the location in the program source. To reduce the runtime cost, the concurrency

context information of each event used by our technique for detecting redundant events is not

recorded during the trace collection. Instead, it is computed and maintained at the time when

the events in the trace are processed by our technique.

After applying our technique for removing the redundant events in the trace, the PTA engine

of PECAN takes the trace as input and reports detected access anomalies. Each of the reported

access anomalies is a pure event sequence satisfying the specification of the access anomaly

pattern. For two access anomalies with the same lexical information but contain different event

sequences, PECAN also is configured to support reporting either both of them or only one of

them, by checking the redundancy between them.


TABLE 5.1: TraceFilter experimental results- RQ1: Effectiveness

Program SLOC Input/#Thread Trace TraceFilter (#Events)#Events #SV Size Local redundancy Global redundancy

BuggyPro 348 33 10,075 5 424KB 5,876(58.3%) 147(1.5%)Shop 220 100 15,560 3 654KB 6,684(44.1%) 462(2.9%)Loader 139 100 34,788 2 1.5MB 12,094(34.8%) 97(0.3%)ArrayList 5,979 451 40,558 696 1.7MB 3,208(8.1%) 0(0.0%)LinkedList 5,866 451 53,173 2,266 2.2MB 4,020(7.9%) 0(0.0%)RayTracer 1,924 SizeA/10 350,688 24 14.7MB 327,645(93.4%) 20(0.0%)SpecJBB2005 17,245 8 484,841 113 20.4MB 281,338(58.0%) 2(0.0%)Tsp 709 map4/4 1,048,433 260 44,1MB 1,042,293(97.7%) 1,248(0.1%)Moldyn 1,352 SizeA/10 1,062,629 26 44.7MB 1,003,062(94.4%) 196(0.0%)Sor 951 SizeA/4 5,545,200 6 233.2MB 5,544,954(99.9%) 0(0.0%)

OpenJMS 262,842 10 904,435 285 38.0M 773,764(85.5%) 0(0.0%)Tomcat 339,405 100 1,296,338 401 54.5M 569,047(43.9%) 663(0.0%)Jigsaw 381,348 10 479,105 407 20.1M 166,338(34.7%) 5(0.0%)Derby 665,733 bug#2861/100 2,236,960 199 94.1M 1,449,550(64.8%) 4,502(0.2%)

5.5 Evaluation

The goal of our technique is to improve the scalability of PTA of concurrency access anomalies

while guarantee soundness of the analysis. Accordingly, our evaluation aims at answering the

following questions:

RQ1. Effectiveness - How much local redundancy as well as global redundancy can our ap-

proach remove from the trace?

RQ2. Efficiency - How efficient is our approach for removing trace redundancy? And how

much improvement on the scalability of PTA for concurrency access anomalies can our

approach contribute?

RQ3. Correctness - Empirically does our approach indeed guarantee the soundness? i.e., it

should not remove any non-redundant events from the trace w.r.t the PTA.

The remainder of this section presents our experimental results on the three questions. All our

experiments were conducted on a 8-core 3.00GHz Intel Xeon machines with 16GB memory and

Linux version 2.6.22.

Benchmarks We consider a set of widely used third-party concurrency benchmarks. We

configure the program inputs to generate traces of different sizes and complexity. To understand

the performance of our technique on real applications in practice, we also include several large

server systems in our benchmarks. The first column in Table 5.1 shows the benchmarks used

in our experiments. The sizes of our evaluation benchmarks range from a few hundred lines to

over 600K lines of code.


5.5.1 RQ1: Effectiveness

The goal of our first research question is to investigate how much redundancy exists in the exe-

cution traces of real concurrent programs. To generate the data necessary for investigating this

question, we proceed as follows. For each benchmark, we first run it multiple times with differ-

ent inputs and the number of threads, and used PECAN to collect the corresponding traces. For

each trace, we then apply our technique to produce a filtered trace with the redundancy removed.

We checked three types of patterns: data races, atomicity violations, and atomic-set serializabil-

ity violations. As our technique deals with two dimensions of redundancy (local redundancy

and global redundancy), we measured the percentage of redundant events with respect to local

and global redundancy, respectively.

Table 5.1 shows our experimental results. Column 3 (Input/#Thread) reports the input data (if

available) and the number of threads configured in the recorded execution of the benchmark.

Columns 4-6 (#Events,#SV,#Size) report the number of events in the trace, the number of real

shared memory locations that contain both read and write accesses from different threads, and

the size of the trace on the disk, respectively. As the table shows, the number of events in

the trace ranges from more than 10K to 5M, with sizes from more than 400KB to 233MB on

disk. Compared to the traces evaluated in the other PTA techniques [22, 128, 130, 131], the

traces in our experiments are orders of magnitude larger. Columns 7-8 (Local,Global) report

the number of local and global redundant events, respectively, detected by our technique in the

corresponding trace. In the small benchmarks, the percentage of local redundancy ranges from

7.9% to 99.9%, and the percentage of global redundancy ranges from 0.0% to 2.9%. For the

real server programs, the percentage of local redundancy ranges from 34.7% to 85.5%, and the

percentage of global redundancy ranges from 0.0% to 0.2%.

The percentage of global redundancy is often very small compared to that of local redundancy.

The reason is that our TraceFilter algorithm has in the first place removed most of events in

the category of local redundancy. Hence, no matter how much global redundancy there is, the

number of the remaining events in the trace after removing local redundancy is already much

smaller compared to the size of the original trace. If global redundancy is detected first, the

reported percentage of global redundancy would be much higher. However, in that case, the

entire trace should be loaded into the memory first, as detecting global redundancy requires

the entire trace. Nonetheless, the data in the table confirm our hypothesis that the redundancy

pervasively exists in concurrent programs. Although the percentage of redundancy in the real

large sever programs is not as high as in the small benchmarks, it already accounts for more than

one third to a half of the entire trace.


TABLE 5.2: TraceFilter experimental results - RQ2: Efficiency

Program TraceTraceFilter PTA

Local Global N YBuggyPro 10,075 105ms 9ms 3.50s 1.4sShop 15,560 599ms 2ms 45.1s 2.6sLoader 34,788 1.06s 5ms 456.0s 71.7sArrayList 40,558 14.9s 5ms 131.5s 115.6sLinkedList 53,173 26.4s 15ms 100.5s 128.9sRayTracer 350,688 1.07s 3ms >2h 9.2sSpecJBB 484,841 2.2s 12ms 112.6s 25.5sTsp 1,048,433 22.6s 10ms >2h 402.5sMoldyn 1,062,629 3.3s 4ms >2h 27.4sSor 5,545,200 4.8s 1ms >2h 33.7sOpenJMS 904,435 9.7s 2ms 220.0s 17.2sTomcat 1,296,338 12.0s 5ms 1440.1s 29.7sJigsaw 479,105 19.5s 22ms 695.6s 35.5sDerby 2,236,960 42.3s 16ms >2h 177.5s

5.5.2 RQ2: Efficiency

The goal of our second research question is to assess if our approach is efficient in detecting

redundant events. Since our objective is to improve the overall scalability of PTA, the analysis

time of our technique should not contribute significantly to the overall analysis time. Hence, we

conduct experiments to evaluate the efficiency of our technique on various traces. To generate

the data necessary for investigating this question, we proceeded as follows. For both the original

trace and the filtered trace, we use PECAN to analyze the three common access anomalies on

them. During the analysis, we record the following three measurements: the amount of time

needed by our technique to remove both local and global redundancy, the time taken for the bug

detection of PTA using the filtered and the unfiltered trace. For large traces, it is possible that

PECAN is not able to load the trace into memory or finish processing the trace in a reasonable

amount of time. In such cases, we set a 2-hour time bound for the analysis and we terminate it

if it did not finish in 2 hours, and we report out of memory error (OOM) if the analysis crashed

due to memory exhaustion.

Table 5.2 shows the experimental results. Columns 1-2 report the benchmark program and the

size of the corresponding trace. Each trace is the same as the one for evaluating the effectiveness

of our technique in Table 5.1. Columns 3-4 report the time our technique takes to detect local

redundancy and the global redundancy, respectively, in the trace. The time for removing the

local redundancy ranges from 105ms for small traces to 42.3s for large traces, while that of

detecting global redundancy is negligible (a few milliseconds), as the number of threads in the

trace are relatively small (from 4 to 100). We observe that the analysis time of our technique


really depends on the complexity of the trace, e.g., the number of shared variables and the depth

of the concurrency context of events in the trace. For instance, for the trace with more than 5M

events in the Sor benchmark, our technique took only less than 5 seconds to process it, whereas

it took 26.4s for processing the trace in the LinkedList benchmark containing only 53K events.

However, overall, these results show that our technique is very efficient for removing the trace

redundancy.

On the aspect of improving the PTA scalability, Columns 5-6 report the total amount of time

for the PTA to process the trace, without and with our technique for removing the redundant

events, respectively. The data show that, in most cases (except LinkedList and ArrayList), the

time needed for PTA using our technique is significantly reduced compared to the runs without

our technique. For example, for the trace with more than 1M (1,296,338) events in the Tomcat

benchmark, our technique reduced the original PTA time from 1440.1s to 29.7s. And for the

trace with 2,236,960 events in the Derby benchmark, the trace analysis with our technique was

able to finish in less than 177.5 seconds, whereas, for the unfiltered trace, the same analysis did

not finish in 2 hours. The only two exceptions were the traces in the LinkedList benchmark and

the ArrayList benchmark. The reason is that the percentages of redundancy in these two traces

are relatively small (7.9% and 8.1% respectively). Since there are not many reduction oppor-

tunities, the amount of time for the PTA to analyze the traces with and without our technique

are comparable. Nevertheless, as our technique is efficient, even for these two traces, the bug-

detection time saved by our technique for the PTA still almost offsets the cost incurred by the

redundancy removal. In summary, the results demonstrate that our approach is very effective in

removing the trace redundancy and therewith significantly improving the scalability of PTA for

detecting concurrency access anomalies in real world large traces.

5.5.3 RQ3: Correctness

The validity of the effectiveness and the efficiency evaluation is based on the assumption that our

technique does not affect the analysis results of PTA presented to the programmer. Although our

redundancy model in Section 5.3.1 shows that our technique is able to guarantee the soundness,

i.e., it does not misclassify any non-redundant event to be redundant, we would also like to see

whether the claim holds empirically in large traces in practice. It is important for us to confirm

the correctness of our technique with experiments.

For large traces, to verify the correctness of PTA results is difficult because in many cases the

bug detection does not finish in two hours. Therefore, we are unable to analyze benchmarks

including RayTracer, Moldyn, Tsp, Sor and Derby. For the traces of other benchmarks, we first

run them unaltered through PECAN and obtain the detected access anomalies that are poten-

tially duplicated with respect to the source code locations. From these results, we remove the


TABLE 5.3: TraceFilter experimental results - RQ3: Correctness

Program TraceRace Atom ASV

N Y N Y N YBuggyPro 10,075 9 9 1 1 0 0Shop 15,560 16 16 6 6 0 0Loader 34,788 2 2 0 0 0 0ArrayList 40,558 0 0 2 2 4 4LinkedList 53,173 0 0 4 4 34 34SpecJBB 484,841 24 24 1 1 0 0OpenJMS 904,435 3 3 7 7 0 0Tomcat 1,296,338 0 0 0 0 0 0Jigsaw 479,105 121 121 209 209 443 443

duplicated reports and compare the remaining results to the ones reported by PECAN using the

filtered trace.

Table 5.3 shows the trace we selected and the number of distinct access anomalies for each type

of analysis. Columns labeled ‘N’ and ‘Y’ indicate whether the analysis is on the unfiltered or

the filtered trace. The results empirically support the correctness of our technique. For all these

traces, we found that the PTA using the filtered trace produced the same result as that of the

unfiltered trace. The reason that many cells in the table are zero is that PTA did not detect any

bug from the recorded trace.

5.6 Summary

We have presented a technique that automatically removes redundant events from the execution

trace, which significantly improves the scalability of predictive analysis techniques for detecting

concurrency access anomalies. In summary, we make the following contributions:

1. We define the concept of trace redundancy in the context of PTA for the general access

anomalies and show that the redundancy pervasively exists in concurrency software systems.

2. We present a technique, TraceFilter, that filters out redundant events in a trace for improving

the scalability of PTA. The soundness of our technique is guaranteed by a theorem showing that

our technique does not impair the trace analysis result.

3. We evaluate our technique on a set of concurrency benchmarks as well as several large

multithreaded applications. The results show that our technique is very effective and efficient,

and can significantly improve the scalability of PTA.

Chapter 6

Dynamically Simplifying ConcurrencyBug Reproduction

The technique of multiprocessor deterministic replay substantially assists debugging by mak-

ing the program execution reproducible. However, facing the huge replay traces and long re-

play time, the debugging task remains stunningly challenging for long running executions. We

present a new technique, LEAN, on top of replay, that significantly reduces the complexity of

the replay trace and the length of the replay time without losing the determinism in reproducing

concurrency bugs. The cornerstone of this work is a redundancy criterion that characterizes the

redundant computation in a buggy trace. Based on the redundancy criterion, we have developed

two novel techniques to automatically identify and remove redundant threads and redundant

instructions in the bug reproduction execution. Our evaluation results with several real world

concurrency bugs in large complex server programs demonstrate that LEAN is able to reduce

the size, the number of threads, and the number of thread context switches of the replay trace by

orders of magnitude, and accordingly greatly shorten the replay time.

6.1 Introduction

Multiprocessor deterministic replay (MDR) has shown effective for concurrent program debug-

ging [3, 28, 45, 48, 70, 83, 84, 100, 127]. Several recent work [45, 48, 83, 127] has also demon-

strated that the future of low overhead MDR is positive, via special hardware design [45, 83]

or even clever software-level approaches [48, 127]. However, MDR alone is not often suffi-

cient for debugging. Even with zero-recording-overhead MDR, the debugging task can remain

stunningly challenging for concurrent programs. We identify two main reasons. First, most

real world concurrent applications are large and complex. For any non-trivial execution, the

execution trace could be huge and complicated, containing millions (or even billions) of critical

78

Dynamically Simplifying Concurrency Bug Reproduction 79

events [122] and hundreds of thousands of thread context switches [49, 55]. It is very hard for

programmers to locate a bug by inspecting the huge amount of trace information. Moreover, the

performance of replay is often slow and hard to predict. As replay typically requires enforcing

scheduling behavior, it is often significantly slower (5x-39000x [3, 100]) than native execution.

For long running executions, the replay phase may never end within a bounded time budget. It

is very frustrating for programmers to wait without knowing when the bug will be reproduced.

To make MDR more practical for supporting concurrent program debugging, we advocate the

simplification of the replay execution and the speeding up of the replaying process, so that pro-

grammers can locate and understand concurrency bugs more effectively using a simplified re-

producible buggy execution. To achieve this goal, we propose LEAN, a concurrency bug repro-

duction technique on top of MDR, that significantly reduces the complexity (size, number of

threads, and number of context switches) of the replay trace and shortens replay time without

losing the determinism.

Key Observation Our key observation is that most computations in a buggy execution are

often irrelevant to reproducing a concurrency bug. As shown by Vaziri, Tip and Dolby [125],

most concurrency bugs are exhibited by only two threads and one or two shared variables.

The rest of the threads and shared variable accesses, if not required to understand the bug,

are redundant and can be removed from the execution. This observation also is empirically

confirmed by a comprehensive study by Lu et al. [74] on real world concurrency bugs showing

that the manifestation of more than 96% of the examined concurrency bugs involves no more

than two threads, 66% of the non-deadlock concurrency bugs involve only one variable, and 97%

of the deadlock concurrency bugs involve at most two resources. This observation also reflects

the common wisdom demonstrated by years of industrial experience (IBM ConTest [31], Stress

testing [85] and Microsoft CHESS [86]) that most concurrency bugs in practice are triggered by

a few threads and a small number of context switches. For example, stress testing for exposing

concurrency bugs typically forks as many threads as possible to repeatedly execute the same

code. However, with the correct interleaving, a few threads and repetitions are often sufficient

to trigger the bug.

To further illustrate this observation, consider a simple, but common, test case for stress testing

an account function in Figure 6.1. The parent thread T0 forks a number (N ) of children threads

Ti (i = 1,2, . . . ,N ), each of which repeatedly validates two methods M times: increasing and

decreasing the account by a certain amount (i). There are three assertions (A,B,C) in the

program. When an assertion is violated, in the worst case, the buggy execution trace containsM

threads (excluding T0) andM ×N iterations of increasing/decreasing operations on the account.

However, in the best case, only two threads and two iterations are needed to reproduce the bug.

For instance, the increment method may be non-atomic, and an erroneous interleaving happened


Example A

for j =1:M

{

expected = account.get()+i

account.increment(i)

assert account.get()==expected

expected=account.get()-i

account.decrease(i)


}

account.set(0);

for i=1:N

fork Ti

for i=1:N

join Ti

assert account.get()==0

Ti T0

A:

B: C:

FIGURE 6.1: A typical test case for stressing testing an account function. A significant amountof computation in a buggy execution of this program may be redundant.

between the 5th and 10th iterations of threads T(2,3), causing assertion A to be violated. To

reproduce the error, the 5th and 10th iterations of threads T(2,3) (plus the erroneous interleaving)

are sufficient. The rest of the computation is redundant and can be eliminated from the execution

without affecting the ability to reproduce the bug.

Contributions We propose a criterion to characterize redundant computation in a buggy trace.

The criterion ensures that, after removing a redundant computation, the resultant execution is

able to reproduce the same concurrency bug. Based on the criterion, LEAN simplifies the buggy

execution by iteratively identifying and removing redundant computation from the original ex-

ecution trace (skipping the computation by controlling the execution) and, at the same time,

enforcing the same schedule between threads in the reduced execution as that in the original

buggy execution. The final result produced by LEAN is a simplified execution with redundant

computation removed.

The key challenge we address is how to effectively identify redundant computation. We further

categorize redundant computation into two dimensions: whole-thread redundancy and partial-

thread redundancy. Whole-thread redundancy identifies threads whose entire computation is

redundant. For example, threads except T(0,2,3) in our example are redundant threads and all

their computation can be removed. Partial-thread redundancy characterizes redundant instruc-

tions as part of each individual thread. For example, all iterations (except the 5th and 10th) of

threads T(2,3) in our example are partial-thread redundant.

We develop two effective techniques based on delta-debugging [144] to identify whole-thread

redundancy and partial-thread redundancy, respectively. To reduce the search space of delta-

debugging, we utilize the parent-children relationship between threads to iteratively identify


whole-thread redundancy using the dynamic thread hierarchy graph. For partial-thread redun-

dancy, we combine an adapted multithreaded program slicing technique [121] and a repetition

analysis to remove irrelevant instructions and to identify the redundant iterations of computation.

To further improve effectiveness, we also provide an easy-to-use repetition analysis framework

that allows programmers to annotate repetitive code segments of which some execution itera-

tions are potentially redundant. All redundant iterations are then automatically validated and

filtered out by our technique.

Note that the redundancy criterion is black-box in nature. It does not rely on any data or control

dependency information, and is completely based on the bug reproduction property. This allows

us to explore more simplification opportunities than white-box approaches such as program

slicing [40, 63, 90].

We implemented LEAN on top of LEAP for Java programs. Our evaluation results on a set of

real concurrency bugs in popular multithreaded benchmarks as well as several large complex

concurrent systems demonstrate that LEAN is able to significantly reduce the complexity of

the buggy execution and shorten replay time without losing determinism. LEAN produces a

simplified execution typically within 20 iterations. LEAN is able to reduce the size of the replay

trace by as much as 324x, the number of threads and thread context switches by 99.3% and

99.6%, and shorten the replay time by more than 300x.

The remainder of this chapter is organized as follows: Section 6.2 presents a model of trace

redundancy; Section 6.3 presents our technique; Section 6.4 presents our implementation and

Section 6.5 presents a case study of simplifying the reproduction of a real concurrency bug;

Section 6.6 reports our experimental results and Section 6.7 summarizes this chapter.

6.2 A Model of Trace Redundancy

Starting with an initial state Σ0 and, following a schedule ξ, the program can reach a final state

Σf . We say ξ exhibits a bug if Σf satisfies a predicate, say φ, that denotes the bug. The bug

predicate is defined as follows:

Definition 6.1. (Bug predicate) A bug predicate, φ, characterizes the exhibition of a bug in the

program execution over the final program state. The bug is exhibited in the execution iff φ(Σf )

is evaluated to be true.

Following different schedules, however, Σf may be different and may or may not satisfy φ. We

call a bug a sequential bug if some sequential schedule is able to exhibit it, and a concurrency

bug if only a non-sequential schedule can exhibit it.


From a high level view, LEAN simplifies the concurrency bug reproduction by controlling the

program execution to skip instructions in the program that are redundant to reproducing the bug.

Generally speaking, an instruction (or a group of instructions) cannot be arbitrarily skipped,

as it may result in two possible negative consequences: the program malfunctions, or the bug

disappears. The program might malfunction if the skipped instruction is an indispensable part

of the program logic, while the bug might disappear if the skipped instruction is related to the

bug. Either consequence will make the reduced execution not useful for debugging.

We propose a redundancy criterion for the concurrency bug reproduction that ensures neither of

these two outcomes will occur if a redundant instruction is skipped. The basic idea is that, after

removing the redundancy, the same bug is reproduced. A subtle problem in defining the criterion

is that we may not have such a bug predicate φ as defined in Definition 6.1. In practice, we often

use assertions or rely on runtime exceptions to determine whether a bug is exhibited or not.

However, the assertions or exceptions may be insufficient to distinguish between the behavior of

the bug manifestation and the behavior of program malfunction, in which case the program is no

longer working properly as expected due to the removal of a necessary instruction. For example,

the assertion that characterizes the bug in the original execution may always be violated after

removing a certain instruction. Although the reduced execution manifests the violation of the

assertion, it is not useful for debugging because the assertion is not able to characterize the same

bug as that in the original execution.

We tackle this issue from the perspective of thread interleavings. For a concurrency bug, essen-

tially, it is some non-deterministic buggy interleavings that cause the bug (assuming the input is

deterministic). For debugging, programmers want to understand how the bug occurs with these

buggy interleavings. If the program executes sequentially and behaves correctly, the bug should

not manifest. On the other hand, if the program malfunctions after removing an instruction, ei-

ther the program cannot proceed to execute the buggy statement or the bug predicate φ is always

satisfied regardless of the buggy interleavings. Therefore, we define the redundancy criterion as

follows:

Definition 6.2. (Trace redundancy criterion) Consider a trace δ that exhibits a concurrency

bug (δ drives the program to a state satisfying the bug predicate φ) and a subset E of the events

in δ. Let δ/E denote the trace δ with the events in E removed. E is redundant if the following

two conditions are satisfied:

I. δ/E can still drive the program to a state that satisfies φ;

II. some sequential schedule of the reduced execution does not satisfy φ.

We assume φ characterizes a concurrency bug. The soundness of this criterion is easy to follow.

First, Condition I and Condition II together ensure that the reproduced bug is a concurrency bug,


because φ is satisfied under the original buggy schedule (excluding the events in E), but not a

sequential schedule. Second, consider condition II: since φ is evaluated but not satisfied (i.e.,

the bug does not manifest) under a sequential schedule1, the program does not malfunction after

removing the events in E. Otherwise, either φ is not evaluated or φ is always satisfied. Hence,

the same concurrency bug is reproduced under Conditions I and II.

It is worth noting that trace redundancy is not defined over a single event but a subset of events

in the trace, which correspond to a group of instructions in the program execution. The reason is

that redundant instructions are not independent but may be closely related to each other. A group

of instructions may be redundant but any single instruction may not. For example, suppose an

erroneous interleaving between the 5th and 10th iterations of threads T(2,3) manifests the bug

in Figure 6.1. The whole computation of thread T1 is redundant, but any single instruction of

T1 alone is not. Without any dependence information between the instructions, removing trace

redundancy is a combinatorial optimization problem, which is exponential in the number of

instructions in the original buggy execution.

To facilitate more effective simplification, we further characterize redundancy into two dimen-

sions:

• whole-thread redundancy - all computation of a thread is redundant;

• partial-thread redundancy - some instructions of an individual thread are redundant.

This categorization utilizes the thread identity relationship between the computations. In prac-

tice, threads are more likely to be independent from one another than are individual instructions.

We can skip all the computation of a redundant thread. Compared to whole-thread redundancy,

partial-thread redundancy examines the instructions local to each individual thread. If an in-

struction by a certain thread is redundant, we can skip it during the execution of that thread. In

our illustrating example, all the other threads except T(0,2,3) are redundant (whole-thread redun-

dancy), and most of the repetitions of threads T(2,3) are redundant (partial-thread redundancy).

6.3 Automatic Redundance Removing

We propose two techniques to remove trace redundancy for simplifying concurrency bug re-

production. The first technique effectively validates and removes whole-thread redundancy by

adapting delta-debugging [144] using thread hierarchy information. Our technique produces

a 1-minimal set of threads [144] that are not redundant in the buggy execution. The second1Note that we do not need to check all sequential schedules but checking any one of them is sufficient to validate

whether the concurrency bug is still a concurrency bug or not.

Dynamically Simplifying Concurrency Bug Reproduction 84Automatic Redundant Thread Removal

T0

T1:1

T2

…

T2:2:1 …

T2:1 T2:2

T1 … T3

T1:2

T1:2:1 T1:2:2

T2:3

T2:3:1 …

FIGURE 6.2: An example of dynamic thead hierarchy graph (TH-Tree). When T1,3 are se-lected, all T1,3 and their descendents (gray color) are disabled.

technique targets irrelevant instructions and repetitions. It combines a dynamic multithreaded

slicing technique and a static repetition analysis, as well as a simple annotation framework that

integrates programmers’ hints. The entire simplification process is deterministic. There is no

interleaving non-determinism during simplification as we control all thread scheduling during

replay.

6.3.1 Removing Whole-Thread Redundancy

Our general idea for whole-thread redundancy follows the approach of hierarchical delta-debugging

[82, 144]. We use a bisection method to pick candidate threads and test whether they can be re-

moved from the execution or not. More specifically, we control the program to disable the

selected candidate threads and validate the reduced execution for the two conditions defined in

our redundancy criterion in Section 6.2. Our technique for removing whole-thread redundancy

is fully automatic. It does not require any user intervention.

There are two main challenges. First, threads may not be arbitrarily removed. For example,

if a parent thread is removed, none of its descendants will execute. Second, after removing a

redundant thread, we must compute the schedule of the remaining threads (in order to deter-

ministically replay the reduced execution). We address these problems as follows. First, we

extract a dynamic thread hierarchy graph of the original buggy execution (TH-Tree) and per-

form delta-debugging based on the TH-Tree, to make sure that if a parent thread is disabled, all

its descendant threads are disabled. Figure 6.2 shows an example of the TH-Tree. For example,

if T1 and T3 are selected, all their descendants (shown in the gray boxes in Figure 6.2) are also

selected. Second, we compute the schedule for the remaining threads by projecting the trace on

thread ID without the IDs of the selected candidate threads and their descendants. The schedule

is enforced in the validation run to test whether the bug can still be reproduced or not.


Let validate and 𝒄𝒙 be given such that validate(𝒄𝒙)=X(fail). The algorithm computes 𝒄′𝒙

=ddmin(𝒄𝒙)=ddmin2(𝒄′𝒙,2) such that 𝒄′𝒙 ⊆ 𝒄𝒙, validate(𝒄′𝒙)=X, and 𝒄′𝒙 is 1-minimal.

𝑑𝑑𝑚𝑖𝑛2 𝒄′𝒙, 𝑛 =

𝑑𝑑𝑚𝑖𝑛2 ∆𝒊, 𝑛 𝑖𝑓∃𝑖 ∈ 1, … , 𝑛 . 𝒗𝒂𝒍𝒊𝒅𝒂𝒕𝒆 ∆𝒊 = 𝐗

𝑑𝑑𝑚𝑖𝑛2 𝛁𝒊, 𝑚𝑎𝑥 𝑛 − 1,2 𝑒𝑙𝑠𝑒 𝑖𝑓∃𝑖 ∈ 1, … , 𝑛 . 𝒗𝒂𝒍𝒊𝒅𝒂𝒕𝒆 𝛁𝒊 = 𝐗

𝑑𝑑𝑚𝑖𝑛2 𝒄′𝒙, 𝑚𝑖𝑛 𝒄′

𝒙 , 2𝑛 𝑒𝑙𝑠𝑒 𝑖𝑓 𝑛 < 𝒄′𝒙

𝒄′𝒙 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.

where ∇𝑖= 𝒄′𝒙 - ∆𝑖, 𝒄′𝒙 = ∆1 ∪ ∆2 ∪ ⋯ ∪ ∆𝑛, all ∆𝑖 are pairwise disjoint, and ∆𝑖 ≈ 𝒄′𝒙 /𝑛.

FIGURE 6.3: The delta-debugging algorithm. The function validate return true if the twoconditions in the redundance criterion are both satisfied. For conciseness, the input trace is

ignored in the ddmin algorithm.

Algorithm 8 summarizes our algorithm. Given the original buggy trace, the algorithm produces a

simplified trace (execution) containing only the 1-minimal set of threads that is able to reproduce

the bug. The 1-minimal property means that all remaining threads are necessary: removing any

one of them would cause the reduced execution to fail to reproduce the bug. Our algorithm starts

by iterating on the height of the TH-Tree. In each iteration, we pick the candidate threads with

the same height. Starting from the threads with height 1 (the main thread is of height 0), we

first select the candidate threads (thread sets) to be validated for the redundancy. If a thread

is selected, its descendants are all disabled. We then process the selected threads using a delta-

debugging algorithm, as shown in Figure 6.3. Each invocation of delta-debugging computes the

1-minimal set of threads (in the input threads denoted by cx) that are necessary to reproduce

the bug. The set cx in the ddmin algorithm corresponds to the selected threads. The validate

procedure (Algorithm 9) corresponds to the test function in delta-debugging. It tests whether the

two conditions in the redundancy criterion are both satisfied after disabling the selected threads:

(1) the bug is reproduced with the computed schedule of the remaining threads; (2) the bug is

not reproduced with a sequential schedule. If both conditions are true, it means that the selected

threads are redundant and they are removed from the execution. This process is repeated for all

levels of threads in the TH-Tree, until no new thread can be removed.

Algorithm 8 RemoveWholeThreadRedundancy(δ)

1: Input: δ - the original trace ⟨ei⟩2: Output: δ′ - the simplified trace with all redundant threads removed3: TH Tree←ExtractThreadHierarchyGraph(δ)4: height← the height of (TH Tree)5: for level ← 1 ∶ height do6: thread set← get threads(TH Tree,level)7: minimal threads←DeltaDebugging(δ,thread set)8: redundant threads← (thread set /minimal threads) and their descendents9: remove redundant threads from TH Tree

10: remove all events by redundant threads in δ11: return δ


Algorithm 9 Validate(δ,disabled threads)

1: Input: δ - a trace ⟨ei⟩2: Input: disabled threads - a set of disabled threads3: δ′ ← remove all events by disabled threads in δ4: ξ ←get schedule(δ′)5: ξseq ←get sequentialschedule(δ′)6: if IsBugReproduced(δ′,ξ) then7: if IsBugNotReproduced(δ′,ξseq) then8: return true9: return false

6.3.2 Removing Partial-Thread Redundancy

To identify partial-thread redundancy, we may directly apply delta-debugging at the level of

individual instructions. However, this naive approach is ineffective because enumerating and

validating every combination of instructions for each individual thread could be very expensive.

To improve efficiency, our technique combines multithreaded dynamic slicing with a repetition

analysis to identify the redundant computation local to each individual thread. Dynamic slicing

tracks the data and control dependencies between instructions in the execution trace and removes

those instructions that are irrelevant to the bug. Repetition analysis is a heuristic that targets at

removing the redundancy related to repetitions. To further improve the effectiveness of repe-

tition analysis, LEAN also provides a simple framework that allows programmers to annotate

repetitive code segments, which significantly reduces the search space.

6.3.2.1 Multithreaded dynamic slicing

The dynamic dependence graph (DDG) is the classical model for slicing single-threaded execu-

tions, which captures the dynamically exercised Read-After-Write (RAW) and control depen-

dencies. Each node in the DDG represents an execution instance of a statement (an instruction)

while edges represent the dependences. For multithreaded execution, Tallam et al. [121] pro-

poses a dynamic slicing modeling for data race detection. Their model extends the DDG to

consider the additional data dependencies on shared variable accesses.

Our slicing model for concurrency bug reproduction is similar to but more strict than the model

by Tallam et al. [121]. To guarantee deterministic bug reproduction, in addition to the shared

variable read/write dependencies, we also need to consider the dependencies on synchroniza-

tion operations. Specifically, given a buggy execution, we construct a multithreaded depen-

dence graph (MDG) that consists of the DDG for each individual threads as well as the depen-

dence relation → (recall Definition 2.5) between instructions by different threads. Note that the

WRITE→WRITE dependency must be included in the MDG, to ensure the correctness of MDR


[49]. Otherwise, a read in the replaying phase may return the value written by a different write

from that in the original buggy execution, which may cause the failure of MDR.

Algorithm 10 shows our dynamic slicing algorithm for removing the partial-thread redundancy.

We first construct the MDG that includes the DDG for each thread in the execution and the

synchronization and shared variable dependencies. Starting from the buggy instruction which

violates the bug predicate, we perform a backward analysis that keeps only the instructions with

a direct or a transitive dependency relation to the buggy instruction. All other instructions are

marked to be irrelevant to reproducing the bug and are skipped in the simplified execution.

Algorithm 10 DynamicMultithreadedSlicing(δ,αf )1: Input: δ - the full execution trace after removing all redundant threads2: Input: αf - the buggy instruction3: Output: δ′ - the simplified trace4: mdg ← ConstructMultithreadedDependencyGraph(δ)5: mdg′ ← ReverseEdge(mdg)6: relevant instructions← DepthFirstSearch(αf ) on mdg′

7: δ′ ← remove the instructions from δ that are not in relevant instructions8: return δ′

6.3.2.2 Repetition analysis

Redundancy often is caused by repetitions. Specifically, we observe that a large portion of

redundant computation by each individual thread is rooted in repetitive code blocks (RCBs)

that contain repeated operations in loops. The operations inside a RCB are expected to execute

a few iterations upon the loop condition with no break operation. The loop variable is often

a primitive data (e.g., integers) that used as a counter for counting the number of iterations

so far. We propose a static repetition analysis to identify RCBs in the program. The RCBs

are used as a pool of potential redundant computation that we may simplify. Each execution

iteration of a RCB is considered as potentially redundant. After validating the redundancy of an

iteration using our redundancy criterion, we can remove all computation of this iteration from

the execution.

Our repetition analysis is based on a simple intra-procedural loop analysis. For each loop, we

consider two conditions to mark it as a potential RCB. First, the loop condition contains only

constants or primitive data, and the loop variable is only incremented or decremented once in

each iteration. Second, there is no break operation inside the loop (exceptions are allowed).

Despite the simplicity, our experiments show that this analysis is effective and efficient for

identifying redundant computation caused by RCBs.

Algorithm 11 shows our algorithm for removing partial-thread redundancy caused by repeti-

tions. This algorithm is applied after slicing the buggy trace. We first identify the RCB that


Algorithm 11 RemoveRepetitionRedundancy(p,δ)1: Input: p - the program2: Input: δ - the trace after slicing3: Output: δ′ - the final simplified trace4: statements← GetRepetitiveCodeBlocks(p)5: threads← get threads(δ)6: for t in threads do7: for σ in statements do8: all iterations← get iterations(δ,t,σ)9: minimal iterations←DeltaDebugging(δ,all iterations)

10: remove (all iterations/minimal iterations)in δ11: return δ

for j =1:M

{

@rcb-begin

expected = account.get()+i

account.increment(i)


expected=account.get()-i

account.decrease(i)


@rcb-end

}

Ti

A:

B:

FIGURE 6.4: Some iterations of the code block demarcated by @rcb-begin and @rcb-endare specified as potentially redundant.

contains potential redundant computation. We then perform delta-debugging on each iteration

of the RCB for each thread, to validate the redundancy of the computation corresponding to the

iteration.

A framework for repetition analysis LEAN also provides an option for the programmers to

annotate RCBs, which can help significantly improve the effectiveness of our automatic repeti-

tion analysis. Our general observation is that programmers often have the knowledge of whether

some code blocks is repetitive or not (in particular, in writing the test drivers). This piece of in-

formation is in fact easy for the programmers to specify (e.g., using simple annotations), but

very difficult to identify by any automatic approach because of the absence of repetition crite-

rion. More importantly, without any further intervention, we can help programmers automati-

cally validate whether some executions of the RCBs are redundant or not, and eliminate them

from the buggy execution if they are redundant.


Program

Hierarchical Delta-Debugging Buggy trace

Remove Whole thread redundancy

Dynamic slicing

Remove Partial thread redundancy

Repetition analysis

Simplified Buggy trace

FIGURE 6.5: An overview of LEAN

Our framework is easy to use. Programmers simply mark the beginning and the end of the RCB

by @rcb-begin and @rcb-end, respectively. For example, programmers may mark the

RCB for thread Ti in a way as shown in Figure 6.4. We then perform delta-debugging on each

iteration of the code, and filter out most redundant iterations. Also, this framework is flexible.

New annotations may be added after each round of simplification, when programmers get more

information about the bug from the intermediate simplified execution.

6.4 Implementation

To evaluate our technique, we have implemented a prototype of LEAN on top of LEAP. Figure

8.12 shows an overview of LEAN. Given the target concurrent program and the buggy execution

trace, LEAN first removes the whole-thread redundancy from the trace using Algorithm 8. It

then further simplifies the resultant execution by removing the partial-thread redundancy using

Algorithm 10 and Algorithm 11. The final output produced by LEAN is a simplified buggy

execution in which redundant computation is skipped in the replayed execution.

For delta-debugging, we faithfully implemented the algorithm described in Figure 6.3. Our

slicing implementation is based on the Indus framework [104], which we adapt for dynamic

multithreaded execution traces. In addition to the data dependencies across threads, slicing also

takes care of all the data and control dependencies internal to each individual thread in the

execution.

To disable an instruction, we instrument the program to insert control statements before the

statement which corresponds to the instruction. For example, to disable a thread, we insert

control instrumentation before Thread.start() and Thread.join() to make sure that the disabled

thread is not executed and joined by any other thread. We distinguish the dynamic thread by

assigning a unique ID to each thread instance (explained in Section 6.3.1). For partial-thread

redundancy, we also maintain a thread local counter for each annotated RCBs, to denote the

iteration instance of each thread in executing the RCB.


TableDescriptor { getObjectName(){

if (referencedColumnMap == null){

…

}

else{

for (int i = 0; i <...; i++){

referencedColumnMap.isSet(…)

}

}

}

}

setReferencedColumnMap(…){

referencedColumnMap = null;

}

FIGURE 6.6: A real concurrency bug #2861 in Derby. The thread interleaving followingthe solid arrow on the shared data referencedColumnMap crashed the program with

NullPointerException.

To control the thread schedule, we reuse the application-level scheduler of LEAP. The thread

IDs of all the events in the trace form a global schedule. After disabling a thread, we simply

remove the thread ID from the global schedule. To enforce a sequential schedule, we control the

execution of a thread until it terminates or cannot continue execution (i.e., waiting for a lock or

joining for the termination of another thread) and then randomly pick an enabled thread to pro-

ceed. For removing partial-thread redundancy, we also associate each event in the trace with its

corresponding statement in the program. User annotated RCBs are interpreted as special state-

ment blocks. To generate the remaining schedule after disabling a certain iteration of a RCB,

we first remove the corresponding events in the trace according to the RCB and the per-iteration

information, and then compute the schedule by performing a projection of the remaining trace

on the thread ID.

6.5 A Case Study

In this section, we present a case study of reproducing a concurrency bug in Apache Derby

DBMS. We illustrate how LEAN simplifies the bug reproduction.

6.5.1 Description of Derby Bug #2861

Figure 6.6 shows the concurrency bug #2861 we study in the Apache bug database. The shared

data referencedColumnMap is checked for null at the top of the getObjectName

method and later dereferenced if it is not null. Due to an erroneous interleaving, another thread

can set referencedColumnMap to null in the setObjectName method and causes the

program to crash by throwing a NullPointerException. Figure 6.7 shows a driver pro-

gram (also documented in the bug database) for triggering the bug. Ignore all the gray areas for


the moment; these are statements inserted by LEAN. The driver program starts N threads each

creating (lines 41-45) and then dropping (lines 48-51) a separate view against the same source

view, repeated M times. Because of non-determinism, the bug is very difficult to manifest with

small N and M. In our experiment with N=2 and M=2 on an eight-core Linux machine, we did

not observe a single run of failure after 1000 runs. With a larger number of threads and repeti-

tions, the probability of triggering the bug is increased. When we set N=10 and M=10, we were

able to trigger the bug in three out of 1000 runs.

With the help of a MDR system such as LEAP, we are able to deterministically reproduce the

bug. The problem is that the bug reproduction run is too complicated, with too many threads

(11) and thread context switches (6,439). The size of the execution trace (which contains the

critical events only) is as large as 94.1M, and it took LEAP 466 seconds to reproduce the bug.

6.5.2 How LEAN Simplifies the Bug Reproduction

LEAN simplifies the reproduction of this bug by removing the redundant computation in the

reproducible buggy execution. Although there are ten testing threads each of which repeats ten

times in the buggy execution, we can observe that, in the best case, two testing threads each

with one iteration is sufficient to trigger the bug. The other eight threads and nine iterations are

redundant and can be removed from the bug reproduction run.

Taking the original buggy execution as the input, LEAN first identifies and removes the re-

dundant threads in the execution using Algorithm 8. Figure 6.8 illustrates the simplification

process. Because the dynamic thread hierarchy graph in the buggy execution contains one level

of thread, the entire simplification process invokes the delta-debugging procedure only once,

which directly applies on threads T(1,2,...,10). To skip a thread, LEAN controls the execution

of the program by inserting a condition checking before Thread.start() and Thread.join() (as

shown in the gray areas at lines 23 and 27 in Figure 6.7). A thread is not started or joined if it

is removed. After four rounds of simplification, threads T(2,3) remain in the reduced execution

and all the other threads are removed. This process took 1,841 seconds in our experiment. After

removing the redundant threads, 75.1M(79.8%) of the events in the original buggy trace were

removed and the size of the remaining trace was reduced to 19M.

After removing whole-thread redundancy, LEAN then further processes the reduced buggy ex-

ecution to remove partial-thread redundancy. It first performs dynamic slicing to remove ir-

relevant instructions using Algorithm 10. As slicing tracks all the dynamic data dependencies

across threads as well as all the intra-thread data and control dependencies in the remaining

buggy execution, it took LEAN 553 seconds to finish the slicing process in our experiment, and

an additional 6.2M(6.6%) of the events were removed from the trace. Similar to the control


of threads, we simply insert control statements before the irrelevant instructions to skip their

executions.

LEAN then continues to simplify the reduced buggy execution by removing the redundant repe-

titions using Algorithm 11. Our automatic repetition analysis successfully identified the RCB at

lines 42-53 in the test thread, as demarcated by @rcb-begin and @rcb-end at lines 41 and

54 in Figure 6.7. To control the execution of a certain iteration i of the RCB, we insert a control

statement before the RCB with i as the input parameter (as shown in the gray area at line 40),

determining whether the ith iteration is enabled or not. Figure 6.9 illustrates the simplification

process for LEAN to remove the redundant execution iterations of the RCB of threads T(2,3).

After ten rounds of simplification, the 7th iteration of T2 and the 4th iteration of T3 remain and

all the other iterations are removed. This process took around 200 seconds in our experiment.

An additional 11.6M (12.3%) of the events were removed and the size of the final buggy trace

was reduced to around 2.01M.

In total, it took LEAN 2,593 seconds to simplify the original buggy execution. The final simpli-

fied execution was able to reproduce the same bug and was significantly simpler than the original

buggy execution. The simplified trace size was reduced by 47x (from 94.1M to 2.01M), con-

taining only 3 threads (T(0,2,3)) and 433 thread context switches, and its replay time by LEAN

was shortened by 46x (from 446 to 10.2 seconds). Moreover, all the instrumentations and the

thread scheduler in LEAN are transparent to the programmers, such that the debugging task can

be performed on the simplified buggy execution in a normal debugging environment.

6.6 Experiments

The goal of our technique is to improve the effectiveness of the MDR support for debugging

concurrent programs, via removing redundancy from the reproducible buggy trace. Accordingly,

our evaluation aims at answering the following two research questions:

RQ1. Effectiveness - Is LEAN effective in simplifying real buggy traces? How much reduction

of the replay time and the trace complexity (i.e., size, threads, and context switches) can

our approach achieve?

RQ2. Efficiency - How efficient is LEAN for identifying and removing the trace redundancy?

Benchmarks We quantify our technique using a set of widely used third-party concurrency

benchmarks with known bugs. We configure the program inputs to generate buggy traces of

different sizes and complexity. To understand the performance of our technique on real appli-

cations in practice, we also include several large concurrent server systems in our benchmarks.


TABLE 6.1: LEAN evaluation benchmarks

Program SLOC Input/#Threads#IterationsBuggyPro 348 race exception/33/-Tsp 709 map4/4/-ArrayList 5,979 not-atomic bug/450/-LinkedList 5,866 not-atomic bug/450/-OpenJMS-0.7.7 262,842 order violation bug/20/10Tomcat-5.5 339,405 bug#37458/10/10Jigsaw-2.2.6 381,348 NPE bug/10/10Derby-10.3.2.1 665,733 bug#2861/10/10

TABLE 6.2: LEAN experimental results - RQ1: Effectiveness

ProgramOriginal Trace Simplified Trace

Size #Thr #CS Replay Size #Thread #CS Replay

BuggyPro 460K 34 1,003 1.27s 13.2K(↓97.1%) 4(↓88.2%) 28(↓97.2%) 39ms(↓97%)Tsp 44.1M 5 9,190 280s 22.1M(↓49.9%) 3(↓40.0%) 4,588(↓50.0%) 115s(↓58.9%)ArrayList 1.72M 451 2,381 6.5s 6.4K(↓99.6%) 3(↓99.3%) 10(↓99.6%) 20ms(↓99.7%)LinkedList 2.20M 451 2,564 7.2s 6.8K(↓99.7%) 3(↓99.3%) 10(↓99.6%) 22ms(↓99.7%)

OpenJMS 128.9M 36 7,287 606s 1.82M(↓98.5%) 7(↓80.5%) 415(↓94.3%) 16.3s(↓97.3%)Tomcat 38.2M 13 3,543 206s 1.26M(↓96.7%) 4(↓69.2%) 111(↓96.9%) 3.3s(↓98.4%)Jigsaw 20.1M 11 2,322 154s 416K(↓98.0%) 3(↓72.7%) 64(↓97.2%) 2.4s(↓98.4%)Derby 94.1M 11 6,439 466s 2.01M(↓97.8%) 3(↓72.7%) 433(↓92.5%) 10.2s(↓97.6%)

Table 6.1 shows the benchmarks used in our experiments. The total size of these benchmarks

is over 600K lines of code. Column 3 (Input/#Threads#Iterations) reports the input data (the

bug, the number of threads, and the iterations, if available) configured in the recorded execu-

tion of the benchmark. All experiments were conducted on two eight-core 3.00GHz Intel Xeon

machines with 16GB memory and Linux 2.6.22 and JDK1.7.

6.6.1 RQ1: Effectiveness

The goal of our first research question is to evaluate how effective our technique is for simpli-

fying the buggy execution traces of real concurrent programs. To generate the data necessary

for investigating this question, we proceed as follows. For each benchmark, we first run it mul-

tiple times with random thread schedule until the bug manifests and use LEAN to collect the

corresponding buggy trace of each run. For each trace, we then apply our technique to produce

a simplified trace with the redundancy removed. During the simplification process, we first re-

move whole-thread redundancy and then partial-thread redundancy (consists of both slicing and

repetition analysis). The whole process is fully automatic with no user intervention. We mea-

sure the percentage of trace size reduction with respect to the two dimensions of redundancy.


TABLE 6.3: LEAN - decomposed effectiveness on trace size reduction

Program Whole Redundancy Partial RedundancySlicing Repetition

BuggyPro 445K(96.9%) 1.8K(0.2%) -Tsp 21.7M(49.2%) 0.4M(0.7%) -ArrayList 1.71M(99.6%) - -LinkedList 2.19(99.7%) - -OpenJMS 100.8M(78.2%) 7.3M(5.7%) 20.0M(15.5%)Tomcat 23.6M(61.9%) 4.2M(11.0%) 9.1M(24.0%)Jigsaw 16.0M(79.4%) 0.91M(4.5%) 2.7M(13.4%)Derby 75.1M(79.8%) 6.2M(6.6%) 11.6M(12.3%)

We also quantify the final simplification results in terms of the reductions of the trace size, the

number of threads and the number of thread context switches, as well as the replay speedups.

To demonstrate the simplification effectiveness of our approach, we also compared LEAN with

an execution reduction technique ER [122] that uses the dependence graph for simplification.

Table 6.2 reports our final simplification results. Columns 2-5 (Size, #Thread, #CS, Replay Time)

report the size of the original trace, the number of threads, the number of thread context switches

(including both non-preemptive and preemptive ones) in the original trace, and the replay time

of the original trace, respectively, while Columns 6-9 report the corresponding statistics of the

simplified trace. As the table shows, the size of the original trace ranges from 460KB (Bug-

gyPro) to more than 128MB (OpenJMS) on disk, which take from 1.27 seconds to more than

10 minutes to replay to reproduce the bug. The original trace also is of significant complexity

w.r.t. the number of threads and the number of context switches, ranging from 5 threads in Tsp

to 451 threads in ArrayList and LinkedList, and from 1,003 context switches in BuggyPro to

9,190 context switches in Tsp. LEAN was able to greatly reduce the trace complexity for all

the concurrency bugs in our experiments. The trace size is reduced by 49.9% (2x) in Tsp to as

large as 99.7% (324x) in LinkedList, the number of threads is reduced by 40% to 99.3%, and

the number of context switches is reduced by 50% to 99.6%. Moreover, the replay time also

is greatly shortened after simplification, ranging from 58.9% (2.4x) in Tsp to 99.7% (327x) in

LinkedList. In the four large server applications, the replay time is consistently shortened by

around 98% (64x).

Table 6.3 reports the simplification effectiveness w.r.t. each of the three components in terms of

the trace size reduction. Column 2 reports the percentages of whole-thread redundancy reduced

by the hierarchical delta-debugging (HDD), while Columns 3-4 report that of partial-thread re-

dundancy, reduced by slicing and repetition analysis, respectively. In the small benchmarks, the

percentage of whole thread redundancy ranges from 49.2% to 99.7%. LEAN did not identify

much partial thread redundancy in these small benchmarks. Slicing removes only 0.2% and


TABLE 6.4: Comparison between LEAN and ER

Program ER LEANBuggyPro 2.1% 97.1%Tsp 0.0% 49.9%ArrayList 2.9% 99.6%LinkedList 3.0% 99.7%OpenJMS 10.2% 98.5%Tomcat 6.9% 96.7%Jigsaw 4.6% 98.0%Derby 2.5% 97.8%

0.7% redundancy, respectively, in BuggyPro and Tsp. For the real server programs, the percent-

age of whole-thread redundancy ranges from 61.9% to 79.8%. For partial-thread redundancy,

slicing and repetition analysis are both more effective than that for the small benchmarks. Slic-

ing removes 4.5% to 11% redundant computation in the four large server programs, while the

percentage of redundancy removed by repetition analysis ranges from 12.3% to 15.5%. We note

that the amount of redundancy in the buggy traces is closely related to the number of threads

and the number of repetitions configured as input to the program. With more redundancy in the

buggy trace, LEAN would have a better simplification ratio. Nevertheless, we believe our result

is representative as our experimental setup reflects the typical concurrency testing scenarios in

the development cycle (such as the effective random testing in the IBM ConTest tool [31] and

the stress testing in CHESS [86]).

Comparing with the ER [122] The execution reduction (ER) technique proposed by Tallam

et al. [122] also aims at reducing the trace size, for supporting the tracing of long running

multithreaded programs. ER works by tracking a dynamic dependence graph of the execution

events. The events are grouped into regions and threads such that the size of the dependence

graph can be reduced. By analyzing the dependence graph, ER removes the regions of events or

threads that are irrelevant to the fault. As ER relies on the dynamic dependence graph, it cannot

remove redundant computation that has data/control dependencies to the fault. As LEAN relies

on the redundancy criterion and dynamic verification, it is able to leverage more simplification

opportunities.

We compared the simplification effectiveness on the trace size reduction between LEAN and

ER. Table 6.4 shows the result. For our evaluation benchmarks, LEAN is much more effec-

tive than ER. ER does not find many irrelevant events (the percentage of simplification ranges

from 0.0%-10.2%), because almost all threads have data dependencies between each other on

shared variables, while LEAN can effectively remove the redundant threads and the repetitive

computation through the hierarchical delta-debugging and our repetition analysis.


TABLE 6.5: LEAN experimental results - RQ2: Efficiency

Program HDD Slicing Repetition RCB#Rounds Time Time #Rounds Time All Real

BuggyPro 6 8s 155ms - - 4 0Tsp 2 199s 12s - - 3 0ArrayList 18 55s 2s - - - -LinkedList 18 58s 2s - - - -OpenJMS 13 4,265s 330s 11 152s 1 1Tomcat 5 1,082s 308s 12 55s 1 1Jigsaw 4 630s 210s 10 37s 1 1Derby 4 1,841s 553s 10 200s 1 1

6.6.2 RQ2: Efficiency

The goal of our second research question is to assess if our approach is efficient in simplifying

the buggy trace. Since LEAN works in a black-box style (applying delta-debugging except for

the dynamic slicing part) to iteratively simplify the trace, it may take a long time (many rounds)

to produce the final simplification. As in each round it requires two replay runs to validate

redundancy (for the two redundancy conditions in our criterion), the efficiency of LEAN is an

important concern for the usefulness in practice. Hence, during the trace simplification, we also

record the number of delta-debugging rounds (for dealing with both whole-thread redundancy

and partial-thread redundancy) and measure the time needed for each of the three components of

LEAN to produce the final simplified trace. As we use repetition analysis to identify the RCBs,

we also report the statistics of the repetition analysis result to assess its usefulness in improving

the simplification effectiveness of LEAN.

Table 6.5 shows the experimental results for our research question RQ2. Columns 2-3 and 5-6

report the number of simplification rounds (including the failed runs) and the time taking LEAN

to remove the whole-thread redundancy and the redundant repetitions, respectively, from the

original trace (the same trace as that in Table 6.2). Generally, the number of rounds is depen-

dent on the amount of redundancy, while the simplification time is dependent on the amount of

redundancy as well as the length of the original trace. For the small benchmarks, LEAN took 2

to 18 rounds for validating whole-thread redundancy, which took 8 to 99 seconds of the execu-

tion time. For the large systems, since their traces are much larger, LEAN took 4 to 13 rounds

and 630 to 4,265 seconds to remove whole-thread redundancy, and 10 to 12 rounds and 37 to

200 seconds to remove the redundant repetitions. Column 4 reports the time needed for slicing

the trace (including both the construction time of the multithreaded dependence graph (MDG)

and the analysis time for slicing the MDG). Because slicing considers all the instructions in the

buggy execution, it is more expensive for large server programs (which have longer and more


complex traces) than that for the small benchmarks. The slicing time for the four large server

programs in our experiments ranges from 210 to 553 seconds.

Summary Compared to the original replay time, the simplification time is typically 4x-8x

longer (except Tsp, which is in fact shorter). However, considering the significant trace simpli-

fication ratio, we believe the time cost is acceptable (even for the large systems). Moreover, as

the simplification task is fully automatic (transparent to programmers) and can be easily paral-

lelized, programmers do not need to worry about the simplification procedure. For very long

running executions, programmers may also choose to set a time bound for the simplification.

When the simplification does not finish within the time bound, programmers can still have the

partially simplified trace (sharing the spirit of delta-debugging).

On the aspect of repetition analysis, Columns 7-8 report the total number of identified RCBs

and the number of real RCBs among them in each benchmark. For the small benchmarks, our

analysis identified 4 RCBs in BuggyPro and 3 in Tsp, but none of them are truly redundant. Our

analysis does not report any RCB in LinkedList and ArrayList. For the large systems, our anal-

ysis successfully identified all the RCBs in the test drivers. In testing real concurrent systems,

there is often a large number of repetitions (in order to increase the bug finding possibility). We

note that repetition analysis plays an important role in effectively reducing this kind of partial-

thread redundancy, though (as our result suggests) the precision of our repetition analysis is not

optimized.

6.7 Summary

Debugging concurrent programs has been a long-standing challenging problem. We have pre-

sented a novel technique LEAN to simplify the concurrency bug reproduction by removing the

redundant computation from the buggy trace with the replay-supported execution reduction. Our

experimental results show that LEAN is able to significantly reduce the complexity of the repro-

ducible buggy execution and shorten the replay time. With LEAN, we believe the effectiveness

of debugging concurrent programs can be greatly improved.

Static Trace Simplification 98

TestEmbeddedMultiThreading { main(String args[]){

int numThreads = Integer.parseInt(args[0]);

int numIterations = Integer.parseInt(args[1]);

//register the embedded driver and create the test database

EmbeddedDriver driver = new EmbeddedDriver();

conn = DriverManager.getConnection("jdbc:derby:DERBY2861");

stmt = conn.createStatement();

sql = "CREATE VIEW viewSource AS SELECT col1, col2 FROM

schemamain.SOURCETABLE“

stmt.execute(sql);

stmt.close();

//create test threads

Thread[] threads = new Thread[numThreads];

for (i = 0; i < numThreads; i++)

threads[i] = new Thread(new ViewCreatorDropper(

"schema1.VIEW" + i, "viewSource", "*", numIterations));

//start test threads

for (int i = 0; i < numThreads; i++)

threads[i].start();

//wait for threads to terminate

for (int i = 0; i < numThreads; i++)

threads[i].join();

}

}

ViewCreatorDropper implements Runnable { ViewCreatorDropper(String viewName, String sourceName,

String columns, int iterations) {

m_viewName = viewName;

m_sourceName = sourceName;

m_columns = columns;

m_iterations = iterations;

}

run(…){

for (i = 0; i < m_iterations; i++)

{

//create view


sql = " "CREATE VIEW " + m_viewName + " AS SELECT "

+ m_columns + " FROM " + m_sourceName“;

stmt.execute(sql);

stmt.close();

//drop view


sql = " " "DROP VIEW " + m_viewName“;

stmt.execute(sql);

stmt.close();

}

}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

if(shouldStartThread(threads[i]))

threads[i].start();

if(shouldJoinThread(threads[i]))

threads[i].join();

@rcb-begin

@rcb-end

if(shouldExecuteIteration(i))

{

}

FIGURE 6.7: A real world test driver for triggering the concurrency bug in Figure 6.6. Thestatements inserted by LEAN to simplify the execution are shown in the gray areas.


T0

T2 T1 T3 T4 T6 T5 T7 T10 T8 T9 Round Result

1 √ √ √ √ √ Y

2 √ √ √ Y

3 √ √ X

4 √ √ Y

FIGURE 6.8: Illustration of delta-debugging for removing the whole thread redundancy. Tidenotes the ith test thread created by the main thread T0. After four rounds of simplification,

threads T(2,3) remain and all the other threads are removed.

I21 Round Result I22 I23 I24 I25 I26 I27 I28 I29 I210 I31 I32 I33 I34 I35 I36 I37 I38 I39 I310

1 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ N

2 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ Y

3 √ √ √ √ √ √ √ √ √ √ √ √ √ Y

4 √ √ √ √ √ √ √ √ √ √ √ √ N

5 √ √ √ √ √ √ √ √ √ √ √ √ Y

6 √ √ √ √ √ √ √ √ √ √ √ Y

7 √ √ √ √ √ √ Y

8 √ √ √ √ N

9 √ √ √ Y

10 √ √ Y

FIGURE 6.9: Illustration of delta-debugging for removing the redundant repetitions for the re-maining threads T(2,3). Iij denotes the jth iteration of thread Ti where i=2,3 and j=1,2,. . . ,10.After ten rounds of simplification, the 7th iteration of T2 and the 4th iteration of T3 remain and

all the other iterations are removed.

Chapter 7

Static Trace Simplification

One of the major difficulties in debugging concurrent programs is that the programmer usually

experiences frequent thread context switches, which complicates the reasoning process. This

problem can be alleviated by trace simplification techniques, which produce the same computa-

tion process but with fewer context switches. The state of the art trace simplification technique

takes a dynamic approach and does not scale well to large traces, hampering its practicality.

We present a static trace simplification approach, SimTrace, that dramatically improves the ef-

ficiency of trace simplification through reasoning about the computational equivalence of traces

offline. By constructing a dependence graph model of events, our trace simplification algorithm

scales linearly in the trace size and quadratic in the number of nodes in the dependence graph.

Underpinned by a trace equivalence theorem, we guarantee that the results generated by Sim-

Trace are sound and no dynamic program re-execution is required to validate trace equivalence.

Our experiments show that SimTrace scales well to traces with more than 1M events, making it

attractive to practical use.

7.1 Introduction

Jalbert and Sen [55] have recently proposed a dynamic trace simplification technique, Tiner-

tia, for reducing the number of thread interleavings in a buggy execution trace. From a high

level perspective, Tinertia iteratively transforms an input trace that satisfies a certain property

to another trace satisfying the same property but with fewer thread context switches. Tinertia

is valuable in improving the debugging efficiency of concurrent programs as it prolongs the se-

quential reasoning of concurrent program executions and reduces frequent “context switches”.

However, since Tinertia is a dynamic approach, it faces serious efficiency problems when used

in practice. To reduce every single context switch, Tinertia has to re-execute the program at

least once to validate the equivalence of the transformed trace. It is very hard for Tinertia to

100


scale to large traces as program re-execution typically requires controlling the thread scheduler

to follow the scheduling decisions in the transformed trace, which is often 5x to 100x slower

than the native execution [100]. The total running time of Tinertia is cubic in the trace size [55].

We present a static trace simplification technique, SimTrace, that dramatically improves the

efficiency of trace simplification through offline reasoning of the computational equivalence

of traces. The key idea of SimTrace is that we can statically guarantee trace equivalence by

leveraging the dependence relations between events in the trace. We prove a theorem of trace

equivalence that any rescheduling of the events in the trace respecting the dependence relation

is equivalent to the given trace. The trace equivalence is not limited to any specific property

but general to all properties that can be defined over the program state. Underpinned by the

trace equivalence theorem, SimTrace is able to perform trace simplification completely offline,

without any dynamic re-execution to validate the intermediate simplification result, which sig-

nificantly improves the efficiency of the trace simplification.

In our analysis, we first build a dependence graph that encodes all the dependence relations

between events in the trace. The dependence graph is a directed acyclic graph in which each

node in the graph represents a corresponding event or event sequence by the same thread in the

trace, and each edge represents a happens-before relation or a data dependence between two

events or event sequences. The dependence graph is sound in that it encodes a complete set of

dependence relations between the events. The trace equivalence theorem guarantees that any

topological sort of the dependence graph produces an equivalent trace to the original trace.

Taking the advantage of the dependence graph, we reduce the trace simplification problem to

a graph merging problem, in which the objective is minimizing the size of the graph. The

algorithm performs a sequence of merging operations on the graph. Each merging operation is

applied on two consecutive nodes by the same thread in the graph, and it consolidates the two

nodes if a merging condition is satisfied. The merging condition is that the edge connecting

the two merged nodes is the only path connecting them in the graph, which can be efficiently

checked by computing the reachability relation between the two nodes.

Finally, SimTrace performs a topological sort on the reduced dependence graph and generates

the simplified trace. The total running time of SimTrace is linear in the size of the trace and

quadratic in the number of the nodes in the initial dependence graph. SimTrace is very efficient

in practice, since the size of the initial dependence graph is often much smaller than that of the

original trace. Moreover, SimTrace is completely offline and does not require any re-execution

of the program for validating the simplified trace.

The problem of generating equivalent traces with minimum context switches is NP-hard [55].

SimTrace does not guarantee the globally optimal simplification but a local optimum. However,


our evaluation results using a set of multithreaded programs show that SimTrace is able to signif-

icantly reduce the context switches in the trace. For instance, for the input trace of the Cache4j

subject with 1,225,167 events, SimTrace is able to reduce the number of context switches from

417 to 33 in 592 seconds. The overall reduction percentage of SimTrace ranges from 65% to

97% in our experiments.

Being an offline analysis technique, SimTrace is complementary to Tinertia. For the sake of

efficiency, our modeling of the dependence relation does not consider the runtime value de-

pendencies between events in the trace and hence may be too strict in preventing further trace

simplification. As Tinertia utilizes runtime verification regardless of the dependence relation, it

might be able to explore more simplification opportunities that are beyond the strict dependence

relation. A good match between SimTrace and Tinertia is to apply SimTrace as a front-end

and use Tinertia as a back end. By working together, we can achieve both trace simplification

efficiency and effectiveness at the same time.

The rest of the chapter is organized as follows: Section 7.2 presents our algorithm; Section 7.3

reports our evaluation results; Section 7.4 summarizes this chapter.

7.2 SimTrace: Efficient Static Trace Simplification

In this section, we first define the trace simplification problem. We then describe a theorem of

trace equivalence and offer a detailed proof. After that, we present the full SimTrace algorithm.

7.2.1 General Trace Simplification Problem

Definition 7.1. Context Switch A context switch occurs when two consecutive actions in the

trace are performed by different threads. Let Γ(α) denote the owner thread of event α. Let δ

denote a trace containingN events and δ[k] the kth event in δ, and letCS(δ) denote the number

of context switches in δ, we have CS(δ) = ΣN−1k=1 uk where uk is a binary variable s.t. uk = 1 if

Γ(δ[k]) ≠ Γ(δ[k + 1]) and uk = 0 otherwise.

Given a trace as the input, the general trace simplification problem is to produce an output trace

that is equivalent to the input trace and has minimum number of context switches among all

equivalent traces. To state more formally, suppose an input trace δ drives the program state to

ΣN , the general trace simplification problem is: given δ, output δ′ s.t. ΣN = Σ′N and CS(δ′) is

minimized. Notice that the program state here is not limited to any local store or the global store

but includes both the global store and the local stores of all the threads. In other words, the trace

simplification problem defined above is general to all properties defined over the program state.


The basic idea for reducing the context switches in a trace is to reschedule the actions in the

trace such that more actions by the same thread are placed next to each other. A naıve approach

is to exhaustively generate all permutations of the events in the trace and pick an equivalent one

with the smallest number of context switches. However, this naıve approach requires checking

N! permutations which is highly inefficient. A better approach is to repeatedly move the inter-

leaving actions to some non-interleaving positions and then consolidate the neighboring actions

by the same thread. However, there are two major challenges in this approach. First, how to

ensure the rescheduled trace is feasible and also equivalent to the input trace? Second, how to

make sure the output trace is optimal, i.e., has the minimum number of context switches among

all equivalent traces?

We address the trace simplification problem by leveraging the dependence relationship between

events in the trace. For the first challenge, we show that the trace equivalence can be guaranteed

by respecting the dependence relation during the rescheduling process. For the second chal-

lenge, since Jalbert and Sen [55] have proved it is NP-hard, we present an efficient algorithm,

SimTrace, that generates a locally optimal solution.

7.2.2 A Theorem of Trace Equivalence

Previous work has proposed many causal models [21, 66, 96, 112, 131] that characterize the

dependence relationship between actions in the trace. Among them, most models are developed

for checking concurrency properties such as data race and atomicity violations, and they are

tailored for a specific property. As we are dealing with all properties over program state, we

have to consider a general model that works for all such properties. We hence use a strict model

based on the dependence relation in Definition 2.5, and we have the following theorem of trace

equivalence:

Theorem 7.2. Any rescheduling of the actions in a trace respecting the dependence relation

generates an equivalent trace.

Proof. (Sketch) Let δ denote the input trace with size N and δ′ an arbitrary rescheduling of δ

respecting the dependence relation, and suppose δ and δ′ drive the program state from the same

initial state Σ0 and Σ′0 to ΣN and Σ′N , respectively. Our goal is to prove Σ′N = ΣN . The

main insight of the proof is that, by respecting the order defined by the dependence relation,

every action in the rescheduled trace reads or writes the same value on the program state as its

corresponding action in the input trace, and hence the rescheduled trace drives the program to

the same final state as that of the input trace. We provide the full detailed proof in the Appendix

A. Readers may skip it at this moment.


Note that Theorem 7.2 is related to but different from the equivalence axiom of the Mazurkiewicz

traces [1] in the trace theory, which provides an abstract model of reasoning about trace equiv-

alence based on the partial order relation between events. We prove Theorem 7.2 in the context

of concurrent program execution based on the concrete modeling of the action semantics and

the computation effect in the trace.

Theorem 7.2 forms the basis of static trace simplification as it guarantees every rescheduling

of the actions in the trace that respects the dependence relation produces a valid simplification

result, without the need of any runtime verification. In other words, as long as we do not violate

the order defined by the dependence relation, we can safely reschedule the events in the trace

without worrying about correctness of the final result.

7.2.3 SimTrace Algorithm

Our algorithm starts by constructing from the input trace a dependence graph (see Definition

7.3), which encodes all the actions in the trace as well as the dependence relations between the

actions. We then simplify the dependence graph by ordinally performing a “merging” operation

on two consecutive nodes by the same thread in the graph. When the dependence graph cannot

be further simplified, our approach applies a simple topological sort on the graph to produce the

final simplified trace.

Definition 7.3. A dependence graph G = (V,E), built upon a trace, is a directed acyclic graph

in which each v ∈ V corresponds to a sequence of consecutive actions by the same thread started

by a unique action that has remote incoming dependence. For each edge, there is a labeling

relation L ∶ E →{local, remote} such that each local edge connects neighboring nodes by the

same thread, and each remote edge connects nodes by different threads meaning that there are

dependence relations from some actions in one node to some actions in the other node.

Note that the dependence graph is directed acyclic graph. Otherwise it indicates there are cyclic

dependences between events in the trace, which is impossible according to our dependence re-

lation model. We next describe our algorithms for constructing and simplifying the dependence

graph in detail.

Dependence Graph Construction Algorithm 12 shows our algorithm for constructing the

dependence graph. Given an input trace, we first conduct a linear scan of all the actions in the

trace to build the smallest dependence relation between actions. We then visit each action in their

appearing order in the trace once to construct the dependence graph according to Definition 7.3.

Our construction of the dependence graph leverages the observation that most of the dependence

relations in the trace are local dependencies within the same thread, while the number of remote


Algorithm 12 ConstructDependenceGraph(δ)1: input: δ (a trace)2: output: graph (the dependence graph built from δ)3: mapt2n ← empty map from a thread identifier to its current graph node4: told ← null5: for i ← 0 to ∣δ∣-1 do6: tcur ← the thread identifier of the action δ[i]7: nodecur ←mapt2n(tcur)8: if nodecur is null then9: nodecur ← new node(δ[i])

10: mapt2n(tcur) ← nodecur11: add node nodecur to graph12: else13: if δ[i] has remote incoming dependence and tcur ≠ told then14: nodeold ← nodecur15: nodecur ← new node(δ[i])16: add node nodecur to graph17: add local edge nodeold ⇢ nodecur to graph18: for each action a with remote outgoing dependence to δ[i] do19: nodea ← the node to which a belongs20: add remote edge nodea → nodecur to graph21: else22: add action δ[i] to nodecur23: told ← tcur

dependence relations are comparatively much smaller. We can hence greatly reduce the size

of the initial dependence graph by shrinking consecutive actions with only local dependence

between them into a single node. The running time of Algorithm 12 is linear in the trace size.

Note that, in our dependence graph construction process, each node in the initial dependence

graph has exactly two incoming edges except the root node: a local incoming edge and a remote

incoming edge. The number of edges in the graph is thus less than twice the number of nodes

in the graph. Moreover, since each node in the dependence graph may represent a sequence of

actions in the trace, the number of nodes in the graph is much smaller than the original trace

size. As a result, performing a topological sort on the dependence graph is much more efficient

than that on the original trace.

Simplifying Dependence Graph Following Theorem 7.2, it is easy to see that any topological

sort of the initial dependence graph produces a correct answer to our problem, i.e., generates an

equivalent trace to the input trace. However, to make the resultant trace as simple as possible,

i.e., to minimize the context switches, we have to wisely choose the next node in each sorting

step during the topological sort, which is a difficult problem with no existing solution or even

good approximation algorithm.


We formulate this problem as an optimization problem on the number of nodes in the depen-

dence graph and use a graph merging algorithm to compute a locally optimal solution to it.

Before describing the formulation, let us first introduce a dual notion of context switch:

Definition 7.4. A context continuation occurs when two consecutive actions in the trace are

performed by the same thread.

Let CC(δ) denote the number of context continuations in a trace δ, we have the following

lemma:

Lemma 7.5. Minimizing CS(δ) is equivalent to maximizing CC(δ).

Proof. Traversing the trace once, it is easy to see that for each action, either CS(δ) or CC(δ)

is incremented. Thus, CS(δ) + CC(δ) = N − 1. Hence, CS(δ) is minimized when CC(δ) is

maximized.

Therefore, our goal becomes to maximize the number of context continuations in the simplified

trace. Now let us consider the action sequence represented by each node in the dependence

graph. Since all actions in the same action sequence are performed by the same thread, their

number of context continuations are already optimized. The remaining possible context contin-

uations can only come from actions that are in different action sequences. Mapping this back to

the dependence graph and because nodes representing action sequences by the same thread are

connected by local edges, we have the following lemma:

Lemma 7.6. Minimizing CS(δ) is equivalent to maximizing the number of context continua-

tions contributed by local edges in the dependence graph.

Consider a local edge in the graph, if the action sequences represented by the two nodes con-

nected by this local edge are consolidated together, it will contribute one context continuation.

Let us call a merging operation as the consolidating of two nodes connected by a local edge in

the dependence graph. As each merging operation eliminates a local edge and correspondingly

reduces one node in the dependence graph, it is easy for us to get the following theorem:

Theorem 7.7. Minimizing CS(δ) is equivalent to minimizing the number of nodes in the de-

pendence graph.

Following Theorem 7.7, our objective is performing as many merging operation as possible so as

to minimize the number of nodes in the dependence graph. However, recall that the dependence

relation between actions in the trace must be respected. Therefore, we cannot arbitrarily perform

the merging operation without satisfying a certain pre-condition: the merging condition is that

the to-be-merged two nodes are connected by the local edge only. Otherwise, the resultant graph


after the merging operation would become cyclic and violate the definition of dependence graph.

Mapping this back to the semantics of the dependence relation, the merging condition simply

requires that there should not exist another dependent action in the trace that interleaves the two

action sequences represented by the to-be-merged two nodes in the dependence graph. Checking

the merging condition is simple because it only requires testing the reachability relation between

the two merged nodes, which is a linear in the number of nodes in the dependence graph1.

Therefore, our dependence graph simplification algorithm (Algorithm 13) traverses each local

edge in the dependence graph, and performs the merging operation if the merging condition is

satisfied. This algorithm evaluates each local edge in the initial dependence graph once and

each evaluation computes the reachability relation between two nodes once. The worst case

time complexity is thus quadratic in the number of nodes in the initial dependence graph.

Algorithm 13 SimplifyDependenceGraph(graph)1: input: graph (the dependence graph)2: output: graph′ (the simplified dependence graph)3: graph′ ← graph4: for each local edge nodea → nodeb in a random order do5: if nodeb is not reachable from nodea except from the local edge then6: merge(nodea, nodeb, graph

′)

Notice that in our merging algorithm, the evaluation order of the local edges may affect the

simplification result. Our algorithm does not guarantee a global optimum but produces a locally

optimal simplification given the chosen evaluation order. To illustrate this problem, let us take

the (incomplete) dependence graph in Figure 7.1 as an example. The graph contains 6 nodes,

3 local edges (denoted by dashed arrows ⇢), and 4 remote edge (denoted by solid arrows →):

a1 ⇢ a2, b1 ⇢ b2, c1 ⇢ c2, a1 → b2, c1 → b2, b1 → a2 and b1 → c2. If b1 and b2 are merged first,

as shown in Figure 7.1 (a), it would produce the trace <a1-c1-b1-b2-c2-a2> that contains 4

context switches. However, the optimal solution is to merge a1 and a2, and c1 and c2, which

produces the trace <b1-a1-a2-c1-c2-b2> that contains only 3 context switches. In fact, this

problem is NP-hard (proved by Jalbert and Sen [55]), and there does not seem to exist an efficient

algorithm for generating an optimal solution. Our algorithm thus picks a random order (or any

arbitrary order) for evaluating the local edges. Though it does not guarantee to produce a global

optimum, it is easy to see that our algorithm always produces a local optimum specific to the

chosen evaluation order. That is, given the evaluation order of the local edges, our algorithm

produces a trace with the fewest thread context switches.1Theoretically, constant time graph reachability computation algorithms also exist [132].

Static Trace Simplification 108The evaluation order matters!

a1 a2

b1 b2

c1 c2

(a) Non‐optimal

#cs=4: a1‐c1‐b1‐b2‐c2‐a2

a1 a2

b1 b2

c1 c2

(b) Optimal

#cs=3: b1‐a1‐a2‐c1‐c2‐b2

local edge

remote edge

merge

FIGURE 7.1: A greedy merge may produce non-optimal result in (a). Unfortunately, the prob-lem of producing the optimal result in (b) is NP-hard.

7.3 Implementation and Experiments

We have implemented SimTrace as a prototype tool on top of LEAP. From the user’s perspec-

tive, our tool consists of three phases. It first obtains a trace of a buggy concurrent Java program

execution, which contains all the shared memory reads and writes as well as synchronization

operations performed by each thread in the program. Then our tool applies the SimTrace algo-

rithm on the trace and produces a simplified trace. In the third phase, it uses a replay engine

to re-execute the program according to the scheduling decisions in the simplified trace. Our

replayer is transparent to the programmers such that they can deterministically investigate the

simplified buggy trace in a normal debugging environment.

The goal of our experiments is to investigate whether our approach is effective and how efficient

it is in reducing the thread context switches in the trace. We chose eight widely used multi-

threaded Java benchmarks as the evaluation subjects (shown in the first column in Table 6.1).

Each subject has one or more known concurrency bugs. Similar to Tinertia [55], we use random

testing to generate the initial buggy trace for each subject. For each trace, we ran SimTrace mul-

tiple times with different evaluation orders of the local edges during our graph merging process

(Algorithm 13). To remove the non-determinism related to random numbers, we fix the seed

of random numbers to a constant in all the subjects. All experiments were conducted on a HP

EliteBook running Windows 7 with 2.53GHz Intel Core 2 Duo processor and 4GB memory. Our

implementation is publicly available at http://www.cse.ust.hk/prism/simtrace.

Table 7.1 shows the experimental results. All data are averaged over 50 runs. The first five

columns show the statistics of the test cases, including the program name, the size of the pro-

gram in lines of source code, the number of threads, the number of real shared memory locations

that contain both read and write accesses from different threads in the given trace, and the length

http://www.cse.ust.hk/prism/simtrace


TABLE 7.1: Simtrace experimental results. Data are averaged over 50 runs for each subject.

Program LOC Thread SV Time Old Ctxt New Ctxt ReductionPhilosopher 81 6 1 131 6ms 51 18 65%Bubble 417 26 25 1,493 23ms 454 163 71%Elevator 514 4 13 2104 8ms 80 14 83%TSP 709 5 234 636,499 149s 9272 1,337 86%Cache4j 3,897 4 5 1,225,167 592s 417 33 92%Weblench 35,175 3 26 11,630 57ms 156 24 85%OpenJMS 154,563 32 365 376,187 38s 96,643 11,402 88%Jigsaw 381,348 10 126 19,074 130ms 2396 65 97%

of the trace. The next four columns shows the statistics of our trace simplification algorithm (all

on average), including the running time of our offline analysis, the number of context switches

in the original trace, the number of context switches in the simplified trace and the reduction

due to our simplification. The results show that our approach is promising in terms of both trace

simplification efficiency and effectiveness. For the eight subjects, our approach is able to reduce

the number of context switches in the trace by 65% to 97% on average. This reduction percent-

age is close to that of Tinertia, which ranges from to 32.1% to 97.0% in their experiments. More

importantly, our approach is able to scale to much larger traces compared to Tinertia. For a trace

with only 1505 events (which is the largest trace reported by Tinertia in their experiments), Tin-

ertia requires a total of 769.3s to finish the simplification, while our approach can analyze a trace

(the Cache4j subject) with more than 1M events within 600s. For a trace (the Bubble subject)

with 1,493 events, our approach requires only 23ms to simplify it. Although a direct comparison

between Tinertia and our approach is not applicable as the two approaches are implemented for

different program languages (Tinertia is implemented for C/C++ programs) and have different

evaluation subjects, we believe the statistical data provides some evidence demonstrating the

value of our approach compared to the state of the art.

7.4 Summary

To sum up, the key contributions of this work are as follows:

• We present an efficient static trace simplification technique for reducing the number of

thread context switches in the trace.

• We show a theorem of trace equivalence that is general to all properties defined over the

program state. This theorem provides the correctness of the static trace simplification

without any dynamic program re-execution to validate the intermediate simplification re-

sult.


• We present a sound graph modeling of the dependence relation between events in the trace,

which allows us to develop efficient graph merging algorithms for the trace simplification

problem.

• We evaluate our approach on a number of multithreaded applications and the results

demonstrate the efficiency and the effectiveness of our approach.

Appendix: A Proof of Theorem 7.2

Proof. Let us say two actions are equal iff they perform the same operation on the same variable

and also read and write the same value. The core of the proof is to prove the following lemma:

Lemma 7.8. For any action α′ in δ′, suppose it is the nth action of thread ti, then α′ is equal to

the nth action of ti in δ.

If Lemma 7.8 holds, we can prove Theorem 7.2 by applying it to the last actions that write to

each variable in both δ and δ′. To prove Lemma 7.8, we first define a notion of version number

and show two lemmas related to it:

Definition 7.9. Every variable is associated with a version number such that it is (1) initialized

to be 0 and (2) incremented by 1 when the variable is written by an action.

Lemma 7.10. For any action α′ in δ′, suppose it is the kth action that writes to a variable s,

then α′ is also the kth action that writes to s in δ.

Proof. To prove Lemma 7.10, we only need to make sure the order of write actions on each vari-

able is unchanged during the rescheduling of the trace from δ to δ′. This follows our modeling

of the dependence relation includes all synchronization orders and the WRITE→WRITE orders

on the same variable. ∎

Lemma 7.11. For any action α′ in δ′, suppose it reads the variable s with version number p,

then α′ also reads s with the same version number p in δ.

Proof. Similar to the proof of Lemma 7.10, since our model of the dependence relation includes

all the synchronization orders and the WRITE→READ and READ→WRITE orders on the same

variable, we guarantee every READ action in the rescheduled trace reads the value written by

the same WRITE action as that in the original trace. ∎

Let σ[s]p denote the value of variable s with version number p, we next prove Lemma 7.8 by

deduction on the version number of each variable:


Consider the jth actions performed by ti, denoted by αi∶j and α′i∶j in δ and δ′ respectively.

To prove α′i∶j is equal to αi∶j , we need to satisfy two conditions. First, their actions should be

the same, i.e., they perform the same operation on the same variable. Second, suppose they

both operate on the variable s (which should be true if the first condition holds), the values of

s before α′i∶j is performed in δ′ should be the same as that in δ before αi∶j is performed. Let

πi∶j and π′i∶j denote the local store of ti after αi∶j is performed in δ and after α′i∶j is performed

in δ′, respectively. For the first condition, since the execution semantics determine that the

next action of any thread is determined by that thread’s current local store, we need to ensure (I)

π′i∶j−1 = πi∶j−1. For the second condition, suppose αi∶j and α′i∶j operate on s with version number

p and p′, respectively, we need to ensure (II) σ′[s]p′

= σ[s]p.

Let’s first assume Condition I holds, we prove p′ = p in Condition II. If α′i∶j writes to s, i.e.,

α′i∶j is the p′th action that writes to s, by Lemma 7.10, we can get that the corresponding action

of α′i∶j in δ is also the p′th action that writes to s. As Condition I holds, we know αi∶j is

the corresponding action of αi∶j in δ′. Since αi∶j operates on s with version number p in our

assumption, we get p′ = p. Otherwise if α′i∶j reads on s, by Lemma 7.11, we can get that α′i∶j’s

corresponding action in δ also reads s with the same version number, and similarly, we get

p′ = p.

We next prove both Condition I and Condition II hold. For condition I, suppose αi∶j−1 and

α′i∶j−1 operate on the variable s1 with version number p1. To satisfy condition I, we need again

to make sure (Ia) π′i∶j−2 = πi∶j−2 and (Ib) σ′[s1]p1 = σ[s1]p1. For condition II, let αi1∶j1 and

α′i1′∶j1′ denote the actions that write σ[s]p and σ′[s]p, respectively. Since the current value of

a variable is determined by the action that last writes to it, to satisfy condition II, we need to

make sure α′i1′∶j1′ is equal to αi1∶j1, which again requires (IIa) π′i1′∶j1′−1 = πi1∶j1−1 and (IIb)

σ′[s]p−1 = σ[s]p−1. If we apply this reasoning logic deductively for all threads, we will finally

reach the base condition (i) ∀ti ∈ T, π′i∶0 = πi∶0 and (ii) ∀s ∈ S, σ′[s]0 = σ[s]0, which are satisfied

by the equivalence of the initial program states Σ′0 = Σ0. Hence, Lemma 7.8 is proved.

Therefore, Theorem 7.2 is proved.

Chapter 8

Execution Privatization forScheduler-Oblivious ConcurrentPrograms

Making multithreaded execution less non-deterministic is a promising solution to address the

difficulty of concurrent programming. In fact, a vast category of concurrent programs are

scheduler-oblivious: their execution is deterministic, regardless of the scheduling behavior.

We present and formally prove a fundamental observation of the privatizability property for

scheduler-oblivious programs, that paves the way for privatizing shared data accesses on a path

segment. With privatization, the non-deterministic thread interleavings on the privatized ac-

cesses are eliminated and many concurrency problems are alleviated. We further present a path

and context sensitive privatization algorithm that safely privatizes the program without intro-

ducing any additional program behavior. Our evaluation results show that the privatization

opportunity pervasively exists in real world large complex concurrent systems. Through pri-

vatization, several real concurrency bugs are fixed and notable performance improvements also

are achieved on benchmarks.

8.1 Introduction

Despite decades of multicore practice, developing good quality concurrent software remains

notoriously difficult due to non-deterministic thread interleavings. In principle, concurrent pro-

grams are free to exhibit the non-deterministic behavior allowed by the scheduler, and it is the

responsibility of the programmers to prevent the non-determinism from impairing program cor-

rectness (using synchronization, for example). In practice, however, a vast category of real

world concurrent programs are deterministic-by-default or, more generally, scheduler-oblivious

112

Execution Privatization for Scheduler-Oblivious Concurrent Programs 113

that, given the same input, they are always expected to produce the same output. As noted

by Bocchino Jr. et al. [13, 14], almost all scientific computing, encryption/decryption, sorting,

compiler and program analysis, and processor simulation algorithms exhibit scheduler-oblivious

behavior.

Scheduler-oblivious concurrent programs are much easier to reason about, because their exe-

cution is deterministic w.r.t. the program state transition: given the same initial state, they al-

ways reach the same final state, regardless of the thread scheduling (assuming a random but fair

scheduler) [14, 26]. Nevertheless, it is still challenging to write correct and efficient scheduler-

oblivious programs. Although significant research effort has been invested in language design

[11, 14], compiler [9], runtime environment [20, 25], operating system [6, 10], and hardware

[26, 27] to find practical solutions, all these approaches essentially limit execution parallelism

and incur a performance penalty. How to efficiently support the deterministic execution of

scheduler-oblivious programs remains an open problem.

We identify a fundamental property we call privatizability of scheduler-oblivious programs.

This property enables us to develop an execution privatization technique that makes scheduler-

oblivious programs more deterministic without compromising parallelism. The privatizability

property is closely related to but slightly different from the classical conflict and view serializ-

ability property [12, 134, 139]. Privatizability describes the view consistency over a subset of

shared data access scenarios: read-after-write and read-after-read. Under a certain condition, the

program can be soundly privatized to an equivalent program in which the two accesses always

are executed sequentially.

More specifically, consider a path segment, p, in a scheduler-oblivious program, with no block-

ing statement (e.g., thread synchronization), and with two successive accesses to the same data,

where the first access is a read or write, and the second is a read. Suppose in a correct execu-

tion of the program (given a certain input), these two accesses are executed sequentially without

interleaving by a third write to the same location. The privatizability property says that the

second read, which is a shared data access in the program, can actually be changed to a local

access, which always returns the local value stored by the first access. Let us call the shared

data accesses such as the second read privatizable accesses, the operation of changing a priva-

tizable shared access to be local privatization, and the modified program a privatized program.

The soundness of the privatizability property is easy to follow. Since p contains no blocking

statement, with no control of the thread scheduling behavior, the execution of p could always

continue without waiting for other threads. In other words, for any input, there always exists a

schedule in the original program such that the two accesses read or write the same value, making

it reach the same final state as that of the privatized program. And because the original program


is scheduler-oblivious, for all schedules, it will reach the same final state. Hence, both the pri-

vatized program and the original program will always reach the same final state given the same

input. We formally prove a theorem of the privatizability property in Section 8.2.

While guaranteeing program state equivalence, privatization brings a nice benefit to the pro-

gram: it isolates the effect of the thread interleaving on the privatized accesses without adding

any synchronization. The privatized program will no longer experience any non-determinism

caused by the potential erroneous interleaving on the privatized accesses and, at the same time,

no performance is lost. Moreover, as the original heap accesses become stack accesses after pri-

vatization, the program performance also can improve. In return, many concurrency problems

caused by non-deterministic thread interleavings, e.g., concurrent program testing and debug-

ging, can be greatly alleviated for scheduler-oblivious programs. We discuss the applications in

more detail in Section 8.7.1.

Taking advantage of this observation, we propose Privateer, an automatic privatization technique

for scheduler-oblivious programs. An important condition for applying privatization is that

the observed execution of the path segment p in the privatizability property (in which the two

accesses are executed without interleaving) should be correct. Otherwise, if it is buggy, every

execution of p would be wrong after privatization. To bias our results to correct executions,

our technique first conducts a dynamic analysis on a set of common correct executions to find

privatizable accesses. In this way, we guarantee that the privatization is only performed when

the privatizable accesses can be correctly privatized.

The privatization may be applied either at runtime or offline. The key technical challenge is

how to guarantee privatization correctness, that it does not introduce additional behavior beyond

what could be exhibited by the original program. We present an offline program transformation

approach including a path and context sensitive privatization algorithm that guarantees no new

program behavior is introduced compared to the original.

We have implemented Privateer for Java and evaluated it on a set of popular multithreaded

benchmarks as well as five real world large complex concurrent systems, including Apache

Derby, Tomcat, Jetty, OpenJMS and Jigsaw. Our experimental results show that: (1) Privatiza-

tion opportunities are common in concurrent programs. We found a total of 5,119 privatizable

accesses in the five large real systems. The overall percentage of privatizable accesses (the num-

ber of privatizable versus the total number of shared data access locations) ranges from 14.7%

to 30.7% in the privatized executions. (2) Our technique is effective in repairing two typical

classes of concurrency bugs. In our study of nine real world concurrency bugs, our privatization

technique is able to fix seven of them. (3) With our technique to automatically privatize the orig-

inal heap accesses, we are also able to improve the performance of the evaluated benchmarks by

4.3%-17.9%.


The remainder of this chapter is organized as follows: Section 8.2 presents and formally proves

the privatizability property; Section 8.3 presents an overview of execution privatization; Section

8.4 presents the technical details of Privateer; Section 8.5 presents our implementation and Sec-

tion 8.6 reports our experimental results; Section 8.7 further discusses the application scope of

privatization; Section 8.8 summarizes this chapter.

8.2 A Theorem of Privatizability for Scheduler-Oblivious Programs

The cornerstone of our work is the fundamental privatizability property of scheduler-oblivious

programs. In this section, we present and formally prove a theorem of this property. The the-

orem forms the foundation of privatization, which reduces the non-deterministic influence of

the thread interleavings for scheduler-oblivious programs, benefiting many concurrent program

testing and debugging tasks.

In a scheduler-oblivious program P , consider a path segment, p, with two successive global

actions, ai and aj , to the same shared variable, s, on the global store, where ai is a READ or

a WRITE, and aj is a READ. Let program P ′ be a privatized version of P on p, in which the

global action aj in P is changed to be a local action, aj′ , in P ′, such that aj′ stores the value

read or written by ai into a local variable in the thread’s local store. All the other actions in P

and P ′ are the same.

Consider the executions of P and P ′ given the same input. Let vi denote the value read or

written by ai, and v(j,j′) the values returned by a(j,j′). Clearly, vi = vj′ always holds, because in

P ′, aj′ always reads the same value as that by ai. However, in P , vi may not be necessarily equal

to vj , because ai and aj may be interleaved by a third WRITE to s from a different thread that

changes the value of s. Nevertheless, we have the following theorem of privatizability property

on the equivalence between P and P ′:

Theorem 8.1. If p contains no blocking statement, P is equivalent to P ′: given the same initial

state, P and P ′ always reach the same final state.

Proof. Let us consider an execution where p is only executed once with an arbituary schedule

ξ. The proof is similar if p is executed multiple times. Recall Section 2.1 Rule (2.1), the state

transitions of P and P ′ are illustrated as follows:

(Σ0, ξ)→ . . .αiÐ→ Σi → . . .

αj−1ÐÐ→ Σj−1 αj

Ð→ Σj . . .→ ΣN

(Σ0, ξ)→ . . .αiÐ→ Σi → . . .

αj−1ÐÐ→ Σj−1

αj′

ÐÐ→ Σj′ . . .→ ΣN ′

Since the only difference between P and P ′ is on aj and aj′ , to prove ΣN ′

= ΣN , it is sufficient

to show Σj′ = Σj . Recall that aj only reads the value of s on the global store σj−1 and stores


it to the thread’s local store, say m. According to Rule (2.2), for the global store, we have

σj′

= σj−1 = σj . According to Rule (2.3), for the local store, the only difference between Πj′

and Πj is the value of m. In Πj′ , it is vj′ , and in Πj , it is vj . Because vj′ = vi in P ′, if we can

show vi = vj , then we must have Πj′ = Πj and hence ΣN ′

= ΣN can be proved.

Now let us consider the schedule ξ. If there is no WRITE action to s between ai and aj , then

clearly, we have vi = vj , and hence ΣN ′

= ΣN . Suppose such a schedule exists and let us call it

ξno. So, we have shown P ′ is equivalent to P at least for the schedule ξno. On the other hand,

if there is a WRITE action by another thread between ai and aj in ξ, then we may have vi ≠ vj .

Nevertheless, recall Rule (4), by the definition of scheduler-oblivious, we have (Σ0, ξ)...Ð→ ΣN

for any schedule. That is, even if ξ ≠ ξno that makes vi ≠ vj , ξ should drive the program to the

same final state as that by ξno. Therefore, ΣN ′

= ΣN always holds as long as ξno exists.

We now prove the existence of ξno for any initial program state by contradiction. Suppose ξnodoes not exist, it means that for all schedules, ai and aj must be interleaved by a third WRITE.

For scheduler-oblivious programs, then there must exist a blocking statement in p. The reason is

that the blocking behavior is the only way to enforce a thread interleaving under a fair but non-

deterministic thread scheduler. Without a blocking action in p, a thread may always continue

executing to the end of p, if not preempted by the scheduler. Since we assume p does not

contain a blocking statement, ξno must exist under the assumption. Therefore, Theorem 8.1 is

proved.

Theorem 8.1 paves the way for privatizing scheduler-oblivious concurrent programs. Since P ′

is equivalent to P , we can soundly privatize P to P ′ according to our purpose. With privatiza-

tion, the non-deterministic thread interleavings on the privatized data accesses (such as aj) are

isolated and, more importantly, the program performance is not impaired but rather improved

as the original heap accesses become stack accesses after privatization. We are now ready to

present our privatization technique for scheduler-oblivious programs.

8.3 Overview

The key concept of our work is the privatization of scheduler-oblivious programs. Essentially,

privatization changes the shared variable accesses in the original program to local ones in the

privatized program, under the condition that the behavior of original program is not changed.

We have shown the condition and the soundness of privatization in Theorem 8.1. In this section,

we first use two motivating examples to illustrate the idea and the benefits of privatization on

concurrency bug fixing and on the program performance. We then present the challenges for

guaranteeing the privatization correctness, which we address in detail in the next section.

Execution Privatization for Scheduler-Oblivious Concurrent Programs 117An concurrency bug in Derby

public class TableDescriptor {FormatableBitSet referencedColumnMap ; public String getObjectName(){if (referencedColumnMap == null){

…}else{

for (int i = 0; i <...; i++){

referencedColumnMap.isSet(…)}

}}

}

123456789101112131415

public voidsetReferencedColumnMap(…){referencedColumnMap = null;}

161718

By turning referencedColumnMap at line 11 to be local, we can fix this bug. The transformation for this case should be easy.

public class TableDescriptor {

FormatableBitSet referencedColumnMap ; public String getObjectName_privatized(){

FormatableBitSet referencedColumnMap_local = referencedColumnMap; if (referencedColumnMap_local == null) {

…

}

else

{

for (int i = 0; i <...; i++)

{

referencedColumnMap_local.isSet(…)

}

}

}

}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

FIGURE 8.1: Top: a real bug #2861 in Apache Derby. The program crashes with NullPoint-erException when a thread references the shared data structure referencedColumnMap at line11 after another thread sets it to null in the method setReferencedColumnMap. Bottom: the

getObjectName method after privatization.

8.3.1 Motivating Examples

Bug fixing The code snippet in Figure 8.1 (top) shows a real crash bug in the Apache Derby

database. When a thread calls the getObjectNamemethod on a shared TableDescriptor,

it first checks whether the field referencedColumnMap is null or not (line 4). If refere-

ncedColumnMap is not null, the thread will enter a loop and dereference it (line 11). There

is a potential interleaving between the two accesses to referencedColumnMap, where an-

other thread may set referencedColumnMap to null (line 17) between line 4 and line

11, causing the first thread to throw a NPE at line 11. Worse, due to the non-determinism of

Execution Privatization for Scheduler-Oblivious Concurrent Programs 118Privatization Performance

while(true){

synchronized(lock)

{

num--;

if(num ==0)

System.exit(0);

}

}

volatile num = 100,000,000;

while(true){

synchronized(lock)

{

num_local = num-1;

num = num_local;

if(num_local ==0)

System.exit(0);

}

}

FIGURE 8.2: The benchmark contains 8 threads simultaneously decreasing the shared variablenum. The privatized version (right) is 17.9% faster than the original version (left).

this interleaving, this bug is difficult to reproduce and to fix. As reported in the bug repository1,

it took as long as a year before this bug finally was fixed by the developer.

To fix this bug, essentially, the effect of this erroneous interleaving on the program state must be

eliminated. One option is to add synchronizations (e.g., locks) to completely prohibit this inter-

leaving, but this limits the degree of parallelism. After a closer look at this program, we can see

that there is an intriguing characteristic with respect to the dereference to referencedColumnMap

at line 11 that we can leverage to eliminate the erroneous interleaving without using synchro-

nization. That is, in correct executions, the dereference to referencedColumnMap should

always dereference the same value as the preceding access to referencedColumnMap by

the same thread at line 4. This indicates that this shared data access is privatizable: we can pri-

vatize it to dereference a thread local variable referencedColumnMap local that stores

the value of the access to referencedColumnMap at line 4 by the same thread, as shown

in Figure 8.1 (bottom). In this way, the dereference to referencedColumnMap will always

dereference a non-null variable, regardless of the thread interleaving. The bug is fixed without

adding any synchronization.

Performance improvement To assess the effect of privatization on the program performance,

we design a micro-benchmark (Figure 8.2) to conduct controlled experiments for quantifying the

runtime characteristics of the privatization effect. The benchmark consists of concurrent threads

that repeatedly decrease a shared counter (a volatile integer) in a loop until its value reaches 0.

The counter decreasing and the termination checking operations are enclosed in a synchronized

block to ensure the correctness. We control the number of threads and the initial value of the

counter to measure the program execution time.

Figure 8.2 (left part) shows the original micro-benchmark. Since the second read to the counter

always returns the same value as that of its preceding write access, the second read can actually

be privatized to return the value of a local variable that stores the value of the write access. The

privatized version is shown in Figure 8.2 (right part). In our experiments on a 8-core machine

with 8 threads and with the initial value of the counter set to 100,000,000, the privatized version

(49.0s) is 17.9% faster than the original version (40.2s).1https://issues.apache.org/jira/browse/DERBY-2861

https://issues.apache.org/jira/browse/DERBY-2861


if (foo == null){

foo = new Foo();

}

foo.m();

1

2

3

4

foo_local = foo;

if (foo_local == null){

foo = new Foo();

}

foo_local.m();

foo_local = foo;

if (foo_local == null){

foo = new Foo();

foo.m();

}

else

foo_local.m();

naïve

path-sensitive 1 4

FIGURE 8.3: Privatization must be path-sensitive

8.3.2 Privatization Challenges

On the surface, privatization seems an easy problem. For instance, for the bug fixing example in

Figure 8.1, we may simply replace the shared read to referenceColumnMap at line 11 to a

local read to referenceColumnMap local which stores the same value read by the access

to referenceColumnMap at line 4. However, in practice, we have to address the following

touch challenges:

Path-sensitivity The privatization must be path-sensitive. A privatizable access is defined spe-

cific to a certain path segment. It might not be privatizable on a different path. To understand

this problem, let us consider a simple program in Figure 8.3 (left part). The program first checks

whether a shared variable foo is null or not at line 1. If foo is null, it is assigned to a new

Foo object at line 2. Then the program invokes the method m on foo at line 4. Suppose that, in

our collected execution traces of the program, we only observed the path through lines (1 → 4),

which is possible as foo might always be not null initially. We would find that the second

read to foo at line 4 is privatizable (because it always return the same value as the first read to

foo at line 1). However, if we naively privatize the second read in the same way as what we

do for the Derby bug #2861 in Figure 8.1, the resulting program (shown in the right-top of

Figure 8.3) would be incorrect, because if foo is initially null, the invocation of m would then

dereference a null variable. The correct privatization should consider the path containing the

second read and the first read, and perform the privatization specific to this path, as shown in the

right-bottom of Figure 8.3.

Context-sensitivity Besides path-sensitivity, the privatization should also be context-sensitive.

Shared data accesses in different calling contexts may access different values, either written

by the same thread or possibly by a different thread. Therefore, an access that is privatiz-

able in one calling context might not be privatizable in another. This problem is illustrated

Execution Privatization for Scheduler-Oblivious Concurrent Programs 120An concurrency bug in StringBuffer

public class StringBuffer {private int count; public synchronizedStringBuffer append(StringBuffer sb) {

int len = sb.length(); …sb.getChars(…, len, …); …

}

1234

public synchronizedStringBuffer delete(intstart, int end) { …int len = end - start; …count -= len;

}

5678

By turning count at line 14 to be local, we can fix this bug. The transformation for this case is inter-procedural.

public synchronized void getChars(…) {

…if (srcEnd > count)) { throw new StringIndexOutOfBoundsException(); }

}

public synchronized int length(){

return count;}

910111213

1415

}

FIGURE 8.4: An atomicity violation in the append methodof java.lang.StringBuffer class. The program throwsStringIndexOutOfBoundsException when a thread at line 11 references the

stale length of sb changed by another thread at line 8.

by the StringBuffer bug in Figure 8.4. The two accesses to count at line 11 in the method

getChars and at line 15 in the method length, respectively, are invoked within the context

of the append method, which is inter-procedural and spans several method calls and control

branches. The access to count in the method getChars at line 11 is a privatizable, because

in correct executions, it always reads the same value as the read access to count at line 15 in

the method length. However, this repeated read is only privatizable within the calling context

append. It might not be privatizable for all calling contexts in the program. For instance, it

is possible that the getChars method is called from an external method in which count is

written by a remote thread and then directly accessed in getChars. Therefore, we have to

consider the calling context specific to the privatizable access.

Progressiveness Privatization changes an originally shared variable access into a local one by

modifying it to return the local value stored by a preceding access (to the same shared data). If

the shared data is changed (by another thread) between the privatized access and its preceding

access, the modified access will not see the change. This is problematic when the change is ex-

pected by the program. Because we have observed in the correct execution that the change does

not happen (the privatized access returns the same value as its preceding access), the change

should not be expected on the observed path segment. However, there is an important pro-

gressiveness property we must preserve: the program must be able to continue execution after

privatization. For example, if there is a blocking operation somewhere between the privatized

access and its preceding access, the program may block forever until the shared data is changed.

Also, when the privatized access is inside a loop and the value of the access is related to the loop

condition, after privatization, the program may never escape from the loop.


while (shared){

…

}

local = shared

while (local){

…

}

FIGURE 8.5: Privatization must preserve progressiveness

Figure 8.9 illustrates this problem. The program implements a simple barrier function with

which the thread cannot progress until the flag ‘shared’ is set to true by another thread.

Suppose the initial value of ‘shared’ is false. If we naively privatize the access to ‘shared’

to be ‘local’, the resulting program may never exit from the while loop. In the original

program, however, this situation only happens if the other thread is never scheduled to change

the value of ‘shared’, which is different from the semantics of the original program. Therefore,

we must also consider progressiveness for the privatization correctness.

8.4 Execution Privatization

To address these challenges, we developed a path and context sensitive privatization algorithm,

to make sure the privatization only applies to the correct execution paths we have observed and

to guarantee that privatization does not introduce extra behavior.

Our technique consists of two phases: dynamic trace analysis and the code privatization. The

dynamic trace analysis phase presents the privatizable accesses to the privatization phase, which

then performs the path-sensitive and context-sensitive privatization on the program source or

bytecode. In this section, we present our technique in detail. We also show the correctness of

privatization in Section 8.4.4.

8.4.1 Preliminaries

We first define a few basic concepts. We will use these concepts to describe our technique in the

rest of this section.

Definition 8.2. A basic block (BB) contains a sequence of program statements with only one

entry point and one exit point.

This definition refers to the standard notion of basic block in the control flow graph (CFG). In

our method, we give each BB in the program a unique ID.

Definition 8.3. A shared data access point (SAP) is a statement in some BB that reads or

writes shared data between threads at runtime.


Each SAP has a unique location in the program with the access type ∈ {WRITE,READ}. For

example, the simple program in Figure 8.6 has three SAPs, at lines 1, 2, 4, respectively, and

their access types are READ, WRITE, READ, respectively.

As SAP is a static instruction of shared data accesses, a SAP may be executed multiple times at

runtime. Different execution instances may access different shared memory locations, because

of the possible pointer aliases. In our method, we also distinguish different execution instances

of a SAP at runtime.

Definition 8.4. A trace captures a multi-threaded program execution as a sequence of events δ

= ⟨ei⟩.

We consider the following four types of events:

• SAPE (t,s,m): a thread t executes a SAP s accessing a shared memory location m.

• BBI (t,b): a thread t enters a BB b.

• BBO (t,b): a thread t exits from a BB b.

• BLOCK (t): a thread t executes a blocking statement.

Definition 8.5. A privatizable SAP (P-SAP) is a READ SAPE in the trace that returns the value

read or written by its preceding SAPE by the same thread, and without a BLOCK by the same

thread in between,. This preceding SAPE is called the dependent SAP (D-SAP) of the P-SAP.

Definition 8.6. A privatizable path (P-Path) is a path segment containing a P-SAP in the trace

by the same thread. The P-Path starts from the BB containing the D-SAP and ends at the BB

containing the P-SAP.

P-Path is represented by the sequence of BBs executed between the P-SAP and the correspond-

ing D-SAP by the same thread. P-SAP and D-SAP are path-sensitive. For example, in Fig-

ure 8.6, there are two pairs (D-SAP1, P-SAP1) and (D-SAP2, P-SAP2) following the P-Paths

through lines (1→ 4) and (2→ 4), respectively.

Definition 8.7. The calling context of a P-SAP or a D-SAP is the sequence of active methods

and the method call sites on the stack, when the P-SAP or the D-SAP is executed.

The calling context defined here is similar to the standard definition [16, 120]. We will use it

to determine whether to perform the privatization on the P-SAP or not (Section 8.4.3.1). Note

that the calling context can be computed efficiently by analyzing the BBI and BBO events in the

trace, without any extra information at runtime.


if (foo== null){

foo = new Foo();

}

foo.m();

1

2

3

4

D-SAP1

P-SAP1,2

D-SAP2

FIGURE 8.6: D-SAP and P-SAP are path-sensitive

8.4.2 Dynamic Trace Analysis

The goal of our dynamic trace analysis is to find all the P-SAPs manifested in the observed

correct executions. Each reported P-SAP is also associated with the P-Path, which is used by

the second phase to perform the privatization.

Algorithm 14 shows our dynamic trace analysis algorithm. Our algorithm to extract the D-SAP

and P-SAP is similar to the work of AVIO [75] and CTrigger [99] in that the D-SAP is related

to the P-instruction and P-SAP is related to the I-instruction. Differently, P-SAP in our work is

limited to READ access only, and our algorithm also needs to make sure that there is no blocking

operation between the D-SAP and the P-SAP by the same thread. Moreover, what we take as

input is a set of correct execution traces. Sharing the same essence with [135], our work does not

require the availability of erroneous executions to eliminate the erroneous thread interleavings.

Algorithm 14 Dynamic Trace Analysis (δ)1: Input: δ - a trace2: Let M denote all shared memory locations in δ3: δm ← the sequence of SAPEs in δ that access a shared memory location m4: δmt ← the sequence of SAPEs in δm that are performed by a thread t5: for each m ∈M do6: for each READ SAPE s ∈ δm do7: sdef ← the most recent WRITE SAPE in δm

8: t← the thread of s9: s′ ← the most recent SAPE in δmt before s

10: if s′ is a WRITE then11: s′def ← s′

12: else13: s′def ← the most recent WRITE SAPE in δmt before s′

14: if sdef == s′def then15: p-path p← the sequence of BBs by t in δ from s′ to s16: if p does not contain a BLOCK statement then17: report p as privatizable


D-SAP P-SAP

P-Path p = bi-bi+1-bi+2…-bj

D-SAP’ P-SAP’

P-Path p’ = b’i-b’i+1-b’i+2…-b’j

trace

trace

FIGURE 8.7: Conceptual view of execution privatization. The privatization is tailored to theP-Path.

To find a P-SAP, our algorithm iterates through the sequence of SAPEs on each shared memory

location by each thread. For each READ SAPE, s, that accesses the shared memory location, m,

by a thread, t, we first find the most recent SAPE on m before s that is performed by t, say it is

s′. To determine whether s is privatizable or not, we compare the most recent WRITE SAPEs

on m that is before s′ (including s′) and the most recent WRITE SAPEs on m that is before s. If

they are the same, we continue to check whether the path p from s′ to s by t in the trace contains

a BLOCK operation or not. If not, we report s as a P-SAP and p the corresponding P-Path.

The same procedure is applied for all threads and all shared memory locations in each trace.

Finally, we obtain a set of P-SAPs computed from all the traces. For a set of traces, the results

of P-SAPs are merged. Two P-SAPs are considered equivalent if their P-Paths are identical.

8.4.3 Path and Context Sensitive Privatization

The execution privatization is essentially a program transformation process that takes the P-

SAPs reported in the trace analysis phase and produces a privatized version of the program in

which the P-SAPs are all privatized. We iterate through the list of P-SAPs and perform the

privatization for each of them.

For each P-SAP, the privatization is tailored to the associated P-Path, as illustrated in Figure

8.7. Conceptually, we clone the P-Path for each P-SAP and attach it to the program. Most of the

cloned P-Path is the same as the original, with the main difference that the P-SAP is privatized to

access a thread local variable which contains the value accessed by the D-SAP. More formally,

consider a P-Path p = bi-bi+1-bi+2-. . . -bj where the D-SAP and P-SAP are in the BBs bi and bj ,

respectively. We clone p to be p′ = b′i-b′

i+1-b′i+2-. . . -b′j , where b′i = bi (D-SAP → D-SAP′), b′i+1= bi+1, . . . , b′j−1 = bj−1, and b′j = bj (P-SAP→ P-SAP′). D-SAP′ and P-SAP′ are determined by

the privatization rules. Moreover, to ensure the soundness, the P-Path clone must guarantee that

p′ is executed in the privatized program iff p is executed in the original program.


D-SAP D-SAP’

WRITE s WRITE s_local

s = s_local

READ s s_local = s

READ s_local

P-SAP P-SAP’

READ s READ s_local

FIGURE 8.8: Privatization rules of D-SAP and P-SAP

int getData(){

return shared;

}

1

2

3

int local1 = getData();

… int local2 = getData();

D-SAP

P-SAP

FIGURE 8.9: The P-SAP and the D-SAP are at the same program location (line 3). Neverthe-less, because their calling contexts are different (line 1 and line 2, respectively), they are still

privatizable.

Furthermore, recall that we must also consider progressiveness before performing any naive

privatization. The key to progressiveness is that any shared data access inside a loop should

be able to see the change to the shared data; otherwise, the program may never progress out of

the loop. To address this problem, after privatizing all the P-SAPs, we perform an additional

inter-procedural loop analysis to decide that whether any privatized P-SAP is inside a loop or

not. If it is, we ensure that not all P-SAPs inside the loop are privatized. In this way, because at

least one P-SAP still accesses the shared data, any change to the shared data is guaranteed to be

visible to all the P-SAPs.

In the rest of this section, we first show the privatization rules in detail. Then we present our

path and context sensitive P-Path cloning algorithm.

8.4.3.1 Privatization Rules

Figure 8.8 shows the privatization rules of D-SAP and P-SAP. The P-SAP is a READ access to

some shared variable s. Our privatization replaces it to read a local variable s local instead.

The value of s local is obtained from the privatization of the D-SAP. According to the differ-

ent access types of the D-SAP, the treatments are slightly different. If the D-SAP is a WRITE

access, we first change it to store the written value into a local variable s local and then insert

a new statement s = s local after it, that stores the value in s local to s. If the D-SAP is


a READ access, we first insert a new statement that stores the value of s into s local and then

change the D-SAP to read s local instead of s. Clearly, in this way, when the P-SAP is exe-

cuted, instead of reading the original shared variable s, it will read the local variable s local

which stores the value of s.

Privatization scope Note that privatization is applicable to the whole program and is general

to all calling contexts in the trace. It is not limited to a single method or a single module. The

P-Path may span multiple modules and contain multiple method calls. Also, the P-SAP and

D-SAP may be at the same program location, as long as their calling contexts (Definition 8.7)

in the P-Path are different. We use the sequence of BBI events and their call sites to represent

the calling context. Starting from the beginning of the trace to the P-SAP (D-SAP), every BBI

event by the same thread is added to the calling context, and when there is a BBO event, the

corresponding BBI event is deleted from the context.

Privatization transitivity An interesting property of the privatization is transitivity. The D-SAP

of a P-SAP itself might also be a P-SAP, which has its own D-SAP. This forms a loop of D-SAPs

and P-SAPs if every D-SAP is a P-SAP in the loop, or a chain when there exists a D-SAP which

is not a P-SAP. When it forms a chain, let us call the only D-SAP the ancestor. The ancestor

gives us a nice property that its local value can be directly used by all the other P-SAPs in the

chain. This property makes the reuse of the local variable possible, freeing us from creating a

new local variable for each P-SAP.

Progressiveness guarantee However, we must be careful when the P-SAPs and their D-SAPs

form a loop. As noted in Section 8.3.2, we must make sure the privatization does not break the

progressiveness of the original program. If any of the P-SAPs and their D-SAPs form a loop,

after the privatization, all the P-SAPs in the loop are privatized and the change to the shared

data would not be seen by the privatized P-SAPs. When the shared data is related to the loop

condition, the program may be inside the loop forever. The key to addressing this problem is to

break the loop, ensuring that at least one P-SAP inside the loop should be able to see the change

to the shared data. We resolve this problem by performing a whole program loop analysis after

privatizing all the P-SAPs. For each privatized P-SAP, we check whether it is inside a loop of

P-SAPs or not. If it is, we simply unprivatize one of the P-SAPs. In this way, at least one shared

data access is not privatized and can see the change to the shared data. Hence, all P-SAPs are

able to see the change. Therefore, the progressiveness of the original program is preserved.

Variable visibility An additional problem we need to address is the visibility of the local vari-

able s localwhen the D-SAP and the P-SAP are within different methods. Because s local

is only visible in the method in which it is declared, the P-SAP cannot read it from a different

method. For such inter-procedural cases, we declare s local as a thread local static variable.

The variable is a static field of a singleton class added to the program, and it is unique for each

P-SAP. In this way, the P-SAP is able to read s local directly.


Algorithm 15 P-Path Clone (p)1: Input: p = bibi+1 . . . bj - the DP-Path2: for k ← i+1 to j do3: if bk is an entry BB to a new method m then4: clone m to mprivatized

5: update the call site in bk−1 to mprivatized

6: else7: if bk has more than one predecessor in the CFG then8: clone bk to b′k9: update the edge from bk−1 to b′k

DSAP

PSAP

… Other paths

DSAP'

PSAP’ b'j PSAP

bj

… … … Other paths

b‘i

bj

bi

bk b’k

FIGURE 8.10: Intra-procedural privatization

8.4.3.2 Path and Context Sensitive P-Path Clone

Because there might be complicated control flows and possibly infinite number of paths in the

program, the main challenge of the P-Path clone is to ensure that only the P-Path is cloned but

not any other path. That is, for all the other paths in the program except the P-Path, they remain

unchanged in the privatized program. To achieve this, our algorithm carefully clones the P-Path

by taking care of every BB and the context in the P-Path. Algorithm 15 shows our P-Path clone

algorithm. It traverses each BB in the P-Path from bi to bj , which contain the D-SAP and the

P-SAP, respectively. For each BB, it first checks whether the BB is an entry block to a new

method or not. If yes, it means that the path has an inter-procedural transition, and we hence

clone the new method and also update the corresponding invocation site in the preceding BB.

Otherwise, the BB goes through an intra-procedural cloning process. In the intra-procedural

phase, our algorithm checks whether the BB has multiple predecessors in the CFG or not. If

yes, it means that there are other paths different from the P-Path that pass through this BB. So

we clone this BB in the CFG and update the edge from the preceding BB to it correspondingly.

This procedure is repeated for every BB until all the BBs in the P-Path are processed. Finally,

the whole P-Path is cloned and all the BB transitions on the P-Path are correctly updated.


m1_privatize (args1)

PSAP’ b’j m2_privatize

m2_privatize(args2)

… context

DSAP’ b’i

m1_privatize

PSAP bj

m1(args1)

DSAP bi

m1

m2 m2(args2)

… context

FIGURE 8.11: Inter-procedural privatization

Examples Figure 8.10 and Figure 8.11 illustrate the privatization of the intra-procedural and

inter-procedural cases, respectively. In the intra-procedural case, the P-Path is cloned and the D-

SAP and the P-SAP are updated to D-SAP’ and P-SAP’ respectively in the cloned P-Path, and all

the other paths remain the same. For the inter-procedural case, in addition to the intra-procedural

treatments, we also have to handle the method transitions. In the example, suppose the P-Path

spans the methods m1 and m2, inside which the D-SAP and the P-SAP are accessed, respectively.

In the privatized version, m1 and m2 are cloned to be m1 privatize and m2 privatize,

respectively, and their invocation sites in the paths are also updated correspondingly.

8.4.4 Privatization Correctness

An important property guaranteed by our approach is that, for any scheduler-oblivious program,

the privatization is safe: it does not introduce additional behavior beyond what could be exhib-

ited by the original program. In this section, we prove the follow theorem:

Theorem 8.8. Our execution privatization is safe for all scheduler-oblivious programs.

Proof. The key requirement of an scheduler-oblivious program is that the program computation

is the same regardless of the underlying thread scheduling. Given the same input and the same

execution environment, even if the scheduling is different, it always returns the same output.

Since our privatization algorithm is tailored to the P-Path, which is a part (a segment) of an

observed correct execution, it is sufficient to prove the privatization correctness of the P-Path.


Privatizable SAPs

Instrumentor

Soot

Program source

Recorder

JVM

Bytecode

Execution Traces

Analyzer Privatizer

JVM

Privatized program

FIGURE 8.12: Architecture of Privateer

Remember that in the P-Path, the D-SAP and P-SAP are two consecutive accesses to the same

shared data. Since our privatization only changes the P-SAP to read the same value as that

read or written by the D-SAP, and the P-Path does not contain a blocking statement, it satisfies

the conditions of privatization in Theorem 8.1. By the theorem of the privatizability property,

for any input, the privatized program is guaranteed to reach the same final state as that by the

original program. The privatization correctness is proved.

8.5 Implementation

We have implemented and evaluated Privateer for Java. Figure 8.12 shows the architecture. It

contains four main components: the instrumentor, the recorder, the analyzer, and the privatizer.

The instrumentor is a Soot bytecode transformation phase that prepares a program for use with

our execution privatization system. It instruments the shared variable accesses, blocking state-

ments, and the basic block entrances/exits, which are recorded for all threads in a global order

by the recorder at runtime. In Java, we consider Object.wait(), Thread.join(), Thread.yield(),

and the boundaries of synchronized blocks and methods as blocking statements. We chose Soot

as our instrumentation framework for its compatibility with the newest JDK 1.7 and easy-to-

analyze intermediate representation (Jimple IR). However, our approach is general and should

apply beyond Java bytecode.

The recorder is similar to existing systems that deterministically record executions [25, 48].

Our current recorder is implemented as a separate Java library invoked from the instrumented

program. When a program runs, the recorder saves the runtime traces into the database. Each

event in the trace is either a shared variable access, a blocking operation, or a basic block en-

trance/exit (BBI/BBO), containing the thread ID, the shared memory location at runtime or the

basic block ID, the access type (READ/WRITE/BLOCK/BBI/BBO), and the program location of

the event. The recorder does not record program input data, because our analysis does not need

this information.


The analyzer is a stand-alone program that reads the runtime traces from the database and com-

putes the P-SAPs for each program. To compute them, the analyzer first extracts a total order

of SAPs per each shared memory location, for each thread, from the execution trace. It then

extracts the P-SAPs using the ordered SAPs. To find the P-SAPs, the analyzer analyzes each

pair of two consecutive SAPs by the same thread for each shared data. If the latter SAP reads

the value written by the preceding SAP or they both read the value written by the same write,

then the latter SAP is a P-SAP, and the corresponding P-Path is reported.

The privatizer is the key component of our system. It is implemented as a whole program trans-

formation phase in Soot. Taking the P-SAPs and the program source (or the program bytecode

with the program location information) as the input, the privatizer privatizes the P-SAPs along

their associated P-Paths in the recorded executions. The core of privatization is to change the

P-SAP, which originally is a shared data read access, to a local access that instead reads the

value returned by its corresponding D-SAP. To ensure the privatization correctness, the priva-

tizer clones the P-Path and inserts it into the program according to Algorithm 15 and following

the rules in Section 8.4.3.1.

8.6 Experiments

Our evaluation aims at answering the following two research questions:

RQ1. Usefulness - What is the impact of privatization? How useful it is? How does it affect

program maintenance?

RQ2. Effectiveness - How much privatization opportunity is there in real world concurrent sys-

tems?

To evaluate usefulness, we use nine real concurrency bugs to assess the bug fixing capability of

the privatization, and three popular multithreaded benchmarks as well as a micro-benchmark to

understand the performance improvement brought by the privatization. To evaluate effective-

ness, we apply our system on five large complex real world concurrent server programs to see

how many privatizable accesses there are in these systems. We also report the program size

increase after privatization, which may affect program maintenance.

All experiments were conducted on two 8-core 3.00GHz Intel Xeon machines with 16GB mem-

ory and Linux 2.6.22 and JDK1.7.


TABLE 8.1: Results of real concurrency bug fixing by privatization

Bug ID Application Existing fix Fix time (days) Fixed by privatization?StringBuffer JDK 1.4.2 Documented thread unsafe - YESDerby1573 Derby-10.2.1.6 privatization 365 YESDerby2861 Derby-10.3.2.1 privatization 365 YESDerby3260 Derby-10.3.1.4 synchronization 46 YESDerby4018 Derby-10.4.2 synchronization 168 NOJetty-284 Jetty-6.1.2 synchronization 1 YESJetty-1269 Jetty-6.1.8 code structure change 33 YESJetty-425 Jetty-6.1.3 privatization 268 YESJetty-418 Jetty-5.x synchronization 19 NO

8.6.1 Concurrency Bug Fixing

By isolating the potential erroneous preemptive interleavings, execution privatization has the

effect of fixing concurrency bugs. The salient feature of privatization is that, unlike the general

concurrency bug fixing techniques [56, 135] that often incur non-ignorable program slowdown,

privatization does not result in any additional runtime overhead. Moreover, because privatiza-

tion does not introduce any extra synchronization into the program, it is completely free from

deadlock.

We have applied our system to nine real world crash bugs, one from the StringBuffer library in

JDK-1.4.2, four from Derby, and four from Netty. Table 8.1 shows a summary of these bugs.

Most of these bugs are hard to fix. Some of them even lasted for as long as a year before they

were fixed, such as Derby #1573 and Derby #2861. Our experiments show that, among

the nine bugs, the privatization is able to fix seven of them (as shown in Column 5 of Table 8.1).

We conclude that privatization is applicable to fixing two classes of concurrency bugs: p(WRITE)-

r(WRITE)-c(READ) and p(READ)-r(WRITE)-c(READ), belonging to two of the five types of all

the atomicity violations [99]. In these two types of bugs, the c access is privatizable. For

scheduler-oblivious programs, it is expected that privatization is the correct or most proper way

to fix these two types of bugs. For instance, three of the seven fixed bugs (Derby #1573,

Derby #2861 and Jetty #425) were indeed fixed by the developers using source code

level privatization.

A typical scenario where the privatization applies but may not fix the bug is illustrated in Figure

8.13. Both the bugs Derby #4018 and Jetty #408 that our privatization fails to fix belong

to this pattern. The two accesses to list should always return the same data, not only the list

reference, but also the whole list itself. Privatization makes the list reference private, but

not the whole content of the list. Hence, the list content can still be changed by other


for(int i=0;i<list.size();i++)

{

list.get(i);

}

// May throw IndexOutofBoundsException

// if another thread modifies the

// content of the list

FIGURE 8.13: Privatization may not repair this bug

TABLE 8.2: Performance improvement by privatization

Program Input Time-original Time-privatizedMicrobench 100M/8 threads 49.0s 40.2s(17.9%)RayTracer C/100 threads 5.6s 4.9s(12.2%)Motercarlo C/100 threads 9.2s 8.8s(4.3%)Moldyn C/100 threads 11.5s 10.7s(6.7%)

threads. To fix this bug, a synchronization mechanism is needed to protect the list content

from being modified.

8.6.2 Performance Improvement

An additional advantage of execution privatization is that, by privatizing the shared heap ac-

cesses to be local stack accesses, it can help improve the program performance. We first design

a micro-benchmark (Figure 8.2) to understand the range of this performance improvement ef-

fect. To further evaluate the performance impact, we also apply our technique to three popular

multithreaded benchmarks, including RayTrace, Montercarlo, and Moldyn. In all these

benchmarks, we start 100 threads with the input size C.

Table 8.2 shows the performance results. All data are averaged over 10 runs. With privatization,

all these subjects have nontrivial performance improvement. For our micro-benchmark, the

performance improvement is as large as 17.9%. For the other benchmarks, the performance

improvement ranges from 4.3% to 12.2%. In fact, all these benchmarks have a small number

of privatizable locations. The reason for the notable performance improvement is that these

privatizable locations are hot access points during the execution. Most of them are volatile

and are frequently executed in loops. After privatization, as they all become local accesses,

it is expected that program performance could be improved significantly. Figure 8.14 shows

such a typical case in the RayTracer benchmark. Direct accesses to field array variables are

frequently used by programmers, however, the field array variables are mostly read-only after

the initialization. Clearly, it is easy to write code in this way, but it is not a good practice for

program performance.


RayTracer

volatile boolean[] IsDone;

public void DoBarrier(int myid) {

boolean donevalue = !IsDone[myid];

while(…){

for(…){

while(IsDone[…] != donevalue){

…

}

}

}

IsDone[myid] = donevalue;

while(IsDone[0] != donevalue) {

…

}

}

FIGURE 8.14: Frequent shared array accesses in RayTracer

We also experimented our approach with real world large systems. However, because the per-

formance effect of the privatizable accesses in the large systems are relatively small (compared

with the other instructions), we did not observe significant performance boost on them.

8.6.3 Pervasive Privatization Opportunities

To evaluate the effectiveness of execution privatization, we applied our system to a set of real

world applications, including five large complex server systems: Apache Derby, Tomcat, Jetty,

OpenJMS and Jigsaw. To maximize the usage of privatized executions, we first collect typical

good executions with different program inputs in the test suite under random schedules. For

each program, we collect the traces of 100 good runs with 10 different inputs and 10 random

schedules for each input.

Table 8.3 reports the privatization statistics. In these real world systems, we found a total of

5,119 privatizable accesses, which account for 23.6% of the total (21,733) shared data accesses

in them (for accesses with the same program location, we only count once). The overall percent-

age for each program ranges from 14.7% to 30.7%. The result clearly demonstrates that there

exist pervasive privatization opportunities in real world large complex concurrent systems. More

importantly, our result strongly supports the effectiveness in applying execution privatization for

real world applications.

Through manual inspection of the large amount of privatizable accesses, we also have identified

several typical reasons for the pervasive privatization opportunity:


TABLE 8.3: Statistics of the privatization results

Application LOC #Shared accesses #Privatizable accesses #Intra-procedural #Inter-procedural

Jetty-6.1.x 49,746 1,362 219(16.1%) 175(79.9%) 44(20.1%)OpenJMS-0.7.7 154,563 6,934 2,126(30.7%) 1,997(93.9%) 129(6.1%)Tomcat-6.0.33 339,405 8,543 1,260(14.7%) 1,173(93.1%) 87(6.9%)Jigsaw-2.2.6 381,348 1,699 510(30.0%) 347(68.0%) 163(32.0%)Derby-10.2-4 665,733 3,195 968(30.3%) 840(86.8%) 128(13.2%)

TABLE 8.4: Bytecode size increase after privatization

Application Size(bytes) Size-privatize IncreaseJetty-6.1.x 1,678,586 1,712,820 34,234(2.03)%OpenJMS-0.7.7 3,563,274 3,833,938 270,664(7.60%)Tomcat-6.0.33 7,434,520 7,791,321 356,801(4.80%)Jigsaw-2.2.6 8,665,258 8,900,182 234,924(2.71%)Derby-10.2-4 23,600,432 24,059,525 459,093(1.95%)

Shared variable name reusing To access the same data at different program locations, for the

sake of programming easiness, a common practice by programmers is to reuse the same identi-

fier to directly access the data. For example, the privatizable access at line 11 in the Derby bug

example (Figure 8.1) is manifested as a reusing of the identifier referencedColumnMap,

which is also used by the read access at line 4. In fact, all cases of privatizable accesses are

manifested by variable name reuse. Programmers tend to reason in a modularized way that they

frequently use the same variable to access the same shared data, without caring about thread

interleavings.

Unexpected sharing Programmers are often unaware of concurrency when writing the code.

Since they do not expect sharing among multiple threads, they believe that in a sequential envi-

ronment the compiler would automatically help with the privatization. Unfortunately, in multi-

threaded circumstances, it is in general very hard for standard compilers to do such optimization

across threads. This often happens when a sequential library code is used in a multithreaded

program, which is unintended by the library developer. For example, we found quite a few

privatizable accesses in the logging library log4j, which is used by both Tomcat and OpenJMS.

Complicated control flow and context Another typical reason we find through our study is

that privatizable accesses may span over complicated control flows or calling contexts, which is

difficult for programmers to reason about. For example, in the StringBuffer bug in Figure 8.4,

the two accesses to the shared data count at lines 11 and 15 span several method calls and

control branches. Facing the large number of calling contexts and control flows, it is usually

difficult for programmers to reason about privatizable accesses.


8.6.4 Program Maintenance

With many benefits of the privatization, a direct cost is that it may affect the program mainte-

nance. As our technique uses basic block cloning to perform the privatization, it increases the

size of the program. Intuitively, our privatization might face the problem of cloning too much

when there is a long P-Path on the CFG between the D-SAP and the P-SAP. Nevertheless, this

problem seldom happens. In our case, that would mean there is no intermediate access to the

same shared data on the P-Path, which we can easily promote the P-SAP to be in the same block

as the D-SAP, without incurring any data or control flow change.

Table 8.4 reports the bytecode size increase by privatization in the real world large systems. The

overall size increase ranges from 1.95% in Derby to 7.60% in OpenJMS, which is relatively

small. In our studied systems, for most of the privatizable accesses, the D-SAP and the P-SAP

are within the same procedure (see Table 8.3, Column 5-6) and their basic blocks are often

next to each other. For these cases, because we do not need to clone the entire method but

rather the intra-procedural P-Path, so the space increase is often much smaller compared to the

inter-procedural cases.

On the other hand, since many field variable accesses become local ones through privatization,

the number of field variable accesses in the original program are reduced. We advocate that our

technique is also good for program maintenance in some aspects. For example, when refactoring

a field name, there are fewer places to change in the program. To understand a program fault

related to a field reference, the size of the cause-effect chain to the privatized field accesses is

also reduced.

8.7 Discussions

Besides concurrency bug fixing, execution privatization has a wide range of applications in

concurrent program testing and debugging. We discuss a few of the applications in this section.

We also briefly discuss some caveats related to the application scope of the privatization.

8.7.1 Concurrent Program Testing and Debugging

Record/replay The record and replay technique [45, 48, 83] aims at fully reenacting an ear-

lier program execution. For concurrent programs, it is one of the most important techniques

for program understanding and debugging. In general, record/replay requires capturing and en-

forcing the thread interleavings at runtime, which often incurs significant program slowdown

that limits its applicability at the production site. With privatization, the portion of thread in-

terleavings on the privatized accesses no longer exist. Consequently, the overhead incurred by


capturing this portion of interleavings is completely eliminated, hence, dramatically improving

the performance of record/replay.

Deterministic multithreading The key insight of deterministic multithreading (DMT) is that a

small set of schedules is often enough for good performance. By limiting the program to exer-

cise a small well tested set of schedules, DMT explores a good tradeoff between the program

performance and the reliability. To achieve this goal, existing techniques employ either static

type systems [11, 14] or runtime support [9, 10, 26]. With execution privatization, DMT tech-

niques can ignore the schedule enforcement on the privatized accesses. Ultimately, for the set

of executions that follow the same path as the privatized execution, the performance would also

be significantly improved.

Concurrency bug understanding Recent research [49, 55] has shown that concurrent program

execution traces often contain many thread context switches that perplex the bug reasoning pro-

cesses. A simplified trace with fewer context switches will greatly help reducing the debugging

effort by reducing the number of places in the trace where we need to look for the cause of the

bug. With execution privatization, future executions of the privatized program would contain

less thread interleavings. The bug reasoning process based on the privatized execution trace

would also be simplified.

8.7.2 Privatization Scope

Although execution privatization brings in many advantages, we note that it has also the limited

application scope:

Bug repair An important note on concurrency bug repair is that privatization is not general to

fixing all concurrency bugs but only the two classes of atomicity violation bugs where the pri-

vatization fits in (Section 8.6.1). As pointed out by Attiya et al. [5], expensive synchronizations

cannot be eliminated for the operations of read-after-write (RAW) to different shared variables

and atomic write-after-read (AWAR) to the same shared variable. For concurrency bugs such as

order violations that miss the happens-before relation across different threads, privatization also

is not applicable and adding synchronization is necessary to bridge the happens-before depen-

dence. Hence, privatization should not be considered as a replacement for synchronization, but

is rather complementary to it. On the other hand, privatizable accesses are not all necessarily

(though they are often) related to concurrency bugs.

Lock removal Another caveat is that privatization does not eliminate or change the original

synchronization operations in the program. Although it looks plausible that some lock/unlock

operations in the original program can be removed after the shared accesses inside it are priva-

tized, we note that it is in general dangerous to do it as it might change the program semantics.


Removing locks choud result in semantic change

1

2

3

4

Thread t2

5

6

7

lock l

unlock l

read x;

lock l

fork t2

write x;

unlock l

Thread t1

FIGURE 8.15: The lock/unlock operations at line 5/6 can not be removed, though there is nocode to execute between them.

Take the program in Figure 8.15 as an example. The empty lock/unlock operation at lines 5/6

cannot be removed, because together with the fork operation at line 2 they form a happens-

before relation between thread t1 and thread t2. A data race on accessing the shared variable

x would occur if the empty lock/unlock operations are eliminated. Another reason is that from

the perspective of the memory model, synchronizations have the effect of cleaning cache. Re-

moving synchronizations eliminates this effect, which breaks the semantics for programs that

rely on cache effect to achieve certain behaviors.

8.8 Summary

We have presented a fundamental observation of the privatizability property that enables sound

privatization of scheduler-oblivious programs. We highlight our contributions as follows:

1. We present a fundamental observation of the privatizability property that enables privatizating

shared data accesses in scheduler-oblivious programs, which helps supporting their deterministic

execution without compromising parallelism.

2. We present a novel path and context sensitive execution privatization technique that safely

privatizes a program without introducing any extra program behavior.

3. We evaluate our technique on a set of large complex Java programs and the results show that

several real bugs are fixed without incurring any performance penalty, and notable performance

improvement is achieved on benchmarks.

Chapter 9

Conclusion and Future Work

This thesis makes contributions to concurrent program debugging along four directions: mul-

tiprocessor deterministic replay, predictive trace analysis, trace simplification, and data sharing

reduction.

Along the direction of multiprocessor deterministic replay, this thesis presents a new local-order

based recording approach that supports the deterministic replay but with much lower overhead

compared to previous approaches. We present the design and implementation of the first multi-

processor deterministic replay system, LEAP, for Java programs. By deterministically replaying

concurrent programs, LEAP substantially helps debugging concurrent programs to make non-

deterministic concurrency bugs reproducible. In addition, LEAP records much less information

compared to the classical global-order based approach. It is fast, portable, and determinis-

tic. LEAP is available in the public domain and has been used by several research institutions

worldwide.

Along the direction of predictive trace analysis, this thesis proposes the idea of persuasiveness in

the trace-based prediction of concurrency access anomalies. The introduction of persuasiveness

has two important contributions. First, it makes predictive trace analysis more useful for concur-

rency bug detection as it eliminates all the false warnings through runtime verification. Second,

it greatly improves debugging effectiveness as it provides a full execution history and context

information for the bug diagnosis. We also contribute the design and implementation of a fully

automatic persuasive predictive analysis tool, PECAN, for Java programs. PECAN is publicly

available and has revealed several serious bugs in large open source concurrent systems.

Our second main contribution in the direction of predictive trace analysis is the concept of re-

dundancy with respect to the detection of general concurrency access anomalies. We show a

trace redundancy theorem that specifies a redundancy criterion and the soundness guarantee for

reducing the size of the analyzed trace without impairing analysis results. This redundancy

138

Conclusion and Future Work 139

theorem allows us to develop an efficient algorithm, TraceFilter, that automatically removes re-

dundant events from a trace for the predictive analysis of general concurrency access anomalies.

Empirical evidence on a set of popular concurrent benchmarks as well as large server applica-

tions shows that the scalability of PTA is improved by orders of magnitude. Our contribution

makes predictive trace analysis much more practical for real world concurrent systems.

Along the direction of trace simplification, this thesis contributes a redundancy criterion to char-

acterize redundant computations in the replay execution for reproducing bugs. This redundancy

criterion enables us to develop two effective techniques that remove the whole thread redun-

dancy and the partial redundancy, respectively, which significantly reduces the complexity of

the bug reproducing execution and shortens replay time. This thesis also contributes a theorem

of trace equivalence for the reduction of thread context switches in a reproducible buggy trace.

The theorem guides us to reason about the trace simplification problem completely offline. We

further contribute an efficient static algorithm, SimTrace, for trace simplification without any

dynamic re-execution to validate trace equivalence. SimTrace scales well to traces with more

than 1M events, making it attractive to practical use. We believe our contribution in the trace

simplification will greatly improve the effectiveness of concurrent program debugging based on

execution traces.

Finally, along the direction of data sharing reduction, we contribute a fundamental theorem

of privatizability of scheduler-oblivious programs, a vast category of concurrent programs that

always produce the same output given the same input. With the foundation of privatizability

property, we are able to reason about a subset of shared data access (i.e., read-after-write and

read-after-read) in scheduler-oblivious programs sequentially, benefiting many program debug-

ging and testing tasks. Moreover, those original shared accesses can be soundly privatized to

be local ones without changing program behavior. We further present the first known execution

privatization technique for scheduler-oblivious programs. Privatization brings two direct advan-

tages. First, scheduler-oblivious programs become more reliable because thread interleavings

on privatized accesses are eliminated. Second, performance improvement is achieved as the

original heap accesses become stack accesses after privatization.

Future Work In the multicore era, concurrent programs are destined to play a significant role

to fulfill the computational power promised by the hardware. With decades of practice, con-

current programs are becoming pervasive and much more complex. However, developing good

quality concurrent software remains highly challenging. In future, we will focus on develop-

ing efficient and effective technique for reducing the difficulty of concurrent programming and

improving the reliability of concurrent programs. We discuss two promising directions we are

currently working on.


1‐>10‐>2‐>11‐>3‐>12‐>13‐>4‐>5‐>14‐>9

Initially x==y==0;

Thread T1 Thread T2

a=xx=1if(y>0)x=a+1;y=a+1;

elsex=0;y=0;

assert(x==y);

123456789

b=yy=2if(x>0)x=b+2;y=b+2;

elsex=1;y=1;

assert(x==y);

101112131415161718

x T1 T1 T2 T2 T1 T1

Access vectors

y T2 T2 T1 T1 T2 T1

FIGURE 9.1: The program above crashes at line 9 following the interleaving 1-10-2-11-3-12-13-4-5-14-9. To reproduce the crash, LEAP [48] requires 12 synchronizations at runtime to

record the thread access order information (right) on the shared variables.

Relaxed Concurrency Bug Reproduction

Concurrency bug reproduction is critical but notoriously difficult due to nondeterminism. De-

terministic replay techniques faithfully capture and replay the shared memory dependencies

to enable the concurrency bug reproduction. However, for programs with heavy thread inter-

leavings and shared memory dependencies, large runtime overhead is still incurred due to the

challenging problem of using synchronizations on multicore processors. Relaxed concurrency

bug reproduction takes advantage of the observation that deterministic replay is a sufficient but

not necessary condition to reproduce the bug. For many concurrency bugs in practice, we do not

need to a faithful schedule as that occurred in the bug exhibition run, but a relaxed schedule is

also able to reproduce the same bug.

We use a simple program in Figure 9.1 to illustrate the problem. There are two threads (T1

and T2) and two shared variables (x and y). Assuming a sequential consistent memory model.

If the two threads execute concurrently following the interleaving represented by the line num-

bers 1-10-2-11-3-12-13-4-5-14-9, the assertion at line 9 will be violated. This bug is

difficult to reproduce because it may disappear nondeterministically following a different inter-

leaving. LEAP reproduces the bug by recording and enforcing the same thread access orders

local to the shared variables (shown in Figure 9.1 (right)). Because recording each access to x

or y requires one synchronization, and the failure schedule contains 12 accesses to x and y in

total, LEAP requires 12 synchronizations in the recording phase to reproduce this bug.

Now let us consider the simple program with the schedule 10-11-1-2-3-4-12-13-14-5-9

shown in Figure 9.2. Although the schedule is different from the original one. It is able to re-

produce the same bug. Moreover, this schedule contains much less thread context switches


Thread T1 Thread T2

a=x x=1 if(y>0) x=a+1; y=a+1; else x=0; y=0; assert(x==y);

1 2 3 4 5 6 7 8 9

b=y y=2 if(x>0) x=b+2; y=b+2; else x=1; y=1; assert(x==y);

10 11 12 13 14 15 16 17 18

10->11->1->2->3->4->12->13->14->5->9

Initially x==y==0;

FIGURE 9.2: A schedule different from the original one, but is able to reproduce the bug.Moreover, this schedule has fewer (4) context switches than the original one (8).

compared to the original schedule, which is more preferable for locating the cause of the bug.

We are investigating a new concurrency bug reproduction technique, CLAP, that does not record

any thread interleaving or checkpointing any program state at runtime, but rather computes

the schedule offline. All CLAP records at runtime is the thread execution path information.

No synchronization is needed. Since path profiling is lightweight (31% [8]) regardless of the

thread interleavings, for real world programs with heavy shared memory dependencies, CLAP

is significantly more efficient than the state of the art shared memory dependency recorders.

Lightweight Deterministic Execution of Concurrent Programs

A major difficulty in concurrent programming is the non-determinism caused by either the

scheduling non-determinism or the timing issues on different cores. A promising solution to

this difficulty is to make concurrent programs less non-deterministic. In recent years, researchers

have pioneered the direction of deterministic multithreading [9, 10, 11, 14, 26] that aims to make

concurrent programs deterministic by default, by eliminating the sensitive thread interleavings

through runtime enforcement or type systems.

Deterministic multithreading has the nice property that given the same program input, every

execution always produces the same output. This property significantly alleviates the challenge

of writing and debugging concurrent programs. For example, bugs can be deterministically

reproduced. However, existing deterministic execution work faces a serious efficiency problem:


without special hardware, they incur 10x substantial runtime overhead. A lightweight way for

deterministic execution is of significant importance.

We are aiming to improve the performance of existing techniques to make deterministic execu-

tion of concurrent programs more lightweight. Our main insight is that existing work suffers

from two drawbacks: (1) all thread communication points have to be serialized together, which

significantly reduces parallelism; (2) it requires the user specification of the execution time slice

(quantum) for each thread and the performance is sensitive to the specification. Quantum speci-

fication is currently purely heuristic, without a general optimization approach for all programs.

We can address this problem by all allowing parallelism on different shared memory locations.

In addition, with static analysis, we can possibly optimize the quantum specification, enforcing

determinism without any synchronization specification.

Bibliography

[1] Mazurkiewicz A. Trace theory. Advances in Petri Nets, 1987.

[2] Sarita V. Adve and Mark D. Hill. Weak ordering—a new definition. SIGARCH Comput.

Archit. News, 1990.

[3] Gautam Altekar and Ion Stoica. ODR: output deterministic replay for multicore debug-

ging. In SOSP, 2009.

[4] Cyrille Artho, Klaus Havelund, and Armin Biere. High-level data races. In NDDL/VVEIS,

2003.

[5] Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael,

and Martin Vechev. Laws of order: expensive synchronization in concurrent algorithms

cannot be eliminated. In POPL, 2011.

[6] Amittai Aviram, Shu-Chun Weng, Sen Hu, and Bryan Ford. Efficient system-enforced

deterministic parallelism. In OSDI, 2010.

[7] David F. Bacon, Robert E. Strom, and Ashis Tarafdar. Guava: a dialect of java without

data races. In OOPSLA, 2000.

[8] Thomas Ball and James R. Larus. Efficient path profiling. In MICRO, 1996.

[9] Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, and Dan Grossman. Coredet:

a compiler and runtime system for deterministic multithreaded execution. In ASPLOS,

2010.

[10] Tom Bergan, Nicholas Hunt, Luis Ceze, and Steven D. Gribble. Deterministic process

groups in dos. In OSDI, 2010.

[11] Emery D. Berger, Ting Yang, Tongping Liu, and Gene Novark. Grace: safe multithreaded

programming for c/c++. In OOPSLA, 2009.

[12] Philip A. Bernstein and Nathan Goodman. Concurrency control in distributed database

systems. ACM Comput. Surv., 1981.

143

Bibliography 144

[13] Robert L. Bocchino, Jr., Vikram S. Adve, Sarita V. Adve, and Marc Snir. Parallel pro-

gramming must be deterministic by default. In HotPar, 2009.

[14] Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann,

Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vak-

ilian. A type and effect system for deterministic parallel java. In OOPSLA, 2009.

[15] Eric Bodden and Klaus Havelund. Racer: effective race detection using aspectj. In ISSTA,

2008.

[16] Michael D. Bond, Graham Z. Baker, and Samuel Z. Guyer. Breadcrumbs: efficient con-

text sensitivity for dynamic bug detection analyses. In PLDI, 2010.

[17] Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley. Pacer: proportional

detection of data races. In PLDI, 2010.

[18] Sebastian Burckhardt, Rajeev Alur, and Milo M. K. Martin. Checkfence: checking con-

sistency of concurrent data types on relaxed memory models. In PLDI, 2007.

[19] Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte.

A randomized scheduler with probabilistic guarantees of finding bugs. In ASPLOS, 2010.

[20] Jacob Burnim and Koushik Sen. Asserting and checking determinism for multithreaded

programs. In ESEC/FSE, 2009.

[21] Feng Chen and Grigore Rosu. Parametric and sliced causality. In CAV, 2007.

[22] Feng Chen, Traian Florin Serbanuta, and Grigore Rosu. jpredictor: a predictive runtime

analysis tool for java. In ICSE, 2008.

[23] Jong-Deok Choi and Harini Srinivasan. Deterministic replay of java multithreaded appli-

cations. In SPDT, 1998.

[24] Jong-Deok Choi and Andreas Zeller. Isolating failure-inducing thread schedules. In

ISSTA, 2002.

[25] Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, and Junfeng Yang. Efficient

deterministic multithreading through schedule relaxation. In SOSP, 2011.

[26] Joseph Devietti, Brandon Lucia, Luis Ceze, and Mark Oskin. Dmp: deterministic shared

memory multi-processing. In ASPLOS, 2009.

[27] Joseph Devietti, Jacob Nelson, Tom Bergan, Luis Ceze, and Dan Grossman. Rcdc: a

relaxed consistency deterministic computer. In ASPLOS, 2011.

[28] George W. Dunlap, Dominic G. Lucchetti, Michael A. Fetterman, and Peter M. Chen.

Execution replay of multiprocessor virtual machines. In VEE, 2008.

Bibliography 145

[29] Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. Goldilocks: a race and transaction-aware

java runtime. In PLDI, 2007.

[30] Dawson Engler and Ken Ashcraft. Racerx: effective, static detection of race conditions

and deadlocks. In SOSP, 2003.

[31] Eitan Farchi, Yarden Nir, and Shmuel Ur. Concurrent bug patterns and how to test them.

IPDPS, 2003.

[32] Azadeh Farzan and P. Madhusudan. Causal atomicity. In CAV, 2006.

[33] Azadeh Farzan, P. Madhusudan, and Francesco Sorrentino. Meta-analysis for atomicity

violations under nested locking. In CAV, 2009.

[34] Cormac Flanagan and Stephen N Freund. Atomizer: a dynamic atomicity checker for

multithreaded programs. In POPL, 2004.

[35] Cormac Flanagan and Stephen N. Freund. Fasttrack: efficient and precise dynamic race

detection. In PLDI, 2009.

[36] Cormac Flanagan and Stephen N. Freund. Adversarial memory for detecting destructive

races. In PLDI, 2010.

[37] Cormac Flanagan, Stephen N. Freund, and Jaeheon Yi. Velodrome: a sound and complete

dynamic atomicity checker for multithreaded programs. In PLDI, 2008.

[38] Cormac Flanagan and Patrice Godefroid. Dynamic partial-order reduction for model

checking software. In POPL, 2005.

[39] A. Georges, M. Christiaens, M. Ronsse, and K. De Bosschere. Jarec: a portable record/re-

play environment for multi-threaded java applications. Software Practice and Experience,

2004.

[40] Dennis Giffhorn and Christian Hammer. Precise slicing of concurrent programs. Auto-

mated Software Engg., 2009.

[41] Alex Groce and Willem Visser. What went wrong: explaining counterexamples. In SPIN,

2003.

[42] Richard L. Halpert, Christopher J. F. Pickett, and Clark Verbrugge. Component-based

lock allocation. In PACT, 2007.

[43] Christian Hammer, Julian Dolby, Mandana Vaziri, and Frank Tip. Dynamic detection of

atomic-set-serializability violations. In ICSE, 2008.

[44] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural support for

lock-free data structures. In ISCA, 1993.

Bibliography 146

[45] Derek R. Hower and Mark D. Hill. Rerun: Exploiting episodes for lightweight memory

race recording. In ISCA, 2008.

[46] Jeff Huang. Lightweight concurrency crash reproduction without logging shared memory

dependencies and program states. In PLDI SRC, 2012.

[47] Jeff Huang, Peng Liu, and Charles Zhang. LEAP: A tool for lightweight deterministic

multi-processor replay of concurrent Java programs. In FSE Demo, 2010.

[48] Jeff Huang, Peng Liu, and Charles Zhang. LEAP: Lightweight deterministic multi-

processor replay of concurrent Java programs. In FSE, 2010.

[49] Jeff Huang and Charles Zhang. An efficient static trace simplification technique for de-

bugging concurrent programs. In SAS, 2011.

[50] Jeff Huang and Charles Zhang. PECAN: Persuasive Prediction of Concurrency Access

Anomalies. In ISSTA, 2011.

[51] Jeff Huang and Charles Zhang. Execution privatization of scheduler-oblivious concurrent

programs. In OOPSLA, 2012.

[52] Jeff Huang and Charles Zhang. Lean: Simplifying concurrency bug reproduction via

replay-supported execution reduction. In OOPSLA, 2012.

[53] Jeff Huang, Jinguo Zhou, and Charles Zhang. Scaling predictive analysis of concurrent

programs by removing trace redundancy. TOSEM, 22(1), 2012.

[54] Intel cilk plus language specification, 2010. http://software.intel.com/sites/products/cilk-

plus/cilk plus language specification.pdf.

[55] Nicholas Jalbert and Koushik Sen. A trace simplification technique for effective debug-

ging of concurrent programs. In FSE, 2010.

[56] Guoliang Jin, Linhai Song, Wei Zhang, Shan Lu, and Ben Liblit. Automated atomicity-

violation fixing. In PLDI, 2011.

[57] Pallavi Joshi, Mayur Naik, Chang-Seo Park, and Koushik Sen. Calfuzzer: An extensible

active testing framework for concurrent programs. In CAV, 2009.

[58] Pallavi Joshi, Mayur Naik, Koushik Sen, and David Gay. An effective dynamic analysis

for detecting generalized deadlocks. In FSE, 2010.

[59] Pallavi Joshi, Chang-Seo Park, Koushik Sen, and Mayur Naik. A randomized dynamic

program analysis technique for detecting real deadlocks. In PLDI, 2009.

[60] Vineet Kahlon, Franjo Ivancic, and Aarti Gupta. Reasoning about threads communicating

via locks. In CAV, 2005.

Bibliography 147

[61] Richard M. Karp and Raymond E. Mille. Properties of a model for parallel computations:

Determinancy, termination, queueing. In SIAM Journal on Applied Mathematics, 1966.

[62] Nicholas Kidd, Thomas Reps, Julian Dolby, and Mandana Vaziri. Finding concurrency-

related bugs using random isolation. In VMCAI, 2009.

[63] Jens Krinke. Context-sensitive slicing of concurrent programs. In ESEC/FSE, 2003.

[64] Zhifeng Lai, S. C. Cheung, and W. K. Chan. Detecting atomic-set serializability viola-

tions in multithreaded programs through active randomized testing. In ICSE, 2010.

[65] L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess

programs. IEEE Trans. Comput., 1979.

[66] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. CACM,

1978.

[67] Doug Lea. The java.util.concurrent synchronizer framework. Sci. Comput. Program., 58,

2005.

[68] T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant

replay. IEEE Transactions on Computers, 1987.

[69] Dongyoon Lee, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. Chimera: hybrid

program analysis for determinism. In PLDI, 2012.

[70] Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Pe-

ter M. Chen, and Jason Flinn. Respec: efficient online multiprocessor replayvia specula-

tion and external determinism. In ASPLOS, 2010.

[71] Joanne Lim. An engineering disaster: Therac-25. http://en.wikipedia.org/wiki/Therac-25,

1998.

[72] Richard J. Lipton. Reduction: a method of proving properties of parallel programs.

CACM, 1975.

[73] Shan Lu, Soyeon Park, Chongfeng Hu, Xiao Ma, Weihang Jiang, Zhenmin Li, Raluca A.

Popa, and Yuanyuan Zhou. Muvi: automatically inferring multi-variable access correla-

tions and detecting related semantic and concurrency bugs. In SOSP, 2007.

[74] Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from mistakes: a

comprehensive study on real world concurrency bug characteristics. ASPLOS, 2008.

[75] Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. Avio: detecting atomicity viola-

tions via access interleaving invariants. In ASPLOS, 2006.

Bibliography 148

[76] Brandon Lucia, Joseph Devietti, Karin Strauss, and Luis Ceze. Atom-aid: Detecting and

surviving atomicity violations. In ISCA, 2008.

[77] Jeremy Manson, William Pugh, and Sarita V. Adve. The java memory model. In POPL,

2005.

[78] Dan Marino, Abhayendra Singh, Todd Millstein, Madan Musuvathi, and Satish

Narayanasamy. Drfx: A simple and efficient memory model for concurrent program-

ming languages. In PLDI, 2009.

[79] Daniel Marino, Madanlal Musuvathi, and Satish Narayanasamy. Literace: effective sam-

pling for lightweight data-race detection. In PLDI, 2009.

[80] Nicholas D. Matsakis and Thomas R. Gross. A time-aware type system for data-race

protection and guaranteed initialization. In OOPSLA, 2010.

[81] F. Mattern. Virtual time and global states of distributed systems. Workshop on Parallel

and Distributed Algorithms, 1988.

[82] Ghassan Misherghi and Zhendong Su. Hdd: hierarchical delta debugging. In ICSE, 2006.

[83] Pablo Montesinos, Luis Ceze, and Josep Torrellas. Delorean: Recording and determinis-

tically replaying shared-memory multi-processor execution efficiently. In ISCA, 2008.

[84] Pablo Montesinos, Matthew Hicks, Samuel T. King, and Josep Torrellas. Capo: a

software-hardware interface for practical deterministic multi-processor replay. In AS-

PLOS, 2009.

[85] Madan Musuvathi and Shaz Qadeer. Chess: systematic stress testing of concurrent soft-

ware. In Proceedings of the 16th international conference on Logic-based program syn-

thesis and transformation, 2007.

[86] Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, Gerard Basler, Piramanayagam A.

Nainar, and Iulian Neamtiu. Finding and reproducing heisenbugs in concurrent programs.

In OSDI, 2008.

[87] Santosh Nagarakatte, Sebastian Burckhardt, Milo M.K. Martin, and Madanlal Musuvathi.

Multicore acceleration of priority-based schedulers for concurrency bug detection. In

PLDI, 2012.

[88] Mayur Naik and Alex Aiken. Conditional must not aliasing for static race detection. In

POPL, 2007.

[89] Mayur Naik, Alex Aiken, and John Whaley. Effective static race detection for java. In

PLDI, 2006.

Bibliography 149

[90] Mangala Gowri Nanda and S. Ramesh. Interprocedural slicing of multithreaded programs

with applications to java. ACM Trans. Program. Lang. Syst., 2006.

[91] Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder. Au-

tomatic logging of operating system effects to guide application-level architecture simu-

lation. In SIGMETRICS, 2006.

[92] Satish Narayanasamy, Gilles Pokam, and Brad Calder. Bugnet: Continuously recording

program execution for deterministic replay debugging. In ISCA, 2005.

[93] Satish Narayanasamy, Zhenghao Wang, Jordan Tigani, Andrew Edwards, and Brad

Calder. Automatically classifying benign and harmful data racesallusing replay analy-

sis. In PLDI, 2007.

[94] R. H. B. Netzer and B. P. Miller. Improving the accuracy of data race detection. In

PPOPP, 1991.

[95] R. H. B. Netzer and B. P. Miller. What are race conditions: Some issues and formaliza-

tions. LOPLAS, 1992.

[96] Robert O’Callahan and Jong-Deok Choi. Hybrid dynamic data race detection. In PPoPP,

2003.

[97] Marek Olszewski, Jason Ansel, and Saman Amarasinghe. Kendo: efficient deterministic

multithreading in software. In ASPLOS, 2009.

[98] Chang-Seo Park and Koushik Sen. Randomized active atomicity violation detection in

concurrent programs. In FSE, 2008.

[99] Soyeon Park, Shan Lu, and Yuanyuan Zhou. Ctrigger: exposing atomicity violation bugs

from their hiding places. In ASPLOS, 2009.

[100] Soyeon Park, Yuanyuan Zhou, Weiwei Xiong, Zuoning Yin, Rini Kaushik, Kyu H. Lee,

and Shan Lu. PRES: probabilistic replay with execution sketching on multi-processors.

In SOSP, 2009.

[101] Suzette Person, Matthew B. Dwyer, Sebastian Elbaum, and Corina S. Pasareanu. Differ-

ential symbolic execution. In FSE, 2008.

[102] Eli Pozniansky and Assaf Schuster. Efficient on-the-fly data race detection in multi-

threaded c++ programs. In PPoPP, 2003.

[103] Sriram Rajamani, G. Ramalingam, Venkatesh Prasad Ranganath, and Kapil Vaswani. Iso-

lator: dynamically ensuring isolation in comcurrent programs. In ASPLOS, 2009.

Bibliography 150

[104] Venkatesh Prasad Ranganath and John Hatcliff. Slicing concurrent java programs using

indus and kaveri. Int. J. Softw. Tools Technol. Transf., 2007.

[105] Paruj Ratanaworabhan, Martin Burtscher, Darko Kirovski, Benjamin Zorn, Rahul Nagpal,

and Karthik Pattabiraman. Detecting and tolerating asymmetric races. In PPoPP, 2009.

[106] Michiel Ronsse and Koen De Bosschere. Recplay: a fully integrated practical record/re-

play system. TOCS, 1999.

[107] Michiel Ronsse, Koen De Bosschere, Mark Christiaens, Jacques Chassin de Kergom-

meaux, and Dieter Kranzlmuller. Record/replay for nondeterministic program executions.

CACM, 2003.

[108] Mark Russinovich and Bryce Cogswell. Replay for concurrent non-deterministic shared-

memory applications. In PLDI, 1996.

[109] Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Ander-

son. Eraser: A dynamic data race detector for multi-threaded programs. TOCS, 1997.

[110] Koushik Sen. Race directed random testing of concurrent programs. In PLDI, 2008.

[111] Koushik Sen and Gul Agha. Detecting errors in multithreaded programs by generalized

predictive analysis of executions. In FMOODS, 2005.

[112] Traian Florin Serbanuta, Feng Chen, and Grigore Rosu. Maximal causal models for

sequentially consistent multithreaded systems. Technical report, University of Illinois,

2010.

[113] Ohad Shacham, Mooly Sagiv, and Assaf Schuster. Scaling model checking of dataraces

using dynamic information. In PPoPP, 2005.

[114] Nir Shavit and Dan Touitou. Software transactional memory. In PODC, 1995.

[115] Y. Shi, S. Park, Z. Yin, S. Lu, Y. Zhou, W. Chen, and W. Zheng. Do i use the wrong

definition?: Defuse: definition-use invariants for detecting concurrency and sequential

bugs. In OOPSLA, 2010.

[116] Nishant Sinha and Chao Wang. Staged concurrent program analysis. In FSE, 2010.

[117] A. Prasad Sistla and Patrice Godefroid. Symmetry and reduced symmetry in model

checking. ACM Trans. Program. Lang. Syst., 26(4), July 2004.

[118] Francesco Sorrentino, Azadeh Farzan, and P. Madhusudan. Penelope: Weaving threads

to expose atomicity violations. In FSE, 2010.

[119] John Steven, Pravir Ch, Bob Fleck, and Andy Podgurski. jrapture: A capture/replay tool

for observation-based testing. In ISSTA, 2000.

Bibliography 151

[120] William N. Sumner, Yunhui Zheng, Dasarath Weeratunge, and Xiangyu Zhang. Precise

calling context encoding. In ICSE, 2010.

[121] Sriraman Tallam, Chen Tian, and Rajiv Gupta. Dynamic slicing of multithreaded pro-

grams for race detection. In ICSM, pages 97–106, 2008.

[122] Sriraman Tallam, Chen Tian, Rajiv Gupta, and Xiangyu Zhang. Enabling tracing of long-

running multithreaded programs via dynamic execution reduction. In ISSTA, 2007.

[123] Chen Tian, Vijay Nagarajan, Rajiv Gupta, and Sriraman Tallam. Dynamic recognition of

synchronization operations for improved data race detection. In ISSTA, 2008.

[124] Software Bug Contributed to Blackout. Securityfocus.

http://www.securityfocus.com/news/8016, 2004.

[125] Mandana Vaziri, Frank Tip, and Julian Dolby. Associating synchronization constraints

with data in an object-oriented language. In POPL, 2006.

[126] Kaushik Veeraraghavan, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. Detect-

ing and surviving data races using complementary schedules. In SOSP, 2011.

[127] Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M.

Chen, Jason Flinn, and Satish Narayanasamy. Doubleplay: parallelizing sequential log-

ging and replay. In ASPLOS, 2011.

[128] Kahlon Vineet and Chao Wang. Universal causality graphs: A precise happens-before

model for detecting bugs in concurrent programs. In CAV, 2010.

[129] Willem Visser, Corina S. Pasareanu, and Sarfraz Khurshid. Test input generation with

java pathfinder. In ISSTA, 2004.

[130] Chao Wang, Sudipta Kundu, Malay K. Ganai, and Aarti Gupta. Symbolic predictive

analysis for concurrent programs. In FM, 2009.

[131] Chao Wang, Rhishikesh Limaye, Malay K. Ganai, and Aarti Gupta. Trace-based symbolic

analysis for atomicity violations. In TACAS, 2010.

[132] Haixun Wang, Hao He2, Jun Yang, Philip S. Yu, and Jeffrey Xu Yu. Dual labeling:

Answering graph reachability queries in constant time. In ICDE, 2006.

[133] Liqiang Wang and Scott D. Stoller. Accurate and efficient runtime detection of atomicity

errors in concurrent programs. In PPoPP, 2006.

[134] Liqiang Wang and Scott D. Stoller. Runtime analysis of atomicity for multithreaded

programs. TSE, 2006.

Bibliography 152

[135] Dasarath Weeratunge, Xiangyu Zhang, and Suresh Jaganathan. Accentuating the positive:

Atomicity inference and enforcement using correct executions. In OOPSLA, 2011.

[136] Dasarath Weeratunge, Xiangyu Zhang, and Suresh Jagannathan. Analyzing multicore

dumps to facilitate concurrency bug reproduction. In ASPLOS, 2010.

[137] Bin Xin, William N. Sumner, and Xiangyu Zhang. Efficient program execution indexing.

In PLDI, 2008.

[138] Min Xu, Rastislav Bodik, and Mark D. Hill. A ”flight data recorder” for enabling full-

system multiprocessor deterministic replay. In ISCA, 2003.

[139] Min Xu, Rastislav Bodık, and Mark D. Hill. A serializability violation detector for shared-

memory server programs. In PLDI, 2005.

[140] Zuoning Yin, Ding Yuan, Yuanyuan Zhou, Shankar Pasupathy, and Lakshmi Bairavasun-

daram. How do fixes become bugs? In ESEC/FSE, 2011.

[141] Jie Yu and Satish Narayanasamy. A case for an interleaving constrained shared-memory

multi-processor. In ISCA, 2009.

[142] Jie Yu and Satish Narayanasamy. Tolerating concurrency bugs using transactions as life-

guards. In MICRO, 2010.

[143] Cristian Zamfir and George Candea. Execution synthesis: a technique for automated

software debugging. In EuroSys, 2010.

[144] Andreas Zeller and Ralf Hildebrandt. Simplifying and isolating failure-inducing input.

TSE, 2002.

[145] Charles Zhang. Flexsync: An aspect-oriented approach to java synchronization. In ICSE,

2009.

[146] Charles Zhang and Hans-Arno Jacobsen. Externalizing java server concurrency with cal.

In ECOOP, 2008.

Effective Methods for Debugging Concurrent Softwarejeff/academic/phdThesis.pdf · 2015-08-19 ·...

Documents

Transcript of Effective Methods for Debugging Concurrent Softwarejeff/academic/phdThesis.pdf · 2015-08-19 ·...