Effective Methods for Debugging Concurrent Softwarejeff/academic/phdThesis.pdf · 2015-08-19 ·...
Transcript of Effective Methods for Debugging Concurrent Softwarejeff/academic/phdThesis.pdf · 2015-08-19 ·...
THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY
Effective Methods for Debugging ConcurrentSoftware
by
Shaoming HUANG
A Thesis Submitted toThe Hong Kong University of Science and Technology
in Partial Fulfillment of the Requirements forthe Degree of Doctor of Philosophy
in theDepartment of Computer Science and Engineering
April 2013, Hong Kong
Authorization
I hereby declare that I am the sole author of the thesis.
I authorize the Hong Kong University of Science and Technology to lend this thesis to
other institutions or individuals for the purpose of scholarly research.
I further authorize the Hong Kong University of Science and Technology to reproduce
the thesis by photocopying or by other means, in total or in part, at the request of other
institutions or individuals for the purpose of scholarly research.
Shaoming HUANG
April 2013
i
EFFECTIVE METHODS FOR DEBUGGINGCONCURRENT SOFTWARE
by
SHAOMING HUANG
This is to certify that I have examined the above PhD thesis
and have found that it is complete and satisfactory in all respects,
and that any and all revisions required bythe thesis examination committee have been made.
Dr. Charles ZHANG (Thesis Supervisor)
Prof. Mounir HAMDI (Department Head)
Department of Computer Science and Engineering
April 2013
ii
To My Beloved Parents and My Dearest Wife Kami
iii
Acknowledgements
My first and foremost, thanks go to my advisor, Charles Zhang, who has spent a tremendous
amount of time and energy forming me into both a confident researcher and a nice person.
His exemplary guidance, his far reaching vision, and his unwavering optimism and patience
have been a constant source of encouragement that helped me explore and develop ideas and
overcome incredible challenges throughout my PhD. From him, I received the largest possible
freedom and unconditional support one can imagine during a four and half year graduate edu-
cation. I can never repay Charles for what he has given to me. The best way for me to express
my gratitude towards him is to try to become what he has been to me: teacher, mentor, guide,
collaborator, and friend.
I am also very grateful to every member in my defense, proposal, and qualifying examination
committee: S.C. Cheung, Sung Kim, Lin Gu, Jiang Xu, Tom Ball, Xueqing Zhang. S.C. is
always so nice and willing to help all the way throughout my graduate study. Leading our big
software engineering group, his extraordinary enthusiasm and knowledge and his creation of a
passionate group culture have made me feel constantly optimistic and inspiring. I am indebted
to Sung too much for his priceless guidance and innumerable suggestions in the various stages
of my research. I will also never forget his kindness and encouragement on all the other aspects
during my study at HKUST. Lin has been very supportive to me ever since our first meet in a
group discussion and has provided invaluable advice on debugging concurrent and distributed
systems. His course on cloud computing systems is of particular interest to me and from which
I learned quite a lot on system programming and system research. I would also like to thank
Tom, Jiang, and Xueqing for serving on my thesis committee. I am grateful to Tom in particular
for his thorough review and constructive comments on my thesis.
I need to thank all the other members in the Prism group: Liu Peng, Xiao Xiao, Jinguo Zhou,
Xiang Gao, Wei Li, Yiqing Zhu, Jin Huang, and Meng Wang. I feel really lucky to grow in such
a smart and energetic group, and I am grateful to have them as friends and colleagues. Most of
my research projects would not have been possible without the critical discussions with them.
Thanks also go to all the other members in our software engineering group, especially Zhifeng
Lai, Chang Xu, Xinming Wang, Yueqi Li, Qiaona Hong, Ning Chen, Yepang Liu, Yida Tao,
Wenmao Gong, Dongxiang Cai, Jaechang Nam, Rae Noh, Donggyun Han and Hyunmin Seo.
They have made the whole group as a warm family to me. I especially want to thank Zhifeng,
who always gives me helpful suggestions and keeps me positive. Discussion with him also
greatly improved my understanding of concurrent program analysis and testing.
A big and big thank you goes to Can Yang, who is like a brother to me. “Open mind” and
“Follow your heart”, I will always cherish these words and will never forget the happy time we
iv
spent together. Equally I would like to thank Chao Yang, Xiaowei Zhou, Tiangzhu Liang, Suijie
Wang, Tao Lu, Xiang Wan, Lingsing Yung, Wei Chen, Guangyuan Yang, and Wei Jiang, for
sharing with me unforgettable time in the past few years. My gratitude extends to my friends
at HKUST: Zhewei Wei, Lixing Wang, Tengfei Liu, Ang Li, Xiaoheng Xie, Yu Peng, Yincheng
Lin, Xiaofei Zhang, Xiangming Fang, Zhiqiang Ma, Dong Lin, Yu Zhang, Ning Ding, Haodi
Zhang, Li Li, and Shanchao Zhang,
Finally, I would like to thank my beloved parents, my brother Qiming, and my dearest wife
Kami. I thank God for bringing them to me. This work wouldn’t have been possible without
their amazing support, tolerance, understanding, and most importantly, love.
Contents
Authorization Page i
Signature Page ii
Acknowledgements iv
Contents vi
List of Figures x
List of Tables xiii
Abstract xiv
Abbreviations xvi
1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Concurrency bugs are difficult to reproduce . . . . . . . . . . . . 2Concurrency bugs are difficult to detect . . . . . . . . . . . . . . 3Concurrency bugs are difficult to understand . . . . . . . . . . . 3Concurrency bugs are difficult to fix . . . . . . . . . . . . . . . . 3
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Multiprocessor Deterministic Replay . . . . . . . . . . . . . . . . . . . . 41.2.2 Predictive Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.3 Dynamic and Static Trace Simplification . . . . . . . . . . . . . . . . . . 51.2.4 Data Sharing Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background and Previous Work 72.1 Concurrent Program Execution Modeling . . . . . . . . . . . . . . . . . . . . . . 72.2 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Thread Interleaving Patterns for Concurrency Bugs . . . . . . . . . . . . . . . . . 13
Data race . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Atomicity Violations . . . . . . . . . . . . . . . . . . . . . . . . . 13
vi
Atomic-set serializability violations . . . . . . . . . . . . . . . . 132.4 Tackling Concurrency Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Concurrency Bug Reproduction . . . . . . . . . . . . . . . . . . . . . . . 152.4.1.1 Deterministic Replay . . . . . . . . . . . . . . . . . . . . . . . 152.4.1.2 Offline Search and Deterministic Multithreading . . . . . . . . 16
2.4.2 Concurrency Bug Detection . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2.1 Static and Dynamic Program Analyses . . . . . . . . . . . . . 17
Active Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2.2 Trace-based concurrent program analysis . . . . . . . . . . . . 18
2.4.3 Surviving Concurrency Bugs . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Multiprocessor Deterministic Replay 203.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 LEAP: Local-Order Based Deterministic Replay . . . . . . . . . . . . . . . . . . 22
3.2.1 LEAP Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.2 Locating Shared Variable Accesses . . . . . . . . . . . . . . . . . . . . . 233.2.3 Field-based Shared Variable Identification . . . . . . . . . . . . . . . . . 243.2.4 Unique Thread Identification . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.5 Handling Early Replay Termination . . . . . . . . . . . . . . . . . . . . . 26
3.3 A Theorem of Local Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 LEAP Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 The LEAP Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4.2 The LEAP Recorder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.4.3 The LEAP Replayer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.5.1 Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1.1 Micro-benchmarking . . . . . . . . . . . . . . . . . . . . . . . 323.5.1.2 Benchmarking with third-party systems . . . . . . . . . . . . . 333.5.1.3 Concurrency bug reproduction . . . . . . . . . . . . . . . . . . 343.5.1.4 Random bug injection . . . . . . . . . . . . . . . . . . . . . . . 353.5.1.5 Real and benchmark concurrency bugs . . . . . . . . . . . . . 35
3.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Persuasive Prediction of Concurrency Access Anomalies 384.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2 PECAN in a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3 Pattern Specification of Access Anomalies . . . . . . . . . . . . . . . . . . . . . 414.4 Graph Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.1 Constraint Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.4.2 The AA Prediction Problem . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Graph Pattern Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.5.1 Compact Encoding of PTG . . . . . . . . . . . . . . . . . . . . . . . . . . 454.5.2 Pattern-Directed Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Schedule Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.6.1 How to Generate a Feasible Schedule? . . . . . . . . . . . . . . . . . . . 474.6.2 What Can Our Algorithm Guarantee? . . . . . . . . . . . . . . . . . . . . 50
4.6.3 Pruning False Warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.7.2 Detected Real Bugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.7.3 PECAN Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Scaling Predictive Trace Analysis by Removing Redundant Events 585.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.2 General PTA algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3 Removing Trace Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Modeling trace redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . 635.3.1.1 A theory of trace redundancy . . . . . . . . . . . . . . . . . . . 645.3.1.2 Concurrency context . . . . . . . . . . . . . . . . . . . . . . . . 665.3.1.3 Two dimensions of redundancy . . . . . . . . . . . . . . . . . 67
5.3.2 Filtering redundant events . . . . . . . . . . . . . . . . . . . . . . . . . . 695.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.5.1 RQ1: Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.5.2 RQ2: Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.5.3 RQ3: Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 Dynamically Simplifying Concurrency Bug Reproduction 786.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Key Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 A Model of Trace Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.3 Automatic Redundance Removing . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.3.1 Removing Whole-Thread Redundancy . . . . . . . . . . . . . . . . . . . 846.3.2 Removing Partial-Thread Redundancy . . . . . . . . . . . . . . . . . . . 86
6.3.2.1 Multithreaded dynamic slicing . . . . . . . . . . . . . . . . . . 866.3.2.2 Repetition analysis . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.5 A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5.1 Description of Derby Bug #2861 . . . . . . . . . . . . . . . . . . . . . . 906.5.2 How LEAN Simplifies the Bug Reproduction . . . . . . . . . . . . . . . 91
6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6.1 RQ1: Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.6.2 RQ2: Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7 Static Trace Simplification 100
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007.2 SimTrace: Efficient Static Trace Simplification . . . . . . . . . . . . . . . . . . . 102
7.2.1 General Trace Simplification Problem . . . . . . . . . . . . . . . . . . . . 1027.2.2 A Theorem of Trace Equivalence . . . . . . . . . . . . . . . . . . . . . . 1037.2.3 SimTrace Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Dependence Graph Construction . . . . . . . . . . . . . . . . . . 104Simplifying Dependence Graph . . . . . . . . . . . . . . . . . . . 105
7.3 Implementation and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8 Execution Privatization for Scheduler-Oblivious Concurrent Programs 1128.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1128.2 A Theorem of Privatizability for Scheduler-Oblivious Programs . . . . . . . . . 1158.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.3.1 Motivating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178.3.2 Privatization Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.4 Execution Privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.4.2 Dynamic Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238.4.3 Path and Context Sensitive Privatization . . . . . . . . . . . . . . . . . . 124
8.4.3.1 Privatization Rules . . . . . . . . . . . . . . . . . . . . . . . . . 1258.4.3.2 Path and Context Sensitive P-Path Clone . . . . . . . . . . . . 127
8.4.4 Privatization Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.6.1 Concurrency Bug Fixing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1318.6.2 Performance Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . 1328.6.3 Pervasive Privatization Opportunities . . . . . . . . . . . . . . . . . . . . 1338.6.4 Program Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1358.7.1 Concurrent Program Testing and Debugging . . . . . . . . . . . . . . . . 1358.7.2 Privatization Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9 Conclusion and Future Work 138Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Bibliography 143
List of Figures
1.1 The same program exhibits different behaviors with different thread interleav-ings. The error manifests with the interleaving A (left) but not the interleavingB (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Overview of the work in this thesis for concurrent program debugging . . . . . . 4
2.1 Atomic-set serializability violation patterns [125]. Wu(l) and Ru(l) representa write and a read, respectively, to a memory location l of a unit of work u. l1and l2 belong to the same atomic set. . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 The instrumentation of SPE accesses . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 The overview of LEAP infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 293.3 The runtime characteristic of LEAP and other techniques on our microbench-
mark with the number of SPE ranges from 1 to 500. The microbenchmark starts10 threads running on 8 processors. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 The runtime characteristic of LEAP and other techniques on our microbench-mark with the number of threads ranges from 1 to 80 running on 8 processors.The number of SPE is set to 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 General access anomaly patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Example of searching atomicity violations . . . . . . . . . . . . . . . . . . . . . . 464.3 An example of schedule generation . . . . . . . . . . . . . . . . . . . . . . . . . . 484.4 An example for illustrating the difficulty of satisfying the lock constraint for
schedule generation. The race pair (v3,v8) is a false warning, though it satisfiesboth the POR and the lockset condition. . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 A destructive race in OpenJMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.6 A predicted real bug in Jigsaw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Example code for illustrating the trace redundancy . . . . . . . . . . . . . . . . . 605.2 Statements (10,7,10) form a real atomicity violation. However, the simple strat-
egy of “dropping all re-references by the same thread to the same variable ifthere are no synchronization operations between them” would drop the secondread of T2 at line 10, which causes PTA to miss this atomicity violation. . . . . 61
5.3 A trace corresponding to a serial execution of the example program in Figure 5.1. 635.4 Trie representation of local (left) and global (right) redundancy . . . . . . . . . . 71
6.1 A typical test case for stressing testing an account function. A significant amountof computation in a buggy execution of this program may be redundant. . . . . . 80
6.2 An example of dynamic thead hierarchy graph (TH-Tree). When T1,3 are se-lected, all T1,3 and their descendents (gray color) are disabled. . . . . . . . . . . 84
x
6.3 The delta-debugging algorithm. The function validate return true if the twoconditions in the redundance criterion are both satisfied. For conciseness, theinput trace is ignored in the ddmin algorithm. . . . . . . . . . . . . . . . . . . . 85
6.4 Some iterations of the code block demarcated by @rcb-begin and @rcb-endare specified as potentially redundant. . . . . . . . . . . . . . . . . . . . . . . . . 88
6.5 An overview of LEAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.6 A real concurrency bug #2861 in Derby. The thread interleaving following the
solid arrow on the shared data referencedColumnMap crashed the programwith NullPointerException. . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.7 A real world test driver for triggering the concurrency bug in Figure 6.6. Thestatements inserted by LEAN to simplify the execution are shown in the grayareas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.8 Illustration of delta-debugging for removing the whole thread redundancy. Tidenotes the ith test thread created by the main thread T0. After four rounds ofsimplification, threads T(2,3) remain and all the other threads are removed. . . . 99
6.9 Illustration of delta-debugging for removing the redundant repetitions for theremaining threads T(2,3). Iij denotes the jth iteration of thread Ti where i=2,3and j=1,2,. . . ,10. After ten rounds of simplification, the 7th iteration of T2 andthe 4th iteration of T3 remain and all the other iterations are removed. . . . . . . 99
7.1 A greedy merge may produce non-optimal result in (a). Unfortunately, the prob-lem of producing the optimal result in (b) is NP-hard. . . . . . . . . . . . . . . . 108
8.1 Top: a real bug #2861 in Apache Derby. The program crashes with Null-PointerException when a thread references the shared data structure referenced-ColumnMap at line 11 after another thread sets it to null in the method setRef-erencedColumnMap. Bottom: the getObjectName method after privatization. . . 117
8.2 The benchmark contains 8 threads simultaneously decreasing the shared vari-able num. The privatized version (right) is 17.9% faster than the original version(left). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.3 Privatization must be path-sensitive . . . . . . . . . . . . . . . . . . . . . . . . . . 1198.4 An atomicity violation in the appendmethod of java.lang.StringBuffer
class. The program throws StringIndexOutOfBoundsException whena thread at line 11 references the stale length of sb changed by another threadat line 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.5 Privatization must preserve progressiveness . . . . . . . . . . . . . . . . . . . . . 1218.6 D-SAP and P-SAP are path-sensitive . . . . . . . . . . . . . . . . . . . . . . . . . 1238.7 Conceptual view of execution privatization. The privatization is tailored to the
P-Path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248.8 Privatization rules of D-SAP and P-SAP . . . . . . . . . . . . . . . . . . . . . . . 1258.9 The P-SAP and the D-SAP are at the same program location (line 3). Neverthe-
less, because their calling contexts are different (line 1 and line 2, respectively),they are still privatizable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.10 Intra-procedural privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1278.11 Inter-procedural privatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288.12 Architecture of Privateer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298.13 Privatization may not repair this bug . . . . . . . . . . . . . . . . . . . . . . . . . 1328.14 Frequent shared array accesses in RayTracer . . . . . . . . . . . . . . . . . . . 133
8.15 The lock/unlock operations at line 5/6 can not be removed, though there is nocode to execute between them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9.1 The program above crashes at line 9 following the interleaving 1-10-2-11-3-12-13-4-5-14-9. To reproduce the crash, LEAP [48] requires 12 synchronizationsat runtime to record the thread access order information (right) on the sharedvariables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.2 A schedule different from the original one, but is able to reproduce the bug.Moreover, this schedule has fewer (4) context switches than the original one (8). 141
List of Tables
3.1 The runtime overhead of LEAP and the state-of-the-art techniques. . . . . . . . 333.2 LEAP - summary of the evaluated real bugs . . . . . . . . . . . . . . . . . . . . . 353.3 LEAP - summary of the evaluated benchmark bugs . . . . . . . . . . . . . . . . . 36
4.1 PECAN experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 TraceFilter experimental results- RQ1: Effectiveness . . . . . . . . . . . . . . . . 735.2 TraceFilter experimental results - RQ2: Efficiency . . . . . . . . . . . . . . . . . 755.3 TraceFilter experimental results - RQ3: Correctness . . . . . . . . . . . . . . . . 77
6.1 LEAN evaluation benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.2 LEAN experimental results - RQ1: Effectiveness . . . . . . . . . . . . . . . . . . 936.3 LEAN - decomposed effectiveness on trace size reduction . . . . . . . . . . . . . 946.4 Comparison between LEAN and ER . . . . . . . . . . . . . . . . . . . . . . . . . 956.5 LEAN experimental results - RQ2: Efficiency . . . . . . . . . . . . . . . . . . . . 96
7.1 Simtrace experimental results. Data are averaged over 50 runs for each subject. 109
8.1 Results of real concurrency bug fixing by privatization . . . . . . . . . . . . . . . 1318.2 Performance improvement by privatization . . . . . . . . . . . . . . . . . . . . . 1328.3 Statistics of the privatization results . . . . . . . . . . . . . . . . . . . . . . . . . 1348.4 Bytecode size increase after privatization . . . . . . . . . . . . . . . . . . . . . . 134
xiii
Effective Methods for Debugging ConcurrentSoftware
by Shaoming HUANG
Department of Computer Science and Engineering
The Hong Kong University of Science and Technology
Abstract
Multicore is here to stay. To keep up with the hardware innovation, software developers must
move from sequential programming to concurrent programming. However, this move is slow
and challenging due to the exponential complexity in reasoning about concurrency. In particular,
Heisenbugs such as data races, which are non-deterministic concurrency errors, pervasively
infect concurrent software, making concurrent program debugging notoriously difficult.
In this dissertation, we develop several effective methods for debugging concurrent programs
along four directions: multiprocessor deterministic replay, predictive trace analysis, trace sim-
plification, and data sharing reduction. We first present LEAP, a lightweight record and replay
system that makes Heisenbugs reproducible on multi-core and multi-processors. Underpinned
by a new local-order based replay theorem, LEAP is fast, portable, and deterministic. As long as
a Heisenbug manifests once, LEAP is able to deterministically reproduce it in every subsequent
execution, and more importantly, with much lower overhead compared to previous approaches.
We second present PECAN and TraceFilter, a persuasive predictive trace analysis system that
predicts Heisenbugs from normal executions, and an efficient algorithm that significantly im-
proves the scalability of predictive analysis by removing the trace redundancy. The salient fea-
ture of PECAN is that, in addition to predicting Heisenbugs, it generates a concrete execution
that deterministically expose and validate the predicted bugs. With PECAN, programmers are
provided with the full execution history and context information to understand the bug, which
dramatically expedites the debugging process.
We third present LEAN and SimTrace, a dynamic and a static technique for simplifying con-
currency bug reproduction through removing computational redundancy and validating trace
equivalence. A simplified execution with fewer threads, fewer thread interleavings, and faster
replay greatly reduces the debugging effort by reducing the number of places in the trace where
we need to look for the cause of the bug and by speeding up the bug reproduction process.
We finally present Privateer, an execution privatization technique that soundly privatizes a subset
of shared data accesses in a vast category of scheduler-oblivious concurrent programs. Under-
pinned by a privatization theorem, Privateer safely reduces the data sharing and isolates the
erroneous thread interleavings without introducing any additional synchronization. With Pri-
vateer, many Heisenbugs are fixed and a wide range of concurrency problems are alleviated
without impairing but, instead, improving the program performance.
Abbreviations
AA Access Anomaly
ASV Atomic-set Serializability Violations
MDR Multiprocessor Deterministic Replay
PTA Predictive Trace Analysis
xvi
Chapter 1
Introduction
We have entered a new era where our daily life is being dramatically changed by computing
technology. One of the greatest innovations in this era lies in the multicore hardware archi-
tecture, which brings our computers a new dimension of computational power. Even though
Moore’s law is going to hit the power wall, the performance of our computers will continue to
increase, as multicore promises to deliver continuous performance boost by packing more and
more computational cores on each chip.
While it is obvious that a multicore computer has the potential for higher performance, actually
realizing this potential is difficult. Despite a decade of practice, developing good quality concur-
rent software that efficiently utilizes multicore hardware remains notoriously difficult. A main
challenge is the interleaving of actions from concurrent thread, which is essential for parallel
performance. Due to the interleaving, programmers can no longer reason in a sequential way
because threads sharing the same address space can interfere with each other through the shared
data following different access orders.
Moreover, the number of thread interleavings is astronomical, exponential in both the number
of the threads and the size of their instructions. Facing the exponential complexity of reasoning
about concurrency, it is very difficult for the programmers to write correct and efficient concur-
rent programs. In addition, due to the huge interleaving space, software testing is often far from
enough to cover the adequate portion of the interleaving space, making many concurrency bugs
slip to the production site and impact the end users.
Even worse, the interleaving is non-deterministic, due to the thread scheduling non-determinism
and the timing differences between different cores. On a multicore computer, the same concur-
rent program running on the same machine with the same input can produce different outputs in
different runs. The non-determinism makes testing and debugging concurrent programs much
more challenging because a bug might “disappear” when programmers want to understand it.
1
Introduction 2
As a consequence, concurrency bugs such as data races, atomicity violations, atomic-set seri-
alizability violations, and deadlocks widely infect concurrent software systems, causing severe
problems such as data corruption, program crashes, with huge economical cost [124], and even
real world disasters [71].
Facing the numerous challenges above, we develop in this thesis a range of effective and scal-
able methods for dealing with concurrency bugs, aiming to improve the quality of concurrent
software in the multicore era.
1.1 Motivation
Concurrency bugs widely exist in today’s real world concurrent software system [74]. While
concurrent programs are more difficult to reason about than sequential programs, there are sev-
eral other important reasons that greatly affect the quality and reliability of concurrent programs.
Concurrency bugs are difficult to reproduce The exhibition of concurrency bugs is not
only dependent on the program input, but also the thread interleaving. Since the interleaving is
non-deterministic due to choices made by the thread scheduler, the exhibition of concurrency
bugs also is non-deterministic. Consider a simple multithreaded example in Figure 1.1. In this
artificial program, there are two threads t1 and t2 accessing two different shared variables
x and y, and there is an error at line 4. Because these two threads can execute concurrently
on different cores, their execution order may following different interleaving sequences. For
example, execution may follow either the interleaving A or B, represented by the statement line
numbers 1-2-6-7-3-4 and 1-2-3-5-6, respectively. If the program execution follows the
interleaving A, the error at line 4 is triggered. However, if it follows the interleaving B, the error
does not manifest.
This simple example illustrates the fact that the computation of concurrent programs is sensitive
to the thread interleaving. Even if we run the same program on the same machine with the
same program input, the error may or may not manifest in different runs. This phenomenon
makes debugging concurrent programs very hard. To reproduce a concurrency bug, not only
the same program input is required, but also the same thread interleaving. Unfortunately, it is
very challenging to capture the thread interleavings on multicore computers. Because recording
the thread interleavings at runtime inevitably hampers the execution parallelism, most runtime
techniques incur unacceptable program slowdown and are hard to deploy in production.
Introduction 3
2: y=1
3: if(x<0)
4: ERROR
6: if(y=1)
7: x=-1
Interleaving A: 1->5->2->6->7->3->4
2: y=1
3: if(x<0)
4: ERROR
6: if(y=1)
7: x=-1
Interleaving B: 1->2->3->5->6
FIGURE 1.1: The same program exhibits different behaviors with different thread interleavings.The error manifests with the interleaving A (left) but not the interleaving B (right).
Concurrency bugs are difficult to detect Due to the astronomical number of thread inter-
leavings, detecting concurrent bugs also is very challenging. Traditional program testing tech-
niques for sequential programs do not work well for concurrent programs because they do not
take the interleaving into account. Moreover, as the interleaving space is huge, testing is often
far from sufficient to cover the entire interleaving space. For this reason, traditional program
analysis techniques for bug detection do not work well on concurrent programs. It is hard for
static program analysis or model checking techniques to find concurrency bugs in real world
large concurrent programs, because there are just too many thread interleavings to explore. Tra-
ditional dynamic analysis do not work well neither because only a limited size of paths and
schedules are observed. Furthermore, due to the inherent complexity of concurrent programs,
program analysis techniques tend to report quite a large number false warnings, further impeding
the debugging process.
Concurrency bugs are difficult to understand Typical executions of real world concurrent
programs often contain a large number of threads, thread interleavings, and shared data accesses
and thread synchronizations. Even if a concurrency bug can be reproduced deterministically, it is
still very challenging for programmers to locate and understand the cause of the bug. Moreover,
the performance of replay is often significantly slower than native execution. For long running
programs, the bug reproduction process may take too long. Furthermore, the bug reasoning
process based on the trace often involves frequent context switches between the executions of
different threads. As most programmers are trained to thinking sequentially, they have to jump
from the context of one thread to another frequently for reasoning the concurrency bug. These
frequent context switches significantly impair the effectiveness of concurrent program debug-
ging.
Concurrency bugs are difficult to fix After diagnosing the concurrency bug, fixing it is still
a challenging problem. A common way to fix a concurrency bug is to add synchronization that
Introduction 4
PECAN
Detection
TraceFilter
Predictive trace analysis
Privateer
Fixing
Data Sharing
Reduction
LEAN
Diagnosis
SimTrace
Trace simplification
LEAP
Reproduction
Multiprocessor
deterministic replay
FIGURE 1.2: Overview of the work in this thesis for concurrent program debugging
prevents the erroneous thread interleavings. However, facing the huge interleaving space and the
large number of thread contexts, it is usually difficult to find the proper type of synchronization
and the proper location to place the synchronization. Improper placement of synchronization
can incur non-ignorable program slowdown, but might also introduce new bugs such as dead-
locks. Moreover, even if the proper synchronization is placed at the right location to rule out the
manifested erroneous interleavings, it does not necessarily guarantee the bug is fixed. Because
the interleaving space is enormous, it is possible that some other unmanifested interleavings
which can still trigger the bug are not forbidden by the added synchronization.
1.2 Contributions
This thesis works on four directions to address the debugging problem: multiprocessor deter-
ministic replay to reproduce concurrency bugs, predictive trace analysis to detect concurrency
bugs, static and dynamic trace simplification to help concurrency bug understanding, and data
sharing reduction to help fixing concurrency bugs without adding synchronization. Figure 1.2
shows an overview of the work done in this thesis. We next elaborate the contributions in each
of the work.
1.2.1 Multiprocessor Deterministic Replay
Bug reproduction is often the first step in debugging. This thesis presents LEAP, a lightweight
record and replay system that makes concurrency bugs reproducible in general multicore and
multiprocessor environments. LEAP is fast, portable, and deterministic. As long as a Heisen-
bug manifests once, LEAP is able to deterministically reproduce it in every subsequent exe-
cution, and more importantly, with much lower overhead compared to previous approaches.
We describe the design and implementation of LEAP that uses static analysis and bytecode in-
strumentation to transparently provide the capability of deterministic replay for Java programs
Introduction 5
without any user intervention. LEAP is the first public available deterministic replay system for
Java programs and has been used by several research groups worldwide.
1.2.2 Predictive Trace Analysis
Predictive trace analysis overcomes the limitation of static and dynamic analyses by combining
them. It records a trace of execution events, statically (often exhaustively) generates other per-
mutations of these events under certain scheduling constraints, and exposes concurrency bugs
unseen in the recorded execution. Predictive trace analysis is a powerful technique as, compared
to dynamic analysis, it is capable of exposing bugs in unexercised executions and, compared
to static analysis, it incurs much fewer false positives because its static analysis phase uses the
concrete execution history.
We present PECAN, a new predictive trace analysis system that predicts Heisenbugs from nor-
mal executions. The salient feature of PECAN is that, in addition to predict Heisenbugs, it
generates concrete executions that deterministically expose the predicted bugs. With PECAN,
programmers are provided with the full execution history and context information to understand
the bug, which dramatically expedites the debugging process. PECAN has revealed several
serious and previously unknown bugs in large open source concurrent systems.
General predictive analysis for exposing Heisenbugs faces considerable challenges scaling to
large traces, due to the exponential explosion of the schedule exploration space. We further
present TraceFilter, an efficient algorithm that significantly improves the scalability of predic-
tive trace analysis. TraceFilter is based on a trace redundancy theorem which guarantees that
predictive trace analysis based on a redundancy-removed trace produces the same analysis result
as that on the original trace.
1.2.3 Dynamic and Static Trace Simplification
To address the difficulty of diagnosing concurrency bugs on a reproducible buggy trace, we
present LEAN and SimTrace, a dynamic and a static trace simplification technique that reduce
the size of the execution trace, the number of threads, and the number of thread context switches.
A simplified trace greatly lessens the debugging effort by reducing the number of places in the
trace where programmers need to look for the cause of the bug. More importantly, through
reasoning about the computational equivalence of the trace offline, SimTrace dramatically im-
proves the efficiency of trace simplification for reducing the thread context switches. SimTrace
scales well to traces with more than 1M events, making it attractive for practical use.
Introduction 6
1.2.4 Data Sharing Reduction
We finally propose Privateer, an execution privatization technique that soundly privatizes a sub-
set of shared data accesses in a vast category of concurrent programs - scheduler-oblivious pro-
grams, of which the computation result is always deterministic regardless of the thread schedul-
ing. Underpinned by a privatization theorem, Privateer is able to reduce the data sharing in
scheduler-oblivious programs without introducing any additional program behavior. Moreover,
the non-deterministic thread interleavings on the privatized accesses are isolated without adding
any synchronization. With Privateer, many Heisenbugs are fixed and a wide range of concur-
rency problems are alleviated without impairing the execution parallelism, but conversely, with
the program performance improved, because the heap accesses become local stack operations
after privatization.
1.3 Outline
The remainder of this thesis is organized as follows. Chapter 2 describes the background knowl-
edge and previous work on concurrent program debugging and the related concurrent program
execution modeling concepts. Chapter 3 presents our multiprocessor deterministic replay system
LEAP. Chapters 4 and 5 focus on our predictive trace analysis work on the scalable concurrency
bug detection, PECAN and TraceFilter. Chapters 6 and 7 present our dynamic and static trace
simplification techniques, LEAN and SimTrace. Chapter 8 presents Privateer, our execution
privatization technique for soundly reducing the data sharing in scheduler-oblivious concurrent
programs. Finally, Chapter 9 concludes this thesis and discusses future work.
The materials in some chapters have been published as conference and journal papers. The
materials in Chapter 3 have been presented in [47, 48]. The materials in Chapters 4 and 5 have
been presented in [50, 53]. The materials in Chapter 6 and 7 have been presented in [49, 52].
The materials in Chapter 8 has been presented in [51], and some materials in Chapter 8 have
been presented in [46].
Chapter 2
Background and Previous Work
This chapter introduces the background of concurrent program debugging and concurrency de-
fect analysis. Section 2.1 presents an execution model for concurrent programs. Section 2.2
presents the basic definitions used in this thesis. Section 2.3 presents the concurrency bug pat-
terns characterized by thread interleaving. Section 2.4 discusses existing techniques for tackling
the concurrency problems, including deterministic replay approaches for concurrency bug re-
production, concurrent program analysis techniques to detect concurrency bugs, and automatic
techniques for fixing and surviving concurrency bugs at runtime.
2.1 Concurrent Program Execution Modeling
In this section, we describe a general execution model of concurrent programs. This model is
a starting point to understand the difficulties in concurrent programming and to comprehend all
the program analysis techniques presented in this thesis for concurrent program debugging.
A concurrent program in our language consists of a set of concurrently executing threads T =
{t1, t2, ...} that communicate through a global store σ. The global store consists of a set of
variables S = {s1, s2, ...} that are shared among threads. Each thread also has its own local store
π, consisting of the local variables and the program counter to the thread. We use σ[s] to denote
the value of the shared variable s on the global store. Each thread executes by performing a
sequence of actions on the global store or the thread’s own local store. Let α refer to an action
and var(α) the variable accessed by α. If var(α) is a shared variable, we call α a global action,
otherwise it is a local action. Note that for any global action, it operates on only one variable
on the global store. This is also true for synchronization actions, though they are only enabled
when certain pre-conditions are met. For local actions, the number of accessed variables on the
local store is not important in our modeling.
7
Background and Previous Work 8
A program execution is modeled as a sequence of transitions defined over the program state
Σ = (σ,Π), where σ is the global store and Π is a mapping from thread identifiers ti to the local
store πi of each thread. Since the program counter is included in the local store, each thread
is deterministic and the next action of ti is determined by ti’s current local store πi. Let αk be
the kth action in the global order of the program execution and Σk−1 be the program state just
before αk is performed (Σ0 is the initial state), the state transition sequence is:
Σ0 α1Ð→ Σ1 α2
Ð→ Σ2 α3Ð→ . . . (2.1)
Given a concurrent system described above, we next formally define the execution semantics
of action α. To give a precise definition, we first introduce some additional notations similar to
[34]:
• σ[s ∶= v] is identical to σ except that it maps the variable s to the value v.
• Π[ti ∶= πi] is identical to Π except that it maps the thread identifier ti to πi.
Let the relation σαÐ→ σ′ model the effect of performing an action α on the global store σ, and
παÐ→ π′ model the effect of performing α on the local store π. The execution semantics of
performing α are defined as follows.
Local action For the case of local actions, the execution semantics of performing α is simply
defined as:
Local ∶var(α) ∉ S Γ(α) = ti πi
αÐ→ π′i
(σ,Π)αÐ→ (σ,Π[ti ∶= π′i])
(2.2)
The program state transition above means that when a local action is performed by a thread,
only the local store of that thread is changed to a new state determined by its current state. The
global store and the local stores of the other threads remain the same.
Global action The common part of the semantics of global actions is that when a global action
is performed by a thread ti on the shared variable s, only s and πi are changed to new states.
The states of all the other shared variables on the global store as well as the local stores of all
the other threads remain the same:
Global ∶var(α) = s ∈ S Γ(α) = ti πi
αÐ→ π′i σ[s]
αÐ→ σ′[s]
(σ,Π)αÐ→ (σ[s ∶= σ′[s]],Π[ti ∶= π′i])
(2.3)
Let τ(α) denote the computation type of the global action α. To make the execution model
general to different programming languages, we consider the following types of global actions:
Background and Previous Work 9
• READ - the thread ti reads the value of a shared variable in the global store into its local
store:τ(α) = READ var(α) ∈ S Γ(α) = ti πi
αÐ→ π′i
(σ,Π)αÐ→ (Π[ti ∶= π′i])
• WRITE - the thread ti assigns some value to a shared variable in the global store:
τ(α) =WRITE var(α) = s ∈ S Γ(α) = ti πiαÐ→ π′i σ[s]
αÐ→ σ′[s]
(σ,Π)αÐ→ (σ[s ∶= σ′[s]],Π[ti ∶= π′i])
• LOCK - the thread ti acquires a lock l (which is also a shared variable on the global store);
the pre-condition l = 0 means that the lock is available and the post-condition l = i means
that the lock l is now owned by the thread ti:
τ(α) = LOCK var(α) = l ∈ S Γ(α) = ti πiαÐ→ π′i σ[l] = 0
(σ,Π)αÐ→ (σ[l ∶= i],Π[ti ∶= π′i])
• UNLOCK - the thread ti releases a lock l; the pre-condition l = i means l is now owned
by the thread ti and the post-condition l = 0 means l is avaiable:
τ(α) = UNLOCK var(α) = l ∈ S Γ(α) = ti πiαÐ→ π′i σ[l] = i
(σ,Π)αÐ→ (σ[l ∶= 0],Π[ti ∶= π′i])
• FORK - the thread ti forks a new thread tj . Let the shared variable stj denote the existence
of the thread tj in the program. The pre-conditions stj = NA and πj = NA mean that the
thread tj is unavailable and its local store is undefined, and the post-conditions σ[stj ∶= 1]
and Π[tj ∶= π0j ] mean the thread tj is available now and its local store is initialized to π0j :
τ(α) = FORK var(α) = stj ∈ S Γ(α) = ti πiαÐ→ π′i stj = NA πj = NA
(σ,Π)αÐ→ (σ[stj ∶= 1],Π[ti ∶= π′i, tj ∶= π
0j ])
• JOIN - the thread ti joins the termination of the thread tj ; the pre-condition stj = 0 means
that the thread tj has already terminated:
τ(α) = FORK var(α) = stj ∈ S Γ(α) = ti πiαÐ→ π′i stj = 0
(σ,Π)αÐ→ (σ,Π[ti ∶= π′i)
• START - the first action in the action sequence of the thread ti. This is a dummy action
indicating that the thread ti is ready to run. This action does not change any program state
and it immediately follows the FORK action that forked the thread ti:
τ(α) = START var(α) = sti Γ(α) = ti
(σ,Π)αÐ→ (σ,Π)
Background and Previous Work 10
• EXIT - the last action in the action sequence of the thread ti, indicating that ti has termi-
nated. The value of the shared variable sti is set to 0 after this action:
τ(α) = EXIT var(α) = sti ∈ S Γ(α) = ti πiαÐ→ π′i sti = 1
(σ,Π)αÐ→ (σ[sti ∶= 0],Π[ti ∶= NA])
• SIGNAL - the thread ti sets the value of a conditional variable c to 1:
τ(α) = SIGNAL var(α) = c ∈ S Γ(α) = ti πiαÐ→ π′i
(σ,Π)αÐ→ (σ[c ∶= 1],Π[ti ∶= π′i])
• WAIT - the standard semantics of a wait(c, l) action contains a sequence of three actions
UNLOCK-WAIT-LOCK: the thread ti first releases the lock l it is currently holding, then
it waits for a conditional variable c to become 1 and resets it back to 0 after c becomes
1, and finally it re-acquires lock l. The following execution semantics model the second
action:
τ(α) =WAIT var(α) = c ∈ S Γ(α) = ti πiαÐ→ π′i c = 1
(σ,Π)αÐ→ (σ[c ∶= 0],Π[ti ∶= π′i])
• YIELD - the thread ti yields execution to another thread. This action does not change
program state:τ(α) = Y IELD Γ(α) = ti
(σ,Π)αÐ→ (σ,Π)
The execution semantics defined above conform to a general concurrent execution model with
deterministic input. Although dynamic thread creation and dynamic shared variable creation
are not explicitly supported by the semantics, they can be modeled within the semantics in a
straightforward way [34].
2.2 Basic Definitions
Definition 2.1. (Trace) A trace captures a multi-threaded program execution as a sequence of
events δ = ⟨ei⟩. We associate each event ei with the following attributes:
• i: the global order of ei in δ;
• t: the thread executing ei;
• m: the memory location accessed by ei;
• a: the access type of ei, where a ∈{READ, WRITE, LOCK, UNLOCK, WAIT, NOTIFY,
FORK, JOIN};
Background and Previous Work 11
• l: the locks held by the thread executing ei when ei is executed;
• u: the atomic region to which ei belongs.
In our presentation, we use t(i), m(i), a(i), l(i), and u(i) to denote the attributes t, m, a, l, u
associated with the event ei respectively.
Definition 2.2. (Trace Equivalence) Two traces are equivalent if they drive the same initial
program state to the same final program state.
Definition 2.3. (Atomic Region) An atomic region is defined as a region of code fragments that
preserves certain consistency properties w.r.t. the program states. Similar to the work by Wang
et. al. [133] and with no loss of generality, we consider every synchronized method and ev-
ery synchronized block as an atomic region. In addition, FORK/JOIN/WAIT/NOTIFY/YIELD
operations are considered to be region boundaries. In the case of nested regions, an event eibelongs to the outermost one.
Definition 2.4. (Partial Order Relation ≺) (POR) An important relation that is used by many
concurrent program analyses is the POR relation (also called happens-before relation) on the
events exhibited by a concurrent execution. Given a trace δ, the partial-order relation ≺ is the
smallest relation satisfying the following conditions:
• Intra-thread program order: If ei and ej are events from the same thread and ei comes
before ej in the trace, then ei ≺ ej .
• Inter-thread message order: If ei is an action that sends a message g and ej is an ac-
tion that receives g, then ei ≺ ej . In our model, such relations include FORK≺START,
EXIT≺JOIN, and NOTIFY≺WAIT. START and EXIT are two fake actions representing
the beginning and ending of a thread.
• ≺ is transitively closed.
The computation of ≺ is often done by maintaining a vector clock with every thread [81]. Note
that, slightly different from the classical happens-before in the Java memory model [77], the
lock order between UNLOCK and LOCK events are not included in the POR relation.
Definition 2.5. (Dependence Relation →) is a strict relation that captures data and control
dependencies between events in the trace. The dependence relation ei → ej holds whenever eioccurs before ej and one of the following holds:
• Partial order - ei ≺ ej ;
• Lock order - ei and ej are consecutive UNLOCK and LOCK actions on the same lock,
respectively, by different threads such that ei releases the lock acquired by ej ;
Background and Previous Work 12
• Conflicting order - ei and ej are consecutive conflicting actions by different threads on
the shared variable. There are three types of conflicting orders:
– WRITE→READ: ei is a WRITE action and ej is a READ action;
– READ→WRITE: ei is a READ action and ej is a WRITE action;
– WRITE→WRITE: both ei and ej are WRITE actions.
Given a dependence relation ei → ej , If ei and ej are from different threads, we say ei has a
remote outgoing dependence to ej , and similarly, ej has a remote incoming dependence from ei.
It is important to notice that the remote dependence relations in our model are between actions
accessing the same shared variable. Therefore, context switches between threads accessing
different variables in the trace are allowed to be reduced in our model.
Definition 2.6. (Memory model) A memory consistency model defines what value a READ
action will return. For example, the simplest but the most strict model, sequential consistency
(SCMM)[65], requires that a READ always returns the value written by the most recent WRITE
on the same memory address. Various relaxed memory models [2, 77, 78] have been developed
to admit additional optimizations by imposing fewer constraints on the value returned from
READ operations. For simplicity, unless we emphasize the other memory models, by default
we consider SCMM in this thesis. Nevertheless, most techniques presented in this thesis also
generalize to relaxed memory models.
Definition 2.7. (Thread scheduling and interleaving) Under SCMM, in any execution, there
exists a global order among all the actions, and a READ action always return the value writ-
ten by the most recent WRITE on the same variable in this global order. We call this global
order a schedule, denoted by ξ. ξ is non-deterministic, it may be different in different exe-
cutions. A thread interleaving occurs in ξ when an action from a certain thread is executed
between two successive actions from a different thread. A preemptive interleaving occurs when
the interleaved thread could have executed continuously without the interleaving. Preemptive
interleaving is non-deterministic, because it depends on the behavior of the thread scheduler and
the timing variations between threads [100]. If a schedule contains no preemptive interleaving,
we say it is sequential and, otherwise, non-sequential.
Definition 2.8. (Scheduler-obliviousness) A vast category of concurrent programs are scheduler-
oblivious. A scheduler-oblivious program requires that, given the same input, it always returns
the same output, regardless of the behavior of the underlying thread scheduler. More specifi-
cally, in our modeling, given the same initial state Σ0, for any schedule ξ, the computation of a
scheduler-oblivious program always reaches the same final state ΣN :
(Σ0, ξ)...Ð→ ΣN (2.4)
Background and Previous Work 13
The definition of scheduler-obliviousness is semantically equivalent to determinancy [61]. A dif-
ference is that determinancy is a goal of parallel computation, whereas scheduler-obliviousness
is an expected property of the program.
Definition 2.9. (Blocking statement) A blocking statement is a statement that, when exe-
cuted, may enforce a thread interleaving or introduce an execution ordering between threads.
In our model, LOCK/WAIT/JOIN/YIELD/UNLOCK/NOTIFY/FORK are blocking statements,
and READ/WRITE are non-blocking. LOCK statement is blocking because they may wait if
the lock is unavailable. For WAIT statement, it always blocks first, and then waits until another
thread sets some conditions to be true. For JOIN statement, it must waits until the termination
of another thread. And for YIELD statement, it always yields the execution to another thread.
2.3 Thread Interleaving Patterns for Concurrency Bugs
Researchers have proposed various criteria for characterizing concurrency defects such as data
race [4, 109], atomicity [4, 34], causal atomicity [32], and conflict/view serializability [134].
A comprehensive study of concurrency-related bugs is given in [74]. We describe data race,
atomicity violation, and atomic-set serializability violations in this section. We omit deadlocks
and livelocks as they are not the focus of this thesis.
Data race Data races are one of the most common and subtle causes of pernicious concur-
rency bugs. A data race occurs when two threads are concurrently accessing the same data
without proper synchronization and at least one of these accesses is a write [109].
Atomicity Violations Atomicity guarantees that the program’s behavior can be understood as
if each atomic region executes serially (without interleaved steps of other threads). An atomicity
violation happens when the desired serializability among multiple memory accesses on some
shared data is violated [34]. Suppose ei and ej are data accesses (write or read) from the same
atomic region, and ek is a data access from another atomic region, and ei, ej , ek are accessing the
same memory location, an atomicity violation occurs if in some execution ei happened before
ek, ek happened before ej , and the access types of “ei-ek-ej” are of the form “write-read-write”
or “x-write-x”, where x means either read or write.
Atomic-set serializability violations Atomic-set serializability is a criterion for characteriz-
ing concurrency defects proposed by Vaziri et al. [125]. Since it also considers the correlations
between memory locations, this criterion characterizes a wider range of concurrency bugs than
many previously proposed criteria such as data race and atomicity violation. In the definition
Background and Previous Work 14
Id Pattern
1 Wu(l1)Wu’(l)Wu’(L‐l)Wu(l2)
2 Wu(l1)Wu’(l2)Wu(l2)Wu’(l1)
3 Wu(l1)Ru’(l)Ru’(L‐l)Wu(l2)
4 Wu(l1)Ru’(l2)Wu(l2)Ru’(l1)
5 Ru(l1)Wu’(l)Wu’(L‐l)Ru(l2)
6 Ru(l1)Wu’(l2)Ru(l2)Wu’(l1)
FIGURE 2.1: Atomic-set serializability violation patterns [125]. Wu(l) and Ru(l) represent awrite and a read, respectively, to a memory location l of a unit of work u. l1 and l2 belong to
the same atomic set.
of atomic-set serializability, the memory locations that have consistency property among each
other are grouped into an atomic set, and code regions expected to preserve the consistency of
an atomic set are called units of work. Atomic-set serializability requires that the units of work
must be serializable for all the atomic sets that they operate on. Errors due to data races, high
level data races, and violations of standard notions of serializability can all be treated as vio-
lations of atomic-set serializability. Besides, previous experiences on using this criterion show
that the criterion can be more accurate in discerning real concurrency bugs than other existing
ones [43].
More importantly, Vaziri et al. [125] summarized a set of eleven problematic data access pat-
terns (Figure 2.1) that violate atomic-set serializability (ASV) and proved that the set is com-
plete, provided that each unit of work that writes to an atomic set, writes all locations in that
set. For example, pattern 6 “Wu(l1)Wu′(l)Wu′(L − l)Wu(l2) (l ∈ l1, l2 = L)” shows an atomic-set
serializability violation that causes memory to be left in an inconsistent state. The two memory
locations l1 and l2 belong to the same atomic set. Because the two consecutive writes to l1 and
l2 of the unit of work u are interleaved by two writes to the two memory locations of another
unit of work u′, the consistency property between l1 and l2 is violated.
2.4 Tackling Concurrency Problems
To address the difficulties in programming concurrent systems, existing research has focused on
three dimensions. The first dimension is to provide language and library support for easy and
safe reasoning of concurrency. This dimension includes high performance concurrency libraries
[54, 67], flexible synchronization mechanisms [145, 146], deterministic language semantics
[11, 14], and transactional memory [44, 114]. The second dimension is to provide deterministic
Background and Previous Work 15
runtime enforcement for concurrent program execution [26, 97, 141]. This dimension usually
combats the concurrency peril with a compromise of program performance. The third dimension
targets the effective diagnosis of concurrency issues. Concurrency defect detection [35, 50, 79,
110], trace analysis [24, 41, 49, 55, 122], multiprocessor record/replay [45, 48, 83, 100] all
belong to this school. We discuss the previous research efforts related to concurrent program
debugging.
2.4.1 Concurrency Bug Reproduction
2.4.1.1 Deterministic Replay
The technique of deterministic replay aims at faithfully reproducing earlier program executions.
It plays a substantial role in concurrent program debugging as it makes concurrency bugs repro-
ducible. We next discuss the representative deterministic replay techniques.
Software-only approaches Dejavu [23] is a software-only solution that uses the logic clock
to provide deterministic replay of Java multi-threaded programs. It is developed as an JVM
extension that has two modes: record and replay. In the record mode, it records the thread
scheduling order at every critical events, including shared memory accesses and synchroniza-
tion operations. In the replay mode, it reproduces the execution behavior of the program by
enforcing the recorded logical thread schedule. However, since it has to trace every critical sec-
tion access, it only can support programs running on single-processor platforms. InstanceReplay
[68] is a record/replay technique that records the version number of shared objects accessed by
each thread for debugging parallel programs. It relies on a protocol called CREW that regu-
lates threads concurrent-read-exclusive-write on shared objects to reduce recording overhead.
To avoid the overhead of recording memory races, RecPlay [106] and Kendo [97] provide de-
terministic multi-threading of concurrent programs that perfectly synchronize using locks. Un-
fortunately, most real world concurrent applications may contain benign or harmful data races,
making these approaches unattractive. Though RecPlay and Kendo both use a data race detector
during replay to ensure deterministic replay up until the first race, they suffer from the limitation
that they cannot replay past the data race. For instance, while debugging using a replayer, a pro-
grammer might want to understand the after effects of a benign data race, which is not possible
with RecPlay and Kendo. JRapture [119] is a capture/replay tool for observation-based testing.
It captures interactions between a Java program and the system, including GUI, file, and console
inputs, among other types, and on replay it presents each thread with exactly the same input
sequence it saw during capture. DoublePlay [127] and Chimera [69] are two recent techniques
that support low overhead full-program replay. DoublePlay intelligently offloads the recording
processes to extra cores, while Chimera combines static data race analysis with offline profiling
and dynamic checking to provide efficient online recording.
Background and Previous Work 16
Hardware-assisted approaches Hardware approaches such as DMP [26] make inter-thread
communication fully deterministic by imposing a deterministic commit order among proces-
sors. PSet [141] eliminates untested thread interleavings by enforcing the runtime to follow a
tested interleaving via processor support. Because hardware approaches rely on non-standard
hardware support, they are limited to proprietary platforms. Though DMP [26] also proposes a
software-only algorithm, its overhead is more than 10x. FDR [138] and BugNet[92] are deter-
ministic replay tools for program debugging based on checkpointing schema and hardware-level
assistance. FDR employs additional hardware to track data races, program I/O, interrupts and
DMA accesses to enable deterministic replay of full system execution from the beginning of
a checkpoint. BugNet focuses on deterministically replaying the instructions executed in user
code and shared libraries by logging the register file content at some point in time and recording
the load values that occur after that point. Both of them require changes to the host operating
system and special hardware support. SMP-ReVirt [28] makes use of hardware page protec-
tion to detect shared memory accesses, aimed at replaying multi-processor virtual machines, but
its overhead can be up to 10x on multi-processors. Rerun [45] exploits episodic memory race
recording to achieve efficient logging (around 4B per 1000 instructions), while DeLorean [83]
promises much smaller log sizes and higher replay speeds by investigating the total sequence of
chunk commits.
2.4.1.2 Offline Search and Deterministic Multithreading
PRES [100] and ODR [3] are two replay solutions that use partial recording and offline search
for the reproduction of concurrency bugs. PRES proposes a novel technique that uses a feed-
back replayer to explore thread interleavings, which reduces the recording overhead at the price
of more replay attempts. ODR proposes a new concept, output-deterministic replay, that fo-
cuses on replaying the same program output, and relies on offline inference to help recording
less information online. ESD [143] further reduces runtime tracing overhead by symbolically
exploring the complete thread scheduling decisions via execution synthesis. Weeratunge et al.
[136] present an approach to generate a failure inducing schedule by comparing the core dumps
offline, leveraging an execution indexing technique [137].
There also are several research attempts to make concurrent programs data race free by construc-
tion and deterministic by default. In this direction, there have been language design approaches
[11, 14] as well as hardware ones [26, 141]. For example, languages such as DPJ [14] guar-
antee deterministic semantics by providing a type and effect system to perform compile-time
type checking. The problem with language level approaches is that they often require nontrivial
programmer annotations or have a limited class of concurrency semantics.
Background and Previous Work 17
2.4.2 Concurrency Bug Detection
2.4.2.1 Static and Dynamic Program Analyses
Researchers have proposed a large body of dynamic or static techniques for concurrency defect
analysis. Eraser [109] first proposed the lockset-based approach for dynamic race detection.
Atomizer [34] uses Lipton’s reduction theory combined with the lockset algorithm to detect
atomicity violations dynamically. The lockset-based algorithms also are extended by RacerX
[30] for static race and deadlock detection. Many techniques based on the happens-before re-
lation [66] also have been proposed for detecting concurrency defects. Farzan et al. [32] uses
happens-before to statically detect causal atomicity. Callahan and Choi [96] combine the lock-
set algorithm and the happens-before approach to dynamically detect races. Chord [88, 89]
uses a staged approach to statically detect data races. AVIO [75] detects atomicity violation
based on access interleaving invariants extracted at run time. MUVI [73] uses data mining tech-
niques to statically detect concurrency bugs based on multi-variable correlations. For detecting
ASVs, Hammer et al. [43] proposed a runtime monitoring technique based on a set of race
automata. The primary limitation of the dynamic techniques is that they can only detect the
defects manifested in a specific concrete execution. On the other hand, while static techniques
can potentially explore all paths to find possible concurrency defects, they typically report a lot
of false warnings.
Several hybrid techniques combining static and dynamic analysis also have been proposed for
concurrency defect analysis. CTrigger [99] uses a two phase approach to detect atomicity
violations by controlling program execution to actively exercise low-probability thread inter-
leavings. Velodrome [37] proposed a sound and complete approach for detecting conflict-
serializability violations based on the dependence information extracted from the execution
trace. Narayanasamy et al. [93] uses the replay analysis to automatically classify benign and
harmful races. The benefit of the hybrid approaches is that they may possess the merits of static
and dynamic analysis at the same time.
Active Testing [57, 59, 64, 98, 110] is a testing technique for concurrent programs proposed
by Sen et al.. Given the reports of some potential concurrency-related defects obtained from
existing analysis tools, such as data races, atomicity violations and deadlocks, active testing
controls a defect-directed random scheduler to expose these defects in the program. Lai et
al. [64] develop AssetFuzzer that effectively exposes real ASVs by combining predictive trace
analysis with randomized active testing. A limitation of active testing is that it may still suffer
from non-determinism, because it utilizes only the partial information of the race pairs or ASV
tuples. To further improve effectiveness, PENELOPE [118] proposes a technique to expose
atomicity violations by re-executing the program under the full atomicity-violating schedules.
Background and Previous Work 18
The atomicity-violating schedules in PENELOPE are generated using a cut-point based theo-
retical scheduling algorithm that addresses the single variable atomicity problem.
Type system and language based techniques, such as DPJ [14] and Guava [7], also are proposed
for detecting and eliminating concurrency defects offline. The problem with these approaches
is that they often require nontrivial programmer annotations.
Model checking [18, 62, 86, 113, 129] is an alternative way to find bugs in concurrent programs.
By exhaustively exploring the thread scheduling space, they also also report counter examples
for the detected concurrency defects. For example, CHESS dynamically explores the thread
scheduling decisions to expose concurrency bugs using a context-bounded approach. Shacham
et al. [113] also uses a model checker to construct the witness for data races reported by the
lockset algorithm. Unfortunately, due to the exponential size of the search space, it is hard for
them to scale to large programs without compromising the detection capability. PCT [19] and
PPCT [87] further improve the effectiveness of CHESS by exploring the schedules in a random
fashion with probablistic guarantee of detecting concurrency bugs.
2.4.2.2 Trace-based concurrent program analysis
A large body of recent research focuses on the predictive trace analysis of concurrent programs.
Sen et al. [111] proposed a generalized predictive analysis technique for detecting violations of
safety properties. Wang et al. [134] proposed the reduction-based and block-based algorithms
for checking atomicity on the execution trace. Chen et al. [22] presented a framework for pre-
dictive analysis of concurrent Java programs. Lai et al. [64] combined PTA with randomized
active testing [110] to detect ASVs in a run. A common difficulty in these techniques is that
they do not scale as the size of executions increases.
To alleviate the scalability problem of PTA, Farzan et al. [33] developed a meta-analysis model
that produces an efficient algorithm for checking atomicity violations in programs that obey the
nested locking discipline. The algorithm works in time linear in the length of the runs, and
quadratic in the number of threads, and was also used in PENELOPE [118] for testing and
debugging atomicity violations.
Symbolic analysis Wang et al. [128, 130, 131] developed a symbolic analysis model for find-
ing concurrency errors, such as atomicity violations, based on the execution trace. The model
encodes the causal dependencies between events, the program control structure, and the prop-
erty of concurrency errors in a uniform way using symbolic constraints and calls a satisfiability
solver to verify the existence of property violations. This approach can statically check whether
a property holds in all feasible permutations of events in the given execution trace. However,
it still faces the inherent challenge of a huge search space and is hard to scale to large traces.
Background and Previous Work 19
Moreover, although the symbolic model is able to exhaustively verify the feasibility of sched-
ules, it is not clear how to efficiently generate a witness that manifests the detected concurrency
errors using this approach.
2.4.3 Surviving Concurrency Bugs
Atomicity violation fixing A recent advance by Jin et al. [56] proposes an automated technique
that fixed six out of eight real atomicity violation bugs, using sophisticated static analysis com-
bined with dynamic monitoring to resolve deadlocks. Weeratunge et al. [135] also present a lock
based approach to effectively suppress concurrency errors by enforcing the atomicity property
observed from good executions. Synchronization is a general way to fixing concurrency bugs,
nevertheless, a drawback of using synchronizations is that it may incur high runtime overhead.
Runtime approaches A line of active research [25, 76, 103, 105, 126, 142] proposes detecting
and surviving concurrency bugs at runtime. ISOLATOR [103] makes the execution of a buggy
program more robust by isolating the well-behaved threads from ill-behaved ones. ToleRace
[105] detects and tolerates asymmetric races in lock-based programs through replication. Atom-
Aid [76] proposes a hardware architecture to reduce the possibility of atomicity violations. Yu
and Narayanasamy [142] uses hardware transaction to constrain the program execution to tested
interleavings. More recently, Veeraraghavan et al. propose a system called Frost [126] that sur-
vives data races by running multiple replicas with complementary schedules. Cui et al. develop
PEREGRINE [25] that generalizes the reusable schedule to more inputs by computing the path
constraints.
Chapter 3
Multiprocessor Deterministic Replay
The technique of deterministic record and replay aims at faithfully reproducing an earlier pro-
gram execution. For concurrent programs, it is one of the most important techniques for program
understanding and debugging. The state of the art deterministic replay techniques face chal-
lenging efficiency problems in supporting multi-processor executions due to the unoptimized
treatment of shared memory accesses. We propose LEAP: a deterministic record and replay
technique that uses a new type of local order w.r.t. the shared memory locations and concurrent
threads. Compared to previous work, our technique records much less information without los-
ing replay determinism. The correctness of our technique is underpinned by formal models and a
replay theorem that we have developed. Through our evaluation using both benchmarks and real
world applications, we show that LEAP is more than 10x faster than conventional global-order
based approaches and, in most cases, 2x to 10x faster than other local-order based approaches.
Our recording overhead on the two large open source multi-threaded applications Tomcat and
Derby is less than 10%. Moreover, LEAP is able to deterministically reproduce 7 out of 8 real
bugs in Tomcat and Derby, 13 out of 16 benchmark bugs in IBM ConTest benchmark suite, and
100% ofs randomly injected concurrency bugs.
3.1 Introduction
One of the most effective ways for combating concurrency bugs is the technique of record and
replay [3, 23, 28, 39, 45, 68, 83, 84, 91, 97, 100, 106, 108]. The record and replay technique aims
at fully reproducing the problematic execution of concurrent programs, thus giving programmers
both the context and the history information to dramatically expedite the debugging process.
A crucial design factor in record and replay solutions is the degree of recording fidelity, i.e.,
the amount of data to be recorded, for the sufficient reproduction of problematic program ex-
ecutions. Simply speaking, the degree of recording fidelity is proportional to the degree of
20
Multiprocessor Deterministic Replay 21
faithfulness in replay. This characteristic is less problematic for hardware-based record and
replay solutions [45, 83, 84, 92, 138], in which special chips share the cost of the recording
computation. For the software-only solutions [91, 108] on uni-processors, the replay of concur-
rent programs can be achieved with low overhead by capturing the thread scheduling decisions.
However, for software-only solutions on multi-processors, making the best trade-off between
how much to record and how faithful to replay is still a very challenging problem, drawing
intense research attention [3, 23, 39, 68, 97, 100, 106].
Our research also is concerned with the software-only record and replay solutions. Our gen-
eral observation is that the state of the art does not achieve both recording efficiency and re-
play determinism. Conventional deterministic multi-processor replay techniques usually incur
a significant runtime overhead of 10x to 100x [23, 26, 28, 68], making them unattractive for
production use or even for testing purposes. For instance, Dejavu [23] is a global clock based
approach that is capable of deterministically replaying concurrent systems on multi-processors
by assigning a global order to all “critical events”, including both the synchronization points and
the shared memory accesses. As indicated by the authors, the enforcement of the global order on
variable accesses across multiple threads incurs a large runtime overhead on multi-processors.
The research of lightweight record and replay techniques [3, 39, 97, 100, 106] has success-
fully lowered the recording overhead, but at the cost of sacrificing determinism. JaRec [39] and
RecPlay [106] abolish the idea of global ordering and use Lamport clock [66] to maintain par-
tial thread access orders w.r.t. only the monitor entry and exit events, thus, making the recording
process lightweight. However, without tracking the shared memory accesses, their approaches
cannot deterministically reproduce problematic runs because a large majority of shared memory
accesses are not synchronized, either due to programming errors or because they are harmless
[93].
As also pointed out in [107], to deterministically replay a concurrent system on multi-processors,
it is necessary to record the thread access orders of the shared memory locations, a method com-
monly believed to be too expensive to be practical [3, 39, 97, 100, 106]. In this work, we demon-
strate that it is possible to achieve efficiency in this approach by observing that, given the same
program input, it is sufficient to deterministically replay the program execution by recording
partial thread access information local to the individual shared variables. Based on this obser-
vation, we have designed and implemented LEAP, a replay tool that provides both recording
efficiency and replay determinism. The replay determinism is underpinned by a semantic model
and formal theorems. To achieve efficiency, we use a field-based approach to statically identify
shared variables, thus, avoiding the cost of runtime identification. In addition, we make exten-
sive use of static analysis to provide a close approximation of the necessary program locations
that need to be monitored and, thus, to prune away a large percentage of otherwise redundant
recording operations.
Multiprocessor Deterministic Replay 22
The idea of the local-order based recording can be traced back to InstantReplay [68], which
enables the deterministic replay by recording the access history of all the shared objects w.r.t. a
particular thread. This technique does not suit our design objectives of being both deterministic
and efficient. First, InstantReplay requires the unique identification of shared objects dynami-
cally, a task hard to efficiently and correctly implement in practice. Second, InstantReplay uses
a complex computation model based on the CREW protocol, making the recording process very
costly. Third, there are important soundness issues with the local-order based approaches that
must be formally proved. Another local-order based approach is the use of Lamport clock that
tracks the partial order of critical events that each thread sees [39, 106]. Our technique tracks
the order of thread accesses that each shared variable sees, which is operationally simpler than
the use of Lamport clock.
We evaluate the runtime performance of LEAP by comparing to the related techniques in-
cluding global clock, InstantReplay, and Lamport clock. Our micro-benchmark shows that
LEAP is more than 10x faster than the global clock based approach, more than 5x faster than
InstantReplay, and at least 2x faster than the use of Lamport clock. On real world large open
source multi-threaded applications such as Tomcat and Derby, LEAP is 5x to 10x faster than
the related approaches. The average runtime overhead of LEAP is less than 10% on Tomcat and
Derby. Moreover, LEAP is able to deterministically reproduce 7 out of 8 real concurrency bugs
in Tomcat and Derby, 13 out of 16 benchmark bugs in IBM ConTest benchmark suite [31], and
100% of the randomly injected concurrency bugs.
The rest of this chapter is organized as follows: Section 3.2 presents the technical details of
LEAP; Section 3.3 presents the semantic model and proofs; Section 3.4 describes the imple-
mentation of LEAP; Section 3.5 evaluates LEAP; Section 3.6 summarizes this chapter.
3.2 LEAP: Local-Order Based Deterministic Replay
LEAP provides a general technique for deterministic replay of concurrent programs on multi-
processors. We define replay determinism as the faithful reenactment of all program state tran-
sitions experienced by a previous execution. A more complete and formal model is presented
in Section 3.3. The main idea of LEAP is that each shared variable tracks the order of thread
accesses it sees during execution.
3.2.1 LEAP Overview
We first use a simple example to show the main technique of LEAP and draw its differences as
compared to the conventional global-order based approach to the deterministic replay. In Figure
Multiprocessor Deterministic Replay 23
1.1 (left), we show a race condition that triggers an ERROR at line 4 following the interleaved
execution order <1,5,2,6,7,3,4>. The global-order based approaches record this schedule
and use it to re-execute the program at the cost of six global synchronization operations. Our
observation is that not all thread accesses to different shared variables need to be tracked. Instead
of enforcing a global order, we claim that it is sufficient to record the thread access order that
each shared variable sees. In our example, instead of the global order vector, we use two access
vectors (x.vec and y.vec) for the shared variables x and y and record <t1,t2,t1> and
<t2,t1,t2> respectively. We require zero global synchronization operations and two groups
of local synchronization operations executed in parallel. During replay, we associate x and y
with conditional variables to enforce that the access order of threads is identical to what was
recorded in their respective access vectors.
Although our technique can be easily illustrated, to ensure determinism and efficiency, there are
many tough challenges that we must tackle:
1. Static shared variable localization. How to effectively locate shared variables statically?
What will happen if we miss some shared variables, or some local variables are mistakenly
recognized to be shared?
2. Consistent shared variable and thread identification across runs. How to match the identities
of shared variables and of threads between the recording run and the replay run? For example,
the deterministic replay would fail if the shared variable x at record is incorrectly recognized as
y at replay, or the thread t1 is mistakenly recognized as t2.
3. Non-unique global order. Keen readers may point out that, by only recording the thread
access orders each variable sees, LEAP will permit a global thread schedule that is different
from the recording run. For instance, in our example, LEAP also permits the global order
<5,1,2,6,7,3,4>. Will this affect the faithfulness of the replay?
In the rest of the section, we focus on discussing the first two issues. The soundness of our
approach associated with the third issue is fundamental to our technique. In Section 3.3, we
provide a formal semantic model and proofs to show this phenomenon does not affect the faith-
fulness of the replay.
3.2.2 Locating Shared Variable Accesses
Precisely locating shared variables is generally undecidable [15]. We therefore compute a com-
plete over-approximation using a static escape analysis in the Soot1 framework called Thread-
LocalObjectAnalysis [42]. ThreadLocalObjectAnalysis provides on demand answers to whether
a variable can be accessed by multiple threads simultaneously or not. However, there are a1http://www.sable.mcgill.ca/soot
Multiprocessor Deterministic Replay 24
class Account {
SPE name index{
int balance1; int balance2;
balance1 Account.balance1 1
balance2 Account balance2 2int balance2;
B l 1…
balance2 Account.balance2 2
getBalance1getBalance1{
tmp = balance1;
getBalance1{
thread_id = getThreadId();tmp balance1;return tmp;
}
_ g ()get_lock(1);accessSPE(thread_id, 1);
b l 1setBalance2{
}tmp = balance1;release_lock(1);return tmp;{
…balance2 = value;
return tmp; }
}}
FIGURE 3.1: The instrumentation of SPE accesses
few important issues with this analysis. First, static analysis is inherently conservative, as local
variables might be reported as shared. We show in Section 3.3 (Corollary 3.3) that this type of
conservativeness does not affect the correctness of the deterministic replay. Second, Thread-
LocalObjectAnalysis does not distinguish between read and write accesses. Shared immutable
variables, whose values never change after initialization, need not to be tracked for they cannot
cause nondeterminism. Third, we discover that static variables are all conservatively reported
as escaped in ThreadLocalObjectAnalysis. Since static variables might also be accessed only by
one thread, we wish to analyze them in the same way as the instance variables, in order to obtain
a more precise result. Thus, we make two enhancements to the ThreadLocalObjectAnalysis: 1.
we further refine the analysis results of ThreadLocalObjectAnalysis so that we do not record
accesses to shared immutable variables; 2. we modify ThreadLocalObjectAnalysis to treat static
variables in the same way as instance variables.
3.2.3 Field-based Shared Variable Identification
For Java programs, since the standard JVMs do not support the consistent object identification
across runs, we cannot use the default object hash-code. We use a static field-based shared
variable identification scheme, applied to the following three categories of variables, which
are collectively referred to as the shared program elements (SPE): 1. variables that serve as
monitors; 2. class variables; 3. thread escaped instance variables. These SPEs include both Java
monitors and shared field variables that may cause nondeterminism. SPEs are uniquely named
as follows: for category 1, it is the name of the declaring type of the object variable; for category
Multiprocessor Deterministic Replay 25
2 and 3, it is the variable name, combined with the name of the class in which the variable is
declared.
After obtaining all the SPEs in the program, LEAP assigns offline to each SPE a numerical index
as its runtime identifier. For example, in Figure 3.1, suppose the two field variable balance1
and balance2 of the Account class are identified as shared, they are mapped to the numerical
IDs 1 and 2.
The static field-based shared variable identification remains consistent across runs and does not
incur runtime overhead. Moreover, compared to the object level identification approaches [68],
this approach is more fine-grained as different fields of the same object are mapped to different
indices. Consequently, accesses to different fields of the same object do not need to be serialized
at the runtime.
There are a few issues with our field-based shared variable identification. First, our approach
does not statically distinguish between different instances of the same type. As a result, accesses
to the same shared field variable of different instances of the same type would be serialized
and recorded into the same access vector. For this concern, we formally prove in Section 3.3
(Corollary 3.4) that the deterministic replay is also guaranteed, if the thread accesses to different
shared variables are recorded globally into a single access vector. Second, we cannot uniquely
identify scalar variables that are aliases of shared array variables. To deal with this issue, we
perform an alias analysis for all of the scalar array variables in the program and represent all
the aliases with the same SPE, ignoring the indexing operations. This treatment guarantees that
the nondeterminism caused by array aliases can be correctly tracked, however, at the cost of
reducing the degree of concurrency. Fortunately, in our experiment, we find very few such cases
in large Java multi-threaded applications. A good object-oriented program rarely manipulates
shared array data directly, so they are rarely escaped.
3.2.4 Unique Thread Identification
Since thread identity is the only information recorded into the access vectors, we must make
sure that a thread at the recording phase is correctly recognized during replay. A naive way is to
keep a mapping between thread name and thread ID during recording and use the same mapping
for replay. However, different parent threads can race with each other when creating their child
threads. Therefore, the thread ID assignment is not fixed across runs.
We take a similar approach as that in jRapture [119] to identify threads and their children. The
key observation is that each thread should create its children threads in the same order, though
there may not exist a consistent global order among all threads. We therefore create a consistent
identification for all threads based on the parent-children order relationship. More specifically,
Multiprocessor Deterministic Replay 26
starting from the main thread (T0), each thread maintains a thread-local counter for recording
the number of children it has forked so far. And everytime a new thread is forked, it is identified
with its parent thread ID associated with the counter value. For instance, suppose a thread tiforks its jth child thread, this child thread will be identified as ti∶j .
3.2.5 Handling Early Replay Termination
Our local-order based approach permits different global schedules for threads that do not affect
each other’s program states. One caveat of this approach is that it gives rise to the possibility
of early termination: a program crash action might occur earlier in the replay execution, thus,
making the replayed run not fully identical to the recording run in terms of its behavior. To faith-
fully replay all the thread execution actions, we ensure that every thread in the replay execution
performs the same number of SPE accesses as it does in the recording execution. Consequently,
we guarantee that the replay execution does not terminate until all the recorded actions in the
original execution are performed, thus making the final state of the replayed execution the same
as that of the original one.
3.3 A Theorem of Local Ordering
In this section we formally prove the soundness of our local-order based approach for determin-
istic replay. We also use two corollaries to show the soundness of the field-based shared variable
identification approach and the soundness of using an unsound but complete static escape anal-
ysis for deterministic replay.
Recall the execution model described in Section 2.1. The action sequence ⟨αk⟩ of a program
execution is called an execution schedule denoted by δ. Suppose there is an execution schedule
δ of size N that drives the program state to ΣN , our goal is to have another execution schedule
δ′ that is able to produce the same program state as ΣN . Obviously, this can be achieved if δ′ = δ
holds. However, this is too strong a condition. We show a relaxed and sufficient condition based
on the access vectors of all the shared variables. To state precisely, let δs be the sequence of
actions w.r.t. a shared variable s projected from δ, τs be the sequence of thread identifiers picked
out from δs, and τ be the mapping from s to τs for ∀s ∈ S (τ is the access vectors of all the
shared variables), we prove:
Theorem 3.1. Under the execution semantics defined in Section 2.1, two execution schedules δ
and δ′ of the same concurrent program have the same final state ΣN = Σ′N if Σ0 = Σ′0∧ τ = τ ′.
The core of the proof is to prove the following lemma:
Multiprocessor Deterministic Replay 27
Lemma 3.2. For any action α′k (k ≤ N ) in the replay execution δ′, suppose it is the pth action
on a shared variable s, then α′k is equal to the pth action on s in the original execution δ.
For two actions to be equal here, they need to read and write the same values, not just do the
same operation on the same shared variable. Next, we first define a notion of “happened-before”,
and then we prove Lemma 3.2 using this notion.
Consider the “happened-before” order of the original execution. The “happened-before” rela-
tion is defined as follows:
(a) If action αi immediately preceded action αj in the same thread, then αi happened-before
αj ;
(b) If action αi and action αj by different threads are consecutive actions on a shared variable
s, without any intervening actions on s, then αi happened-before αj ;
(c) The “happened-before” is reflexive and transitive.
More accurately, rules (a) and (b) define “happened-immediately-before” and “happened-before”
is the reflexive transitive closure of “happened-immediately-before”.
Proof. Let’s say the “happened-before” tree of an action is the tree of all the actions that
“happened-before” it, we next prove Lemma 3.2 by induction on the depth of the “happened-
before” tree.
Base case: Consider an action on the shared variable s, with a “happened-before” tree of depth 1.
This means that the current action does not depend on anything that happened-before it involving
shared variables. Because the first action on a shared variable is performed by the same thread
in both the original and the replay execution, and because that thread is deterministic, the replay
action should be identical to the one in the original execution.
Induction: Now assuming that Lemma 3.2 holds for all actions with happened-before depth
≤ n, we prove it for n + 1. Consider an action αi on a shared variable s, where αi has a
tree of happened-before depth n + 1. Let’s say αi is the pth action on s. The (p-1)th action
on s has a lower happened-before depth so it is an equal action in both the original and the
replay execution. Additionally, every action αj that “happened-immediately-before” αi has a
happened-before tree of depth n, therefore it is equal to a similarly numbered action in the
original execution (i.e., if αj is the kth action on a shared variable v, then αj is equal to the kth
action on v in the original execution). Now action αi only depends on all the αj actions. So,
since our approach enforces that the pth action on s is performed by the same thread in both
executions, and since the thread is deterministic and every value that αi can depend on has to be
equal to each other, it follows that action αi is also equal in the original and replay executions.
Multiprocessor Deterministic Replay 28
Lemma 3.2 is proved. If we apply Lemma 3.2 to the last action α′N in the replay execution, we
can get Σ′N = ΣN . Thus, Theorem 3.1 is proved.
With Theorem 3.1, we have proved the soundness of local-order based approaches for the de-
terministic replay that is able to reach the same program state as the original execution, by only
recording the access vectors for all the shared variables.
While τ = τ ′ is a rather relaxed condition, we can surely add more information that also guar-
antees the deterministic replay. For example, if the local variable accesses are recorded, the
deterministic replay is still guaranteed as long as we do not miss any shared variable accesses.
Following we derive two corollaries:
Corollary 3.3. The deterministic replay holds as long as τ = τ ′, regardless of whether accesses
to local variables are recorded or not.
Corollary 3.4. Recording different shared variable accesses into a single access vector does
not affect the correctness of the deterministic replay.
As noted in Section 3.2.2, the static escape analysis is conservative such that local variables
might be mistakenly categorized as shared. Corollary 3.3 ensures that this conservativeness
does not affect the correctness of the deterministic replay as long as all the shared variables are
correctly identified. Corollary 3.4 is easy to understand as the thread access orders on different
shared variables can be considered as a global order on a single variable abstracted from these
shared variables. To be more clear, assuming all thread accesses are recorded into a global
access vector, it is a global order of the execution schedule; hence, the determinism must hold.
As noted in Section 3.2.3, Corollary 3.4 ensures the soundness of our field-based shared variable
identification.
3.4 LEAP Implementation
We have implemented LEAP using the Soot framework. Figure 3.2 shows the overview of the
LEAP infrastructure, consisting of the transformer, the recorder, and the replayer. The trans-
former takes the bytecode of an arbitrary Java program and produces two versions: the record
version and the replay version. Started by a record driver, LEAP collects the access vector for
each SPE during the execution of the record version. When the recording stops, LEAP saves
both the access vectors and the thread creation order information and generates a replay driver.
To replay, the LEAP replayer uses the generated replay driver as the entry point to run the replay
version of the program, together with recorded information. The replayer takes control of the
thread scheduling to enforce the correct execution order of the threads w.r.t. the SPEs. We now
introduce each of the components in turn.
Multiprocessor Deterministic Replay 29
T fRecorder
TransformerSPE Access Recorder
R d
SPE LocatorThread Creation Order
Recorder
Record version
SPE Access Instrumentor
Replay Driver GeneratorOriginal program
Instrumentor
Record version
Access
vector
Replay
driver
Thread creation
order
Record version Generator Replayer
Replayversion
Replay version Generator
Trace Loader
Thread Scheduler
version
Thread Scheduler
FIGURE 3.2: The overview of LEAP infrastructure
3.4.1 The LEAP Transformer
The LEAP transformer performs the instrumentation on Jimple, an intermediate representation
of Java bytecode in the three-address form. For the record version, after locating all the SPEs in
the program, the transformer visits each Jimple statement and performs the following tasks:
Instrumenting SPE accesses If the SPE is not a Java monitor object, we insert a LEAP moni-
toring API invocation before the Jimple statement to collect both the thread ID and the numeric
SPE ID. Both the API call and the SPE access are wrapped by a lock specific to the accessed
SPE to ensure that we collect the right thread accessing order een by the SPE. If the SPE is a
Java monitor object, we insert the monitoring API call after the monitorentry and before the
monitorexit instructions. The API call is also inserted before notify/notifyAll/thread
start operations and after wait/thread join operations. Figure 3.1 shows a source-
code equivalent view of the instrumentation on the read/write accesses to the shared field vari-
ables. The box on the left shows the original method getBalance1, inside of which the
shared variable balance1 is read. The box on the right shows the transformed version of
getBalance1. For multiple shared variable accesses in a method, the thread ID needs only
to be obtained once. Also, to remove the unnecessary recording overhead, we do not need to
instrument the SPEs that are always protected by the same monitor.
Multiprocessor Deterministic Replay 30
Instrumenting recording end points To enable the deterministic replay, we insert the record-
ing end points to save the recorded runtime information and to generate the replay driver. Cur-
rently, LEAP supports three types of recording end points. First, we add a ShutDownHook to
the JVM Runtime in the record driver as a recording end point. When the program ends, the
ShutDownHook will be invoked to perform the saving operations. Second, we insert a try-
catch block into the main thread and the run method of each Java Runnable class. We then
add a method invocation in the catch block to capture the uncaught runtime exceptions as the
recording end points. Third, LEAP also supports the user specified recording end points by
allowing the annotation-based specification of end points. During the traversal of the program
statements, the transformer will replace the annotation with a method invocation, indicating the
end of recording.
To generate the replay version, the transforming process is largely identical to the record ver-
sion with a few differences: 1. since the order of synchronization operations on each SPE is
controlled by the LEAP replayer during replay, we need to insert the API call before the original
synchronization operations in the program, i.e, monitorenter and wait, to avoid deadlock;
2. the inserted API call is bound to a different implementation from the one used during the
recording phase; 3. since we need to ensure that the replay execution does not terminate until all
recorded actions in the original execution have been executed (See Section 3.2.5), we insert ex-
tra API invocations after each SPE access so that we can check whether a thread has performed
all its recorded actions in the original execution or not.
3.4.2 The LEAP Recorder
When executing the record version of the target program, the LEAP monitoring API will be
invoked on each critical event to record the ID of the executing thread into the access vec-
tor of the accessed SPE. To reduce the memory requirement, we use a compact representation
of the access vectors by replacing consecutive and identical thread IDs with a single thread
ID and a corresponding counter. For example, suppose the access vector of a SPE contains
<t1,t1,t2,t2,t2>, it is replaced by <t1,t2> and a corresponding counter <2,3>. This
compact representation produces much smaller log size compared to the related approaches in
our experiment. When a new thread is created, its ID is computed according to our consistent
thread idenfication method. Once a program end point is detected, the LEAP recorder will then
save the recorded data, i.e, the recorded access vectors, and the thread creation order list, and
generate the replay driver.
Multiprocessor Deterministic Replay 31
3.4.3 The LEAP Replayer
The LEAP replayer controls the scheduling of threads to enforce a deterministic replay using
both the access vectors and the thread identity information. To enable the user level thread
scheduling, the replayer associates each thread in replay with a semaphore maintained in a
global data structure, so that each thread can be suspended and resumed on demand.
To replay, the replay driver first loads the saved access vectors and starts executing the replay
version of the program. Before each SPE access, the threads use their semaphores to coordinate
with each other in order to obey the access order defined in the access vector of the SPE. Also, to
make sure that the replay execution does not terminate “early”, the thread also counts the total
number of SPE accesses it has performed so far after each SPE access. The thread suspends
itself if it finds that it has already executed all its SPE accesses in the original execution, as
recorded in the access vector, until all threads have finished their recorded actions. Since the
threads accessing different SPEs can execute in parallel, the replaying process is also faster than
that of a global order scheduler, which can only execute one thread each time.
3.5 Evaluation
3.5.1 Evaluation methodology
We assess the quality of LEAP by quantifying both its recording overhead and the correctness of
the deterministic replay. To properly compare our technique to the state of the art, we have also
implemented the following techniques: the Dejavu approach based on the global clock [23], the
technique presented by InstantReplay [68], and the JaRec approach based on the Lamport clock
[66]. Because none of these tools are publicly available, we faithfully implemented them ac-
cording to their representative publications. Since JaRec is not a deterministic replay technique,
we extended its capability to tracking shared memory races, in order to make it comparable to
our technique.
For the evaluation, we first design a micro-benchmark to conduct controlled experiments for
quantifying various runtime characteristics of the evaluated techniques. We then use real com-
plex Java server programs and third-party benchmarks to assess the recording overhead of LEAP
in comparison to the related approaches. We use bug reproducibility to verify if our technique
can faithfully and deterministically reproduce problematic concurrent runs. All experiments are
conducted on two 8-core 3.00GHz Intel Xeon machines with 16GB memory and Linux version
2.6.22. We now present these experiments in detail.
Multiprocessor Deterministic Replay 32
0 50 100 150 200 250 300 350 400 450 5000
1
2
3
4
5
6x 10
5
Number of SPE
Tim
e <
ms>
Processor number = 8Thread number = 10
BaseLEAPLamportGlobalInstant
FIGURE 3.3: The runtime characteristic of LEAP and other techniques on our microbenchmarkwith the number of SPE ranges from 1 to 500. The microbenchmark starts 10 threads running
on 8 processors.
3.5.1.1 Micro-benchmarking
We designed a micro-benchmark to quantify the runtime characteristics of LEAP and the related
record and replay techniques. The benchmark consists of concurrent threads that randomly
update shared variables in a loop. For each experiment, we can control the number of threads
and shared variables. In our experiments, we set the number of threads from 1 to 100, and the
number of shared variables from 1 to 1000, we then measure the time needed for all the threads
to finish a fixed total number of updating operations under different settings.
Figures 3.3 and 3.4 show the runtime characteristics of LEAP and the related techniques on
our micro-benchmark. In the figures, Base refers to the native execution. Global, Lamport
and Instant refer to the recorded execution using global clock, Lamport clock and InstantReplay
respectively. Figure 3.3 shows that the performance of the LEAP instrumented version is close to
the base version. By fixing the number of threads to 10, as the number of SPE increases from 10
to 500, LEAP is more than 10x faster than global clock, more than 5x faster than InstantReplay,
and at least 2x faster than Lamport clock. Global clock is the slowest among the four techniques.
The main reason is that the use of global clock requires a global synchronization on every shared
variable access, which significantly affects the degree of concurrency. Figure 3.4 shows a similar
performance trend as the number of threads increases from 10 to 80 and the number of SPEs is
fixed to 1000.
Multiprocessor Deterministic Replay 33
0 10 20 30 40 50 60 70 800
0.5
1
1.5
2
2.5
3
3.5
4x 10
5
Number of Threads
Tim
e <
ms>
Processor number = 8SPE number = 1000
BaseLEAPLamportGlobalInstant
FIGURE 3.4: The runtime characteristic of LEAP and other techniques on our microbenchmarkwith the number of threads ranges from 1 to 80 running on 8 processors. The number of SPE
is set to 1000.
TABLE 3.1: The runtime overhead of LEAP and the state-of-the-art techniques.
Application LOC Total SPE SPESize Log LogCmp LEAP Lamport Instant GlobalAvrora 93K 16003 1725(11%) 113 30623 796 626% 1697% 1821% 1036%Lusearch 69K 11497 1140(9.9%) 75 7485 632 74% 308% 379% 227%Derby 1.51M 48356 1433(3.0%) 264 18545 113 9.9% 68% 113% 52%Tomcat 535K 23046 654(2.6%) 163 15351 51 7.3% 39% 44% 34%MolDyn 864 821 634(77%) 66 110761 37760 64% 2776% 3567% 9960%MonteCarlo 3128 427 104(24%) 18 70384 1994 7.5% 7.9% 8.6% 9.1%RayTracer 1431 442 223(50%) 19 124239 35878 18% 39% 43% 94%
3.5.1.2 Benchmarking with third-party systems
To perform an unbiased evaluation, we first use LEAP on two widely used complex server
programs, Derby and Tomcat, with the PolePosition2 database benchmark and the SPECWeb-
20053 web workload benchmark. Each benchmark starts with 10 threads and we measure the
time for finishing a total number of 10000 operations. We also selected a suite of third-party
programs, among which Avrora and Lusearch are from the dacapo-9.12-bach benchmark suite4,
and MolDyn, MonteCarlo and RayTracer are from the Java Grande multi-thread benchmark
suite.2http://polepos.sourceforge.net3http://www.spec.org/web20054http://dacapobench.org
Multiprocessor Deterministic Replay 34
Table 3.1 shows some of the relevant static attributes of the benchmarked programs as well as the
associated runtime overhead of the evaluated record and replay techniques. We report the total
number of field variable accesses in the program (Total), the total number instrumented SPE
accesses (SPE), the number of SPEs (SPESize), the log size (KB/sec) of the related approaches
(Log), the log size of LEAP (LogCmp), and the runtime overhead (LEAP, Lamport, Instant
and Global). Overall, the percentage of SPE accesses over the total number of field variable
accesses varies from less than 3% on Derby and Tomcat to around 10% on Avrora and Lusearch.
As MolDyn (77%), MonteCarlo (24%) and RayTracer (50%) are relatively small applications
dedicated for multi-threaded benchmarking, the percentage of their SPE accesses is large.
Log size By using our compact representation of the access vectors, the log size of LEAP is
much smaller than the related approaches, from 3x in MolDyn to as large as 164x in Derby.
We recognize that the log size in LEAP is still considerable from 51 to 37760 KB/sec. With
the increasing disk capacity and disk write performance, as also observed by other researchers
[100], moderate log size does not pose a serious problem. For long running programs, we can
reset logs through the use of checkpoints.
Recording overhead LEAP is the fastest on all the evaluated applications. It is more than 150x
faster than global clock on MolDyn. For Derby and Tomcat, LEAP is 5x to 10x faster than all
the related approaches. The sheer runtime overhead of LEAP on Derby and Tomcat is less than
10% (9.9% and 7.3% respectively). LEAP’s overhead is large on Avrora (626%), the reason is
that there are several SPEs in Avrora that are frequently accessed in hot loops.
3.5.1.3 Concurrency bug reproduction
One of the major motivating forces for the record and replay technique is to help reproducing so-
called Heisenbugs. We believe that the ability of deterministically reproducing a concurrency-
related bug is a strong indicator of the replay correctness, because it requires the program state
to be correctly restored for the bug to be triggered. To compare the bug reproducibility, we
have also implemented JaRec for the comparison. We first compare LEAP and JaRec for their
capabilities of reproducing real-world concurrency bugs in complex server systems as well as
a number of benchmark bugs widely used in concurrency testing. To proper quantify bug re-
producibility, we aso have designed a bug injection technique that injects atomic set violations
into our micro-benchmark. We then assess how many of the violations can be deterministically
reproduced by LEAP and JaRec.
Multiprocessor Deterministic Replay 35
TABLE 3.2: LEAP - summary of the evaluated real bugs
Bug Id Version LOC Exception TypeDerby230 Derby-10.1 1.34M DuplicateDescriptorDerby1573 Derby-10.2 1.52M NullPointerExceptionDerby2861 Derby-10.3 1.51M NullPointerExceptionDerby3260 Derby-10.2 1.52M SQLExceptionTomcat728 Tomcat-3.2 150K NullPointerExceptionTomcat4036 Tomcat-3.3 184K NumberFormatExceptionTomcat27315 Tomcat-4.1 361K ConcurrentModificationTomcat37458 Tomcat-5.5 535K NullPointerException
3.5.1.4 Random bug injection
Our bug injection technique is based on the problematic thread interleaving patterns presented
in [125]. We introduce 10 dummy shared variables into the program and divide them into 5
groups, each group representing an atomic set as defined in [125]. During the recording phase,
on each critical event, the thread also randomly performs a write or read access on one of the
introduced variables. We use the same random seed for each thread across record and replay.
After each random access, if one of the problematic thread interleaving patterns occurs, the
program stops and the replay data are exported. Given the same program input, a deterministic
replay technique should be able to recreate the occurred bug pattern.
To compare the concurrency bug reproducibility between LEAP and JaRec, we use 100 different
random seeds to inject 100 concurrency bugs into our micro-benchmark. For each run, we
initialize 10 threads in the program. LEAP is able to deterministically reproduce 100% of these
bugs, while JaRec cannot deterministically reproduce any of them. The reason is that JaRec
does not record shared memory races, while all these bug patterns are generated on shared
memory accesses.
3.5.1.5 Real and benchmark concurrency bugs
Tables 3.2 and 3.3 show the description of the real concurrency bugs and the benchmark bugs
used in our experiments. All the 8 real bugs in Table 3.2 are extracted from the Derby and
Tomcat bug repositories5 that were reported by users. The 16 benchmark bugs in Table 3.3
are from the IBM ConTest benchmark suite [31], which cover the major types of concurrency
bugs, including data races, atomicity violation, order violation, and deadlocks. We also run both
JaRec and LEAP on these buggy programs to compare the bug reproducibility between them.5https://issues.apache.org
Multiprocessor Deterministic Replay 36
TABLE 3.3: LEAP - summary of the evaluated benchmark bugs
Bug Name LOC Bug DescriptionBubbleSort 362 Not-atomic, Orphaned-ThreadAllocationVector 286 Weak-reality, two stage accessAirlineTickets 95 Not-atomic interleavingPingPong 272 Not-atomicBufferWriter 255 Wrong or no-LockRandomNumbers 359 Blocking-Critical-SectionLoader 130 Initialization-Sleep PatternAccount 155 Wrong or no-LockLinkedList 416 Not-atomicBoundedBuffer 536 Notify instead of notifyAllMergeSort 375 Not-atomicCritical 73 Not-atomicDeadlock 135 DeadlockDeadlockException 255 DeadlockFileWriter 311 Not-atomicManager 236 Not-atomic
For the 8 real world concurrency bugs, LEAP is able to deterministically reproduce 7 of them
(88%), except the bug tomcat4036, and JaRec reproduced none of them. For the 16 bench-
mark bugs, LEAP can reproduce 13 of them (81%), except BufferWriter, Loader, and
DeadlockException, while JaRec can only reproduce one of them (Deadlock). The rea-
son for LEAP to miss tomcat4036 is that the bug is triggered by races of the internal data
of the underlying JDK library java.text.DateFormat, which LEAP does not instrument.
And because all these real bugs are related to shared memory races, JaRec is not able to re-
produce any of them. For the three benchmark cases LEAP cannot reproduce, two of them are
related to random numbers and the other one makes LEAP run out of memory because too many
threads (>5000) are involved in loops.
3.5.2 Discussion
The evaluation results clearly have demonstrated the superior runtime performance of LEAP
as well as its much higher concurrency bug reproducibility, compared to existing approaches.
Through our experiments with real world large multi-threaded applications, we observed several
limitations of LEAP that we plan to address in our future work:
Input nondeterminism As LEAP only captures the nondeterminism brought by thread inter-
leavings, it may not reproduce executions containing input nondeterminism, e.g., programs with
nondeterministic I/O. The two benchmark bugs that LEAP cannot reproduce both contain ran-
dom number generators that use the current system time as the random seed. Since it is not likely
to keep the random numbers the same across record and replay without saving them, LEAP may
Multiprocessor Deterministic Replay 37
not reproduce executions that contain such random issues. A way to overcome these issues is to
save the program states of some key nondeterministic events, e.g., the value of random seeds.
JDK library LEAP does not record shared variable accesses in the underlying JDK library.
If an execution contains races of the internal data of these APIs, LEAP might not be able to
reproduce it. The bug tomcat4036 is an example of this limitation. In fact, we can also
instrument the underlying Java Runtime, but as the JDK library is used frequently, it would
incur large runtime overhead. An implementation of LEAP on the JVM should relieve this issue
as the JVM environment enables efficiently tracing the internal data of the JDK library.
Long running programs LEAP currently has to replay from the beginning of the program
execution. For long running programs, it might not be convenient to replay the whole program
execution concerning the long replay time and the large log size. A lightweight checkpoint
scheme would be helpful in such scenarios, as LEAP can then only replay the program from the
last checkpoint to the recording end point.
3.6 Summary
We have presented LEAP, a new local-order based approach that deterministically replays con-
current program executions on multi-processors with low overhead. Our basic idea is to capture
the thread access history of each shared variable, and we use theoretic models to guarantee its
correctness. We have implemented LEAP as an automatic program transformation tool that pro-
vides deterministic replay support to arbitrary Java programs. To evaluate our technique, we
make use of both benchmarks and real world concurrent applications. We extensively quanti-
fied the runtime overhead of using LEAP as well as the correctness of the LEAP-based replay
through reproducing concurrency bugs. Our evaluation shows that, compared to the state of the
art, LEAP incurs lower runtime overhead and has much superior capability of correctly repro-
ducing concurrency bugs. For real world applications that we evaluated, the overhead of using
LEAP is under 10%, exhibiting the great potential for the production use.
Chapter 4
Persuasive Prediction of ConcurrencyAccess Anomalies
Predictive analysis is a powerful technique that exposes concurrency bugs in un-exercised pro-
gram executions. However, current predictive analysis approaches lack the persuasiveness prop-
erty as they offer little assistance in helping programmers fully understand the execution history
that triggers the predicted bugs. We present a persuasive bug prediction technique as well as a
prototype tool, PECAN, for detecting general access anomalies (AAs) in concurrent programs.
The main characteristic of PECAN is that, in addition to predicting AAs in a more general way,
it generates concrete executions that deterministically expose the predicted AAs. The key ingre-
dient of PECAN is an efficient offline schedule generation algorithm, with proof of soundness,
that guarantees to generate a feasible schedule for every real AA in programs that use locks in a
nested way. We evaluate PECAN using twenty-two multi-threaded subjects including six large
concurrent systems, and our experiments demonstrate that PECAN is able to effectively predict
and deterministically expose real AAs. Several serious and previously unknown bugs in large
open source concurrent systems also were revealed in our experiments.
4.1 Introduction
Access anomalies (AAs) is a class of concurrency bugs characterized by criteria such as data
races [109], atomicity violations [34], and atomic-set serializability violations (ASV) [125].
Among the broad spectrum of concurrency bug detection techniques that have proliferated in
recent years [15, 17, 35, 37, 43, 58, 64, 79, 80, 86, 89, 98, 99, 110, 115, 118], the technique of
predictive trace analysis (PTA) has drawn significant research attention [22, 33, 128, 130, 131,
133].
38
Persuasive Prediction of Concurrency Access Anomalies 39
Generally speaking, a PTA technique records a trace of execution events, and then statically
(often exhaustively) generates other permutations of these events under certain scheduling con-
straints, and exposes concurrency bugs unseen in the recorded execution. PTA is a powerful
technique as, compared to dynamic analysis, it is capable of exposing bugs in unexercised exe-
cutions and, compared to static analysis, it incurs much fewer false positives for the fact that its
static analysis phase uses the concrete execution history.
A bug detection technique is more useful if it is persuasive. This new criterion emphasizes that
a bug detection technique should not only localize the bug in the source code but also, and more
importantly, help programmers in fully understanding how the bug has occurred, to provide good
fixes.1 We characterize persuasiveness by two key properties. First, a persuasive technique
should report violations with no false positives. Since it is non-trivial to manually verify the
false alarms in large sophisticated concurrent systems, the perceived usefulness of the technique
quickly deteriorates with even a small number of false positives. Second, a persuasive technique
should also show programmers how the detected bugs or violations can occur by accompanying
each violation a concrete execution that deterministically exposes the bug. We believe that
allowing programmers to deterministically trigger the bug is one of the most effective ways to
achieve the complete bug comprehension.
Assessed by the persuasiveness criterion, the state of the art PTA techniques [22, 33, 128, 130,
131, 133] are unsatisfactory in generally addressing access anomalies in real-life complex con-
current programs. Although several recent work [128, 130, 131] pointed out the usefulness of
persuasiveness, it is still not clear how to efficiently create a concrete execution that can expose
the predicted anomalies in real programs. In addition, despite the much improved soundness
compared to static analysis, current PTA techniques still report quite a number of false positives,
either due to the inadequacy of their prediction models or the incompleteness of the collected
traces. For example, as detecting data races in general is NP-hard [95], for efficiency reasons,
many race detectors [29, 96, 102, 110] employ an overly approximated prediction model that
combines the lockset-based algorithms [109] and the happens-before based approaches [66].
Moreover, for PTA techniques, a certain type of false positives simply cannot be avoided when
programmers use application level synchronization mechanisms, such as barrier and flag opera-
tions. These “non-standard” synchronization mechanisms are difficult to automatically discover
[123] and, in turn, result in incomplete traces.
We present PECAN, a novel persuasive PTA technique that detects general access anomalies
(AAs) in concurrent programs. Unlike other PTA techniques [22, 33, 118] that cater to spe-
cific types of concurrency bugs, PECAN offers a general prediction model that addresses a
much broader class of concurrent access anomalies. Moreover, for each predicted AA, PECAN1A recent report [140] shows that as much as 39% concurrency bug fixes are bad fix, either failing to fix a bug or
creating new bugs.
Persuasive Prediction of Concurrency Access Anomalies 40
generates “bug hatching clips” that deterministically instruct the input program to exercise the
predicted AAs. PECAN does not present false positives to programmers as we guarantee that
each clip represents a feasible concrete execution. Since all AAs reported are real and the pro-
grammers are given the full history and context information to understand the bug, we believe
PECAN can dramatically expedite the process of bug fixing.
The key technical challenge that we are faced with is how to statically generate a feasible thread
execution schedule to expose the predicted AAs. We present an algorithm, with a proof of its
soundness, that guarantees to generate a feasible schedule for every real AA for programs that
use locks in a nested way, i.e., releasing locks reverse to the acquisition order. Moreover, to
predict AAs in a general way, we present a general specification model of AAs and reduce
the AA prediction problem to a graph pattern search problem. With compact encoding of the
happens-before relationship between the events and the scheduling order of memory accesses in
the trace, the graph supports efficient pattern search of AAs, enabling PECAN to scale well to
large traces.
The salient property of persuasiveness also is highly valued and explored by other classes of
techniques such as active testing [59, 64, 98, 110] and model checking [18, 62, 86, 113, 129].
In particular, RaceFuzzer and Atomfuzzer [98, 110] dynamically explore and, thus, are capable
of creating concrete executions to expose real races by actively controlling the race-directed and
randomized thread scheduler. Chess also systematically explores the thread scheduling space at
runtime to find concurrency bugs. As a PTA technique, the goal of PECAN is to provide the
generalized support of the persuasiveness for concurrency access anomalies.
We have implemented PECAN for Java programs and conducted extensive experiments for eval-
uating it. Three common types of AAs are investigated: data races, atomicity violations, and
ASVs. Our evaluation results show that PECAN is able to effectively and efficiently predict and
deterministically create real AAs in all the twenty evaluated subjects including six large multi-
threaded applications. PECAN achieves a 100% success ratio of creating the predicted AAs in
more than half of the subjects. For the other subjects, the success ratio is from 0.25 to 0.93 (due
to the reported false AAs). Several serious and previously unknown bugs were also revealed by
PECAN in large open source concurrent systems such as OpenJMS and Jigsaw. Moreover,
PECAN has good scalability that can, for instance, analyze a trace in Derby with more than
447K events in around 6 seconds. The PECAN prototype and the detected replayable bugs in
our experiments are publicly online at http://www.cse.ust.hk/prism/pecan/.
The rest of this chapter is organized as follows: Section 4.2 presents an overview of PECAN;
Section 4.3 presents pattern specification of general access anomalies; Section 4.4 presents the
graph-based prediction model; Section 4.5 presents the search algorithm based on the graph
model; Section 4.6 presents the schedule generation algorithm; Section 4.7 present the imple-
mentation and evaluation of PECAN; Section 4.8 summarizes this chapter.
Persuasive Prediction of Concurrency Access Anomalies 41
4.2 PECAN in a Nutshell
To make our technique more comprehensible, we first use the simple example in Figure 1.1 to
illustrate the AA detection process of PECAN. Let us use the line number as the identifier of the
statement. There are three data races in the program. The races are between statements (2,5),
(2,6) and (3,7). Among the three real races, the race (3,7) is more important because it
might trigger the ERROR at line 4.
PECAN addresses the above problem using the following steps: 1. We first collect traces of
interesting events during the program execution. 2. We extract from the trace a partial and
temporal order graph (PTG) that encodes the information about the happens-before relationship
between the events, the atomic blocks, and the scheduling order of memory accesses. 3. We
perform a pattern-directed search on the PTG for the matching of the general AA patterns w.r.t.
the program constraints. 4. Taking the original trace and the search results as the input, we stat-
ically generate a thread schedule for each predicted AA. 5. We use a deterministic replayer [48]
to re-execute the program to expose the predicted AAs according to the generated schedules.
Coming back to our example, suppose the collected execution trace is <1,5,2,3,6,7>. In
Step 3, PECAN will detect that (3,7) is a possible race and then, in Step 4, PECAN is able
to generate the thread schedule <1,5,2,6,7,3> that deterministically directs the replayer to
expose this race and to trigger the ERROR in Step 5. From the user’s perspective, the whole pro-
cess is automatic and requires no additional user intervention. We note that, like other PTA tech-
niques, our analysis requires the fact that the error inducing events (3,7) appear in the input
trace, which might not always happen. In practice, we can use techniques such as RaceFuzzer
[110] to compensate this deficiency2.
In the following sections, we go under the hood of our technique to discuss the pattern language
we use to specify the general AAs (Section 4.3), the graph prediction model (PTG) we use to
represent the AA prediction problem (Section 4.4), the pattern search algorithm for locating the
AAs on the graph model (Section 4.5), and the schedule generation algorithm for generating the
thread schedules for each predicted AAs (Section 4.6).
4.3 Pattern Specification of Access Anomalies
The most commonly known AAs include data races, atomicity violations, and atomic-set seri-
alizability violations (ASV). These anomalies are sequences of 2 - 4 events generated by two2We come back to this issue in Section 4.7.3.
Persuasive Prediction of Concurrency Access Anomalies 42
E
T
SV
AR
AT
The event sequence:
The thread scheduling sequence:
The accessed shared variable sequence:
The atomic region sequence:
The access type sequence:
Examples
data race atomicity violation ASV
e1 – e2 e1 – e2 – e3 e1 – e2 – e3 – e4
t1 – t2 t1 – t2 – t1 t1 – t2 – t2 – t1
s1 – s1 s1 – s1 – s1 s1 – s1 – s2 – s2
u1 – u2 u1 – u2 – u1 u1 – u2 – u2 – u1
r – w r – w – r w – r – r – w
General access anomaly pattern
FIGURE 4.1: General access anomaly patterns
different threads on one or two shared variables. In our prediction model, we generalize the con-
cept of AA to allow arbitrary number of events, threads and shared variables, and we describe
each type of AA as an event sequence pattern.
An AA pattern p is comprised of a group of equal-length sequences [E,T,SV,AR,AT]. The
meaning of each symbol is described as follows:
• E: the event sequence defined by the pattern.
• T: the thread scheduling order corresponding to E, i.e., the event E[i] is by the thread
T[i].
• SV: the accessed shared variable sequence corresponding to E, i.e., the event E[i] ac-
cesses the shared variable SV[i].
• AR: the atomic region sequence corresponding to E, i.e., the event E[i] belongs to the
atomic region AR[i].
• AT: the access type sequence corresponding to E, i.e., the access type of E[i] is AT[i]
which is either a read or a write: AT[i] ∈ {r,w}.
Figure 4.1 shows example patterns of the three commonly known AAs. Clearly, the specifica-
tion of AA patterns above is general enough to describe all the three commonly known AAs.
Moreover, this general pattern model allows the users to define their own AA patterns that may
contain much more complex thread interleavings. Nevertheless, since in fact all complex AA
patterns can be composed by these three basic ones3, we focus on explaining them in this sec-
tion.3As proved in [125], a set of eleven ASV patterns forms a complete set of all the problematic thread interleaving
scenarios w.r.t. atomic sets and units of work.
Persuasive Prediction of Concurrency Access Anomalies 43
We next discuss these three basic AAs and describe them using the general pattern specification.
Since they in total contain a dozen of patterns, for brevity, we only show one representative
pattern for each of them. The others patterns are similar.
Data race A data race occurs when two threads are concurrently accessing the same data without
proper synchronization and at least one of these accesses is a write. We thus can describe it
as: E=e1-e2, T=t1-t2, SV=s1-s1, AR=u1-u2, and AT=r-w, meaning that the first thread reads
a shared variable and immediately the second thread writes to it. Note that data race patterns
require the two events happen consecutively, while this condition is unnecessary for atomicity
violation and ASVs.
Atomicity violation An atomicity violation happens when the desired serializability among mul-
tiple memory accesses to a single memory location is violated. Suppose a memory location is
accessed by three consecutive events ei, ek, and ej in this written order, and ei, ej belong to the
same atomic region, while ek belongs to another. An atomicity violation with the three access
type“write-read-write” can be written as E=e1-e2-e3, T=t1-t2-t1, SV=s1-s1-s1, AR=u1-u2-u1,
and AT=w-r-w.
ASV Atomic-set serializability is a criterion for enforcing the serializability of units of work
that deal with atomic sets. An atomic set is defined to be a set of memory locations that together
satisfy some consistency property. For example, let Wu(m) (Ru(m)) represent a write (read)
access to a memory location, m, by a unit of work, u, and suppose m1 and m2 belong to the
same atomic set. The execution sequence “Wu(m1) −Ru′(m1) −Ru′(m2) −Wu(m2)” causes
an ASV as the two consecutive writes to m1 and m2 by u are interleaved by two reads to these
memory locations by u′, another unit of work, resulting in inconsistent reads. We describe this
pattern as E=e1-e2-e3-e4, T=t1-t2-t2-t1, SV=s1-s1-s2-s2, AR=u1-u2-u2-u1, and AT=w-r-r-w.
In our implementation, we consider each atomic region as a unit of work and all memory loca-
tions accessed in the same atomic region belong to the same atomic set.
4.4 Graph Prediction Model
Our approach to the general AA prediction problem is to reduce it to a graph search problem.
We start by formalizing the permutation constraints. We then describe our formulation as a
graph mutation and pattern search problem.
4.4.1 Constraint Model
Precisely detecting access anomalies in general is computationally intractable [95]. To achieve
efficiency in predicting AAs, similar to many race detection techniques [29, 110], we use a
Persuasive Prediction of Concurrency Access Anomalies 44
hybrid constraint model [96] that combines the lockset condition [109] and the happens-before
relation [66]. Specifically, the hybrid model defines that two events ei and ej are independent
iff :
1. they do not hold a common lock (l(i) ∩ l(j) == ∅);
2. they do not have a POR relation (recall Definition 2.4) between each other (ei ↛ ej and
ej ↛ ei).
Notice that the hybrid constraint model we use is a conservative approximation of the precise
model for checking the independence between events [94]. Therefore, it is a possible source for
PECAN to report false warnings during the pattern search. Nevertheless, these false warnings
can be automatically pruned during the re-execution phase (see Section 4.6.3), hence, do not
affect the final results delivered to the end user.
4.4.2 The AA Prediction Problem
The essential idea behind the AA prediction is that the independent events in the trace can be
rearranged, simulating the thread scheduling effects. Therefore, even if a AA is not directly
witnessed in the trace, as long as it can be manifested in any feasible permutation of the trace,
we can locate it and expose it with a concrete execution. This idea is initiated in Lipton’s
theory of reduction [72] and has been exploited by many concurrency bug detection approaches
[34, 37, 133].
Our general objective is to search all the AAs that satisfy some given patterns on an execution
trace or on any of its feasible permutations allowed by our constraint model defined in Section
4.4.1. We model this problem as a graph pattern search and mutation problem. Before giving a
formal problem definition, let us first define the graph model:
Definition 4.1. The Temporal Order Relation (TOR) ei ⇢ ej holds if events ei and ej are
consecutive accesses on the same shared memory location and ei occurs before ej .
Definition 4.2. A Partial and Temporal Order Graph (PTG) is a graph G(V,E) where V is
a set of nodes and E is a set of edges. Each vi ∈ V corresponds to the event ei in the trace.
Each edge e is either solid (→) or dashed (⇢), corresponding to the POR and TOR between the
events, respectively.
The PTG can be mutated by interchanging the nodes connected by dashed edges w.r.t. the POR
and the lockset condition. For brevity, we call these two conditions as mutation condition, and
we refer to these mutated PTGs as vPTGs.
Persuasive Prediction of Concurrency Access Anomalies 45
Based on the PTG, the AA patterns can be conveniently formulated as propositional formulas
between the nodes in the PTG. Our goal is to find all the AAs on the vPTGs that satisfy the user
specified patterns.
4.5 Graph Pattern Search
Since the number of vPTGs is exponential and the size of trace could be very large, it is inef-
ficient to perform pattern search on every individual vPTG. We use two primary techniques to
achieve efficiency. First, we have developed a compact encoding of the PTG. Second, we per-
form pattern-directed graph mutations on the fly based on the intermediate search results, hence,
does not require separate mutation steps.
4.5.1 Compact Encoding of PTG
We have two main techniques for compactly encoding the PTG. First, to facilitate efficient pat-
tern search, we build separate indices of events based on thread ID, memory location, access
type and atomic region. Second, to scale to large traces, we do not maintain the full POR but,
instead, maintain only the relations between the thread communication (TC) events, i.e., fork,
join, notify, and wait events. Since the TC events are the only sources of the POR between
events across different threads, we use them to compute the POR for all the other events on
demand. By this approach, we reduce the space cost from quadratic in the trace size to linear
in the trace size and quadratic in the number of TC events. The number of TC events usually is
much smaller compared to the entire trace size.
4.5.2 Pattern-Directed Search
In general, given a pattern described in the specification model in Section 4.3, our pattern search
algorithm first computes the number of threads, the number of shared variables, and the number
of events by each thread in the same atomic region on each shared variable. Our algorithm
then uses this information to search on the indexed PTG to obtain a set of candidate AAs. The
candidate AA may not match the thread scheduling order T specified in the pattern, in which
case the mutation condition is applied to check whether there exists certain allowed permutation
of nodes in the PTG that makes the matching possible. We next give detailed explanations for
data race, atomicity violation, and ASV patterns.
Data race Recall that each pattern of data race contains two events satisfying the conditions
defined in Section 4.3. We thus follow the dashed edges on the PTG and examine every candidate
Persuasive Prediction of Concurrency Access Anomalies 46
thread1
1. lock(l) 2. read x 3. unlock(l)
… 4. lock(l) 5. read x 6. unlock(l)
thread2
7. lock(l) 8. write x 9. unlock(l)
Searching Atomicity violation
Atomic region
FIGURE 4.2: Example of searching atomicity violations
node pair that could possibly satisfy the conditions. If a node pair (vi,vj) matches the temporal
order (i.e., the two nodes are connected by a dashed edge), we report it as a real AA. Otherwise,
we check if the PTG can be mutated for the node pair to match the temporal order. The function
canSatisfyByMutation(vi,vj) (Algorithm 1) is used to check this condition.
Algorithm 1 canSatisfyByMutation(vi,vj)Ensure: i < j
1: return (l(i) ∩ l(j) = ∅ && !POR(vi, vj))
Algorithm 2 canSatisfyByMutation(vi,vk,vj)Ensure: i < j < k
1: for all vx ∈ [vi+1, vi+2, . . . , vj] do2: if canSatisfyByMutation(vx, vk) then3: return true4: return false
Atomicity violation and ASV The search algorithms for atomicity violation and ASV patterns
are similar to that for data races, with the main difference in checking the mutation condition.
Because each atomicity violation (ASV) pattern contains three (four) nodes, we need to check
the mutation condition for more pairs of nodes. Without loss of generality, we use the example
in Figure 4.2 to illustrate the mutation condition (Algorithm 2) for checking candidate atomicity
violations. Suppose we have already found the candidate triple (v2, v8, v5) by traversing the
events by the two threads on the shared memory location x. As the temporal order of this triple
does not directly satisfy the atomicity violation pattern, we next check if it can be satisfied in any
of the other vPTGs, i.e., if v8 can be placed in any position between v2 and v5 without violating
the POR and the lockset condition. Our algorithm thus tries to find a position between v2 and
v5, say vx, such that there is no POR between v8 and vx and they are not protected by a common
lock, i.e., the lockset condition. Finally we find vx = v4 and thus report this AA.
Persuasive Prediction of Concurrency Access Anomalies 47
4.6 Schedule Generation
For each predicted AA, PECAN statically generates a corresponding thread schedule that is used
to deterministically direct an execution for exposing the AA. This problem is highly nontrivial
and there are several challenges to be addressed:
1. Given an AA, regardless of real or false, how to generate a schedule that can manifest it?
2. For each AA, there might be multiple corresponding schedules. Which one should we gener-
ate?
3. For real AAs, how to make sure the generated schedules are feasible, i.e., can expose the real
AAs?
In the following text, we first present our schedule generation algorithm and discuss how it
addresses the above challenges. Then we formally prove that, for programs using nested locks,
our algorithm guarantees to generate a feasible schedule for every real AA. For false AAs,
although our algorithm may also generate infeasible schedules, we show in Section 4.6.3 that
these false AAs can be automatically pruned away during the re-execution phase.
4.6.1 How to Generate a Feasible Schedule?
The basic idea of our schedule generation algorithm is to transform the original trace by chang-
ing the relative order of independent events, i.e., moving the related events to different positions
in the trace. The main challenge is that we need not only to make sure the transformed trace
can manifest the AA, but also to guarantee it is feasible (i.e., does not violate the program con-
straints). However, as there are exponential number of ways to transform the trace, it is very
inefficient to exhaustively generate every possible schedule and verify its feasibility by check-
ing the constraints. Figure 4.3 shows a simple trace in which the nodes v1,v2,v3 and v4,v5,v6belong to two different threads, and the POR and TOR are represented by solid and dashed
edges respectively. Suppose (v2,v5) is a real race pair. There are many possible rearrangements
of the nodes in which we can place v2 and v5 next to each other, but only some of them are
feasible schedules. For instance, if we naively move v2 to the position before v5, we will get an
infeasible schedule δ′, in which the relative order between v2 and v3 violates the POR.
We have the following tactics to reduce the computational complexity of the schedule gener-
ation: First, although there might be many feasible schedules that manifest a real AA, it is
sufficient for us to generate one of them. Second, since the original trace is a feasible schedule
(i.e., satisfies the program constraints), when we permute the original trace (e.g., move a node
to a different position), we only need to make sure the changed portion does not violate the con-
straints w.r.t. the entire trace. Third, since it is sufficient for the resulting schedule to manifest
Persuasive Prediction of Concurrency Access Anomalies 48
1
23 4 5 6
2 3 4 5 6
2 541 1
FIGURE 4.3: An example of schedule generation
the violation, we can remove from the schedule the nodes that are placed beyond the violation
creation point.
With these tactics, the whole schedule generation process becomes clear and straightforward.
The key problem is how to satisfy the program constraints when permuting the nodes. There
are basically three types of program constraints: the POR, the lock constraint, and the program
control constraint. The lock constraint requires that, at any time of the program execution,
a lock cannot be held by more than one thread. The program control constraint is related to
the execution order determined by the evaluation results of program control statements. For
real AAs, we can ignore the program control constraint as the evaluation results of program
control statements should be unchanged if we move the violation node to a correct position that
manifests the AA; otherwise, the AA is not real. We next discuss how our algorithm respects
the POR and the lock constraint.
Satisfying the POR is relatively simple. The key point is that we should not only move the
violation node to the correct position such that the violation pattern can be satisfied, but also
move the nodes that are dependent on, or having PORs with, the violation node, to their correct
positions. Back to the example in Figure 4.3, we generate a correct schedule δ′′ by first moving
v2 and v3 (because v3 is dependent on v2) to the position next to v5, and then removing v3 and
v6 from the schedule (because v3 and v6 are beyond the violation creating point).
Satisfying the lock constraint is much more complicated. We first use an example to illustrate
the challenge and then describe our approach for addressing it.
Example In Figure 4.4, the race pair (v3,v8) satisfies our relaxed mutation constraints, i.e., v3and v8 are not protected by a common lock and there is also no POR between them. Therefore, it
would be reported as a possible race pair by our pattern search algorithm. However, it is a false
warning: it is impossible for v3 and v8 to happen next to each other in any feasible schedule, as
there is a POR between v2 and v5. For this false violation, if we only consider the POR in the
schedule generation, we would generate an infeasible schedule <v1,v2,v5,v6,v7,v3,v8> that
Persuasive Prediction of Concurrency Access Anomalies 49
thread1
v1. lock(l) v2. … v3. read x v4. unlock(l)
thread2
v5. … v6. lock(l) … v7. unlock(l) v8. write x
FIGURE 4.4: An example for illustrating the difficulty of satisfying the lock constraint forschedule generation. The race pair (v3,v8) is a false warning, though it satisfies both the POR
and the lockset condition.
violates the lock constraint. This is fine as this false violation can be pruned in the re-execution
phase. The problem is, however, if we remove the partial order relation from v2 to v5 and (v3,v8)
becomes a real race, this schedule is still infeasible.
The root cause of the problem above is that, by moving the dependent nodes on the to-be-moved
violation node (v3 in Figure 4.4), we have moved an unlock node (v4) but not its corresponding
lock node (v1), causing the resulting schedule to violate the lock constraint. To address this
problem, whenever we move a unlock node, we should also make sure its corresponding lock
node is moved to a correct position. Thus, in addition to the steps illustrated in Figure 4.3, our
algorithm also looks for the outermost lock (OML) node protecting the to-be-moved violation
node, and moves all the dependent nodes on the OML node to their correct positions. For the
example in Figure 4.4, we first find v1 (the OML node) and move v1, v2 and v3 (the nodes
dependent on the OML node) to the positions before v8, then we move v4 to the position after
v8 and remove v4 afterwards. Finally we get a feasible schedule <v5,v6,v7,v1,v2,v3,v8>.
Algorithm 3 ScheduleGeneration(vi,vj)Require: i < j
1: Let vl be the outermost lock node that is protecting vi2: Move all the nodes dependent on vi to the positions after vj3: if vl is not NULL then4: Move vl and all the nodes from vl to vi that are dependent on vl to the positions before vj5: else6: Move vi to the position immediately before vj7: Remove all nodes after vj
Algorithm 3 summarizes our schedule generation algorithm for data race patterns. The goal
is to generate a feasible schedule in which vi and vj are placed next to each other. Since all
it does is move a sequence of nodes to different positions, the worst case time complexity of
this algorithm is linear in the length of the trace. The algorithms for the other AA patterns,
such as atomicity violation and ASV patterns, are in a similar style, though may require moving
Persuasive Prediction of Concurrency Access Anomalies 50
more nodes if the pattern contains three or more events. For example, Algorithm 4 shows our
algorithm for atomicity violation patterns, which contain an event triple (vi,vk,vj). The goal of
the algorithm is to generate a feasible schedule in which vk is placed between vi and vj . With
no loss of generality, let us consider the case i < j < k. Recall that in reporting every potential
atomicity violation in the pattern search phase, we have found a node vx which is between viand vj and satisfies the mutation condition with vk. This means that in some feasible schedule
vk can be placed before vx. We thus generate such a feasible schedule by the safest and simplest
way: move all the nodes from vx to vj in the original trace that are dependent on vx to the
position after vk. The movement of nodes simply follows the same rule as that in the algorithm
for data race patterns.
Algorithm 4 ScheduleGeneration(vi,vk,vj)Require: i < j < k
1: Find vx in canSatisfyByMutation(vi, vk, vj)2: Let vl be the outermost lock node that is protecting vx3: if vl is not NULL then4: Move vl and all the nodes from vl to vx−1 that are dependent on vl to the positions before vk5: else6: Move all the nodes from vx to vj that are dependent on vx to the positions after vk7: Remove all nodes after vj
4.6.2 What Can Our Algorithm Guarantee?
Theorem 4.3. For programs that use locks in a nested way, i.e., releasing locks reverse to the
acquisition order, our schedule generation algorithm will produce a feasible schedule for every
real AA.
Proof. Since the essential idea of the schedule generation is event permutation: moving events
or event sequences in the original trace from one place to another, to prove the correctness in
general (for any AA), it is sufficient to prove the correctness for the most basic step: moving
a single event. Now let us pick a race pair (vi,vj) with i < j for the proof. Suppose (vi,vj) is
a real race but the schedule generated by Algorithm 3 is infeasible. Following we prove it is
impossible by contradiction.
Because the schedule is infeasible, it must have either violated the POR, the program control
constraint, or the lock constraint. For the POR, because Algorithm 3 only changed the temporal
order between vj and the nodes that were moved to the positions after vj , i.e., nodes dependent
on vi, the only possible POR the generated schedule may violate is between vj and the nodes
that are dependent on vi. However, for any of such PORs, say vx → vj , we must have vi → vx
and then vi → vj that contradicts to the condition that there is no POR between vi and vj , which
must be satisfied for our algorithm to report this AA. Besides, it cannot violate the program
Persuasive Prediction of Concurrency Access Anomalies 51
control structure neither; otherwise the race is infeasible. Thus, it is impossible for the generated
schedule to violate the POR or the program control constraint.
We next prove that it is also impossible to violate the lock constraint. If the schedule violates
the lock constraint, then there must exist an unmatched lock and unlock node pair, i.e., the lock
node and its corresponding unlock node is interleaved by another lock or unlock node. However,
because the original trace satisfies the lock constraint, there are only two possible reasons for
this result: (I) we incorrectly moved the interleaved lock or unlock node to a position between
the lock and the unlock node; (II) we incorrectly moved the unlock node to a position after the
interleaved node. Case I is impossible because it violates the lockset condition which should be
satisfied for our algorithm to report this AA. For case II, we show it is also impossible if there
are only nested locks in the original trace. First, because our algorithm only moves those nodes
that are dependent on the outermost lock (OML) node that is protecting the violation node, if
we had ever moved an unlock node, this unlock node should be dependent on the OML node.
Additionally, if there are only nested locks in the trace, the corresponding lock node of this
unlock node should also be dependent on the OML node, otherwise the OML node would not
be the outermost lock node. Thus, if we had ever moved an unlock node, we should have also
moved its corresponding lock node to a correct position. So case II is also impossible.
4.6.3 Pruning False Warnings
Note that our schedule generation algorithm is sound but incomplete, i.e., it may generate infea-
sible schedules for false violations. Nevertheless, we are able to automatically prune all the false
AAs away during the re-execution phase. Specifically, during the re-execution, we control the
thread scheduling of the re-execution to strictly follow an input generated schedule by matching
the events between the two schedules. When we observe that some thread has executed a new
event that does not match the corresponding event in the input schedule, which means the thread
has taken a different branch from the original observed execution, or the re-execution hangs due
to a deadlock, we immediately stop the re-execution and report the AA is a false violation. In
this way, as we only report successful re-executions, we are able to prune all the false violations.
4.7 Evaluation
We have implemented PECAN based on LEAP. We use a set of popular subjects (Table 4.1),
used in benchmarking the concurrency defect analysis techniques [64, 98, 110], and a number of
large multi-threaded Java applications to evaluate PECAN. In all our experiments, we collect a
normal execution trace for each program with the fixed configuration setting and program input.
To represent the trace, we maintain a vector to record a global order of all the events. For all the
Persuasive Prediction of Concurrency Access Anomalies 52
events, we record their access type, thread ID and the accessed memory ID at runtime. The lock
set and the atomic region information are computed offline to save runtime cost. For re-entrant
locks, like [36], we process them internally in the trace collection phase and do not expose them
to the resultant trace.
For each generated schedule, we re-execute the program once to verify whether the correspond-
ing predicted AA is present. Because of concurrency bugs, some subjects may throw uncaught
exceptions in certain problematic schedules. It is clearly a highly desirable and useful charac-
teristic if a technique is able to predict these concurrency bugs from a normal execution trace,
and generate the corresponding schedules to cause the program to raise uncaught exceptions.
Thus, in our evaluation, we also report the number of re-executions in which the program raised
uncaught exceptions, out of all the schedules generated by PECAN, for each evaluated program.
To remove nondeterminism caused by random numbers, we replace all random seeds in the
evaluated programs with a constant. For open libraries, we use the drivers from [57] to close
them. All the experiments were conducted on a 8-core 3.00GHz Intel Xeon machine with 16GB
memory and Linux version 2.6.22. The VM configuration is a standard Java HotSpot (TM)
64-Bit Server VM with version 1.6.0 10 with 10G heap space, which is sufficient for all our
experiments.
Persuasive Prediction of Concurrency Access Anomalies 53
TAB
LE
4.1:
PEC
AN
expe
rim
enta
lres
ults
Prog
ram
LO
CTr
ace
Com
puta
tion
Vio
latio
nR
esul
tT
hrea
dSV
Eve
ntO
verh
ead
Ana
lysi
sTr
ansf
orm
Rac
eAV
ASV
TE
XF
Acc
ount
148
34
660.
00x
6ms
7ms
72
615
80
Bug
gyPr
g38
54
522
50.
67x
12m
s8m
s9
10
101
0C
ritic
al70
31
190.
33x
2ms
2ms
1614
028
20
Loa
der
148
41
480.
01x
4ms
8ms
24
05
10
Man
ager
212
53
158
0.00
x30
ms
14m
s15
00
141
1M
erge
Sort
456
62
1,58
70.
50x
22m
s10
ms
810
08
110
Shop
280
41
429
0.20
x10
1ms
32m
s4
00
41
0St
ring
Buf
1,32
03
286
1.40
x3m
s43
ms
01
01
10
Arr
ayL
ist
5,86
63
329
40.
14x
6ms
8ms
61
411
20
Lin
kedL
ist
5,97
93
935
70.
13x
8ms
7ms
61
713
30
Has
hSet
7,08
63
1140
40.
62x
12m
s8m
s4
34
105
0Tr
eeSe
t7,
532
323
475
0.21
x30
ms
7ms
42
410
40
Mol
dyn
1,35
22
1113
4,37
57.
84x
5.62
2s26
5ms
40
01
03
Ray
Trac
er1,
924
239
915
,140
1.25
x1.
034s
68m
s2
20
40
0M
onte
Car
lo3,
619
29
7,65
01.
69x
309m
s20
ms
10
01
00
Cac
he4j
3,89
75
191,
077
0.25
x11
ms
12m
s5
20
50
2Sp
ecJB
B-2
005
17,5
964
116
60,7
750.
07x
79m
s53
ms
241
018
27
Hed
c29
,949
710
3,11
70.
32x
5ms
9ms
253
012
216
Web
lech
-0.0
.335
,175
326
10,6
400.
14x
57m
s24
ms
100
04
06
Ope
nJM
S-0.
7.7*
154,
563
2418
518
0,88
70.
38x
298.
6s35
0ms
207
226
434
*8*
66*
Jigs
aw-2
.2.6
*38
1,34
812
307
275,
128
3.61
x17
7.7s
578m
s66
572
768
436
*15
*57
*D
erby
-10.
3.2.
1*66
5,73
34
9944
7,39
21.
90x
6.18
4s1.
453s
144
319
038
*6*
62*
Persuasive Prediction of Concurrency Access Anomalies 54
4.7.1 Experimental Results
Table 4.1 summarizes the results of our experiments. For each program, Column 2 reports its
size in the lines of source code (LOC), Column 3-5 report the number of threads (Thread), the
number of real shared memory locations that contain both read and write accesses from different
threads (SM), and the number of events in the trace (Event) that we analyzed, respectively. The
thread number ranges from 2 in RayTracer to 24 in OpenJMS, the number of shared memory
locations ranges from 1 to 399, and the trace size ranges from 19 to 447,392.
Column 6 reports the runtime overhead (Overhead) of our trace collection4. The runtime over-
head ranges from 0.00x in Account and Manager to 7.84x in Moldyn. Columns 7-8 report the
pattern search time (Analysis) and the average schedule generation time (Transform). The pat-
tern search time ranges from 3ms in StringBuf, with 86 events in the trace, to around 5 minutes
in OpenJMS with 180,887 events in the trace. The average schedule generation time ranges
from 2ms in Critical to 1.473s in Derby.
Columns 9-11 report the number of predicted data races (Race), atomicity violations (AV), and
ASVs (ASV), respectively, in each program. PECAN predicted a number of data races and
atomicity violations in almost all the traces we analyzed. The number of predicted ASVs is
often zero or very small except for Jigsaw, in which PECAN predicted 684 ASVs. Note that
each AA reported by PECAN is unique in terms of the source code line numbers on which the
violation events are triggered. We do not report duplicate AAs that have the same line number
combinations in the source.
Columns 12-14 report the number of created real AAs (T), the number of re-executions that
raise uncaught exceptions (EX), and the number of re-executions that fail (F). For the three
large programs (OpenJMS,Jigsaw,Derby) marked with ‘*’, because they contain too many pre-
dicted AAs (from 437 to 2,076), we only generate the schedules for 100 randomly selected AAs.
PECAN created real AAs for all the evaluated programs and, for most of them, PECAN caused
the program to throw uncaught exceptions, which is a strong symptom of real concurrency bugs.
PECAN also reported a number of failed re-executions in several subjects, especially those large
programs. We manually inspected those failures and found that the only reason why PECAN
fails to create these AAs was that these AAs are false violations, due to the conservativeness of
the hybrid constraint model (recall Section 4.4.1) we use for AA prediction.
Our experiment results clearly demonstrate the performance and effectiveness of PECAN. First,
PECAN has predicted real AAs for all the evaluated subjects and achieves a 100% success ratio
creating the predicted AAs in more than half of the subjects. For the other subjects, the success
ratio is from 0.25 to 0.93 (due to the reported false violations). Second, the pattern search and
the schedule generation are both relatively fast. For Derby, which has more than 447K events4The overhead was averaged over 10 runs for each subject.
Persuasive Prediction of Concurrency Access Anomalies 55
setInvocationHandler(…) { … 1. _multiplexer = createMultiplexer(…); … }
invoke(…) { synchronized (this) { 2. multiplexer = _multiplexer; } if (multiplexer != null) … else throw new ResourceException(…); }
MultiplexedManagedConnection.java
FIGURE 4.5: A destructive race in OpenJMS
getNextEvent() { 1. while (queue.size() == 0) { … } 2. Event e = queue.elementAt(0); … 3. queue.removeElementAt(0); }
EventManager.java
FIGURE 4.6: A predicted real bug in Jigsaw
in the trace, PECAN predicted 463 AAs in around 6 seconds and generated the corresponding
schedule for each AA in around 1.5 seconds on average. For OpenJMS, the trace of which con-
tains more than 180K events, PECAN predicted 2,076 AAs in less than 5 minutes. For the other
cases with smaller trace size, such as ArrayList that contains several hundred events, the pattern
search time and the schedule generation time are only several milliseconds. These results clearly
demonstrate the efficiency of our pattern search and schedule generation algorithms. Moreover,
since we compute most of the event attributes offline, the runtime overhead of PECAN is rela-
tively small, with slowdown factors ranging from 0.00x to 7.84x.
4.7.2 Detected Real Bugs
We investigated the uncaught exceptions and real AAs that PECAN created and confirmed a
number of real concurrency bugs in almost all the subjects, and several previously unknown
bugs. We next describe a couple of previously unknown bugs in two large projects OpenJMS-
0.7.7 and Jigsaw-2.2.6.
Figure 4.5 shows a destructive data race predicted by PECAN in OpenJMS-0.7.7. The race
happens on the field multiplexer of the class MultiplexedManagedConnection.
When a thread first read the shared field at line 2 before it is initialized by another thread at line
1, the thread will throw a ResourceException that crashes the program.
Persuasive Prediction of Concurrency Access Anomalies 56
Figure 4.6 shows a predicted real bug in Jigsaw-2.2.6. In the method getNextEvent of
the class EventManager, a thread first checks in a while loop (line 1) until the event queue
becomes non-empty, then the thread gets the first item in the queue (line 2) and removes it
from the queue (line 3). This logic is correct in a single-threaded event manager. However,
when there are multiple threads executing inside the getNextEvent method simultaneously,
a thread might try to get an item from the queue that has already been removed by another
thread, causing an ArrayIndexOutOfBoundsException at line 2.
4.7.3 PECAN Limitations
Our experimental results clearly demonstrated the superior persuasive concurrency bug predic-
tion capability of PECAN compared to related approaches. Through our experiments with real
world large multi-threaded applications, we also observed some limitations of PECAN that we
plan to address in future work.
Limited path exploration PECAN currently has only the information of a single trace, it can
not predict access anomalies in execution paths absent from the collected traces. We plan to
enhance PECAN by combining it with approaches such as symbolic analysis [101, 129, 131] to
systematically exercise more execution paths.
Sensitivity to the original trace Both the pattern search and the schedule generation phases of
PECAN are dependent on the original trace. For example, to create the race (3,7) in Figure 1.1,
PECAN needs the statements 3 and 7 are both exercised in the original trace. However, such
a schedule, e.g., <1,5,2,3,6,7> or <5,1,2,3,6,7>, could be difficult to manifest in
either real executions or test runs. Techniques such as RaceFuzzer are effective in generating
error-inducing traces by intelligently exploring thread schedules based on statically detected
race pairs. As the future work, we plan to integrate PECAN with this school of techniques to
tackle the trace sensitivity issue and to improve the bug detection capability of PECAN.
4.8 Summary
In summary, this work makes the following contributions:
• We present a persuasive PTA technique as well as a prototype tool PECAN for detecting
general access anomalies in concurrent Java programs. PECAN not only predicts access
anomalies, but also generates “bug hatching clips” that deterministically instruct the input
program to exercise the predicted AAs.
Persuasive Prediction of Concurrency Access Anomalies 57
• We present a general specification model of access anomalies and a prediction model
that models the problem of access anomaly prediction as a graph pattern search problem.
The graph compactly encodes the happens-before relationship between the events and the
scheduling order of memory accesses in the trace, and supports efficient pattern search of
AAs to enable PECAN to scale well to large traces.
• We present an efficient static thread schedule generation algorithm, with proof of sound-
ness, that will generate a feasible schedule for every real AA in programs that use locks
in a nested way.
• We evaluated PECAN using twenty-two multi-threaded subjects including six large con-
current systems and our experiments demonstrate that PECAN is able to effectively pre-
dict and deterministically expose real AAs.
Chapter 5
Scaling Predictive Trace Analysis byRemoving Redundant Events
Predictive trace analysis (PTA) of concurrent programs is powerful in finding concurrency bugs
unseen in past program executions. Unfortunately, existing PTA solutions face considerable
challenges in scaling to large traces. We identify that a large percentage of events in the trace
are redundant for presenting useful analysis results to the end user. Removing them from the
trace can significantly improve the scalability of PTA without affecting the quality of the results.
We present a trace redundancy theorem that specifies a redundancy criterion and a soundness
guarantee that the PTA results are preserved after removing the redundancy. Based on this cri-
terion, we design and implement TraceFilter, an efficient algorithm that automatically removes
redundant events from a trace for the PTA of general concurrency access anomalies. We eval-
uated TraceFilter on a set of popular concurrent benchmarks as well as real world large server
programs. Our experimental results show that TraceFilter is able to significantly improve the
scalability of PTA by orders of magnitude, without impairing the analysis result.
5.1 Introduction
PTA-based solutions often experience scalability problems with large traces because of exhaus-
tively checking all feasible permutations of the trace. The largest trace reported by recent PTA
techniques [130, 131] contains less than 10K events1 and one of the techniques [131] takes more
than two minutes to analyze a trace with only 1K events. It is important for PTA techniques to
scale as the trace of large complex concurrent programs can easily contain millions or even
billions of events [122].1This corresponds to a 0.01sec execution of a Bank benchmark in [130] with 135 lines of code.
58
Scaling Predictive Trace Analysis by Removing Redundant Events 59
We observe that existing research that addresses the scalability of the PTA techniques targets
two causes of computational complexities. The first cause is the well recognized exponential
explosion of the schedule exploration space. An array of space reduction methods have been
proposed such as partial order reduction [38], maximal causal models [112], and staged analysis
[116]. The second cause is the computational complexity inherent to the anomaly checking
algorithms themselves. For instance, in any particular schedule, the number of event pairs to
check for race conditions is O(N2) in the worst case, where N is the size of the trace. This
complexity becomes O(N4) for checking for the atomic-set serializability violations (ASV) [64,
125]. Approaches such as the meta-analysis model [33] and the work by Kahlon et al [60] can
effectively reduce this type of complexity by limiting the analysis to programs that obey the
nested locking discipline.
In this work, we identify that a third cause of computational complexity comes from the fact that
the trace often contains a large number of events that are mapped to the same lexical statements
in the source code. While increasing the size of the trace significantly, these events do not
reveal any additional information for fixing bugs when presented to the users of the PTA tools.
Therefore, we can dramatically improve the scalability of PTA techniques if we can remove this
redundancy, i.e, produce a smallerN , while preserving the quality of the results presented to the
end user. On the surface, it seems simple to remove the operations that are lexically identical
from the trace. Unfortunately, such an approach removes important dependency information and
causes the PTA techniques to work incorrectly. Let us further illustrate this through an example.
The program in Figure 5.1 consists of a parent thread (T0), executing line 1 to line 5, and three
children threads (T(1,2,3)), executing line 6 to line 14. Since T0 generates 3 writes to variable x
and T(1,2,3) generates 9 reads of x in total, a PTA technique for checking data races will need to
examine 3*9=27 event pairs. It is apparent that these 27 pairs to the variable x eventually map
to only two lines in the source code (line 3 and line 13). Therefore, only one racy pair of events
is sufficient to highlight the problem in this program. In modern day concurrent programs, this
type of redundancy is prevalent due to the single-process-multiple-data (SPMD) architectural
design. A straight forward way to combat this redundancy is to only record one instance of
each lexically distinctive statement. For instance, we can choose to only record the first write
at the line 3 by T0 as well as the first read at line 13 by the other three threads. The obtained
trace, albeit much smaller in size (4 x accesses instead of 12), is not that useful in finding
the race because it tells us that these reads are performed after the write as the result of the
thread creation operation at line 4. Therefore, the data race cannot be detected. However, if
we also record the second write by T0, a PTA algorithm can correctly report the data race by
only analyzing 5 accesses of x. Even better, by observing that the event sequences of the three
threads T(1,2,3) are all identical, we can drop the events by any two of them, resulting in only 3
x accesses to be analyzed for race detection.
Scaling Predictive Trace Analysis by Removing Redundant Events 60
Thread T1,2,3
11:
12:
13:
14:
for(i=1;i<=3;i++)
{
write x;
fork Thread Ti;
}
1:
2:
3:
4:
5:
m()
{
read x;
}
(a) Local redundancy (b) Global redundancy
Thread T0
lock l
m()
unlock l
m()
m()
6:
7:
8:
9:
10:
FIGURE 5.1: Example code for illustrating the trace redundancy
Through this example, we note that the identical lexical position of two recorded events is only a
necessary but not sufficient condition for them to be redundant, in terms of preserving the results
of the PTA techniques. We propose the concept of permutational redundancy, in conjunction
with lexical redundancy, to serve as the criteria of the safe removal of events from traces before
being analyzed by the PTA techniques. The permutational redundancy criterion states that two
events by the same thread are redundant to each other (called local redundancy) if, first, their
locksets contain no different locks and, second, their inter-thread happens-before relationships
with all the other events generated by the other threads are equivalent. In addition, we extend
this notion to characterize the redundant event sequences by different threads (called global
redundancy). Two event sequences by different threads are redundant if their corresponding
events are lexically redundant. Going back to our example in Figure 5.1, the fork Thread
statement states that the first write at line 3 by T0 happens before the read of T1 at line 13.
However, this relationship does not hold between this read operation and the second write of T0.
Therefore, the first and the second writes of T0 are not permutationally redundant to each other
and neither of them can be removed. By the same reasoning, the second and the third writes
(reads) of T0 (T(1,2,3)) are in fact redundant and only one of them is needed for further analysis.
Moreover, because the event sequences of T(1,2,3) accessing x are lexically identical and their
corresponding events are equivalent, they are globally redundant to each other and only one of
them is needed for detecting the race.
To remove the redundancy above, another simpler strategy is to drop all re-references to the
same variable at the same program location by the same thread if there are no synchronization
operations between them. For instance, the third reads of T(1,2,3) in our example are removed
from the trace. However, this simple strategy is less preferable in two aspects. First, it is limited
in removing the redundant events within the same synchronization region. Redundant thread
accesses across synchronization boundaries cannot be detected using this approach. More im-
portantly, this approach is unsound in addressing trace redundancy in the general PTA treatment
of access anomalies. It may incorrectly drop useful events that manifest access anomalies other
than data races. As illustrated in Figure 5.2, this simple strategy removes the second read of T2,
which results in missing a real atomicity violation formed by the statements (10,7,10).
Scaling Predictive Trace Analysis by Removing Redundant Events 61
9:
10:
m1():
lock l;
write x;
unlock l;
5:
6:
7:
8:
m2():
read x; m2()
m2()
• Statement pair (4,9) forms a real data race.
• The second write of Thread t1 is redundant, however.
The simple strategy of “dropping all re-references by the
same thread to the same variable if there are no
synchronization operations between them” does not work
for this redundancy, because there are lock/unlock
operations between the two writes of Thread t1.
m1()
m1()
T1 T2
1:
2:
3:
4:
FIGURE 5.2: Statements (10,7,10) form a real atomicity violation. However, the simple strat-egy of “dropping all re-references by the same thread to the same variable if there are nosynchronization operations between them” would drop the second read of T2 at line 10, which
causes PTA to miss this atomicity violation.
Based on the above observation, we present TraceFilter, a technique that efficiently removes
redundant events from a trace and, at the same time, preserves the results of the PTA techniques.
We first propose a generalized model of the PTA algorithms for analyzing the access anomaly
bugs in concurrent programs. Using this model, we associate each event in the trace with a new
attribute, called concurrency context, in addition to its lexical location. The concurrency context
contains the synchronization histories of the thread, at the time when the event is triggered in
the trace. We show that our technique is sound: it does not mis-classify any useful event to be
redundant, as the concurrency context strictly preserves the permutability conditions of events.
Moreover, the prefix-sharing property of the concurrency context enables us to use a compact
Trie data structure to detect redundancy in a memory-friendly way and to efficiently filter out
redundant events.
To evaluate our technique, we have implemented a prototype tool for analyzing the trace of
concurrent Java programs and evaluated the tool on a set of popular concurrent benchmarks and
real world large server programs. We considered the PTA of all the three common concurrency
access anomaly bugs including data races, atomicity violations, and ASVs. Our experimental
results show that: (1) redundant events are pervasive in concurrent programs and our technique
is very effective for detecting them. The overall percentage of redundant events detected by
our technique ranges from 7.9% to 99.9% in the trace, while for the real server programs, the
percentage of redundancy ranges from 34.7% to 85.5%. (2) our technique is able to significantly
improve the scalability of PTA. For a trace with more than 2M events (2,236,960) in Derby, the
PTA with our technique was able to finish in 177.5 seconds, whereas without our technique, the
same PTA does not finish in 2 hours. (3) our technique does not impair the analysis result for
the PTA of concurrency access anomaly bugs. By comparing the trace analysis results for all
the evaluated benchmarks, we empirically confirm that the analysis results reported by the trace
analysis algorithms with our technique are the same as without our technique.
The remainder of this chapter is organized as follows: Section 5.2 presents a description of
general PTA algorithm; Section 5.3 presents our technique in detail; Section 5.4 describes our
implementation; Section 5.5 presents our empirical evaluation; Section 5.6 summarizes this
chapter.
Scaling Predictive Trace Analysis by Removing Redundant Events 62
5.2 General PTA algorithm
The essential idea of PTA for detecting concurrency access anomalies [22, 64] is based on a
permutability property between events that combines both the lockset condition and happens-
before condition. We describe a generalized PTA algorithm as follows.
A PTA algorithm A, given a trace δ and a pattern p, first decides, in the pattern, the number of
different threads, different shared variables, and different events on each shared variable by each
thread in the same atomic region. Then it uses this information to search in the trace to obtain a
set of candidate access anomalies. In order for the search to be efficient, often the trace is pre-
processed to build the index based on the thread ID, the shared variable, the access type, and the
atomic region. Each candidate access anomaly contains a sequence of events usually satisfying
one of the patterns in Figure 4.1 with respect to SV, AR, and AT, but not T, the thread scheduling
order. In this case, it continues to check whether there exists certain allowed permutations of
events that match T under the lockset and the happens-before constraints.
For a pair of events ei and ej , PTA checks the following two conditions:
I: lockset condition: Li ∩ Lj = ∅, where Li and Lj are the locks held by the corresponding
thread when the event occurs.
II: happens-before condition: ¬(ei ≺ ej) ∧ ¬(ej ≺ ei), where ≺ is the POR relation defined in
Definition 2.5.
The above conditions mean that two events in the candidate access anomaly are permutable, i.e,
concurrent to each other (i.e., one access does not happen-before the other). Consequently, the
PTA algorithm can conclude that multiple thread scheduling orders are possible for this pair of
events and report this pair as a data race bug.
Finally, for each access anomaly, PTA extracts the information contained in the events and
presents it to the programmer for debugging. There is no uniform rule on what information
is extracted from each event, as the users of PTA may require different level of details for
understanding the access anomaly bug. However, the basic information of access anomalies
should contain the lexical statements in the program on which the events are triggered.
Example Let us consider the trace in Figure 5.3. There are in total 43 events (e1-e43) in
the trace. The events e1-e6 are performed by T0, e7-e18 by T1, e19-e31 by T2, and e32-e43 by
T3. There are in total three write events, e(1,3,5), all by thread T0, and nine read events, among
which e(10,14,17) are by T1, e(22,26,30) by T2, and e(35,39,42) by T3. The locksets of the read
events e(10,22,35) contain a single lock l. The locksets of the other read/write events are empty.
Scaling Predictive Trace Analysis by Removing Redundant Events 63
write x;fork T1;write x;fork T2;write x;fork T3;
start T1lock l;enter m;read x;exit m;unlock l;enter m;read x;exit m; enter m;read x;exit m;
e1:e2:e3:e4:e5:e6:
e7:e8:e9:e10:e11:e12:e13:e14:e15:e16:e17:e18:
Thread T0 Thread T1
start T2lock l;enter m;read x;exit m;unlock l;enter m;read x;exit m; enter m;read x;exit m;
e19:e20:e21:e22:e23:e24:e25:e26:e27:e29:e30:e31:
Thread T2
start T3lock l;enter m;read x;exit m;unlock l;enter m;read x;exit m; enter m;read x;exit m;
e32:e33:e34:e35:e36:e37:e38:e39:e40:e41:e42:e43:
Thread T3
FIGURE 5.3: A trace corresponding to a serial execution of the example program in Figure 5.1.
Since the fork Thread ti event must be executed before the start of thread ti, the happens-
before relation between the events are e1 ≺ e2 ≺ e7 ≺ . . . ≺ e18, e2 ≺ e3 ≺ e4 ≺ e19 ≺ . . . ≺ e31,
and e4 ≺ e5 ≺ e6 ≺ e32 ≺ . . . ≺ e43.
Given the above trace, PTA will first list all the read/write events on the shared variable x by
each thread. As e(1,3,5) are three write events by T0 and e(10,14,17,22,26,30,35,39,42) are nine read
events by T1,2,3, PTA will then check the 3*9=27 pairs of candidate races. By evaluating the
lockset and happens-before relation between the two events in each candidate race, PTA will
get nine real race pairs [(e(3), e(10,14,17)),(e(5), e(10,14,17,22,26,30))]. For the real race pairs, PTA
then reports the lexical statement pair contained in each of them, which finally produces the race
at lines (3,13) to the user.
5.3 Removing Trace Redundancy
This section presents our methodology for removing the trace redundancy. We start by giving
a formal modeling of the trace redundancy. We then present our algorithm on detecting the
redundant events.
5.3.1 Modeling trace redundancy
Consider a PTA algorithmA that takes a trace δ as the input and produces a set of access anomaly
bugs as the output. We define the concept of redundancy as follows:
Definition 5.1. Given an algorithm A and an arbitrary input δ, a subsequence X of δ is redun-
dant iff A(I) = A(I/X).
Scaling Predictive Trace Analysis by Removing Redundant Events 64
Recall Section 4.3, an access anomaly is a sequence of events that can be specified by a meta
pattern that defines both the attribute values of these events and the order relation between them.
To facilitate our discussion, we first define a concept called candidate access anomaly (CAA)
that will be used in our modeling of trace redundancy:
Definition 5.2. A candidate access anomaly (CAA) corresponding to a pattern p is an event
sequence, of which the event attribute values satisfy the condition defined in p, but the order
relation between them might not satisfy the condition defined in p.
Note that a CAA should correspond to a certain pattern that is provided by the user of the PTA
algorithm. For different patterns, a CAA may contain different numbers of distinctive events.
For example, for a data race pattern, a CAA contains two events, while for atomicity violation
patterns, it contains three events. We refer to this property as the feasibility property of CAA,
used later in proving the trace redundancy theorem.
5.3.1.1 A theory of trace redundancy
Given a pattern and a trace, as described in Section 5.2, a PTA algorithm proceeds in two steps.
First, it analyzes the trace to find the sequences of events (i.e., access anomalies) that satisfy
the conditions specified in the pattern. Second, for each sequence of events, it extracts the
information contained in the events and reports it to the programmer for debugging. We can
decompose such PTA algorithms into two components: a rule R and a function f . The rule R
evaluates on a CAA, say s, and simply reports true or false. If R reports true, it means s is a real
access anomaly, and f will be applied on each event in s to generate an output.
Let us assume the generated information of each event by f is its lexical location in the program
source, σ. Let s(e → e′) denote the result by replacing an event, e, in a CAA, s, with another
event, e′. Let R(s) denote that R reports true on s. And let ⊏ denote a relation where X ⊏ Y
means all events of the sequence X are also in Y . Based on the above assumption, we have the
following theorem for detecting redundancy in δ.
Theorem 5.3. Given an input trace δ, a pattern p and an algorithm A = (R,f) with rule R
and function f . An event e is redundant if there exists another event e′ in δ such that ∀ CAA
s ⊏ δ ∧ e ∈ s, the following three conditions hold:
• Lexical equivalence condition: f(s) = f(s(e→ e′));
• Permutational equivalence condition: R(s)⇒ R(s(e→ e′));
• CAA feasibility condition: s(e→ e′) is a CAA corresponding to p;
Scaling Predictive Trace Analysis by Removing Redundant Events 65
Proof. According to our definition of trace redundancy in Definition 5.1, if the above three
conditions are all satisfied, we have ∀e,∀s ⊏ δ, e ∈ s ∧ R(s), ∃s′ = s(e → e′) ⊏ δ/{e} s.t.
R(s′) ∧ f(s) = f(s′). Thus, e is redundant.
Theorem 5.3 says that, for an event e in δ, if the three conditions are all satisfied, then the trace
δ with e or without e would always produce the same result for all the PTA algorithms in our
assumption. Therefore, we can detect redundancy in the trace by checking the three conditions
for each event e.
Let us first consider the lexical and the permutational equivalence conditions in Theorem 5.3.
Since f generates the lexical statement σ according to our assumption, the lexical equivalence
is easy to evaluate, i.e., it is satisfied iff e′ and e are triggered on same lexical statement. For
the permutational equivalence, as we have shown in Section 5.2, the essential determinant is the
lockset and the happens-before relation between the events, which instantiate the rule R. More
specifically, we define that the events e and e′ satisfy the permutational equivalence condition
if :
• Lockset equivalence: their locksets contain no different locks, i.e., Le = Le′ ;
• Inter-thread happens-before equivalence: their happens-before relationships with all the
events by other threads are equivalent, i.e., ∀e′′ te′′ ≠ te ∨ te′′ ≠ te′ , ¬(e ≺ e′′)∧¬(e′′ ≺ e)
⇐⇒ ¬(e′ ≺ e′′) ∧ ¬(e′′ ≺ e′).
Since the above two conditions are indeed conservative in satisfying the permutational equiva-
lence condition, we have the following theorem:
Theorem 5.4. Events e and e′ are permutationally equivalent to each other if the lockset con-
dition and the inter-thread happens-before condition are both satisfied.
In the following, we say the two events are fully equivalent to each other if they satisfy both
the lexical equivalence and the permutational equivalence conditions. By Theorem 5.4, we can
determine if e and e′ are fully equivalent to each other by checking three conditions in total:
the lexical equivalence, the lockset equivalence, and the happens-before equivalence. However,
note that neither e nor e′ is redundant even if they are fully equivalent. We have to also consider
the CAA feasibility condition in Theorem 5.3.
The CAA feasibility condition requires that s(e → e′) is a CAA corresponding to the pattern p.
Recall in Section 5.3.1 that no two events in the CAA should be the same. Therefore, to satisfy
this condition, e′ must not be in s. More specifically, to determine whether or not an event e
is redundant, we have to ensure that, for any CAA s, there always exists an event e′ that is not
Scaling Predictive Trace Analysis by Removing Redundant Events 66
in s, such that e and e′ are fully equivalent to each other. However, this condition in general is
impossible to satisfy without considering the pattern p that s corresponds to.
For example, consider an atomicity violation pattern, which specifies three events with two of
them, e1 and e2, from the same thread and the third one from another thread. Even if these two
events are fully equivalent to each other, the pattern requires both of them to be present to form
the bug condition. However, an event, e3, is truly redundant if it is also fully equivalent to e1and e2 because two events are sufficient according to the definition of the bug pattern.
Hence, to determine whether the CAA feasibility condition can be satisfied or not, we need
to consider the specific pattern that the CAA sequence, s, corresponds to. This leads to our
definition of norm with respect to each pattern as follows:
Definition 5.5. The norm of a pattern p, denoted as ∥p∥, is the maximum number of lexically
and permutationally equivalent events allowed in p. For example, the norm of a data race pattern
is 1, and the norm of an atomicity violation is 2.
Given the definition of pattern norm above, we have the following theorem:
Theorem 5.6. An event e is redundant if the number of fully equivalent events to e in the trace
is no less than the pattern norm ∥p∥.
Proof. Suppose there are ∥p∥ or more equivalent events to e in the trace. Let us put them into
a set S. As there are at most ∥p∥ fully equivalent events for any CAA that corresponds to the
pattern p, no matter what events the CAA, s, contains, there always exists at least one event in S
that is not in s but fully equivalent to e. Therefore, the CAA feasibility condition in Theorem 5.3
is satisfied. e is redundant, because both the lexical and the permutational equivalence conditions
are also satisfied.
Therefore, using Theorem 5.6, given a pattern and a trace, we can determine whether an event
e is redundant or not in the trace by counting the number of fully equivalent events to e. If the
number is no less than the pattern norm, we can classify that e is redundant and remove e from
the trace.
5.3.1.2 Concurrency context
According to Theorem 5.3, to detect redundant events, we need to check lexical equivalence,
permutational equivalence, and the CAA feasibility conditions between events. While lexical
equivalence and the CAA feasibility conditions are straightforward to compute, we have to prop-
erly model the lockset and happens-before relationships of each event, to compute permutational
equivalence condition.
Scaling Predictive Trace Analysis by Removing Redundant Events 67
Recall in Section 5.2 that the lockset of an event is the set of locks the thread is holding when
it triggers the event, and the POR relation is computed using vector clocks by considering the
internal events in each thread and the synchronization events across different threads. To support
the efficient checking of these two conditions for detecting redundant events, we introduce a
new attribute, concurrency context, for each event to encode the lockset and the happens-before
relation in a uniform way:
Definition 5.7. The concurrency context of each event includes both the LOCK/UNLOCK and
the message send/receive (FORK/JOIN/WAIT/NOTIFY) history of the thread at the time when
the event is triggered.
By defining the concurrency context in this way, we have the following theorem:
Theorem 5.8. Two events from the same thread with the same concurrency context are permu-
tationally equivalent to each other.
Proof. Since the concurrency context encodes the LOCK/UNLOCK history, the two events
must have the same lockset, which satisfies the lockset equivalence condition. In addition,
since the concurrency context encodes the message send/receive history, which determines the
happens-before relation between events across different threads (recall Section 2, happens-
before relation, the second condition), these two events from the same thread must also have
the same happens-before relations with all the other events by the other thread, hence, satisfying
the inter-thread happens-before equivalence condition.
In addition, as programmers may require more details besides the lexical location of the access
anomaly, we also include the runtime method call stack (ENT/EXT) of each event in its concur-
rency context, to give programmers the full calling context information for understanding the
bug. Note that our definition of the concurrency context naturally supports online computation.
This is important since, as mentioned earlier, large traces may not fit in memory.
5.3.1.3 Two dimensions of redundancy
The model above describes a general way of determining redundancy in the context of a PTA
algorithm for concurrency access anomaly detection. According to Theorem 5.6, we know that
the number of fully equivalent events need to be no more than the pattern norm, and all addi-
tional ones are considered to be redundant. Conceptually, we can decompose redundancy into
two dimensions: the redundant events from the same thread and those from different threads.
According to Theorem 5.8, since full equivalence between two lexically equivalent events by
Scaling Predictive Trace Analysis by Removing Redundant Events 68
the same thread can be determined by comparing their associated concurrency contexts, an ad-
vantage of this decomposition is that it allows the separation of local and global reasoning of
redundancy with respect to each individual thread. We next show the decomposition in detail.
Local redundancy The first dimension of redundancy is called local redundancy, defined over
the events of each individual thread. Consider the set of fully equivalent events. If we further
divide it into subsets grouped by the thread ID, we are able to determine the redundancy locally
to each thread, without checking against all the events in the trace. More specifically, if the
size of some subset exceeds the pattern norm, the additional events in the subset are already
redundant regardless of the events in the other subsets. We refer to these additional events as
the locally redundant events and they can be safely removed from the trace. As an example,
consider detecting the data race on the trace in Figure 5.3. Since the second and third writes of
x (e(3,5)) by thread T0 are equivalent to each other, and the norm of a data race pattern is one,
we can safely remove either e3 or e5 from the trace.
Global redundancy The second dimension of redundancy is called global redundancy, which
is defined over the events across different threads. For general access anomaly patterns, it is
difficult to determine the equivalence between events from different threads. The reason is that
the permutational equivalence condition requires checking the happens-before relation between
the two events against all the other events in the trace. For two events from different threads,
their happens-before relationships with the events from the other thread would be different and,
thus, the permutational equivalence condition may never be satisfied. For example, the events
e30 and e42 by threads T2 and T3 (in Figure 5.3), respectively, are not equivalent to each other, as
their happens-before relationships with all the other events in T2 and T3 are different. Therefore,
to determine the redundancy across different threads, we need to examine the access anomaly
patterns in more detail.
Recall that an access anomaly pattern [E,T,SV,AR,AT] specifies a sequence of events by
different threads. Consider the element T which specifies the meta thread ID sequence in the
pattern. Our observation is that, only a limited number (nt) of different threads are required in
the formation of an access anomaly pattern. If there are more than nt threads in the trace that
contain the lexically identical events with respective to the pattern, those additional threads are
redundant and all their corresponding events, which are referred to as the globally redundant
events, can be removed. For example, consider the threads T(1,2,3) in Figure 5.1, since their
event sequences are lexically identical to each other (because they execute the same code), we
only need to keep the events from two of them because the three common access anomalies all
require only two threads. The reason is that any access anomaly contributed by the events from
the redundant threads can be replaced by the events from the remaining threads in the trace, as
the access anomaly pattern does not require concrete but rather meta thread IDs. This is also
known as symmetry reduction in model checking techniques [117].
Scaling Predictive Trace Analysis by Removing Redundant Events 69
To generalize to any pattern that specifies nt different threads, we determine global redundancy
by comparing the entire event sequences between different threads. For the set of threads that
contain lexically identical event sequences, we only keep nt of them (if the size of the set is
larger than nt) and discard the events from the rest of them. We detect global redundancy after
processing local redundancy to reduce the computation effort.
5.3.2 Filtering redundant events
To efficiently encode and filter redundant events, we design two filters for dealing with both the
locally and the globally redundant events. Our filters use the Trie data structure to represent the
concurrency contexts. The reason for choosing a Trie is that, for any particular thread, the stream
of events exhibits strong temporal locality due to the stack-based computation model. Events
generated at the top level of the function stack share all their preceding events generated by the
entire stack. We leverage this phenomenon to make good use of the prefix sharing capability of
Trie and to perform the online analysis of the events.
More specifically, each node in the Trie represents an element in the concurrency context, e.g.,
a method entry or a lock acquisition operation, and each node also is associated with a bounded
stack, of which the capacity is set to be the norm of the access anomaly pattern. During the fil-
tering, the new incoming event with the concurrency context represented by the corresponding
node in the Trie is added to the stack; when the stack is full, the event is discarded and auto-
matically removed from the trace as it is guaranteed to be redundant with respect to the analysis
result.
Algorithm 5 TraceFilter(δ)
1: Input: δ - a trace ⟨ei⟩2: cctxt ← empty concurrency context for each thread t3: for i = 1 to ∣δ∣ do4: switch ei5: case: MEM(σi, vi, ai, ti, Li)6: DetectLocalRedundancy(ti, σi, cctxti);7: case: ENT(mi, ti)8: add mi to cctxti ;9: case: EXT(mi, ti)
10: remove mi from cctxti ;11: case: LOCK(li, ti)12: add li to cctxti ;13: case: UNLOCK(li, ti)14: remove li from cctxti ;15: case: WAIT/NOTIFY(gi, ti)16: add gi to cctxti ;17: DetectGlobalRedundancy(δ);
Scaling Predictive Trace Analysis by Removing Redundant Events 70
Algorithm 5 shows our TraceFilter algorithm for removing redundant events in the trace. It
consists of two parts: an online algorithm (Algorithm 6 DetectLocalRedundancy) for detect-
ing local redundancy and an in-memory algorithm (Algorithm 7 DetectGlobalRedundancy)
for detecting global redundancy. The algorithm conducts a linear scan of the input trace and
maintains a concurrency context for each thread during the analysis. The concurrency context
is computed as follows. If the event is a method entry/lock acquisition (ENT/LOCK) event,
the method/lock ID (m/l) will be added to the thread’s concurrency context. If the event is a
method exit/lock release (EXT/UNLOCK) event, the most recent method/lock ID (m/l) will
be removed from the thread’s concurrency context. If the event is a message send/receive
(FORK/JOIN/WAIT/NOTIFY) event, the message ID (g) will be added to the thread’s con-
currency context. Otherwise, if the event is a shared variable read or write access (MEM),
it will call the algorithm DetectLocalRedundancy for checking redundancy with the current
concurrency context of the thread.
Algorithm 6 DetectLocalRedundancy(e, t, σ, cctx)1: Input: e - an event in the trace2: Input: t - an thread ID3: Input: σ - a program location4: Input: cctx - a concurrency context5: local trie map(t→(σ → trie)): a map that maps a given t and σ to a trie6: trie← local trie map(t, σ)7: stack ← trie.get(cctx) //get the corresponding stack of cctx8: if stack is full then9: discard e
10: else11: add e to stack
Detecting local redundancy In our algorithm DetectLocalRedundancy, our local redundancy
filter checks each event that is associated with a shared variable access. We first find the node in
the Trie, given the concurrency context of the event. If the stack associated with the node is full,
the event is discarded from the trace and the algorithm continues to process the next event. The
algorithm terminates after the last event in the trace is analyzed. The worst case time complexity
of this algorithm is linear in the trace size multiplied by the maximum length of the concurrency
context, i.e., the number of events in the concurrency context.
Figure 5.3.2 (left) shows an exemple snapshot of the local filter, assuming the norm of the
detected access anomaly pattern is 2. The table in Figure 5.3.2 (left) lists eight events and their
associated concurrency contexts that consist of locks, l1 and l2, and methods, m1 and m2. In
this Trie, each node contains a particular context element and is associated with a stack of size
2 for storing events. The events e7 and e8 are not stored because they hits the same node as e5and e6 and the stack is full.
Scaling Predictive Trace Analysis by Removing Redundant Events 71
Root
m1 m2
l1 l1
l2
e1 <m1>
e2 <m1, l1>
e3 <m1, l1>
e4 <m2, l1>
e5 <m2, l1, l2>
e6 <m2, l1, l2>
e7 <m2, l1,l2>
e8 <m2, l1, l2>
Event Concurrency context
e1
e3
e2
e4
e6
e5
Root
A B
B C
Thread Events
T6
T5
A
A
B C
B C
A B C
T1 <e1, e2, e3>
T2 <e4, e5, e6>
T3 <e7, e8, e9>
T4 <e10, e11, e12>
T5 <e13, e14>
T6 <e15, e16>
T7 <e17, e18, e19>
T8 <e20, e21>
B C
B C
B C
C T2
T1
A B D
D T7
T3
A B D
FIGURE 5.4: Trie representation of local (left) and global (right) redundancy
Detecting global redundancy We invoke the algorithm, DetectGlobalRedundancy, to re-
move global redundancy across different threads after removing local redundancy in each thread.
We first categorize the events according to their thread IDs. Instead of populating the Trie using
the attributes of the concurrency context, we use the lexical location of the event as the key to
populate the thread IDs of each event in the Trie. Our algorithm iterates through the set of all
threads and updates the global trie according to the lexical locations of the events in the event
sequence of each thread. If the corresponding lexical locations of two event sequences by two
threads are identical, the two thread IDs will be placed in the stack associated with the same
node. If a stack is full, all the events from the new coming thread are discarded.
Figure 5.3.2 (right) shows an exemplary snapshot of the global filter. The table shows the cate-
gorized events generated by eight threads. The lexical locations are shown on top of each event.
The events from the threads T1, T2, and T4 have the same lexical location sequence <A,B,C>,
and the events from the threads T5, T6, and T8 have the same lexical location sequence <B,C>.
Following the sequence, the thread IDs are recorded by the filter. Suppose the access anomaly
patterns in this example specify at most 2 different threads, the events from the threads T4 and
T8 are all dropped because the corresponding stacks of T4 and T8 in the global trie are full. T4is mapped to the same node as T1 and T2, and T8 is mapped to the same node as T5 and T6.
This operation uses the global information of the remaining event sequences of each thread.
Therefore, the entire trace is required to be in the memory. For large raw trace, this requirement
is hard to satisfy. Fortunately, after removing local redundancy, the size of the raw trace is often
greatly reduced, so that our technique is able to handle large traces despite the fact that removing
global redundancy is not memory-friendly.
Scaling Predictive Trace Analysis by Removing Redundant Events 72
Algorithm 7 DetectGlobalRedundancy(δ)
1: Input: δ - a trace ⟨ei⟩2: δt: the event sequence by thread t in δ3: trie: the global trie4: //iterate through the set of all threads5: for all t ∈ T do6: trie←UpdateGlobalTrie(δt)7: stack ← trie.get(t)//get the corresponding stack of t8: if stack is full then9: discard δt
10: else11: add δt to stack
5.4 Implementation
We implemented our technique on top of PECAN. To obtain a trace, PECAN first takes the
bytecode of an arbitrary Java program and outputs an instrumented version to collect interested
events of the program execution.
For detecting concurrency bugs using PTA, PECAN collects the following types of events in a
global order: READ/WRITE accesses to shared variables, method entry/exit, LOCK/UNLOCK,
FORK/JOIN, and WAIT/NOTIFY events. To support the recording of long running programs,
PECAN does not hold the entire trace in the main memory but saves it to a database. To reduce
the unnecessary recording of accesses on thread local variables, PECAN also performs a static
thread escape analysis [42] to identify all the possible shared variables in the program. Each
event in the trace is associated with a set of attributes: the access type, the memory address, the
thread ID, and the location in the program source. To reduce the runtime cost, the concurrency
context information of each event used by our technique for detecting redundant events is not
recorded during the trace collection. Instead, it is computed and maintained at the time when
the events in the trace are processed by our technique.
After applying our technique for removing the redundant events in the trace, the PTA engine
of PECAN takes the trace as input and reports detected access anomalies. Each of the reported
access anomalies is a pure event sequence satisfying the specification of the access anomaly
pattern. For two access anomalies with the same lexical information but contain different event
sequences, PECAN also is configured to support reporting either both of them or only one of
them, by checking the redundancy between them.
Scaling Predictive Trace Analysis by Removing Redundant Events 73
TABLE 5.1: TraceFilter experimental results- RQ1: Effectiveness
Program SLOC Input/#Thread Trace TraceFilter (#Events)#Events #SV Size Local redundancy Global redundancy
BuggyPro 348 33 10,075 5 424KB 5,876(58.3%) 147(1.5%)Shop 220 100 15,560 3 654KB 6,684(44.1%) 462(2.9%)Loader 139 100 34,788 2 1.5MB 12,094(34.8%) 97(0.3%)ArrayList 5,979 451 40,558 696 1.7MB 3,208(8.1%) 0(0.0%)LinkedList 5,866 451 53,173 2,266 2.2MB 4,020(7.9%) 0(0.0%)RayTracer 1,924 SizeA/10 350,688 24 14.7MB 327,645(93.4%) 20(0.0%)SpecJBB2005 17,245 8 484,841 113 20.4MB 281,338(58.0%) 2(0.0%)Tsp 709 map4/4 1,048,433 260 44,1MB 1,042,293(97.7%) 1,248(0.1%)Moldyn 1,352 SizeA/10 1,062,629 26 44.7MB 1,003,062(94.4%) 196(0.0%)Sor 951 SizeA/4 5,545,200 6 233.2MB 5,544,954(99.9%) 0(0.0%)
OpenJMS 262,842 10 904,435 285 38.0M 773,764(85.5%) 0(0.0%)Tomcat 339,405 100 1,296,338 401 54.5M 569,047(43.9%) 663(0.0%)Jigsaw 381,348 10 479,105 407 20.1M 166,338(34.7%) 5(0.0%)Derby 665,733 bug#2861/100 2,236,960 199 94.1M 1,449,550(64.8%) 4,502(0.2%)
5.5 Evaluation
The goal of our technique is to improve the scalability of PTA of concurrency access anomalies
while guarantee soundness of the analysis. Accordingly, our evaluation aims at answering the
following questions:
RQ1. Effectiveness - How much local redundancy as well as global redundancy can our ap-
proach remove from the trace?
RQ2. Efficiency - How efficient is our approach for removing trace redundancy? And how
much improvement on the scalability of PTA for concurrency access anomalies can our
approach contribute?
RQ3. Correctness - Empirically does our approach indeed guarantee the soundness? i.e., it
should not remove any non-redundant events from the trace w.r.t the PTA.
The remainder of this section presents our experimental results on the three questions. All our
experiments were conducted on a 8-core 3.00GHz Intel Xeon machines with 16GB memory and
Linux version 2.6.22.
Benchmarks We consider a set of widely used third-party concurrency benchmarks. We
configure the program inputs to generate traces of different sizes and complexity. To understand
the performance of our technique on real applications in practice, we also include several large
server systems in our benchmarks. The first column in Table 5.1 shows the benchmarks used
in our experiments. The sizes of our evaluation benchmarks range from a few hundred lines to
over 600K lines of code.
Scaling Predictive Trace Analysis by Removing Redundant Events 74
5.5.1 RQ1: Effectiveness
The goal of our first research question is to investigate how much redundancy exists in the exe-
cution traces of real concurrent programs. To generate the data necessary for investigating this
question, we proceed as follows. For each benchmark, we first run it multiple times with differ-
ent inputs and the number of threads, and used PECAN to collect the corresponding traces. For
each trace, we then apply our technique to produce a filtered trace with the redundancy removed.
We checked three types of patterns: data races, atomicity violations, and atomic-set serializabil-
ity violations. As our technique deals with two dimensions of redundancy (local redundancy
and global redundancy), we measured the percentage of redundant events with respect to local
and global redundancy, respectively.
Table 5.1 shows our experimental results. Column 3 (Input/#Thread) reports the input data (if
available) and the number of threads configured in the recorded execution of the benchmark.
Columns 4-6 (#Events,#SV,#Size) report the number of events in the trace, the number of real
shared memory locations that contain both read and write accesses from different threads, and
the size of the trace on the disk, respectively. As the table shows, the number of events in
the trace ranges from more than 10K to 5M, with sizes from more than 400KB to 233MB on
disk. Compared to the traces evaluated in the other PTA techniques [22, 128, 130, 131], the
traces in our experiments are orders of magnitude larger. Columns 7-8 (Local,Global) report
the number of local and global redundant events, respectively, detected by our technique in the
corresponding trace. In the small benchmarks, the percentage of local redundancy ranges from
7.9% to 99.9%, and the percentage of global redundancy ranges from 0.0% to 2.9%. For the
real server programs, the percentage of local redundancy ranges from 34.7% to 85.5%, and the
percentage of global redundancy ranges from 0.0% to 0.2%.
The percentage of global redundancy is often very small compared to that of local redundancy.
The reason is that our TraceFilter algorithm has in the first place removed most of events in
the category of local redundancy. Hence, no matter how much global redundancy there is, the
number of the remaining events in the trace after removing local redundancy is already much
smaller compared to the size of the original trace. If global redundancy is detected first, the
reported percentage of global redundancy would be much higher. However, in that case, the
entire trace should be loaded into the memory first, as detecting global redundancy requires
the entire trace. Nonetheless, the data in the table confirm our hypothesis that the redundancy
pervasively exists in concurrent programs. Although the percentage of redundancy in the real
large sever programs is not as high as in the small benchmarks, it already accounts for more than
one third to a half of the entire trace.
Scaling Predictive Trace Analysis by Removing Redundant Events 75
TABLE 5.2: TraceFilter experimental results - RQ2: Efficiency
Program TraceTraceFilter PTA
Local Global N YBuggyPro 10,075 105ms 9ms 3.50s 1.4sShop 15,560 599ms 2ms 45.1s 2.6sLoader 34,788 1.06s 5ms 456.0s 71.7sArrayList 40,558 14.9s 5ms 131.5s 115.6sLinkedList 53,173 26.4s 15ms 100.5s 128.9sRayTracer 350,688 1.07s 3ms >2h 9.2sSpecJBB 484,841 2.2s 12ms 112.6s 25.5sTsp 1,048,433 22.6s 10ms >2h 402.5sMoldyn 1,062,629 3.3s 4ms >2h 27.4sSor 5,545,200 4.8s 1ms >2h 33.7sOpenJMS 904,435 9.7s 2ms 220.0s 17.2sTomcat 1,296,338 12.0s 5ms 1440.1s 29.7sJigsaw 479,105 19.5s 22ms 695.6s 35.5sDerby 2,236,960 42.3s 16ms >2h 177.5s
5.5.2 RQ2: Efficiency
The goal of our second research question is to assess if our approach is efficient in detecting
redundant events. Since our objective is to improve the overall scalability of PTA, the analysis
time of our technique should not contribute significantly to the overall analysis time. Hence, we
conduct experiments to evaluate the efficiency of our technique on various traces. To generate
the data necessary for investigating this question, we proceeded as follows. For both the original
trace and the filtered trace, we use PECAN to analyze the three common access anomalies on
them. During the analysis, we record the following three measurements: the amount of time
needed by our technique to remove both local and global redundancy, the time taken for the bug
detection of PTA using the filtered and the unfiltered trace. For large traces, it is possible that
PECAN is not able to load the trace into memory or finish processing the trace in a reasonable
amount of time. In such cases, we set a 2-hour time bound for the analysis and we terminate it
if it did not finish in 2 hours, and we report out of memory error (OOM) if the analysis crashed
due to memory exhaustion.
Table 5.2 shows the experimental results. Columns 1-2 report the benchmark program and the
size of the corresponding trace. Each trace is the same as the one for evaluating the effectiveness
of our technique in Table 5.1. Columns 3-4 report the time our technique takes to detect local
redundancy and the global redundancy, respectively, in the trace. The time for removing the
local redundancy ranges from 105ms for small traces to 42.3s for large traces, while that of
detecting global redundancy is negligible (a few milliseconds), as the number of threads in the
trace are relatively small (from 4 to 100). We observe that the analysis time of our technique
Scaling Predictive Trace Analysis by Removing Redundant Events 76
really depends on the complexity of the trace, e.g., the number of shared variables and the depth
of the concurrency context of events in the trace. For instance, for the trace with more than 5M
events in the Sor benchmark, our technique took only less than 5 seconds to process it, whereas
it took 26.4s for processing the trace in the LinkedList benchmark containing only 53K events.
However, overall, these results show that our technique is very efficient for removing the trace
redundancy.
On the aspect of improving the PTA scalability, Columns 5-6 report the total amount of time
for the PTA to process the trace, without and with our technique for removing the redundant
events, respectively. The data show that, in most cases (except LinkedList and ArrayList), the
time needed for PTA using our technique is significantly reduced compared to the runs without
our technique. For example, for the trace with more than 1M (1,296,338) events in the Tomcat
benchmark, our technique reduced the original PTA time from 1440.1s to 29.7s. And for the
trace with 2,236,960 events in the Derby benchmark, the trace analysis with our technique was
able to finish in less than 177.5 seconds, whereas, for the unfiltered trace, the same analysis did
not finish in 2 hours. The only two exceptions were the traces in the LinkedList benchmark and
the ArrayList benchmark. The reason is that the percentages of redundancy in these two traces
are relatively small (7.9% and 8.1% respectively). Since there are not many reduction oppor-
tunities, the amount of time for the PTA to analyze the traces with and without our technique
are comparable. Nevertheless, as our technique is efficient, even for these two traces, the bug-
detection time saved by our technique for the PTA still almost offsets the cost incurred by the
redundancy removal. In summary, the results demonstrate that our approach is very effective in
removing the trace redundancy and therewith significantly improving the scalability of PTA for
detecting concurrency access anomalies in real world large traces.
5.5.3 RQ3: Correctness
The validity of the effectiveness and the efficiency evaluation is based on the assumption that our
technique does not affect the analysis results of PTA presented to the programmer. Although our
redundancy model in Section 5.3.1 shows that our technique is able to guarantee the soundness,
i.e., it does not misclassify any non-redundant event to be redundant, we would also like to see
whether the claim holds empirically in large traces in practice. It is important for us to confirm
the correctness of our technique with experiments.
For large traces, to verify the correctness of PTA results is difficult because in many cases the
bug detection does not finish in two hours. Therefore, we are unable to analyze benchmarks
including RayTracer, Moldyn, Tsp, Sor and Derby. For the traces of other benchmarks, we first
run them unaltered through PECAN and obtain the detected access anomalies that are poten-
tially duplicated with respect to the source code locations. From these results, we remove the
Scaling Predictive Trace Analysis by Removing Redundant Events 77
TABLE 5.3: TraceFilter experimental results - RQ3: Correctness
Program TraceRace Atom ASV
N Y N Y N YBuggyPro 10,075 9 9 1 1 0 0Shop 15,560 16 16 6 6 0 0Loader 34,788 2 2 0 0 0 0ArrayList 40,558 0 0 2 2 4 4LinkedList 53,173 0 0 4 4 34 34SpecJBB 484,841 24 24 1 1 0 0OpenJMS 904,435 3 3 7 7 0 0Tomcat 1,296,338 0 0 0 0 0 0Jigsaw 479,105 121 121 209 209 443 443
duplicated reports and compare the remaining results to the ones reported by PECAN using the
filtered trace.
Table 5.3 shows the trace we selected and the number of distinct access anomalies for each type
of analysis. Columns labeled ‘N’ and ‘Y’ indicate whether the analysis is on the unfiltered or
the filtered trace. The results empirically support the correctness of our technique. For all these
traces, we found that the PTA using the filtered trace produced the same result as that of the
unfiltered trace. The reason that many cells in the table are zero is that PTA did not detect any
bug from the recorded trace.
5.6 Summary
We have presented a technique that automatically removes redundant events from the execution
trace, which significantly improves the scalability of predictive analysis techniques for detecting
concurrency access anomalies. In summary, we make the following contributions:
1. We define the concept of trace redundancy in the context of PTA for the general access
anomalies and show that the redundancy pervasively exists in concurrency software systems.
2. We present a technique, TraceFilter, that filters out redundant events in a trace for improving
the scalability of PTA. The soundness of our technique is guaranteed by a theorem showing that
our technique does not impair the trace analysis result.
3. We evaluate our technique on a set of concurrency benchmarks as well as several large
multithreaded applications. The results show that our technique is very effective and efficient,
and can significantly improve the scalability of PTA.
Chapter 6
Dynamically Simplifying ConcurrencyBug Reproduction
The technique of multiprocessor deterministic replay substantially assists debugging by mak-
ing the program execution reproducible. However, facing the huge replay traces and long re-
play time, the debugging task remains stunningly challenging for long running executions. We
present a new technique, LEAN, on top of replay, that significantly reduces the complexity of
the replay trace and the length of the replay time without losing the determinism in reproducing
concurrency bugs. The cornerstone of this work is a redundancy criterion that characterizes the
redundant computation in a buggy trace. Based on the redundancy criterion, we have developed
two novel techniques to automatically identify and remove redundant threads and redundant
instructions in the bug reproduction execution. Our evaluation results with several real world
concurrency bugs in large complex server programs demonstrate that LEAN is able to reduce
the size, the number of threads, and the number of thread context switches of the replay trace by
orders of magnitude, and accordingly greatly shorten the replay time.
6.1 Introduction
Multiprocessor deterministic replay (MDR) has shown effective for concurrent program debug-
ging [3, 28, 45, 48, 70, 83, 84, 100, 127]. Several recent work [45, 48, 83, 127] has also demon-
strated that the future of low overhead MDR is positive, via special hardware design [45, 83]
or even clever software-level approaches [48, 127]. However, MDR alone is not often suffi-
cient for debugging. Even with zero-recording-overhead MDR, the debugging task can remain
stunningly challenging for concurrent programs. We identify two main reasons. First, most
real world concurrent applications are large and complex. For any non-trivial execution, the
execution trace could be huge and complicated, containing millions (or even billions) of critical
78
Dynamically Simplifying Concurrency Bug Reproduction 79
events [122] and hundreds of thousands of thread context switches [49, 55]. It is very hard for
programmers to locate a bug by inspecting the huge amount of trace information. Moreover, the
performance of replay is often slow and hard to predict. As replay typically requires enforcing
scheduling behavior, it is often significantly slower (5x-39000x [3, 100]) than native execution.
For long running executions, the replay phase may never end within a bounded time budget. It
is very frustrating for programmers to wait without knowing when the bug will be reproduced.
To make MDR more practical for supporting concurrent program debugging, we advocate the
simplification of the replay execution and the speeding up of the replaying process, so that pro-
grammers can locate and understand concurrency bugs more effectively using a simplified re-
producible buggy execution. To achieve this goal, we propose LEAN, a concurrency bug repro-
duction technique on top of MDR, that significantly reduces the complexity (size, number of
threads, and number of context switches) of the replay trace and shortens replay time without
losing the determinism.
Key Observation Our key observation is that most computations in a buggy execution are
often irrelevant to reproducing a concurrency bug. As shown by Vaziri, Tip and Dolby [125],
most concurrency bugs are exhibited by only two threads and one or two shared variables.
The rest of the threads and shared variable accesses, if not required to understand the bug,
are redundant and can be removed from the execution. This observation also is empirically
confirmed by a comprehensive study by Lu et al. [74] on real world concurrency bugs showing
that the manifestation of more than 96% of the examined concurrency bugs involves no more
than two threads, 66% of the non-deadlock concurrency bugs involve only one variable, and 97%
of the deadlock concurrency bugs involve at most two resources. This observation also reflects
the common wisdom demonstrated by years of industrial experience (IBM ConTest [31], Stress
testing [85] and Microsoft CHESS [86]) that most concurrency bugs in practice are triggered by
a few threads and a small number of context switches. For example, stress testing for exposing
concurrency bugs typically forks as many threads as possible to repeatedly execute the same
code. However, with the correct interleaving, a few threads and repetitions are often sufficient
to trigger the bug.
To further illustrate this observation, consider a simple, but common, test case for stress testing
an account function in Figure 6.1. The parent thread T0 forks a number (N ) of children threads
Ti (i = 1,2, . . . ,N ), each of which repeatedly validates two methods M times: increasing and
decreasing the account by a certain amount (i). There are three assertions (A,B,C) in the
program. When an assertion is violated, in the worst case, the buggy execution trace containsM
threads (excluding T0) andM ×N iterations of increasing/decreasing operations on the account.
However, in the best case, only two threads and two iterations are needed to reproduce the bug.
For instance, the increment method may be non-atomic, and an erroneous interleaving happened
Dynamically Simplifying Concurrency Bug Reproduction 80
Example A
for j =1:M
{
expected = account.get()+i
account.increment(i)
assert account.get()==expected
expected=account.get()-i
account.decrease(i)
assert account.get()==expected
}
account.set(0);
for i=1:N
fork Ti
for i=1:N
join Ti
assert account.get()==0
Ti T0
A:
B: C:
FIGURE 6.1: A typical test case for stressing testing an account function. A significant amountof computation in a buggy execution of this program may be redundant.
between the 5th and 10th iterations of threads T(2,3), causing assertion A to be violated. To
reproduce the error, the 5th and 10th iterations of threads T(2,3) (plus the erroneous interleaving)
are sufficient. The rest of the computation is redundant and can be eliminated from the execution
without affecting the ability to reproduce the bug.
Contributions We propose a criterion to characterize redundant computation in a buggy trace.
The criterion ensures that, after removing a redundant computation, the resultant execution is
able to reproduce the same concurrency bug. Based on the criterion, LEAN simplifies the buggy
execution by iteratively identifying and removing redundant computation from the original ex-
ecution trace (skipping the computation by controlling the execution) and, at the same time,
enforcing the same schedule between threads in the reduced execution as that in the original
buggy execution. The final result produced by LEAN is a simplified execution with redundant
computation removed.
The key challenge we address is how to effectively identify redundant computation. We further
categorize redundant computation into two dimensions: whole-thread redundancy and partial-
thread redundancy. Whole-thread redundancy identifies threads whose entire computation is
redundant. For example, threads except T(0,2,3) in our example are redundant threads and all
their computation can be removed. Partial-thread redundancy characterizes redundant instruc-
tions as part of each individual thread. For example, all iterations (except the 5th and 10th) of
threads T(2,3) in our example are partial-thread redundant.
We develop two effective techniques based on delta-debugging [144] to identify whole-thread
redundancy and partial-thread redundancy, respectively. To reduce the search space of delta-
debugging, we utilize the parent-children relationship between threads to iteratively identify
Dynamically Simplifying Concurrency Bug Reproduction 81
whole-thread redundancy using the dynamic thread hierarchy graph. For partial-thread redun-
dancy, we combine an adapted multithreaded program slicing technique [121] and a repetition
analysis to remove irrelevant instructions and to identify the redundant iterations of computation.
To further improve effectiveness, we also provide an easy-to-use repetition analysis framework
that allows programmers to annotate repetitive code segments of which some execution itera-
tions are potentially redundant. All redundant iterations are then automatically validated and
filtered out by our technique.
Note that the redundancy criterion is black-box in nature. It does not rely on any data or control
dependency information, and is completely based on the bug reproduction property. This allows
us to explore more simplification opportunities than white-box approaches such as program
slicing [40, 63, 90].
We implemented LEAN on top of LEAP for Java programs. Our evaluation results on a set of
real concurrency bugs in popular multithreaded benchmarks as well as several large complex
concurrent systems demonstrate that LEAN is able to significantly reduce the complexity of
the buggy execution and shorten replay time without losing determinism. LEAN produces a
simplified execution typically within 20 iterations. LEAN is able to reduce the size of the replay
trace by as much as 324x, the number of threads and thread context switches by 99.3% and
99.6%, and shorten the replay time by more than 300x.
The remainder of this chapter is organized as follows: Section 6.2 presents a model of trace
redundancy; Section 6.3 presents our technique; Section 6.4 presents our implementation and
Section 6.5 presents a case study of simplifying the reproduction of a real concurrency bug;
Section 6.6 reports our experimental results and Section 6.7 summarizes this chapter.
6.2 A Model of Trace Redundancy
Starting with an initial state Σ0 and, following a schedule ξ, the program can reach a final state
Σf . We say ξ exhibits a bug if Σf satisfies a predicate, say φ, that denotes the bug. The bug
predicate is defined as follows:
Definition 6.1. (Bug predicate) A bug predicate, φ, characterizes the exhibition of a bug in the
program execution over the final program state. The bug is exhibited in the execution iff φ(Σf )
is evaluated to be true.
Following different schedules, however, Σf may be different and may or may not satisfy φ. We
call a bug a sequential bug if some sequential schedule is able to exhibit it, and a concurrency
bug if only a non-sequential schedule can exhibit it.
Dynamically Simplifying Concurrency Bug Reproduction 82
From a high level view, LEAN simplifies the concurrency bug reproduction by controlling the
program execution to skip instructions in the program that are redundant to reproducing the bug.
Generally speaking, an instruction (or a group of instructions) cannot be arbitrarily skipped,
as it may result in two possible negative consequences: the program malfunctions, or the bug
disappears. The program might malfunction if the skipped instruction is an indispensable part
of the program logic, while the bug might disappear if the skipped instruction is related to the
bug. Either consequence will make the reduced execution not useful for debugging.
We propose a redundancy criterion for the concurrency bug reproduction that ensures neither of
these two outcomes will occur if a redundant instruction is skipped. The basic idea is that, after
removing the redundancy, the same bug is reproduced. A subtle problem in defining the criterion
is that we may not have such a bug predicate φ as defined in Definition 6.1. In practice, we often
use assertions or rely on runtime exceptions to determine whether a bug is exhibited or not.
However, the assertions or exceptions may be insufficient to distinguish between the behavior of
the bug manifestation and the behavior of program malfunction, in which case the program is no
longer working properly as expected due to the removal of a necessary instruction. For example,
the assertion that characterizes the bug in the original execution may always be violated after
removing a certain instruction. Although the reduced execution manifests the violation of the
assertion, it is not useful for debugging because the assertion is not able to characterize the same
bug as that in the original execution.
We tackle this issue from the perspective of thread interleavings. For a concurrency bug, essen-
tially, it is some non-deterministic buggy interleavings that cause the bug (assuming the input is
deterministic). For debugging, programmers want to understand how the bug occurs with these
buggy interleavings. If the program executes sequentially and behaves correctly, the bug should
not manifest. On the other hand, if the program malfunctions after removing an instruction, ei-
ther the program cannot proceed to execute the buggy statement or the bug predicate φ is always
satisfied regardless of the buggy interleavings. Therefore, we define the redundancy criterion as
follows:
Definition 6.2. (Trace redundancy criterion) Consider a trace δ that exhibits a concurrency
bug (δ drives the program to a state satisfying the bug predicate φ) and a subset E of the events
in δ. Let δ/E denote the trace δ with the events in E removed. E is redundant if the following
two conditions are satisfied:
I. δ/E can still drive the program to a state that satisfies φ;
II. some sequential schedule of the reduced execution does not satisfy φ.
We assume φ characterizes a concurrency bug. The soundness of this criterion is easy to follow.
First, Condition I and Condition II together ensure that the reproduced bug is a concurrency bug,
Dynamically Simplifying Concurrency Bug Reproduction 83
because φ is satisfied under the original buggy schedule (excluding the events in E), but not a
sequential schedule. Second, consider condition II: since φ is evaluated but not satisfied (i.e.,
the bug does not manifest) under a sequential schedule1, the program does not malfunction after
removing the events in E. Otherwise, either φ is not evaluated or φ is always satisfied. Hence,
the same concurrency bug is reproduced under Conditions I and II.
It is worth noting that trace redundancy is not defined over a single event but a subset of events
in the trace, which correspond to a group of instructions in the program execution. The reason is
that redundant instructions are not independent but may be closely related to each other. A group
of instructions may be redundant but any single instruction may not. For example, suppose an
erroneous interleaving between the 5th and 10th iterations of threads T(2,3) manifests the bug
in Figure 6.1. The whole computation of thread T1 is redundant, but any single instruction of
T1 alone is not. Without any dependence information between the instructions, removing trace
redundancy is a combinatorial optimization problem, which is exponential in the number of
instructions in the original buggy execution.
To facilitate more effective simplification, we further characterize redundancy into two dimen-
sions:
• whole-thread redundancy - all computation of a thread is redundant;
• partial-thread redundancy - some instructions of an individual thread are redundant.
This categorization utilizes the thread identity relationship between the computations. In prac-
tice, threads are more likely to be independent from one another than are individual instructions.
We can skip all the computation of a redundant thread. Compared to whole-thread redundancy,
partial-thread redundancy examines the instructions local to each individual thread. If an in-
struction by a certain thread is redundant, we can skip it during the execution of that thread. In
our illustrating example, all the other threads except T(0,2,3) are redundant (whole-thread redun-
dancy), and most of the repetitions of threads T(2,3) are redundant (partial-thread redundancy).
6.3 Automatic Redundance Removing
We propose two techniques to remove trace redundancy for simplifying concurrency bug re-
production. The first technique effectively validates and removes whole-thread redundancy by
adapting delta-debugging [144] using thread hierarchy information. Our technique produces
a 1-minimal set of threads [144] that are not redundant in the buggy execution. The second1Note that we do not need to check all sequential schedules but checking any one of them is sufficient to validate
whether the concurrency bug is still a concurrency bug or not.
Dynamically Simplifying Concurrency Bug Reproduction 84Automatic Redundant Thread Removal
T0
T1:1
T2
…
T2:2:1 …
T2:1 T2:2
T1 … T3
T1:2
T1:2:1 T1:2:2
T2:3
T2:3:1 …
FIGURE 6.2: An example of dynamic thead hierarchy graph (TH-Tree). When T1,3 are se-lected, all T1,3 and their descendents (gray color) are disabled.
technique targets irrelevant instructions and repetitions. It combines a dynamic multithreaded
slicing technique and a static repetition analysis, as well as a simple annotation framework that
integrates programmers’ hints. The entire simplification process is deterministic. There is no
interleaving non-determinism during simplification as we control all thread scheduling during
replay.
6.3.1 Removing Whole-Thread Redundancy
Our general idea for whole-thread redundancy follows the approach of hierarchical delta-debugging
[82, 144]. We use a bisection method to pick candidate threads and test whether they can be re-
moved from the execution or not. More specifically, we control the program to disable the
selected candidate threads and validate the reduced execution for the two conditions defined in
our redundancy criterion in Section 6.2. Our technique for removing whole-thread redundancy
is fully automatic. It does not require any user intervention.
There are two main challenges. First, threads may not be arbitrarily removed. For example,
if a parent thread is removed, none of its descendants will execute. Second, after removing a
redundant thread, we must compute the schedule of the remaining threads (in order to deter-
ministically replay the reduced execution). We address these problems as follows. First, we
extract a dynamic thread hierarchy graph of the original buggy execution (TH-Tree) and per-
form delta-debugging based on the TH-Tree, to make sure that if a parent thread is disabled, all
its descendant threads are disabled. Figure 6.2 shows an example of the TH-Tree. For example,
if T1 and T3 are selected, all their descendants (shown in the gray boxes in Figure 6.2) are also
selected. Second, we compute the schedule for the remaining threads by projecting the trace on
thread ID without the IDs of the selected candidate threads and their descendants. The schedule
is enforced in the validation run to test whether the bug can still be reproduced or not.
Dynamically Simplifying Concurrency Bug Reproduction 85
Let validate and 𝒄𝒙 be given such that validate(𝒄𝒙)=X(fail). The algorithm computes 𝒄′𝒙
=ddmin(𝒄𝒙)=ddmin2(𝒄′𝒙,2) such that 𝒄′𝒙 ⊆ 𝒄𝒙, validate(𝒄′𝒙)=X, and 𝒄′𝒙 is 1-minimal.
𝑑𝑑𝑚𝑖𝑛2 𝒄′𝒙, 𝑛 =
𝑑𝑑𝑚𝑖𝑛2 ∆𝒊, 𝑛 𝑖𝑓∃𝑖 ∈ 1, … , 𝑛 . 𝒗𝒂𝒍𝒊𝒅𝒂𝒕𝒆 ∆𝒊 = 𝐗
𝑑𝑑𝑚𝑖𝑛2 𝛁𝒊, 𝑚𝑎𝑥 𝑛 − 1,2 𝑒𝑙𝑠𝑒 𝑖𝑓∃𝑖 ∈ 1, … , 𝑛 . 𝒗𝒂𝒍𝒊𝒅𝒂𝒕𝒆 𝛁𝒊 = 𝐗
𝑑𝑑𝑚𝑖𝑛2 𝒄′𝒙, 𝑚𝑖𝑛 𝒄′
𝒙 , 2𝑛 𝑒𝑙𝑠𝑒 𝑖𝑓 𝑛 < 𝒄′𝒙
𝒄′𝒙 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.
where ∇𝑖= 𝒄′𝒙 - ∆𝑖, 𝒄′𝒙 = ∆1 ∪ ∆2 ∪ ⋯ ∪ ∆𝑛, all ∆𝑖 are pairwise disjoint, and ∆𝑖 ≈ 𝒄′𝒙 /𝑛.
FIGURE 6.3: The delta-debugging algorithm. The function validate return true if the twoconditions in the redundance criterion are both satisfied. For conciseness, the input trace is
ignored in the ddmin algorithm.
Algorithm 8 summarizes our algorithm. Given the original buggy trace, the algorithm produces a
simplified trace (execution) containing only the 1-minimal set of threads that is able to reproduce
the bug. The 1-minimal property means that all remaining threads are necessary: removing any
one of them would cause the reduced execution to fail to reproduce the bug. Our algorithm starts
by iterating on the height of the TH-Tree. In each iteration, we pick the candidate threads with
the same height. Starting from the threads with height 1 (the main thread is of height 0), we
first select the candidate threads (thread sets) to be validated for the redundancy. If a thread
is selected, its descendants are all disabled. We then process the selected threads using a delta-
debugging algorithm, as shown in Figure 6.3. Each invocation of delta-debugging computes the
1-minimal set of threads (in the input threads denoted by cx) that are necessary to reproduce
the bug. The set cx in the ddmin algorithm corresponds to the selected threads. The validate
procedure (Algorithm 9) corresponds to the test function in delta-debugging. It tests whether the
two conditions in the redundancy criterion are both satisfied after disabling the selected threads:
(1) the bug is reproduced with the computed schedule of the remaining threads; (2) the bug is
not reproduced with a sequential schedule. If both conditions are true, it means that the selected
threads are redundant and they are removed from the execution. This process is repeated for all
levels of threads in the TH-Tree, until no new thread can be removed.
Algorithm 8 RemoveWholeThreadRedundancy(δ)
1: Input: δ - the original trace ⟨ei⟩2: Output: δ′ - the simplified trace with all redundant threads removed3: TH Tree←ExtractThreadHierarchyGraph(δ)4: height← the height of (TH Tree)5: for level ← 1 ∶ height do6: thread set← get threads(TH Tree,level)7: minimal threads←DeltaDebugging(δ,thread set)8: redundant threads← (thread set /minimal threads) and their descendents9: remove redundant threads from TH Tree
10: remove all events by redundant threads in δ11: return δ
Dynamically Simplifying Concurrency Bug Reproduction 86
Algorithm 9 Validate(δ,disabled threads)
1: Input: δ - a trace ⟨ei⟩2: Input: disabled threads - a set of disabled threads3: δ′ ← remove all events by disabled threads in δ4: ξ ←get schedule(δ′)5: ξseq ←get sequentialschedule(δ′)6: if IsBugReproduced(δ′,ξ) then7: if IsBugNotReproduced(δ′,ξseq) then8: return true9: return false
6.3.2 Removing Partial-Thread Redundancy
To identify partial-thread redundancy, we may directly apply delta-debugging at the level of
individual instructions. However, this naive approach is ineffective because enumerating and
validating every combination of instructions for each individual thread could be very expensive.
To improve efficiency, our technique combines multithreaded dynamic slicing with a repetition
analysis to identify the redundant computation local to each individual thread. Dynamic slicing
tracks the data and control dependencies between instructions in the execution trace and removes
those instructions that are irrelevant to the bug. Repetition analysis is a heuristic that targets at
removing the redundancy related to repetitions. To further improve the effectiveness of repe-
tition analysis, LEAN also provides a simple framework that allows programmers to annotate
repetitive code segments, which significantly reduces the search space.
6.3.2.1 Multithreaded dynamic slicing
The dynamic dependence graph (DDG) is the classical model for slicing single-threaded execu-
tions, which captures the dynamically exercised Read-After-Write (RAW) and control depen-
dencies. Each node in the DDG represents an execution instance of a statement (an instruction)
while edges represent the dependences. For multithreaded execution, Tallam et al. [121] pro-
poses a dynamic slicing modeling for data race detection. Their model extends the DDG to
consider the additional data dependencies on shared variable accesses.
Our slicing model for concurrency bug reproduction is similar to but more strict than the model
by Tallam et al. [121]. To guarantee deterministic bug reproduction, in addition to the shared
variable read/write dependencies, we also need to consider the dependencies on synchroniza-
tion operations. Specifically, given a buggy execution, we construct a multithreaded depen-
dence graph (MDG) that consists of the DDG for each individual threads as well as the depen-
dence relation → (recall Definition 2.5) between instructions by different threads. Note that the
WRITE→WRITE dependency must be included in the MDG, to ensure the correctness of MDR
Dynamically Simplifying Concurrency Bug Reproduction 87
[49]. Otherwise, a read in the replaying phase may return the value written by a different write
from that in the original buggy execution, which may cause the failure of MDR.
Algorithm 10 shows our dynamic slicing algorithm for removing the partial-thread redundancy.
We first construct the MDG that includes the DDG for each thread in the execution and the
synchronization and shared variable dependencies. Starting from the buggy instruction which
violates the bug predicate, we perform a backward analysis that keeps only the instructions with
a direct or a transitive dependency relation to the buggy instruction. All other instructions are
marked to be irrelevant to reproducing the bug and are skipped in the simplified execution.
Algorithm 10 DynamicMultithreadedSlicing(δ,αf )1: Input: δ - the full execution trace after removing all redundant threads2: Input: αf - the buggy instruction3: Output: δ′ - the simplified trace4: mdg ← ConstructMultithreadedDependencyGraph(δ)5: mdg′ ← ReverseEdge(mdg)6: relevant instructions← DepthFirstSearch(αf ) on mdg′
7: δ′ ← remove the instructions from δ that are not in relevant instructions8: return δ′
6.3.2.2 Repetition analysis
Redundancy often is caused by repetitions. Specifically, we observe that a large portion of
redundant computation by each individual thread is rooted in repetitive code blocks (RCBs)
that contain repeated operations in loops. The operations inside a RCB are expected to execute
a few iterations upon the loop condition with no break operation. The loop variable is often
a primitive data (e.g., integers) that used as a counter for counting the number of iterations
so far. We propose a static repetition analysis to identify RCBs in the program. The RCBs
are used as a pool of potential redundant computation that we may simplify. Each execution
iteration of a RCB is considered as potentially redundant. After validating the redundancy of an
iteration using our redundancy criterion, we can remove all computation of this iteration from
the execution.
Our repetition analysis is based on a simple intra-procedural loop analysis. For each loop, we
consider two conditions to mark it as a potential RCB. First, the loop condition contains only
constants or primitive data, and the loop variable is only incremented or decremented once in
each iteration. Second, there is no break operation inside the loop (exceptions are allowed).
Despite the simplicity, our experiments show that this analysis is effective and efficient for
identifying redundant computation caused by RCBs.
Algorithm 11 shows our algorithm for removing partial-thread redundancy caused by repeti-
tions. This algorithm is applied after slicing the buggy trace. We first identify the RCB that
Dynamically Simplifying Concurrency Bug Reproduction 88
Algorithm 11 RemoveRepetitionRedundancy(p,δ)1: Input: p - the program2: Input: δ - the trace after slicing3: Output: δ′ - the final simplified trace4: statements← GetRepetitiveCodeBlocks(p)5: threads← get threads(δ)6: for t in threads do7: for σ in statements do8: all iterations← get iterations(δ,t,σ)9: minimal iterations←DeltaDebugging(δ,all iterations)
10: remove (all iterations/minimal iterations)in δ11: return δ
for j =1:M
{
@rcb-begin
expected = account.get()+i
account.increment(i)
assert account.get()==expected
expected=account.get()-i
account.decrease(i)
assert account.get()==expected
@rcb-end
}
Ti
A:
B:
FIGURE 6.4: Some iterations of the code block demarcated by @rcb-begin and @rcb-endare specified as potentially redundant.
contains potential redundant computation. We then perform delta-debugging on each iteration
of the RCB for each thread, to validate the redundancy of the computation corresponding to the
iteration.
A framework for repetition analysis LEAN also provides an option for the programmers to
annotate RCBs, which can help significantly improve the effectiveness of our automatic repeti-
tion analysis. Our general observation is that programmers often have the knowledge of whether
some code blocks is repetitive or not (in particular, in writing the test drivers). This piece of in-
formation is in fact easy for the programmers to specify (e.g., using simple annotations), but
very difficult to identify by any automatic approach because of the absence of repetition crite-
rion. More importantly, without any further intervention, we can help programmers automati-
cally validate whether some executions of the RCBs are redundant or not, and eliminate them
from the buggy execution if they are redundant.
Dynamically Simplifying Concurrency Bug Reproduction 89
Program
Hierarchical Delta-Debugging Buggy trace
Remove Whole thread redundancy
Dynamic slicing
Remove Partial thread redundancy
Repetition analysis
Simplified Buggy trace
FIGURE 6.5: An overview of LEAN
Our framework is easy to use. Programmers simply mark the beginning and the end of the RCB
by @rcb-begin and @rcb-end, respectively. For example, programmers may mark the
RCB for thread Ti in a way as shown in Figure 6.4. We then perform delta-debugging on each
iteration of the code, and filter out most redundant iterations. Also, this framework is flexible.
New annotations may be added after each round of simplification, when programmers get more
information about the bug from the intermediate simplified execution.
6.4 Implementation
To evaluate our technique, we have implemented a prototype of LEAN on top of LEAP. Figure
8.12 shows an overview of LEAN. Given the target concurrent program and the buggy execution
trace, LEAN first removes the whole-thread redundancy from the trace using Algorithm 8. It
then further simplifies the resultant execution by removing the partial-thread redundancy using
Algorithm 10 and Algorithm 11. The final output produced by LEAN is a simplified buggy
execution in which redundant computation is skipped in the replayed execution.
For delta-debugging, we faithfully implemented the algorithm described in Figure 6.3. Our
slicing implementation is based on the Indus framework [104], which we adapt for dynamic
multithreaded execution traces. In addition to the data dependencies across threads, slicing also
takes care of all the data and control dependencies internal to each individual thread in the
execution.
To disable an instruction, we instrument the program to insert control statements before the
statement which corresponds to the instruction. For example, to disable a thread, we insert
control instrumentation before Thread.start() and Thread.join() to make sure that the disabled
thread is not executed and joined by any other thread. We distinguish the dynamic thread by
assigning a unique ID to each thread instance (explained in Section 6.3.1). For partial-thread
redundancy, we also maintain a thread local counter for each annotated RCBs, to denote the
iteration instance of each thread in executing the RCB.
Dynamically Simplifying Concurrency Bug Reproduction 90
TableDescriptor { getObjectName(){
if (referencedColumnMap == null){
…
}
else{
for (int i = 0; i <...; i++){
referencedColumnMap.isSet(…)
}
}
}
}
setReferencedColumnMap(…){
referencedColumnMap = null;
}
FIGURE 6.6: A real concurrency bug #2861 in Derby. The thread interleaving followingthe solid arrow on the shared data referencedColumnMap crashed the program with
NullPointerException.
To control the thread schedule, we reuse the application-level scheduler of LEAP. The thread
IDs of all the events in the trace form a global schedule. After disabling a thread, we simply
remove the thread ID from the global schedule. To enforce a sequential schedule, we control the
execution of a thread until it terminates or cannot continue execution (i.e., waiting for a lock or
joining for the termination of another thread) and then randomly pick an enabled thread to pro-
ceed. For removing partial-thread redundancy, we also associate each event in the trace with its
corresponding statement in the program. User annotated RCBs are interpreted as special state-
ment blocks. To generate the remaining schedule after disabling a certain iteration of a RCB,
we first remove the corresponding events in the trace according to the RCB and the per-iteration
information, and then compute the schedule by performing a projection of the remaining trace
on the thread ID.
6.5 A Case Study
In this section, we present a case study of reproducing a concurrency bug in Apache Derby
DBMS. We illustrate how LEAN simplifies the bug reproduction.
6.5.1 Description of Derby Bug #2861
Figure 6.6 shows the concurrency bug #2861 we study in the Apache bug database. The shared
data referencedColumnMap is checked for null at the top of the getObjectName
method and later dereferenced if it is not null. Due to an erroneous interleaving, another thread
can set referencedColumnMap to null in the setObjectName method and causes the
program to crash by throwing a NullPointerException. Figure 6.7 shows a driver pro-
gram (also documented in the bug database) for triggering the bug. Ignore all the gray areas for
Dynamically Simplifying Concurrency Bug Reproduction 91
the moment; these are statements inserted by LEAN. The driver program starts N threads each
creating (lines 41-45) and then dropping (lines 48-51) a separate view against the same source
view, repeated M times. Because of non-determinism, the bug is very difficult to manifest with
small N and M. In our experiment with N=2 and M=2 on an eight-core Linux machine, we did
not observe a single run of failure after 1000 runs. With a larger number of threads and repeti-
tions, the probability of triggering the bug is increased. When we set N=10 and M=10, we were
able to trigger the bug in three out of 1000 runs.
With the help of a MDR system such as LEAP, we are able to deterministically reproduce the
bug. The problem is that the bug reproduction run is too complicated, with too many threads
(11) and thread context switches (6,439). The size of the execution trace (which contains the
critical events only) is as large as 94.1M, and it took LEAP 466 seconds to reproduce the bug.
6.5.2 How LEAN Simplifies the Bug Reproduction
LEAN simplifies the reproduction of this bug by removing the redundant computation in the
reproducible buggy execution. Although there are ten testing threads each of which repeats ten
times in the buggy execution, we can observe that, in the best case, two testing threads each
with one iteration is sufficient to trigger the bug. The other eight threads and nine iterations are
redundant and can be removed from the bug reproduction run.
Taking the original buggy execution as the input, LEAN first identifies and removes the re-
dundant threads in the execution using Algorithm 8. Figure 6.8 illustrates the simplification
process. Because the dynamic thread hierarchy graph in the buggy execution contains one level
of thread, the entire simplification process invokes the delta-debugging procedure only once,
which directly applies on threads T(1,2,...,10). To skip a thread, LEAN controls the execution
of the program by inserting a condition checking before Thread.start() and Thread.join() (as
shown in the gray areas at lines 23 and 27 in Figure 6.7). A thread is not started or joined if it
is removed. After four rounds of simplification, threads T(2,3) remain in the reduced execution
and all the other threads are removed. This process took 1,841 seconds in our experiment. After
removing the redundant threads, 75.1M(79.8%) of the events in the original buggy trace were
removed and the size of the remaining trace was reduced to 19M.
After removing whole-thread redundancy, LEAN then further processes the reduced buggy ex-
ecution to remove partial-thread redundancy. It first performs dynamic slicing to remove ir-
relevant instructions using Algorithm 10. As slicing tracks all the dynamic data dependencies
across threads as well as all the intra-thread data and control dependencies in the remaining
buggy execution, it took LEAN 553 seconds to finish the slicing process in our experiment, and
an additional 6.2M(6.6%) of the events were removed from the trace. Similar to the control
Dynamically Simplifying Concurrency Bug Reproduction 92
of threads, we simply insert control statements before the irrelevant instructions to skip their
executions.
LEAN then continues to simplify the reduced buggy execution by removing the redundant repe-
titions using Algorithm 11. Our automatic repetition analysis successfully identified the RCB at
lines 42-53 in the test thread, as demarcated by @rcb-begin and @rcb-end at lines 41 and
54 in Figure 6.7. To control the execution of a certain iteration i of the RCB, we insert a control
statement before the RCB with i as the input parameter (as shown in the gray area at line 40),
determining whether the ith iteration is enabled or not. Figure 6.9 illustrates the simplification
process for LEAN to remove the redundant execution iterations of the RCB of threads T(2,3).
After ten rounds of simplification, the 7th iteration of T2 and the 4th iteration of T3 remain and
all the other iterations are removed. This process took around 200 seconds in our experiment.
An additional 11.6M (12.3%) of the events were removed and the size of the final buggy trace
was reduced to around 2.01M.
In total, it took LEAN 2,593 seconds to simplify the original buggy execution. The final simpli-
fied execution was able to reproduce the same bug and was significantly simpler than the original
buggy execution. The simplified trace size was reduced by 47x (from 94.1M to 2.01M), con-
taining only 3 threads (T(0,2,3)) and 433 thread context switches, and its replay time by LEAN
was shortened by 46x (from 446 to 10.2 seconds). Moreover, all the instrumentations and the
thread scheduler in LEAN are transparent to the programmers, such that the debugging task can
be performed on the simplified buggy execution in a normal debugging environment.
6.6 Experiments
The goal of our technique is to improve the effectiveness of the MDR support for debugging
concurrent programs, via removing redundancy from the reproducible buggy trace. Accordingly,
our evaluation aims at answering the following two research questions:
RQ1. Effectiveness - Is LEAN effective in simplifying real buggy traces? How much reduction
of the replay time and the trace complexity (i.e., size, threads, and context switches) can
our approach achieve?
RQ2. Efficiency - How efficient is LEAN for identifying and removing the trace redundancy?
Benchmarks We quantify our technique using a set of widely used third-party concurrency
benchmarks with known bugs. We configure the program inputs to generate buggy traces of
different sizes and complexity. To understand the performance of our technique on real appli-
cations in practice, we also include several large concurrent server systems in our benchmarks.
Dynamically Simplifying Concurrency Bug Reproduction 93
TABLE 6.1: LEAN evaluation benchmarks
Program SLOC Input/#Threads#IterationsBuggyPro 348 race exception/33/-Tsp 709 map4/4/-ArrayList 5,979 not-atomic bug/450/-LinkedList 5,866 not-atomic bug/450/-OpenJMS-0.7.7 262,842 order violation bug/20/10Tomcat-5.5 339,405 bug#37458/10/10Jigsaw-2.2.6 381,348 NPE bug/10/10Derby-10.3.2.1 665,733 bug#2861/10/10
TABLE 6.2: LEAN experimental results - RQ1: Effectiveness
ProgramOriginal Trace Simplified Trace
Size #Thr #CS Replay Size #Thread #CS Replay
BuggyPro 460K 34 1,003 1.27s 13.2K(↓97.1%) 4(↓88.2%) 28(↓97.2%) 39ms(↓97%)Tsp 44.1M 5 9,190 280s 22.1M(↓49.9%) 3(↓40.0%) 4,588(↓50.0%) 115s(↓58.9%)ArrayList 1.72M 451 2,381 6.5s 6.4K(↓99.6%) 3(↓99.3%) 10(↓99.6%) 20ms(↓99.7%)LinkedList 2.20M 451 2,564 7.2s 6.8K(↓99.7%) 3(↓99.3%) 10(↓99.6%) 22ms(↓99.7%)
OpenJMS 128.9M 36 7,287 606s 1.82M(↓98.5%) 7(↓80.5%) 415(↓94.3%) 16.3s(↓97.3%)Tomcat 38.2M 13 3,543 206s 1.26M(↓96.7%) 4(↓69.2%) 111(↓96.9%) 3.3s(↓98.4%)Jigsaw 20.1M 11 2,322 154s 416K(↓98.0%) 3(↓72.7%) 64(↓97.2%) 2.4s(↓98.4%)Derby 94.1M 11 6,439 466s 2.01M(↓97.8%) 3(↓72.7%) 433(↓92.5%) 10.2s(↓97.6%)
Table 6.1 shows the benchmarks used in our experiments. The total size of these benchmarks
is over 600K lines of code. Column 3 (Input/#Threads#Iterations) reports the input data (the
bug, the number of threads, and the iterations, if available) configured in the recorded execu-
tion of the benchmark. All experiments were conducted on two eight-core 3.00GHz Intel Xeon
machines with 16GB memory and Linux 2.6.22 and JDK1.7.
6.6.1 RQ1: Effectiveness
The goal of our first research question is to evaluate how effective our technique is for simpli-
fying the buggy execution traces of real concurrent programs. To generate the data necessary
for investigating this question, we proceed as follows. For each benchmark, we first run it mul-
tiple times with random thread schedule until the bug manifests and use LEAN to collect the
corresponding buggy trace of each run. For each trace, we then apply our technique to produce
a simplified trace with the redundancy removed. During the simplification process, we first re-
move whole-thread redundancy and then partial-thread redundancy (consists of both slicing and
repetition analysis). The whole process is fully automatic with no user intervention. We mea-
sure the percentage of trace size reduction with respect to the two dimensions of redundancy.
Dynamically Simplifying Concurrency Bug Reproduction 94
TABLE 6.3: LEAN - decomposed effectiveness on trace size reduction
Program Whole Redundancy Partial RedundancySlicing Repetition
BuggyPro 445K(96.9%) 1.8K(0.2%) -Tsp 21.7M(49.2%) 0.4M(0.7%) -ArrayList 1.71M(99.6%) - -LinkedList 2.19(99.7%) - -OpenJMS 100.8M(78.2%) 7.3M(5.7%) 20.0M(15.5%)Tomcat 23.6M(61.9%) 4.2M(11.0%) 9.1M(24.0%)Jigsaw 16.0M(79.4%) 0.91M(4.5%) 2.7M(13.4%)Derby 75.1M(79.8%) 6.2M(6.6%) 11.6M(12.3%)
We also quantify the final simplification results in terms of the reductions of the trace size, the
number of threads and the number of thread context switches, as well as the replay speedups.
To demonstrate the simplification effectiveness of our approach, we also compared LEAN with
an execution reduction technique ER [122] that uses the dependence graph for simplification.
Table 6.2 reports our final simplification results. Columns 2-5 (Size, #Thread, #CS, Replay Time)
report the size of the original trace, the number of threads, the number of thread context switches
(including both non-preemptive and preemptive ones) in the original trace, and the replay time
of the original trace, respectively, while Columns 6-9 report the corresponding statistics of the
simplified trace. As the table shows, the size of the original trace ranges from 460KB (Bug-
gyPro) to more than 128MB (OpenJMS) on disk, which take from 1.27 seconds to more than
10 minutes to replay to reproduce the bug. The original trace also is of significant complexity
w.r.t. the number of threads and the number of context switches, ranging from 5 threads in Tsp
to 451 threads in ArrayList and LinkedList, and from 1,003 context switches in BuggyPro to
9,190 context switches in Tsp. LEAN was able to greatly reduce the trace complexity for all
the concurrency bugs in our experiments. The trace size is reduced by 49.9% (2x) in Tsp to as
large as 99.7% (324x) in LinkedList, the number of threads is reduced by 40% to 99.3%, and
the number of context switches is reduced by 50% to 99.6%. Moreover, the replay time also
is greatly shortened after simplification, ranging from 58.9% (2.4x) in Tsp to 99.7% (327x) in
LinkedList. In the four large server applications, the replay time is consistently shortened by
around 98% (64x).
Table 6.3 reports the simplification effectiveness w.r.t. each of the three components in terms of
the trace size reduction. Column 2 reports the percentages of whole-thread redundancy reduced
by the hierarchical delta-debugging (HDD), while Columns 3-4 report that of partial-thread re-
dundancy, reduced by slicing and repetition analysis, respectively. In the small benchmarks, the
percentage of whole thread redundancy ranges from 49.2% to 99.7%. LEAN did not identify
much partial thread redundancy in these small benchmarks. Slicing removes only 0.2% and
Dynamically Simplifying Concurrency Bug Reproduction 95
TABLE 6.4: Comparison between LEAN and ER
Program ER LEANBuggyPro 2.1% 97.1%Tsp 0.0% 49.9%ArrayList 2.9% 99.6%LinkedList 3.0% 99.7%OpenJMS 10.2% 98.5%Tomcat 6.9% 96.7%Jigsaw 4.6% 98.0%Derby 2.5% 97.8%
0.7% redundancy, respectively, in BuggyPro and Tsp. For the real server programs, the percent-
age of whole-thread redundancy ranges from 61.9% to 79.8%. For partial-thread redundancy,
slicing and repetition analysis are both more effective than that for the small benchmarks. Slic-
ing removes 4.5% to 11% redundant computation in the four large server programs, while the
percentage of redundancy removed by repetition analysis ranges from 12.3% to 15.5%. We note
that the amount of redundancy in the buggy traces is closely related to the number of threads
and the number of repetitions configured as input to the program. With more redundancy in the
buggy trace, LEAN would have a better simplification ratio. Nevertheless, we believe our result
is representative as our experimental setup reflects the typical concurrency testing scenarios in
the development cycle (such as the effective random testing in the IBM ConTest tool [31] and
the stress testing in CHESS [86]).
Comparing with the ER [122] The execution reduction (ER) technique proposed by Tallam
et al. [122] also aims at reducing the trace size, for supporting the tracing of long running
multithreaded programs. ER works by tracking a dynamic dependence graph of the execution
events. The events are grouped into regions and threads such that the size of the dependence
graph can be reduced. By analyzing the dependence graph, ER removes the regions of events or
threads that are irrelevant to the fault. As ER relies on the dynamic dependence graph, it cannot
remove redundant computation that has data/control dependencies to the fault. As LEAN relies
on the redundancy criterion and dynamic verification, it is able to leverage more simplification
opportunities.
We compared the simplification effectiveness on the trace size reduction between LEAN and
ER. Table 6.4 shows the result. For our evaluation benchmarks, LEAN is much more effec-
tive than ER. ER does not find many irrelevant events (the percentage of simplification ranges
from 0.0%-10.2%), because almost all threads have data dependencies between each other on
shared variables, while LEAN can effectively remove the redundant threads and the repetitive
computation through the hierarchical delta-debugging and our repetition analysis.
Dynamically Simplifying Concurrency Bug Reproduction 96
TABLE 6.5: LEAN experimental results - RQ2: Efficiency
Program HDD Slicing Repetition RCB#Rounds Time Time #Rounds Time All Real
BuggyPro 6 8s 155ms - - 4 0Tsp 2 199s 12s - - 3 0ArrayList 18 55s 2s - - - -LinkedList 18 58s 2s - - - -OpenJMS 13 4,265s 330s 11 152s 1 1Tomcat 5 1,082s 308s 12 55s 1 1Jigsaw 4 630s 210s 10 37s 1 1Derby 4 1,841s 553s 10 200s 1 1
6.6.2 RQ2: Efficiency
The goal of our second research question is to assess if our approach is efficient in simplifying
the buggy trace. Since LEAN works in a black-box style (applying delta-debugging except for
the dynamic slicing part) to iteratively simplify the trace, it may take a long time (many rounds)
to produce the final simplification. As in each round it requires two replay runs to validate
redundancy (for the two redundancy conditions in our criterion), the efficiency of LEAN is an
important concern for the usefulness in practice. Hence, during the trace simplification, we also
record the number of delta-debugging rounds (for dealing with both whole-thread redundancy
and partial-thread redundancy) and measure the time needed for each of the three components of
LEAN to produce the final simplified trace. As we use repetition analysis to identify the RCBs,
we also report the statistics of the repetition analysis result to assess its usefulness in improving
the simplification effectiveness of LEAN.
Table 6.5 shows the experimental results for our research question RQ2. Columns 2-3 and 5-6
report the number of simplification rounds (including the failed runs) and the time taking LEAN
to remove the whole-thread redundancy and the redundant repetitions, respectively, from the
original trace (the same trace as that in Table 6.2). Generally, the number of rounds is depen-
dent on the amount of redundancy, while the simplification time is dependent on the amount of
redundancy as well as the length of the original trace. For the small benchmarks, LEAN took 2
to 18 rounds for validating whole-thread redundancy, which took 8 to 99 seconds of the execu-
tion time. For the large systems, since their traces are much larger, LEAN took 4 to 13 rounds
and 630 to 4,265 seconds to remove whole-thread redundancy, and 10 to 12 rounds and 37 to
200 seconds to remove the redundant repetitions. Column 4 reports the time needed for slicing
the trace (including both the construction time of the multithreaded dependence graph (MDG)
and the analysis time for slicing the MDG). Because slicing considers all the instructions in the
buggy execution, it is more expensive for large server programs (which have longer and more
Dynamically Simplifying Concurrency Bug Reproduction 97
complex traces) than that for the small benchmarks. The slicing time for the four large server
programs in our experiments ranges from 210 to 553 seconds.
Summary Compared to the original replay time, the simplification time is typically 4x-8x
longer (except Tsp, which is in fact shorter). However, considering the significant trace simpli-
fication ratio, we believe the time cost is acceptable (even for the large systems). Moreover, as
the simplification task is fully automatic (transparent to programmers) and can be easily paral-
lelized, programmers do not need to worry about the simplification procedure. For very long
running executions, programmers may also choose to set a time bound for the simplification.
When the simplification does not finish within the time bound, programmers can still have the
partially simplified trace (sharing the spirit of delta-debugging).
On the aspect of repetition analysis, Columns 7-8 report the total number of identified RCBs
and the number of real RCBs among them in each benchmark. For the small benchmarks, our
analysis identified 4 RCBs in BuggyPro and 3 in Tsp, but none of them are truly redundant. Our
analysis does not report any RCB in LinkedList and ArrayList. For the large systems, our anal-
ysis successfully identified all the RCBs in the test drivers. In testing real concurrent systems,
there is often a large number of repetitions (in order to increase the bug finding possibility). We
note that repetition analysis plays an important role in effectively reducing this kind of partial-
thread redundancy, though (as our result suggests) the precision of our repetition analysis is not
optimized.
6.7 Summary
Debugging concurrent programs has been a long-standing challenging problem. We have pre-
sented a novel technique LEAN to simplify the concurrency bug reproduction by removing the
redundant computation from the buggy trace with the replay-supported execution reduction. Our
experimental results show that LEAN is able to significantly reduce the complexity of the repro-
ducible buggy execution and shorten the replay time. With LEAN, we believe the effectiveness
of debugging concurrent programs can be greatly improved.
Static Trace Simplification 98
TestEmbeddedMultiThreading { main(String args[]){
int numThreads = Integer.parseInt(args[0]);
int numIterations = Integer.parseInt(args[1]);
//register the embedded driver and create the test database
EmbeddedDriver driver = new EmbeddedDriver();
conn = DriverManager.getConnection("jdbc:derby:DERBY2861");
stmt = conn.createStatement();
sql = "CREATE VIEW viewSource AS SELECT col1, col2 FROM
schemamain.SOURCETABLE“
stmt.execute(sql);
stmt.close();
//create test threads
Thread[] threads = new Thread[numThreads];
for (i = 0; i < numThreads; i++)
threads[i] = new Thread(new ViewCreatorDropper(
"schema1.VIEW" + i, "viewSource", "*", numIterations));
//start test threads
for (int i = 0; i < numThreads; i++)
threads[i].start();
//wait for threads to terminate
for (int i = 0; i < numThreads; i++)
threads[i].join();
}
}
ViewCreatorDropper implements Runnable { ViewCreatorDropper(String viewName, String sourceName,
String columns, int iterations) {
m_viewName = viewName;
m_sourceName = sourceName;
m_columns = columns;
m_iterations = iterations;
}
run(…){
for (i = 0; i < m_iterations; i++)
{
//create view
stmt = conn.createStatement();
sql = " "CREATE VIEW " + m_viewName + " AS SELECT "
+ m_columns + " FROM " + m_sourceName“;
stmt.execute(sql);
stmt.close();
//drop view
stmt = conn.createStatement();
sql = " " "DROP VIEW " + m_viewName“;
stmt.execute(sql);
stmt.close();
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
if(shouldStartThread(threads[i]))
threads[i].start();
if(shouldJoinThread(threads[i]))
threads[i].join();
@rcb-begin
@rcb-end
if(shouldExecuteIteration(i))
{
}
FIGURE 6.7: A real world test driver for triggering the concurrency bug in Figure 6.6. Thestatements inserted by LEAN to simplify the execution are shown in the gray areas.
Static Trace Simplification 99
T0
T2 T1 T3 T4 T6 T5 T7 T10 T8 T9 Round Result
1 √ √ √ √ √ Y
2 √ √ √ Y
3 √ √ X
4 √ √ Y
FIGURE 6.8: Illustration of delta-debugging for removing the whole thread redundancy. Tidenotes the ith test thread created by the main thread T0. After four rounds of simplification,
threads T(2,3) remain and all the other threads are removed.
I21 Round Result I22 I23 I24 I25 I26 I27 I28 I29 I210 I31 I32 I33 I34 I35 I36 I37 I38 I39 I310
1 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ N
2 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ Y
3 √ √ √ √ √ √ √ √ √ √ √ √ √ Y
4 √ √ √ √ √ √ √ √ √ √ √ √ N
5 √ √ √ √ √ √ √ √ √ √ √ √ Y
6 √ √ √ √ √ √ √ √ √ √ √ Y
7 √ √ √ √ √ √ Y
8 √ √ √ √ N
9 √ √ √ Y
10 √ √ Y
FIGURE 6.9: Illustration of delta-debugging for removing the redundant repetitions for the re-maining threads T(2,3). Iij denotes the jth iteration of thread Ti where i=2,3 and j=1,2,. . . ,10.After ten rounds of simplification, the 7th iteration of T2 and the 4th iteration of T3 remain and
all the other iterations are removed.
Chapter 7
Static Trace Simplification
One of the major difficulties in debugging concurrent programs is that the programmer usually
experiences frequent thread context switches, which complicates the reasoning process. This
problem can be alleviated by trace simplification techniques, which produce the same computa-
tion process but with fewer context switches. The state of the art trace simplification technique
takes a dynamic approach and does not scale well to large traces, hampering its practicality.
We present a static trace simplification approach, SimTrace, that dramatically improves the ef-
ficiency of trace simplification through reasoning about the computational equivalence of traces
offline. By constructing a dependence graph model of events, our trace simplification algorithm
scales linearly in the trace size and quadratic in the number of nodes in the dependence graph.
Underpinned by a trace equivalence theorem, we guarantee that the results generated by Sim-
Trace are sound and no dynamic program re-execution is required to validate trace equivalence.
Our experiments show that SimTrace scales well to traces with more than 1M events, making it
attractive to practical use.
7.1 Introduction
Jalbert and Sen [55] have recently proposed a dynamic trace simplification technique, Tiner-
tia, for reducing the number of thread interleavings in a buggy execution trace. From a high
level perspective, Tinertia iteratively transforms an input trace that satisfies a certain property
to another trace satisfying the same property but with fewer thread context switches. Tinertia
is valuable in improving the debugging efficiency of concurrent programs as it prolongs the se-
quential reasoning of concurrent program executions and reduces frequent “context switches”.
However, since Tinertia is a dynamic approach, it faces serious efficiency problems when used
in practice. To reduce every single context switch, Tinertia has to re-execute the program at
least once to validate the equivalence of the transformed trace. It is very hard for Tinertia to
100
Static Trace Simplification 101
scale to large traces as program re-execution typically requires controlling the thread scheduler
to follow the scheduling decisions in the transformed trace, which is often 5x to 100x slower
than the native execution [100]. The total running time of Tinertia is cubic in the trace size [55].
We present a static trace simplification technique, SimTrace, that dramatically improves the
efficiency of trace simplification through offline reasoning of the computational equivalence
of traces. The key idea of SimTrace is that we can statically guarantee trace equivalence by
leveraging the dependence relations between events in the trace. We prove a theorem of trace
equivalence that any rescheduling of the events in the trace respecting the dependence relation
is equivalent to the given trace. The trace equivalence is not limited to any specific property
but general to all properties that can be defined over the program state. Underpinned by the
trace equivalence theorem, SimTrace is able to perform trace simplification completely offline,
without any dynamic re-execution to validate the intermediate simplification result, which sig-
nificantly improves the efficiency of the trace simplification.
In our analysis, we first build a dependence graph that encodes all the dependence relations
between events in the trace. The dependence graph is a directed acyclic graph in which each
node in the graph represents a corresponding event or event sequence by the same thread in the
trace, and each edge represents a happens-before relation or a data dependence between two
events or event sequences. The dependence graph is sound in that it encodes a complete set of
dependence relations between the events. The trace equivalence theorem guarantees that any
topological sort of the dependence graph produces an equivalent trace to the original trace.
Taking the advantage of the dependence graph, we reduce the trace simplification problem to
a graph merging problem, in which the objective is minimizing the size of the graph. The
algorithm performs a sequence of merging operations on the graph. Each merging operation is
applied on two consecutive nodes by the same thread in the graph, and it consolidates the two
nodes if a merging condition is satisfied. The merging condition is that the edge connecting
the two merged nodes is the only path connecting them in the graph, which can be efficiently
checked by computing the reachability relation between the two nodes.
Finally, SimTrace performs a topological sort on the reduced dependence graph and generates
the simplified trace. The total running time of SimTrace is linear in the size of the trace and
quadratic in the number of the nodes in the initial dependence graph. SimTrace is very efficient
in practice, since the size of the initial dependence graph is often much smaller than that of the
original trace. Moreover, SimTrace is completely offline and does not require any re-execution
of the program for validating the simplified trace.
The problem of generating equivalent traces with minimum context switches is NP-hard [55].
SimTrace does not guarantee the globally optimal simplification but a local optimum. However,
Static Trace Simplification 102
our evaluation results using a set of multithreaded programs show that SimTrace is able to signif-
icantly reduce the context switches in the trace. For instance, for the input trace of the Cache4j
subject with 1,225,167 events, SimTrace is able to reduce the number of context switches from
417 to 33 in 592 seconds. The overall reduction percentage of SimTrace ranges from 65% to
97% in our experiments.
Being an offline analysis technique, SimTrace is complementary to Tinertia. For the sake of
efficiency, our modeling of the dependence relation does not consider the runtime value de-
pendencies between events in the trace and hence may be too strict in preventing further trace
simplification. As Tinertia utilizes runtime verification regardless of the dependence relation, it
might be able to explore more simplification opportunities that are beyond the strict dependence
relation. A good match between SimTrace and Tinertia is to apply SimTrace as a front-end
and use Tinertia as a back end. By working together, we can achieve both trace simplification
efficiency and effectiveness at the same time.
The rest of the chapter is organized as follows: Section 7.2 presents our algorithm; Section 7.3
reports our evaluation results; Section 7.4 summarizes this chapter.
7.2 SimTrace: Efficient Static Trace Simplification
In this section, we first define the trace simplification problem. We then describe a theorem of
trace equivalence and offer a detailed proof. After that, we present the full SimTrace algorithm.
7.2.1 General Trace Simplification Problem
Definition 7.1. Context Switch A context switch occurs when two consecutive actions in the
trace are performed by different threads. Let Γ(α) denote the owner thread of event α. Let δ
denote a trace containingN events and δ[k] the kth event in δ, and letCS(δ) denote the number
of context switches in δ, we have CS(δ) = ΣN−1k=1 uk where uk is a binary variable s.t. uk = 1 if
Γ(δ[k]) ≠ Γ(δ[k + 1]) and uk = 0 otherwise.
Given a trace as the input, the general trace simplification problem is to produce an output trace
that is equivalent to the input trace and has minimum number of context switches among all
equivalent traces. To state more formally, suppose an input trace δ drives the program state to
ΣN , the general trace simplification problem is: given δ, output δ′ s.t. ΣN = Σ′N and CS(δ′) is
minimized. Notice that the program state here is not limited to any local store or the global store
but includes both the global store and the local stores of all the threads. In other words, the trace
simplification problem defined above is general to all properties defined over the program state.
Static Trace Simplification 103
The basic idea for reducing the context switches in a trace is to reschedule the actions in the
trace such that more actions by the same thread are placed next to each other. A naıve approach
is to exhaustively generate all permutations of the events in the trace and pick an equivalent one
with the smallest number of context switches. However, this naıve approach requires checking
N! permutations which is highly inefficient. A better approach is to repeatedly move the inter-
leaving actions to some non-interleaving positions and then consolidate the neighboring actions
by the same thread. However, there are two major challenges in this approach. First, how to
ensure the rescheduled trace is feasible and also equivalent to the input trace? Second, how to
make sure the output trace is optimal, i.e., has the minimum number of context switches among
all equivalent traces?
We address the trace simplification problem by leveraging the dependence relationship between
events in the trace. For the first challenge, we show that the trace equivalence can be guaranteed
by respecting the dependence relation during the rescheduling process. For the second chal-
lenge, since Jalbert and Sen [55] have proved it is NP-hard, we present an efficient algorithm,
SimTrace, that generates a locally optimal solution.
7.2.2 A Theorem of Trace Equivalence
Previous work has proposed many causal models [21, 66, 96, 112, 131] that characterize the
dependence relationship between actions in the trace. Among them, most models are developed
for checking concurrency properties such as data race and atomicity violations, and they are
tailored for a specific property. As we are dealing with all properties over program state, we
have to consider a general model that works for all such properties. We hence use a strict model
based on the dependence relation in Definition 2.5, and we have the following theorem of trace
equivalence:
Theorem 7.2. Any rescheduling of the actions in a trace respecting the dependence relation
generates an equivalent trace.
Proof. (Sketch) Let δ denote the input trace with size N and δ′ an arbitrary rescheduling of δ
respecting the dependence relation, and suppose δ and δ′ drive the program state from the same
initial state Σ0 and Σ′0 to ΣN and Σ′N , respectively. Our goal is to prove Σ′N = ΣN . The
main insight of the proof is that, by respecting the order defined by the dependence relation,
every action in the rescheduled trace reads or writes the same value on the program state as its
corresponding action in the input trace, and hence the rescheduled trace drives the program to
the same final state as that of the input trace. We provide the full detailed proof in the Appendix
A. Readers may skip it at this moment.
Static Trace Simplification 104
Note that Theorem 7.2 is related to but different from the equivalence axiom of the Mazurkiewicz
traces [1] in the trace theory, which provides an abstract model of reasoning about trace equiv-
alence based on the partial order relation between events. We prove Theorem 7.2 in the context
of concurrent program execution based on the concrete modeling of the action semantics and
the computation effect in the trace.
Theorem 7.2 forms the basis of static trace simplification as it guarantees every rescheduling
of the actions in the trace that respects the dependence relation produces a valid simplification
result, without the need of any runtime verification. In other words, as long as we do not violate
the order defined by the dependence relation, we can safely reschedule the events in the trace
without worrying about correctness of the final result.
7.2.3 SimTrace Algorithm
Our algorithm starts by constructing from the input trace a dependence graph (see Definition
7.3), which encodes all the actions in the trace as well as the dependence relations between the
actions. We then simplify the dependence graph by ordinally performing a “merging” operation
on two consecutive nodes by the same thread in the graph. When the dependence graph cannot
be further simplified, our approach applies a simple topological sort on the graph to produce the
final simplified trace.
Definition 7.3. A dependence graph G = (V,E), built upon a trace, is a directed acyclic graph
in which each v ∈ V corresponds to a sequence of consecutive actions by the same thread started
by a unique action that has remote incoming dependence. For each edge, there is a labeling
relation L ∶ E →{local, remote} such that each local edge connects neighboring nodes by the
same thread, and each remote edge connects nodes by different threads meaning that there are
dependence relations from some actions in one node to some actions in the other node.
Note that the dependence graph is directed acyclic graph. Otherwise it indicates there are cyclic
dependences between events in the trace, which is impossible according to our dependence re-
lation model. We next describe our algorithms for constructing and simplifying the dependence
graph in detail.
Dependence Graph Construction Algorithm 12 shows our algorithm for constructing the
dependence graph. Given an input trace, we first conduct a linear scan of all the actions in the
trace to build the smallest dependence relation between actions. We then visit each action in their
appearing order in the trace once to construct the dependence graph according to Definition 7.3.
Our construction of the dependence graph leverages the observation that most of the dependence
relations in the trace are local dependencies within the same thread, while the number of remote
Static Trace Simplification 105
Algorithm 12 ConstructDependenceGraph(δ)1: input: δ (a trace)2: output: graph (the dependence graph built from δ)3: mapt2n ← empty map from a thread identifier to its current graph node4: told ← null5: for i ← 0 to ∣δ∣-1 do6: tcur ← the thread identifier of the action δ[i]7: nodecur ←mapt2n(tcur)8: if nodecur is null then9: nodecur ← new node(δ[i])
10: mapt2n(tcur) ← nodecur11: add node nodecur to graph12: else13: if δ[i] has remote incoming dependence and tcur ≠ told then14: nodeold ← nodecur15: nodecur ← new node(δ[i])16: add node nodecur to graph17: add local edge nodeold ⇢ nodecur to graph18: for each action a with remote outgoing dependence to δ[i] do19: nodea ← the node to which a belongs20: add remote edge nodea → nodecur to graph21: else22: add action δ[i] to nodecur23: told ← tcur
dependence relations are comparatively much smaller. We can hence greatly reduce the size
of the initial dependence graph by shrinking consecutive actions with only local dependence
between them into a single node. The running time of Algorithm 12 is linear in the trace size.
Note that, in our dependence graph construction process, each node in the initial dependence
graph has exactly two incoming edges except the root node: a local incoming edge and a remote
incoming edge. The number of edges in the graph is thus less than twice the number of nodes
in the graph. Moreover, since each node in the dependence graph may represent a sequence of
actions in the trace, the number of nodes in the graph is much smaller than the original trace
size. As a result, performing a topological sort on the dependence graph is much more efficient
than that on the original trace.
Simplifying Dependence Graph Following Theorem 7.2, it is easy to see that any topological
sort of the initial dependence graph produces a correct answer to our problem, i.e., generates an
equivalent trace to the input trace. However, to make the resultant trace as simple as possible,
i.e., to minimize the context switches, we have to wisely choose the next node in each sorting
step during the topological sort, which is a difficult problem with no existing solution or even
good approximation algorithm.
Static Trace Simplification 106
We formulate this problem as an optimization problem on the number of nodes in the depen-
dence graph and use a graph merging algorithm to compute a locally optimal solution to it.
Before describing the formulation, let us first introduce a dual notion of context switch:
Definition 7.4. A context continuation occurs when two consecutive actions in the trace are
performed by the same thread.
Let CC(δ) denote the number of context continuations in a trace δ, we have the following
lemma:
Lemma 7.5. Minimizing CS(δ) is equivalent to maximizing CC(δ).
Proof. Traversing the trace once, it is easy to see that for each action, either CS(δ) or CC(δ)
is incremented. Thus, CS(δ) + CC(δ) = N − 1. Hence, CS(δ) is minimized when CC(δ) is
maximized.
Therefore, our goal becomes to maximize the number of context continuations in the simplified
trace. Now let us consider the action sequence represented by each node in the dependence
graph. Since all actions in the same action sequence are performed by the same thread, their
number of context continuations are already optimized. The remaining possible context contin-
uations can only come from actions that are in different action sequences. Mapping this back to
the dependence graph and because nodes representing action sequences by the same thread are
connected by local edges, we have the following lemma:
Lemma 7.6. Minimizing CS(δ) is equivalent to maximizing the number of context continua-
tions contributed by local edges in the dependence graph.
Consider a local edge in the graph, if the action sequences represented by the two nodes con-
nected by this local edge are consolidated together, it will contribute one context continuation.
Let us call a merging operation as the consolidating of two nodes connected by a local edge in
the dependence graph. As each merging operation eliminates a local edge and correspondingly
reduces one node in the dependence graph, it is easy for us to get the following theorem:
Theorem 7.7. Minimizing CS(δ) is equivalent to minimizing the number of nodes in the de-
pendence graph.
Following Theorem 7.7, our objective is performing as many merging operation as possible so as
to minimize the number of nodes in the dependence graph. However, recall that the dependence
relation between actions in the trace must be respected. Therefore, we cannot arbitrarily perform
the merging operation without satisfying a certain pre-condition: the merging condition is that
the to-be-merged two nodes are connected by the local edge only. Otherwise, the resultant graph
Static Trace Simplification 107
after the merging operation would become cyclic and violate the definition of dependence graph.
Mapping this back to the semantics of the dependence relation, the merging condition simply
requires that there should not exist another dependent action in the trace that interleaves the two
action sequences represented by the to-be-merged two nodes in the dependence graph. Checking
the merging condition is simple because it only requires testing the reachability relation between
the two merged nodes, which is a linear in the number of nodes in the dependence graph1.
Therefore, our dependence graph simplification algorithm (Algorithm 13) traverses each local
edge in the dependence graph, and performs the merging operation if the merging condition is
satisfied. This algorithm evaluates each local edge in the initial dependence graph once and
each evaluation computes the reachability relation between two nodes once. The worst case
time complexity is thus quadratic in the number of nodes in the initial dependence graph.
Algorithm 13 SimplifyDependenceGraph(graph)1: input: graph (the dependence graph)2: output: graph′ (the simplified dependence graph)3: graph′ ← graph4: for each local edge nodea → nodeb in a random order do5: if nodeb is not reachable from nodea except from the local edge then6: merge(nodea, nodeb, graph
′)
Notice that in our merging algorithm, the evaluation order of the local edges may affect the
simplification result. Our algorithm does not guarantee a global optimum but produces a locally
optimal simplification given the chosen evaluation order. To illustrate this problem, let us take
the (incomplete) dependence graph in Figure 7.1 as an example. The graph contains 6 nodes,
3 local edges (denoted by dashed arrows ⇢), and 4 remote edge (denoted by solid arrows →):
a1 ⇢ a2, b1 ⇢ b2, c1 ⇢ c2, a1 → b2, c1 → b2, b1 → a2 and b1 → c2. If b1 and b2 are merged first,
as shown in Figure 7.1 (a), it would produce the trace <a1-c1-b1-b2-c2-a2> that contains 4
context switches. However, the optimal solution is to merge a1 and a2, and c1 and c2, which
produces the trace <b1-a1-a2-c1-c2-b2> that contains only 3 context switches. In fact, this
problem is NP-hard (proved by Jalbert and Sen [55]), and there does not seem to exist an efficient
algorithm for generating an optimal solution. Our algorithm thus picks a random order (or any
arbitrary order) for evaluating the local edges. Though it does not guarantee to produce a global
optimum, it is easy to see that our algorithm always produces a local optimum specific to the
chosen evaluation order. That is, given the evaluation order of the local edges, our algorithm
produces a trace with the fewest thread context switches.1Theoretically, constant time graph reachability computation algorithms also exist [132].
Static Trace Simplification 108The evaluation order matters!
a1 a2
b1 b2
c1 c2
(a) Non‐optimal
#cs=4: a1‐c1‐b1‐b2‐c2‐a2
a1 a2
b1 b2
c1 c2
(b) Optimal
#cs=3: b1‐a1‐a2‐c1‐c2‐b2
local edge
remote edge
merge
FIGURE 7.1: A greedy merge may produce non-optimal result in (a). Unfortunately, the prob-lem of producing the optimal result in (b) is NP-hard.
7.3 Implementation and Experiments
We have implemented SimTrace as a prototype tool on top of LEAP. From the user’s perspec-
tive, our tool consists of three phases. It first obtains a trace of a buggy concurrent Java program
execution, which contains all the shared memory reads and writes as well as synchronization
operations performed by each thread in the program. Then our tool applies the SimTrace algo-
rithm on the trace and produces a simplified trace. In the third phase, it uses a replay engine
to re-execute the program according to the scheduling decisions in the simplified trace. Our
replayer is transparent to the programmers such that they can deterministically investigate the
simplified buggy trace in a normal debugging environment.
The goal of our experiments is to investigate whether our approach is effective and how efficient
it is in reducing the thread context switches in the trace. We chose eight widely used multi-
threaded Java benchmarks as the evaluation subjects (shown in the first column in Table 6.1).
Each subject has one or more known concurrency bugs. Similar to Tinertia [55], we use random
testing to generate the initial buggy trace for each subject. For each trace, we ran SimTrace mul-
tiple times with different evaluation orders of the local edges during our graph merging process
(Algorithm 13). To remove the non-determinism related to random numbers, we fix the seed
of random numbers to a constant in all the subjects. All experiments were conducted on a HP
EliteBook running Windows 7 with 2.53GHz Intel Core 2 Duo processor and 4GB memory. Our
implementation is publicly available at http://www.cse.ust.hk/prism/simtrace.
Table 7.1 shows the experimental results. All data are averaged over 50 runs. The first five
columns show the statistics of the test cases, including the program name, the size of the pro-
gram in lines of source code, the number of threads, the number of real shared memory locations
that contain both read and write accesses from different threads in the given trace, and the length
Static Trace Simplification 109
TABLE 7.1: Simtrace experimental results. Data are averaged over 50 runs for each subject.
Program LOC Thread SV Time Old Ctxt New Ctxt ReductionPhilosopher 81 6 1 131 6ms 51 18 65%Bubble 417 26 25 1,493 23ms 454 163 71%Elevator 514 4 13 2104 8ms 80 14 83%TSP 709 5 234 636,499 149s 9272 1,337 86%Cache4j 3,897 4 5 1,225,167 592s 417 33 92%Weblench 35,175 3 26 11,630 57ms 156 24 85%OpenJMS 154,563 32 365 376,187 38s 96,643 11,402 88%Jigsaw 381,348 10 126 19,074 130ms 2396 65 97%
of the trace. The next four columns shows the statistics of our trace simplification algorithm (all
on average), including the running time of our offline analysis, the number of context switches
in the original trace, the number of context switches in the simplified trace and the reduction
due to our simplification. The results show that our approach is promising in terms of both trace
simplification efficiency and effectiveness. For the eight subjects, our approach is able to reduce
the number of context switches in the trace by 65% to 97% on average. This reduction percent-
age is close to that of Tinertia, which ranges from to 32.1% to 97.0% in their experiments. More
importantly, our approach is able to scale to much larger traces compared to Tinertia. For a trace
with only 1505 events (which is the largest trace reported by Tinertia in their experiments), Tin-
ertia requires a total of 769.3s to finish the simplification, while our approach can analyze a trace
(the Cache4j subject) with more than 1M events within 600s. For a trace (the Bubble subject)
with 1,493 events, our approach requires only 23ms to simplify it. Although a direct comparison
between Tinertia and our approach is not applicable as the two approaches are implemented for
different program languages (Tinertia is implemented for C/C++ programs) and have different
evaluation subjects, we believe the statistical data provides some evidence demonstrating the
value of our approach compared to the state of the art.
7.4 Summary
To sum up, the key contributions of this work are as follows:
• We present an efficient static trace simplification technique for reducing the number of
thread context switches in the trace.
• We show a theorem of trace equivalence that is general to all properties defined over the
program state. This theorem provides the correctness of the static trace simplification
without any dynamic program re-execution to validate the intermediate simplification re-
sult.
Static Trace Simplification 110
• We present a sound graph modeling of the dependence relation between events in the trace,
which allows us to develop efficient graph merging algorithms for the trace simplification
problem.
• We evaluate our approach on a number of multithreaded applications and the results
demonstrate the efficiency and the effectiveness of our approach.
Appendix: A Proof of Theorem 7.2
Proof. Let us say two actions are equal iff they perform the same operation on the same variable
and also read and write the same value. The core of the proof is to prove the following lemma:
Lemma 7.8. For any action α′ in δ′, suppose it is the nth action of thread ti, then α′ is equal to
the nth action of ti in δ.
If Lemma 7.8 holds, we can prove Theorem 7.2 by applying it to the last actions that write to
each variable in both δ and δ′. To prove Lemma 7.8, we first define a notion of version number
and show two lemmas related to it:
Definition 7.9. Every variable is associated with a version number such that it is (1) initialized
to be 0 and (2) incremented by 1 when the variable is written by an action.
Lemma 7.10. For any action α′ in δ′, suppose it is the kth action that writes to a variable s,
then α′ is also the kth action that writes to s in δ.
Proof. To prove Lemma 7.10, we only need to make sure the order of write actions on each vari-
able is unchanged during the rescheduling of the trace from δ to δ′. This follows our modeling
of the dependence relation includes all synchronization orders and the WRITE→WRITE orders
on the same variable. ∎
Lemma 7.11. For any action α′ in δ′, suppose it reads the variable s with version number p,
then α′ also reads s with the same version number p in δ.
Proof. Similar to the proof of Lemma 7.10, since our model of the dependence relation includes
all the synchronization orders and the WRITE→READ and READ→WRITE orders on the same
variable, we guarantee every READ action in the rescheduled trace reads the value written by
the same WRITE action as that in the original trace. ∎
Let σ[s]p denote the value of variable s with version number p, we next prove Lemma 7.8 by
deduction on the version number of each variable:
Static Trace Simplification 111
Consider the jth actions performed by ti, denoted by αi∶j and α′i∶j in δ and δ′ respectively.
To prove α′i∶j is equal to αi∶j , we need to satisfy two conditions. First, their actions should be
the same, i.e., they perform the same operation on the same variable. Second, suppose they
both operate on the variable s (which should be true if the first condition holds), the values of
s before α′i∶j is performed in δ′ should be the same as that in δ before αi∶j is performed. Let
πi∶j and π′i∶j denote the local store of ti after αi∶j is performed in δ and after α′i∶j is performed
in δ′, respectively. For the first condition, since the execution semantics determine that the
next action of any thread is determined by that thread’s current local store, we need to ensure (I)
π′i∶j−1 = πi∶j−1. For the second condition, suppose αi∶j and α′i∶j operate on s with version number
p and p′, respectively, we need to ensure (II) σ′[s]p′
= σ[s]p.
Let’s first assume Condition I holds, we prove p′ = p in Condition II. If α′i∶j writes to s, i.e.,
α′i∶j is the p′th action that writes to s, by Lemma 7.10, we can get that the corresponding action
of α′i∶j in δ is also the p′th action that writes to s. As Condition I holds, we know αi∶j is
the corresponding action of αi∶j in δ′. Since αi∶j operates on s with version number p in our
assumption, we get p′ = p. Otherwise if α′i∶j reads on s, by Lemma 7.11, we can get that α′i∶j’s
corresponding action in δ also reads s with the same version number, and similarly, we get
p′ = p.
We next prove both Condition I and Condition II hold. For condition I, suppose αi∶j−1 and
α′i∶j−1 operate on the variable s1 with version number p1. To satisfy condition I, we need again
to make sure (Ia) π′i∶j−2 = πi∶j−2 and (Ib) σ′[s1]p1 = σ[s1]p1. For condition II, let αi1∶j1 and
α′i1′∶j1′ denote the actions that write σ[s]p and σ′[s]p, respectively. Since the current value of
a variable is determined by the action that last writes to it, to satisfy condition II, we need to
make sure α′i1′∶j1′ is equal to αi1∶j1, which again requires (IIa) π′i1′∶j1′−1 = πi1∶j1−1 and (IIb)
σ′[s]p−1 = σ[s]p−1. If we apply this reasoning logic deductively for all threads, we will finally
reach the base condition (i) ∀ti ∈ T, π′i∶0 = πi∶0 and (ii) ∀s ∈ S, σ′[s]0 = σ[s]0, which are satisfied
by the equivalence of the initial program states Σ′0 = Σ0. Hence, Lemma 7.8 is proved.
Therefore, Theorem 7.2 is proved.
Chapter 8
Execution Privatization forScheduler-Oblivious ConcurrentPrograms
Making multithreaded execution less non-deterministic is a promising solution to address the
difficulty of concurrent programming. In fact, a vast category of concurrent programs are
scheduler-oblivious: their execution is deterministic, regardless of the scheduling behavior.
We present and formally prove a fundamental observation of the privatizability property for
scheduler-oblivious programs, that paves the way for privatizing shared data accesses on a path
segment. With privatization, the non-deterministic thread interleavings on the privatized ac-
cesses are eliminated and many concurrency problems are alleviated. We further present a path
and context sensitive privatization algorithm that safely privatizes the program without intro-
ducing any additional program behavior. Our evaluation results show that the privatization
opportunity pervasively exists in real world large complex concurrent systems. Through pri-
vatization, several real concurrency bugs are fixed and notable performance improvements also
are achieved on benchmarks.
8.1 Introduction
Despite decades of multicore practice, developing good quality concurrent software remains
notoriously difficult due to non-deterministic thread interleavings. In principle, concurrent pro-
grams are free to exhibit the non-deterministic behavior allowed by the scheduler, and it is the
responsibility of the programmers to prevent the non-determinism from impairing program cor-
rectness (using synchronization, for example). In practice, however, a vast category of real
world concurrent programs are deterministic-by-default or, more generally, scheduler-oblivious
112
Execution Privatization for Scheduler-Oblivious Concurrent Programs 113
that, given the same input, they are always expected to produce the same output. As noted
by Bocchino Jr. et al. [13, 14], almost all scientific computing, encryption/decryption, sorting,
compiler and program analysis, and processor simulation algorithms exhibit scheduler-oblivious
behavior.
Scheduler-oblivious concurrent programs are much easier to reason about, because their exe-
cution is deterministic w.r.t. the program state transition: given the same initial state, they al-
ways reach the same final state, regardless of the thread scheduling (assuming a random but fair
scheduler) [14, 26]. Nevertheless, it is still challenging to write correct and efficient scheduler-
oblivious programs. Although significant research effort has been invested in language design
[11, 14], compiler [9], runtime environment [20, 25], operating system [6, 10], and hardware
[26, 27] to find practical solutions, all these approaches essentially limit execution parallelism
and incur a performance penalty. How to efficiently support the deterministic execution of
scheduler-oblivious programs remains an open problem.
We identify a fundamental property we call privatizability of scheduler-oblivious programs.
This property enables us to develop an execution privatization technique that makes scheduler-
oblivious programs more deterministic without compromising parallelism. The privatizability
property is closely related to but slightly different from the classical conflict and view serializ-
ability property [12, 134, 139]. Privatizability describes the view consistency over a subset of
shared data access scenarios: read-after-write and read-after-read. Under a certain condition, the
program can be soundly privatized to an equivalent program in which the two accesses always
are executed sequentially.
More specifically, consider a path segment, p, in a scheduler-oblivious program, with no block-
ing statement (e.g., thread synchronization), and with two successive accesses to the same data,
where the first access is a read or write, and the second is a read. Suppose in a correct execu-
tion of the program (given a certain input), these two accesses are executed sequentially without
interleaving by a third write to the same location. The privatizability property says that the
second read, which is a shared data access in the program, can actually be changed to a local
access, which always returns the local value stored by the first access. Let us call the shared
data accesses such as the second read privatizable accesses, the operation of changing a priva-
tizable shared access to be local privatization, and the modified program a privatized program.
The soundness of the privatizability property is easy to follow. Since p contains no blocking
statement, with no control of the thread scheduling behavior, the execution of p could always
continue without waiting for other threads. In other words, for any input, there always exists a
schedule in the original program such that the two accesses read or write the same value, making
it reach the same final state as that of the privatized program. And because the original program
Execution Privatization for Scheduler-Oblivious Concurrent Programs 114
is scheduler-oblivious, for all schedules, it will reach the same final state. Hence, both the pri-
vatized program and the original program will always reach the same final state given the same
input. We formally prove a theorem of the privatizability property in Section 8.2.
While guaranteeing program state equivalence, privatization brings a nice benefit to the pro-
gram: it isolates the effect of the thread interleaving on the privatized accesses without adding
any synchronization. The privatized program will no longer experience any non-determinism
caused by the potential erroneous interleaving on the privatized accesses and, at the same time,
no performance is lost. Moreover, as the original heap accesses become stack accesses after pri-
vatization, the program performance also can improve. In return, many concurrency problems
caused by non-deterministic thread interleavings, e.g., concurrent program testing and debug-
ging, can be greatly alleviated for scheduler-oblivious programs. We discuss the applications in
more detail in Section 8.7.1.
Taking advantage of this observation, we propose Privateer, an automatic privatization technique
for scheduler-oblivious programs. An important condition for applying privatization is that
the observed execution of the path segment p in the privatizability property (in which the two
accesses are executed without interleaving) should be correct. Otherwise, if it is buggy, every
execution of p would be wrong after privatization. To bias our results to correct executions,
our technique first conducts a dynamic analysis on a set of common correct executions to find
privatizable accesses. In this way, we guarantee that the privatization is only performed when
the privatizable accesses can be correctly privatized.
The privatization may be applied either at runtime or offline. The key technical challenge is
how to guarantee privatization correctness, that it does not introduce additional behavior beyond
what could be exhibited by the original program. We present an offline program transformation
approach including a path and context sensitive privatization algorithm that guarantees no new
program behavior is introduced compared to the original.
We have implemented Privateer for Java and evaluated it on a set of popular multithreaded
benchmarks as well as five real world large complex concurrent systems, including Apache
Derby, Tomcat, Jetty, OpenJMS and Jigsaw. Our experimental results show that: (1) Privatiza-
tion opportunities are common in concurrent programs. We found a total of 5,119 privatizable
accesses in the five large real systems. The overall percentage of privatizable accesses (the num-
ber of privatizable versus the total number of shared data access locations) ranges from 14.7%
to 30.7% in the privatized executions. (2) Our technique is effective in repairing two typical
classes of concurrency bugs. In our study of nine real world concurrency bugs, our privatization
technique is able to fix seven of them. (3) With our technique to automatically privatize the orig-
inal heap accesses, we are also able to improve the performance of the evaluated benchmarks by
4.3%-17.9%.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 115
The remainder of this chapter is organized as follows: Section 8.2 presents and formally proves
the privatizability property; Section 8.3 presents an overview of execution privatization; Section
8.4 presents the technical details of Privateer; Section 8.5 presents our implementation and Sec-
tion 8.6 reports our experimental results; Section 8.7 further discusses the application scope of
privatization; Section 8.8 summarizes this chapter.
8.2 A Theorem of Privatizability for Scheduler-Oblivious Programs
The cornerstone of our work is the fundamental privatizability property of scheduler-oblivious
programs. In this section, we present and formally prove a theorem of this property. The the-
orem forms the foundation of privatization, which reduces the non-deterministic influence of
the thread interleavings for scheduler-oblivious programs, benefiting many concurrent program
testing and debugging tasks.
In a scheduler-oblivious program P , consider a path segment, p, with two successive global
actions, ai and aj , to the same shared variable, s, on the global store, where ai is a READ or
a WRITE, and aj is a READ. Let program P ′ be a privatized version of P on p, in which the
global action aj in P is changed to be a local action, aj′ , in P ′, such that aj′ stores the value
read or written by ai into a local variable in the thread’s local store. All the other actions in P
and P ′ are the same.
Consider the executions of P and P ′ given the same input. Let vi denote the value read or
written by ai, and v(j,j′) the values returned by a(j,j′). Clearly, vi = vj′ always holds, because in
P ′, aj′ always reads the same value as that by ai. However, in P , vi may not be necessarily equal
to vj , because ai and aj may be interleaved by a third WRITE to s from a different thread that
changes the value of s. Nevertheless, we have the following theorem of privatizability property
on the equivalence between P and P ′:
Theorem 8.1. If p contains no blocking statement, P is equivalent to P ′: given the same initial
state, P and P ′ always reach the same final state.
Proof. Let us consider an execution where p is only executed once with an arbituary schedule
ξ. The proof is similar if p is executed multiple times. Recall Section 2.1 Rule (2.1), the state
transitions of P and P ′ are illustrated as follows:
(Σ0, ξ)→ . . .αiÐ→ Σi → . . .
αj−1ÐÐ→ Σj−1 αj
Ð→ Σj . . .→ ΣN
(Σ0, ξ)→ . . .αiÐ→ Σi → . . .
αj−1ÐÐ→ Σj−1
αj′
ÐÐ→ Σj′ . . .→ ΣN ′
Since the only difference between P and P ′ is on aj and aj′ , to prove ΣN ′
= ΣN , it is sufficient
to show Σj′ = Σj . Recall that aj only reads the value of s on the global store σj−1 and stores
Execution Privatization for Scheduler-Oblivious Concurrent Programs 116
it to the thread’s local store, say m. According to Rule (2.2), for the global store, we have
σj′
= σj−1 = σj . According to Rule (2.3), for the local store, the only difference between Πj′
and Πj is the value of m. In Πj′ , it is vj′ , and in Πj , it is vj . Because vj′ = vi in P ′, if we can
show vi = vj , then we must have Πj′ = Πj and hence ΣN ′
= ΣN can be proved.
Now let us consider the schedule ξ. If there is no WRITE action to s between ai and aj , then
clearly, we have vi = vj , and hence ΣN ′
= ΣN . Suppose such a schedule exists and let us call it
ξno. So, we have shown P ′ is equivalent to P at least for the schedule ξno. On the other hand,
if there is a WRITE action by another thread between ai and aj in ξ, then we may have vi ≠ vj .
Nevertheless, recall Rule (4), by the definition of scheduler-oblivious, we have (Σ0, ξ)...Ð→ ΣN
for any schedule. That is, even if ξ ≠ ξno that makes vi ≠ vj , ξ should drive the program to the
same final state as that by ξno. Therefore, ΣN ′
= ΣN always holds as long as ξno exists.
We now prove the existence of ξno for any initial program state by contradiction. Suppose ξnodoes not exist, it means that for all schedules, ai and aj must be interleaved by a third WRITE.
For scheduler-oblivious programs, then there must exist a blocking statement in p. The reason is
that the blocking behavior is the only way to enforce a thread interleaving under a fair but non-
deterministic thread scheduler. Without a blocking action in p, a thread may always continue
executing to the end of p, if not preempted by the scheduler. Since we assume p does not
contain a blocking statement, ξno must exist under the assumption. Therefore, Theorem 8.1 is
proved.
Theorem 8.1 paves the way for privatizing scheduler-oblivious concurrent programs. Since P ′
is equivalent to P , we can soundly privatize P to P ′ according to our purpose. With privatiza-
tion, the non-deterministic thread interleavings on the privatized data accesses (such as aj) are
isolated and, more importantly, the program performance is not impaired but rather improved
as the original heap accesses become stack accesses after privatization. We are now ready to
present our privatization technique for scheduler-oblivious programs.
8.3 Overview
The key concept of our work is the privatization of scheduler-oblivious programs. Essentially,
privatization changes the shared variable accesses in the original program to local ones in the
privatized program, under the condition that the behavior of original program is not changed.
We have shown the condition and the soundness of privatization in Theorem 8.1. In this section,
we first use two motivating examples to illustrate the idea and the benefits of privatization on
concurrency bug fixing and on the program performance. We then present the challenges for
guaranteeing the privatization correctness, which we address in detail in the next section.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 117An concurrency bug in Derby
public class TableDescriptor {FormatableBitSet referencedColumnMap ; public String getObjectName(){if (referencedColumnMap == null){
…}else{
for (int i = 0; i <...; i++){
referencedColumnMap.isSet(…)}
}}
}
123456789101112131415
public voidsetReferencedColumnMap(…){referencedColumnMap = null;}
161718
By turning referencedColumnMap at line 11 to be local, we can fix this bug. The transformation for this case should be easy.
public class TableDescriptor {
FormatableBitSet referencedColumnMap ; public String getObjectName_privatized(){
FormatableBitSet referencedColumnMap_local = referencedColumnMap; if (referencedColumnMap_local == null) {
…
}
else
{
for (int i = 0; i <...; i++)
{
referencedColumnMap_local.isSet(…)
}
}
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
FIGURE 8.1: Top: a real bug #2861 in Apache Derby. The program crashes with NullPoint-erException when a thread references the shared data structure referencedColumnMap at line11 after another thread sets it to null in the method setReferencedColumnMap. Bottom: the
getObjectName method after privatization.
8.3.1 Motivating Examples
Bug fixing The code snippet in Figure 8.1 (top) shows a real crash bug in the Apache Derby
database. When a thread calls the getObjectNamemethod on a shared TableDescriptor,
it first checks whether the field referencedColumnMap is null or not (line 4). If refere-
ncedColumnMap is not null, the thread will enter a loop and dereference it (line 11). There
is a potential interleaving between the two accesses to referencedColumnMap, where an-
other thread may set referencedColumnMap to null (line 17) between line 4 and line
11, causing the first thread to throw a NPE at line 11. Worse, due to the non-determinism of
Execution Privatization for Scheduler-Oblivious Concurrent Programs 118Privatization Performance
while(true){
synchronized(lock)
{
num--;
if(num ==0)
System.exit(0);
}
}
volatile num = 100,000,000;
while(true){
synchronized(lock)
{
num_local = num-1;
num = num_local;
if(num_local ==0)
System.exit(0);
}
}
FIGURE 8.2: The benchmark contains 8 threads simultaneously decreasing the shared variablenum. The privatized version (right) is 17.9% faster than the original version (left).
this interleaving, this bug is difficult to reproduce and to fix. As reported in the bug repository1,
it took as long as a year before this bug finally was fixed by the developer.
To fix this bug, essentially, the effect of this erroneous interleaving on the program state must be
eliminated. One option is to add synchronizations (e.g., locks) to completely prohibit this inter-
leaving, but this limits the degree of parallelism. After a closer look at this program, we can see
that there is an intriguing characteristic with respect to the dereference to referencedColumnMap
at line 11 that we can leverage to eliminate the erroneous interleaving without using synchro-
nization. That is, in correct executions, the dereference to referencedColumnMap should
always dereference the same value as the preceding access to referencedColumnMap by
the same thread at line 4. This indicates that this shared data access is privatizable: we can pri-
vatize it to dereference a thread local variable referencedColumnMap local that stores
the value of the access to referencedColumnMap at line 4 by the same thread, as shown
in Figure 8.1 (bottom). In this way, the dereference to referencedColumnMap will always
dereference a non-null variable, regardless of the thread interleaving. The bug is fixed without
adding any synchronization.
Performance improvement To assess the effect of privatization on the program performance,
we design a micro-benchmark (Figure 8.2) to conduct controlled experiments for quantifying the
runtime characteristics of the privatization effect. The benchmark consists of concurrent threads
that repeatedly decrease a shared counter (a volatile integer) in a loop until its value reaches 0.
The counter decreasing and the termination checking operations are enclosed in a synchronized
block to ensure the correctness. We control the number of threads and the initial value of the
counter to measure the program execution time.
Figure 8.2 (left part) shows the original micro-benchmark. Since the second read to the counter
always returns the same value as that of its preceding write access, the second read can actually
be privatized to return the value of a local variable that stores the value of the write access. The
privatized version is shown in Figure 8.2 (right part). In our experiments on a 8-core machine
with 8 threads and with the initial value of the counter set to 100,000,000, the privatized version
(49.0s) is 17.9% faster than the original version (40.2s).1https://issues.apache.org/jira/browse/DERBY-2861
Execution Privatization for Scheduler-Oblivious Concurrent Programs 119
if (foo == null){
foo = new Foo();
}
foo.m();
1
2
3
4
foo_local = foo;
if (foo_local == null){
foo = new Foo();
}
foo_local.m();
foo_local = foo;
if (foo_local == null){
foo = new Foo();
foo.m();
}
else
foo_local.m();
naïve
path-sensitive 1 4
FIGURE 8.3: Privatization must be path-sensitive
8.3.2 Privatization Challenges
On the surface, privatization seems an easy problem. For instance, for the bug fixing example in
Figure 8.1, we may simply replace the shared read to referenceColumnMap at line 11 to a
local read to referenceColumnMap local which stores the same value read by the access
to referenceColumnMap at line 4. However, in practice, we have to address the following
touch challenges:
Path-sensitivity The privatization must be path-sensitive. A privatizable access is defined spe-
cific to a certain path segment. It might not be privatizable on a different path. To understand
this problem, let us consider a simple program in Figure 8.3 (left part). The program first checks
whether a shared variable foo is null or not at line 1. If foo is null, it is assigned to a new
Foo object at line 2. Then the program invokes the method m on foo at line 4. Suppose that, in
our collected execution traces of the program, we only observed the path through lines (1 → 4),
which is possible as foo might always be not null initially. We would find that the second
read to foo at line 4 is privatizable (because it always return the same value as the first read to
foo at line 1). However, if we naively privatize the second read in the same way as what we
do for the Derby bug #2861 in Figure 8.1, the resulting program (shown in the right-top of
Figure 8.3) would be incorrect, because if foo is initially null, the invocation of m would then
dereference a null variable. The correct privatization should consider the path containing the
second read and the first read, and perform the privatization specific to this path, as shown in the
right-bottom of Figure 8.3.
Context-sensitivity Besides path-sensitivity, the privatization should also be context-sensitive.
Shared data accesses in different calling contexts may access different values, either written
by the same thread or possibly by a different thread. Therefore, an access that is privatiz-
able in one calling context might not be privatizable in another. This problem is illustrated
Execution Privatization for Scheduler-Oblivious Concurrent Programs 120An concurrency bug in StringBuffer
public class StringBuffer {private int count; public synchronizedStringBuffer append(StringBuffer sb) {
int len = sb.length(); …sb.getChars(…, len, …); …
}
1234
public synchronizedStringBuffer delete(intstart, int end) { …int len = end - start; …count -= len;
}
5678
By turning count at line 14 to be local, we can fix this bug. The transformation for this case is inter-procedural.
public synchronized void getChars(…) {
…if (srcEnd > count)) { throw new StringIndexOutOfBoundsException(); }
}
public synchronized int length(){
return count;}
910111213
1415
}
FIGURE 8.4: An atomicity violation in the append methodof java.lang.StringBuffer class. The program throwsStringIndexOutOfBoundsException when a thread at line 11 references the
stale length of sb changed by another thread at line 8.
by the StringBuffer bug in Figure 8.4. The two accesses to count at line 11 in the method
getChars and at line 15 in the method length, respectively, are invoked within the context
of the append method, which is inter-procedural and spans several method calls and control
branches. The access to count in the method getChars at line 11 is a privatizable, because
in correct executions, it always reads the same value as the read access to count at line 15 in
the method length. However, this repeated read is only privatizable within the calling context
append. It might not be privatizable for all calling contexts in the program. For instance, it
is possible that the getChars method is called from an external method in which count is
written by a remote thread and then directly accessed in getChars. Therefore, we have to
consider the calling context specific to the privatizable access.
Progressiveness Privatization changes an originally shared variable access into a local one by
modifying it to return the local value stored by a preceding access (to the same shared data). If
the shared data is changed (by another thread) between the privatized access and its preceding
access, the modified access will not see the change. This is problematic when the change is ex-
pected by the program. Because we have observed in the correct execution that the change does
not happen (the privatized access returns the same value as its preceding access), the change
should not be expected on the observed path segment. However, there is an important pro-
gressiveness property we must preserve: the program must be able to continue execution after
privatization. For example, if there is a blocking operation somewhere between the privatized
access and its preceding access, the program may block forever until the shared data is changed.
Also, when the privatized access is inside a loop and the value of the access is related to the loop
condition, after privatization, the program may never escape from the loop.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 121
while (shared){
…
}
local = shared
while (local){
…
}
FIGURE 8.5: Privatization must preserve progressiveness
Figure 8.9 illustrates this problem. The program implements a simple barrier function with
which the thread cannot progress until the flag ‘shared’ is set to true by another thread.
Suppose the initial value of ‘shared’ is false. If we naively privatize the access to ‘shared’
to be ‘local’, the resulting program may never exit from the while loop. In the original
program, however, this situation only happens if the other thread is never scheduled to change
the value of ‘shared’, which is different from the semantics of the original program. Therefore,
we must also consider progressiveness for the privatization correctness.
8.4 Execution Privatization
To address these challenges, we developed a path and context sensitive privatization algorithm,
to make sure the privatization only applies to the correct execution paths we have observed and
to guarantee that privatization does not introduce extra behavior.
Our technique consists of two phases: dynamic trace analysis and the code privatization. The
dynamic trace analysis phase presents the privatizable accesses to the privatization phase, which
then performs the path-sensitive and context-sensitive privatization on the program source or
bytecode. In this section, we present our technique in detail. We also show the correctness of
privatization in Section 8.4.4.
8.4.1 Preliminaries
We first define a few basic concepts. We will use these concepts to describe our technique in the
rest of this section.
Definition 8.2. A basic block (BB) contains a sequence of program statements with only one
entry point and one exit point.
This definition refers to the standard notion of basic block in the control flow graph (CFG). In
our method, we give each BB in the program a unique ID.
Definition 8.3. A shared data access point (SAP) is a statement in some BB that reads or
writes shared data between threads at runtime.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 122
Each SAP has a unique location in the program with the access type ∈ {WRITE,READ}. For
example, the simple program in Figure 8.6 has three SAPs, at lines 1, 2, 4, respectively, and
their access types are READ, WRITE, READ, respectively.
As SAP is a static instruction of shared data accesses, a SAP may be executed multiple times at
runtime. Different execution instances may access different shared memory locations, because
of the possible pointer aliases. In our method, we also distinguish different execution instances
of a SAP at runtime.
Definition 8.4. A trace captures a multi-threaded program execution as a sequence of events δ
= ⟨ei⟩.
We consider the following four types of events:
• SAPE (t,s,m): a thread t executes a SAP s accessing a shared memory location m.
• BBI (t,b): a thread t enters a BB b.
• BBO (t,b): a thread t exits from a BB b.
• BLOCK (t): a thread t executes a blocking statement.
Definition 8.5. A privatizable SAP (P-SAP) is a READ SAPE in the trace that returns the value
read or written by its preceding SAPE by the same thread, and without a BLOCK by the same
thread in between,. This preceding SAPE is called the dependent SAP (D-SAP) of the P-SAP.
Definition 8.6. A privatizable path (P-Path) is a path segment containing a P-SAP in the trace
by the same thread. The P-Path starts from the BB containing the D-SAP and ends at the BB
containing the P-SAP.
P-Path is represented by the sequence of BBs executed between the P-SAP and the correspond-
ing D-SAP by the same thread. P-SAP and D-SAP are path-sensitive. For example, in Fig-
ure 8.6, there are two pairs (D-SAP1, P-SAP1) and (D-SAP2, P-SAP2) following the P-Paths
through lines (1→ 4) and (2→ 4), respectively.
Definition 8.7. The calling context of a P-SAP or a D-SAP is the sequence of active methods
and the method call sites on the stack, when the P-SAP or the D-SAP is executed.
The calling context defined here is similar to the standard definition [16, 120]. We will use it
to determine whether to perform the privatization on the P-SAP or not (Section 8.4.3.1). Note
that the calling context can be computed efficiently by analyzing the BBI and BBO events in the
trace, without any extra information at runtime.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 123
if (foo== null){
foo = new Foo();
}
foo.m();
1
2
3
4
D-SAP1
P-SAP1,2
D-SAP2
FIGURE 8.6: D-SAP and P-SAP are path-sensitive
8.4.2 Dynamic Trace Analysis
The goal of our dynamic trace analysis is to find all the P-SAPs manifested in the observed
correct executions. Each reported P-SAP is also associated with the P-Path, which is used by
the second phase to perform the privatization.
Algorithm 14 shows our dynamic trace analysis algorithm. Our algorithm to extract the D-SAP
and P-SAP is similar to the work of AVIO [75] and CTrigger [99] in that the D-SAP is related
to the P-instruction and P-SAP is related to the I-instruction. Differently, P-SAP in our work is
limited to READ access only, and our algorithm also needs to make sure that there is no blocking
operation between the D-SAP and the P-SAP by the same thread. Moreover, what we take as
input is a set of correct execution traces. Sharing the same essence with [135], our work does not
require the availability of erroneous executions to eliminate the erroneous thread interleavings.
Algorithm 14 Dynamic Trace Analysis (δ)1: Input: δ - a trace2: Let M denote all shared memory locations in δ3: δm ← the sequence of SAPEs in δ that access a shared memory location m4: δmt ← the sequence of SAPEs in δm that are performed by a thread t5: for each m ∈M do6: for each READ SAPE s ∈ δm do7: sdef ← the most recent WRITE SAPE in δm
8: t← the thread of s9: s′ ← the most recent SAPE in δmt before s
10: if s′ is a WRITE then11: s′def ← s′
12: else13: s′def ← the most recent WRITE SAPE in δmt before s′
14: if sdef == s′def then15: p-path p← the sequence of BBs by t in δ from s′ to s16: if p does not contain a BLOCK statement then17: report p as privatizable
Execution Privatization for Scheduler-Oblivious Concurrent Programs 124
D-SAP P-SAP
P-Path p = bi-bi+1-bi+2…-bj
D-SAP’ P-SAP’
P-Path p’ = b’i-b’i+1-b’i+2…-b’j
trace
trace
FIGURE 8.7: Conceptual view of execution privatization. The privatization is tailored to theP-Path.
To find a P-SAP, our algorithm iterates through the sequence of SAPEs on each shared memory
location by each thread. For each READ SAPE, s, that accesses the shared memory location, m,
by a thread, t, we first find the most recent SAPE on m before s that is performed by t, say it is
s′. To determine whether s is privatizable or not, we compare the most recent WRITE SAPEs
on m that is before s′ (including s′) and the most recent WRITE SAPEs on m that is before s. If
they are the same, we continue to check whether the path p from s′ to s by t in the trace contains
a BLOCK operation or not. If not, we report s as a P-SAP and p the corresponding P-Path.
The same procedure is applied for all threads and all shared memory locations in each trace.
Finally, we obtain a set of P-SAPs computed from all the traces. For a set of traces, the results
of P-SAPs are merged. Two P-SAPs are considered equivalent if their P-Paths are identical.
8.4.3 Path and Context Sensitive Privatization
The execution privatization is essentially a program transformation process that takes the P-
SAPs reported in the trace analysis phase and produces a privatized version of the program in
which the P-SAPs are all privatized. We iterate through the list of P-SAPs and perform the
privatization for each of them.
For each P-SAP, the privatization is tailored to the associated P-Path, as illustrated in Figure
8.7. Conceptually, we clone the P-Path for each P-SAP and attach it to the program. Most of the
cloned P-Path is the same as the original, with the main difference that the P-SAP is privatized to
access a thread local variable which contains the value accessed by the D-SAP. More formally,
consider a P-Path p = bi-bi+1-bi+2-. . . -bj where the D-SAP and P-SAP are in the BBs bi and bj ,
respectively. We clone p to be p′ = b′i-b′
i+1-b′i+2-. . . -b′j , where b′i = bi (D-SAP → D-SAP′), b′i+1= bi+1, . . . , b′j−1 = bj−1, and b′j = bj (P-SAP→ P-SAP′). D-SAP′ and P-SAP′ are determined by
the privatization rules. Moreover, to ensure the soundness, the P-Path clone must guarantee that
p′ is executed in the privatized program iff p is executed in the original program.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 125
D-SAP D-SAP’
WRITE s WRITE s_local
s = s_local
READ s s_local = s
READ s_local
P-SAP P-SAP’
READ s READ s_local
FIGURE 8.8: Privatization rules of D-SAP and P-SAP
int getData(){
return shared;
}
1
2
3
int local1 = getData();
… int local2 = getData();
D-SAP
P-SAP
FIGURE 8.9: The P-SAP and the D-SAP are at the same program location (line 3). Neverthe-less, because their calling contexts are different (line 1 and line 2, respectively), they are still
privatizable.
Furthermore, recall that we must also consider progressiveness before performing any naive
privatization. The key to progressiveness is that any shared data access inside a loop should
be able to see the change to the shared data; otherwise, the program may never progress out of
the loop. To address this problem, after privatizing all the P-SAPs, we perform an additional
inter-procedural loop analysis to decide that whether any privatized P-SAP is inside a loop or
not. If it is, we ensure that not all P-SAPs inside the loop are privatized. In this way, because at
least one P-SAP still accesses the shared data, any change to the shared data is guaranteed to be
visible to all the P-SAPs.
In the rest of this section, we first show the privatization rules in detail. Then we present our
path and context sensitive P-Path cloning algorithm.
8.4.3.1 Privatization Rules
Figure 8.8 shows the privatization rules of D-SAP and P-SAP. The P-SAP is a READ access to
some shared variable s. Our privatization replaces it to read a local variable s local instead.
The value of s local is obtained from the privatization of the D-SAP. According to the differ-
ent access types of the D-SAP, the treatments are slightly different. If the D-SAP is a WRITE
access, we first change it to store the written value into a local variable s local and then insert
a new statement s = s local after it, that stores the value in s local to s. If the D-SAP is
Execution Privatization for Scheduler-Oblivious Concurrent Programs 126
a READ access, we first insert a new statement that stores the value of s into s local and then
change the D-SAP to read s local instead of s. Clearly, in this way, when the P-SAP is exe-
cuted, instead of reading the original shared variable s, it will read the local variable s local
which stores the value of s.
Privatization scope Note that privatization is applicable to the whole program and is general
to all calling contexts in the trace. It is not limited to a single method or a single module. The
P-Path may span multiple modules and contain multiple method calls. Also, the P-SAP and
D-SAP may be at the same program location, as long as their calling contexts (Definition 8.7)
in the P-Path are different. We use the sequence of BBI events and their call sites to represent
the calling context. Starting from the beginning of the trace to the P-SAP (D-SAP), every BBI
event by the same thread is added to the calling context, and when there is a BBO event, the
corresponding BBI event is deleted from the context.
Privatization transitivity An interesting property of the privatization is transitivity. The D-SAP
of a P-SAP itself might also be a P-SAP, which has its own D-SAP. This forms a loop of D-SAPs
and P-SAPs if every D-SAP is a P-SAP in the loop, or a chain when there exists a D-SAP which
is not a P-SAP. When it forms a chain, let us call the only D-SAP the ancestor. The ancestor
gives us a nice property that its local value can be directly used by all the other P-SAPs in the
chain. This property makes the reuse of the local variable possible, freeing us from creating a
new local variable for each P-SAP.
Progressiveness guarantee However, we must be careful when the P-SAPs and their D-SAPs
form a loop. As noted in Section 8.3.2, we must make sure the privatization does not break the
progressiveness of the original program. If any of the P-SAPs and their D-SAPs form a loop,
after the privatization, all the P-SAPs in the loop are privatized and the change to the shared
data would not be seen by the privatized P-SAPs. When the shared data is related to the loop
condition, the program may be inside the loop forever. The key to addressing this problem is to
break the loop, ensuring that at least one P-SAP inside the loop should be able to see the change
to the shared data. We resolve this problem by performing a whole program loop analysis after
privatizing all the P-SAPs. For each privatized P-SAP, we check whether it is inside a loop of
P-SAPs or not. If it is, we simply unprivatize one of the P-SAPs. In this way, at least one shared
data access is not privatized and can see the change to the shared data. Hence, all P-SAPs are
able to see the change. Therefore, the progressiveness of the original program is preserved.
Variable visibility An additional problem we need to address is the visibility of the local vari-
able s localwhen the D-SAP and the P-SAP are within different methods. Because s local
is only visible in the method in which it is declared, the P-SAP cannot read it from a different
method. For such inter-procedural cases, we declare s local as a thread local static variable.
The variable is a static field of a singleton class added to the program, and it is unique for each
P-SAP. In this way, the P-SAP is able to read s local directly.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 127
Algorithm 15 P-Path Clone (p)1: Input: p = bibi+1 . . . bj - the DP-Path2: for k ← i+1 to j do3: if bk is an entry BB to a new method m then4: clone m to mprivatized
5: update the call site in bk−1 to mprivatized
6: else7: if bk has more than one predecessor in the CFG then8: clone bk to b′k9: update the edge from bk−1 to b′k
DSAP
PSAP
… Other paths
DSAP'
PSAP’ b'j PSAP
bj
… … … Other paths
b‘i
bj
bi
bk b’k
FIGURE 8.10: Intra-procedural privatization
8.4.3.2 Path and Context Sensitive P-Path Clone
Because there might be complicated control flows and possibly infinite number of paths in the
program, the main challenge of the P-Path clone is to ensure that only the P-Path is cloned but
not any other path. That is, for all the other paths in the program except the P-Path, they remain
unchanged in the privatized program. To achieve this, our algorithm carefully clones the P-Path
by taking care of every BB and the context in the P-Path. Algorithm 15 shows our P-Path clone
algorithm. It traverses each BB in the P-Path from bi to bj , which contain the D-SAP and the
P-SAP, respectively. For each BB, it first checks whether the BB is an entry block to a new
method or not. If yes, it means that the path has an inter-procedural transition, and we hence
clone the new method and also update the corresponding invocation site in the preceding BB.
Otherwise, the BB goes through an intra-procedural cloning process. In the intra-procedural
phase, our algorithm checks whether the BB has multiple predecessors in the CFG or not. If
yes, it means that there are other paths different from the P-Path that pass through this BB. So
we clone this BB in the CFG and update the edge from the preceding BB to it correspondingly.
This procedure is repeated for every BB until all the BBs in the P-Path are processed. Finally,
the whole P-Path is cloned and all the BB transitions on the P-Path are correctly updated.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 128
m1_privatize (args1)
PSAP’ b’j m2_privatize
m2_privatize(args2)
… context
DSAP’ b’i
m1_privatize
PSAP bj
m1(args1)
DSAP bi
m1
m2 m2(args2)
… context
FIGURE 8.11: Inter-procedural privatization
Examples Figure 8.10 and Figure 8.11 illustrate the privatization of the intra-procedural and
inter-procedural cases, respectively. In the intra-procedural case, the P-Path is cloned and the D-
SAP and the P-SAP are updated to D-SAP’ and P-SAP’ respectively in the cloned P-Path, and all
the other paths remain the same. For the inter-procedural case, in addition to the intra-procedural
treatments, we also have to handle the method transitions. In the example, suppose the P-Path
spans the methods m1 and m2, inside which the D-SAP and the P-SAP are accessed, respectively.
In the privatized version, m1 and m2 are cloned to be m1 privatize and m2 privatize,
respectively, and their invocation sites in the paths are also updated correspondingly.
8.4.4 Privatization Correctness
An important property guaranteed by our approach is that, for any scheduler-oblivious program,
the privatization is safe: it does not introduce additional behavior beyond what could be exhib-
ited by the original program. In this section, we prove the follow theorem:
Theorem 8.8. Our execution privatization is safe for all scheduler-oblivious programs.
Proof. The key requirement of an scheduler-oblivious program is that the program computation
is the same regardless of the underlying thread scheduling. Given the same input and the same
execution environment, even if the scheduling is different, it always returns the same output.
Since our privatization algorithm is tailored to the P-Path, which is a part (a segment) of an
observed correct execution, it is sufficient to prove the privatization correctness of the P-Path.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 129
Privatizable SAPs
Instrumentor
Soot
Program source
Recorder
JVM
Bytecode
Execution Traces
Analyzer Privatizer
JVM
Privatized program
FIGURE 8.12: Architecture of Privateer
Remember that in the P-Path, the D-SAP and P-SAP are two consecutive accesses to the same
shared data. Since our privatization only changes the P-SAP to read the same value as that
read or written by the D-SAP, and the P-Path does not contain a blocking statement, it satisfies
the conditions of privatization in Theorem 8.1. By the theorem of the privatizability property,
for any input, the privatized program is guaranteed to reach the same final state as that by the
original program. The privatization correctness is proved.
8.5 Implementation
We have implemented and evaluated Privateer for Java. Figure 8.12 shows the architecture. It
contains four main components: the instrumentor, the recorder, the analyzer, and the privatizer.
The instrumentor is a Soot bytecode transformation phase that prepares a program for use with
our execution privatization system. It instruments the shared variable accesses, blocking state-
ments, and the basic block entrances/exits, which are recorded for all threads in a global order
by the recorder at runtime. In Java, we consider Object.wait(), Thread.join(), Thread.yield(),
and the boundaries of synchronized blocks and methods as blocking statements. We chose Soot
as our instrumentation framework for its compatibility with the newest JDK 1.7 and easy-to-
analyze intermediate representation (Jimple IR). However, our approach is general and should
apply beyond Java bytecode.
The recorder is similar to existing systems that deterministically record executions [25, 48].
Our current recorder is implemented as a separate Java library invoked from the instrumented
program. When a program runs, the recorder saves the runtime traces into the database. Each
event in the trace is either a shared variable access, a blocking operation, or a basic block en-
trance/exit (BBI/BBO), containing the thread ID, the shared memory location at runtime or the
basic block ID, the access type (READ/WRITE/BLOCK/BBI/BBO), and the program location of
the event. The recorder does not record program input data, because our analysis does not need
this information.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 130
The analyzer is a stand-alone program that reads the runtime traces from the database and com-
putes the P-SAPs for each program. To compute them, the analyzer first extracts a total order
of SAPs per each shared memory location, for each thread, from the execution trace. It then
extracts the P-SAPs using the ordered SAPs. To find the P-SAPs, the analyzer analyzes each
pair of two consecutive SAPs by the same thread for each shared data. If the latter SAP reads
the value written by the preceding SAP or they both read the value written by the same write,
then the latter SAP is a P-SAP, and the corresponding P-Path is reported.
The privatizer is the key component of our system. It is implemented as a whole program trans-
formation phase in Soot. Taking the P-SAPs and the program source (or the program bytecode
with the program location information) as the input, the privatizer privatizes the P-SAPs along
their associated P-Paths in the recorded executions. The core of privatization is to change the
P-SAP, which originally is a shared data read access, to a local access that instead reads the
value returned by its corresponding D-SAP. To ensure the privatization correctness, the priva-
tizer clones the P-Path and inserts it into the program according to Algorithm 15 and following
the rules in Section 8.4.3.1.
8.6 Experiments
Our evaluation aims at answering the following two research questions:
RQ1. Usefulness - What is the impact of privatization? How useful it is? How does it affect
program maintenance?
RQ2. Effectiveness - How much privatization opportunity is there in real world concurrent sys-
tems?
To evaluate usefulness, we use nine real concurrency bugs to assess the bug fixing capability of
the privatization, and three popular multithreaded benchmarks as well as a micro-benchmark to
understand the performance improvement brought by the privatization. To evaluate effective-
ness, we apply our system on five large complex real world concurrent server programs to see
how many privatizable accesses there are in these systems. We also report the program size
increase after privatization, which may affect program maintenance.
All experiments were conducted on two 8-core 3.00GHz Intel Xeon machines with 16GB mem-
ory and Linux 2.6.22 and JDK1.7.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 131
TABLE 8.1: Results of real concurrency bug fixing by privatization
Bug ID Application Existing fix Fix time (days) Fixed by privatization?StringBuffer JDK 1.4.2 Documented thread unsafe - YESDerby1573 Derby-10.2.1.6 privatization 365 YESDerby2861 Derby-10.3.2.1 privatization 365 YESDerby3260 Derby-10.3.1.4 synchronization 46 YESDerby4018 Derby-10.4.2 synchronization 168 NOJetty-284 Jetty-6.1.2 synchronization 1 YESJetty-1269 Jetty-6.1.8 code structure change 33 YESJetty-425 Jetty-6.1.3 privatization 268 YESJetty-418 Jetty-5.x synchronization 19 NO
8.6.1 Concurrency Bug Fixing
By isolating the potential erroneous preemptive interleavings, execution privatization has the
effect of fixing concurrency bugs. The salient feature of privatization is that, unlike the general
concurrency bug fixing techniques [56, 135] that often incur non-ignorable program slowdown,
privatization does not result in any additional runtime overhead. Moreover, because privatiza-
tion does not introduce any extra synchronization into the program, it is completely free from
deadlock.
We have applied our system to nine real world crash bugs, one from the StringBuffer library in
JDK-1.4.2, four from Derby, and four from Netty. Table 8.1 shows a summary of these bugs.
Most of these bugs are hard to fix. Some of them even lasted for as long as a year before they
were fixed, such as Derby #1573 and Derby #2861. Our experiments show that, among
the nine bugs, the privatization is able to fix seven of them (as shown in Column 5 of Table 8.1).
We conclude that privatization is applicable to fixing two classes of concurrency bugs: p(WRITE)-
r(WRITE)-c(READ) and p(READ)-r(WRITE)-c(READ), belonging to two of the five types of all
the atomicity violations [99]. In these two types of bugs, the c access is privatizable. For
scheduler-oblivious programs, it is expected that privatization is the correct or most proper way
to fix these two types of bugs. For instance, three of the seven fixed bugs (Derby #1573,
Derby #2861 and Jetty #425) were indeed fixed by the developers using source code
level privatization.
A typical scenario where the privatization applies but may not fix the bug is illustrated in Figure
8.13. Both the bugs Derby #4018 and Jetty #408 that our privatization fails to fix belong
to this pattern. The two accesses to list should always return the same data, not only the list
reference, but also the whole list itself. Privatization makes the list reference private, but
not the whole content of the list. Hence, the list content can still be changed by other
Execution Privatization for Scheduler-Oblivious Concurrent Programs 132
for(int i=0;i<list.size();i++)
{
list.get(i);
}
// May throw IndexOutofBoundsException
// if another thread modifies the
// content of the list
FIGURE 8.13: Privatization may not repair this bug
TABLE 8.2: Performance improvement by privatization
Program Input Time-original Time-privatizedMicrobench 100M/8 threads 49.0s 40.2s(17.9%)RayTracer C/100 threads 5.6s 4.9s(12.2%)Motercarlo C/100 threads 9.2s 8.8s(4.3%)Moldyn C/100 threads 11.5s 10.7s(6.7%)
threads. To fix this bug, a synchronization mechanism is needed to protect the list content
from being modified.
8.6.2 Performance Improvement
An additional advantage of execution privatization is that, by privatizing the shared heap ac-
cesses to be local stack accesses, it can help improve the program performance. We first design
a micro-benchmark (Figure 8.2) to understand the range of this performance improvement ef-
fect. To further evaluate the performance impact, we also apply our technique to three popular
multithreaded benchmarks, including RayTrace, Montercarlo, and Moldyn. In all these
benchmarks, we start 100 threads with the input size C.
Table 8.2 shows the performance results. All data are averaged over 10 runs. With privatization,
all these subjects have nontrivial performance improvement. For our micro-benchmark, the
performance improvement is as large as 17.9%. For the other benchmarks, the performance
improvement ranges from 4.3% to 12.2%. In fact, all these benchmarks have a small number
of privatizable locations. The reason for the notable performance improvement is that these
privatizable locations are hot access points during the execution. Most of them are volatile
and are frequently executed in loops. After privatization, as they all become local accesses,
it is expected that program performance could be improved significantly. Figure 8.14 shows
such a typical case in the RayTracer benchmark. Direct accesses to field array variables are
frequently used by programmers, however, the field array variables are mostly read-only after
the initialization. Clearly, it is easy to write code in this way, but it is not a good practice for
program performance.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 133
RayTracer
volatile boolean[] IsDone;
public void DoBarrier(int myid) {
boolean donevalue = !IsDone[myid];
while(…){
for(…){
while(IsDone[…] != donevalue){
…
}
}
}
IsDone[myid] = donevalue;
while(IsDone[0] != donevalue) {
…
}
}
FIGURE 8.14: Frequent shared array accesses in RayTracer
We also experimented our approach with real world large systems. However, because the per-
formance effect of the privatizable accesses in the large systems are relatively small (compared
with the other instructions), we did not observe significant performance boost on them.
8.6.3 Pervasive Privatization Opportunities
To evaluate the effectiveness of execution privatization, we applied our system to a set of real
world applications, including five large complex server systems: Apache Derby, Tomcat, Jetty,
OpenJMS and Jigsaw. To maximize the usage of privatized executions, we first collect typical
good executions with different program inputs in the test suite under random schedules. For
each program, we collect the traces of 100 good runs with 10 different inputs and 10 random
schedules for each input.
Table 8.3 reports the privatization statistics. In these real world systems, we found a total of
5,119 privatizable accesses, which account for 23.6% of the total (21,733) shared data accesses
in them (for accesses with the same program location, we only count once). The overall percent-
age for each program ranges from 14.7% to 30.7%. The result clearly demonstrates that there
exist pervasive privatization opportunities in real world large complex concurrent systems. More
importantly, our result strongly supports the effectiveness in applying execution privatization for
real world applications.
Through manual inspection of the large amount of privatizable accesses, we also have identified
several typical reasons for the pervasive privatization opportunity:
Execution Privatization for Scheduler-Oblivious Concurrent Programs 134
TABLE 8.3: Statistics of the privatization results
Application LOC #Shared accesses #Privatizable accesses #Intra-procedural #Inter-procedural
Jetty-6.1.x 49,746 1,362 219(16.1%) 175(79.9%) 44(20.1%)OpenJMS-0.7.7 154,563 6,934 2,126(30.7%) 1,997(93.9%) 129(6.1%)Tomcat-6.0.33 339,405 8,543 1,260(14.7%) 1,173(93.1%) 87(6.9%)Jigsaw-2.2.6 381,348 1,699 510(30.0%) 347(68.0%) 163(32.0%)Derby-10.2-4 665,733 3,195 968(30.3%) 840(86.8%) 128(13.2%)
TABLE 8.4: Bytecode size increase after privatization
Application Size(bytes) Size-privatize IncreaseJetty-6.1.x 1,678,586 1,712,820 34,234(2.03)%OpenJMS-0.7.7 3,563,274 3,833,938 270,664(7.60%)Tomcat-6.0.33 7,434,520 7,791,321 356,801(4.80%)Jigsaw-2.2.6 8,665,258 8,900,182 234,924(2.71%)Derby-10.2-4 23,600,432 24,059,525 459,093(1.95%)
Shared variable name reusing To access the same data at different program locations, for the
sake of programming easiness, a common practice by programmers is to reuse the same identi-
fier to directly access the data. For example, the privatizable access at line 11 in the Derby bug
example (Figure 8.1) is manifested as a reusing of the identifier referencedColumnMap,
which is also used by the read access at line 4. In fact, all cases of privatizable accesses are
manifested by variable name reuse. Programmers tend to reason in a modularized way that they
frequently use the same variable to access the same shared data, without caring about thread
interleavings.
Unexpected sharing Programmers are often unaware of concurrency when writing the code.
Since they do not expect sharing among multiple threads, they believe that in a sequential envi-
ronment the compiler would automatically help with the privatization. Unfortunately, in multi-
threaded circumstances, it is in general very hard for standard compilers to do such optimization
across threads. This often happens when a sequential library code is used in a multithreaded
program, which is unintended by the library developer. For example, we found quite a few
privatizable accesses in the logging library log4j, which is used by both Tomcat and OpenJMS.
Complicated control flow and context Another typical reason we find through our study is
that privatizable accesses may span over complicated control flows or calling contexts, which is
difficult for programmers to reason about. For example, in the StringBuffer bug in Figure 8.4,
the two accesses to the shared data count at lines 11 and 15 span several method calls and
control branches. Facing the large number of calling contexts and control flows, it is usually
difficult for programmers to reason about privatizable accesses.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 135
8.6.4 Program Maintenance
With many benefits of the privatization, a direct cost is that it may affect the program mainte-
nance. As our technique uses basic block cloning to perform the privatization, it increases the
size of the program. Intuitively, our privatization might face the problem of cloning too much
when there is a long P-Path on the CFG between the D-SAP and the P-SAP. Nevertheless, this
problem seldom happens. In our case, that would mean there is no intermediate access to the
same shared data on the P-Path, which we can easily promote the P-SAP to be in the same block
as the D-SAP, without incurring any data or control flow change.
Table 8.4 reports the bytecode size increase by privatization in the real world large systems. The
overall size increase ranges from 1.95% in Derby to 7.60% in OpenJMS, which is relatively
small. In our studied systems, for most of the privatizable accesses, the D-SAP and the P-SAP
are within the same procedure (see Table 8.3, Column 5-6) and their basic blocks are often
next to each other. For these cases, because we do not need to clone the entire method but
rather the intra-procedural P-Path, so the space increase is often much smaller compared to the
inter-procedural cases.
On the other hand, since many field variable accesses become local ones through privatization,
the number of field variable accesses in the original program are reduced. We advocate that our
technique is also good for program maintenance in some aspects. For example, when refactoring
a field name, there are fewer places to change in the program. To understand a program fault
related to a field reference, the size of the cause-effect chain to the privatized field accesses is
also reduced.
8.7 Discussions
Besides concurrency bug fixing, execution privatization has a wide range of applications in
concurrent program testing and debugging. We discuss a few of the applications in this section.
We also briefly discuss some caveats related to the application scope of the privatization.
8.7.1 Concurrent Program Testing and Debugging
Record/replay The record and replay technique [45, 48, 83] aims at fully reenacting an ear-
lier program execution. For concurrent programs, it is one of the most important techniques
for program understanding and debugging. In general, record/replay requires capturing and en-
forcing the thread interleavings at runtime, which often incurs significant program slowdown
that limits its applicability at the production site. With privatization, the portion of thread in-
terleavings on the privatized accesses no longer exist. Consequently, the overhead incurred by
Execution Privatization for Scheduler-Oblivious Concurrent Programs 136
capturing this portion of interleavings is completely eliminated, hence, dramatically improving
the performance of record/replay.
Deterministic multithreading The key insight of deterministic multithreading (DMT) is that a
small set of schedules is often enough for good performance. By limiting the program to exer-
cise a small well tested set of schedules, DMT explores a good tradeoff between the program
performance and the reliability. To achieve this goal, existing techniques employ either static
type systems [11, 14] or runtime support [9, 10, 26]. With execution privatization, DMT tech-
niques can ignore the schedule enforcement on the privatized accesses. Ultimately, for the set
of executions that follow the same path as the privatized execution, the performance would also
be significantly improved.
Concurrency bug understanding Recent research [49, 55] has shown that concurrent program
execution traces often contain many thread context switches that perplex the bug reasoning pro-
cesses. A simplified trace with fewer context switches will greatly help reducing the debugging
effort by reducing the number of places in the trace where we need to look for the cause of the
bug. With execution privatization, future executions of the privatized program would contain
less thread interleavings. The bug reasoning process based on the privatized execution trace
would also be simplified.
8.7.2 Privatization Scope
Although execution privatization brings in many advantages, we note that it has also the limited
application scope:
Bug repair An important note on concurrency bug repair is that privatization is not general to
fixing all concurrency bugs but only the two classes of atomicity violation bugs where the pri-
vatization fits in (Section 8.6.1). As pointed out by Attiya et al. [5], expensive synchronizations
cannot be eliminated for the operations of read-after-write (RAW) to different shared variables
and atomic write-after-read (AWAR) to the same shared variable. For concurrency bugs such as
order violations that miss the happens-before relation across different threads, privatization also
is not applicable and adding synchronization is necessary to bridge the happens-before depen-
dence. Hence, privatization should not be considered as a replacement for synchronization, but
is rather complementary to it. On the other hand, privatizable accesses are not all necessarily
(though they are often) related to concurrency bugs.
Lock removal Another caveat is that privatization does not eliminate or change the original
synchronization operations in the program. Although it looks plausible that some lock/unlock
operations in the original program can be removed after the shared accesses inside it are priva-
tized, we note that it is in general dangerous to do it as it might change the program semantics.
Execution Privatization for Scheduler-Oblivious Concurrent Programs 137
Removing locks choud result in semantic change
1
2
3
4
Thread t2
5
6
7
lock l
unlock l
read x;
lock l
fork t2
write x;
unlock l
Thread t1
FIGURE 8.15: The lock/unlock operations at line 5/6 can not be removed, though there is nocode to execute between them.
Take the program in Figure 8.15 as an example. The empty lock/unlock operation at lines 5/6
cannot be removed, because together with the fork operation at line 2 they form a happens-
before relation between thread t1 and thread t2. A data race on accessing the shared variable
x would occur if the empty lock/unlock operations are eliminated. Another reason is that from
the perspective of the memory model, synchronizations have the effect of cleaning cache. Re-
moving synchronizations eliminates this effect, which breaks the semantics for programs that
rely on cache effect to achieve certain behaviors.
8.8 Summary
We have presented a fundamental observation of the privatizability property that enables sound
privatization of scheduler-oblivious programs. We highlight our contributions as follows:
1. We present a fundamental observation of the privatizability property that enables privatizating
shared data accesses in scheduler-oblivious programs, which helps supporting their deterministic
execution without compromising parallelism.
2. We present a novel path and context sensitive execution privatization technique that safely
privatizes a program without introducing any extra program behavior.
3. We evaluate our technique on a set of large complex Java programs and the results show that
several real bugs are fixed without incurring any performance penalty, and notable performance
improvement is achieved on benchmarks.
Chapter 9
Conclusion and Future Work
This thesis makes contributions to concurrent program debugging along four directions: mul-
tiprocessor deterministic replay, predictive trace analysis, trace simplification, and data sharing
reduction.
Along the direction of multiprocessor deterministic replay, this thesis presents a new local-order
based recording approach that supports the deterministic replay but with much lower overhead
compared to previous approaches. We present the design and implementation of the first multi-
processor deterministic replay system, LEAP, for Java programs. By deterministically replaying
concurrent programs, LEAP substantially helps debugging concurrent programs to make non-
deterministic concurrency bugs reproducible. In addition, LEAP records much less information
compared to the classical global-order based approach. It is fast, portable, and determinis-
tic. LEAP is available in the public domain and has been used by several research institutions
worldwide.
Along the direction of predictive trace analysis, this thesis proposes the idea of persuasiveness in
the trace-based prediction of concurrency access anomalies. The introduction of persuasiveness
has two important contributions. First, it makes predictive trace analysis more useful for concur-
rency bug detection as it eliminates all the false warnings through runtime verification. Second,
it greatly improves debugging effectiveness as it provides a full execution history and context
information for the bug diagnosis. We also contribute the design and implementation of a fully
automatic persuasive predictive analysis tool, PECAN, for Java programs. PECAN is publicly
available and has revealed several serious bugs in large open source concurrent systems.
Our second main contribution in the direction of predictive trace analysis is the concept of re-
dundancy with respect to the detection of general concurrency access anomalies. We show a
trace redundancy theorem that specifies a redundancy criterion and the soundness guarantee for
reducing the size of the analyzed trace without impairing analysis results. This redundancy
138
Conclusion and Future Work 139
theorem allows us to develop an efficient algorithm, TraceFilter, that automatically removes re-
dundant events from a trace for the predictive analysis of general concurrency access anomalies.
Empirical evidence on a set of popular concurrent benchmarks as well as large server applica-
tions shows that the scalability of PTA is improved by orders of magnitude. Our contribution
makes predictive trace analysis much more practical for real world concurrent systems.
Along the direction of trace simplification, this thesis contributes a redundancy criterion to char-
acterize redundant computations in the replay execution for reproducing bugs. This redundancy
criterion enables us to develop two effective techniques that remove the whole thread redun-
dancy and the partial redundancy, respectively, which significantly reduces the complexity of
the bug reproducing execution and shortens replay time. This thesis also contributes a theorem
of trace equivalence for the reduction of thread context switches in a reproducible buggy trace.
The theorem guides us to reason about the trace simplification problem completely offline. We
further contribute an efficient static algorithm, SimTrace, for trace simplification without any
dynamic re-execution to validate trace equivalence. SimTrace scales well to traces with more
than 1M events, making it attractive to practical use. We believe our contribution in the trace
simplification will greatly improve the effectiveness of concurrent program debugging based on
execution traces.
Finally, along the direction of data sharing reduction, we contribute a fundamental theorem
of privatizability of scheduler-oblivious programs, a vast category of concurrent programs that
always produce the same output given the same input. With the foundation of privatizability
property, we are able to reason about a subset of shared data access (i.e., read-after-write and
read-after-read) in scheduler-oblivious programs sequentially, benefiting many program debug-
ging and testing tasks. Moreover, those original shared accesses can be soundly privatized to
be local ones without changing program behavior. We further present the first known execution
privatization technique for scheduler-oblivious programs. Privatization brings two direct advan-
tages. First, scheduler-oblivious programs become more reliable because thread interleavings
on privatized accesses are eliminated. Second, performance improvement is achieved as the
original heap accesses become stack accesses after privatization.
Future Work In the multicore era, concurrent programs are destined to play a significant role
to fulfill the computational power promised by the hardware. With decades of practice, con-
current programs are becoming pervasive and much more complex. However, developing good
quality concurrent software remains highly challenging. In future, we will focus on develop-
ing efficient and effective technique for reducing the difficulty of concurrent programming and
improving the reliability of concurrent programs. We discuss two promising directions we are
currently working on.
Conclusion and Future Work 140
1‐>10‐>2‐>11‐>3‐>12‐>13‐>4‐>5‐>14‐>9
Initially x==y==0;
Thread T1 Thread T2
a=xx=1if(y>0)x=a+1;y=a+1;
elsex=0;y=0;
assert(x==y);
123456789
b=yy=2if(x>0)x=b+2;y=b+2;
elsex=1;y=1;
assert(x==y);
101112131415161718
x T1 T1 T2 T2 T1 T1
Access vectors
y T2 T2 T1 T1 T2 T1
FIGURE 9.1: The program above crashes at line 9 following the interleaving 1-10-2-11-3-12-13-4-5-14-9. To reproduce the crash, LEAP [48] requires 12 synchronizations at runtime to
record the thread access order information (right) on the shared variables.
Relaxed Concurrency Bug Reproduction
Concurrency bug reproduction is critical but notoriously difficult due to nondeterminism. De-
terministic replay techniques faithfully capture and replay the shared memory dependencies
to enable the concurrency bug reproduction. However, for programs with heavy thread inter-
leavings and shared memory dependencies, large runtime overhead is still incurred due to the
challenging problem of using synchronizations on multicore processors. Relaxed concurrency
bug reproduction takes advantage of the observation that deterministic replay is a sufficient but
not necessary condition to reproduce the bug. For many concurrency bugs in practice, we do not
need to a faithful schedule as that occurred in the bug exhibition run, but a relaxed schedule is
also able to reproduce the same bug.
We use a simple program in Figure 9.1 to illustrate the problem. There are two threads (T1
and T2) and two shared variables (x and y). Assuming a sequential consistent memory model.
If the two threads execute concurrently following the interleaving represented by the line num-
bers 1-10-2-11-3-12-13-4-5-14-9, the assertion at line 9 will be violated. This bug is
difficult to reproduce because it may disappear nondeterministically following a different inter-
leaving. LEAP reproduces the bug by recording and enforcing the same thread access orders
local to the shared variables (shown in Figure 9.1 (right)). Because recording each access to x
or y requires one synchronization, and the failure schedule contains 12 accesses to x and y in
total, LEAP requires 12 synchronizations in the recording phase to reproduce this bug.
Now let us consider the simple program with the schedule 10-11-1-2-3-4-12-13-14-5-9
shown in Figure 9.2. Although the schedule is different from the original one. It is able to re-
produce the same bug. Moreover, this schedule contains much less thread context switches
Conclusion and Future Work 141
Thread T1 Thread T2
a=x x=1 if(y>0) x=a+1; y=a+1; else x=0; y=0; assert(x==y);
1 2 3 4 5 6 7 8 9
b=y y=2 if(x>0) x=b+2; y=b+2; else x=1; y=1; assert(x==y);
10 11 12 13 14 15 16 17 18
10->11->1->2->3->4->12->13->14->5->9
Initially x==y==0;
FIGURE 9.2: A schedule different from the original one, but is able to reproduce the bug.Moreover, this schedule has fewer (4) context switches than the original one (8).
compared to the original schedule, which is more preferable for locating the cause of the bug.
We are investigating a new concurrency bug reproduction technique, CLAP, that does not record
any thread interleaving or checkpointing any program state at runtime, but rather computes
the schedule offline. All CLAP records at runtime is the thread execution path information.
No synchronization is needed. Since path profiling is lightweight (31% [8]) regardless of the
thread interleavings, for real world programs with heavy shared memory dependencies, CLAP
is significantly more efficient than the state of the art shared memory dependency recorders.
Lightweight Deterministic Execution of Concurrent Programs
A major difficulty in concurrent programming is the non-determinism caused by either the
scheduling non-determinism or the timing issues on different cores. A promising solution to
this difficulty is to make concurrent programs less non-deterministic. In recent years, researchers
have pioneered the direction of deterministic multithreading [9, 10, 11, 14, 26] that aims to make
concurrent programs deterministic by default, by eliminating the sensitive thread interleavings
through runtime enforcement or type systems.
Deterministic multithreading has the nice property that given the same program input, every
execution always produces the same output. This property significantly alleviates the challenge
of writing and debugging concurrent programs. For example, bugs can be deterministically
reproduced. However, existing deterministic execution work faces a serious efficiency problem:
Conclusion and Future Work 142
without special hardware, they incur 10x substantial runtime overhead. A lightweight way for
deterministic execution is of significant importance.
We are aiming to improve the performance of existing techniques to make deterministic execu-
tion of concurrent programs more lightweight. Our main insight is that existing work suffers
from two drawbacks: (1) all thread communication points have to be serialized together, which
significantly reduces parallelism; (2) it requires the user specification of the execution time slice
(quantum) for each thread and the performance is sensitive to the specification. Quantum speci-
fication is currently purely heuristic, without a general optimization approach for all programs.
We can address this problem by all allowing parallelism on different shared memory locations.
In addition, with static analysis, we can possibly optimize the quantum specification, enforcing
determinism without any synchronization specification.
Bibliography
[1] Mazurkiewicz A. Trace theory. Advances in Petri Nets, 1987.
[2] Sarita V. Adve and Mark D. Hill. Weak ordering—a new definition. SIGARCH Comput.
Archit. News, 1990.
[3] Gautam Altekar and Ion Stoica. ODR: output deterministic replay for multicore debug-
ging. In SOSP, 2009.
[4] Cyrille Artho, Klaus Havelund, and Armin Biere. High-level data races. In NDDL/VVEIS,
2003.
[5] Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael,
and Martin Vechev. Laws of order: expensive synchronization in concurrent algorithms
cannot be eliminated. In POPL, 2011.
[6] Amittai Aviram, Shu-Chun Weng, Sen Hu, and Bryan Ford. Efficient system-enforced
deterministic parallelism. In OSDI, 2010.
[7] David F. Bacon, Robert E. Strom, and Ashis Tarafdar. Guava: a dialect of java without
data races. In OOPSLA, 2000.
[8] Thomas Ball and James R. Larus. Efficient path profiling. In MICRO, 1996.
[9] Tom Bergan, Owen Anderson, Joseph Devietti, Luis Ceze, and Dan Grossman. Coredet:
a compiler and runtime system for deterministic multithreaded execution. In ASPLOS,
2010.
[10] Tom Bergan, Nicholas Hunt, Luis Ceze, and Steven D. Gribble. Deterministic process
groups in dos. In OSDI, 2010.
[11] Emery D. Berger, Ting Yang, Tongping Liu, and Gene Novark. Grace: safe multithreaded
programming for c/c++. In OOPSLA, 2009.
[12] Philip A. Bernstein and Nathan Goodman. Concurrency control in distributed database
systems. ACM Comput. Surv., 1981.
143
Bibliography 144
[13] Robert L. Bocchino, Jr., Vikram S. Adve, Sarita V. Adve, and Marc Snir. Parallel pro-
gramming must be deterministic by default. In HotPar, 2009.
[14] Robert L. Bocchino, Jr., Vikram S. Adve, Danny Dig, Sarita V. Adve, Stephen Heumann,
Rakesh Komuravelli, Jeffrey Overbey, Patrick Simmons, Hyojin Sung, and Mohsen Vak-
ilian. A type and effect system for deterministic parallel java. In OOPSLA, 2009.
[15] Eric Bodden and Klaus Havelund. Racer: effective race detection using aspectj. In ISSTA,
2008.
[16] Michael D. Bond, Graham Z. Baker, and Samuel Z. Guyer. Breadcrumbs: efficient con-
text sensitivity for dynamic bug detection analyses. In PLDI, 2010.
[17] Michael D. Bond, Katherine E. Coons, and Kathryn S. McKinley. Pacer: proportional
detection of data races. In PLDI, 2010.
[18] Sebastian Burckhardt, Rajeev Alur, and Milo M. K. Martin. Checkfence: checking con-
sistency of concurrent data types on relaxed memory models. In PLDI, 2007.
[19] Sebastian Burckhardt, Pravesh Kothari, Madanlal Musuvathi, and Santosh Nagarakatte.
A randomized scheduler with probabilistic guarantees of finding bugs. In ASPLOS, 2010.
[20] Jacob Burnim and Koushik Sen. Asserting and checking determinism for multithreaded
programs. In ESEC/FSE, 2009.
[21] Feng Chen and Grigore Rosu. Parametric and sliced causality. In CAV, 2007.
[22] Feng Chen, Traian Florin Serbanuta, and Grigore Rosu. jpredictor: a predictive runtime
analysis tool for java. In ICSE, 2008.
[23] Jong-Deok Choi and Harini Srinivasan. Deterministic replay of java multithreaded appli-
cations. In SPDT, 1998.
[24] Jong-Deok Choi and Andreas Zeller. Isolating failure-inducing thread schedules. In
ISSTA, 2002.
[25] Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, and Junfeng Yang. Efficient
deterministic multithreading through schedule relaxation. In SOSP, 2011.
[26] Joseph Devietti, Brandon Lucia, Luis Ceze, and Mark Oskin. Dmp: deterministic shared
memory multi-processing. In ASPLOS, 2009.
[27] Joseph Devietti, Jacob Nelson, Tom Bergan, Luis Ceze, and Dan Grossman. Rcdc: a
relaxed consistency deterministic computer. In ASPLOS, 2011.
[28] George W. Dunlap, Dominic G. Lucchetti, Michael A. Fetterman, and Peter M. Chen.
Execution replay of multiprocessor virtual machines. In VEE, 2008.
Bibliography 145
[29] Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. Goldilocks: a race and transaction-aware
java runtime. In PLDI, 2007.
[30] Dawson Engler and Ken Ashcraft. Racerx: effective, static detection of race conditions
and deadlocks. In SOSP, 2003.
[31] Eitan Farchi, Yarden Nir, and Shmuel Ur. Concurrent bug patterns and how to test them.
IPDPS, 2003.
[32] Azadeh Farzan and P. Madhusudan. Causal atomicity. In CAV, 2006.
[33] Azadeh Farzan, P. Madhusudan, and Francesco Sorrentino. Meta-analysis for atomicity
violations under nested locking. In CAV, 2009.
[34] Cormac Flanagan and Stephen N Freund. Atomizer: a dynamic atomicity checker for
multithreaded programs. In POPL, 2004.
[35] Cormac Flanagan and Stephen N. Freund. Fasttrack: efficient and precise dynamic race
detection. In PLDI, 2009.
[36] Cormac Flanagan and Stephen N. Freund. Adversarial memory for detecting destructive
races. In PLDI, 2010.
[37] Cormac Flanagan, Stephen N. Freund, and Jaeheon Yi. Velodrome: a sound and complete
dynamic atomicity checker for multithreaded programs. In PLDI, 2008.
[38] Cormac Flanagan and Patrice Godefroid. Dynamic partial-order reduction for model
checking software. In POPL, 2005.
[39] A. Georges, M. Christiaens, M. Ronsse, and K. De Bosschere. Jarec: a portable record/re-
play environment for multi-threaded java applications. Software Practice and Experience,
2004.
[40] Dennis Giffhorn and Christian Hammer. Precise slicing of concurrent programs. Auto-
mated Software Engg., 2009.
[41] Alex Groce and Willem Visser. What went wrong: explaining counterexamples. In SPIN,
2003.
[42] Richard L. Halpert, Christopher J. F. Pickett, and Clark Verbrugge. Component-based
lock allocation. In PACT, 2007.
[43] Christian Hammer, Julian Dolby, Mandana Vaziri, and Frank Tip. Dynamic detection of
atomic-set-serializability violations. In ICSE, 2008.
[44] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural support for
lock-free data structures. In ISCA, 1993.
Bibliography 146
[45] Derek R. Hower and Mark D. Hill. Rerun: Exploiting episodes for lightweight memory
race recording. In ISCA, 2008.
[46] Jeff Huang. Lightweight concurrency crash reproduction without logging shared memory
dependencies and program states. In PLDI SRC, 2012.
[47] Jeff Huang, Peng Liu, and Charles Zhang. LEAP: A tool for lightweight deterministic
multi-processor replay of concurrent Java programs. In FSE Demo, 2010.
[48] Jeff Huang, Peng Liu, and Charles Zhang. LEAP: Lightweight deterministic multi-
processor replay of concurrent Java programs. In FSE, 2010.
[49] Jeff Huang and Charles Zhang. An efficient static trace simplification technique for de-
bugging concurrent programs. In SAS, 2011.
[50] Jeff Huang and Charles Zhang. PECAN: Persuasive Prediction of Concurrency Access
Anomalies. In ISSTA, 2011.
[51] Jeff Huang and Charles Zhang. Execution privatization of scheduler-oblivious concurrent
programs. In OOPSLA, 2012.
[52] Jeff Huang and Charles Zhang. Lean: Simplifying concurrency bug reproduction via
replay-supported execution reduction. In OOPSLA, 2012.
[53] Jeff Huang, Jinguo Zhou, and Charles Zhang. Scaling predictive analysis of concurrent
programs by removing trace redundancy. TOSEM, 22(1), 2012.
[54] Intel cilk plus language specification, 2010. http://software.intel.com/sites/products/cilk-
plus/cilk plus language specification.pdf.
[55] Nicholas Jalbert and Koushik Sen. A trace simplification technique for effective debug-
ging of concurrent programs. In FSE, 2010.
[56] Guoliang Jin, Linhai Song, Wei Zhang, Shan Lu, and Ben Liblit. Automated atomicity-
violation fixing. In PLDI, 2011.
[57] Pallavi Joshi, Mayur Naik, Chang-Seo Park, and Koushik Sen. Calfuzzer: An extensible
active testing framework for concurrent programs. In CAV, 2009.
[58] Pallavi Joshi, Mayur Naik, Koushik Sen, and David Gay. An effective dynamic analysis
for detecting generalized deadlocks. In FSE, 2010.
[59] Pallavi Joshi, Chang-Seo Park, Koushik Sen, and Mayur Naik. A randomized dynamic
program analysis technique for detecting real deadlocks. In PLDI, 2009.
[60] Vineet Kahlon, Franjo Ivancic, and Aarti Gupta. Reasoning about threads communicating
via locks. In CAV, 2005.
Bibliography 147
[61] Richard M. Karp and Raymond E. Mille. Properties of a model for parallel computations:
Determinancy, termination, queueing. In SIAM Journal on Applied Mathematics, 1966.
[62] Nicholas Kidd, Thomas Reps, Julian Dolby, and Mandana Vaziri. Finding concurrency-
related bugs using random isolation. In VMCAI, 2009.
[63] Jens Krinke. Context-sensitive slicing of concurrent programs. In ESEC/FSE, 2003.
[64] Zhifeng Lai, S. C. Cheung, and W. K. Chan. Detecting atomic-set serializability viola-
tions in multithreaded programs through active randomized testing. In ICSE, 2010.
[65] L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess
programs. IEEE Trans. Comput., 1979.
[66] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. CACM,
1978.
[67] Doug Lea. The java.util.concurrent synchronizer framework. Sci. Comput. Program., 58,
2005.
[68] T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant
replay. IEEE Transactions on Computers, 1987.
[69] Dongyoon Lee, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. Chimera: hybrid
program analysis for determinism. In PLDI, 2012.
[70] Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Pe-
ter M. Chen, and Jason Flinn. Respec: efficient online multiprocessor replayvia specula-
tion and external determinism. In ASPLOS, 2010.
[71] Joanne Lim. An engineering disaster: Therac-25. http://en.wikipedia.org/wiki/Therac-25,
1998.
[72] Richard J. Lipton. Reduction: a method of proving properties of parallel programs.
CACM, 1975.
[73] Shan Lu, Soyeon Park, Chongfeng Hu, Xiao Ma, Weihang Jiang, Zhenmin Li, Raluca A.
Popa, and Yuanyuan Zhou. Muvi: automatically inferring multi-variable access correla-
tions and detecting related semantic and concurrency bugs. In SOSP, 2007.
[74] Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning from mistakes: a
comprehensive study on real world concurrency bug characteristics. ASPLOS, 2008.
[75] Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. Avio: detecting atomicity viola-
tions via access interleaving invariants. In ASPLOS, 2006.
Bibliography 148
[76] Brandon Lucia, Joseph Devietti, Karin Strauss, and Luis Ceze. Atom-aid: Detecting and
surviving atomicity violations. In ISCA, 2008.
[77] Jeremy Manson, William Pugh, and Sarita V. Adve. The java memory model. In POPL,
2005.
[78] Dan Marino, Abhayendra Singh, Todd Millstein, Madan Musuvathi, and Satish
Narayanasamy. Drfx: A simple and efficient memory model for concurrent program-
ming languages. In PLDI, 2009.
[79] Daniel Marino, Madanlal Musuvathi, and Satish Narayanasamy. Literace: effective sam-
pling for lightweight data-race detection. In PLDI, 2009.
[80] Nicholas D. Matsakis and Thomas R. Gross. A time-aware type system for data-race
protection and guaranteed initialization. In OOPSLA, 2010.
[81] F. Mattern. Virtual time and global states of distributed systems. Workshop on Parallel
and Distributed Algorithms, 1988.
[82] Ghassan Misherghi and Zhendong Su. Hdd: hierarchical delta debugging. In ICSE, 2006.
[83] Pablo Montesinos, Luis Ceze, and Josep Torrellas. Delorean: Recording and determinis-
tically replaying shared-memory multi-processor execution efficiently. In ISCA, 2008.
[84] Pablo Montesinos, Matthew Hicks, Samuel T. King, and Josep Torrellas. Capo: a
software-hardware interface for practical deterministic multi-processor replay. In AS-
PLOS, 2009.
[85] Madan Musuvathi and Shaz Qadeer. Chess: systematic stress testing of concurrent soft-
ware. In Proceedings of the 16th international conference on Logic-based program syn-
thesis and transformation, 2007.
[86] Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, Gerard Basler, Piramanayagam A.
Nainar, and Iulian Neamtiu. Finding and reproducing heisenbugs in concurrent programs.
In OSDI, 2008.
[87] Santosh Nagarakatte, Sebastian Burckhardt, Milo M.K. Martin, and Madanlal Musuvathi.
Multicore acceleration of priority-based schedulers for concurrency bug detection. In
PLDI, 2012.
[88] Mayur Naik and Alex Aiken. Conditional must not aliasing for static race detection. In
POPL, 2007.
[89] Mayur Naik, Alex Aiken, and John Whaley. Effective static race detection for java. In
PLDI, 2006.
Bibliography 149
[90] Mangala Gowri Nanda and S. Ramesh. Interprocedural slicing of multithreaded programs
with applications to java. ACM Trans. Program. Lang. Syst., 2006.
[91] Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder. Au-
tomatic logging of operating system effects to guide application-level architecture simu-
lation. In SIGMETRICS, 2006.
[92] Satish Narayanasamy, Gilles Pokam, and Brad Calder. Bugnet: Continuously recording
program execution for deterministic replay debugging. In ISCA, 2005.
[93] Satish Narayanasamy, Zhenghao Wang, Jordan Tigani, Andrew Edwards, and Brad
Calder. Automatically classifying benign and harmful data racesallusing replay analy-
sis. In PLDI, 2007.
[94] R. H. B. Netzer and B. P. Miller. Improving the accuracy of data race detection. In
PPOPP, 1991.
[95] R. H. B. Netzer and B. P. Miller. What are race conditions: Some issues and formaliza-
tions. LOPLAS, 1992.
[96] Robert O’Callahan and Jong-Deok Choi. Hybrid dynamic data race detection. In PPoPP,
2003.
[97] Marek Olszewski, Jason Ansel, and Saman Amarasinghe. Kendo: efficient deterministic
multithreading in software. In ASPLOS, 2009.
[98] Chang-Seo Park and Koushik Sen. Randomized active atomicity violation detection in
concurrent programs. In FSE, 2008.
[99] Soyeon Park, Shan Lu, and Yuanyuan Zhou. Ctrigger: exposing atomicity violation bugs
from their hiding places. In ASPLOS, 2009.
[100] Soyeon Park, Yuanyuan Zhou, Weiwei Xiong, Zuoning Yin, Rini Kaushik, Kyu H. Lee,
and Shan Lu. PRES: probabilistic replay with execution sketching on multi-processors.
In SOSP, 2009.
[101] Suzette Person, Matthew B. Dwyer, Sebastian Elbaum, and Corina S. Pasareanu. Differ-
ential symbolic execution. In FSE, 2008.
[102] Eli Pozniansky and Assaf Schuster. Efficient on-the-fly data race detection in multi-
threaded c++ programs. In PPoPP, 2003.
[103] Sriram Rajamani, G. Ramalingam, Venkatesh Prasad Ranganath, and Kapil Vaswani. Iso-
lator: dynamically ensuring isolation in comcurrent programs. In ASPLOS, 2009.
Bibliography 150
[104] Venkatesh Prasad Ranganath and John Hatcliff. Slicing concurrent java programs using
indus and kaveri. Int. J. Softw. Tools Technol. Transf., 2007.
[105] Paruj Ratanaworabhan, Martin Burtscher, Darko Kirovski, Benjamin Zorn, Rahul Nagpal,
and Karthik Pattabiraman. Detecting and tolerating asymmetric races. In PPoPP, 2009.
[106] Michiel Ronsse and Koen De Bosschere. Recplay: a fully integrated practical record/re-
play system. TOCS, 1999.
[107] Michiel Ronsse, Koen De Bosschere, Mark Christiaens, Jacques Chassin de Kergom-
meaux, and Dieter Kranzlmuller. Record/replay for nondeterministic program executions.
CACM, 2003.
[108] Mark Russinovich and Bryce Cogswell. Replay for concurrent non-deterministic shared-
memory applications. In PLDI, 1996.
[109] Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and Thomas Ander-
son. Eraser: A dynamic data race detector for multi-threaded programs. TOCS, 1997.
[110] Koushik Sen. Race directed random testing of concurrent programs. In PLDI, 2008.
[111] Koushik Sen and Gul Agha. Detecting errors in multithreaded programs by generalized
predictive analysis of executions. In FMOODS, 2005.
[112] Traian Florin Serbanuta, Feng Chen, and Grigore Rosu. Maximal causal models for
sequentially consistent multithreaded systems. Technical report, University of Illinois,
2010.
[113] Ohad Shacham, Mooly Sagiv, and Assaf Schuster. Scaling model checking of dataraces
using dynamic information. In PPoPP, 2005.
[114] Nir Shavit and Dan Touitou. Software transactional memory. In PODC, 1995.
[115] Y. Shi, S. Park, Z. Yin, S. Lu, Y. Zhou, W. Chen, and W. Zheng. Do i use the wrong
definition?: Defuse: definition-use invariants for detecting concurrency and sequential
bugs. In OOPSLA, 2010.
[116] Nishant Sinha and Chao Wang. Staged concurrent program analysis. In FSE, 2010.
[117] A. Prasad Sistla and Patrice Godefroid. Symmetry and reduced symmetry in model
checking. ACM Trans. Program. Lang. Syst., 26(4), July 2004.
[118] Francesco Sorrentino, Azadeh Farzan, and P. Madhusudan. Penelope: Weaving threads
to expose atomicity violations. In FSE, 2010.
[119] John Steven, Pravir Ch, Bob Fleck, and Andy Podgurski. jrapture: A capture/replay tool
for observation-based testing. In ISSTA, 2000.
Bibliography 151
[120] William N. Sumner, Yunhui Zheng, Dasarath Weeratunge, and Xiangyu Zhang. Precise
calling context encoding. In ICSE, 2010.
[121] Sriraman Tallam, Chen Tian, and Rajiv Gupta. Dynamic slicing of multithreaded pro-
grams for race detection. In ICSM, pages 97–106, 2008.
[122] Sriraman Tallam, Chen Tian, Rajiv Gupta, and Xiangyu Zhang. Enabling tracing of long-
running multithreaded programs via dynamic execution reduction. In ISSTA, 2007.
[123] Chen Tian, Vijay Nagarajan, Rajiv Gupta, and Sriraman Tallam. Dynamic recognition of
synchronization operations for improved data race detection. In ISSTA, 2008.
[124] Software Bug Contributed to Blackout. Securityfocus.
http://www.securityfocus.com/news/8016, 2004.
[125] Mandana Vaziri, Frank Tip, and Julian Dolby. Associating synchronization constraints
with data in an object-oriented language. In POPL, 2006.
[126] Kaushik Veeraraghavan, Peter M. Chen, Jason Flinn, and Satish Narayanasamy. Detect-
ing and surviving data races using complementary schedules. In SOSP, 2011.
[127] Kaushik Veeraraghavan, Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M.
Chen, Jason Flinn, and Satish Narayanasamy. Doubleplay: parallelizing sequential log-
ging and replay. In ASPLOS, 2011.
[128] Kahlon Vineet and Chao Wang. Universal causality graphs: A precise happens-before
model for detecting bugs in concurrent programs. In CAV, 2010.
[129] Willem Visser, Corina S. Pasareanu, and Sarfraz Khurshid. Test input generation with
java pathfinder. In ISSTA, 2004.
[130] Chao Wang, Sudipta Kundu, Malay K. Ganai, and Aarti Gupta. Symbolic predictive
analysis for concurrent programs. In FM, 2009.
[131] Chao Wang, Rhishikesh Limaye, Malay K. Ganai, and Aarti Gupta. Trace-based symbolic
analysis for atomicity violations. In TACAS, 2010.
[132] Haixun Wang, Hao He2, Jun Yang, Philip S. Yu, and Jeffrey Xu Yu. Dual labeling:
Answering graph reachability queries in constant time. In ICDE, 2006.
[133] Liqiang Wang and Scott D. Stoller. Accurate and efficient runtime detection of atomicity
errors in concurrent programs. In PPoPP, 2006.
[134] Liqiang Wang and Scott D. Stoller. Runtime analysis of atomicity for multithreaded
programs. TSE, 2006.
Bibliography 152
[135] Dasarath Weeratunge, Xiangyu Zhang, and Suresh Jaganathan. Accentuating the positive:
Atomicity inference and enforcement using correct executions. In OOPSLA, 2011.
[136] Dasarath Weeratunge, Xiangyu Zhang, and Suresh Jagannathan. Analyzing multicore
dumps to facilitate concurrency bug reproduction. In ASPLOS, 2010.
[137] Bin Xin, William N. Sumner, and Xiangyu Zhang. Efficient program execution indexing.
In PLDI, 2008.
[138] Min Xu, Rastislav Bodik, and Mark D. Hill. A ”flight data recorder” for enabling full-
system multiprocessor deterministic replay. In ISCA, 2003.
[139] Min Xu, Rastislav Bodık, and Mark D. Hill. A serializability violation detector for shared-
memory server programs. In PLDI, 2005.
[140] Zuoning Yin, Ding Yuan, Yuanyuan Zhou, Shankar Pasupathy, and Lakshmi Bairavasun-
daram. How do fixes become bugs? In ESEC/FSE, 2011.
[141] Jie Yu and Satish Narayanasamy. A case for an interleaving constrained shared-memory
multi-processor. In ISCA, 2009.
[142] Jie Yu and Satish Narayanasamy. Tolerating concurrency bugs using transactions as life-
guards. In MICRO, 2010.
[143] Cristian Zamfir and George Candea. Execution synthesis: a technique for automated
software debugging. In EuroSys, 2010.
[144] Andreas Zeller and Ralf Hildebrandt. Simplifying and isolating failure-inducing input.
TSE, 2002.
[145] Charles Zhang. Flexsync: An aspect-oriented approach to java synchronization. In ICSE,
2009.
[146] Charles Zhang and Hans-Arno Jacobsen. Externalizing java server concurrency with cal.
In ECOOP, 2008.