A Human Study of Patch Maintainability
description
Transcript of A Human Study of Patch Maintainability
A HUMAN STUDY OF PATCH MAINTAINABILITYZachary P. Fry, Bryan Landau, Westley WeimerUniversity of Virginia{zpf5a,bal2ag,weimer}@virginia.edu
Bug Fixing Fixing bugs manually is difficult and costly. Recent techniques explore automated
patches: Evolutionary techniques – GenProg Dynamic modification – ClearView Enforcement of pre/post-conditions – AutoFix-E Program transformation via static analysis – AFix
While these techniques save developers time, there is some concern as to whether the patches produced are human-understandable and maintainable in the long run.
2
3
Questions Moving Forward How can we concretely measure these
notions of human understandability and future maintainability?
Can we automatically augment machine-generated patches to improve maintainability?
In practice, are machine-generated patches as maintainable as human-generated patches?
4
Questions Moving Forward How can we concretely measure
these notions of human understandability and future maintainability?
Can we automatically augment machine-generated patches to improve maintainability?
In practice, are machine-generated patches as maintainable as human-generated patches?
5
Measuring quality and maintainability
Functional Quality – Does the implementation match the specification? Does the code execute “correctly”?
Non-functional Quality – Is the code understandable to humans? How difficult is it to understand and alter
the code in the future?
✓
?
6
Software Functional Quality Perfect:
Implementation matches specification Direct software quality metrics:
Testing Defect density Mean time to failure
Indirect software quality metrics: Cyclomatic complexity Coupling and cohesion (CK metrics) Software readability
7
Software Non-functional Quality Maintainability:
Human-centric factors affecting the ease with which bugs can be fixed and features can be added
Broadly related to the “understandability” of code
Not easy to concretely measure with heuristics like functional correctness
These automatically-generated patches have been shown to be of high quality functionally – what about non-functionally?
8
Patch Maintainability Defined Rather than using an approximation
to measure understandability, we will directly measure humans’ abilities to perform maintenance tasks
Task: ask human participants questions that require them to read and understand a piece of code and measure the effort required to provide correct answers
Simulate the maintenance process as closely as possible
9
Php Bug #54454 Title: “substr_compare incorrectly
reports equality in some cases” Bug description:
“if main_str is shorter than str, substr_compare [mistakenly] checks only up to the length of main_str”
substr_compare(“cat”, “catapult”) = true
10
Motivating Exampleif (offset >= s1_len) { php_error_docref(NULL TSRMLS_CC,
E_WARNING, "The start
position cannot exceed string length"); RETURN_FALSE;}
if (len > s1_len - offset) {len = s1_len - offset;
}
cmp_len = (uint) (len ? len : MAX(s2_len, (s1_len - offset)));
11
Motivating Examplelen--; if (mode & 2) {
for (i = len - 1; i >= 0; i--) {
if (mask[(unsigned char)c[i]]) {
len--; }
else {break; }
}}
if (return_value) { RETVAL_STRINGL(c, len, 1); } else {
12
Automatic Documentation Intuitions suggest that patches augmented with
documentation are more maintainable Human patches can contain comments with
hints as to the developer’s intention when changing code Automatic approaches cannot easily reason about
why a change is made, but can describe what was changed
Automatically Synthesized Documentation: DeltaDoc (Buse et al. ASE 2010) Measures semantic program changes Outputs natural language descriptions of changes
13
Automatic Documentationif (!con->conditional_is_valid[dc->comp]) {
if (con->conf.log_condition_handling) {TRACE("cond[%d] is valid: %d", dc->comp,
con->conditional_is_valid[dc->comp]); } /* If not con->conditional_is_valid[dc->comp]
No longer return COND_RESULT_UNSET; */return COND_RESULT_UNSET;
} /* pass the rules */ switch (dc->comp) {
case COMP_HTTP_HOST: {char *ck_colon = NULL, *val_colon = NULL;
14
Questions Moving Forward How can we concretely measure
these notions of human understandability and future maintainability?
Can we automatically augment machine-generated patches to improve maintainability?
In practice, are machine-generated patches as maintainable as human-generated patches?
15
EvaluationFocused research questions to answer:
1) How do different types of patches affect maintainability?
2) Which source code characteristics are predictive of our maintainability measurements?
3) Do participants’ intuitions about maintainability and its causes agree with measured maintainability?
To answer these questions directly we performed a human study using over 150 participants with real patches from existing systems
16
Experiment - Subject Patches
Program LOC Defects Patchesgzip 491,083 1 2libtiff 77,258 7 14lighttpd 61,528 3 4php 1,046,421 9 17python 407,917 1 2wireshark 2,812,340 11 11Total: 4,896,547 32 50
We used patches from six benchmarks over a variety subject domains
17
Experiment - Subject Patches Original – the defective, un-patched code
used as a baseline for measuring relative changes
Human-Accepted – human patches that have not been reverted to date
Human-Reverted – human-created patches that were later reverted
Machine – automatically-generated patches created by the GenProg tool
Machine+Doc – the same patches as above, but augmented with automatically synthesized documentation
18
Experiment – Maintenance Task
Sillito et al. – “Questions programmers ask during software evolution tasks”
Recorded and categorized the questions developers actually asked while performing real maintenance tasks
“What is the value of the variable “y” on line X?”
Not: “Does this type have any siblings in the type hierarchy?”
19
Human Study…15 if (dc->prev) {16 if (con->conf.log_condition_handling) { 17 log_error_write(srv, __FILE__, __LINE__, "sb", "go prev", dc->prev->key);18 }19 /* make sure prev is checked first */20 config_check_cond_cached(srv, con, dc->prev);21 /* one of prev set me to FALSE */22 if (COND_RESULT_FALSE == con->cond_cache[dc->context_ndx].result) {23 return COND_RESULT_FALSE;24 }25 26 }27 28 if (!con->conditional_is_valid[dc->comp]) {29 if (con->conf.log_condition_handling) {30 TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]);31 }32 33 return COND_RESULT_UNSET;34 }…
20
Human Study Question presentation
Question: What is the value of the variable "con->conditional_is_valid[dc->comp]" on line 33? (recall, you can use inequality symbols in your answer)
Answer to the Question Above:
21
Human Study…15 if (dc->prev) {16 if (con->conf.log_condition_handling) { 17 log_error_write(srv, __FILE__, __LINE__, "sb", "go prev", dc->prev->key);18 }19 /* make sure prev is checked first */20 config_check_cond_cached(srv, con, dc->prev);21 /* one of prev set me to FALSE */22 if (COND_RESULT_FALSE == con->cond_cache[dc->context_ndx].result) {23 return COND_RESULT_FALSE;24 }25 26 }27 28 if (!con->conditional_is_valid[dc->comp]) {29 if (con->conf.log_condition_handling) {30 TRACE("cond[%d] is valid: %d", dc->comp, con->conditional_is_valid[dc->comp]);31 }32 33 return COND_RESULT_UNSET;34 }…
22
Human Study Question presentation
Question: What is the value of the variable "con->conditional_is_valid[dc->comp]" on line 33? (recall, you can use inequality symbols in your answer)
Answer to the Question Above:
False
23
Evaluation Metrics Correctness – is the right answer reported? Time – what is the “maintenance effort”
associated with understanding this code? We favor correctness over time
Participants were instructed to spend as much time as they deemed necessary to correctly answer the questions
The percentages of correct answers over all types of patches were not different in a statistically significant way
We focus on time, as it is an analog for the software engineering effort associated with program understanding
24
Type of Patch vs. Maintainability
Effort = average number of minutes it took participants to report a correct answer for all patches of a given type relative to the original code
25
Type of Patch vs. Maintainability
Effort = average number of minutes it took participants to report a correct answer for all patches of a given type relative to the original code
26
Characteristics of Maintainability We measured various code features for
all patches used in the human study Using a logistic regression model, we
can predict human accuracy when answering the questions in the study 73.16% of the time
A Principle Component Analysis shows that 17 features account for 90% of the variance in the data Modeling maintainability is a complex
problem
27
Characteristics of MaintainabilityCode Feature Predictive
PowerRatio of variable uses per assignment 0.178Code readability 0.157Ratio of variables declared out of scope vs. in scope 0.146Number of total tokens 0.097Number of non-whitespace characters 0.090Number of macro uses 0.080Average token length 0.078Average line length 0.072Number of conditionals 0.070Number of variable declarations or assignments 0.056Maximum conditional clauses on any path 0.055Number of blank lines 0.054
28
Human Intuition vs. Measurement After completing the study, participants
were asked to report which code features they thought increased maintainability the mostHuman Reported Feature Vote
sPredictive Power
Descriptive variable names 35 *0.000Clear whitespace and indentation 25 *0.003Presence of comments 25 0.022Shorter function 8 *0.000Presence of nested conditionals 8 0.033Presence of compiler directives / macros 7 0.080Presence of global variables 5 0.146Use of goto statements 5 *0.000Lack of conditional complexity 5 0.055Uniform use and format of curly braces
5 0.014
29
Conclusions From conducting a human study involving
over 150 participants and patches fixing high-priority defects from real systems we conclude: The fact that humans take less time, on
average, to answer questions about machine-generated patches with automated documentation than human-created patches validates the possibility of using automatic patch generation techniques in practice
There is a strong disparity between human intuitions about maintainability and our measurements and thus we think further study is merited in this area
30
Questions?
31
Modified DeltaDoc We modify DeltaDoc in the following ways:
Include all changes, regardless of length of output Ignore all internal optimizations that lead to loss of
information (e.g. ignore suspected unrelated statements)
Include all relevant programmatic information (e.g. function arguments)
Ignore all high-level output optimizations Favor comprehensive explanations over
brevity Insert output directly above patches as
comments
32
Experiment - Participants Over 150 participants
27 fourth-year undergraduate CS students 14 CS graduate students 116 Mechanical Turk internet participants
Accuracy cutoff imposed Ensuring people don’t try to “game the
system” requires special consideration Any participant who failed to answer all
questions or scored below one standard deviation of the average undergraduate student’s score was removed
33
Experiment - Questions What conditions must hold to always
reach line X during normal execution? What is the value of the variable “y” on
line X? What conditions must be true for the
function “z()” to be called on line X? At line X, which variables must be in
scope? Given the following values for relevant
variables, what lines are executed by beginning at line X? Y=5 && Z=True.