Post on 14-Apr-2017
Aspiring Minds
www.aspiringminds.com
Grading Programs using Machine Learning
Varun Aggarwal
Presented at KDD, 2014
Programming Assessments: Existing solutions
• Manual evaluation: Can’t scale; not standardized
• Test-case based evaluation:
• High false-positives – hard code, inadvertent errors
• High false-negatives – correct code but not efficient
• Similarity metric between control flow graphs, syntax trees:
• Need to handle multiple correct implementations – theoretically doesn’t fit in
• No mapping of metric to an objective feedback
Automatic grading of programs– Why?
• Widely performed - will help professors and TAs save a lot of time.
• Companies can recruit efficiently
• MOOCs - need automated open response assessments to really make it effective. True scaling of such system currently not achieved.
A model to predict the logical correctness of a program, given the control and data
dependencies it possesses.
Our Approach
Automata – Automatic program evaluation engine
Machine Learning based scoring engine
Evaluation of programming best practices
Asymptotic complexity evaluation
Lint-styled rule-based system to detect programs not following programming best
practices.
Measures the run-time of the code for various input sizes
and empirically derives the complexity.
Why programming modules give a better test-shortlist rate ?
• Programming has more predictive power in identifying good performers than Logical ability.
• Due to lower predictive power of Logical, a higher cut-off has to be applied to it as compared to Programming to get the same organizational efficiency.
• Higher the Programming capability of the person, requirement on Logical score is lesser.
• Given the person is lower than a given score on Programming, even having a higher logical ability does not help.
Evaluation Rubric
ML based scoring
Understanding the human process
Problem and Language independent
Features
Machine learning model
Ungraded programs
Graded programs
Predicted grades
5
Evaluation Rubric
Our Approach
Understanding the human process
Problem and Language independent
Features
Machine learning model
Ungraded programs
Graded programs
Predicted grades
1
Evaluation Rubric
Our Approach
Understanding the human process
Problem and Language independent
Features
Machine learning model
Ungraded programs
Graded programs
Predicted grades
2
Evaluation Rubric
Our Approach
Understanding the human process
Problem and Language independent
Features
Machine learning model
Ungraded programs
Graded programs
Predicted grades
3
void print_1(int N){ for(i =1 ; i<=N; i++){ print newline;
count = i; for(j=0; j<i; j++) print count;
count++; }}
12 33 4 54 5 6 7
OBJECTIVE To print the pattern of integers
An implementation
1. Are there loops? Are there print statements?
3. Is the conditional in the inner loop dependent on - a variable modified in the outer loop? - a variable used in the conditional of the outer
loop?
What does a grader look for?
2. Is there a nested-loop structure?
Grammar for expressing features• Simple features
• Keywords and Tokens (Counts):
• Tokens like for, if, return, break; function calls like printf, strrev, strcat; declarations like int, char• Operators like various arithmetic, logical, relational operators used• Character constants like ‘\0’, ‘ ’, ‘65’, ‘96’
• Capturing logical constructs (Interactions)
• Control flow structure
• Data-dependencies
• Data-dependencies in context of control-flow
CONTROL FEATURES – COUNTS
Counts of control-related keywords/tokens Ex. count(for) = 2
count(for-in-for) = 1 count(while) = 0
Control-context of these keywords- The Print command as loop(loop(print)))
for(i =1 ; i<=N; i++){
print newline;
count = i;
CONTROL FLOW GRAPH
i = 1
i <= N
i++
j < i
count = ij = 0
print(count)count++
j++
END
Loop 1
Loop 2
Parent scope
for(j=0; j<i; j++)
print count; count++;
void print(int N){
}
}
TARGET PROGRAM
DATA OPERATION FEATURES IN CONTROL-CONTEXT
Counts of data-related tokens in context of the control structureEx. count(block1 :loop(loop(++))) = 2
count(block1 :loop(loop_cond(<))) = 1
Capture control-context of data-dependencies in groups of expressions
i++ j < i : var (i) related to var (j) : appearing in a loop(loop_cond) previously incremented : appearing in a loop The relation and the increment happen in the same block
Loop 1
Loop 2
Loop 1
Loop 1
Loop 2
Loop 1
Loop 2
Loop 1
i = 1
i <= N
print(count)
count = ii++
j < i
count = 0
count++
j= 0
j++
Parent scope Parent scope
Loop 1
Loop 1
Loop 2
CONTROL FLOW INFORMATION ANNOTATED IN A-D-D GRAPH
Loop 1
• Deployed Automata in a major product-based company’s recruitment
• Analyzed the performance improvement in using Automata over test-case pass based selection criterion
• 22.6% candidates who were not being shortlisted through test-case pass were now shortlisted using Automata.
Case study
Experimental Results
Sort Problem
Doing it the one-class way!
PROBLEM All features Basic featuresMean Min25 Mean Min25
1 0.57 0.61 0.52 0.562 0.80 0.83 0.72 0.75
3 0.75 0.81 0.59 0.734 0.81 0.81 0.75 0.755 0.68 0.69 0.55 0.61
Betters test-case in all, but one
How good is the final ML-based score?
Validation Correlation >= 0.79Matches Inter-rater Correlation between two human raters
PROBLEM # of features Cross-val correl Train correl Validation correl Test Case Score
1 80 0.61 0.85 0.79 0.54
2 68 0.77 0.93 0.91 0.80
3 193 0.91 0.98 0.90 0.64
4 66 0.90 0.94 0.90 0.80
5 87 0.81 0.92 0.84 0.84
Can we get insight? • The most contributing feature for Find Digit problem -
int findDigit(int N, int digit){
…
…
LOOP (N != <constant value>){
…
N = N / <constant value>
…
}
}
Features for FindDigit problem analyzed. Given a multi-digit number and a digit, one has to find the number of times the digit appears in the number
Yes, we can! • The most contributing feature for Find Digit problem -
int findDigit(int N, int digit){
…
LOOP (N != <constant value>){
…
N = N / <constant value>
…
}
…
}
int findDigit(int N, int digit){
...
while(N != 0){
d = N%10;
if(d == digit)
...
N = N / 10;
}
}
Evaluation Rubric
Score Interpretation5 Completely correct and efficient
An efficient implementation of the problem using right control structures and data-dependencies.
4 Correct with some silly errorsCorrect control structures and closely matching data-dependencies. Some silly mistakes fail the code to pass test-cases.
3 Inconsistent logical structuresRight control structures start exist with few correct data dependencies
2 Emerging basic structuresAppropriate keywords and tokens present, showing some understanding of the problem
1 Gibberish codeSeemingly unrelated to problem at hand.
Automata – Sample report
Candidate’s source codeFeedback on programming best practices
Asymptotic complexity of the candidate’s solution
Test case pass/fail information
Problem summary
Do our fancy features help?
Control and Data dependency features add around 0.15 correlation points above token information.
PROBLEM Type of feature # of features Cross-val correl Train correl Validation correl
1All, w/o test case 35 0.57 0.72 0.56
Basic 60 0.62 0.87 0.41
2All, w/o test case 80 0.81 0.99 0.80
Basic 26 0.59 0.72 0.67
3All, w/o test case 190 0.87 0.97 0.90
Basic 26 0.74 0.89 0.74
4All, w/o test case 134 0.85 0.91 0.82
Basic 35 0.83 0.88 0.69
5All, w/o test case 166 0.66 0.81 0.64
Basic 40 0.61 0.78 0.61
Conclusion
• We propose the first machine learning based approach to automatically grade programs
• An innovative feature grammar is proposed which matches human intuition of grading programs.
• Models built for sample problems show promising results.
• We propose and demonstrate machine learning techniques to lower the need of human-graded data to build models.