KKU. KULTHIDA Plagiarism · - Ideas Plagiarism (การคัดลอกแนวคิด) การนําแนวคิด ความรู้หรือทฤษฎีต่างๆ
Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.
-
Upload
laurence-gallagher -
Category
Documents
-
view
216 -
download
1
Transcript of Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.
Mike Joy
23 November 2010
Approaches to Detection of Source Code Plagiarism
amongst Students
Overview of Talk
1) Process for detecting plagiarism
2) Technologies
3) Establishing the facts
Part 1 – Process
Four Stages
Collection
Detection
Confirmation
Investigation
From Culwin and Lancaster (2002)
Stage 1: Collection
• Get all documents together online– so they can be processed– formats?– security?
• BOSS (Warwick)
• Coursemaster (Nottingham)
• Managed Learning Environment
Stage 2: Detection
• Compare with other submissions
• Compare with external documents– essay-based assignments?
• We’ll come back to this later– Technology
Stage 3: Confirmation
Software tool says “A and B similar”–Are they?
Never rely on a computer program!–Requires expert human judgement–Evidence must be compelling–Might go to court
Stage 4: Investigation
A from B, or B from A, or joint work?
If A from B, did B know?– open networked file / printer output
Did the culprit/s understand?
Was code written externally?
University processes must be followed
Establishing the facts
Why is this Interesting?
How do you compare two programs?– This is an algorithm question– Stages 2 and 3: detection and confirmation
How do you use the results (of a comparison) to educate students?– This is a pedagogic question– Stage 4, and before stage 1!
Digression: Essays
Plagiarism in essays is easier to detect
Lots of “tricks” a lecturer can use!– Google search on phrases– Abnormal style– ... etc.
Software tools– Let's have a look ...
Turnitin
Part 2 – Technologies
Collection
Detection
Confirmation
Investigation
Why not use Turnitin?
It won’t work!–String matching algorithm
inappropriate–Database does not contain code
Commercial involvement– E.g. Black Duck Software
/* Program A */
public class Sun {
static final double latitude=52.4;
static final double longitude=-1.5;
static final double tpi = 2.0*pi;
/* ... */
public static void main(String[] args) { calculate(); }
public static double FNrange(double x) {
double b = x / tpi;
double a = tpi * (b - (long)(b));
if (a < 0) a = tpi + a; return a;
};
public static void calculate() { /* ... */ }
/* ... */
/* Program B */
public class SunsetCalculator {
static float latitude=52.4;
static float longitude=-1.5;
/* ... */
public static void main(String[] args) { findSunsetTime(); }
public static double rangeCalc(float arg) {
float x = arg / tpi;
float y = 2*3.14159 * (x - (int)(x));
if (y < 0) y = 2*3.14159 + y; return y;
};
public static void findSunsetTime() { /* ... */ }
/* ... */
Is This Plagiarism?
• Is Program B derived from Program A in a manner which is “plagiarism”?
• Maybe– Structure is similar – cosmetic changes– But the algorithm is public domain– Maybe 6 derived from 5, maybe the other
way round
History (1)
Attribute counting systems (Halstead, 1972; Ottenstein, 1976):
• Numbers of unique operators• Numbers of unique operands• Total numbers of operator occurrences• Total numbers of operand occurrences
History (2)
Structure-based systems:
– Each program is converted into token strings (or something similar)
– Token streams are compared for determining similar source-code fragments
– Tools: YAP3, JPlag, Plague, MOSS, and Sherlock
Example (tokenwise equivalent)
int calculate(String arg) { int ans=0; for (int j=1; j<=100; j++) {
ans *= j;}
return ans;}
Integer doit(String v) { float result=0.0; for (float f=100.0; f > 0.0; f--)
result *= f; return result;}
Example (tokenised)
type name(type name) start type name=number loop (type name=number name compare number operation name) start name operation name end
return nameend
Detectors
MOSS (Alex Aiken: Berkeley/Stanford, USA, 1994)
JPlag (Guido Malpohl: Karlsruhe, Germany)– Java only– Programs must compile?
Sherlock (Warwick, UK) (Joy and Luck, 1999)
MOSS
MOSS determines the similarity of C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme program
Web-based: theory.stanford.edu/~aiken/moss/
“Winnowing” (Schleimer et al., 2003)– Local document fingerprinting algorithm– Efficiency proven (33% of lower bound)– Guarantees detection of matches longer than
a certain threshold
JPlag
JPlag currently supports Java, C#, C, C++, Scheme, and natural language text
Web-based: www.ipd.uni-karlsruhe.de/jplag
Algorithm: Parse programs and tokenise then pairwise compare using “Greedy String Tiling” (Prechelt et al., 2002)– maximises percentage of common token
strings– worst case θ(n3), average case linear
Sherlock
Developed at the University of Warwick Department of Computer Science
Open-Source application coded in Java
Sherlock detects plagiarism on source-code and natural language assignments
BOSS home page: www.boss.org.uk
Preprocesses code (not a full parse!) then simple string comparison
CodeMatch
Commercial product– www.safe-corp.biz– exact algorithm not published– patent pending?
Free academic use for small data sets
CodeMatch – Algorithm
1) Remove comments, whitespace and lines containing only keywords/syntax; compare sequences of instructions
2) Extract comments, and compare3) Extract identifiers, and count similar;
x, xxx, xx12345 are “similar”4) Combine (1), (2) and (3) to give
correlation score
Example of Identical “Instruction Sequences”
/* File 1*/
for (int i=1; i<10; i++) { if (a==10) print(“done”); else a++; }
/* File 2*/
for (int x=100; x > 0; x--) { if (z99 > -10) print(“ans is ” + z99); else { abc += 65; } }
Latent Semantic Analysis
Documents as “bags of words”Known technique in IRHandles synonymy and polysemyMaths is nasty
Results reported in (Cosma and Joy, 2010)
Heuristics
Comments– Spelling mistakes– Unusual English (Thai, German, …)
Use of Search EnginesUnusual styleCode errors
Technical Issues
Data protection (e.g. MOSS is in USA)AccuracyFaulty code may not be acceptedResults returned by different tools are
similar (but not identical)User interfaceAvailability of sets of test data
Part 3 – Establishing the Facts
Collection
Detection
Confirmation
Investigation
What is “Similarity”?
What do we actually mean by “similar”?
This is where the problems start …
Evidence … ?
(1) Staff Survey
We carried out a survey in order to:– gather the perceptions of academics on what
constitutes source-code plagiarism (Cosma and Joy, 2006), and
– create a structured description of what constitutes source-code plagiarism from a UK academic perspective (Cosma and Joy, 2008)
Data Source
• On-line questionnaire distributed to 120 academics – Questions were in the form of small scenarios– Mostly multiple-choice responses– Comments box below each question– Anonymous – option for providing details
• Received 59 responses, from more that 34 different institutions
• Responses were analysed and collated to create a universally acceptable source-code plagiarism description.
Results
Plagiaristic activities:– Source-code reuse and self-plagiarism– Use of (O-O) templates– Converting source to another language– Inappropriate collusion/collaboration– Using code-generator software– Obtaining source-code written by other authors – False and “pretend” references
Copying with adaptation: minimal, moderate, extreme– How to decide?
(2) Student Survey
We carried out a survey (Joy et al., 2010) in order to:– gather the perceptions of students on what
(source code) plagiarism means– identify types of plagiarism which are poorly
understood– identify categories of student who perceive the
issue differently to others
Data Source
• Online questionnaire answered by 770 students from computing departments across the UK
• Anonymised, but brief demographic information included
• Used 15 “scenarios”, each of which may describe a plagiaristic activity
Results (1)
No significant difference in perspectives in terms of
– university– degree programme– level of study (BS, MS, PhD)
Results (2)
Issues which students misunderstood:
– open source code– translating between languages– re-use of code from previous assignments– placing references within technical
documentation
Summary ... Big Issues
Making policy clear to students
Identifying external contributors• web sites with code to download• enthusiasts forums, Wikis, etc.
Cheat sites• “Rent-A-Coder” (etc.)
References (1)
F. Culwin and T.Lancaster, “Plagiarism, prevention, deterrence and detection”, [online] available from:www.heacademy.ac.uk/assets/York/documents/resources/resourcedatabase/id426_plagiarism_prevention_deterrence_detection.pdf ) 2002(
G. Cosma and M.S. Joy, “An Approach to Source-Code Plagiarism Detection and Investigation using Latent Semantic Analysis” IEEE Transactions on Computers, to appear (2010)
G. Cosma and M.S. Joy, “Towards a Definition on Source-Code Plagiarism”, IEEE Transactions on Education 51(2) pp. 195-200 (2008)
References (2)G. Cosma and M.S. Joy, “Source-code Plagiarism: a
UK Academic Perspective”, Proceedings of the 7th Annual Conference of the HEA Network for Information and Computer Sciences (2006)
M. Halstead, “Natural Laws Controlling Algorithm Structure, ACM SIGPLAN Notices 7(2) pp. 19-26 (1972)
M.S. Joy, G. Cosma, J.Y-K. Yau and J.E. Sinclair, “Source Code Plagiarism – a Student Perspective”, IEEE Transactions on Education (to appear) (2010)
M.S. Joy and M. Luck, “Plagiarism in Programming Assignments”, IEEE Transactions on Education 42(2), pp. 129-133 (1999)
References (3)K. Ottenstein, “An Algorithmic Approach to the
Detection and Prevention of Plagiarism”, ACM SIGCSE Bulletin 8(4) pp. 30-41 (1976)
L. Prechelt, G. Malpohl and M. Philippsen, “Finding “Plagiarisms among a Set of Programs with JPlag”. Journal of Universal Computer Science 8(11) pp. 1016-1038 (2002)
S. Schleimer, D.S. Wilkerson and A. Aitken, “Winnowing: Local Algorithms for Document Fingerprinting”, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76-85 (2003)