Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

49
Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students

Transcript of Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Page 1: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Mike Joy

23 November 2010

Approaches to Detection of Source Code Plagiarism

amongst Students

Page 2: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Overview of Talk

1) Process for detecting plagiarism

2) Technologies

3) Establishing the facts

Page 3: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Part 1 – Process

Four Stages

Collection

Detection

Confirmation

Investigation

From Culwin and Lancaster (2002)

Page 4: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Stage 1: Collection

• Get all documents together online– so they can be processed– formats?– security?

• BOSS (Warwick)

• Coursemaster (Nottingham)

• Managed Learning Environment

Page 5: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Stage 2: Detection

• Compare with other submissions

• Compare with external documents– essay-based assignments?

• We’ll come back to this later– Technology

Page 6: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Stage 3: Confirmation

Software tool says “A and B similar”–Are they?

Never rely on a computer program!–Requires expert human judgement–Evidence must be compelling–Might go to court

Page 7: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Stage 4: Investigation

A from B, or B from A, or joint work?

If A from B, did B know?– open networked file / printer output

Did the culprit/s understand?

Was code written externally?

University processes must be followed

Establishing the facts

Page 8: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Why is this Interesting?

How do you compare two programs?– This is an algorithm question– Stages 2 and 3: detection and confirmation

How do you use the results (of a comparison) to educate students?– This is a pedagogic question– Stage 4, and before stage 1!

Page 9: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Digression: Essays

Plagiarism in essays is easier to detect

Lots of “tricks” a lecturer can use!– Google search on phrases– Abnormal style– ... etc.

Software tools– Let's have a look ...

Page 10: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Turnitin

Page 11: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.
Page 12: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Part 2 – Technologies

Collection

Detection

Confirmation

Investigation

Page 13: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Why not use Turnitin?

It won’t work!–String matching algorithm

inappropriate–Database does not contain code

Commercial involvement– E.g. Black Duck Software

Page 14: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

/* Program A */

public class Sun {

static final double latitude=52.4;

static final double longitude=-1.5;

static final double tpi = 2.0*pi;

/* ... */

public static void main(String[] args) { calculate(); }

public static double FNrange(double x) {

double b = x / tpi;

double a = tpi * (b - (long)(b));

if (a < 0) a = tpi + a; return a;

};

public static void calculate() { /* ... */ }

/* ... */

/* Program B */

public class SunsetCalculator {

static float latitude=52.4;

static float longitude=-1.5;

/* ... */

public static void main(String[] args) { findSunsetTime(); }

public static double rangeCalc(float arg) {

float x = arg / tpi;

float y = 2*3.14159 * (x - (int)(x));

if (y < 0) y = 2*3.14159 + y; return y;

};

public static void findSunsetTime() { /* ... */ }

/* ... */

Page 15: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Is This Plagiarism?

• Is Program B derived from Program A in a manner which is “plagiarism”?

• Maybe– Structure is similar – cosmetic changes– But the algorithm is public domain– Maybe 6 derived from 5, maybe the other

way round

Page 16: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

History (1)

Attribute counting systems (Halstead, 1972; Ottenstein, 1976):

• Numbers of unique operators• Numbers of unique operands• Total numbers of operator occurrences• Total numbers of operand occurrences

Page 17: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

History (2)

Structure-based systems:

– Each program is converted into token strings (or something similar)

– Token streams are compared for determining similar source-code fragments

– Tools: YAP3, JPlag, Plague, MOSS, and Sherlock

Page 18: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Example (tokenwise equivalent)

int calculate(String arg) { int ans=0; for (int j=1; j<=100; j++) {

ans *= j;}

return ans;}

Integer doit(String v) { float result=0.0; for (float f=100.0; f > 0.0; f--)

result *= f; return result;}

Page 19: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Example (tokenised)

type name(type name) start type name=number loop (type name=number name compare number operation name) start name operation name end

return nameend

Page 20: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Detectors

MOSS (Alex Aiken: Berkeley/Stanford, USA, 1994)

JPlag (Guido Malpohl: Karlsruhe, Germany)– Java only– Programs must compile?

Sherlock (Warwick, UK) (Joy and Luck, 1999)

Page 21: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

MOSS

MOSS determines the similarity of C, C++, Java, Pascal, Ada, ML, Lisp, or Scheme program

Web-based: theory.stanford.edu/~aiken/moss/

“Winnowing” (Schleimer et al., 2003)– Local document fingerprinting algorithm– Efficiency proven (33% of lower bound)– Guarantees detection of matches longer than

a certain threshold

Page 22: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

JPlag

JPlag currently supports Java, C#, C, C++, Scheme, and natural language text

Web-based: www.ipd.uni-karlsruhe.de/jplag

Algorithm: Parse programs and tokenise then pairwise compare using “Greedy String Tiling” (Prechelt et al., 2002)– maximises percentage of common token

strings– worst case θ(n3), average case linear

Page 23: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.
Page 24: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.
Page 25: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Sherlock

Developed at the University of Warwick Department of Computer Science

Open-Source application coded in Java

Sherlock detects plagiarism on source-code and natural language assignments

BOSS home page: www.boss.org.uk

Preprocesses code (not a full parse!) then simple string comparison

Page 26: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.
Page 27: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.
Page 28: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.
Page 29: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.
Page 30: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

CodeMatch

Commercial product– www.safe-corp.biz– exact algorithm not published– patent pending?

Free academic use for small data sets

Page 31: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

CodeMatch – Algorithm

1) Remove comments, whitespace and lines containing only keywords/syntax; compare sequences of instructions

2) Extract comments, and compare3) Extract identifiers, and count similar;

x, xxx, xx12345 are “similar”4) Combine (1), (2) and (3) to give

correlation score

Page 32: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Example of Identical “Instruction Sequences”

/* File 1*/

for (int i=1; i<10; i++) { if (a==10) print(“done”); else a++; }

/* File 2*/

for (int x=100; x > 0; x--) { if (z99 > -10) print(“ans is ” + z99); else { abc += 65; } }

Page 33: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Latent Semantic Analysis

Documents as “bags of words”Known technique in IRHandles synonymy and polysemyMaths is nasty

Results reported in (Cosma and Joy, 2010)

Page 34: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.
Page 35: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Heuristics

Comments– Spelling mistakes– Unusual English (Thai, German, …)

Use of Search EnginesUnusual styleCode errors

Page 36: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Technical Issues

Data protection (e.g. MOSS is in USA)AccuracyFaulty code may not be acceptedResults returned by different tools are

similar (but not identical)User interfaceAvailability of sets of test data

Page 37: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Part 3 – Establishing the Facts

Collection

Detection

Confirmation

Investigation

Page 38: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

What is “Similarity”?

What do we actually mean by “similar”?

This is where the problems start …

Evidence … ?

Page 39: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

(1) Staff Survey

We carried out a survey in order to:– gather the perceptions of academics on what

constitutes source-code plagiarism (Cosma and Joy, 2006), and

– create a structured description of what constitutes source-code plagiarism from a UK academic perspective (Cosma and Joy, 2008)

Page 40: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Data Source

• On-line questionnaire distributed to 120 academics – Questions were in the form of small scenarios– Mostly multiple-choice responses– Comments box below each question– Anonymous – option for providing details

• Received 59 responses, from more that 34 different institutions

• Responses were analysed and collated to create a universally acceptable source-code plagiarism description.

Page 41: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Results

Plagiaristic activities:– Source-code reuse and self-plagiarism– Use of (O-O) templates– Converting source to another language– Inappropriate collusion/collaboration– Using code-generator software– Obtaining source-code written by other authors – False and “pretend” references

Copying with adaptation: minimal, moderate, extreme– How to decide?

Page 42: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

(2) Student Survey

We carried out a survey (Joy et al., 2010) in order to:– gather the perceptions of students on what

(source code) plagiarism means– identify types of plagiarism which are poorly

understood– identify categories of student who perceive the

issue differently to others

Page 43: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Data Source

• Online questionnaire answered by 770 students from computing departments across the UK

• Anonymised, but brief demographic information included

• Used 15 “scenarios”, each of which may describe a plagiaristic activity

Page 44: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Results (1)

No significant difference in perspectives in terms of

– university– degree programme– level of study (BS, MS, PhD)

Page 45: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Results (2)

Issues which students misunderstood:

– open source code– translating between languages– re-use of code from previous assignments– placing references within technical

documentation

Page 46: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

Summary ... Big Issues

Making policy clear to students

Identifying external contributors• web sites with code to download• enthusiasts forums, Wikis, etc.

Cheat sites• “Rent-A-Coder” (etc.)

Page 47: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

References (1)

F. Culwin and T.Lancaster, “Plagiarism, prevention, deterrence and detection”, [online] available from:www.heacademy.ac.uk/assets/York/documents/resources/resourcedatabase/id426_plagiarism_prevention_deterrence_detection.pdf ) 2002(

G. Cosma and M.S. Joy, “An Approach to Source-Code Plagiarism Detection and Investigation using Latent Semantic Analysis” IEEE Transactions on Computers, to appear (2010)

G. Cosma and M.S. Joy, “Towards a Definition on Source-Code Plagiarism”, IEEE Transactions on Education 51(2) pp. 195-200 (2008)

Page 48: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

References (2)G. Cosma and M.S. Joy, “Source-code Plagiarism: a

UK Academic Perspective”, Proceedings of the 7th Annual Conference of the HEA Network for Information and Computer Sciences (2006)

M. Halstead, “Natural Laws Controlling Algorithm Structure, ACM SIGPLAN Notices 7(2) pp. 19-26 (1972)

M.S. Joy, G. Cosma, J.Y-K. Yau and J.E. Sinclair, “Source Code Plagiarism – a Student Perspective”, IEEE Transactions on Education (to appear) (2010)

M.S. Joy and M. Luck, “Plagiarism in Programming Assignments”, IEEE Transactions on Education 42(2), pp. 129-133 (1999)

Page 49: Mike Joy 23 November 2010 Approaches to Detection of Source Code Plagiarism amongst Students.

References (3)K. Ottenstein, “An Algorithmic Approach to the

Detection and Prevention of Plagiarism”, ACM SIGCSE Bulletin 8(4) pp. 30-41 (1976)

L. Prechelt, G. Malpohl and M. Philippsen, “Finding “Plagiarisms among a Set of Programs with JPlag”. Journal of Universal Computer Science 8(11) pp. 1016-1038 (2002)

S. Schleimer, D.S. Wilkerson and A. Aitken, “Winnowing: Local Algorithms for Document Fingerprinting”, Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 76-85 (2003)