Detección automática de plagio en texto - Consorcio...

54
Detección automática de plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial, reconocimiento de formas e imagen digital Advisor: Paolo Rosso Natural Language Engineering Lab, ELiRF Universidad Politécnica de Valencia IV Jornadas MAVIR November 19th, 2009 Detección automática de plagio en texto IARFID-NLEL, UPV 1/54

Transcript of Detección automática de plagio en texto - Consorcio...

Page 1: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Detección automática deplagio en texto

L. Alberto Barrón Cedeño

Máster en inteligencia artificial, reconocimiento de formas e imagen digital

Advisor: Paolo Rosso

Natural Language Engineering Lab, ELiRF

Universidad Politécnica de Valencia

IV Jornadas MAVIRNovember 19th, 2009

Detección automática de plagio en texto IARFID-NLEL, UPV 1/54

Page 2: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Outline

Introduction

State of the Art

Monolingual Plagiarism Detection

Cross-Language Plagiarism Detection

Contributions

Corpora, Competition and More...

Detección automática de plagio en texto IARFID-NLEL, UPV 2/54

Page 3: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

What is Plagiarism?

• Copying words or ideas from someone else without givingcredit.

• Changing words but copying the sentence structure of a sourcewithout giving credit.

• Copying so many words or ideas from a source that it makes upthe majority of your work, whether you give credit or not.

www.plagiarism.org

Detention vs. Detection

“En un lugar de la estancia , de cuyo nombre no quiero acordarme...”

Detección automática de plagio en texto IARFID-NLEL, UPV 3/54

Page 4: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

In the news

Some available (online) systems:Pl@giarismturnitinwCopyFind

doccop.comFerretPlagiarismdetect

http://www.time.com (Oct. 20, 2009)

Detección automática de plagio en texto IARFID-NLEL, UPV 4/54

Page 5: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Objectives

• Study of some of the current (statistical) approaches toplagiarism detection.

• Analysis and adaptation of available lexical and statisticalresources

• Becoming the seed of a broader (PhD) research onmonolingual and cross-lingual plagiarism detection.

Detección automática de plagio en texto IARFID-NLEL, UPV 5/54

Page 6: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Terminology

A authordq plagiarism suspicious documentd potential source documentD / Dq set of source/suspicious documentss text fragment in a document

Detección automática de plagio en texto IARFID-NLEL, UPV 6/54

Page 7: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Relevant Factors for Plagiarism Detection

intrinsic analysis Use of vocabulary

Changes of vocabularyPunctuationReadability of text

external analysisAmount of similarity between textsDistribution of words

[Clough, 2000]

Detección automática de plagio en texto IARFID-NLEL, UPV 7/54

Page 8: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

“Search and classify”

dq determine whether dq was written by one single. Ifnot, identify sections written by a different author.

dq to d retrieve documents d ∈ D such that d may be thesource of the potentially plagiarised dq

s ∈ dq tod

For (some) section s ∈ dq, retrieve documentsd ∈ D such that d may be the source of potentiallyplagiarised sections in dq

s ∈ dq tos ∈ d

For (some) section s ∈ dq, retrieve sections s ∈d ∈ D such that s ∈ d may be the source of thepotentially plagiarised section s ∈ dq

Detección automática de plagio en texto IARFID-NLEL, UPV 8/54

Page 9: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Outline

Introduction

State of the Art

Monolingual Plagiarism Detection

Cross-Language Plagiarism Detection

Contributions

Corpora, Competition and More...

Detección automática de plagio en texto IARFID-NLEL, UPV 9/54

Page 10: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Intrinsic Analysis

If a person is able to detect a plagiarism case by reading a text…

• Word average frequency class

• Average [sentence , word] length

• Stopwords average

• Complexity measures

[Meyer zu Eißen and Stein, 2006, Stamatatos, 2009]

Detección automática de plagio en texto IARFID-NLEL, UPV 10/54

Page 11: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Overview of External Analysis

Heuristicretrieval

ReferenceCollection D

d_q

Candidatedocuments

Decomposition

Vector spacemodel

comparison

Fingerprint-based

comparison

Knowledge-basedpost-processing

Suspiciouspassages

Detailed analysisPreprocessing Plagiarism Detection

(adapted from [Potthast et al., 2009a, LRE, submitted])

Detección automática de plagio en texto IARFID-NLEL, UPV 11/54

Page 12: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

External Analysis

n-gram based comparison

Given N(x), the set of n-grams in x

Resemblance Containment

R(dq | d) =|N(dq)∩N(d)||N(dq)∪N(d)| C(s ∈ dq | d) = |N(s)∩N(d)|

|N(s)|

[Broder, 1997, Lyon et al., 2001]

Detección automática de plagio en texto IARFID-NLEL, UPV 12/54

Page 13: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

External Analysis

Why do n-grams work? (1)

Query |web pages|Bienvenido 57,800,000Bienvenido a 30,000,000Bienvenido a la 10,700,000Bienvenido a la página web 572,000Bienvenido a la página web del 265,000Bienvenido a la página web del posgrado 5Bienvenido a la página web del posgrado en informática, desde 4

Sentence splitting does not work (cf. www.plagiarismdetect.com/)

http://www.popinformatica.upv.es/ (1-XII-08)

(Example adapted from P. Clough)

Detección automática de plagio en texto IARFID-NLEL, UPV 13/54

Page 14: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

External Analysis

Why do n-grams work? (2)

• Consider 4 documents d1...4

• d1...4 were authored by A

• d1...4 have a common topic

• Avg. document size: 3, 728 words

|d| 1-grams 2-grams 3-grams 4-grams2 0.1692 0.1125 0.0574 0.03123 0.0720 0.0302 0.0093 0.00274 0.0739 0.0166 0.0031 0.0004

Detección automática de plagio en texto IARFID-NLEL, UPV 14/54

Page 15: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

External Analysis

Fingerprinting models

• D is compiled into a fingerprint index

• fingerprints of dq and d are compared

• COPS [Brin et al., 1995]

• Winnowing [Schleimer et al., 2003]

• Fuzzy fingerprinting [Stein, 2007]

Detección automática de plagio en texto IARFID-NLEL, UPV 15/54

Page 16: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Outline

Introduction

State of the Art

Monolingual Plagiarism Detection

Cross-Language Plagiarism Detection

Contributions

Corpora, Competition and More...

Detección automática de plagio en texto IARFID-NLEL, UPV 16/54

Page 17: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Language Models

P (w1, . . . , wk) =

k∏

i=1

P (wi | w1, . . . , wi − 1)

• Is it possible to characterize the writing style with a LM?

• Perplexity (PP ) determines how well a LM predicts a text

• Compute a language model LM from DA.

• Let d1 and d2 be two documents such that d1 ∈ A and d2 /∈ A

• We expect that PP (d1) ≪ PP (d2).

[Barrón-Cedeño and Rosso, 2008, PAN-lm]

Detección automática de plagio en texto IARFID-NLEL, UPV 17/54

Page 18: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Language Models

POS

26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10

9 8 7 6 5 4 3 2 1 0

950 855 760 665 570 475 380 285 190 95 0

Per

plex

ity

Sentence

Literature example, n=3 plagiarised

µ=9

Detección automática de plagio en texto IARFID-NLEL, UPV 18/54

Page 19: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Language Models

What is wrong?

• In fact, this approach is closer to intrinsic analysis

• Significant text fragments contain more than 100 words

• Vectors (or LM) should be computed at character level

[Stamatatos, 2009]

Detección automática de plagio en texto IARFID-NLEL, UPV 19/54

Page 20: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Detailed Analysis

Heuristicretrieval

ReferenceCollection D

d_q

Candidatedocuments

Decomposition

Vector spacemodel

comparison

Fingerprint-based

comparison

Knowledge-basedpost-processing

Suspiciouspassages

Detailed analysisPreprocessing Plagiarism Detection

Detección automática de plagio en texto IARFID-NLEL, UPV 20/54

Page 21: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

n-grams Comparison

• Meter corpus on journalistic text reuse [Clough et al., 2002]

• Containment approach (s ∈ dq to d)

C(si | d) =|N(si) ∩ N(d)|

|N(si)|

[Barrón-Cedeño and Rosso, 2009a, ECIR]

Detección automática de plagio en texto IARFID-NLEL, UPV 21/54

Page 22: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

n-grams Comparison

How long n should be?

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

t (containment)

Pre

cisi

onR

ecal

lF

−m

easu

re

P=0.736

R=0.641

F=0.685

t=0.34

P=0.740

R=0.604

F=0.665

t=0.17

n=1 n=2 n=3 n=4 n=5

Detección automática de plagio en texto IARFID-NLEL, UPV 22/54

Page 23: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

“Heuristic Retrieval”

Heuristicretrieval

ReferenceCollection D

d_q

Candidatedocuments

Decomposition

Vector spacemodel

comparison

Fingerprint-based

comparison

Knowledge-basedpost-processing

Suspiciouspassages

Detailed analysisPreprocessing Plagiarism Detection

Detección automática de plagio en texto IARFID-NLEL, UPV 23/54

Page 24: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Previous selection of (a good) D

• Some methods work properly. However...what about the size of D? (database, digital library, Internet)

Detección automática de plagio en texto IARFID-NLEL, UPV 24/54

Page 25: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Previous selection of (a good) D

Based on the Kullback-Leibler distance

KLδ(P || Q) =∑

x∈X

(P (x) − Q(x))logP (x)

Q(x)

[Kullback and Leibler, 1951, Bigi, 2003]

• Variation of n-gram levels n = {1, 2, 3}

• Keywords ranking on the basis of tf , tfidf and tp

• The top 20% of the keywords ranking represents the document

• Pd(ti) = tfd,ti ; Qdq(ti) = (tfdq ,ti | Pd(ti))

[Barrón-Cedeño et al., 2009b, CICLing]

Detección automática de plagio en texto IARFID-NLEL, UPV 25/54

Page 26: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Previous selection of (a good) D

Results

• Best option: tfidf on 1-grams

• Exhaustive comparison: Containmet on 3-grams

Selection of D threshold P R F tNO 0.34 0.73 0.63 0.68 2.32YES 0.25 0.77 0.74 0.75 0.19

[Barrón-Cedeño and Rosso, 2009b, SEPLN]

Detección automática de plagio en texto IARFID-NLEL, UPV 26/54

Page 27: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Outline

Introduction

State of the Art

Monolingual Plagiarism Detection

Cross-Language Plagiarism Detection

Contributions

Corpora, Competition and More...

Detección automática de plagio en texto IARFID-NLEL, UPV 27/54

Page 28: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Cross-Language plagiarism detection

What to do if dq and d are written in different languages?

• Scholars from non-English speaking countries write texts in theirnative languages

• Current scientific discourse to refer to is often published inEnglish

Issue The syntactical similarity between passages in dq and dis lost across languages.

Detección automática de plagio en texto IARFID-NLEL, UPV 28/54

Page 29: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Cross-Language Plagiarism Detection

Heuristicretrieval

ReferenceCollection D

d_q

Candidatedocuments

Decomposition

Vector spacemodel

comparison

Fingerprint-based

comparison

Knowledge-basedpost-processing

Suspiciouspassages

Detailed analysisPreprocessing Plagiarism Detection

Detección automática de plagio en texto IARFID-NLEL, UPV 29/54

Page 30: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Cross-Language Plagiarism Detection

Some options:

1 EUROVOC Thesaurus-based [Pouliquen et al., 2003]

2 CL-ESA, Wikipedia-based [Potthast et al., 2008]

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

Plagiarism, the unacknowledged use of another author’s ori

is nowadays considered as one of the biggest problems in p

science, and education. Although texts and other works of a

plagiarized all times, text plagiarism is observed at an unprecedented

scale with the advent of the World Wide Web.

This observation is not surprising since the Web makes billions of texts,

code sources, images, sounds, and videos easily accessible, that is to say,

copyable.

Plagiarism detection, the automatic identification of plagiarism and the

retrieval of the original sources, is researched and developed as a possible

countermeasure to plagiarism. Although humans can relatively easy

identify cases of plagiarism in their areas of expertise, it requires much

effort to be aware of all potential sources on a given topic and to provide

strong evidence against an offender.

The manual analysis of text with respect to plagiarism becomes infeasible

on a large scale, so that automatic plagiarism detection attracts considerable

attention.

The paper in hand investigates a particular kind of text plagiarism, namely

the detection of plagiarism across languages. The different kinds of text

plagiarism are organized in Figure 1. Cross−language plagiarism, shown

encircled, refers to cases where an author translates text from another

language and integrates the translation into his/her own writing. It is

reasonable to assume that plagiarism does not stop at language barriers

since, for instance, scholars from non−English speaking countries often

write assignments, seminars, theses, or papers in their native languages

whereas current scientific discourse to refer to is often published in English.

There are no studies which assess the amount of cross−language plagiarism

directly, but in [2] a broader study among 18,000 students revealed that

almost 40% of them admittedly plagiarized at least once, which may also

include the cross−lingual case

dL

dL

dL’

DL DL’

dL’

Detección automática de plagio en texto IARFID-NLEL, UPV 30/54

Page 31: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Cross-Language Plagiarism Detection

CL-ASA: CL-Alignment-based Similarity Analysis

• How probable is that dq is a valid translation of d′?

• Combination of a two-step probabilistic translation and similarityanalysis

• Adaptation of the basic principles of statistical MachineTranslation’s IBM M1 [Brown et al., 1990]

[Barrón-Cedeño et al., 2008, PAN-cl], [Pinto et al., 2009, Algorithms]

Detección automática de plagio en texto IARFID-NLEL, UPV 31/54

Page 32: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Cross-Language Plagiarism Detection

Bayes’s rule for statistical Machine Translation [Brown et al., 1993]

p(d′ | dq) =p(d′) p(dq | d′)

p(dq)

• p(dq) does not depend on d′ and is therefore neglected

• p(dq | d′) is a translation model probability (statistical bilingualdictionary)

es en p(es, en) es en p(es, en)

certifica certifies 0.4203 certifica certifying 0.0913certifica certify 0.1644 certifica hereby 0.0548certifica certified 0.1096 …

Detección automática de plagio en texto IARFID-NLEL, UPV 32/54

Page 33: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Cross-Language Plagiarism Detection

• . p(d′) is the language model probability

• Length model, inspired on [Pouliquen et al., 2003]

0

0.2

0.4

0.6

0.8

1

0 20000 40000 60000 80000 100000

Pro

babi

lity

Probable lengths of translations of d

|d| = 30000

deesfrnlpl

The tests in a toy corpus were really promising

Detección automática de plagio en texto IARFID-NLEL, UPV 33/54

Page 34: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Outline

Introduction

State of the Art

Monolingual Plagiarism Detection

Cross-Language Plagiarism Detection

Contributions

Corpora, Competition and More...

Detección automática de plagio en texto IARFID-NLEL, UPV 34/54

Page 35: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Contributions

Heuristicretrieval

ReferenceCollection D

d_q

Candidatedocuments

Decomposition

Vector spacemodel

comparison

Fingerprint-based

comparison

Knowledge-basedpost-processing

Suspiciouspassages

Detailed analysisPreprocessing Plagiarism Detection

Detección automática de plagio en texto IARFID-NLEL, UPV 35/54

Page 36: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Contributions

Search space reduction

• Retrieval of documents based on query-documents

• Document and keyword-based retrieval are not the same

• Most of the papers on this topic assume that such stage is solved

• It is often called “heuristic source documents retrieval”

[Barrón-Cedeño et al., 2009b, CICLing]

Detección automática de plagio en texto IARFID-NLEL, UPV 36/54

Page 37: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Contributions

Heuristicretrieval

ReferenceCollection D

d_q

Candidatedocuments

Decomposition

Vector spacemodel

comparison

Fingerprint-based

comparison

Knowledge-basedpost-processing

Suspiciouspassages

Detailed analysisPreprocessing Plagiarism Detection

Detección automática de plagio en texto IARFID-NLEL, UPV 37/54

Page 38: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Contributions

Cross-Language Approach

• Numerous tools exist, but they do not pay attention to thecross-language case

• Proposal of a method based on statistical machine translation

• Preliminary experiments

• Maybe one of the first papers on this topic[Barrón-Cedeño et al., 2008, PAN cl] (also [Pinto et al., 2009,Algorithms])

Detección automática de plagio en texto IARFID-NLEL, UPV 38/54

Page 39: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Outline

Introduction

State of the Art

Monolingual Plagiarism Detection

Cross-Language Plagiarism Detection

Contributions

Corpora, Competition and More...

Detección automática de plagio en texto IARFID-NLEL, UPV 39/54

Page 40: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Corpora development

• Corpora of real cases of plagiarism have not been publishedbecause of ethical reasons

• Nobody wants to be exposed!

• We need to compile/generate corpora

Detección automática de plagio en texto IARFID-NLEL, UPV 40/54

Page 41: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Wikipedia co-derivatives corpus

• Texts written in: en, de, es, hi

• 500 most frequently accessed articles in each language

• For each article 10 revisions are included

http://users.dsic.upv.es/grupos/nle/downloads.html

Detección automática de plagio en texto IARFID-NLEL, UPV 41/54

Page 42: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Monolingual Text Similarity Measures

Comparison of models including:

• Vector space models

• Probabilistic models

• Fingerprinting models

(the most simple seems to be the best: Jaccard Coefficient)

[Barrón-Cedeño et al., 2009a, ICON, in press]

Detección automática de plagio en texto IARFID-NLEL, UPV 42/54

Page 43: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Cross-Language Text Reuse corpus

• It includes parallel texts from the JRC-Acquis corpus

• Comparable texts from Wikipedia are included as well

• Simulation of cross-language text reuse (plagiarism) based ontranslation and translation+post-edition

http://users.dsic.upv.es/grupos/nle/downloads.html

Detección automática de plagio en texto IARFID-NLEL, UPV 43/54

Page 44: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Cross-Language Similarity Measures

• CL Explicit Semantic Analysis [Potthast et al., 2008]• CL Character n-grams-based Comparison

[McNamee and Mayfield, 2004]• CL A lignment-based Similarity Analysis

[Barrón-Cedeño et al., 2008, Pinto et al., 2009]

• CL-CnC is the best option when the languages are related• CL-ESA is better on Wikipedia (but Wikipedia pages are far from

being plagiarism!)• CL-ASA is better on exact translations• CL-ASA can be applied to any pair of languages (related or not)

[Potthast et al., 2009a, LRE, in press]

Detección automática de plagio en texto IARFID-NLEL, UPV 44/54

Page 45: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Development of the PAN-PC-09 corpus

• 41,223 documents including 94,202 cases of artificial plagiarism

• Plagiarism Languages. 90% of the cases are monolingualEnglish plagiarism. The remainder are cross-lingual (fromGerman and Spanish into English).

• Plagiarism Obfuscation.

http://users.dsic.upv.es/grupos/nle/downloads.html

[Potthast et al., 2009b, PAN]

Detección automática de plagio en texto IARFID-NLEL, UPV 45/54

Page 46: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

PAN-09@SEPLN: Workshop & Competition

PAN Workshop. Uncovering Plagiarism, Authorship and SocialSoftware Misuse

http://www.webis.de/pan-09

Detección automática de plagio en texto IARFID-NLEL, UPV 46/54

Page 47: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Lab@CLEF-10: on Plagiarism Detection

Conference on Multilingual and MultimodalInformation Access Evaluation

Detección automática de plagio en texto IARFID-NLEL, UPV 47/54

Page 48: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

Contact

Master en Inteligencia Artificial, Reconocimiento de Formas eImagen Digital

http://www.popinformatica.upv.es/iarfid.html

Natural Language Engineering Labhttp://www.dsic.upv.es/grupos/nle/

Alberto Barrón Cedeñohttp://www.dsic.upv.es/∼lbarron

Paolo Rossohttp://www.dsic.upv.es/∼prosso

Detección automática de plagio en texto IARFID-NLEL, UPV 48/54

Page 49: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

CONACyT

The research work of Alberto Barrón Cedeño is possible thanksto the CONACyT-Mexico support.

Detección automática de plagio en texto IARFID-NLEL, UPV 49/54

Page 50: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

References I

Barrón-Cedeño, A., Eiselt, A., and Rosso, P. (2009a).

Monolingual Text Similarity Measures: A comparison of Models over Wikipedia Articles Revisions.In Proceedings of the ICON 2009.

Barrón-Cedeño, A., Pinto, D., Rosso, P., and Juan, A. (2008).

On Cross-lingual Plagiarism Analysis using a Statistical Model.In Stein, Stamatatos, and Koppel, editors, Proceedings of the ECAI’08 PAN Workshop: Uncovering Plagiarism,Authorship and Social Software Misuse, pages 9–14, Patras, Greece.

Barrón-Cedeño, A. and Rosso, P. (2008).

Towards the Exploitation of Statistical Language Models for Plagiarism Detection with Reference.In Stein, Stamatatos, and Koppel, editors, Proceedings of the ECAI’08 PAN Workshop: Uncovering Plagiarism,Authorship and Social Software Misuse, pages 15–19, Patras, Greece.

Barrón-Cedeño, A. and Rosso, P. (2009a).

On Automatic Plagiarism Detection based on n-grams Comparison.Advances in Information Retrieval. Proceedings of the 31st European Conference on IR Research, LNCS(5478):696–700.

Barrón-Cedeño, A. and Rosso, P. (2009b).

On the Relevance of Search space Reduction in Automatic Plagiarism Detection.Procesamiento del Lenguaje Natural, 43:141–149.

Barrón-Cedeño, A., Rosso, P., and Benedí, J. (2009b).

Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance.Computational Linguistics and Intelligent Text Processing. Proceedings of the CICLing 2009, LNCS (5449):523–534.

Detección automática de plagio en texto IARFID-NLEL, UPV 50/54

Page 51: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

References II

Bigi, B. (2003).

Using Kullback-Leibler distance for text categorization.Advances in Information Retrieval: Proceedings of the 25th European Conference on IR Research (ECIR 2003), LNCS(2633):305–319.

Brin, S., Davis, J., and Garcia-Molina, H. (1995).

Copy Detection Mechanisms for Digital Documents.In ACM International Conference on Management of Data (SIGMOD 1995).

Broder, A. (1997).

On the Resemblance and Containment of Documents.In SEQUENCES ’97: Proceedings of the Compression and Complexity of Sequences 1997, page 21, Washington, DC.IEEE Computer Society.

Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Lafferty, J., Mercer, R., and Roossin, P. (1990).

A Statistical Approach to Machine Translation.Computational Linguistics, 16(2):79–85.

Brown, P., Della Pietra, S., Della Pietra, V., and Mercer, R. (1993).

The Mathematics of Statistical Machine Translation: Parameter Estimation.Computational Linguistics, 19(2):263–311.

Clough, P. (2000).

Plagiarism in Natural and Programming Languages: an Overview of Current Tools and Technologies.Research Memoranda: CS-00-05, Department of Computer Science. University of Sheffield, UK.

Detección automática de plagio en texto IARFID-NLEL, UPV 51/54

Page 52: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

References III

Clough, P., Gaizauskas, R., and Piao, S. (2002).

Building and Annotating a Corpus for the Study of Journalistic Text Reuse.In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), volume V,pages 1678–1691, Las Palmas, Spain.

Kullback, S. and Leibler, R. (1951).

On Information and Sufficiency.Annals of Mathematical Statistics, 22(1):79–86.

Lyon, C., Malcolm, J., and Dickerson, B. (2001).

Detecting short passages of similar text in large document collections.In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pages 118–125,Pennsylvania.

McNamee, P. and Mayfield, J. (2004).

Character n-gram tokenization for european language text retrieval.Information Retrieval, 7(1–2):73–97.

Meyer zu Eißen, S. and Stein, B. (2006).

Intrinsic plagiarism detection.Advances in Information Retrieval: Proceedings of the 28th European Conference on IR Research (ECIR 2006), LNCS(3936):565–569.

Pinto, D., Civera, J., Barrón-Cedeño, A., Juan, A., and Rosso, P. (2009).

A Statistical Approach to Crosslingual Natural Language Tasks.Journal of Algorithms, 64(1):51–60.

Detección automática de plagio en texto IARFID-NLEL, UPV 52/54

Page 53: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

References IV

Potthast, M., Barrón-Cedeño, A., Stein, B., and Prosso, P. (2009a).

Cross-Language Plagiarism Detection.Languages Resources and Evaluation, Special Issue on Plagiarism and Authorship Analysis.

Potthast, M., Stein, B., and Anderka, M. (2008).

A Wikipedia-Based Multilingual Retrieval Model.Proceedings of the 30th European Conf. on IR Research (ECIR 2008), LNCS (4956):522–530.

Potthast, M., Stein, B., Eiselt, A., Alberto, B.-C., and Rosso, P. (2009b).

Overview of the 1st International Competition on Plagiarism Detection.In Stein, Rosso, Stamatatos, Koppel, and Agirre, editors, SEPLN 2009 Workshop on Uncovering Plagiarism, Authorshipand Social Software Misuse (PAN 09), pages 1–9. CEUR-WS.org.

Pouliquen, B., Steinberger, R., and Ignat, C. (2003).

Automatic Identification of Document Translations in Large Multilingual Document Collections.In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-2003),pages 401–408, Borovets, Bulgaria.

Schleimer, S., Wilkerson, D., and Aiken, A. (2003).

Winnowing: Local Algorithms for Document Fingerprinting.In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, New York, NY. ACM.

Stamatatos, E. (2009).

A Survey of Modern Authorship Attribution Methods.Journal of the American Society for Information Science and Technology, 60(3):538–556.

Detección automática de plagio en texto IARFID-NLEL, UPV 53/54

Page 54: Detección automática de plagio en texto - Consorcio MAVIRmavir2006.mavir.net/docs/Barron-DPT.pdf · plagio en texto L. Alberto Barrón Cedeño Máster en inteligencia artificial,

References V

Stein, B. (2007).

Principles of hash-based text retrieval.In Clarke, C., Fuhr, N., Kando, N., Kraaij, W., and de Vries, A., editors, 30th Annual International ACM SIGIRConference, pages 527–534, Amsterdam, Netherlands. ACM.

Pinto, Civera, Barrón-Cedeño, Juan, Rosso: A statistical approach to natural language tasks, 2009.

Detección automática de plagio en texto IARFID-NLEL, UPV 54/54