N-gram Overlap i n Automatic Detection o f Document Derivation

Post on 09-Feb-2016

26 views 0 download

Tags:

description

N-gram Overlap i n Automatic Detection o f Document Derivation. Siniša Bosanac, Vanja Štefanec {sbosanac;vstefane}@ffzg.hr Department of Information Sciences Faculty of Humanities and Social Sciences University of Zagreb. Introduction. problems of originality and authenthicity - PowerPoint PPT Presentation

Transcript of N-gram Overlap i n Automatic Detection o f Document Derivation

N-gram Overlap in Automatic Detection of Document Derivation

Siniša Bosanac, Vanja Štefanec{sbosanac;vstefane}@ffzg.hr

Department of Information SciencesFaculty of Humanities and Social Sciences

University of Zagreb

Introduction

• problems of originality and authenthicity• increased usage of ICT intensified the

problem• Derivation

– relationship between the two documents in which the source document was used in creating the target document

• Text reuse– process by which content from a source

document is reused in the creation of a target document

• word for word • paraphrase

Examples of content derivation• desirable:

1) quoting2) document updating3) relaying the sponsored content in news media4) automatic and manual summarization

• undesirable: 1) plagiarism2) non-critical relaying of sponsored content in news

media3) journalistic theft

N-gram overlap

• overlapping of n successive linguistic units

• overlapping in longer n-grams can be an indicator of derivation

• representative n-gram length is language-specific

• fast, robust and low-complex method

Measure

• resemblance• Jaccard similarity coefficient

• n-gram types, not tokens

BFAF

BFAFBAr

,

Methodology

• building the text collection• performing measurements• manual classification of compared

pairs into derived and non-derived• ranking of pairs based on

resemblance score

Text collection

• total number of documents: 236• sources:

• digital repository of the Library of the Faculty of Humanities and Social Sciences

• Web news sites• other Web sources

• document size: 69 – 34,397 tokens

Text collection

• types of documents:• 39 diploma papers• 42 scientific articles• 61 news articles• 61 literary columns• 35 documents classified as “other”

• topic:• library science • psychology

Text classification

• topic determined according to:• source• document title• keywords• content

• functional style determined according to:

• classification at source• type of document

Measurements

• using a purpose-built program module

• comparison of each document against every other

• comparison according to n-gram length from 1 to 10

• calculating the resemblance score

• No of derivation pairs: 28,203• derived: 256• non-derived: 27,938

Choosing the most representative n-gram

• finding the F1 maximum on the level of each n-gram

• comparing the F1 measure maxima across all n-gram levels

• determining the resemblance threshold

Derived pairs – trigrams

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642650

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

derivation pair No

rese

mbla

nce

[%

]

Non-derived pairs – trigrams

28841401962523083644204765325886447007568128689249801036109211481204126013161372142814841540159616521708176418201876193219882044210021562212226823242380243624922548260426602716277228282884294029963052310831643220327633323388344435003556361236683724378038363892394840044060411641724228428443404396445245084564462046764732478848444900495650125068512451805236529253485404546055165572562856845740579658525908596460206076613261886244630063566412646865246580663666926748680468606916697270287084714071967252730873647420747675327588764477007756781278687924798080368092814882048260831683728428848485408596865287088764882088768932898890449100915692129268932493809436949295489604966097169772982898849940999610052101081016410220102761033210388104441050010556106121066810724107801083610892109481100411060111161117211228112841134011396114521150811564116201167611732117881184411900119561201212068121241218012236122921234812404124601251612572126281268412740127961285212908129641302013076131321318813244133001335613412134681352413580136361369213748138041386013916139721402814084141401419614252143081436414420144761453214588146441470014756148121486814924149801503615092151481520415260153161537215428154841554015596156521570815764158201587615932159881604416100161561621216268163241638016436164921654816604166601671616772168281688416940169961705217108171641722017276173321738817444175001755617612176681772417780178361789217948180041806018116181721822818284183401839618452185081856418620186761873218788188441890018956190121906819124191801923619292193481940419460195161957219628196841974019796198521990819964200202007620132201882024420300203562041220468205242058020636206922074820804208602091620972210282108421140211962125221308213642142021476215322158821644217002175621812218682192421980220362209222148222042226022316223722242822484225402259622652227082276422820228762293222988230442310023156232122326823324233802343623492235482360423660237162377223828238842394023996240522410824164242202427624332243882444424500245562461224668247242478024836248922494825004250602511625172252282528425340253962545225508255642562025676257322578825844259002595626012260682612426180262362629226348264042646026516265722662826684267402679626852269082696427020270762713227188272442730027356274122746827524275802763627692277482780427860279160

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

derivation pair No

rese

mbla

nce

[%

]

Trigrams – F1, precision, recall

resemblance 0.25 0.3 0.35 0.4 0.45 0.5

F1-measure 0.417 0.517 0.583 0.633 0.656 0.651

precision 0.272 0.370 0.457 0.535 0.602 0.634

recall 0.890 0.852 0.803 0.773 0.720 0.667

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.550.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.417

0.5170.583 0.633

0.656 0.651

F1-measurepre-ci-sionrecall

resemblance [%]

General results

n-gram1-gram

2-gram

3-gram

4-gram

5-gram

6-gram

7-gram

8-gram

9-gram

10-gram

F1-measure

0,465 0,524 0,656 0,735 0,771 0,820 0,814 0,807 0,777 0,718

1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.465

0.524

0.656

0.735

0.7710.820 0.814 0.807

0.777

0.718

F1-m

easu

re

Function words removed

n-gram1-gram

2-gram

3-gram

4-gram

5-gram

6-gram

7-gram

8-gram

9-gram

10-gram

F1-measure

0,487 0,611 0,704 0,754 0,801 0,803 0,748 0,712 0,661 0,63

1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.488

0.611

0.705

0.7550.801 0.804

0.749

0.7130.662

0.630

general

no func-tion words

F1-m

easu

re

Functional styles

• five functional styles in Croatian• scientific• official• newspaper and publicistic• literary• colloquial

• need to differentiate between functional styles?

Differentiating functional styles

n-gram1-gram

2-gram

3-gram

4-gram

5-gram

6-gram

7-gram

8-gram

9-gram

10-gram

scientific 0,545 0,589 0,762 0,840 0,846 0,855 0,854 0,846 0,835 0,787

publicistic and newspaper

0,622 0,667 0,696 0,691 0,727 0,810 0,754 0,719 0,536 0,423

1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.5450.589

0.762

0.840 0.846 0.855 0.854 0.846 0.835

0.787

0.622

0.6670.696 0.691

0.727

0.8100.754

0.719

0.536

0.423

scientific

publicistic and newspa-per

F1-m

easu

re

Conclusion

• the first research of this kind performed on texts in Croatian language

• 6-grams were shown to be the most representative

• final parameters depend on intended application

Further research

• enlarge and refine the text collection

• experiment with different kinds of text editing

• POS tagging• extracting hapax legomena, stop words, labels,

direct quotes

• focus on a different level of text• characters• sentences• paragraphs

Possible applications

• determining document originality• protection of intelectual property• plagiarism detection

• infometric research• information disemination analysis• citation analysis

Thank you for your attention!