N-gram Overlap i n Automatic Detection o f Document Derivation
description
Transcript of N-gram Overlap i n Automatic Detection o f Document Derivation
N-gram Overlap in Automatic Detection of Document Derivation
Siniša Bosanac, Vanja Štefanec{sbosanac;vstefane}@ffzg.hr
Department of Information SciencesFaculty of Humanities and Social Sciences
University of Zagreb
Introduction
• problems of originality and authenthicity• increased usage of ICT intensified the
problem• Derivation
– relationship between the two documents in which the source document was used in creating the target document
• Text reuse– process by which content from a source
document is reused in the creation of a target document
• word for word • paraphrase
Examples of content derivation• desirable:
1) quoting2) document updating3) relaying the sponsored content in news media4) automatic and manual summarization
• undesirable: 1) plagiarism2) non-critical relaying of sponsored content in news
media3) journalistic theft
N-gram overlap
• overlapping of n successive linguistic units
• overlapping in longer n-grams can be an indicator of derivation
• representative n-gram length is language-specific
• fast, robust and low-complex method
Measure
• resemblance• Jaccard similarity coefficient
• n-gram types, not tokens
BFAF
BFAFBAr
,
Methodology
• building the text collection• performing measurements• manual classification of compared
pairs into derived and non-derived• ranking of pairs based on
resemblance score
Text collection
• total number of documents: 236• sources:
• digital repository of the Library of the Faculty of Humanities and Social Sciences
• Web news sites• other Web sources
• document size: 69 – 34,397 tokens
Text collection
• types of documents:• 39 diploma papers• 42 scientific articles• 61 news articles• 61 literary columns• 35 documents classified as “other”
• topic:• library science • psychology
Text classification
• topic determined according to:• source• document title• keywords• content
• functional style determined according to:
• classification at source• type of document
Measurements
• using a purpose-built program module
• comparison of each document against every other
• comparison according to n-gram length from 1 to 10
• calculating the resemblance score
• No of derivation pairs: 28,203• derived: 256• non-derived: 27,938
Choosing the most representative n-gram
• finding the F1 maximum on the level of each n-gram
• comparing the F1 measure maxima across all n-gram levels
• determining the resemblance threshold
Derived pairs – trigrams
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642650
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
derivation pair No
rese
mbla
nce
[%
]
Non-derived pairs – trigrams
28841401962523083644204765325886447007568128689249801036109211481204126013161372142814841540159616521708176418201876193219882044210021562212226823242380243624922548260426602716277228282884294029963052310831643220327633323388344435003556361236683724378038363892394840044060411641724228428443404396445245084564462046764732478848444900495650125068512451805236529253485404546055165572562856845740579658525908596460206076613261886244630063566412646865246580663666926748680468606916697270287084714071967252730873647420747675327588764477007756781278687924798080368092814882048260831683728428848485408596865287088764882088768932898890449100915692129268932493809436949295489604966097169772982898849940999610052101081016410220102761033210388104441050010556106121066810724107801083610892109481100411060111161117211228112841134011396114521150811564116201167611732117881184411900119561201212068121241218012236122921234812404124601251612572126281268412740127961285212908129641302013076131321318813244133001335613412134681352413580136361369213748138041386013916139721402814084141401419614252143081436414420144761453214588146441470014756148121486814924149801503615092151481520415260153161537215428154841554015596156521570815764158201587615932159881604416100161561621216268163241638016436164921654816604166601671616772168281688416940169961705217108171641722017276173321738817444175001755617612176681772417780178361789217948180041806018116181721822818284183401839618452185081856418620186761873218788188441890018956190121906819124191801923619292193481940419460195161957219628196841974019796198521990819964200202007620132201882024420300203562041220468205242058020636206922074820804208602091620972210282108421140211962125221308213642142021476215322158821644217002175621812218682192421980220362209222148222042226022316223722242822484225402259622652227082276422820228762293222988230442310023156232122326823324233802343623492235482360423660237162377223828238842394023996240522410824164242202427624332243882444424500245562461224668247242478024836248922494825004250602511625172252282528425340253962545225508255642562025676257322578825844259002595626012260682612426180262362629226348264042646026516265722662826684267402679626852269082696427020270762713227188272442730027356274122746827524275802763627692277482780427860279160
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
derivation pair No
rese
mbla
nce
[%
]
Trigrams – F1, precision, recall
resemblance 0.25 0.3 0.35 0.4 0.45 0.5
F1-measure 0.417 0.517 0.583 0.633 0.656 0.651
precision 0.272 0.370 0.457 0.535 0.602 0.634
recall 0.890 0.852 0.803 0.773 0.720 0.667
0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.550.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.417
0.5170.583 0.633
0.656 0.651
F1-measurepre-ci-sionrecall
resemblance [%]
General results
n-gram1-gram
2-gram
3-gram
4-gram
5-gram
6-gram
7-gram
8-gram
9-gram
10-gram
F1-measure
0,465 0,524 0,656 0,735 0,771 0,820 0,814 0,807 0,777 0,718
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.465
0.524
0.656
0.735
0.7710.820 0.814 0.807
0.777
0.718
F1-m
easu
re
Function words removed
n-gram1-gram
2-gram
3-gram
4-gram
5-gram
6-gram
7-gram
8-gram
9-gram
10-gram
F1-measure
0,487 0,611 0,704 0,754 0,801 0,803 0,748 0,712 0,661 0,63
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.488
0.611
0.705
0.7550.801 0.804
0.749
0.7130.662
0.630
general
no func-tion words
F1-m
easu
re
Functional styles
• five functional styles in Croatian• scientific• official• newspaper and publicistic• literary• colloquial
• need to differentiate between functional styles?
Differentiating functional styles
n-gram1-gram
2-gram
3-gram
4-gram
5-gram
6-gram
7-gram
8-gram
9-gram
10-gram
scientific 0,545 0,589 0,762 0,840 0,846 0,855 0,854 0,846 0,835 0,787
publicistic and newspaper
0,622 0,667 0,696 0,691 0,727 0,810 0,754 0,719 0,536 0,423
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.5450.589
0.762
0.840 0.846 0.855 0.854 0.846 0.835
0.787
0.622
0.6670.696 0.691
0.727
0.8100.754
0.719
0.536
0.423
scientific
publicistic and newspa-per
F1-m
easu
re
Conclusion
• the first research of this kind performed on texts in Croatian language
• 6-grams were shown to be the most representative
• final parameters depend on intended application
Further research
• enlarge and refine the text collection
• experiment with different kinds of text editing
• POS tagging• extracting hapax legomena, stop words, labels,
direct quotes
• focus on a different level of text• characters• sentences• paragraphs
Possible applications
• determining document originality• protection of intelectual property• plagiarism detection
• infometric research• information disemination analysis• citation analysis
Thank you for your attention!