A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse...

Post on 15-Jun-2020

7 views 0 download

Transcript of A Pipeline for Scalable Text Milad Alshomary Reuse Analysis · A Pipeline for Scalable Text Reuse...

05.07.2018Pipeline for TR extractionMilad Alshomary

A Pipeline for Scalable Text Reuse Analysis

Milad Alshomary

05.07.2018

Bauhaus Universität

1

05.07.2018Pipeline for TR extractionMilad Alshomary

Overview

2

● Motivation

● A Pipeline for Scalable Text Reuse Extraction

● Application on Wikipedia

● Application on Wikipedia and Common Crawl

● Conclusion

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse (TR)Motivation

3

● Quoting

● Verbatim

● Paraphrasing

● Translation

● Summarization

05.07.2018Pipeline for TR extractionMilad Alshomary

TR Detection ApplicationsMotivation

4

METER project(Measuring Text Reuse)

Plagiarism detection

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

Plagiarism detection

5

METER projet(Measuring Text Reuse)

TR Detection Applications

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

6

Plagiarism detectionMETER projet

(Measuring Text Reuse)

TR Detection Applications

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The WorldMotivation

● Digital Encyclopedia

● Collaborative environment

● Giant public source of

information

● Free to use

7

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The WorldMotivation

● Digital Encyclopedia

● Collaborative environment

● Giant public source of

information

● Free to use

8

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The WorldMotivation

● Digital Encyclopedia

● Collaborative environment

● Giant public source of

information

● Free to use

9

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The WorldMotivation

● Digital Encyclopedia

● Collaborative environment

● Giant public source of

information

● Free to use

10

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

Wikipedia vs The World

Quality Flaws

11

Scientific community

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

Wikipedia vs The World

- Web pages = Wikipedia text + advertisements

12

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

Research Questions

➔ What kinds of text reuse occur within Wikipedia?

➔ How much of the web is a copy of Wikipedia content?

➔ How much revenue does this content generate?

13

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

14

➔ What kinds of text reuse occur within Wikipedia?

➔ How much of the web is a copy of Wikipedia content?

➔ How much revenue does this content generate?

Research Questions

05.07.2018Pipeline for TR extractionMilad Alshomary

Motivation

15

➔ What kinds of text reuse occur within Wikipedia?

➔ How much of the web is a copy of Wikipedia content?

➔ How much revenue does this content generate?

Research Questions

05.07.2018Pipeline for TR extractionMilad Alshomary

A Pipeline for Scalable Text Reuse Extraction

16

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse Pipeline

TR PipelineD1

D2

➔ Input: Two datasets

➔ Output: Text reuse

cases

17

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse Pipeline

TR PipelineD1

D2

➔ Input: Two datasets

➔ Output: Text reuse

cases

18

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse Pipeline

Text Preprocessing

Candidate Elimination

Text Alignment

19

➔ Content extraction

➔ Chunking

➔ Feature extraction

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse Pipeline

Text Preprocessing

Candidate Elimination

Text Alignment

20

➔ Content extraction

➔ Chunking

➔ Feature extraction

➔ Pairwise scan

➔ Text Reuse heuristics

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse Pipeline

Text Preprocessing

Candidate Elimination

Text Alignment

➔ Content extraction

➔ Chunking

➔ Feature extraction

➔ Pairwise scan

➔ Text Reuse heuristics

➔ Detailed scan of text

reuse

➔ Picapica framework

21

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Text Preprocessing

Candidate Elimination

Text Alignment

Keys for scaling-up:

➔ Cluster computing

➔ Heuristics based candidate elimination

algorithms

22

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Text Preprocessing

Candidate Elimination

Text Alignment

Keys for scaling-up:

➔ Cluster computing

➔ Heuristics based candidate elimination

algorithms

23

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

For a candidacy function we proposed

the following methods:

- Cosine similarity of TF-IDF (semantic)

- Paragraph embedding (semantic)

- Stopwords N-grams (structure)

- Weighted average of Stopwords Ngrams and

Paragraph embedding (semantic + structure)

24

d2n

D1 D2

candidacy(d11, d21) → [0, 1]

d11 d21

d22d12

d1n

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Wikipedia Document Sample

Text alignment using picapica framework

TR sample

Sample 1k documents

Generate TR Sample from Wikipedia:

- Sample 1k documents from

Wikipedia

- Using Picapica framework to find

TR cases

25

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Wikipedia Document Sample

Text alignment using picapica framework

TR sample

Sample 1k documents

Generate TR Sample from Wikipedia:

- Sample 1k documents from

Wikipedia

- Using Picapica framework to find

TR cases

26

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Wikipedia Document Sample

Text alignment using picapica framework

TR sample

Sample 1k documents

Generate TR Sample from Wikipedia:

- Sample 1k documents from

Wikipedia

- Using Picapica framework to find

TR cases

27

- 232 documents

- ~ 90% have < 10 alignements (TR case)

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

TR sampleEvaluation of “candidacy” function:

- For each document in TR sample:

- Sort all Wikipedia articles

according to the proposed

“candidacy” .

- Precision/Recall on

Thresholds of [1, 101,..,100k]

- A True Positive (TP) is a pair of

documents that have TR.

28

T1 T2 T3

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

TR sampleEvaluation of “candidacy” function:

29

T1 T2 T3

r1r2

p1

p2

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Semantic hashing function:

- Hashes documents into binary

hashes.

- Similar documents get similar or

exact binary hash.

30

011001 011001

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Semantic hashing function:

- Hashing all documents.

- Inverted index.

- Hash document’s chunks.

- Apply candidacy function only on

documents that intersect in one

hash at least.001001

011001

001000

Inverted index

011001

011001

D1D2

31

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

001001

011001

001000

Inverted index

011001

011001

D1D2

32

Semantic hashing function:

- Hashing all documents.

- Inverted index.

- Hash document’s chunks.

- Apply candidacy function only on

documents that intersect in one

hash at least.

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

001001

011001

001000

Inverted index

011001

011001

D1D2

33

Semantic hashing function:

- Hashing all documents.

- Inverted index.

- Hash document’s chunks.

- Apply candidacy function only on

documents that intersect in one

hash at least.

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

001001

011001

001000

Inverted index

011001

011001

D1D2

34

Semantic hashing function:

- Hashing all documents.

- Inverted index.

- Hash document’s chunks.

- Apply candidacy function only on

documents that intersect in one

hash at least.

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Proposed semantic hashing methods:

- Random Projection (data

independent)

- Variational Deep Semantic

Hashing (data dependent)

35

di

dj

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Proposed semantic hashing methods:

- Random Projection (data

independent)

- Variational Deep Semantic

Hashing (data dependent)

36

di001

100

dj

001

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Proposed semantic hashing methods:

- Random Projection (data

independent)

- Variational Deep Semantic

Hashing (data dependent)

37

Learning

VDSH

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Transform

011001

38

Learning

VDSH VDSH

Proposed semantic hashing methods:

- Random Projection (data

independent)

- Variational Deep Semantic

Hashing (data dependent)

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

39

Hashing methods evaluation:

- Using same TR sample for

evaluation.

- Hashing all documents using the

proposed hashing function.

- Compute precision and recall.

TR sample

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

40

Hashing methods evaluation:

- Using same TR sample for

evaluation.

- Hashing all documents using the

proposed hashing function.

- Compute precision and recall.

TR sample

101

001

111

101 101 101

110000

100

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

41

Hashing methods evaluation:

- Using same TR sample for

evaluation.

- Hashing all documents using the

proposed hashing function.

- Compute precision and recall.

TR sample

101

001

111

101 101 101

110000

100

Precision = 2/3 Recall = 1.0

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

Random projection

bits precision recall

…. 8 3.1 x 10-4 0.8741

…. 16 9.9 x 10-4 0.324

VDSH bits precision recall

…. 8 2.8 x 10-4 0.88

…. 16 4.5 x 10-3 0.73

42

Hashing methods evaluation

- Using same TR sample for

evaluation.

- Hashing all documents using the

proposed hashing function.

- Compute precision and recall.

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Candidate Elimination

VDSH bits precision recall

8 2.8 x 10-4 0.88

16 4.5 x 10-3 0.73

● Retains 73% of the recall● By experiment:

○ Reduces the computations needed by 3 order of magnitude

43

Hashing methods evaluation

- Using same TR sample for

evaluation.

- Hashing all documents using the

proposed hashing function.

- Compute precision and recall.

A Pipeline for Scalable Text Reuse Extraction

05.07.2018Pipeline for TR extractionMilad Alshomary

Application on Wikipedia

44

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse In WikipediaApplication on Wikipedia

➔ What kinds of text reuse occur within Wikipedia?

➔ How much of the web is a copy of Wikipedia content?

➔ How much revenue does this content generate?

45

05.07.2018Pipeline for TR extractionMilad Alshomary

Text Reuse In WikipediaApplication on Wikipedia

100 million text reuse

TR Pipeline

Wikipedia

Wikipedia

46

Wikipedia Articles

360k Wikipedia Article

05.07.2018Pipeline for TR extractionMilad Alshomary

What kinds of text reuse occur in

Wikipedia?

- Reasons behind text reuse:

(1) Two texts describe the same

topic.

(2) Two texts describe two

different topics, that share similar

characteristics

47

Text Reuse In WikipediaApplication on Wikipedia

05.07.2018Pipeline for TR extractionMilad Alshomary

What kinds of text reuse occur in

Wikipedia?

- Reasons behind text reuse:

(1) Two texts describe the same

topic.

Text Reuse

Structure Text Reuse

Content Text Reuse

48

Text Reuse In WikipediaApplication on Wikipedia

05.07.2018Pipeline for TR extractionMilad Alshomary

What kinds of text reuse occur in

Wikipedia?

- Reasons behind text reuse:

- Tow texts describing same

topic.

Text Reuse

Structure Text Reuse

Content Text Reuse

49

Text Reuse In WikipediaApplication on Wikipedia

05.07.2018Pipeline for TR extractionMilad Alshomary

What kinds of text reuse occur in

Wikipedia?

- Reasons behind text reuse:

(2) Two texts describe two

different topics, that share similar

characteristics

Text Reuse

Structure Text Reuse

Content Text Reuse

50

Text Reuse In WikipediaApplication on Wikipedia

05.07.2018Pipeline for TR extractionMilad Alshomary

What kinds of text reuse occur in

Wikipedia?

- Reasons behind text reuse:

(2) Two texts describe two

different topics, that share similar

characteristics

Text Reuse

Structure Text Reuse

Content Text Reuse

51

Text Reuse In WikipediaApplication on Wikipedia

05.07.2018Pipeline for TR extractionMilad Alshomary 52

Text Reuse In WikipediaApplication on Wikipedia

- Vertical alignment → Content TR

- Horizontal alignment → Structure TR

Vertical relation Horizontal relation

05.07.2018Pipeline for TR extractionMilad Alshomary 53

Text Reuse In WikipediaApplication on Wikipedia

- Vertical alignment → Content TR

- Horizontal alignment → Structure TR

05.07.2018Pipeline for TR extractionMilad Alshomary 54

Text Reuse In WikipediaApplication on Wikipedia

- Vertical alignment → Content TR

- Horizontal alignment → Structure TR

05.07.2018Pipeline for TR extractionMilad Alshomary

Application on Wikipedia and Common Crawl

55

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The WebApplication on Wikipedia and Common Crawl

56

➔ What kinds of text reuse occur within Wikipedia?

➔ How much of the web is a copy of Wikipedia content?

➔ How much revenue does this content generate?

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

WWW

- Crawling

Extracted web content

- Content extraction

- Keeping only english

pages

Web Sample

10% random

sample

57

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

WWW

- Crawling

Extracted web content

- Content extraction

- Keeping only english

pages

Web Sample

10% random

sample

- 59 million web pages.

- 1.4 million websites.

- 70% of these websites

contains less than 10 web

pages

Number of web pages

Num

ber

of w

ebsi

tes

58

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

TR Pipeline

Web Sample

Wikipedia

- 1.6 million text reuse cases.

- 15k pages reuse Wikipedia text.

- 4.8k websites reuse Wikipedia text.

59

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Monthly revenue estimation:

- Rough estimate of Ads revenue

- Based on CPM (Cost Per Millie)

- Sampled 100 webpages and

manually checked the existence of

Advertisements.

60

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web pagewebsite Monthly

revenue Percentage of reuse Monthly Wikipedia value

pdxretro.com $195 0.012 $2.5

seqrchquarry.com $8,850 0.096 $850

asiatees.com $36,000 0.017 $613

…. ….. ….. ….

Total $1.2 million

61

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web pagewebsite Monthly

revenue Percentage of reuse Monthly Wikipedia value

pdxretro.com $195 0.012 $2.5

seqrchquarry.com $8,850 0.096 $850

asiatees.com $36,000 0.017 $613

…. ….. ….. ….

Total $1.2 million

62

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web pagewebsite Monthly

revenue Percentage of reuse Monthly Wikipedia value

pdxretro.com $195 0.012 $2.5

seqrchquarry.com $8,850 0.096 $850

asiatees.com $36,000 0.017 $613

…. ….. ….. ….

Total $1.2 million

63

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web pagewebsite Monthly

revenue Percentage of reuse Monthly Wikipedia value

pdxretro.com $195 0.012 $2.5

seqrchquarry.com $8,850 0.096 $850

asiatees.com $36,000 0.017 $613

…. ….. ….. ….

Total $1.2 million

64

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web pagewebsite Monthly

revenue Percentage of reuse Monthly Wikipedia value

pdxretro.com $195 0.012 $2.5

seqrchquarry.com $8,850 0.096 $850

asiatees.com $36,000 0.017 $613

…. ….. ….. ….

Total $1.2 million

The rough estimate of monthly revenue of Wikipedia content

65

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web page

- Percentage of pages reusing Wikipedia >= 0.5

- 87 websites.

- Estimated monthly revenue: $15k

66

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Extracted from Wikipedia API

67

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web page Reused Wikipedia page

Average page views Average CPM Average monthly

revenue

Nuclear renaissance 645 $2.8 $1.806

Second Chechen War 34655 $2.8 $97

Enumerated powers 12858 $2.8 $36

…. ….. ….. ….

Total $900k

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Estimated from marketing reports

68

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web page Reused Wikipedia page

Average page views Average CPM Average monthly

revenue

Nuclear renaissance 645 $2.8 $1.806

Second Chechen War 34655 $2.8 $97

Enumerated powers 12858 $2.8 $36

…. ….. ….. ….

Total $900k

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

69

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web page Reused Wikipedia page

Average page views Average CPM Average monthly

revenue

Nuclear renaissance 645 $2.8 $1.806

Second Chechen War 34655 $2.8 $97

Enumerated powers 12858 $2.8 $36

…. ….. ….. ….

Total $900k

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Reused Wikipedia page

Average page views Average CPM Average monthly

revenue

Nuclear renaissance 645 $2.8 $1.806

Second Chechen War 34655 $2.8 $97

Enumerated powers 12858 $2.8 $36

…. ….. ….. ….

Total $900k

70

Revenue estimation:

- Per website (all websites)

- Per website (highly reusing)

- Per Wikipedia web page

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

71

Monthly revenue:

Application on Wikipedia and Common Crawl

Per Web sample Number of reusing web pages Revenue(per webpage)

59 million 15k $900k

590 million 150k $9 million

05.07.2018Pipeline for TR extractionMilad Alshomary

Wikipedia vs The Web

Per Web sample Number of reusing web pages Revenue(per webpage)

59 million 15k $900k

590 million 150k $9 million

72

Monthly revenue:

Application on Wikipedia and Common Crawl

05.07.2018Pipeline for TR extractionMilad Alshomary

Conclusion

73

05.07.2018Pipeline for TR extractionMilad Alshomary

Summaryconclusion

74

● Pipeline for TR Extraction

● Text Reuse in Wikipedia

● Text Reuse between

Wikipedia and the Web

Text Preprocessing

Candidate Elimination Text Alignment

Text Reuse

Structure Text Reuse

Content Text Reuse

Per website (all websites)

Per website (highly reuse)

Per Webpage

$1.2 million $15k $900k

05.07.2018Pipeline for TR extractionMilad Alshomary

Future Workconclusion

TR Pipeline

Wikipedia

?

75

● Using the pipeline to extract and analyze

TR between Wikipedia and the scientific

community.

● Experiments on the Text Alignment

subtask.

● Further analysis of the extracted Text

Reuse cases.

● More accurate estimation on the

monthly revenue generated by

Wikipedia content.

05.07.2018Pipeline for TR extractionMilad Alshomary

ConclusionFuture Work

TR Pipeline

Wikipedia

?● Using the pipeline to extract and analyze

TR between Wikipedia and the scientific

community.

● Experiments on the Text Alignment

subtask.

● Further analysis of the extracted Text

Reuse cases.

● More accurate estimation on the

monthly revenue generated by

Wikipedia content.

Text Preprocessing

Candidate Elimination Text Alignment

76

05.07.2018Pipeline for TR extractionMilad Alshomary

ConclusionFuture Work

TR Pipeline

Wikipedia

?

Text Preprocessing

Candidate Elimination Text Alignment

77

● Using the pipeline to extract and analyze

TR between Wikipedia and the scientific

community.

● Experiments on the Text Alignment

subtask.

● Further analysis of the extracted Text

Reuse cases.

● More accurate estimation on the

monthly revenue generated by

Wikipedia content.

05.07.2018Pipeline for TR extractionMilad Alshomary

ConclusionFuture Work

TR Pipeline

Wikipedia

?

Text Preprocessing

Candidate Elimination Text Alignment

78

● Using the pipeline to extract and analyze

TR between Wikipedia and the scientific

community.

● Experiments on the Text Alignment

subtask.

● Further analysis of the extracted Text

Reuse cases.

● More accurate estimation on the

monthly revenue generated by

Wikipedia content.

05.07.2018Pipeline for TR extractionMilad Alshomary

Backup Slides

79

05.07.2018Pipeline for TR extractionMilad Alshomary 80

- Candidate Elimination functions:

05.07.2018Pipeline for TR extractionMilad Alshomary 81

- Stopwords N-grams procedure:

Wiki paragraphs stopwords stopword

ngrams

Extract stop words generate n-grams filtered stopword ngrams

Top 50 frequent stopwords:

the, of, and, a, in, to,is, was, it, for, with, he,

be, on, i, that, by, at, you, 's, are, not,his,

this, from, but, had, which, she, they, or, an,

were, we, their, been, has, have, will, would,

her, there, can, all,as, if, who, what, said

filter n-grams

- Let C = {the, of, and, a, in, to, ’s} stopwords that increases false positive.

- X is accepted n-gram if:- It doesn’t contain more than n-1

stopwords from C- The maximal sequence of stopwords

belonging to C is less than n-2

binary count vector

- Binary count vector ignores the frequency in which a specific n-gram happened in a paragraph.

- We apply the scoring function on the binary count vector

05.07.2018Pipeline for TR extractionMilad Alshomary 82

- VDSH explained:

VDSH USAGE

05.07.2018Pipeline for TR extractionMilad Alshomary 83

- Performance of candidacy functions on different thresholds:

Documents from sample who have number of aligned docs <= 10

Documents from sample who have number of aligned docs > 10

Thresholds between (1 to 1000 and step of 5)

RECALLRECALL

Prec

isio

n

Prec

isio

n

05.07.2018Pipeline for TR extractionMilad Alshomary 84

- Performance of candidacy functions on different thresholds:

Documents from sample who have number of aligned docs <= 10

Documents from sample who have number of aligned docs > 10

Thresholds between (1 to 1000 and step of 5)

RECALLRECALL

Prec

isio

n

Prec

isio

n

05.07.2018Pipeline for TR extractionMilad Alshomary 85

- Candidate Elimination procedure over the cluster:

05.07.2018Pipeline for TR extractionMilad Alshomary 86

- Hash based Candidate Elimination procedure over the cluster:

05.07.2018Pipeline for TR extractionMilad Alshomary 87

- Hash based Candidate Elimination procedure over the cluster:

05.07.2018Pipeline for TR extractionMilad Alshomary 88

- Heuristics:

- H1: ne_sim ∈ (0.5, 1.0] AND 10grams_sim > 0.5 AND (s_percent_reused < 0.5 or

t_percent_reused < 0.5) => content reuse otherwise structure reuse

- 6700 content reuse cases only

- Validation on two random samples of size 100:

Structure reuse Content reuse

Sample1 100% 58%

Sample2 (Text1 or Text2 > 200)

100% 73%