Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011...

14
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 06/16/22 1

Transcript of Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011...

Page 1: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Crawling and Aligning Scholarly Presentations and Documents from the Web

By SARAVANAN.S09/09/2011

Under the guidance ofA/P Min-Yen Kan

04/20/23 1

Page 2: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Motivation• More articles more users• Searching for documents is difficult• Aim: Find pairs of presentations and documents automatically

04/20/23 2

Page 3: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

System Architecture

Search Engine Wrapper

QueryQuery

“File type”( PDF, PPT or PS) operator is added with the user queryBefore sending it to Google.

Re-Ranking

Top results of GoogleOutput:(3-way)

1.Exact URL2.Message for

No-free files3.No result

Google

Used Yee Fan’s Search Engine Wrapper – just Google subsystem

04/20/23 3

Page 4: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Methodology (1)

Re-Ranking•Computed similarity between user query and documents retrieved for re-ranking.•Methods used for computing similarity are Jaccard co-efficient, Bilingual Evaluation Understudy (BLEU).•Threshold value is used to restrict the system from considering low similarity scored documents.

Google’s Top

Results

Similarity Score

Computation

Re-RankingResults

Based on Similarity

ScoreSimilarity is computer between Query Title and each Google’s result Title, Snippet, URL.04/20/23 4

Page 5: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Jaccard Measure• Jaccard measure is used to compute similarity between Query

Title and Google’s result Title, Snippet, and URL.• Simple word by word matching.

Problems are:• Snippets have more words than title.• Union in Jaccard increases while intersection remains same.

Sentence1: Finding related pages in the world wide web.

Sentence2: Finding Related pages using the Link structure of the WWW.

04/20/23 5

Page 6: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

BLEU metric

Why BLEU??• n-gram similarity of words.• Helps in accessing the sequential order of the words when

finding similarity between two sets.• Sequential order of words matters with snippetquery terms

may appear in a random position.

04/20/23 6

Page 7: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Rules

Special rules are used for better matching:

Rule1: Removing special symbols. (On/Off)Rule2: Stop-words removal (On/Off)Rule3: URL filter by .edu (On/Off)Rule4: Stemming (Porter stemming algorithm) (On/Off)

All these rules are used with both the methodologies.

04/20/23 7

Page 8: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

MethodologyMIME-types:• To differentiate free PDF from subscription type, I used the MIME-types. It

returns the content-type of the URL.

Dataset collection:Queries from,• Computer science.• Medical science.• Architecture.• Mathematics.

04/20/23 8

Page 9: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Experiment• Experiments on

– Jaccard Measure.(All special rules are tested with On/Off).– BLEU measure (All special rules are tested with On/Off).– Query set with about 50 queries.– Threshold is set from 0.1 to 1.0 range for all experiments.– Highest recall with high threshold is considered.

• Experiment results– Jaccard similarity.– BLEU similarity.

04/20/23 9

Page 10: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Experiment result of JaccardRule Google

TargetThreshold Precision Recall F-Score

1 2 3 4

On Off Off Off Title 0.7 0.9818 0.9152 0.9473

Off On Off Off Title 0.6 0.7857 0.3728 0.5057

Off Off On Off Title 0.5 0.9272 0.8644 0.8947

Off Off Off On Title 0.5 0.9107 0.8644 0.8869

On On Off Off Title 0.6 0.7857 0.3728 0.5057

Off On On Off Title 0.5 0.6641 0.3728 0.5057

Off Off On On Title 0.4 0.9107 0.8644 0.8869

On Off Off On Title 0.5 0.9107 0.8644 0.8869

On On On Off Title 0.5 0.7857 0.3728 0.5057

Off On On On Title 0.5 0.7667 0.3898 0.5168

On Off On On Title 0.4 0.9107 0.8644 0.8869

On On Off On Title 0.7 0.8214 0.3898 0.5287

On On On On Title 0.5 0.9310 0.4576 0.6136

Off Off Off Off Title 0.7 0.9259 0.8474 0.8849

On Off On Off Title 0.5 0.9272 0.8644 0.8947

Off On Off On Title 0.3 0.9107 0.8644 0.8869

BestF-score

achieved

04/20/23 10

Page 11: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Best F-score

achieved

Experiment result of BLEU

04/20/23 11

Rule Google Target

Threshold Precision Recall F-Score

1 2 3 4

1 On Off Off Off Title 0.5 0.9272 0.8644 0.8947

2 Off On Off Off Title 0.3 0.7667 0.3898 0.5168

3 Off Off On Off Title 0.6 0.9412 0.8136 0.8727

4 Off Off Off On Title 0.5 0.9130 0.7118 0.8

5 On On Off Off Title 0.8 0.8148 0.3729 0.5116

6 Off On On Off Title 0.5 0.7586 0.3729 0.5

7 Off Off On On Title 0.5 0.9130 0.7119 0.8

8 On Off Off On Title 0.8 0.9074 0.8305 0.8672

9 On On On Off Title 0.8 0.8148 0.3728 0.5116

10 Off On On On Title 0.2 0.7586 0.3729 0.5

11 On Off On On Title 0.7 0.9074 0.8305 0.8672

12 On On Off On Title 0.8 0.8148 0.3728 0.5116

13 On On On On Title 0.6 0.8076 0.3559 0.4941

14 Off Off Off Off Title 0.5 0.9090 0.8478 0.8772

15 On Off On Off Title 0.5 0.9107 0.8644 0.8899

16 Off On Off On Title 0.5 0.76 0.3220 0.4523

Page 12: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Related WorkBase Reference:

– SlideSeer: a digital library of aligned document and presentation pairs, [Kan, JCDL’07].

– Learning to Rank for Information Retrieval. [Liu et al., WWW’09].– Kairos: Proactive Harvesting of Research Paper Metadata from

Scientific Conference Web Sites.[Hänse, ICADL’09]Approaches to Similarity Computation

– BLEU: a Method for Automatic Evaluation of Machine Translation. [Papineni et al., ACL July’02].

– BLEU algorithm for evaluation machine translations implementation.[Payson et al.]

04/20/23 12

Page 13: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Conclusion• Matching documents based on similarity score• Jaccard measure

-- Jaccard similarity computed over Query title and Document title with rule special symbol removed retrieves best articles.

-- Threshold:0.7-- F-score:0.9473

• BLEU metric-- BLEU similarity computed over Query title and Document

title with rule special symbol removed retrieves best articles.-- Threshold 0.5-- F-score:0.8947

04/20/23 13

Page 14: Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/2015 1.

Thank you

Comments are welcome

04/20/23 14