A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen...

12
A Method for A Method for Focused Focused Crawling Crawling Using Combination of Using Combination of Link Structure and Link Structure and Content Similarity Content Similarity SeyedMohsen (Mohsen) Jamali SeyedMohsen (Mohsen) Jamali [email protected] [email protected]
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen...

Page 1: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

A Method forA Method for Focused Focused

CrawlingCrawling Using Combination of Using Combination of

Link Structure and Link Structure and Content SimilarityContent SimilaritySeyedMohsen (Mohsen) JamaliSeyedMohsen (Mohsen) Jamali

[email protected][email protected]

Page 2: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

IntroductionIntroduction

The rapid growth of the world-wide web The rapid growth of the world-wide web poses unprecedented scaling challenges for poses unprecedented scaling challenges for general-purpose crawlers and search general-purpose crawlers and search enginesengines

Focused crawler aims at selectively seek out Focused crawler aims at selectively seek out pages that are relevant to a pre-defined set pages that are relevant to a pre-defined set of topics.of topics.

Focused crawler entails a very small Focused crawler entails a very small investment in hardware and network investment in hardware and network resources and yet achieves respectable resources and yet achieves respectable coverage at a rapid ratecoverage at a rapid rate

Page 3: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

Crawler ArchitectureCrawler Architecture

Page 4: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

Crawler Architecture Crawler Architecture (cont)(cont)

Page 5: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

Content Similarity Content Similarity MeasureMeasure

Page 6: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

EvaluationsEvaluations

1.1. PrecisionPrecision

2.2. RecallRecall We ran the algorithm 2 times, one We ran the algorithm 2 times, one

with a good hub for the topic and with a good hub for the topic and the other with a general pagethe other with a general page

We compared the both results with We compared the both results with usual BFS crawlerusual BFS crawler

Page 7: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

Experimental ResultsExperimental Results

Page 8: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

Experimental Results Experimental Results (cont)(cont)

Page 9: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

Experimental Results Experimental Results (cont)(cont)

Page 10: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

Experimental Results Experimental Results (cont)(cont)

TCPTCP: Total Crawled Pages, : Total Crawled Pages, RPCRPC: Related Pages' Count: Related Pages' Count RCTRCT: Relative Crawling Time, : Relative Crawling Time, AHRAHR: Average Harvest Rate: Average Harvest Rate

AHRAHR: the mean of harvest rates in each segment: the mean of harvest rates in each segment

Page 11: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

Experimental Results Experimental Results (cont)(cont)

Page 12: A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen (Mohsen) Jamali m_jamali@ce.sharif.edu.

THE ENDTHE END