A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen...
-
date post
20-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of A Method for Focused Crawling Using Combination of Link Structure and Content Similarity SeyedMohsen...
A Method forA Method for Focused Focused
CrawlingCrawling Using Combination of Using Combination of
Link Structure and Link Structure and Content SimilarityContent SimilaritySeyedMohsen (Mohsen) JamaliSeyedMohsen (Mohsen) Jamali
IntroductionIntroduction
The rapid growth of the world-wide web The rapid growth of the world-wide web poses unprecedented scaling challenges for poses unprecedented scaling challenges for general-purpose crawlers and search general-purpose crawlers and search enginesengines
Focused crawler aims at selectively seek out Focused crawler aims at selectively seek out pages that are relevant to a pre-defined set pages that are relevant to a pre-defined set of topics.of topics.
Focused crawler entails a very small Focused crawler entails a very small investment in hardware and network investment in hardware and network resources and yet achieves respectable resources and yet achieves respectable coverage at a rapid ratecoverage at a rapid rate
Crawler ArchitectureCrawler Architecture
Crawler Architecture Crawler Architecture (cont)(cont)
Content Similarity Content Similarity MeasureMeasure
EvaluationsEvaluations
1.1. PrecisionPrecision
2.2. RecallRecall We ran the algorithm 2 times, one We ran the algorithm 2 times, one
with a good hub for the topic and with a good hub for the topic and the other with a general pagethe other with a general page
We compared the both results with We compared the both results with usual BFS crawlerusual BFS crawler
Experimental ResultsExperimental Results
Experimental Results Experimental Results (cont)(cont)
Experimental Results Experimental Results (cont)(cont)
Experimental Results Experimental Results (cont)(cont)
TCPTCP: Total Crawled Pages, : Total Crawled Pages, RPCRPC: Related Pages' Count: Related Pages' Count RCTRCT: Relative Crawling Time, : Relative Crawling Time, AHRAHR: Average Harvest Rate: Average Harvest Rate
AHRAHR: the mean of harvest rates in each segment: the mean of harvest rates in each segment
Experimental Results Experimental Results (cont)(cont)
THE ENDTHE END