Focused Crawling A New Approach to Topic-Specific Web Resource Discovery
description
Transcript of Focused Crawling A New Approach to Topic-Specific Web Resource Discovery
Focused CrawlingA New Approach to Topic-Specific
Web Resource Discovery
Soumen Chakrabarti
Martin van Den Berg
Byron Dom
WWW 1999 2
Portals and portholes
Popular search portals and directories Useful for generic needs Difficult to do serious research
Information needs of net-savvy users are getting very sophisticated
Relatively little business incentive Need handmade specialty sites: portholes Resource discovery must be personalized
WWW 1999 3
Quote
The emergence of portholes will be one of the major Internet trends of 1999. As people become more savvy users of the Net, they want things which are better focused on meeting their specific needs. We're going to see a whole lot more of this, and it's going to potentially erode the user base of some of the big portals.
Jim Hake(Founder, Global Information Infrastructure Awards)
WWW 1999 4
Scenario
Disk drive research group wants to track magnetic surface technologies
Compiler research group wants to trawl the web for graduate student resumés
____ wants to enhance his/her collection of bookmarks about ____ with prominent and relevant links
Virtual libraries like the Open Directory Project and the Mining Co.
WWW 1999 5
Structured web queries
How many links were found from an environment protection agency site to a site about oil and natural gas in the last year?
Apart from cycling, what is the most common topic cited by pages on cycling?
Find Web research pages which are widely cited by Hawaiian vacation pages
WWW 1999 6
Goal
Automatically construct a focused portal (porthole) containing resources that are Relevant to the user’s focus of interest Of high influence and quality Collectively comprehensive
Answer structured web queries by selectively exploring the topics involved in the query
WWW 1999 7
Tools at hand
Keyword search engines Synonymy, polysemy Abundance, lack of quality
Hand compiled topic directories Labor intensive, subjective judgements
Resources automatically located using keyword search and link graph distillation Dependence on large crawls and indices
WWW 1999 8
Estimating popularity
Extensive research on social network theory Wasserman and Faust
Hyperlink based Large in-degree indicates popularity/authority Not all votes are worth the same
Several similar ideas and refinements Googol (Page and Brin) and HITS (Kleinberg) Resource compilation (Chakrabarti et al) Topic distillation (Bharat and Henzinger)
WWW 1999 9
Topic distillation overview
Given web graph and query
Search engine selects sub-graph
Expansion, pruning and edge weights
Nodes iteratively transfer authority to cited neighbors
Search Engine Query
The Web
Selected subgraph
WWW 1999 10
Preliminary distillation-based approach
Design a keyword query to represent a topic Run topic distillation periodically Refine query through trial-and-error Works well if answer is partially known,
e.g., European airlines +swissair +iberia +klm
WWW 1999 11
WWW 1999 12
Problems with preliminary approach
Dependence on large web crawl and index System = crawler + index + distiller
Unreliability of keyword match Engines differ significantly on a given query
due to small overlap [Bharat and Bröder] Narrow, arbitrary view of relevant subgraph Topic model does not improve over time
Difficulty of query construction Lack of output sensitivity
WWW 1999 13
Query construction
+“power suppl*”
“switch* mode” smps
-multiprocessor*
“uninterrupt* power suppl*” ups
-parcel*
/Companies/Electronics/Power_Supply
WWW 1999 14
Query complexity
Complex queries (966 trials) Average words 7.03 Average operators (+*–") 4.34
Typical Alta Vista queries are much simpler [Silverstein, Henzinger, Marais and Moricz] Average query words 2.35 Average operators (+*–") 0.41
Forcibly adding a hub or authority node helped in 86% of the queries
WWW 1999 15
Query complexity Complex queries
needed for distillation Typical Alta Vista
queries are much simpler (Silverstein, Henzinger, Marais and Moricz)
Forcing a hub or authority helps 86% of the time
Dis
tilla
tion
Alta
Vis
ta
Op
era
tors W
ord
s
7.03
4.34
2.35
0.410
2
4
6
8
Operators Words
WWW 1999 16
Output sensitivity
Say the goal is to find a comprehensive collection of recreational and competitive bicycling sites and pages
Ideally effort should scale with size of the result
Time spent crawling and indexing sites unrelated to the topic is wasted
Likewise, time that does not improve comprehensiveness is wasted
WWW 1999 17
Proposed solution
Resource discovery system that can be customized to crawl for any topic by giving examples
Hypertext mining algorithms learn to recognize pages and sites about the given topic, and a measure of their centrality
Crawler has guidance hooks controlled by these two scores
WWW 1999 18
Administration scenario
TaxonomyEditor
CurrentExamples
SuggestedAdditionalExamples
Drag
WWW 1999 19
Relevance
All
Bus&Econ Recreation
Companies Cycling
Bike Shops
Mt.Biking
Clubs
Arts
... ...
Path nodes
Good nodesSubsumed nodes
)good(
]|Pr[]good is Pr[c
dcd
WWW 1999 20
Classification
How relevant is a document w.r.t. a class? Supervised learning, filtering, classification,
categorization
Many types of classifiers Bayesian, nearest neighbor, rule-based
Hypertext Both text and links are class-dependent clues How to model link-based features?
WWW 1999 21
The “bag-of-words” document model
Decide topic; topic c is picked with prior probability (c); c(c) = 1
Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1
Fix document length and keep tossing coin Given c, probability of document is
dt
tdntctdn
dncd ),(),(
)},({
)(]|Pr[
WWW 1999 22
Exploiting link features
c=class, t=text, N=neighbors
Text-only model: Pr[t|c] Using neighbors’ text
to judge my topic:Pr[t, t(N) | c]
Better model:Pr[t, c(N) | c]
Non-linear relaxation
?
WWW 1999 23
Improvement using link features
9600 patents from 12 classes marked by USPTO
Patents have text and cite other patents
Expand test patent to include neighborhood
‘Forget’ fraction of neighbors’ classes
0
5
10
15
20
25
30
35
40
0 50 100
%Neighborhood known%
Err
or
Text Link Text+Link
WWW 1999 24
Putting it together
TaxonomyDatabase
TaxonomyEditor
ExampleBrowser
CrawlDatabase
HypertextClassifier(Learn)
TopicModels
HypertextClassifier(Apply)
Scheduler
Workers
TopicDistiller
Feedback
WWW 1999 25
Monitoring the crawler
Time
Relevance
One URL
MovingAverage
WWW 1999 26
Measures of success
Harvest rate What fraction of crawled pages are relevant
Robustness across seed sets Separate crawls with random disjoint samples Measure overlap in URLs and servers crawled Measure agreement in best-rated resources
Evidence of non-trivial work #Links from start set to the best resources
WWW 1999 27
Harvest rateHarvest Rate (Cycling, Unfocused)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5000 10000
#URLs fetched
Ave
rag
e R
ele
van
ce
Avg over 100
Harvest Rate (Cycling, Soft Focus)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2000 4000 6000
#URLs fetched
Ave
rag
e R
ele
van
ce
Avg over 100
Avg over 1000
Unfocused Focused
WWW 1999 28
Crawl robustness
Crawl Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1000 2000 3000
#URLs crawled
UR
L O
verl
ap
Crawl Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1000 2000 3000
#URLs crawled
Se
rve
r o
verl
ap
Overlap1
Overlap2
URL Overlap Server OverlapCrawl 1 Crawl 2
WWW 1999 29
Top resources after one hour
Recreational and competitive cycling http://www.truesport.com/Bike/links.htm http://reality.sgi.com/billh_hampton/jrvs/links.html http://www.acs.ucalgary.ca/~bentley/mark_links.html
HIV/AIDS research and treatment http://www.stopaids.org/Otherorgs.html http://www.iohk.com/UserPages/mlau/aidsinfo.html http://www.ahandyguide.com/cat1/a/a66.htm
Purer and better than root set
WWW 1999 32
Distance to best resources
Resource Distance (Mutual Funds)
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Min. distance from crawl seed (#links)
#S
erv
ers
in t
op
10
0
Resource Distance (Cycling)
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12
Min. distance from crawl seed (#links)
#S
erv
ers
in t
op
10
0
Cycling: cooperative Mutual funds: competitive
WWW 1999 33
Robustness of resource discovery
Sample disjoint sets of starting URL’s
Two separate crawls Find best authorities Order by rank Find overlap in the
top-rated resources
Resource Robustness (Cycling)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25#Top resources
Se
rve
r O
verl
ap
Overlap1
Overlap2
WWW 1999 34
Related work
WebWatcher, HotList and ColdList Filtering as post-processing, not acquisition
ReferralWeb Social network on the Web
Ahoy!, Cora Hand-crafted to find home pages and papers
WebCrawler, Fish, Shark, Fetuccino, agents Crawler guided by query keyword matches
WWW 1999 35
Comparison with agents Agents usually look
for keywords and hand-crafted patterns
Cannot learn new vocabulary dynamically
Do not use distance-2 centrality information
Client-side assistant
We use taxonomy with statistical topic models
Models can evolve as crawl proceeds
Combine relevance and centrality
Broader scope: inter-community linkage analysis and querying
WWW 1999 36
Conclusion
New architecture for example-driven topic-specific web resource discovery
No dependence on full web crawl and index Modest desktop hardware adequate Variable radius goal-directed crawling High harvest rate High quality resources found far from
keyword query response nodes