Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of...

Topical Crawling for Business IntelligenceGautam Pant* and Filippo Menczer**

*Department of Management SciencesThe University of Iowa, Iowa City, IA 52246

**School of InformaticsIndiana University, Bloomington, IN 47408

Overview Topical Crawling The Business Intelligence Problem Test Bed Crawling Algorithms Results Finding Better Seeds

Crawling as Graph Search

History

Frontier

Seeds

Node expansion – Downloading and parsing a page

Open list - Frontier Closed list – History Expansion order – Crawl path

Exhaustive vs. Preferential Crawling Exhaustive - blind expansion order (e.g.

Breadth First ) Preferential - heuristic-based expansion order

(e.g. Best First) Topical Crawling: the guiding heuristic is based

on a topic or a set of topics

Business Intelligence Problem Web based information about related business

entities Related through the area of competence,

research thrust etc. Topical crawlers can help in creating a small

but focused collection of Web pages that is rich in information about related business entities

Business Intelligence Problem A list of business entities is available We create a focused document collection that

can be further explored with ranking, indexing and text-mining tools

We investigate the crawling techniques for the task

Finding paths in a competitive community

.com

.edu, .org, .gov.com

.com

.com.com

.com

.com

Test Bed DMOZ Categories – “Companies”,

“Consultants”, “Manufacturers” 159 topics seeds, targets, keywords and description Each crawler crawl up-to 10,000 pages for

each topic

Sample Topic

Crawling Infrastructure

Crawling Algorithms Breadth First

Naïve Best First

Crawling Algorithms – DOM Crawler

scorecontextscorelinkscorelink bfsdom _)1(__

Hub-Seeking Crawler

0,1

)1(_

2

nn

nnscorehub

n – number of seed hosts

domhub scorelinkscorehubscorelink _,_max_

Performance

Improving the Seed Set Top 10 hubs based on back-

links from Google Avoiding mirrors of DMOZ Augmented seed set

Performance

Related work Chakrabarti et. al. [1998]

Use of Hubs Menczer et. al. [2001]

Framework for evaluating topical crawlers Chakrabarti et. al. [2002]

Use of DOM

Conclusion Investigated the problem of creating a small collection

through topical crawling for locating related business entities Hub Seeking crawler that seeks hubs at crawl time and

exploits the tag tree structure of Web pages outperforms Naïve Best-First

Positive effects of identifying hubs before and during the crawl process

Future Work – Find optimal aggregation node Compare the benefits of identifying hubs in competitive vs.

collaborative communities

Thank [email protected]

Acknowledgements:

Robin McEntire (GlaxoSmithKline R&D)

Valdis A. Dzelzkalns (GlaxoSmithKline R&D)

Paul Stead (GlaxoSmithKline R&D)

NSF grant to FM

Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of...

Documents

Transcript of Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of...