CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of...
-
Upload
darrell-carpenter -
Category
Documents
-
view
212 -
download
0
Transcript of CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of...
CROSSMARC Web Pages Collection: Crawling and Spidering Components
Vangelis KarkaletsisVangelis Karkaletsis
Institute of Informatics & TelecommunicationsInstitute of Informatics & TelecommunicationsNCSR “Demokritos”NCSR “Demokritos”
Final Project Review
Luxembourg, October 31, 2003
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
2
Web Pages Collection: Web Pages Collection: Focused CrawlerFocused Crawler
• Identifies web sites that are of relevance to a particular domain. It combines:
• a crawler that exploits the topic-based Web site hierarchies used by various search engines
• a crawler that submits to a search engine queries from the domain ontologies and lexicons of CROSSMARC
• a crawler that takes a set of ‘seed’ pages and conducts a ‘similar pages’ search from advanced search engines
– The list of Web sites produced is filtered
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
3
Web Pages Collection:Web Pages Collection:crawler customizationcrawler customization
– change of settings of crawler configuration files
– experimentation and evaluation to find the optimal settings for each version as well as their optimal combination
– train the light spidering module that filters the crawler results
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
4
Web Pages Collection: Web Pages Collection: Crawler EvaluationCrawler Evaluation
• more than one experimentation cycle may be needed depending on the domain and language
• our evaluation methodology provides a good way of comparing different initial settings of the crawler
Language 1st DomainPrecision (%)
2nd DomainPrecision (%)
English 45,2 87,5
Italian 25,6 41,7
Greek 26,0 53,2
French 57,1 30,8
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
5
• Site navigation: traverses a Web site, collecting information from each page visited and forwarding it to the “Page-Filtering” and “Link-Scoring” modules
• Page-filtering is responsible for deciding whether a page is an interesting one and should be stored or not
– before storing a page, its language is identified
– the page is also converted to XHTML
• Link-scoring validates the links to be followed. Only links with a score above a certain threshold are followed.
Web Pages Collection: Web Pages Collection: Web sites spiderWeb sites spider
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
6
• The following types of URLs are supported:• Frame links, Text links, Image links, Image maps
• JavaScript cases, HTML forms
in order to discover and extract more URLs in the Web page.
• Each URL is checked if it• redirects to another site
• points to a non-HTML file
• is already in the queue of visited URLs
Web Pages Collection: Web Pages Collection: Web sites spider - NavigationWeb sites spider - Navigation
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
7
• Two approaches were investigated:
– Machine learning: The WebPageClassifier tool was developed that
• reads a corpus of positive and negative Web pages, • translates it into a feature vector format, and • uses learning algorithms to construct the Web page
classifier.
– Heuristics: The heuristics based filter• accepts as input the Web page, in the form of a token
sequence,• compares each token to a list of regular expressions from
the domain lexicon in use.
Web Pages Collection: Web Pages Collection: Web sites spider – Page FilteringWeb sites spider – Page Filtering
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
8
• Two approaches were investigated:
– Machine learning: The training system for Link scoring takes as input
• a collection of domain-specific web sites, • the positive web pages within these web sites, • the domain ontology and one or more domain lexicon files
from which it creates the training data set.
– Heuristics: The heuristics based link scorer • takes as input the link’s text content as well as its context
(left and right),• parses the three strings looking for domain relevant
information based on a score-table,• combines the scores of the three strings using a weighted
function.
Web Pages Collection: Web Pages Collection: Web sites spider – Link scoringWeb sites spider – Link scoring
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
9
Web Pages Collection:Web Pages Collection:spider customizationspider customization
– Use the same navigation mechanism– Use the machine learning based “page filtering”
which requires:– the domain ontology and lexicons – the creation of a representative training corpus
(CROSSMARC provides the Corpus Formation tool) – the use of the WebPageClassifer tool to construct the
domain-specific classifier
– Use the rule-based approach suggested for link scoring which requires:
– the specification of new settings in the configuration file of the link scoring module
– experimenting with each specification until the optimal setting is found
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
12
Web Pages Collection: Web Pages Collection: Web sites spider EvaluationWeb sites spider Evaluation
• Page Filtering
Language1st Domain
F-measure (%)2nd Domain F-
measure (%)
English 96,9 83,2
Italian 93,7 73,7
Greek 92,7 87,9
French 96,9 82,3
– we are able to identify with a high degree of confidence whether a page in interesting or not according to the domain
– results can be improved further
• so far only ontology-based features are used
• combination with statistically selected one a promising research direction
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
13
• Link scoring:
– Rather poor the results of both methods
– An issue that could be investigated is the combination of the two methods to improve recall results
– Concluding, the task of scoring links without visiting them
• remains a very challenging one and
• is becoming more important in the general setting of topic-specific search engines and portals
Web Pages Collection: Web Pages Collection: Web sites spider EvaluationWeb sites spider Evaluation
Final Review “Crawling and Spidering” Luxembourg, 31 October 2003
14
Concluding RemarksConcluding Remarks
• Crawler• Applied in both domains of the project• Customization instructions are provided• The tool and the corpora used in both domains
and four languages will be available for research purposes
• Spider• Applied in three domains• Customization methodology and tools are
provided• The corpora collected for page filtering and link
scoring will be available for research purposes