web page classification
-
Upload
nabeelah-ali -
Category
Technology
-
view
530 -
download
3
Transcript of web page classification
web page classification
nabeelah ali27 november 2013
with naïve bayes classifiers
outline• what is web page classification
• motivation
• literature review
• project design
• experiments
• evaluation
description &motivation
what is classification?
web page classification
web page classification can be seen as a type of
document classification
documents vs web pages• web pages have structure
• HTML indicates headings, paragraphs, meta-information
• web pages are interconnected
• they contain hyperlinks to other pages
• they have locations (URLs)
why?web directories
why?improving search results
why?
• user profile mining
• information filtering
• creation of domain-specific search engines
literaturereview
bag of wordstext is represented as an unordered
list of words
n-gram representation
• document is represented by vector of features
• concepts expressed by phrases can be capture (e.g. “New York” vs “new” and “york”)
using html structure• assign weight depending on HTML tags, and
make the feature a linear combination of these
• e.g. headings would have a greater weight
• four main elements are considered: title, headings, metadata and main text
Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology
for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
visual analysis• visual representation by web browser is
important
• each web page is visualised as an adjacency multigraph, with each section representing a different kind of content
Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of
SAWM04 workshop, ECML2004. 2004.
URL features• pages do not need to be fetched or
analysed
• fast!
• derives tokens from the URL and uses these tokens as features
Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international
conference on Information and knowledge management. ACM, 2005.
web page classificationproject design
dataset
• 4 universities dataset (cornell, texas, washington, wisconsin)
• each page must be classified into a category: course, department, faculty, project, staff, student, other
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
document classificationsingle label classification: one and only one class label is assigned to each instance
hard classification: an instance can either be or not be in a particular class, with no intermediate state
multi-class classification: instances that can be divided into more than two categories
details of the dataset
experiment #1bag of words
use the words, unweighted, as features
intern
Professor
CS220
room
admission
Drassistant
research
experiment #2HTML tag weighting
use words weighted by the HTML tags (e.g. words in <h1> tags will be weighted more heavily than those in <p> tags)
intern
Professor
CS220
room
admissionDrassistant
research
experiment #3
n-gram
use phrases instead of single words as features
research assistant
course outline
contact informationprogram description
evaluation
From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/
k-fold cross validation
http://en.wikipedia.org/wiki/Confusion_matrix
evaluationconfusion matrix
bibliographyB. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005)
Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12.
Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.
Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005.
Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.
questions?