web page classification

26
web page classification nabeelah ali 27 november 2013 with naïve bayes classifiers

Transcript of web page classification

Page 1: web page classification

web page classification

nabeelah ali27 november 2013

with naïve bayes classifiers

Page 2: web page classification

outline• what is web page classification

• motivation

• literature review

• project design

• experiments

• evaluation

Page 3: web page classification

description &motivation

Page 4: web page classification

what is classification?

Page 5: web page classification

web page classification

web page classification can be seen as a type of

document classification

Page 6: web page classification

documents vs web pages• web pages have structure

• HTML indicates headings, paragraphs, meta-information

• web pages are interconnected

• they contain hyperlinks to other pages

• they have locations (URLs)

Page 7: web page classification

why?web directories

Page 8: web page classification

why?improving search results

Page 9: web page classification

why?

• user profile mining

• information filtering

• creation of domain-specific search engines

Page 10: web page classification

literaturereview

Page 11: web page classification

bag of wordstext is represented as an unordered

list of words

Page 12: web page classification

n-gram representation

• document is represented by vector of features

• concepts expressed by phrases can be capture (e.g. “New York” vs “new” and “york”)

Page 13: web page classification

using html structure• assign weight depending on HTML tags, and

make the feature a linear combination of these

• e.g. headings would have a greater weight

• four main elements are considered: title, headings, metadata and main text

Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology

for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.

Page 14: web page classification

visual analysis• visual representation by web browser is

important

• each web page is visualised as an adjacency multigraph, with each section representing a different kind of content

Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of

SAWM04 workshop, ECML2004. 2004.

Page 15: web page classification

URL features• pages do not need to be fetched or

analysed

• fast!

• derives tokens from the URL and uses these tokens as features

Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international

conference on Information and knowledge management. ACM, 2005.

Page 16: web page classification

web page classificationproject design

Page 17: web page classification

dataset

• 4 universities dataset (cornell, texas, washington, wisconsin)

• each page must be classified into a category: course, department, faculty, project, staff, student, other

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/

Page 18: web page classification

document classificationsingle label classification: one and only one class label is assigned to each instance

hard classification: an instance can either be or not be in a particular class, with no intermediate state

multi-class classification: instances that can be divided into more than two categories

Page 19: web page classification

details of the dataset

Page 20: web page classification

experiment #1bag of words

use the words, unweighted, as features

intern

Professor

CS220

room

admission

Drassistant

research

Page 21: web page classification

experiment #2HTML tag weighting

use words weighted by the HTML tags (e.g. words in <h1> tags will be weighted more heavily than those in <p> tags)

intern

Professor

CS220

room

admissionDrassistant

research

Page 22: web page classification

experiment #3

n-gram

use phrases instead of single words as features

research assistant

course outline

contact informationprogram description

Page 23: web page classification

evaluation

From http://chrisjmccormick.wordpress.com/2013/07/31/k-fold-cross-validation-with-matlab-code/

k-fold cross validation

Page 24: web page classification

http://en.wikipedia.org/wiki/Confusion_matrix

evaluationconfusion matrix

Page 25: web page classification

bibliographyB. Choi and Z. Yao: Web Page Classification, StudFuzz 180, 221–274 (2005)

Qi, Xiaoguang, and Brian D. Davison. "Web page classification: Features and algorithms." ACM Computing Surveys (CSUR) 41.2 (2009): 12.

Golub, Koraljka, and Anders Ardö. "Importance of HTML structural elements and metadata in automated subject classification." Research and Advanced Technology for Digital Libraries. Springer Berlin Heidelberg, 2005. 368-378.

Kan, Min-Yen, and Hoang Oanh Nguyen Thi. "Fast webpage classification using URL features." Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 2005.

Kovacevic, Milos, et al. "Visual adjacency multigraphs—a novel approach for a Web page classification." Proceedings of SAWM04 workshop, ECML2004. 2004.

Page 26: web page classification

questions?