Web Page Classification

43
WEB PAGE CLASSIFICATION Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender

description

Features and Algorithms. Web Page Classification. Paper by: XIAOGUANG QI and BRIAN D. DAVISON. Presentation by: Jason Bender. Outline. Introduction to Classification Background Classification Types Classification Methods Applications Features Algorithms Evolution of Websites. - PowerPoint PPT Presentation

Transcript of Web Page Classification

Page 1: Web Page Classification

WEB PAGE CLASSIFICATIONFeatures and Algorithms

Paper by: XIAOGUANG QI and BRIAN D. DAVISONPresentation by: Jason Bender

Page 2: Web Page Classification

Outline Introduction to Classification Background

Classification TypesClassification Methods

Applications Features Algorithms Evolution of Websites

Page 3: Web Page Classification

What is web page classification? The process of assigning a web page to

one or more predefined category labels (ex: news, sports, business…)

Classification is generally posed as a supervised learning problemSet of labeled data is used to train a

classifier which is applied to label future examples

Page 4: Web Page Classification

Background - Classification Types Supervised learning problem broken into

sub problems:Subject ClassificationFunctional ClassificationSentiment ClassificationOther types of Classification

Page 5: Web Page Classification

Subject Classification Concerned with subject or topic of the

web pageJudging whether a page is about arts,

business, sports, etc…

Functional Classification Role that the page is playing

Deciding a page to be a personal homepage, course page, admissions page, etc…

Page 6: Web Page Classification

Sentiment Classification Focuses on the opinion that is presented

in a web page

Other types of Classification Such as genre classification and search

engine spam classification

Page 7: Web Page Classification

Background - Classification Methods Binary vs. Multiclass Single Label vs. Multi Label Soft vs. Hard Flat vs. Hierarchical

Page 8: Web Page Classification

Binary vs. Multiclass Classification

Page 9: Web Page Classification

Single-Label vs. Multi-Label Classification

Page 10: Web Page Classification

Soft vs. Hard Classification

Page 11: Web Page Classification

Flat vs. Hierarchical Classification

Page 12: Web Page Classification

Applications Why is classification important and how

can we use it efficiently?

Page 13: Web Page Classification

Constructing, maintaining, or expanding web directories Web directories provide an efficient way to

browse for information within a predefined set of categories

Example:Open Directory Project

Currently constructed by human effort78,940 editors of ODP

Page 14: Web Page Classification

Improving the quality of search results Big problem with search results is

search ambiguity

Page 15: Web Page Classification

Helping question and answering systems Can use classification systems to help

improve the quality of answers Example: Wolfram alpha

Other applications Contextual advertising

Page 16: Web Page Classification
Page 17: Web Page Classification

Features What features can we extract from a

web page to use to help classify it?

Page 18: Web Page Classification

Features - Introduction Because of features such as the hyperlink

<a> … </a>, webpage classification is vastly different from other forms of classification such as plaintext classification.

Features organized into two groups:○ On-page features – directly located on page○ Neighbor features – found on related pages

Page 19: Web Page Classification

On Page Features Textual Contents & Tags

Bag-of-words○ N-gram feature

Rather than analyzing individual words, group them into clusters of n-words. - Ex: New York vs. new ….. ….. York

Yahoo! Has used a 5-gram feature

HTML tags – title, heading, metadata, main text

URL

Page 20: Web Page Classification

On Page Features Visual Analysis

Each page has two representations○ Text via HTML○ Visual via the browser

Each page can be represented as a visual adjacency multigraph

Page 21: Web Page Classification

Features of Neighbors What happens when a page’s features

are missing or are unrecognizable?

Page 22: Web Page Classification

Features of Neighbors Assumptions

If page1 is in the neighborhood of many “sports” pages then there is an increasing probability that page1 is also a “sports” page.

Linked pages are more likely to have terms in common

Page 23: Web Page Classification

Features of Neighbors Neighbor Selection

Focus on pages within 2 steps of target6 types: parent, child, sibling, spouse,

grandparent, and grandchild

Page 24: Web Page Classification

Features of Neighbors Labels Anchor Text Surrounding Anchor Text

By using the anchor text, surrounding text, and page title of a parent page in combination with text from target page, classification can be improved.

Page 25: Web Page Classification

Features of Neighbors Implicit Links

Connections between pages that appear in the results of the same query and are both clicked by users

Page 26: Web Page Classification

Algorithms What are the algorithmic approaches to

webpage classification?Dimension reductionRelational learningHierarchal classificationInformation combination

Page 27: Web Page Classification

Dimension Reduction Boost classification by emphasizing

certain features that are more useful in classificationFeature Weighting

○ Reduces the dimensions of feature space○ Reduces computational complexity○ Classification more accurate as a result of

reduced space

Page 28: Web Page Classification

Dimension Reduction Methods

Use first fragmentK-nearest neighbor algorithm

○ Weighted features○ Weighted HTML Tags○ Metrics

Expected mutual informationMutual information

Page 29: Web Page Classification

Relational Learning Relaxation Labeling

Page 30: Web Page Classification

Hierarchical Classification Based on “divide and conquer”

Classification problems split into hierarchical set of sub problems.

Error MinimizationWhen a lower level category is uncertain of

whether page belongs or not, shift assignment one level up.

Page 31: Web Page Classification

Information Combination Combine several methods into one

Information from different sources are used to train multiple classifiers and the collective work of those classifiers make a final decision.

Page 32: Web Page Classification

Conclusion Webpage classification is a type of

supervised learning problem aiming to categorize a webpage into a predefined set of categories.

In the future, efforts will most likely be focused on effectively combining content and link information to build a more accurate classifier

Page 33: Web Page Classification

Evolution of Websites Apple in 1998

Page 34: Web Page Classification

Evolution of Websites Apple 2008

Page 35: Web Page Classification

Evolution of Websites Nike in 2000

Page 36: Web Page Classification

Evolution of Websites Nike in 2008

Page 37: Web Page Classification

Evolution of Websites Yahoo in 1996

Page 38: Web Page Classification

Evolution of Websites Yahoo in 2008

Page 39: Web Page Classification

Evolution of Websites Microsoft in 1998

Page 40: Web Page Classification

Evolution of Websites Microsoft in 2008

Page 41: Web Page Classification

Evolution of Websites MTV in 1998

Page 42: Web Page Classification

Evolution of Websites MTV in 2008

Page 43: Web Page Classification

Sources Web Page Classification: Features and Algorithms

by Xiaoguang Qi & Brian D. Davison

Visual Adjacency Multigraphs – A Novel Approach for a Web Page Classificationby Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko Milutinovic

The Evolution of Websiteshttp://www.wakeuplater.com/website-building/evolution-of-websites-10-popular-websites.aspx