Web Page Classification

Post on 25-Feb-2016

65 views 1 download

Tags:

description

Features and Algorithms. Web Page Classification. Paper by: XIAOGUANG QI and BRIAN D. DAVISON. Presentation by: Jason Bender. Outline. Introduction to Classification Background Classification Types Classification Methods Applications Features Algorithms Evolution of Websites. - PowerPoint PPT Presentation

Transcript of Web Page Classification

WEB PAGE CLASSIFICATIONFeatures and Algorithms

Paper by: XIAOGUANG QI and BRIAN D. DAVISONPresentation by: Jason Bender

Outline Introduction to Classification Background

Classification TypesClassification Methods

Applications Features Algorithms Evolution of Websites

What is web page classification? The process of assigning a web page to

one or more predefined category labels (ex: news, sports, business…)

Classification is generally posed as a supervised learning problemSet of labeled data is used to train a

classifier which is applied to label future examples

Background - Classification Types Supervised learning problem broken into

sub problems:Subject ClassificationFunctional ClassificationSentiment ClassificationOther types of Classification

Subject Classification Concerned with subject or topic of the

web pageJudging whether a page is about arts,

business, sports, etc…

Functional Classification Role that the page is playing

Deciding a page to be a personal homepage, course page, admissions page, etc…

Sentiment Classification Focuses on the opinion that is presented

in a web page

Other types of Classification Such as genre classification and search

engine spam classification

Background - Classification Methods Binary vs. Multiclass Single Label vs. Multi Label Soft vs. Hard Flat vs. Hierarchical

Binary vs. Multiclass Classification

Single-Label vs. Multi-Label Classification

Soft vs. Hard Classification

Flat vs. Hierarchical Classification

Applications Why is classification important and how

can we use it efficiently?

Constructing, maintaining, or expanding web directories Web directories provide an efficient way to

browse for information within a predefined set of categories

Example:Open Directory Project

Currently constructed by human effort78,940 editors of ODP

Improving the quality of search results Big problem with search results is

search ambiguity

Helping question and answering systems Can use classification systems to help

improve the quality of answers Example: Wolfram alpha

Other applications Contextual advertising

Features What features can we extract from a

web page to use to help classify it?

Features - Introduction Because of features such as the hyperlink

<a> … </a>, webpage classification is vastly different from other forms of classification such as plaintext classification.

Features organized into two groups:○ On-page features – directly located on page○ Neighbor features – found on related pages

On Page Features Textual Contents & Tags

Bag-of-words○ N-gram feature

Rather than analyzing individual words, group them into clusters of n-words. - Ex: New York vs. new ….. ….. York

Yahoo! Has used a 5-gram feature

HTML tags – title, heading, metadata, main text

URL

On Page Features Visual Analysis

Each page has two representations○ Text via HTML○ Visual via the browser

Each page can be represented as a visual adjacency multigraph

Features of Neighbors What happens when a page’s features

are missing or are unrecognizable?

Features of Neighbors Assumptions

If page1 is in the neighborhood of many “sports” pages then there is an increasing probability that page1 is also a “sports” page.

Linked pages are more likely to have terms in common

Features of Neighbors Neighbor Selection

Focus on pages within 2 steps of target6 types: parent, child, sibling, spouse,

grandparent, and grandchild

Features of Neighbors Labels Anchor Text Surrounding Anchor Text

By using the anchor text, surrounding text, and page title of a parent page in combination with text from target page, classification can be improved.

Features of Neighbors Implicit Links

Connections between pages that appear in the results of the same query and are both clicked by users

Algorithms What are the algorithmic approaches to

webpage classification?Dimension reductionRelational learningHierarchal classificationInformation combination

Dimension Reduction Boost classification by emphasizing

certain features that are more useful in classificationFeature Weighting

○ Reduces the dimensions of feature space○ Reduces computational complexity○ Classification more accurate as a result of

reduced space

Dimension Reduction Methods

Use first fragmentK-nearest neighbor algorithm

○ Weighted features○ Weighted HTML Tags○ Metrics

Expected mutual informationMutual information

Relational Learning Relaxation Labeling

Hierarchical Classification Based on “divide and conquer”

Classification problems split into hierarchical set of sub problems.

Error MinimizationWhen a lower level category is uncertain of

whether page belongs or not, shift assignment one level up.

Information Combination Combine several methods into one

Information from different sources are used to train multiple classifiers and the collective work of those classifiers make a final decision.

Conclusion Webpage classification is a type of

supervised learning problem aiming to categorize a webpage into a predefined set of categories.

In the future, efforts will most likely be focused on effectively combining content and link information to build a more accurate classifier

Evolution of Websites Apple in 1998

Evolution of Websites Apple 2008

Evolution of Websites Nike in 2000

Evolution of Websites Nike in 2008

Evolution of Websites Yahoo in 1996

Evolution of Websites Yahoo in 2008

Evolution of Websites Microsoft in 1998

Evolution of Websites Microsoft in 2008

Evolution of Websites MTV in 1998

Evolution of Websites MTV in 2008

Sources Web Page Classification: Features and Algorithms

by Xiaoguang Qi & Brian D. Davison

Visual Adjacency Multigraphs – A Novel Approach for a Web Page Classificationby Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko Milutinovic

The Evolution of Websiteshttp://www.wakeuplater.com/website-building/evolution-of-websites-10-popular-websites.aspx