Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson...

28
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01

Transcript of Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson...

Web Mining Research: A Survey

Authors: Raymond Kosala & Hendrik BlockeelPresenter: Ryan Patterson

April 23rd 2014 CS332 Data Mining

pg 01

outline

• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 02

outline

• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 03

Introduction“The Web is huge, diverse, and dynamic . . . we are currently drowning in information and facing

information overload.”

Web users encounter problems:

• Finding relevant information

• Creating new knowledge out of the information available on the Web

• Personalization of the information

• Learning about consumers or individual users

pg 04

outline

• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 05

Web Mining“Web mining is the use of data mining

techniques to automatically discover and extract information from Web documents and

services.”

Web mining subtasks:

1. Resource finding

2. Information selection and pre-processing

3. Generalization

4. Analysis

pg 06

Web MiningInformation Retrieval & Information Extraction

• Information Retrieval (IR)o the automatic retrieval of all relevant documents

while at the same time retrieving as few of the non-relevant as possible

• Information Extraction (IE)o transforming a collection of documents into

information that is more readily digested and analyzed

pg 07

Live demo

pg 08

outline

• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 09

Web Content MiningInformation Retrieval View

Unstructured Documents• Most utilizes “bag of words” representation to generate documents features

o ignores the sequence in which the words occur

• Document features can be reduced with selection algorithmso ie. information gain

• Possible alternative document feature representations:o word positions in the documento phrases/terms (ie. “annual interest rate”)

Semi-Structured Documents• Utilize additional structural information gleaned from the document

o HTML markup (intra-document structure)o HTML links (inter-document structure)

pg 10

Web content mining, IR unstructured documents

pg 11

Web content mining, IR semi structured documents

pg 12

Web Content MiningDatabase View

“the Database view tries . . . to transform a Web site to become a database so that . . . querying

on the Web become[s] possible.”• Uses Object Exchange Model (OEM)

o represents semi-structured data by a labeled graph

• Database view algorithms typically start from manually selected Web siteso site-specific parsers

• Database view algorithms produce:o extract document level schema or DataGuides

structural summary of semi-structured datao extract frequent substructures (sub-schema)o multi-layered database

each layer is obtained by generalizations on lower layers

pg 13

Web content mining, Database view

pg 14

outline

• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 15

Web Structure Mining“. . . we are interested in the structure of the

hyperlinks within the Web itself”• Inspired by the study of social networks and citation analysis

o based on incoming & outgoing links we could discover specific types of pages (such as hubs, authorities, etc)

• Some algorithms calculate the quality/relevancy of each Web page

o ie. Page Rank

• Others measure the completeness of a Web site

o measuring frequency of local links on the same server

o interpreting the nature of hierarchy of hyperlinks on one domain

pg 16

outline

• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 17

Web Usage Mining“. . . focuses on techniques that could predict

user behavior while the user interacts with the Web.”

• Web usage is mined by parsing Web server logs

o mapped into relational tables → data mining techniques applied

o log data utilized directly

• Users connecting through proxy servers and/or users or ISP’s utilizing caching of Web data results in decreased server log accuracy

• Two applications:

o personalized - user profile or user modeling in adaptive interfaces

o impersonalized - learning user navigation patterns

pg 18

outline

• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 19

Review

• Web miningo 4 subtaskso IR & IE

• Web content miningo primarily intra-page analysiso IR view vs DB view

• Web structure miningo primarily inter-page analysis

• Web usage miningo primarily analysis of server activity logs

pg 20

Web mining categories

Web Mining

Web Content MiningWeb Structure Mining Web Usage Mining

IR View DB View

View of Data - Unstructured- Semi structured

- Semi structured- Web site as DB

- Links structure - Interactivity

Main Data - Text documents- Hypertext documents

- Hypertext documents - Links structure - Server logs- Browser logs

Representation - Bag of word, n-grams- Terms, phrases- Concepts of ontology- Relational

- Edge-labeled graph (OEM)- Relational

- Graph - Relational table- Graphs

Method - TFIDF and variants- Machine learning- Statistical (incl. NLP)

- Proprietary algorithms- ILP- (modified) association rules

- Proprietary algorithms - Machine Learning- Statistical- (modified) association rules

ApplicationCategories

- Categorization- Clustering- Finding extraction rules- Finding patterns in text- User modeling

- Finding frequent sub-structures- Web site schema discovery

- Categorization- Clustering

- Site construction, adaptation, and management- Marketing- User modeling

pg 21

outline

• Introduction

• Web Mining

• Web Content Mining

• Web Structure Mining

• Web Usage Mining

• Review

• Exam Questions

pg 22

Exam Question 1

Q: Of the following Web mining paradigms:

• Information Retrieval

• Information Extraction

Which does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer.

pg 23

Exam Question 1Q: Of the following Web mining paradigms:

• Information Retrieval

• Information Extraction

Which does a traditional Web search engine (google.com, bing.com, etc.) attempt to accomplish? Briefly support your answer.

A: Information Retrieval, the search engine attempts provides a list of documents ranked by their relevancy to the search query.

pg 24

Exam Question 2Q: State one common problem hampering accurate Web usage mining? Briefly support your answer.

pg 25

Exam Question 2Q: State one common problem hampering accurate Web usage mining? Briefly support your answer.

A:

• Users connecting to a Web site though a proxy server,

• Users (or their ISP’s) utilizing Web data caching,

will result in decreased server log accuracy. Accurate server logs are required for accurate Web usage mining.

pg 26

Exam Question 3Q: What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents?

pg 27

Exam Question 3Q: What is the phrase associated with the most popular method for Web content mining algorithms to generate document features from unstructured documents?

A: “Bag of words” representation.

pg 28