DSCI 5240 Graduate Presentation Xxxxxx

13
DSCI 5240 Graduate Presentation Xxxxxx Research paper: Web Mining Research: A survey SIGKDD Explorations, June 2000. Volume 2, Issue 1 Author: R. Kosala and H. Blockeel

description

DSCI 5240 Graduate Presentation Xxxxxx. Research paper: Web Mining Research: A survey SIGKDD Explorations , June 2000. Volume 2, Issue 1 Author: R. Kosala and H. Blockeel. Outline. Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion. - PowerPoint PPT Presentation

Transcript of DSCI 5240 Graduate Presentation Xxxxxx

Page 1: DSCI 5240 Graduate Presentation Xxxxxx

DSCI 5240 Graduate PresentationXxxxxx

Research paper: Web Mining Research: A survey SIGKDD Explorations, June 2000. Volume 2, Issue 1

Author: R. Kosala and H. Blockeel

Page 2: DSCI 5240 Graduate Presentation Xxxxxx

Introduction Web Mining Web Content Mining Web Structure Mining Web Usage Mining Conclusion

Outline

Page 3: DSCI 5240 Graduate Presentation Xxxxxx

The World Wide Web is a popular and interactive medium to disseminate information

Information users may encounter four problems 1. Finding relevant information a. low precision b. low recall

2. Creating new knowledge out of the information available on the web

---data-triggered process

3. Personalizing of the information People differ in the content and presentations of information

4. Learning about consumers or individual users Mass customizing or even personalizing

Introduction

Page 4: DSCI 5240 Graduate Presentation Xxxxxx

Definition: web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from the web data

Four subtasks Resource finding: retrieving intended web documents Information selection and pre-processing: selecting and pre-

processing specific information Generalization: discovering general patterns Analysis: validation and/or interpretation of mined patterns

Web Mining

Page 5: DSCI 5240 Graduate Presentation Xxxxxx

Web Mining and Information RetrievalDefinition: IR is the automatic retrieval of all relevant documents while at the same time retrieving as few of the non-relevant documents as possible.goal: indexing and searching for useful documents Web Mining and Information ExtractionIE has the goal of transforming a collection of documents into information that is more readily digested and analyzed. Compare IR and IE a. aims b. fields

Web Mining

Page 6: DSCI 5240 Graduate Presentation Xxxxxx

Web Mining and the Agent ParadigmWeb mining is often viewed from or implemented within an agent paradigm 1. User interface agents2. Distributed agents3. Mobile agents

Two approaches used to develop intelligent agents4. Content-based approach5. Collaborative approach

Web Mining

Page 7: DSCI 5240 Graduate Presentation Xxxxxx

Definition: discovering useful info from web page contents/data/documents

Several types of data: text, image, audio, video, hyperlinks

Types of Data Structure:1.Unstructured: free text2.Semi- structured: HTML3.More structured: data in tables or database generated HTML pages

Web Content Mining

Page 8: DSCI 5240 Graduate Presentation Xxxxxx

IR view: Unstructured Documentsa. Bag of words to represent unstructured documents b. Feature: Boolean, Frequency basedc. Variations of the feature selection d. Features could be reduced using different feature selection

techniques Semi-Structured Documentse. Uses richer representations for featuresf. Uses common data mining methods

Web Content Mining

Page 9: DSCI 5240 Graduate Presentation Xxxxxx

DB view:DB view tries to infer the structure of a web site or transform a web site to become a databaseMethods:a. Finding the scheme of web documentsb. Building a web warehousec. Building a web knowledge based. Building a virtual database

Web Content Mining

Page 10: DSCI 5240 Graduate Presentation Xxxxxx

Interested in the structure of the hyperlinks within the web

Inspired by the study of social networks and citation analysis

Discover specific types of pages based on the incoming and outgoing links

Application: a. discovering micro-communities in the webb. measuring the completeness of a web site

Web Structure Mining

Page 11: DSCI 5240 Graduate Presentation Xxxxxx

Tries to predict user behavior from interaction with the web

Wide range of data Two commonly used approachesa. Maps the usage data of Web server into relational tables before

an adapted data mining technique is performedb. Uses the log data directly by utilizing special pre-processing

techniques problems:a. Distinguishing among unique users, server sessions, episodes in

the presence of caching and proxy serversb. Often usage mining uses some background or domain knowledge applications

Web Usage Mining

Page 12: DSCI 5240 Graduate Presentation Xxxxxx

Survey of research in the area of web mining

Three web mining categories: content structure usage mining

Connection between web mining categories and related agent paradigm

Conclusions

Page 13: DSCI 5240 Graduate Presentation Xxxxxx