Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad...

28
Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University

Transcript of Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad...

Page 1: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Overview of Web Data Mining and Applications

Part II

Bamshad MobasherDePaul University

Bamshad MobasherDePaul University

Page 2: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources.

application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources.

Web Mining Definition

2

What is Web Mining

Page 3: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Types of Web Mining

Web ContentMining

Web ContentMining

Web StructureMining

Web StructureMining

Web UsageMining

Web UsageMining

Web MiningWeb Mining

3

Page 4: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Types of Web Mining

Web ContentMining

Web ContentMining

Web StructureMining

Web StructureMining

Web UsageMining

Web UsageMining

Web MiningWeb Mining

Extracting interesting patterns from user interactions with resources on one or more Web sites

4

Page 5: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Types of Web Mining

Web ContentMining

Web ContentMining

Web StructureMining

Web StructureMining

Web UsageMining

Web UsageMining

Web MiningWeb Mining

Applications:• user and customer behavior

modeling• Web site optimization• e-customer relationship

management• Web marketing• targeted advertising• Personalization

5

Page 6: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

6

Data Mining and Personalization

i Personalization: “Killer App” for big data analyticsi Tangible successes both in the research and in industrial

applications4 recommender systems4 personalized Web agents4 user adaptive systems4 Web marketing & targeted advertising4 personalized search

i Sophisticated modeling approaches based on both predictive and unsupervised DM techniques

Page 7: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Web Usage Mining:: data sources

i Typical Sources of Data:4 automatically generated Web/application server access logs

4 e-commerce and product-oriented user events (e.g., shopping cart changes, product clickthroughs, etc.)

4 user profiles and/or user ratings

4 meta-data, page content, site structure

i User Transactions4 sets or sequences of pageviews possibly with associated weights

4 a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser

7

Page 8: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

What’s in a Typical Server Log?1 2006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html - 200 9221 HTTP/1.1

maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727) http://dataminingresources.blogspot.com/

2 2006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf - 200 4096 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727) http://maya.cs.depaul.edu/~classes/cs589/papers.html

3 2006-02-01 08:01:28 2.3.4.5 - GET /classes/ds575/papers/hyperlink.pdf - 200 318814 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1) http://www.google.com/search?hl=en&lr=&q=hyperlink+analysis+for+the+web+survey

4 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/announce.html - 200 3794 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/

5 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/styles2.css - 200 1636 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/announce.html

6 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/header.gif - 200 6027 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/announce.html

8

Page 9: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Typical Fields in a Log File Entry

client IP address 1.2.3.4base url maya.cs.depaul.edudate/time 2006-02-01 00:08:43 http method GETfile accessed /classes/cs589/papers.htmlprotocol version HTTP/1.1 status code 200 (successful access)bytes transferred 9221referrer page http://dataminingresources.blogspot.com/user agent Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;

+SV1;+.NET+CLR+2.0.50727)

client IP address 1.2.3.4base url maya.cs.depaul.edudate/time 2006-02-01 00:08:43 http method GETfile accessed /classes/cs589/papers.htmlprotocol version HTTP/1.1 status code 200 (successful access)bytes transferred 9221referrer page http://dataminingresources.blogspot.com/user agent Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;

+SV1;+.NET+CLR+2.0.50727)

In addition, there may be fields corresponding to• login information• client-side cookies (unique keys, issued to clients in order to identify

a repeat visitor)• session ids issued by the Web or application servers

9

Page 10: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

10

Basic Entities in Web Usage Mining

i User (Visitor) - Single individual that is accessing files from one or more Web servers through a Browser

i Page File - File that is served through HTTP protocol

i Pageview - Set of Page Files that contribute to a single display in a Web Browser

i User Session - Set of Pageviews served due to a series of HTTP requests from a single User across the entire Web.

i Server Session - Set of Pageviews served due to a series of HTTP requests from a single User to a single site

i Transaction (Episode) - Subset of Pageviews from a single User or Server Session

Page 11: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

11

Main Challenges in Data Collection and Preprocessing

i Main Questions:4 what data to collect and how to collect it; what to exclude4 how to identify requests associated with a unique user sessions (HTTP is

“stateless”)4 how to identify/define user transactions4 how to identify what is the basic unit of analysis (e.g., pageviews, items

purchased, user ratings, etc.)4 how to integrate data across channels: e-commerce data, clickstream data,

user profiles, social media data, product meta data, etc.

Page 12: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Usage Data Preparation Tasksi Data cleaning

4 remove irrelevant references and fields in server logs4 remove references due to spider navigation4 add missing references due to client-side caching

i Data integration4 synchronize data from multiple server logs4 integrate e-commerce and application server data4 integrate meta-data

i Data Transformation4 pageview identification4 identification of product-oriented events4 identification of unique users4 sessionization – partitioning each user’s record into multiple sessions or

transactions (usually representing different visits)4 integrating meta-data and user profile data with user sessions

12

Page 13: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Conceptual Representation of User Transactions or Sessions

A B C D E Fuser0 15 5 0 0 0 185user1 0 0 32 4 0 0user2 12 0 0 56 236 0user3 9 47 0 0 0 134user4 0 0 23 15 0 0user5 17 0 0 157 69 0user6 24 89 0 0 0 354user7 0 0 78 27 0 0user8 7 0 45 20 127 0user9 0 38 57 0 0 15

Sessions/user transactions

Pageview/objects

This is the typical representation of the data, after preprocessing, that is used for input into data mining algorithms. Raw weights may be binary, based on time spent on a page, or other measures of user interest in an item. In practice, need to normalize or standardize this data.

13

Page 14: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Web Usage Mining as a Process

14

Page 15: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

15

E-Commerce Data

i Integrating E-Commerce and Usage Data4 Needed for analyzing relationships between navigational patterns of visitors

and business questions such as profitability, customer value, product placement, etc.

4 E-business / Web Analytics4 E.g., tracking and analyzing conversion of browsers to buyers

i E-Commerce v. Simple Usage Data4 E-commerce data is product oriented while usage data is pageview oriented4 Usage events (pageviews) are well defined and have consistent meaning

across all Web sites4 E-commerce events are often only applicable to specific domains, and the

definition of certain events can vary from site to site4 Major difficulty for Usage events is getting accurate preprocessed data4 Major difficulty for E-commerce events is defining and implementing the

events for a particular site

Page 16: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

16

Why We Need Web Analyticsi Are we attracting new people to our site?i Is our site ‘sticky’? Which regions in it are not?i What is the health of our lead qualification process?i How adept is our conversion of browsers to buyers?i What behavior indicates purchase propensity?i What site navigation do we wish to encourage?i How can profiling help use cross-sell and up-sell?i How do customer segments differ?i What attributes describe our best customers?i Can we target other prospects like them?i What makes customers loyal?i How do we measure loyalty?

Page 17: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

17

Three Skill Sets Required

i Technology4 How do we get the data? Are we collecting the right data?

i Analytics 4 How do we turn the data into insightful information?

i Business Management4 What action do we take? How do we measure the impact of that

action?

Data Collection / Preprocessing / IntegrationData Collection / Preprocessing / Integration

Analysis Tools, OLAP, Data MiningAnalysis Tools, OLAP, Data Mining

E-MetricsE-Metrics

Page 18: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

18

Using Analytics for E-Business Management

i Navigation Calibrationi Calculating Content

4 Popularity4 Freshness 4 Stickiness / Slipperiness / Leakage4 Stimulus - Inducement

i Conversion Quotienti Interaction Computationi Customer Service Assessmenti Customer Experience Evaluationi Branding

Refresh rateVisit Frequency

< 1 ?

Page 19: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

19

Web Usage and E-Business Analytics

i Session Analysis

i Static Aggregation and Statistics

i OLAP

i Data Mining

Different Levels of AnalysisDifferent Levels of Analysis

Page 20: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

20

Session Analysis

i Simplest form of analysis: examine individual or groups of server sessions and e-commerce data.

i Advantages:4 Gain insight into typical customer behaviors.4 Trace specific problems with the site.

i Drawbacks:4 LOTS of data.4 Difficult to generalize.

Page 21: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

21

Static Aggregation (Reports)i Most common form of analysis.i Data is aggregated by predetermined units such as days or

sessions.i Generally gives most “bang for the buck.”i Advantages:

4 Gives quick overview of how a site is being used.4 Minimal disk space or processing power required.

i Drawbacks:4 No ability to “dig deeper” into the data.

Page Number of Average View Count View Sessions per Session

Home Page 50,000 1.5Catalog Ordering 500 1.1Shopping Cart 9000 2.3

Page 22: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

22

Online Analytical Processing (OLAP)i Allows changes to aggregation level for multiple dimensions.i Generally associated with a Data Warehouse.i Advantages & Drawbacks

4 Very flexible4 Requires significantly more resources than static reporting.

Page Number of Average View Count View Sessions per Session

Kid's Stuff Products 2,000 5.9

Page Number of Average View Count View Sessions per Session

Kid's Stuff Products Electronics Educational 63 2.3 Radio-Controlled 93 2.5

Page 23: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Data Mining: Going Deeperi Frequent Itemsets and Association Rules

4 The “Donkey Kong Video Game” and “Stainless Steel Flatware Set” product pages are accessed together in 1.2% of the sessions.

4 When the “Shopping Cart Page” is accessed in a session, “Home Page” is also accessed 90% of the time.

4 When the “Stainless Steel Flatware Set” product page is accessed in a session, the “Donkey Kong Video” page is also accessed 5% of the time.

4 30% of clients who accessed /special-offer.html, placed an online order in /products/software/

i Sequential Patterns4 Add an extra dimension to frequent itemsets and association rules - time

h “x% of the time, when AB appears in a transaction, C appears within z transactions”)

4 40% of people who bought the book “How to cheat IRS” booked a flight to South America 6 months later

4 The “Video Game Caddy” page view is accessed after the “Donkey Kong Video Game” page view 50% of the time. This occurs in 1% of the sessions.

4 15% of visitors followed the path home > * > software > * > shopping cart > checkout

23

Page 24: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Data Mining: Going Deeperi Clustering: Content-Based or Usage-Based

4 Customer/visitor segmentation4 Categorization of pages and products

i Classification4 Classifying users into behavioral groups (browser, likely to purchase, loyal

customer, etc.)4 Examples:

h Cusotmers who access Video Game Product pages, have income of 50K+, and have 1 or more children, should get a banner ad for Xbox in their next visit.

h Customers who make at least 4 purchases in one year should be categorized as “loyal”

h Load applicants in 45K-60K income range, low debt, and good-excellent credit should be approved for a new mortgage.

24

Page 25: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

25

Example: Path Analysis for Ecommerce

Visit

Search(64% successful)

No Search

Last Search SucceededLast Search Failed

10%90%

Avg sale per visit: 2.2X

Avg sale per visit: $X

Avg sale per visit: 2.8XAvg sale per visit: 0.9X

70% 30%

Page 26: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

26

Example: Association Analysis for Ecommerce

i Confidence: 41% who purchased Fully Reversible Mats also purchased Egyptian Cotton Towelsi Lift: People who purchased Fully Reversible Mats were 456 times more likely to purchase the Egyptian

Cotton Towels compared to the general population

Product Association Lift Confidence

WebsiteRecommended Products

J Jasper Towels

FullyReversibleMats

456 41%Egyptian CottonTowels

White CottonT-Shirt Bra

PlungeT-Shirt Bra 246 25%

Black embroidered underwired bra

Confidence 1.4%

Confidence 1%

Page 27: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

27

Web Usage Mining: clustering example

i Transaction Clusters: 4 Clustering similar user transactions and using centroid of each cluster as a

usage profile (representative for a user segment)

Support URL Pageview Description

1.00 /courses/syllabus.asp?course=450-96-303&q=3&y=2002&id=290

SE 450 Object-Oriented Development class syllabus

0.97 /people/facultyinfo.asp?id=290 Web page of a lecturer who thought the above course

0.88 /programs/ Current Degree Descriptions 2002

0.85 /programs/courses.asp?depcode=96&deptmne=se&courseid=450

SE 450 course description in SE program

0.82 /programs/2002/gradds2002.asp M.S. in Distributed Systems program description

Sample cluster centroid from dept. Web site (cluster size =330)

Page 28: Overview of Web Data Mining and Applications Part II Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

customers

ordersproducts

OperationalDatabase

ContentAnalysisModule

Web/ApplicationServer Logs

Data Cleaning /Sessionization

Module

Site Map

SiteDictionary

IntegratedSessionized

Data

DataIntegration

Module

E-CommerceData Mart

Data MiningEngine

OLAPTools

UsageAnalysis

PatternAnalysis

OLAPAnalysis

SiteContent

Data Cube

Basic Framework for E-Commerce Data Analysis

Basic Framework for E-Commerce Data Analysis