Post on 05-Feb-2016
description
Advanced Topics in Data Mining:
Web Mining
Web MiningWeb Mining
Web Mining• Applications are ported to the Web at rapid pace• On-line services, such as America Online (AOL),
and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web
• How Amazon does it?• Understanding Web user behavior is important
– It can improve Web page organization– It can increase Web server performance– It can exploit Web advertising– It can increase business opportunity
Amazon Web Page
Association Rules
More Information Desired
• Collect statistical information (page hits) only, which is insufficient since:– The hit frequency of a page depends not only on its
content but also on its location
– The number of users accessing a page is not available
– Information on what pages accessed together is not available
• Data mining in the Web (Web Mining)– Web Access Pattern Collection
– Web User Pattern Mining
Web Access Pattern Collection
• Server-Based Data Collection– Who are visiting a given Web site and what are
they doing
• Agent-Based Data Collection– What are the Web sites a particular user has
visited?
Server-Based Data Collection
• Examine the logs collected by HTTPd– Access Log (IP, Time, Access Data), Referred
Log (AB), Error Log, …– We can combining some of them for our use if
necessary
• Problems – The use of proxy servers– The effect of caching
Server-Based Data Collection
Access LogIP/Domain Name Time Access Data
Referred Log
不考慮 Caching的問題
Server-Based Data Collection• Have to be done in accordance with technol
ogy advances– The use of Active Server Pages (Session ID ava
ilable)• The use of proxy servers• The effect of caching
– HTTPd 1.1
• Limitation– Can only capture the user behavior when they a
re within this site
Agent-Based Data Collection• Understanding individual Web behavior needs clie
nt-based data collection• Results are useful
– Better Personalized Service– Improved Web Page Organization– Better Pricing Policies
• Methods– Applets can only read/write files in their source servers
• a big security constraint
– Using Active Components (ActiveX Control) and PlugIns
• APCS (Access Pattern Collection Server)
APCS
APCS
APCS
APCS
APCS
Agent-Based Data Collection
• Very difficult to do for non-registered users in the current Web environment– We have to be conducted with users’ consent
• Very dependent upon available Web technologies
Web User Pattern Mining
• Web user pattern mining is to discover user access patterns in Web servers
• Pattern discovery and analysis tools– Some existing Web tools provide mechanisms f
or reporting user activity in the servers – Web Trends (http://www.webtrends.com.tw/)– Open Market (http://www.openmarket.com/)– Net.Genesis (http://www.netgen.com/)
Path Traversal Patterns Mining• Mining path traversal patterns in a distributed information
providing environment (WWW) where documents or objects are linked together (via hyperlinks) to facilitate interactive access
• Solution procedure consists of three steps: – Convert the original sequence of log data into a set of maximal
forward references (MF)• Filter out the effect of some backward references
– Mainly made for ease of traveling and concentrate on mining meaningful user access sequences
– Some objects are visited because of their locations rather than their content
– Determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained
– Determine the maximal reference sequences from large reference sequences (Trivial)
Step1: MF References• Suppose the traversal log contains the following traversal path for a user:
– A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V
The set of maximal forward references is {ABCD, ABEGH, ABEGW, AOU, AOV}
When backward referencesoccur, a forward reference path terminate.
Step1: Another Example
Step1: Arrange Database
Encoding
Step1: Database Reduction
Database Reduction
Step2: Find Frequent Reference Sequences
• Two algorithms for finding Frequent Traversal Patterns (Frequent Reference Sequences, Frequent Consecutive Subsequences)– Full-Scan (FS) Algorithm
• FS utilizes key ideas of the DHP algorithm
– Selective-Scan (SS) Algorithm• SS reduces the number of database scans
Full-Scan (FS) Algorithm
ScanDB-1
Generate L1 & Hash Table
Generate L1 & Hash Table
ScanDB-1
h(x,y) = [ ( order of x ) * 23 + ( order of y ) ] mod 17
Generate C2
Generate L2 & Reduce DB
ScanDB-2
Generate L2 & Reduce DB
ScanDB-2
Generate C3, L3 & Reduce DB
ScanDB-3
Generate C4, L4 & Reduce DB
ScanDB-4
Selective-Scan (SS) Algorithm
ScanDB-3
Step 3: Generate FrequentTraversal Patterns
Maximal Reference Sequences
WAP-Mine Algorithm• The key consideration is how to facilitate the te
dious support counting and candidate generating operations in the mining procedure
• Given Web Access Sequence database WAS and a support threshold , mine the complete set of -patterns of WAS
User ID Web Access Sequence
100 abdac
200 eaebcac
300 babfaec
400 afbacfc
WAS
WAP-Mine Algorithm
(1)Scan WAS once,find all frequent-1 events
(2)Scan WAS again,construct a WAP-tree
(3)Recursively mine the WAP-tree using conditional search
Access patterns
Find All Frequent-1 Events
User ID Web Access Sequence
100 abdac
200 eaebcac
300 babfaec
400 afbacfc
Item Support Frequency
a 4
b 4
c 4
d 1
e 2
f 2
Min_Sup=75%
User ID Web Access Sequence Frequent Subsequence
100 abdac abac
200 eaebcac abcac
300 babfaec babac
400 afbacfc abacc
WAP-Tree Construction
• Using frequent events to register all count information for further mining
User ID Frequent Subsequence
100 abac
200 abcac
300 babac
400 abacc
Mining Web Access Patterns from WAP-Tree
Sequence Countaba 2ab 1abca 1ab -1baba 1abac 1aba -1
Conditional Sequence Based on c
Sequence Countaba 1abca 1baba 1abac 1
Item Sup Frequencya 4b 4c 2
Generate Web Access Patterns: ac, bc
Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on ac
Sequence Count
ab 3
b 1
bab 1
b -1
Sequence Count
ab 3
bab 1
Item Sup Frequency
a 4
b 4Generate Web Access Patterns: aac, bac
Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on bac
Sequence Count
a 3
ba 1
Item Sup
Frequent
a 4
b 1Generate Web Access Patterns: abac
Mining Web Access Patterns from WAP-Tree
Conditional Sequence Based on abac
No Web Access Patterns are Generated
Sequence Count
a 4
Mining for Web Transactions
• To capture Web customer buying behavior– It is not just market basket transaction for the
set of items bought by a customer in a single purchase (Association Rules)
– It is not just Web user travel patterns (Path Traversal Patterns)
– It is an extension from path traversal patterns
• Exploring the relationship between traveling and buying
Mining for Web Transactions
Web Transaction
Algorithm WR (Web-transaction-Record)
Web Transaction Records <Path: a Set of Purchases>
Algorithm WTM, MTSPJ, MTSPC
Frequent Transaction Patterns
Web Transaction Association Rules
Mining for Web Transactions
• Web-transaction-Record (WR) Algorithm– Extract meaningful Web transaction records
from the given Web transaction
• WTM (Web Transaction Mining) Algorithm – Mining Web Transaction Patterns
• MTS (Maximal Transaction Segment) Algorithms are the improvement versions of WTM
Mining for Web Transactions
Mining for Web Transactions
WTM Algorithm
• It joins the purchased itemsets for generating candidate transaction patterns
• WTM employs a two-level hash tree, called Web transaction tree, to store candidate transaction patterns– WTM hashes not only each item but also each
purchase in the path
WTM Algorithm
S{i7}, J{i8}, Q{i10}ASJLQ
G{i5}ABFG
D{i3}ABD
400
S{i7}, J{i8}, L{i9}ASJL
B{i1}, G{i5}ABFG
B{i1}, E{i4}ABCE
300
S{i7}, Q{i10}ASJLQ
B{i1}, C{i2}, E{i4}ABCE200
S{i7}, L{i9}ASJL
B{i1}, H{i6}ABFGH
B{i1}, C{i2}, E{i4}ABCE
100
PurchasePathWT_ID
Web Transaction
DATABASE
Support Count
WT_ID Path Purchase
100
ABCE B{i1}, C{i2}, E{i4}
ABFGH B{i1}, H{i6}
ASJL S{i7}, L{i9}
200
ABCE B{i1}, C{i2}, E{i4}
ASJLQ S{i7}, Q{i10}
Path Purchase Support Count
AB B{i1} 2
ABC C{i2} 2
WTM Algorithm
2Q{i10}ASJLQ
2L{i9}ASJL
2J{i8}ASJ
4S{i7}AS
1H{i6}ABFGH
2G{i5}ABFG
3E{i4}ABCE
1D{i3}ABD
2C{i2}ABC
3B{i1}AB
Sup.PurchasePath
C1
Path Purchase Sup.
AB B{i1} 3
ABC C{i2} 2
ABCE E{i4} 3
ABFG G{i5} 2
AS S{i7} 4
ASJ J{i8} 2
ASJL L{i9} 2
ASJLQ Q{i10} 2
T1
Support Count >= 2
WTM Algorithm
3B{i1} E{i4}ABCE
2B{i1} C{i2}ABC
Sup.PurchasePath
0B{i1} J{i8}ASJ
0B{i1} S{i7}AS
0L{i9} Q{i10}ASJLQ
1J{i8} Q{i10}ASJLQ
C2
2C{i2} E{i4}ABCE
2S{i7} Q{i10}ASJLQ
2S{i7} L{i9} ASJL
2S{i7} J{i8}ASJ
3B{i1} E{i4}ABCE
2B{i1} C{i2}ABC
Sup.PurchasePath
T2
Support Count >= 2
共28個
WTM Algorithm
2B{i1} C{i2} E{i4}ABCE
Sup.PurchasePath
T3
2B{i1} C{i2} E{i4}ABCE
Sup.PurchasePath
C3
Support Count >= 2
WTM Disadvantages
• WTM may generate a lot of unqualified candidate transaction patterns without utilizing the paths of frequent transaction patterns
• This will degrade the performance
MTSPJ Algorithm
• Algorithm MTSPJ uses maximal transaction segment that contains frequent transaction patterns and the maximal path, to solve the unqualified candidate transaction pattern problem
• MTSPJ generalizes candidate transaction patterns only when the leaf node of the Web transaction tree is reached
MTSPJ Algorithm
S{i7}, J{i8}, Q{i10}ASJLQ
G{i5}ABFG
D{i3}ABD
400
S{i7}, J{i8}, L{i9}ASJL
B{i1}, G{i5}ABFG
B{i1}, E{i4}ABCE
300
S{i7}, Q{i10}ASJLQ
B{i1}, C{i2}, E{i4}ABCE200
S{i7}, L{i9}ASJL
B{i1}, H{i6}ABFGH
B{i1}, C{i2}, E{i4}ABCE
100
PurchasePathWT_ID
Web Transaction
DATABASE
A
F
E
C
B S
D
H
G
Q
L
J
MTSPJ Algorithm
2Q{i10}ASJLQ
2L{i9}ASJL
2J{i8}ASJ
4S{i7}AS
1H{i6}ABFGH
2G{i5}ABFG
3E{i4}ABCE
1D{i3}ABCD
2C{i2}ABC
3B{i1}AB
Sup.PurchasePath
C1
Path Purchase Sup.
AB B{i1} 3
ABC C{i2} 2
ABCE E{i4} 3
ABFG G{i5} 2
AS S{i7} 4
ASJ J{i8} 2
ASJL L{i9} 2
ASJLQ Q{i10} 2
T1
Support Count >= 2
F
G
J
L
Q
S
E
C
A
B
MTSPJ Algorithm
B{i1} C{i2} E{i4}ABCE
Maximal Transaction Segment
B{i1} G{i5}ABFG
Sup.PurchasePath
C2S{i7} J{i8} L{i9} Q{i10}ASJLQ
Maximal Transaction Segment
C2Sup.PurchasePath
L{i9} Q{i10}ASJLQ
J{i8} Q{i10}ASJLQ
S{i7} Q{i10}ASJLQ
J{i8} L{i9}ASJL
S{i7} L{i9}ASJL
S{i7} J{i8}ASJ
C2
B{i1} C{i2}ABC
C{i2} E{i4}ABCE
B{i1} E{i4}ABCE
Sup.PurchasePath
B{i1} G{i5}ABFG
Maximal Transaction Segment
F
G
J
L
Q
S
E
C
A
B
2
3
21
2
2
1
2
1
0
MTSPJ Algorithm
C2
Path Purchase Sup.
ABC B{i1} C{i2} 2
ABCE B{i1} E{i4} 3
ABCE C{i2} E{i4} 2
ABFG B{i1} G{i5} 1
ASJ S{i7} J{i8} 2
ASJL S{i7} L{i9} 2
ASJL J{i8} L{i9} 1
ASJLQ S{i7} Q{i10} 2
ASJLQ J{i8} Q{i10} 1
ASJLQ L{i9} Q{i10} 0
Path Purchase Sup.
ABC B{i1} C{i2} 2
ABCE B{i1} E{i4} 3
ABCE C{i2} E{i4} 2
ASJ S{i7} J{i8} 2
ASJL S{i7} L{i9} 2
ASJLQ S{i7} Q{i10} 2
T2
MTSPJ Algorithm
J
L
Q
S
E
C
A
B
B{i1} C{i2} E{i4}ABCE
Maximal Transaction Segment
2B{i1} C{i2} E{i4}ABCE
Sup.PurchasePath
C3
MTSPC Algorithm
2Q{i10}ASJLQ
2L{i9}ASJL
2J{i8}ASJ
4S{i7}AS
1H{i6}ABFGH
2G{i5}ABFG
3E{i4}ABCE
1D{i3}ABCD
2C{i2}ABC
3B{i1}AB
Sup.PurchasePath
C1
Path Purchase Sup.
AB B{i1} 3
ABC C{i2} 2
ABCE E{i4} 3
ABFG G{i5} 2
AS S{i7} 4
ASJ J{i8} 2
ASJL L{i9} 2
ASJLQ Q{i10} 2
T1
Support Count >= 2
F
G
J
L
Q
S
E
C
A
B
MTSPC utilizes the LC (Large Count) to Filter Candidates
MTSPC Algorithm
F
G
J
L
Q
S
E
C
A
B
1E{i4}
1C{i2}
1B{i1}
ABCE
LCItemMaximal Path
Maximal Transaction Segment
K=1
|I| = 3 > 1 (K-1)
C2
2B{i1} C{i2}ABC
2C{i2} E{i4}ABCE
3B{i1} E{i4}ABCE
Sup.PurchasePath
Maximal Transaction Segment
Maximal Path Item LC
ASJLQ
S{i7} 1
J{i8} 1
L{i9} 1
Q{i10} 1
|I| = 4 > 1C2
Sup.PurchasePath
0L{i9} Q{i10}ASJLQ
1J{i8} Q{i10}ASJLQ
2S{i7} Q{i10}ASJLQ
1J{i8} L{i9}ASJL
2S{i7} L{i9}ASJL
2S{i7} J{i8}ASJ
Maximal Transaction Segment
Maximal Path Item LC
ABFGB{i1} 1
G{i5} 1
|I| = 2 > 1
1B{i1} G{i5}ABFG
Sup.PurchasePath
C2
MTSPC Algorithm
C2
Path Purchase Sup.
ABC B{i1} C{i2} 2
ABCE B{i1} E{i4} 3
ABCE C{i2} E{i4} 2
ABFG B{i1} G{i5} 1
ASJ S{i7} J{i8} 2
ASJL S{i7} L{i9} 2
ASJL J{i8} L{i9} 1
ASJLQ S{i7} Q{i10} 2
ASJLQ J{i8} Q{i10} 1
ASJLQ L{i9} Q{i10} 0
Path Purchase Sup.
ABC B{i1} C{i2} 2
ABCE B{i1} E{i4} 3
ABCE C{i2} E{i4} 2
ASJ S{i7} J{i8} 2
ASJL S{i7} L{i9} 2
ASJLQ S{i7} Q{i10} 2
T2
MTSPC Algorithm
Maximal Transaction Segment
Maximal Path Item LC
ASJLQ
S{i7} 3
J{i8} 1
L{i9} 1
Q{i10} 1
|I| = 3 > 2
2E{i4}
2C{i2}
2B{i1}
ABCE
LCItemMaximal Path
Maximal Transaction Segment
K=2
|I| = 1 < 2
J
L
Q
S
E
C
A
B
B{i1} C{i2} E{i4}ABCE
PurchasePath
C3
No Generations
Path Purchase Sup.
ABC B{i1} C{i2} 2
ABCE B{i1} E{i4} 3
ABCE C{i2} E{i4} 2
ASJ S{i7} J{i8} 2
ASJL S{i7} L{i9} 2
ASJLQ S{i7} Q{i10} 2
T2
Mining for Web Transactions
• <ABCE : B{1}, E{4}> = 2
• <AB : B{1}> = 3
• We can derive <ABCE : B{1} => E{4}>– support_count(<ABCE : B{1} => E{4}>) = 2– confidence(<ABCE : B{1} => E{4}>) =
Summary• Data mining in the Web is an area of growing
importance– In particular, the emerging of EC
– More and more applications will benefit from the knowledge from data mining
• Web Mining = Web Data Collection + Traditional Data Mining?
• Important Issues– Incremental Web Mining