Data e Web Mining 825368 Paolo Gobbo Smart Miner: A New Framework for Mining Large Scale Web Usage...
-
Upload
victoria-goward -
Category
Documents
-
view
218 -
download
2
Transcript of Data e Web Mining 825368 Paolo Gobbo Smart Miner: A New Framework for Mining Large Scale Web Usage...
Data e Web Mining Data e Web Mining
825368 Paolo Gobbo825368 Paolo Gobbo
Smart Miner: A New Framework for Mining Large Scale Web Usage Data
Bayir – Toroslu – Cosar - Fidan
Data e Web Mining 825368 - Paolo Gobbo2
Data Mining on WebData Mining on Web
Web Miningdiscover and retrieve useful and
interesting pattern from large web dataset
web content mining
web structure mining
web usage mining
text and multimedia documents
hyperlink structure
web log records
real data in web pages
data describes the organization of the
content
data describes the pattern of usage of
web pages
Data e Web Mining 825368 - Paolo Gobbo3
PreProcessingPreProcessing
Site File Access Log Referrer Log Agent Log Registration
Site Crawler
Data CleaningPath Completion
Session IdentificationUser Identification
User Session File
TransactionIdentification
Transaction File
Site Topology
INP
UT
PR
EP
RO
CES
ISN
G
SQLQuery
Data e Web Mining 825368 - Paolo Gobbo4
Session IdentificationSession Identification
)()(
,1:
1 ii PTPT
nii )()( 1PTPT n
partitioning each user’s activities into sequence (session) of entries from web request logs
Session Identification
time oriented heuristics
navigation oriented heuristics
temporal boundaries
session length page-stay
link between web pages
truePPLink
iji
ij
),(
0
Data e Web Mining 825368 - Paolo Gobbo5
Sequential MiningSequential Mining
},,,{ 21 miiiI Ø XIxi
raaas ,,, 21
Sequential Mining
Association Mining with the order of transactions
itemset/element
items
sequence
},,,{ 21 kxxxX :
: ia is itemset
sequence size
sequence length
number of itemsets/elements
number of items
:
:
:
:
:
Given a set of data sequences find all sequences with a user-specified minimum support
subsequence naaas ,,, 211
niniin bababaiii ,,, : 21 2121
1s 2s :nbbbs ,,, 212
Data e Web Mining 825368 - Paolo Gobbo6
Sequential Mining algorithmsSequential Mining algorithms
GSP APrioriAll APrioriSome
Sort Phase
LargeItemSet Phase
Transformation Phase
Sequence Phase
Maximal Phase
Transforms customer transaction into custumer sequences
Generates set of large itemset
Represents customer sequences based on large itemset
Derives large k-sequences based on large (k-1)-sequences
Prunes non maximal sequences
Data e Web Mining 825368 - Paolo Gobbo7
Smart-SRA sessionSmart-SRA session
x],,,,,,[ 121 nkkx PPPPPS
},,,{ 21 mSSSS Smart-SRA session
Path
• timestamp ordering (time oriented) rule
• topology (navigation oriented) rule
• maximality rule
truePPLinki ii ),(: 1
)()(,1: 1 ii PTPTnii
)()(,1: 1 ii PTPTnii )()( 1PTPT n
yxyyx SSSSSSS :
(session)
(path in the web site)
(path in the web site)
Data e Web Mining 825368 - Paolo Gobbo8
Smart MinerSmart Miner
Candidate Session
Smart Session
Sequencial AprioriAll
SMART-SRA SESSION
CONSTRUCTION
SEQUENCIALMINING
DATA STREAM
FREQUENT ACCESS PATTERN
Data e Web Mining 825368 - Paolo Gobbo9
Smart Miner: First Phase Smart SRA Smart Miner: First Phase Smart SRA
time oriented heuristics
session length
page-stay
no backward movement
P1 P13
P20
P49P34
P23
Web Site Graph Candidate Session
Candidate session construction
P1 P20 P13 P49 P34 P23
0 6 9 12 14 15
Page
TimeStamp
P13 P20 P23
0 5 9
Page
TimeStamp
P49
10
Data e Web Mining 825368 - Paolo Gobbo10
Smart Miner: Second Phase Smart SRA Smart Miner: Second Phase Smart SRA
time oriented heuristics
inherithed session length
re-check page-stay
no backward movement
maximality
topology rule
Smart session construction
P1 P13
P20
P49P34
P23
Web Site Graph
[P1, P13, P34, P23]
[P1, P13, P49, P23]
[P1, P20, P23]
Smart Session
P1 P20 P13 P49 P34 P23
0 6 9 12 14 15
Page
TimeStamp
Data e Web Mining 825368 - Paolo Gobbo11
Smart Miner: Second Phase SmartSmart Miner: Second Phase Smart
SMART SESSION RECONSTRUCTION
foreach CanditateSession in CandSessionSet NewSessionSet={} while CanditateSession ≠Ø TSessionSet = {}; TPageSet = {}; foreach Pagei in CandSession StartPageFlag = TRUE foreach Pagej in CandidateSession with j<i if (Link[Pagej,Pagei] and TimeDiff(Pagei,Pagej)≤σ then StartPageFlag = FALSE endfor if StartPageFlag then TPageSet = TPageSet U {Pagei} endfor CandSession = TPageSet U {Pagei} if NewSessionSet = {} then foreach Pagei in TPageSet TSessionSet = TSessionSet U {[Pagei]} else foreach Pagei in TPageSet foreach Sessionj in NewSessionSet if (Link[Last(Sessionj),Pagei] and TimeDiff(Last(Sessionj),Pagei)≤σ) then TSession = Sessionj TSession.mark = UNEXTENDED TSession = TSession • Pagei TSessionSet = TSessionSet U {TSession} Sessionj.mark = EXTENDED endif endfor endfor endif foreach SessionJ in New SessionSet if SessionJ.mark ≠ EXTENDED then TSessionSet = TSessionSet U {SessionJ} end for NewSessionSet = TSessionSet end whileend for
page with no incoming
link
session set construction
session set extension
session set extension with no
extended
Data e Web Mining 825368 - Paolo Gobbo12
Session Construction ExampleSession Construction Example
Iteration CandidateSession TPageSet NewSessionSet
1 [ P1, P20, P13, P49, P34, P23 ]
2 [ P20, P13, P49, P34, P23 ]
3
4
[ P49, P34, P23 ]
[ P23 ]
{ P1 }
{ P20, P13 }
{ P49, P34 }
{ P23 }
[ P1 ]
[ P1, P20 ] [ P1, P13]
[ P1, P13, P34 ] [ P1, P13, P49 ] [ P1, P20 ]
[ P1, P13, P34, P23 ] [ P1, P13, P49, P23] [ P1, P20, P23 ]
P1 P13
P20
P49P34
P23
Data e Web Mining 825368 - Paolo Gobbo13
Sequential APrioriAllSequential APrioriAll
Pruning
topological constraint
every subsequent pair of pages in a sequence the former one must have a hyperlink to the latter one
string matching costraint
session S supports a pattern P if and only if P is a subsequence of S not violating string matching
<1,2,3> support <1,2><1,2,3> not support <1,3>
• during candidate sequence generation before calculating their support
Data e Web Mining 825368 - Paolo Gobbo14
Support Support
Support
I : pattern
S : user reconstructed sessions
S
SiSSISupport ii } of substring is I |{
),(
• one scan through the transaction database by keeping candidate session in hashmap
Data e Web Mining 825368 - Paolo Gobbo15
Sequential Apriori AlgorithmSequential Apriori Algorithm
SEQUENTIAL APRIORIINPUT: minimum support frequency : δ
reconstructed sessions : Stopology information : Linkset of all web pages : P
OUTPUT: set of maximal frequent patterns : Max
L1 = {}for i = 1 to |P| do L1 = L1 U [Pi] | if Support([Pi],S)> δfor k = 1 to N-1 do if Lk = Ø then Halt else Lk+1 = {} foreach Ii in Lk foreach Pj in P if Link[Last(Ii),Pj] then T = Ii • Pj // append page if Support(T,S)> δ then T.maximal = true Ii.maximal = false V = [T2,T3,…, T|T|] if V in Lk then V.maximal = false lk+1 = lk+1 U {T} endif endif endif endfor endfor endif max = {} for k=1 to N-1 do max = max U {S|S in Lk and S.maximal = true }endfor
length-1 candidatepattern generation
union of the sets ofmaximal patterns
no further generation
length-k+1 candidate pattern generationjoining step
pruning steptopological rule
support rulemaximality rule
Data e Web Mining 825368 - Paolo Gobbo16
Accuracy MetricAccuracy Metric
AMP
H
HAH MP
MPMPPRE
HHH PRERECA *
: frequent maximal pattern of the agent simulator
: frequent maximal pattern of the heuristic
A
HAH MP
MPMPREC
HMP
recall
precision
accuracy
Data e Web Mining 825368 - Paolo Gobbo17
Agent SimulatorAgent Simulator
• STP : Session Termination Probability
• LPP : Link from Previous page Probability
• LPC : Link from Current page Probability
• NIP : New Initial page Probability
probability of terminating session
probability of referring next page from one of the previously accessed pages except the most recently accessed one
probability of referring next page from the most recently visited page
probability of selecting one of the starting pages of a web site during the navigation
Agent Simulator Parameters
Data e Web Mining 825368 - Paolo Gobbo18
Simulated DataSimulated Data
Web topology
• number of web pages from 10 to 1000
• number users from 1000 to 10000
Agent simulator parameters
• NIP/STP 0.1 , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0
• LPC/LPP 0.1 , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0
• 49 different cases
Support parameter
• Values 0.001 , 0.0025 , 0.005 , 0,0075 , 0.01
Runs of agent simulator
• 10 random different runs
Data e Web Mining 825368 - Paolo Gobbo19
Results on Simulated DataResults on Simulated Data
NOTO :
:
SSRA :
navigation orientedtime orientedSmart SRA
NIP : New Initial Page Probability
STP : Session Termination Probability
NIP : New Initial Page Probability
STP : Session Termination Probability
Data e Web Mining 825368 - Paolo Gobbo20
Results on Simulated DataResults on Simulated Data
NOTO :
:
SSRA :
navigation orientedtime orientedSmart SRA
Data e Web Mining 825368 - Paolo Gobbo21
Real DataReal Data
AGMLAB’s company web site
• 4 months user activity
• 3801 users
• 30 minutes session time-out
• 10 web pages
• link graph densely connected
User Activity
• action tracking program
• cookies
• cookie information recorded to a server log file
Data e Web Mining 825368 - Paolo Gobbo22
Results on Real DataResults on Real Data
NOTO :
:
SSRA :
navigation orientedtime orientedSmart SRA
Data e Web Mining 825368 - Paolo Gobbo23
ScalabilityScalability
Performance on 100 GB Data Performance with 50 nodes
MAP/REDUCE paradigm
each node process a block of session database computing the local frequency of each candidate patterns
Data e Web Mining 825368 - Paolo Gobbo24
Sitologia/BibliografiaSitologia/Bibliografia
M.A.Bayir – I.H.Toroslu – A.Cosar – G.Fidan, Smart Miner: A New Framework for Mining Larga Scale Web Usage Data - 2009
R.Cooley - B.Mobasher - J.Srivastava, Data Preparation for Mining World Wide Web - 1999
J.Srivastava - R.Cooley – M.Deshpande – P.N. Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data - 2000
M.G Da Costa jr – Z. Gong, Web Structure Mining: An Introduction - 2005
J.J.Jung, Semantic PreProcessing of Web Request Streams for Web Usage Mining - 2005
R.Agrawal – R.Srikant, Mining Sequential Patterns- 1995
Data e Web Mining 825368 - Paolo Gobbo25
foreach p in Lk-1
foreach q in Lk-1
if ( )
then Ck = Ck U {p1,…,pk-1,qk-1 }
foreach s in Ck
if exists(r | ˄ )
then Ck = Ck - s
GSPGSP
C1 = Init_Pass
L1 = {<{f}>|f in C1, with minimum support}
for (k=2; Lk-1≠Ø; k++) do begin
Ck = Candidate-gen-SPM Lk-1
foreach sequence s in the database D do foreach candidate c in Ck if (c in s) then update candidate c
Lk= candidated c in Ck with minimum support end
result = Uk(Lk)
GSP – GENERALIZED SEQUENTIAL PATTERN
nn qpkni 1:2
CANDIDATE-GEN-SPM
(join step)
(prune step)sr 1 kLr
Data e Web Mining 825368 - Paolo Gobbo26
GSP ExampleGSP Example
L3-sequencesCandidate 4-sequences
(join step)Candidate 4-sequences
(prune step)
<{1,2},{4}>
<{1,2},{5}>
<{1},{4,5}>
<{1,4},{6}>
<{2},{4,5}>
<{2},{4},{6}>
<{1,2},{4,5}>
<{1,2},{4},{6}>
<{1,2},{4,5}>
<{1},{4},{6}>
Data e Web Mining 825368 - Paolo Gobbo27
foreach p in Lk-1
foreach q in Lk-1
if (p.x1=q.x1) ˄ (p.x2=q.x2) ˄ … ˄ (p.xk-2=q.xk-2)
then Ck = Ck U {<p.x1,…,p.xk-1,q.xk-1>}
foreach s in Ck
if exists(r | ˄ )
then Ck = Ck - s
APrioriAllAPrioriAll
sr
L1 = {large 1-sequences}
for (k=2; Lk-1≠Ø; k++) do begin
Ck = Apriori-generate function Lk-1
foreach sequence c in the database D do
update candidates in Ck that are contained in c
Lk= candidated in Ck with minimum support end
result = maximal sequences in Uk(Lk)
APRIORIALL
APRIORI-GENERATE
1 kLr
(join step)
(prune step)
Data e Web Mining 825368 - Paolo Gobbo28
APrioriAll ExampleAPrioriAll Example
L3-sequences
<1,2,3>
<1,2,4>
<1,3,4>
<1,3,5>
<2,3,4>
Candidate 4-sequences(join step)
<1,2,3,4>
<1,2,4,3>
<1,3,4,5>
<1,3,5,4>
Candidate 4-sequences (prune step)
<1,2,3,4>
Data e Web Mining 825368 - Paolo Gobbo29
APrioriSomeAPrioriSome
APRIORISOME
//Forward Phase
L1 = {large 1-sequences}; C1 = L1 ; last = 1;
for (k=2; Ck-1≠Ø; k++) do begin
if (Lk-1 known) then Ck = Apriori-generate function Lk-1
else Ck = Apriori-generate function Ck-1
if (k=next(last)) then foreach sequence c in the database D do
update candidates in Ck that are contained in c
Lk= candidated in Ck with minimum support; last = kend//Backword Phasefor (k--; k>=1; k--) do begin
if (Lk not found) then
delete all sequences in Ck contained in some Li, i>k
foreach sequence c in the database D do update candidates in Ck that are contained in c
Lk= candidated in Ck with minimum support else
delete all sequences in Lk contained in some Li, i>kend
result = maximal sequences in Uk(Lk)
Data e Web Mining 825368 - Paolo Gobbo30
Sequential Mining AlgorithmSequential Mining Algorithm
11
June 25 ’93June 25 ‘93
3090
222
June 10 ’93June 15 ’93June 20 ‘93
10,2030
40,60,60
3 June 25 ’93 30,50,70
444
June 25 ’93June 30 ‘93July 25 ‘93
3040,70
90
5 June 12 ’93 90
Customer ID Transaction Time Items
Customer ID Customer Sequence
1 <(30)(90)>
2 <(10 20) (30) (40 60 70)>
3 <(30) (50 (70))>
4 <(30) (40 70) (90)>
5 <(90)>
(30) 1
(40) 2
(70) 3
(40 70) 4
(90) 5
Large itemset Mapped to
1 <{1} {5}>
2 <{1} {2, 3, 4}>
3 <{1, 3}>
4 <{1} {2, 3, 4} {5}>
5 <{5}>
Customer ID Customer Sequence