IT444: Web Intelligence
description
Transcript of IT444: Web Intelligence
1
IT444: Web Intelligence
RevisionApriori and HITS algorithm
2
APRIORI ALGORITHMAssociation Rules
3
Pass 1 • Generate the candidate itemsets in C1
• Save the frequent itemsets in L1
Pass k 1. Generate the candidate itemsets in Ck from the frequent
itemsets in Lk-1 – Join Lk-1 p with Lk-1q, as follows:
insert into Ck select p.item1, p.item2, . . . , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1q where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
– Generate all (k-1)-subsets from the candidate itemsets in Ck
– Prune all candidate itemsets from Ck where some (k-1)-subset of the candidate itemset is not in the frequent itemset Lk-1
2. Scan the transaction database to determine the support for each candidate itemset in Ck
3. Save the frequent itemsets in Lk
4
Example
• Assume the user-specified minimum support is 40%, then generate all frequent itemsets.
• Given: The transaction database shown below: TID A
T1 A, B, CT2 A, B, C, D, ET3 A, C, D T4 A, C, D, ET5 A, B, C, D
5
Pass-1
Itemset X Support (x)A ?B ?C ?D ?E ?
Itemset X support(X)A 100%B 60%C 100%D 80%E 40%
C1 L1
6
Pass-2Itemset X Support (X)
A,B ?A,C ?A,D ?A,E ?B,C ?B,D ?B,E ?C,D ?C,E ?D,E ?
C2
Before computing support, check for pruning. Nothing pruned since all subsets of these itemsets are frequent
Itemset X Support (X)A,B 60%A,C 100%A,D 80%A,E 40%B,C 60%B,D 40%B,E 20%C,D 80%C,E 40%D,E 40%
C2
7
Itemset X Support (X)A,B 60%A,C 100%A,D 80%A,E 40%B,C 60%B,D 40%B,E 20%C,D 80%C,E 40%D,E 40%
C2Itemset X Support (X)A,B 60%A,C 100%A,D 80%A,E 40%B,C 60%B,D 40%C,D 80%C,E 40%D,E 40%
L2
After saving only the frequent itemsets
8
Pass-3
• To create C3 only look at items that have the same first item (in pass k, the first k - 2 items must match)
Itemset X Support (X)
join AB with AC A,B,C ?
join AB with AD A,B,D ?
join AB with AE A,B,E ?
join AC with AD A,C,D ?
join AC with AE A,C,E ?
join AD with AE A,D,E ?
join BC with BD B,C,D ?
join CD with CE C,D,E ?
C3
9
Pruning
(k-1)-subset of the candidate itemset is not in the frequent itemset Lk-1
In pass-3: • Find all subsets of 2 items from the C3, and
check if they are in the frequent itemset L2.
10
C3 after pruning
Itemset X Support (X)A,B,C ?A,B,D ?A,C,D ?A,C,E ?A,D,E ?B,C,D ?C,D,E ?
Pruning eliminates ABE since BE is not frequent
11
• Scan transactions in the database and compute support
Itemset X Support (X)A,B,C 60%A,B,D 40%A,C,D 80%A,C,E 40%A,D,E 40%B,C,D 40%C,D,E 40%
L3
12
Pass-4
• First k - 2 = 2 items must match in pass k = 4
Itemset X
Support (X)
combine ABC with ABD A,B,C,D ?combine ACD with ACE A,C,D,E ?
13
Pruning
• Pruning: For ABCD we check whether ABC, ABD, ACD, BCD are frequent. They are in all cases, so we do not prune ABCD.
• For ACDE we check whether ACD, ACE, ADE, CDE are frequent. Yes, in all cases, so we do not prune ACDE
• Both are frequent
Itemset X Support (X)A,B,C,D 40%A,C,D,E 40% L4
14
Pass-5
• For pass 5 we can't form any candidates because there aren't two frequent 4-itemsets beginning with the same 3 items.
15
Association Rules
• {A, B, C}• Non-empty sets:• {A}{B}{C} {AB}{AC} {BC} • Assume min
confidence 70%• Compute confidence
for each rule
Itemset X Support (X)A,B,C 60%A,B,D 40%A,C,D 80%A,C,E 40%A,D,E 40%B,C,D 40%C,D,E 40%
16
Rules
• R1: A, BC• Confidence= support {A B C}/support {A B}
= 0.6/ 0.6= 1 => 100%Compute confidence in R2R2: A, CB
17
HITS ALGORITHM
18
19
Example-1
• Apply the HITS algorithm on the following web graph:
1 2
3
20
Initialize HUB and AUTH values
1 2
3
HUB=1AUTH=1
HUB=1AUTH=1
HUB=1AUTH=1
21
Normalization
Normalized HUB (1)= HUB(1)/ SQRT [HUB(1)2+HUB(2)2+HUB(3)2]
Normalized AUTH (1)= AUTH(1)/ SQRT [AUTH(1)2+AUTH(2)2+AUTH(3)2]
We do this for all pages in the graph.
22
Normalized values
• HUB (1)=0.58, AUTH (1)=0.58• HUB (2)=0.58, AUTH (2)=0.58• HUB (3)=0.58, AUTH (3)=0.58
23
Compute new HUB and AUTH valuesNode (1)
• HUB (1)=AUTH(2)+AUTH(3)=
= 0.58 + 0.58 = 1.16• AUTH (1)=
=0
Authority of nodes pointed to by node (1)
Hub value of nodes pointing to node (1)
24
Node (2)
• HUB (2)==0
• AUTH (2)= = HUB (1)= 0.58
Authority of nodes pointed to by node (2)
Hub value of nodes pointing to node (2)
25
Node (3)
• HUB (3)==0
• AUTH (3)= = HUB (1)= 0.58
Authority of nodes pointed to by node (3)
Hub value of nodes pointing to node (3)
26
After Normalization
• HUB (1)= 1.16/SQRT [(1.16)2+02+02]=1.16/SQRT (1.3456)=1.16/1.16=1
• AUTH (1)= 0• HUB(2)=0, AUTH(2)=0.71• HUB(3)=0, AUTH(3)=0.71
27
Recalculating HUB and AUTH
• HUB (1)=AUTH(2)+AUTH(3)=
= 0.71 + 0.71 = 1.42• AUTH (1)= 0Normalizing: Hub(1)= 1.42/SQRT [(1.42)2+02+02]HUB(1)=1.42/SQRT(2.0164)= 1.42/1.42= 1
28
Recalculations
• HUB (2)= 0• AUTH(2)=0.71• HUB (3)=0• AUTH(3)=0.71• Because the values are unchanged, we stop
here.• Page 1 is clearly the hub, and pages 1, and 2
share the honor of being authorities.