IT444: Web Intelligence

28
IT444: Web Intelligence Revision Apriori and HITS algorithm 1

description

IT444: Web Intelligence. Revision A priori and HITS algorithm. Association Rules. Apriori Algorithm. Pass 1 Generate the candidate itemsets in C 1 Save the frequent itemsets in L 1 Pass k Generate the candidate itemsets in C k from the frequent itemsets in L k -1 - PowerPoint PPT Presentation

Transcript of IT444: Web Intelligence

Page 1: IT444: Web Intelligence

1

IT444: Web Intelligence

RevisionApriori and HITS algorithm

Page 2: IT444: Web Intelligence

2

APRIORI ALGORITHMAssociation Rules

Page 3: IT444: Web Intelligence

3

Pass 1 • Generate the candidate itemsets in C1

• Save the frequent itemsets in L1

Pass k 1. Generate the candidate itemsets in Ck from the frequent

itemsets in Lk-1 – Join Lk-1 p with Lk-1q, as follows:

insert into Ck select p.item1, p.item2, . . . , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1q where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1

– Generate all (k-1)-subsets from the candidate itemsets in Ck

– Prune all candidate itemsets from Ck where some (k-1)-subset of the candidate itemset is not in the frequent itemset Lk-1

2. Scan the transaction database to determine the support for each candidate itemset in Ck

3. Save the frequent itemsets in Lk

Page 4: IT444: Web Intelligence

4

Example

• Assume the user-specified minimum support is 40%, then generate all frequent itemsets.

• Given: The transaction database shown below: TID A

T1 A, B, CT2 A, B, C, D, ET3 A, C, D T4 A, C, D, ET5 A, B, C, D

Page 5: IT444: Web Intelligence

5

Pass-1

Itemset X Support (x)A ?B ?C ?D ?E ?

Itemset X support(X)A 100%B 60%C 100%D 80%E 40%

C1 L1

Page 6: IT444: Web Intelligence

6

Pass-2Itemset X Support (X)

A,B ?A,C ?A,D ?A,E ?B,C ?B,D ?B,E ?C,D ?C,E ?D,E ?

C2

Before computing support, check for pruning. Nothing pruned since all subsets of these itemsets are frequent

Itemset X Support (X)A,B 60%A,C 100%A,D 80%A,E 40%B,C 60%B,D 40%B,E 20%C,D 80%C,E 40%D,E 40%

C2

Page 7: IT444: Web Intelligence

7

Itemset X Support (X)A,B 60%A,C 100%A,D 80%A,E 40%B,C 60%B,D 40%B,E 20%C,D 80%C,E 40%D,E 40%

C2Itemset X Support (X)A,B 60%A,C 100%A,D 80%A,E 40%B,C 60%B,D 40%C,D 80%C,E 40%D,E 40%

L2

After saving only the frequent itemsets

Page 8: IT444: Web Intelligence

8

Pass-3

• To create C3 only look at items that have the same first item (in pass k, the first k - 2 items must match)

Itemset X Support (X)

join AB with AC A,B,C ?

join AB with AD A,B,D ?

join AB with AE A,B,E ?

join AC with AD A,C,D ?

join AC with AE A,C,E ?

join AD with AE A,D,E ?

join BC with BD B,C,D ?

join CD with CE C,D,E ?

C3

Page 9: IT444: Web Intelligence

9

Pruning

(k-1)-subset of the candidate itemset is not in the frequent itemset Lk-1

In pass-3: • Find all subsets of 2 items from the C3, and

check if they are in the frequent itemset L2.

Page 10: IT444: Web Intelligence

10

C3 after pruning

Itemset X Support (X)A,B,C ?A,B,D ?A,C,D ?A,C,E ?A,D,E ?B,C,D ?C,D,E ?

Pruning eliminates ABE since BE is not frequent

Page 11: IT444: Web Intelligence

11

• Scan transactions in the database and compute support

Itemset X Support (X)A,B,C 60%A,B,D 40%A,C,D 80%A,C,E 40%A,D,E 40%B,C,D 40%C,D,E 40%

L3

Page 12: IT444: Web Intelligence

12

Pass-4

• First k - 2 = 2 items must match in pass k = 4

Itemset X

Support (X)

combine ABC with ABD A,B,C,D ?combine ACD with ACE A,C,D,E ?

Page 13: IT444: Web Intelligence

13

Pruning

• Pruning: For ABCD we check whether ABC, ABD, ACD, BCD are frequent. They are in all cases, so we do not prune ABCD.

• For ACDE we check whether ACD, ACE, ADE, CDE are frequent. Yes, in all cases, so we do not prune ACDE

• Both are frequent

Itemset X Support (X)A,B,C,D 40%A,C,D,E 40% L4

Page 14: IT444: Web Intelligence

14

Pass-5

• For pass 5 we can't form any candidates because there aren't two frequent 4-itemsets beginning with the same 3 items.

Page 15: IT444: Web Intelligence

15

Association Rules

• {A, B, C}• Non-empty sets:• {A}{B}{C} {AB}{AC} {BC} • Assume min

confidence 70%• Compute confidence

for each rule

Itemset X Support (X)A,B,C 60%A,B,D 40%A,C,D 80%A,C,E 40%A,D,E 40%B,C,D 40%C,D,E 40%

Page 16: IT444: Web Intelligence

16

Rules

• R1: A, BC• Confidence= support {A B C}/support {A B}

= 0.6/ 0.6= 1 => 100%Compute confidence in R2R2: A, CB

Page 17: IT444: Web Intelligence

17

HITS ALGORITHM

Page 18: IT444: Web Intelligence

18

Page 19: IT444: Web Intelligence

19

Example-1

• Apply the HITS algorithm on the following web graph:

1 2

3

Page 20: IT444: Web Intelligence

20

Initialize HUB and AUTH values

1 2

3

HUB=1AUTH=1

HUB=1AUTH=1

HUB=1AUTH=1

Page 21: IT444: Web Intelligence

21

Normalization

Normalized HUB (1)= HUB(1)/ SQRT [HUB(1)2+HUB(2)2+HUB(3)2]

Normalized AUTH (1)= AUTH(1)/ SQRT [AUTH(1)2+AUTH(2)2+AUTH(3)2]

We do this for all pages in the graph.

Page 22: IT444: Web Intelligence

22

Normalized values

• HUB (1)=0.58, AUTH (1)=0.58• HUB (2)=0.58, AUTH (2)=0.58• HUB (3)=0.58, AUTH (3)=0.58

Page 23: IT444: Web Intelligence

23

Compute new HUB and AUTH valuesNode (1)

• HUB (1)=AUTH(2)+AUTH(3)=

= 0.58 + 0.58 = 1.16• AUTH (1)=

=0

Authority of nodes pointed to by node (1)

Hub value of nodes pointing to node (1)

Page 24: IT444: Web Intelligence

24

Node (2)

• HUB (2)==0

• AUTH (2)= = HUB (1)= 0.58

Authority of nodes pointed to by node (2)

Hub value of nodes pointing to node (2)

Page 25: IT444: Web Intelligence

25

Node (3)

• HUB (3)==0

• AUTH (3)= = HUB (1)= 0.58

Authority of nodes pointed to by node (3)

Hub value of nodes pointing to node (3)

Page 26: IT444: Web Intelligence

26

After Normalization

• HUB (1)= 1.16/SQRT [(1.16)2+02+02]=1.16/SQRT (1.3456)=1.16/1.16=1

• AUTH (1)= 0• HUB(2)=0, AUTH(2)=0.71• HUB(3)=0, AUTH(3)=0.71

Page 27: IT444: Web Intelligence

27

Recalculating HUB and AUTH

• HUB (1)=AUTH(2)+AUTH(3)=

= 0.71 + 0.71 = 1.42• AUTH (1)= 0Normalizing: Hub(1)= 1.42/SQRT [(1.42)2+02+02]HUB(1)=1.42/SQRT(2.0164)= 1.42/1.42= 1

Page 28: IT444: Web Intelligence

28

Recalculations

• HUB (2)= 0• AUTH(2)=0.71• HUB (3)=0• AUTH(3)=0.71• Because the values are unchanged, we stop

here.• Page 1 is clearly the hub, and pages 1, and 2

share the honor of being authorities.