DBSocial 2013, New York

22
Scalable, Continuous Tracking of Tag Co- Occurrences between Short Sets using (Almost) Disjoint Tag Partitions DBSocial 2013, New York Foteini Alvanaki Sebastian Michel Excellence Cluster on Multimodal Computing and Interaction (MMCI)

description

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions. DBSocial 2013, New York. Motivation enBlogue (1). enBlogue : Identifies emergent topics Input: A stream of documents annotated with hash-tags (e.g. Tweets) - PowerPoint PPT Presentation

Transcript of DBSocial 2013, New York

Page 1: DBSocial  2013, New York

Scalable, Continuous Tracking of Tag Co-Occurrences between Short Sets using (Almost) Disjoint Tag Partitions

DBSocial 2013, New York

Foteini Alvanaki Sebastian Michel

Excellence Cluster on Multimodal Computing and Interaction (MMCI)

Page 2: DBSocial  2013, New York

2

MotivationenBlogue (1)

{#flood, #Lourdes}€

{# Algeria, Stratfor}

{# Asanz, #Wikileaks}

{# obamainBerlin, #Merkel}€

{# Twisted, ABCFamily}

{#NSA, #Orwell}

{# Kim, #Kanye}€

{# Kim, # Baby}

{# Bieber, #NBAFinals}€

{# Algeria, Stratfor}

{# Asanz, # Wikileaks}

{# obamainBerlin, # Merkel}

{#NSA, # Orwell}

{# Kim, # Kanye}€

{# Kim, #Baby}

{# Bieber, # NBAFinals}

{# Rihanna,# Bieber, # Youtube}€

{# Kim, # Baby}

{#Kim, # Baby}

{# Kim, # Baby}€

{# Bieber, #NBAFinals} €

{# flood, #Lourdes}

{# flood, #Lourdes}

{#flood, #Lourdes}

{# flood, # Lourdes}

{# HeatNation, #NBAFinals} €

{# HeatNation, #NBAFinals}

{# obamainBerlin, #Merkel}

{# obamainBerlin, # Merkel}

{# obamainBerlin, #Merkel}

{# obama,#berlin}

{#obama,# berlin}

• enBlogue: Identifies emergent topics• Input: A stream of documents annotated with hash-tags (e.g. Tweets)• Restricts the focus to the more recent documents using a time sliding window

Page 3: DBSocial  2013, New York

3

MotivationenBlogue (2)

{#flood, #Lourdes}€

{# Algeria, Stratfor}

{# Asanz, # Wikileaks}

{# obamainBerlin, # Merkel}€

{# Twisted, ABCFamily}

{# NSA, #Orwell}

{#Kim, # Kanye}€

{# Kim, #Baby}

{# Bieber, #NBAFinals}€

{# Algeria, Stratfor}

{# Asanz, #Wikileaks}

{# obamainBerlin, # Merkel}

{# NSA, #Orwell}

{# Kim, # Kanye}€

{# Kim, # Baby}

{# Bieber, #NBAFinals}

{#Rihanna,#Bieber, # Youtube}€

{# Kim, # Baby}

{# Kim, #Baby}

{# Kim, #Baby}€

{# Bieber, # NBAFinals} €

{# flood, # Lourdes}

{# flood, # Lourdes}

{#flood, # Lourdes}

{# flood, #Lourdes}

{#HeatNation, #NBAFinals} €

{# HeatNation, # NBAFinals}

{# obamainBerlin, #Merkel}

{# obamainBerlin, #Merkel}

{# obamainBerlin, #Merkel}

{# obama,# berlin}

{# obama,#berlin}

• Tracks the correlation of co-occurring hash-tags over time• Reports on unexpected changes in the correlation

{# Kim, # Baby}

time

corr

elati

on

{# Kim, # Baby}

{# Kim, # Baby}

Page 4: DBSocial  2013, New York

4

Jaccard Coefficient

• T : A set containing the document ids annotated with tag t

• Pair of tags :

• Set of n tags :

J(t1,t2) =T1 I T2

T1 UT2

J(t1,..., tn ) =Ii=1

nTi

Ui=1

nTi

{t1, t2}

{t1,t2,...,tn}

Page 5: DBSocial  2013, New York

Jaccard Coefficient Computation

• Maintain counters for all subsets of co-occurring tags

5

{a, b, c}

{a, b}

{a, c}

{b, c}

{a, b, c}

{b, c, d}

{c, d}

{b, d}

{b, c, d}

AUB AI B

AUC AI C

BUC BI C

CUD C I D

BUD BI D

AUBUC AI BI C

BUCUD BI C I D

Page 6: DBSocial  2013, New York

6

Inclusion – Exclusion Principle

• Compute the cardinality of the union of n sets using the cardinalities of the intersections of all its subsets:

XUZ = X + Z − X I Z

Ui=1

nTi = (−1)k+1 Ti1 I L I Tik

1≤ ii <L < ik ≤n∑

⎝ ⎜ ⎜

⎠ ⎟ ⎟

k=1

n

Page 7: DBSocial  2013, New York

7

Inclusion – Exclusion PrincipleAdvantages

• Needs to maintain less counters

• Adapts more easily to changes in the load

AUB A I B

AUC AI C

BUC BI C

CUD C I D

BUD B I D

AUBUC AI B I C

BUCUD BI C I D€

{a, b, c}

{a, b}

{a, c}

{b, c}

{a, b, c}

{b, c, d}

{c, d}

{b, d}

{b, c, d}

A

B

C

D

}€

d'= {a, d}

AI D

Page 8: DBSocial  2013, New York

8

Problem

• For each subset of co-occurring tags– Number of documents annotated each tag– Number of documents annotated with all tags

• A big number of co-occurring tag sets• New documents arrive fast changing the

numbers

{t1, t2,...,tn}

Ii=1

nTi

Ti

Solution: Let multiple nodes compute the Jaccard coefficient for different tag sets

Page 9: DBSocial  2013, New York

9

Outline Motivation

enBlogue Jaccard Coefficient Inclusion – Exclusion Principle Problem

• Idea- Architecture– Partition Tags– Updating Counters

• Results– Theoretical Results– Experimental Results

• Conclusion

Page 10: DBSocial  2013, New York

Architecture

10

Nodes computing the Jaccard coefficients

Nodes computing the partitions

Page 11: DBSocial  2013, New York

11

Partition TagsRequisites

1. Treat tag-sets as inseparable units

2. Minimise the overlap of single tags tracked by different nodes

{a,b}

{c,d}

{a,c,d}

N1 :{a,b}

A B AI B

N2 :{c,d}

C D C I D

{a,c,d}

C DAI C AI D C I D

{a,c,d}

AA I C AI D

J(a,b) =AI B

A + B − AI B

J(c,d) =C I D

C + D − C I D

J(a,c,d) =AI C I D

A + C + D − AI C − AI D − C I D + AI C I D

Page 12: DBSocial  2013, New York

12

Partition TagsAlgorithm

Phase 1: Create an initial assignment of the tags to the nodes Max-k cover : Selects k out of n sets that cover the maximum number of elements

Phase 2: Make sure all sets of tags are assigned to some node

Page 13: DBSocial  2013, New York

13

Partition TagsExample

d1 = {a, b, c}

d2 = {b, c}

d3 = {a, b, f }

d4 = {d, e, g}

d5 = {a, d, e}

PHASE 1: MAX-2 COVER

{a, b, c}

{a, d, e}

PHASE 2: ASSIGNING REMAINING SETS

{a, b, f }

{d, e, g}

{a, b, c}{a,b, f }

{d, e, g}{a,d, e}

{a, b, c, f }

{a,d,e,g}

Page 14: DBSocial  2013, New York

14

Update Counters

N1 :{a, b,c,d}

N2 :{b,e, f }

BI E BI F E I FBI E I F B E F

A B AI BC D C I D

d4 = {c,d}

|C | + +|D | + +C I D + +

d5 = {b, f }€

|B | + +

|B | + +| E | + +|BI E | + +

Page 15: DBSocial  2013, New York

15

Finding nodes

d2000 = {a, c}

a :{N1, N2}

b :{N1}

c :{N1}

d :{N2}

e :{N2}

f :{N1}

g :{N2}

⇒ {N1, N2}U{N1} = {N1, N2}

Inverted Index

Page 16: DBSocial  2013, New York

16

Outline Motivation

enBlogue Jaccard Coefficient Inclusion – Exclusion Principle Problem

Idea Architecture Distributing Tags Updating Counters

• Results– Theoretical Results– Experimental Results

• Conclusion

Page 17: DBSocial  2013, New York

17

Theoretic expectation

E affected nodes[ ] = k ∗ 1−v−mm

⎛ ⎝ ⎜

⎞ ⎠ ⎟vm ⎛ ⎝ ⎜

⎞ ⎠ ⎟

⎣ ⎢

⎦ ⎥

nk

⎢ ⎢ ⎢

⎥ ⎥ ⎥

• k partitions• v total tags (vocabulary)• m randomly selected tags per set• n total tag-sets

Page 18: DBSocial  2013, New York

18

Theoretical ResultsPartitions: 10 Vocabulary Size: 1,000,000

Page 19: DBSocial  2013, New York

19

Real Data Experiments• Dataset: Tweets of 15th March 2013• Partitions: 10

Page 20: DBSocial  2013, New York

20

Outline Motivation

enBlogue Jaccard Coefficient Inclusion – Exclusion Principle Problem

Idea Architecture Distributing Tags Updating Counters

Results Theoretical Results Experimental Results

• Conclusion

Page 21: DBSocial  2013, New York

21

Conclusion

• An algorithm to compute the Jaccard coefficient for tag-sets in a massive data stream.

• Applicable to all measures using intersection and/or unions of sets (e.g. Dice)

• Results show small replication• Load equally distributed to the nodes.

Page 22: DBSocial  2013, New York

22

Thank you!