CS246: Mining Massive Datasets Jure Leskovec,...

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

2

Nodes: Football Teams Edges: Games played

Can we identify node groups?

(communities, modules, clusters)

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

3

NCAA conferences

Nodes: Football Teams Edges: Games played


4

Can we identify functional modules?

Nodes: Proteins Edges: Physical interactions


5

Functional modules

Nodes: Proteins Edges: Physical interactions


6

Can we identify social communities?

Nodes: Facebook Users Edges: Friendships


7

High school Summer internship

Stanford (Squash) Stanford (Basketball)

Social communities

Nodes: Facebook Users Edges: Friendships


Non-overlapping vs. overlapping communities

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

9

Network Adjacency matrix

Nodes

Nod

es


What is the structure of community overlaps: Edge density in the overlaps is higher!

10

Communities as “tiles” 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

11

This is what we want! Communities in a network


1) Given a model, we generate the network:

2) Given a network, find the “best” model


C

A

B

D E

H

F

G

C

A

B

D E

H

F

G

Generative model for networks


Goal: Define a model that can generate networks The model will have a set of “parameters” that we

will later want to estimate (and detect communities)

Q: Given a set of nodes, how do communities “generate” edges of the network?


C

A

B

D E

H

F

G


Generative model B(V, C, M, {pc}) for graphs: Nodes V, Communities C, Memberships M Each community c has a single probability pc

Later we fit the model to networks to detect communities

14

Model

Network

Communities, C

Nodes, V

Model

pA pB

Memberships, M


AGM generates the links: For each For each pair of nodes in community 𝑨𝑨,

we connect them with prob. 𝒑𝒑𝑨𝑨 The overall edge probability is:

15

Model

∏∩∈

−−=vu MMc

cpvuP )1(1),(

Network

Communities, C

Nodes, V Community Affiliations

pA pB

Memberships, M

If 𝒖𝒖,𝒗𝒗 share no communities: 𝑷𝑷 𝒖𝒖,𝒗𝒗 = 𝜺𝜺 Think of this as an “OR” function: If at least 1 community says “YES” we create an edge

𝑴𝑴𝒖𝒖 … set of communities node 𝒖𝒖 belongs to


16

Model

Network 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested

17 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Detecting communities with AGM:

18

C

A

B

D E

H

F

G

Given a Graph, find the Model 1) Affiliation graph M 2) Number of communities C 3) Parameters pc


Maximum Likelihood Principle (MLE): Given: Data 𝑿𝑿 Assumption: Data is generated by some model 𝒇𝒇(𝚯𝚯) 𝒇𝒇 … model 𝚯𝚯 … model parameters

Want to estimate 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯): The probability that our model 𝒇𝒇 (with parameters 𝜣𝜣)

generated the data

Now let’s find the most likely model that could have generated the data: arg max

Θ 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯)


Imagine we are given a set of coin flips Task: Figure the bias of a coin! Data: Sequence of coin flips: 𝑿𝑿 = [𝟏𝟏,𝟎𝟎,𝟎𝟎,𝟎𝟎,𝟏𝟏,𝟎𝟎,𝟎𝟎,𝟏𝟏] Model: 𝒇𝒇 𝚯𝚯 = return 1 with prob. Θ, else return 0 What is 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 ? Assuming coin flips are independent So, 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 ∗ 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 ∗ 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 … ∗ 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 What is 𝑷𝑷𝒇𝒇 𝟏𝟏 𝚯𝚯 ? Simple, 𝑷𝑷𝒇𝒇 𝟎𝟎 𝚯𝚯 = 𝚯𝚯

Then, 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝚯𝚯𝟑𝟑 𝟏𝟏 − 𝚯𝚯 𝟓𝟓 For example: 𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝟎𝟎.𝟓𝟓 = 𝟎𝟎.𝟎𝟎𝟎𝟎𝟑𝟑𝟎𝟎𝟎𝟎𝟎𝟎

𝑷𝑷𝒇𝒇 𝑿𝑿 𝚯𝚯 = 𝟑𝟑𝟖𝟖

= 𝟎𝟎.𝟎𝟎𝟎𝟎𝟓𝟓𝟎𝟎𝟎𝟎𝟎𝟎

What did we learn? Our data was most likely generated by coin with bias 𝚯𝚯 = 𝟑𝟑/𝟖𝟖

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 20 𝑷𝑷𝒇𝒇𝑿𝑿𝚯𝚯

𝚯𝚯

𝚯𝚯∗ = 𝟑𝟑/𝟖𝟖

How do we do MLE for graphs? Model generates a probabilistic adjacency matrix We then flip all the entries of the probabilistic

matrix to obtain the binary adjacency matrix

The likelihood of AGM generating graph G:


0.25 0.10 0.10 0.04 0.05 0.15 0.02 0.06 0.05 0.02 0.15 0.06 0.01 0.03 0.03 0.09

1 1 0 0 0 1 0 1 1 0 1 1 0 1 0 1

For every pair of nodes 𝒖𝒖,𝒗𝒗 AGM gives the prob. 𝒑𝒑𝒖𝒖𝒗𝒗 of them being linked

Flip biased coins

)),(1(),()|(),(),(

vuPvuPGPGvuGvu

−ΠΠ=Θ∉∈

Given graph G and Θ we calculate likelihood that Θ generated G: P(G|Θ)

0.25 0.10 0.10 0.04

0.05 0.15 0.02 0.06

0.05 0.02 0.15 0.06

0.01 0.03 0.03 0.09 Θ=B(V, C, M, {pc})

1 0 1 1

0 1 0 1

1 0 1 1

1 1 1 1

G

P(G|Θ)

)),(1(),()|(),(),(

vuPvuPGPGvuGvu

−ΠΠ=Θ∉∈

G

2/13/2014 22 Jure Leskovec, Stanford C246: Mining Massive Datasets

Our goal: Find 𝚯𝚯 = 𝑩𝑩(𝑽𝑽,𝑪𝑪,𝑴𝑴, 𝒑𝒑𝑪𝑪 ) such that:

How do we find 𝑩𝑩(𝑽𝑽,𝑪𝑪,𝑴𝑴, 𝒑𝒑𝑪𝑪 ) that maximizes the likelihood? No nice way, have to search over

all affiliation graphs But, don’t despair…


ΘP( | ) AGM

arg max Θ

𝑮𝑮

Relaxation: Memberships have strengths

𝑭𝑭𝒖𝒖𝑨𝑨: The membership strength of node 𝒖𝒖 to community 𝑨𝑨 (𝑭𝑭𝒖𝒖𝑨𝑨 = 𝟎𝟎: no membership) Each community 𝑨𝑨 links nodes independently:

𝑷𝑷𝑨𝑨 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨)

24

𝑭𝑭𝒖𝒖𝑨𝑨

u v


Community membership strength matrix 𝑭𝑭


𝑭𝑭 =

j

Communities

Nod

es

𝑭𝑭𝒖𝒖𝑨𝑨 … strength of 𝒖𝒖’s membership to 𝑨𝑨

𝑭𝑭𝒗𝒗 … vector of community membership strengths of 𝒖𝒖

𝑷𝑷𝑨𝑨 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨) Probability of connection is

proportional to the product of strengths

Prob. that at least one common community 𝑪𝑪 links the nodes: 𝑷𝑷 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − ∏ 𝟏𝟏 − 𝑷𝑷𝑪𝑪 𝒖𝒖,𝒗𝒗𝑪𝑪

Community 𝑨𝑨 links nodes 𝒖𝒖,𝒗𝒗 independently: 𝑷𝑷𝑨𝑨 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨)

Then prob. at least one common 𝑪𝑪 links them: 𝑷𝑷 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − ∏ 𝟏𝟏 − 𝑷𝑷𝑪𝑪 𝒖𝒖,𝒗𝒗𝑪𝑪 = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−∑ 𝑭𝑭𝒖𝒖𝑪𝑪 ⋅ 𝑭𝑭𝒗𝒗𝑪𝑪𝑪𝑪 )

= 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝑻𝑻) For example:

26

𝑭𝑭𝒖𝒖:

𝑭𝑭𝒗𝒗: Then: 𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝑻𝑻 = 𝟎𝟎.𝟏𝟏𝟎𝟎 And: 𝑷𝑷 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒑𝒑 −𝟎𝟎.𝟏𝟏𝟎𝟎 = 𝟎𝟎.𝟏𝟏𝟏𝟏 But: 𝑷𝑷 𝒖𝒖,𝒘𝒘 = 𝟎𝟎.𝟖𝟖𝟖𝟖 𝑷𝑷 𝒗𝒗,𝒘𝒘 = 𝟎𝟎

𝑭𝑭𝒘𝒘: Node community

membership strengths 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

0 1.2 0 0.2

0.5 0 0 0.8

0 1.8 1 0

Another way to think about this is: Think of a weighted graph (but we don’t see the

weights) 𝒖𝒖,𝒗𝒗 interact with strength 𝑿𝑿𝒖𝒖𝒗𝒗𝑨𝑨 ~𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃(𝑭𝑭𝒖𝒖𝑨𝑨 ⋅ 𝑭𝑭𝒗𝒗𝑨𝑨) Total interaction strength is the sum: 𝑿𝑿𝒖𝒖𝒗𝒗 = ∑ 𝑿𝑿𝒖𝒖𝒗𝒗𝑪𝑪𝑪𝑪 Then: 𝑿𝑿𝒖𝒖𝒗𝒗~𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃(∑ 𝑭𝑭𝒖𝒖𝑪𝑪 ⋅ 𝑭𝑭𝒗𝒗𝑪𝑪𝐶𝐶 ) We observe an edge if there is non-zero interaction: 𝑃𝑃 𝑋𝑋𝑢𝑢𝑢𝑢 > 0 = 1 − 𝑃𝑃 𝑋𝑋𝑢𝑢𝑢𝑢 = 0 = = 𝟏𝟏 − 𝐞𝐞𝐞𝐞𝐞𝐞 (−𝑭𝑭𝒖𝒖 ⋅ 𝑭𝑭𝒗𝒗𝑻𝑻)


Poisson PMF:

𝑷𝑷(𝑿𝑿 = 𝒆𝒆) =𝝀𝝀𝒌𝒌

𝒌𝒌!𝒆𝒆−𝝀𝝀

Task: Given a network, estimate 𝑭𝑭 Find 𝑭𝑭 that maximizes the log-likelihood Many times we take the logarithm of the likelihood and

call it log-likelihood: 𝑙𝑙 𝐹𝐹 = log 𝑃𝑃(𝐺𝐺|𝐹𝐹)

28

Objective function is biconvex Non-negative matrix factorization: Update 𝑭𝑭𝒖𝒖𝑪𝑪 for node 𝒖𝒖 while fixing the

memberships of all other nodes

[wsdm ‘13]


Task: Given a network, estimate 𝑭𝑭 Find 𝑭𝑭 that maximizes the log-likelihood Many times we take the logarithm of the likelihood and

call it log-likelihood: 𝒍𝒍 𝑭𝑭 = 𝐥𝐥𝐥𝐥𝐥𝐥 𝑷𝑷(𝑮𝑮|𝑭𝑭)

29

Connection: Non-negative matrix factorization We want to “factor” the adjacency matrix G

F FT Matrix of

edge probs.

𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒑𝒑 (− . ) = 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets

Graph G

Non-negative matrix factorization: Use a gradient based approach! But updating the whole 𝑭𝑭 takes too long Computing the gradient takes quadratic time! Our network are far too big to allow for that: Network with 1M nodes, can have 1012 different edges

Instead, update 𝑭𝑭 in small “units”: Update 𝑭𝑭𝒖𝒖𝑪𝑪 for node 𝒖𝒖 while fixing the memberships of

all other nodes


Compute gradient of a single row 𝑭𝑭𝒖𝒖 of 𝑭𝑭:

Coordinate gradient ascent: Iterate over the rows of 𝑭𝑭: Compute gradient 𝜵𝜵𝒍𝒍 𝑭𝑭𝒖𝒖 of row 𝒖𝒖 (while keeping others fixed) Update the row 𝑭𝑭𝒖𝒖: 𝑭𝑭𝒖𝒖 ← 𝑭𝑭𝒖𝒖 + 𝜼𝜼 𝛁𝛁𝒍𝒍(𝑭𝑭𝒖𝒖) Project 𝑭𝑭𝒖𝒖 back to a non-negative vector: If 𝑭𝑭𝒖𝒖𝑪𝑪 < 𝟎𝟎: 𝑭𝑭𝒖𝒖𝑪𝑪 = 𝟎𝟎

This is slow! Computing 𝜵𝜵𝒍𝒍 𝑭𝑭𝒖𝒖 takes linear time!


However, we notice:

We can cache ∑ 𝑭𝑭𝒗𝒗𝒗𝒗 So, computing ∑ 𝑭𝑭𝒗𝒗𝒗𝒗∉𝓝𝓝(𝒖𝒖) now takes linear time

in the degree 𝓝𝓝(𝒖𝒖) of 𝒖𝒖 In networks degree of a node is much smaller to the total

number of nodes in the network, so this is a significant speedup!


BigCLAM takes 5 minutes for 300k node nets Other methods take 10 days

Can process networks with 100M edges! 33

Extension: Make community membership edges directed: Outgoing membership: Nodes “sends” edges Incoming membership: Node “receives” edges


Everything is almost the same except now we have 2 matrices: 𝑭𝑭 and 𝑯𝑯 𝑭𝑭… out-going community memberships 𝑯𝑯… in-coming community memberships

Edge prob.: 𝑷𝑷 𝒖𝒖,𝒗𝒗 = 𝟏𝟏 − 𝒆𝒆𝒆𝒆𝒑𝒑(−𝑭𝑭𝒖𝒖𝑯𝑯𝒗𝒗𝑻𝑻)

Network log-likelihood:

which we optimize the same way as before 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 37

𝑭𝑭 𝑯𝑯

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013.

Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014.

Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013.


http://cs.stanford.edu/people/jure/pubs/bigclam-wsdm13.pdf

http://cs.stanford.edu/people/jure/pubs/bigclam-wsdm13.pdf

http://cs.stanford.edu/people/jure/pubs/coda-wsdm14.pdf



http://cs.stanford.edu/people/jure/pubs/cesna-icdm13.pdf

CS246: Mining Massive Datasets Jure Leskovec,...

Documents

Transcript of CS246: Mining Massive Datasets Jure Leskovec,...