CS246: Mining Massive Datasets Jure Leskovec,...
Transcript of CS246: Mining Massive Datasets Jure Leskovec,...
CS246: Mining Massive Datasets Jure Leskovec, Stanford University
http://cs246.stanford.edu
2
Nodes: Football Teams Edges: Games played
Can we identify node groups?
(communities, modules, clusters)
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
3
NCAA conferences
Nodes: Football Teams Edges: Games played
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
4
Can we identify functional modules?
Nodes: Proteins Edges: Physical interactions
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
5
Functional modules
Nodes: Proteins Edges: Physical interactions
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
6
Can we identify social communities?
Nodes: Facebook Users Edges: Friendships
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
7
High school Summer internship
Stanford (Squash) Stanford (Basketball)
Social communities
Nodes: Facebook Users Edges: Friendships
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
Non-overlapping vs. overlapping communities
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
9
Network Adjacency matrix
Nodes
Nod
es
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
What is the structure of community overlaps: Edge density in the overlaps is higher!
10
Communities as โtilesโ 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
11
This is what we want! Communities in a network
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
1) Given a model, we generate the network:
2) Given a network, find the โbestโ model
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 12
C
A
B
D E
H
F
G
C
A
B
D E
H
F
G
Generative model for networks
Generative model for networks
Goal: Define a model that can generate networks The model will have a set of โparametersโ that we
will later want to estimate (and detect communities)
Q: Given a set of nodes, how do communities โgenerateโ edges of the network?
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 13
C
A
B
D E
H
F
G
Generative model for networks
Generative model B(V, C, M, {pc}) for graphs: Nodes V, Communities C, Memberships M Each community c has a single probability pc
Later we fit the model to networks to detect communities
14
Model
Network
Communities, C
Nodes, V
Model
pA pB
Memberships, M
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
AGM generates the links: For each For each pair of nodes in community ๐จ๐จ,
we connect them with prob. ๐๐๐จ๐จ The overall edge probability is:
15
Model
โโฉโ
โโ=vu MMc
cpvuP )1(1),(
Network
Communities, C
Nodes, V Community Affiliations
pA pB
Memberships, M
If ๐๐,๐๐ share no communities: ๐ท๐ท ๐๐,๐๐ = ๐บ๐บ Think of this as an โORโ function: If at least 1 community says โYESโ we create an edge
๐ด๐ด๐๐ โฆ set of communities node ๐๐ belongs to
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
16
Model
Network 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested
17 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
Detecting communities with AGM:
18
C
A
B
D E
H
F
G
Given a Graph, find the Model 1) Affiliation graph M 2) Number of communities C 3) Parameters pc
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
Maximum Likelihood Principle (MLE): Given: Data ๐ฟ๐ฟ Assumption: Data is generated by some model ๐๐(๐ฏ๐ฏ) ๐๐ โฆ model ๐ฏ๐ฏ โฆ model parameters
Want to estimate ๐ท๐ท๐๐ ๐ฟ๐ฟ ๐ฏ๐ฏ): The probability that our model ๐๐ (with parameters ๐ฃ๐ฃ)
generated the data
Now letโs find the most likely model that could have generated the data: arg max
ฮ ๐ท๐ท๐๐ ๐ฟ๐ฟ ๐ฏ๐ฏ)
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
Imagine we are given a set of coin flips Task: Figure the bias of a coin! Data: Sequence of coin flips: ๐ฟ๐ฟ = [๐๐,๐๐,๐๐,๐๐,๐๐,๐๐,๐๐,๐๐] Model: ๐๐ ๐ฏ๐ฏ = return 1 with prob. ฮ, else return 0 What is ๐ท๐ท๐๐ ๐ฟ๐ฟ ๐ฏ๐ฏ ? Assuming coin flips are independent So, ๐ท๐ท๐๐ ๐ฟ๐ฟ ๐ฏ๐ฏ = ๐ท๐ท๐๐ ๐๐ ๐ฏ๐ฏ โ ๐ท๐ท๐๐ ๐๐ ๐ฏ๐ฏ โ ๐ท๐ท๐๐ ๐๐ ๐ฏ๐ฏ โฆ โ ๐ท๐ท๐๐ ๐๐ ๐ฏ๐ฏ What is ๐ท๐ท๐๐ ๐๐ ๐ฏ๐ฏ ? Simple, ๐ท๐ท๐๐ ๐๐ ๐ฏ๐ฏ = ๐ฏ๐ฏ
Then, ๐ท๐ท๐๐ ๐ฟ๐ฟ ๐ฏ๐ฏ = ๐ฏ๐ฏ๐๐ ๐๐ โ ๐ฏ๐ฏ ๐๐ For example: ๐ท๐ท๐๐ ๐ฟ๐ฟ ๐ฏ๐ฏ = ๐๐.๐๐ = ๐๐.๐๐๐๐๐๐๐๐๐๐๐๐
๐ท๐ท๐๐ ๐ฟ๐ฟ ๐ฏ๐ฏ = ๐๐๐๐
= ๐๐.๐๐๐๐๐๐๐๐๐๐๐๐
What did we learn? Our data was most likely generated by coin with bias ๐ฏ๐ฏ = ๐๐/๐๐
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 20 ๐ท๐ท๐๐๐ฟ๐ฟ๐ฏ๐ฏ
๐ฏ๐ฏ
๐ฏ๐ฏโ = ๐๐/๐๐
How do we do MLE for graphs? Model generates a probabilistic adjacency matrix We then flip all the entries of the probabilistic
matrix to obtain the binary adjacency matrix
The likelihood of AGM generating graph G:
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
0.25 0.10 0.10 0.04 0.05 0.15 0.02 0.06 0.05 0.02 0.15 0.06 0.01 0.03 0.03 0.09
1 1 0 0 0 1 0 1 1 0 1 1 0 1 0 1
For every pair of nodes ๐๐,๐๐ AGM gives the prob. ๐๐๐๐๐๐ of them being linked
Flip biased coins
)),(1(),()|(),(),(
vuPvuPGPGvuGvu
โฮ ฮ =ฮโโ
Given graph G and ฮ we calculate likelihood that ฮ generated G: P(G|ฮ)
0.25 0.10 0.10 0.04
0.05 0.15 0.02 0.06
0.05 0.02 0.15 0.06
0.01 0.03 0.03 0.09 ฮ=B(V, C, M, {pc})
1 0 1 1
0 1 0 1
1 0 1 1
1 1 1 1
G
P(G|ฮ)
)),(1(),()|(),(),(
vuPvuPGPGvuGvu
โฮ ฮ =ฮโโ
G
2/13/2014 22 Jure Leskovec, Stanford C246: Mining Massive Datasets
Our goal: Find ๐ฏ๐ฏ = ๐ฉ๐ฉ(๐ฝ๐ฝ,๐ช๐ช,๐ด๐ด, ๐๐๐ช๐ช ) such that:
How do we find ๐ฉ๐ฉ(๐ฝ๐ฝ,๐ช๐ช,๐ด๐ด, ๐๐๐ช๐ช ) that maximizes the likelihood? No nice way, have to search over
all affiliation graphs But, donโt despairโฆ
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 23
ฮP( | ) AGM
arg max ฮ
๐ฎ๐ฎ
Relaxation: Memberships have strengths
๐ญ๐ญ๐๐๐จ๐จ: The membership strength of node ๐๐ to community ๐จ๐จ (๐ญ๐ญ๐๐๐จ๐จ = ๐๐: no membership) Each community ๐จ๐จ links nodes independently:
๐ท๐ท๐จ๐จ ๐๐,๐๐ = ๐๐ โ ๐๐๐๐๐๐ (โ๐ญ๐ญ๐๐๐จ๐จ โ ๐ญ๐ญ๐๐๐จ๐จ)
24
๐ญ๐ญ๐๐๐จ๐จ
u v
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
Community membership strength matrix ๐ญ๐ญ
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 25
๐ญ๐ญ =
j
Communities
Nod
es
๐ญ๐ญ๐๐๐จ๐จ โฆ strength of ๐๐โs membership to ๐จ๐จ
๐ญ๐ญ๐๐ โฆ vector of community membership strengths of ๐๐
๐ท๐ท๐จ๐จ ๐๐,๐๐ = ๐๐ โ ๐๐๐๐๐๐ (โ๐ญ๐ญ๐๐๐จ๐จ โ ๐ญ๐ญ๐๐๐จ๐จ) Probability of connection is
proportional to the product of strengths
Prob. that at least one common community ๐ช๐ช links the nodes: ๐ท๐ท ๐๐,๐๐ = ๐๐ โ โ ๐๐ โ ๐ท๐ท๐ช๐ช ๐๐,๐๐๐ช๐ช
Community ๐จ๐จ links nodes ๐๐,๐๐ independently: ๐ท๐ท๐จ๐จ ๐๐,๐๐ = ๐๐ โ ๐๐๐๐๐๐ (โ๐ญ๐ญ๐๐๐จ๐จ โ ๐ญ๐ญ๐๐๐จ๐จ)
Then prob. at least one common ๐ช๐ช links them: ๐ท๐ท ๐๐,๐๐ = ๐๐ โ โ ๐๐ โ ๐ท๐ท๐ช๐ช ๐๐,๐๐๐ช๐ช = ๐๐ โ ๐๐๐๐๐๐ (โโ ๐ญ๐ญ๐๐๐ช๐ช โ ๐ญ๐ญ๐๐๐ช๐ช๐ช๐ช )
= ๐๐ โ ๐๐๐๐๐๐ (โ๐ญ๐ญ๐๐ โ ๐ญ๐ญ๐๐๐ป๐ป) For example:
26
๐ญ๐ญ๐๐:
๐ญ๐ญ๐๐: Then: ๐ญ๐ญ๐๐ โ ๐ญ๐ญ๐๐๐ป๐ป = ๐๐.๐๐๐๐ And: ๐ท๐ท ๐๐,๐๐ = ๐๐ โ ๐๐๐๐๐๐ โ๐๐.๐๐๐๐ = ๐๐.๐๐๐๐ But: ๐ท๐ท ๐๐,๐๐ = ๐๐.๐๐๐๐ ๐ท๐ท ๐๐,๐๐ = ๐๐
๐ญ๐ญ๐๐: Node community
membership strengths 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
0 1.2 0 0.2
0.5 0 0 0.8
0 1.8 1 0
Another way to think about this is: Think of a weighted graph (but we donโt see the
weights) ๐๐,๐๐ interact with strength ๐ฟ๐ฟ๐๐๐๐๐จ๐จ ~๐๐๐๐๐๐๐๐(๐ญ๐ญ๐๐๐จ๐จ โ ๐ญ๐ญ๐๐๐จ๐จ) Total interaction strength is the sum: ๐ฟ๐ฟ๐๐๐๐ = โ ๐ฟ๐ฟ๐๐๐๐๐ช๐ช๐ช๐ช Then: ๐ฟ๐ฟ๐๐๐๐~๐๐๐๐๐๐๐๐(โ ๐ญ๐ญ๐๐๐ช๐ช โ ๐ญ๐ญ๐๐๐ช๐ช๐ถ๐ถ ) We observe an edge if there is non-zero interaction: ๐๐ ๐๐๐ข๐ข๐ข๐ข > 0 = 1 โ ๐๐ ๐๐๐ข๐ข๐ข๐ข = 0 = = ๐๐ โ ๐๐๐๐๐๐ (โ๐ญ๐ญ๐๐ โ ๐ญ๐ญ๐๐๐ป๐ป)
27 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
Poisson PMF:
๐ท๐ท(๐ฟ๐ฟ = ๐๐) =๐๐๐๐
๐๐!๐๐โ๐๐
Task: Given a network, estimate ๐ญ๐ญ Find ๐ญ๐ญ that maximizes the log-likelihood Many times we take the logarithm of the likelihood and
call it log-likelihood: ๐๐ ๐น๐น = log ๐๐(๐บ๐บ|๐น๐น)
28
Objective function is biconvex Non-negative matrix factorization: Update ๐ญ๐ญ๐๐๐ช๐ช for node ๐๐ while fixing the
memberships of all other nodes
[wsdm โ13]
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
Task: Given a network, estimate ๐ญ๐ญ Find ๐ญ๐ญ that maximizes the log-likelihood Many times we take the logarithm of the likelihood and
call it log-likelihood: ๐๐ ๐ญ๐ญ = ๐ฅ๐ฅ๐ฅ๐ฅ๐ฅ๐ฅ ๐ท๐ท(๐ฎ๐ฎ|๐ญ๐ญ)
29
Connection: Non-negative matrix factorization We want to โfactorโ the adjacency matrix G
F FT Matrix of
edge probs.
๐๐ โ ๐๐๐๐๐๐ (โ . ) = 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
Graph G
Non-negative matrix factorization: Use a gradient based approach! But updating the whole ๐ญ๐ญ takes too long Computing the gradient takes quadratic time! Our network are far too big to allow for that: Network with 1M nodes, can have 1012 different edges
Instead, update ๐ญ๐ญ in small โunitsโ: Update ๐ญ๐ญ๐๐๐ช๐ช for node ๐๐ while fixing the memberships of
all other nodes
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 30
Compute gradient of a single row ๐ญ๐ญ๐๐ of ๐ญ๐ญ:
Coordinate gradient ascent: Iterate over the rows of ๐ญ๐ญ: Compute gradient ๐ต๐ต๐๐ ๐ญ๐ญ๐๐ of row ๐๐ (while keeping others fixed) Update the row ๐ญ๐ญ๐๐: ๐ญ๐ญ๐๐ โ ๐ญ๐ญ๐๐ + ๐ผ๐ผ ๐๐๐๐(๐ญ๐ญ๐๐) Project ๐ญ๐ญ๐๐ back to a non-negative vector: If ๐ญ๐ญ๐๐๐ช๐ช < ๐๐: ๐ญ๐ญ๐๐๐ช๐ช = ๐๐
This is slow! Computing ๐ต๐ต๐๐ ๐ญ๐ญ๐๐ takes linear time!
31 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
However, we notice:
We can cache โ ๐ญ๐ญ๐๐๐๐ So, computing โ ๐ญ๐ญ๐๐๐๐โ๐๐(๐๐) now takes linear time
in the degree ๐๐(๐๐) of ๐๐ In networks degree of a node is much smaller to the total
number of nodes in the network, so this is a significant speedup!
32 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
BigCLAM takes 5 minutes for 300k node nets Other methods take 10 days
Can process networks with 100M edges! 33
34 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
Extension: Make community membership edges directed: Outgoing membership: Nodes โsendsโ edges Incoming membership: Node โreceivesโ edges
35 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 36
Everything is almost the same except now we have 2 matrices: ๐ญ๐ญ and ๐ฏ๐ฏ ๐ญ๐ญโฆ out-going community memberships ๐ฏ๐ฏโฆ in-coming community memberships
Edge prob.: ๐ท๐ท ๐๐,๐๐ = ๐๐ โ ๐๐๐๐๐๐(โ๐ญ๐ญ๐๐๐ฏ๐ฏ๐๐๐ป๐ป)
Network log-likelihood:
which we optimize the same way as before 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 37
๐ญ๐ญ ๐ฏ๐ฏ
38 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets
Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013.
Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014.
Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013.
2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 39