Methods and Applications for Detecting Structure in ...€¦ · structure, community structure, in...

Post on 25-Jul-2020

4 views 0 download

Transcript of Methods and Applications for Detecting Structure in ...€¦ · structure, community structure, in...

Methods and Applications for Detecting Structure in Complex Networks

Elizabeth A. Leicht

Introduction

http://www-personal.umich.edu/~mejn/networks/attweb.gif

Yeast proteins: Maslov & Sneppen, Science 296, 910-913 (2002).

Web site: M. E. J. Newman & M. Girvan, Physical Review E 69, 026113 (2004).

7th Grade Students: adapted from Moreno, Who Shall Survive, 1934.

Visualizing a Network

Large Networks

Talk Outline

• Introduction to networks and their structure.

• Network structure detection method 1: linear-algebra and optimization techniques.

• Network structure detection method 2: probabilistic mixture models and the expectation-maximization algorithm.

• Conclusions.

Detecting Communities in Directed Networks

- In many networks, the vertices divide naturally into communities, also known as groups or modules.

- Previous methods for detecting communities in networks have mostly focused just on undirected networks.

- One established method for partitioning a network relies on identifying statistically surprising configurations of edges by measuring modularity (Q).

Q =

(fraction of edges within communities)- (expected fraction of such edges)

Calculating Modularity

Rewrite modularity using the adjacency matrix.

Q =1

m

n!

ij=1

[Aij ! Pij ] !ci,cj

How do we determine the “expected” number of edges between two vertices?

Aij =

!

1, if there is an edge from j to i

0, otherwise .

ci = the community to which i belongs.Pij = the expected number of edges from j to i.

Pij =

kini koutj

m

kini

koutj

Division of a Network into Two Communitiessj = !1si = +1

Q =1

2m

n!

ij=1

!

Aij !kini koutj

m

"

(sisj + 1)

!ij =1

2(sisj + 1)

Let Bij = Aij !

kini koutj

mbe an element of the modularity matrix.

Let v(1) be the eigenvectorcorresponding to the mostpositive eigenvalue of thesymmetric matrix B + BT

.

si =

!

+1, if v(1)i

> 0

!1, if v(1)i

< 0

Q =1

2msTBs =

1

2msTB

Ts =

1

4msT

!

B + BT"

s

Two Communities and More

- A network can be divided into more than two communities by repeated bisection.

- We can calculate the additional contribution to the modularity, ∆ Q, from the each additional division and halt the divisions

when no there is no additional increase in the modularity.

Communities with Edge Direction Bias

Communities with Edge Direction Bias

Hubs and Authorities Example

Hubs and Authorities Example

Edge Direction Bias in Real Networks

Minnesota

Indiana

Michigan

Iowa Wisconsin

Michigan State

Northwestern Ohio State

Pennsylvania State

Purdue

Illinois

Edge Direction Bias in Real Networks

Purdue IndianaIowa

Michigan

Wisconsin

Michigan State

Minnesota

Northwestern Ohio State

Pennsylvania State

Illinois

Exploratory Analysis of Structure in Networks

- Previously we identified communities in networks because we specifically sought a method to detect modules in networks.

- Reliance on specific measures of network structure where we are required to know the type of structure, for which we are looking, in advance can be limiting.

- We turn to probabilistic techniques and the Expectation Maximization (EM) Algorithm to identify general patterns of connection between vertices.

The Method

There are three types of quantities in this method of approach.

1. Observed data: the actual edges falling between pairs of vertices in a network ( ).

2. Missing data: we assume that the vertices divide into c groups. We denote the group to which vertex belongs as and set of all missing data as

3. Model parameters: they describe the patterns of vertices in different groups (θ, π).

A

i

gi g.

!ri =

the probability that there exists an edgefrom a vertex in group r to a vertex i

!r = the probability of a vertex belonging to group r

c!

r=1

!r = 1

n!

i=1

!ri = 1

The likelihood of the data given the model is

where

Frequently, one works not with the likelihood itself, but with the log-likelihood.

A Likelihood Problem

Pr(A|g,!, ") =!

ij

"Aij

gj,iand Pr(g|!, ") =

!

j

!gj.

Pr(A,g|!, ") = Pr(A|g,!, ")Pr(g|!, "),

L = ln Pr(A,g|!, ") =!

j

!

ln!gj+

"

i

"Aij

gj,i

"

.

We cannot directly observe g.

It is, however, possible to calculate an expected value for the log-likelihood over all possible values of g.

where

Dealing with the “Missing” Data

L =

c!

r=1

n!

j=1

qrj

!

ln!r +

n!

i=1

Aij ln "ri

"

qjr = Pr(gj = r|A,!, ") =Pr(A,gj = r|!, ")

Pr(A|!, ")=

!r

∏i "

Aij

ri∑

s !s

∏i "

Aij

si

.

L =

c!

g1=1

· · ·

c!

gn=1

Pr(g|A, !,")

n!

j=1

!

ln"gj+

n!

i=1

Aij ln!gj,i

"

The EM Algorithm

- Initialize model parameters (θ, π) with random values.

- Find the probability a given vertex is a member of group (E-step).

- Maximize the model parameters (M-step)

- Iterate until convergence.

qjr =

!r

!i "

Aij

ri"

s !s

!i "

Aij

si

.

!r =

1

n

!

i

qir !ri =

!j Aijqjr

!j kjqjr

.

rj

The AddHealth Network

Example: AddHealth

Example: A “Keystone” Network

Example: A “Keystone” Network

“Big Ten” Results with the EM Algorithm

0 0.5 1

qj1

Vertex shading based on probability of being assigned to group 1.

Vertex size based on probability of being beaten by teams assigned to group 1.

Example: The “Big Ten” Conference

Illinois

Purdue

Indiana

WisconsinIowa

Pennsylvania StateMinnesota

Michigan StateNorthwestern

Ohio StateMichigan

Conclusions

- There are many existing methods for detecting structure in complex networks, but this work takes two new and interesting approaches.

- In this dissertation a method for detecting one specific type of structure, community structure, in directed networks is presented.

- Additionally, a method for detecting more general types of structure by turning to probabilistic mixture models and the Expectation-Maximization Algorithm is given.

- These two topics are not the only ones covered in my dissertation.

- The dissertation also introduces method for tackling the specific issue of identifying similar vertices.

- A method for probing the structure of time evolving citation networks was also developed.