Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3...

34
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1 CSE 6240: Web Search and Text Mining. Spring 2020 Random Graph Models Prof. Srijan Kumar

Transcript of Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3...

Page 1: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 1

CSE 6240: Web Search and Text Mining. Spring 2020

Random Graph Models

Prof. Srijan Kumar

Page 2: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 2

Today’s Lecture: Networks• Networks introduction • Web as a network • Networks properties• Random graph model: Erdos-Renyi Random Graph Model• Random graph model: Small-world Random Graph Model

Some slides are inspired by Prof. Jure Leskovec’s slides

Page 3: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3

Simplest Model of Graphs¡ Erdös-Renyi Random Graphs [Erdös-Renyi, 1960]• Two variants:

– Gn,p: undirected graph on n nodes and each edge (u,v) appears i.i.d. with probability p

– Gn,m: undirected graph with n nodes and m edges, where edges are picked uniformly at random

• What kind of networks do such models produce?

Page 4: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 4

Random Graph Models: Intuition

• n and p do not uniquely determine the graph!– The graph is a result of a random process

• We can have many different realizations given the same n and p

n = 10 p= 1/6

Page 5: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 5

Random Graph Model: Edges

• How likely is a graph on E edges?• P(E): the probability that a given Gnp generates a graph on

exactly E edges:

where Emax=n(n-1)/2 is the maximum possible number of edges in an undirected graph of n nodes

• P(E) is a Binomial distribution: NumberofsuccessesinasequenceofEmax independentyes/noexperiments

EEE ppEE

EP --÷÷ø

öççè

æ= max)1()( max

Page 6: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 6

Node Degrees in a Random Graph

• What is expected degree of a node?

• Probability of node u linking to node v is p• u can link (flips a coin) to all other (n-1) nodes• Thus, the expected degree of node u is: p(n-1)

E[Xv ]= E[Xvu ]= (n−1)pu=1

n−1

Page 7: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Key Network Properties

7

• Degree distribution: P(k)

• Clustering coefficient: C

• Path length: h

What are the values of these properties for Gnp?

Page 8: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 8

Degree Distribution• Degree distribution of Gnp is binomial• Let P(k) denote the fraction of nodes with degree k:

• Mean and variance of a binomial distribution

knk ppkn

kP ---÷÷ø

öççè

æ -= 1)1(

1)(

Select k nodes out of n-1

Probability of having k edges

Probability of missing the rest of the n-1-k edges

σ 2 = p(1− p)(n−1)

)1( -= npk

Page 9: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 9

Degree Distribution• As the network size increases, the distribution becomes

increasingly narrow—we are increasingly confident that the degree of a node is in the vicinity of k.

σk=1− pp

1(n−1)

"

#$

%

&'

1/2

≈1

(n−1)1/2

P(k)

k

Page 10: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 10

Clustering Coefficient of Gnp

• Clustering coefficient – Where ei is the number of edges between i’s neighbors

• So,

• Clustering coefficient of a random graph is small– Bigger graphs with the same average degree k have lower

clustering coefficient

nk

nkp

kkkkpCii

ii »-

==--×

=1)1(

)1(

)1(2-

=ii

ii kk

eC

ei = pki (ki −1)2

Number of distinct pairs of neighbors of node i of degree ki

Each pair is connected with prob. p

Page 11: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Key Network Properties

11

• Degree distribution:

• Clustering coefficient: C=p=k/n

• Path length: h

What are the values of these properties for Gnp?

knk ppkn

kP ---÷÷ø

öççè

æ -= 1)1(

1)(

Page 12: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 12

Average Shortest Path• Average path length = O(log n)• Erdös-Renyi networks can grow to be very large but nodes

will be just a few hops apart

0 200000 400000 600000 800000 1000000

05

1015

20

num nodes

aver

age

shor

test

pat

h

Page 13: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

MSN Network Properties vs. Gnp Properties

13

Degree distribution:

Path length: 6.6 O(log n) ~ 8.2

Clustering coefficient: 0.11 k / n ≈ 8·10-8

MSN Gnp

Page 14: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 14

Clustering Implies Edge Locality• MSN network has 7 orders of magnitude larger clustering

than the corresponding Gnp!• Other examples:

– Actor Collaborations (IMDB): N = 225,226 nodes, avg. degree k = 61– Electrical power grid: N = 4,941 nodes, k = 2.67– Network of neurons: N = 282 nodes, k = 14

Network hactual hrandom Cactual Crandom

Film actors 3.65 2.99 0.00027

Power Grid 18.70 12.40 0.005

C. elegans 2.65 2.25 0.05

Page 15: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 15

Gnp Simulation Experiment: Giant Component

• n=100,000, k=p(n-1) = 0.5 … 3• Emergence of a giant component: average degree k=2E/n

or p=k/(n-1)– When k=1-ε: all

components are of size Ω(log n)

– k=1+ε: 1 component of size Ω(n), others have size Ω(log n)

Fraction of nodes in the largest component

p*(n-1)=1

Page 16: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 16

Real Networks vs. Gnp

• Are real networks like random graphs?– Giant connected component: YES – Average path length: YES – Clustering Coefficient: NO– Degree Distribution: NO

Page 17: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 17

Real Networks vs. Gnp

• Problems with the random networks model:– Degree distribution differs from that of real networks– Giant component in most real networks does NOT emerge

through a phase transition– No local structure – clustering coefficient is too low

• Most important: Are real networks random?– The answer is simply: NO!

Page 18: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 18

Real Networks vs. Gnp

• If Gnp is wrong, why did we spend time on it?– It is the reference model for the rest of the class. – It will help us calculate many quantities, that can then be

compared to the real data– It will help us understand to what degree is a particular

property the result of some random process• While Gnp is not realistic, it is useful

Page 19: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 19

Problem with the ER Model• Gnp model has short paths: O(log n)

– This is the smallest diameter we can get if we have a constant degree.

– But clustering is low!• But real networks have “local” structure

– Triadic closure: Friend of a friend is my friend– High clustering but diameter is also high

• Can we generate graphs with high clustering coefficient while having short paths (low diameter)?

• Solution: Small-World Model

Low diameterLow clustering coefficient

High clustering coefficientHigh diameter

Page 20: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 20

Today’s Lecture: Networks• Networks introduction • Web as a network • Networks properties• Random graph model: Erdos-Renyi Random Graph Model• Random graph model: Small-world Random Graph Model

Some slides are inspired by Prof. Jure Leskovec’s slides

Page 21: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 21

Six Degrees of Kevin Bacon

Origins of a small-world idea:• The Bacon number:

– Create a network of Hollywood actors– Connect two actors if they co-appeared in

the movie– Bacon number: number of steps to Kevin

Bacon• As of Dec 2007, the highest Bacon

number reported is 8• Only approx. 12% of all actors cannot be

linked to Bacon

Page 22: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 22

Erdos Number

• Erdos Number: number of hops in scientific co-author graph to reach Paul Erdos

• Srijan’ Erdos number is 4.

• Find out your Erdos number: http://www.ams.org/mathscinet/collaborationDistance.html

Page 23: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 23

The Small-World Experiment• What is the typical shortest path length

between any two people?– Experiment on the global friendship network

• Can’t measure, need to probe explicitly • Small-world experiment [Milgram ’67]

– Picked 300 people in Omaha, Nebraska and Wichita, Kansas

– Ask them to get a letter to a stock-broker in Boston by passing it through friends only

• How many steps do you think it took?

Page 24: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 24

The Small-World Experiment• 64 chains completed (letters reached)

– It took 6.2 steps on the average, thus “6 degrees of separation”

• Further observations:– People who owned stock

had shorter paths to the stockbroker than random people: 5.4 vs. 6.7

– People from the Boston area have even closer paths: 4.4

• On average, you are 6 hops away from anyone in the world!

Milgram’s small world experiment

Page 25: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 25

The Small-World Experiment #2: Columbia• In 2003 Dodds, Muhamad and Watts performed similar

experiments using e-mail:– 18 targets of various backgrounds– 24,000 first steps (~1,500 per target)– 65% dropout per step– 384 chains completed (1.5%)

Average chain length = 4.01Problem: People stop participating

Path length, h

n(h)

Page 26: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 26

6-Degrees: Should We Be Surprised?• Assume each human is connected to 100 other people,

then:– Step 1: reach 100 people– Step 2: reach 100*100 = 10,000 people– Step 3: reach 100*100*100 = 1,000,000 people– Step 4: reach 100*100*100*100 = 100M people– In 5 steps we can reach 10 billion people

Page 27: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 27

Small-World: How?• Could a network with high clustering be at the same

time a small world?– How can we at the same time have high clustering and small

diameter?

• Intuition:– Clustering implies edge “locality”– Randomness enables “shortcuts”

High clusteringHigh diameter

Low clusteringLow diameter

Page 28: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 28

The Small-World Model [Watts-Strogatz ‘98]

Two components to the model:• (1) Start with a low-dimensional regular lattice

– In our case, we are using a ring as a lattice– Has high clustering coefficient, but has high diameter

• (2) Rewire: introduce randomness (“shortcuts”)– Add/remove edges to create shortcuts to join remote parts

of the lattice– For each edge with probability p move

the other end to a random node– Reduces the diameter by adding shortcuts

Page 29: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 29

Tuning Randomness in Small-World Model• Rewiring allows us to “interpolate” between a regular lattice

and a random graph

High clusteringHigh diameter

High clusteringLow diameter

Low clusteringLow diameter

43

2== C

kNh

NkCNh ==

loglog

aC = ½

Page 30: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 30

Randomness vs Clustering• Intuition: It takes a lot of randomness to ruin the clustering,

but a very small amount to create shortcuts.

Clustering coefficientC = 1/n ∑ Ci

Parameter region of high clustering and low path length

Probability of rewiring, p

Clustering Coefficient

Average path length

Page 31: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 31

• Alternative formulation of the model:1. Start with a square grid2. Add 1 random long-range edge per node

• Each node has 1 spoke. Then randomly connect them.

• Each node has 8 + 1 = 9 edges• Each node’s neighbors have 12 edges • Clustering

• Diameter: O(log(n)) – Why?

Ci =2 ⋅ei

ki (ki −1)=2 ⋅129 ⋅8

≥ 0.33

Diameter of the Watts-Strogatz Model

Page 32: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 32

Proof: O(log n) Diameter of Small World Model• Convert 2x2 subgraphs into supernodes:

– Each supernode has 4 long-range edges sticking out: a 4-regular random graph!• Ignore the edges between neighboring supernodes

– Recall Gnp: short paths between super nodesÞ Path in the original graph = add at most 2 steps

per long range edge (by traversing within supernodes)

Þ Diameter of the model is O(2 + log n) = O (log n)• Edges between neighboring supernodes: these edges

will reduce the diameter further. 4-regular randomgraph on supernodes

Super node

Page 33: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 33

Small-World: Summary

• Could a network with high clustering be at the same time a small world?– Yes! You don’t need more than a few random links

• The Watts-Strogatz/Small-World Model:– Provides insight on the interplay between clustering and the

small-world – Captures the structure of many realistic networks– Accounts for the high clustering of real networks– Does not lead to the correct degree distribution

Page 34: Random Graph ModelsSrijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 3 Simplest Model of Graphs ¡Erdös-RenyiRandom Graphs [Erdös-Renyi, 1960] • Two variants:

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining 34

Today’s Lecture: Networks• Networks introduction • Web as a network • Networks properties• Random graph model: Erdos-Renyi Random Graph Model• Random graph model: Small-world Random Graph Model

Some slides are inspired by Prof. Jure Leskovec’s slides