Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔...

34
Estimating Clustering Coefficients and Size of Social Networks via Random Walk Stephen J. Hardiman* Capital Fund Management France Liran Katzir Advanced Technology Labs Microsoft Research, Israel *Research was conducted while the author was unaffiliated

Transcript of Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔...

Page 1: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Estimating Clustering Coefficients and Size of Social Networks via

Random Walk Stephen J. Hardiman*

Capital Fund Management

France

Liran Katzir

Advanced Technology Labs Microsoft Research, Israel

*Research was conducted while the author was unaffiliated

Page 2: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Motivation: Social Networks

Facebook Twitter Qzone Google+

Sina Weibo

Habbo Renren

LinkedIn Vkontakte

Bebo

Tagged Orkut

Netlog

Friendster hi5

Flixster

MyLife Classmates.com

Sonico.com

Plaxo

Page 3: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Motivation: External access

v1 v2

v3 v5

v6

v7

v4 v8

v9

Social Analytics

The online social network

Disk Space

Communication

Privacy

Page 4: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Task: Estimate parameters

Business development/ advertisement/ market size.

Predicting Social Products’ Potential.

Global Clustering Coefficient

Network Average

CC

Number of Registered

Users

Page 5: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Global CC = 3 x number of triangles

number of connected triplet

Global Clustering Coefficient

v1 v2

v3 v5

v6

v7

v4 v8

v9

Triangle Connected Triplet

Page 6: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Global Clustering Coefficient

Exact: [Alon et al, 1997]

Estimation – input is read at least once:

• Random Access: [Avron, 2010]

• Streaming Model: [Buriol et al, 2006]

Estimation – sampling:

• Random Access: [Schank et al, 2005]

• External Access: This work.

Page 7: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Ci = #connections between vi′s neighbors

di (di−1)/2

Local Clustering Coefficient

v1 v2

v3 v5

v6

v7

v4 v8

v9

di – degree of node i

d1 = 1 d9 = 2 d2 = 3

C2 =1/3

Network Average CC = average local CC

Page 8: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Network Average CC

Exact: Naïve.

Estimation – input is read at least once:

• Streaming Model: [Becchetti et al, 2010]

Estimation – sampling:

• Random Access: [Schank et al, 2005]

• External Access: [Ribeiro et al 2010], [Gjoka et al, 2010], This work – Improved accuracy.

Page 9: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Number of Registered Users

Exact: trivial

Estimation – sampling:

• External Access: [Hardiman et al 2009], [Katzir et al, 2011], This work – Improved accuracy.

Page 10: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Random Walk

v1 v2

v3 v5

v6

v7

v4 v8

v9

Sampled Nodes: v1 v2 v3 v4

1

22

3

22

2

22

2

22

Stationary

Distribution = 𝑑𝑖

𝑑𝑖

3

22

2

22

3

22

4

22

2

22

v5

Page 11: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Random Walk - Summary

v1 v2

v3 v5

v6

v7

v4 v8

v9

Visible Nodes Invisible Nodes Sampled Nodes

Visible Edges

Invisible Edges

Page 12: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Global CC Algorithm

1. Ψ𝑔 – Sampled nodes average degree - 1.

𝜙𝑘 = 1 if there is an edge 𝑣𝑘−1 − 𝑣𝑘+1,

0 Otherwise.

2. Φ𝑔 – Sampled nodes average 𝜙𝑘𝑑𝑘 .

The estimated global clustering coefficient:

𝑐𝑔 =Φ𝑔

Ψ𝑔

𝜙𝑘 = 1 iff 𝑣𝑘−1, 𝑣𝑘 , 𝑣𝑘+1 is a triangle

Page 13: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Global CC Example

v1 v2

v3 v5

v4

𝜙2 = 0

𝜙3 = 1

Φ𝑔 =1

30 + 2 + 0 =

2

3 Ψ𝑔 =

1

50 + 2 + 1 + 3 + 1 =

7

5

𝑐𝑔 = 2

3

5

7 ≈ 0.47

𝑐𝑔 =9

23≈ 0.39

𝜙4 = 0 v6

v7

Page 14: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Expectation of 𝝓𝒌

𝐸 𝜙𝑘𝑑𝑘 = 𝑑𝑖

𝐷𝐸 𝜙𝑘𝑑𝑘|𝑥𝑘 = 𝑣𝑖

𝑛

𝑖=1

= 𝑑𝑖

𝐷

𝑛

𝑖=1

2𝑙𝑖𝑑𝑖𝑑𝑖

𝑑𝑖

= 2𝑙𝑖𝐷

𝑛

𝑖=1

Total expectation

𝑑𝑖𝑑𝑖 combinations. 2𝑙𝑖 yield 𝜙𝑘=1

𝑙𝑖 – The number of triangles contain vi.

𝑑𝑖 – The degree of node vi.

𝑛 – The number of nodes.

𝐷 = 𝑑𝑖

𝑛

𝑖=1

Page 15: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Global CC Proof

𝐷 = 𝑑𝑖

𝑛

𝑖=1

𝑙𝑖 – The number of triangles contain vi.

𝑑𝑖 – The degree of node vi.

𝑛 – The number of nodes.

𝐸 Φ𝑔 = 𝐸 𝜙𝑘𝑑𝑘 =2

𝐷 𝑙𝑖

𝑛

𝑖=1

𝐸 Ψ𝑔 =1

𝐷 𝑑𝑖 𝑑𝑖 − 1

𝑛

𝑖=1

𝑐𝑔 =Φ𝑔

concentration bounds𝐸 Φ𝑔

Ψ𝑔

concentration bounds𝐸 Ψ𝑔

≅2 𝑙𝑖

𝑛𝑖=1

𝑑𝑖 𝑑𝑖 − 1𝑛𝑖=1

= 𝑐𝑔

Page 16: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Guarantees

For any 𝜖 ≤1

8 and 𝛿 ≤ 1, we have

Prob 1 − 휀 𝑐𝑔 ≤ 𝑐𝑔 ≤ 1 + 휀 𝑐𝑔 ≥ 1 − 𝛿

when the number of samples, r, satisfies

𝑟 ≥ 𝑟𝑔 = 𝑂 mixing time(휀)

Page 17: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Network Average CC Algorithm

1. Ψ𝑙 – Sampled nodes average 1/degree .

𝜙𝑘 = 1 if there is an edge 𝑣𝑘−1 − 𝑣𝑘+1,

0 Otherwise.

2. Φ𝑙 – Sampled nodes average 𝜙𝑘1

𝑑𝑘−1.

The estimated network average CC:

𝑐𝑙 =Φ𝑙

Ψ𝑙

Page 18: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Evaluations

Network n (size) D/n cl cg

DBLP 977,987 8.457 0.7231 0.1868

Orkut 3,072,448 76.28 0.1704 0.0413

Flickr 2,173,370 20.92 0.3616 0.1076

Live Journal 4,843,953 17.69 0.3508 0.1179

DBLP facts: Paper with most co-authors: has 119 listed authors. Most prolific author: Vincent Poor with 798 entries.

Page 19: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Global CC

Relative improvement ranges between 300% and 500% depending on the network.

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2

Re

lati

ve e

stim

atio

n v

alu

e

Percentage of mined nodes

DBLP Network

Gjoka et al*

Ribeiro et al*

This work

Page 20: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Network Average CC

Relative improvement ranges between 50% and 400% depending on the network.

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

Re

lati

ve e

stim

atio

n v

alu

e

Percentage of mined nodes

Orkut Network

Ribeiro et al

Gjoka et al

Random walk

Page 21: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Conclusions

1. New external access estimator from Global Clustering Coefficient.

2. Improved estimator for Network Average Clustering Coefficient.

3. Improved estimator for number of registered users.

Page 22: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Estimating Sizes of Social Networks via Biased Sampling

Liran Katzir

Yahoo! Labs, Haifa, Israel

Edo Liberty

Yahoo! Labs, Haifa, Israel

Oren Somekh

Yahoo! Labs, Haifa, Israel

Page 23: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

The expected number of collisions in a list of r

i.i.d. samples from a set of n elements is 𝑟 𝑟−1

2𝑛.

The Birthday “Paradox”

A collision is a pair of identical samples.

Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x2, x3), (x2, x5), and (x3, x5)

Page 24: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Cardinality estimation uniform

Needs 𝑟 = 𝑂 𝑛 samples to converge. Used by [Ye et al, 2010] to estimate the size.

When C collisions are observed

n ≅𝑟 𝑟 − 1

2𝐶

Page 25: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Stationary distribution sampling

v1 v2

v3 v5

v6

v7

v4 v8

v9

Sampled Nodes: v5

1

22

3

22

2

22

2

22

Stationary

Distribution = 𝑑𝑖

𝑑𝑖

3

22

2

22

3

22

4

22

2

22

v2 v5 v4 v2

Page 26: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Cardinality estimation stationary

Needs 𝑟 = 𝑂 𝑛4 log 𝑛 samples to converge when 𝑑𝑖~𝑧𝑖𝑝𝑓( 𝑛, 2).

When C collisions are observed

n ≅ 𝑑𝑥

1𝑑𝑥

2𝐶

Page 27: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Example:

v1 v2

v3 v5

v6

v7

v4 v8

v9

v5 v2 v5 v4 v2

𝑑𝑥 = 2 + 3 + 2 + 4 + 3 1

𝑑𝑥=

1

2+

1

3+

1

2+

1

4+

1

3

𝑛 =14

23

12

2∙2 ≈ 6.7

Page 28: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Global CC Proof

𝐷 = 𝑑𝑖

𝑛

𝑖=1

𝑑𝑖 – The degree of node vi.

𝑛 – The number of nodes.

𝐸 𝑑𝑥 = 𝑑𝑖

𝐷𝑑𝑖

𝑛

𝑖=1

𝐸1

𝑑𝑥=

𝑑𝑖

𝐷

1

𝑑𝑖

𝑛

𝑖=1

=𝑛

𝐷

𝑛 = 𝑑𝑥

1𝑑𝑥

concentration bounds𝐸 𝑑𝑥 𝐸

1𝑑𝑥

2𝐶concentration bounds

2𝐸 𝐶≅

𝑑𝑖𝐷

𝑑𝑖𝑛𝐷

𝑑𝑖𝐷

𝑑𝑖𝐷

= 𝑛

𝐸 𝐶 = 𝑑𝑖

𝐷

𝑑𝑖

𝐷

𝑛

𝑖=1

Page 29: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Improvements

1. Using all samples (Hardiman et al 2009).

2. Using Conditional Monte Carlo (This work).

Page 30: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

All Samples

Restrict computation to indexes m steps apart, 𝐼 = 𝑘, 𝑙 | 𝑘 − 𝑙 ≥ 𝑚

A collision is only be considered within 𝐼. Φ = 𝑥𝑘 = 𝑥𝑙 | 𝑘, 𝑙 ∈ 𝐼

Ratio of degrees is similarly defined

Ψ = 𝑑𝑥𝑘

𝑑𝑥𝑙𝑘,𝑙 ∈𝐼

Page 31: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Conditional Monte Carlo

A collision between 𝑥𝑘 and 𝑥𝑙, is replaced by the conditional collision is steps k+1 and l+1 respectively.

𝐸 1𝑥𝑘+1=𝑥𝑙+1|𝑥𝑘 , 𝑥𝑙 =

Common Neighbors

𝑑𝑥𝑘𝑑𝑥𝑙

Page 32: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Conditional Monte Carlo

• The pair 𝑣4, 𝑣7 is not a collision, but it

contributes 1

12 to the collision counter.

v1 v2

v3 v5

v6

v7

v4 v8

v9

Page 33: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Size Estimation

0

0.5

1

1.5

2

2.5

0.5 1 1.5 2 2.5

Re

lati

ve e

stim

atio

n v

alu

e

Percentage of mined nodes

DBLP Network Priot art

This work

Page 34: Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔 – Sampled nodes average degree - 1. 𝜙 = 1 if there is an edge 𝑣 −1−𝑣

Thanks