Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie...

16
Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research Subhabrata Sen – AT&T Labs—Research Walter Willinger – AT&T Labs— Research Global Internet Symposium Barcelona, Spain April 28 th , 2006

Transcript of Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie...

Page 1: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Sampling Techniques for Large, Dynamic Graphs

Daniel Stutzbach – University of OregonReza Rejaie – University of OregonNick Duffield – AT&T Labs—ResearchSubhabrata Sen – AT&T Labs—ResearchWalter Willinger – AT&T Labs—Research

Global Internet Symposium

Barcelona, Spain

April 28th, 2006

Page 2: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Motivation

P2P systems are very popular in practice. Several million simultaneous users collectively. 60% of all Internet traffic [CacheLogic Research 2005]

Measurement studies aid understanding existing systems and user behavior.

Capturing global state is often infeasible. P2P systems are large and rapidly changing.

Sampling is therefore a natural approach, and has been used in several earlier measurement studies.

But how do we know the samples are representative?

Page 3: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

The Problem

We focus on sampling peer properties. Peer degree Link Bandwidth Number of shared files Remaining uptime

Sampling peer properties occurs in two steps: Discover and select peers Collect the measurements

Selecting peers uniformly at random is hard. Peer dynamics can introduce bias. The graph topology can introduce bias.

We examine these two problems separately.

Page 4: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Temporal Causes of Bias

Define Vt as the set of peers present at time t.

We gather samples over a measurement window of length Δ.

The most common approach is to gather peers from the set present during the window:

0

0

t

tttVv

Page 5: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Example of Bias towards Short-Lived Peers

Time

{Short-lived peers

{Long-lived peer

Consider a simple two-peer system, containing: One long-lived peer One rapidly-changing short-lived peer

The common approach over-selects short-lived peers. XXX I plan to update this slide with animation to show how a particular

measurement window selects too many short-lived peers

Page 6: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Handling Temporal Causes of Bias

The common approach is intuitive but incorrect.

Sampling peers is the wrong goal. We want to sample peer properties. Therefore, vi,t and vi,t’ are distinct, even

though they come from the same peer. Allow sampling the same peer more than

once, at different points in time.

0

0

t

ttti Vv

tti Vvttt ,00 ],[

Page 7: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Example of avoiding bias towards Short-Lived Peers

Time

{Short-lived peers

{Long-lived peer

Allowing re-selecting a peer solves the problem. The long-lived peer will be selected half the time, reflecting the actual state of the

system. Now the problem remains, how do we select a peer uniformly a random at a particular

moment? XXX I plan to update this slide with animation

Page 8: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Topological Causes of Bias

Goal: Select a peer uniformly at time t Begin with one peer. Query peers to discover neighbors. Prior work uses classic graph-

discovery algorithms: Breadth-First Search (BFS) Depth-First Search (DFS)

Problems with these techniques: Peers are correlated by their neighbor

relationship Peers with higher degree are more

likely to be discovered. A peer can only be selected once.

Random walks are a promising alternative.

XXX Some kind of animation here showing the discovery process (using breadth-first search)

Page 9: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Random Walks

Basic idea of the random walk: Select a neighbor randomly to explore Explore that neighbor and “forget” the previous peer Only two pieces of state are maintained:

The current peer The length of the walk

A subset of visited peers are selected for sampling The basic random walk selects a peer every r steps.

Graph theory suggests r ≥ log(|V|). Walking r steps between samples eliminates correlations. Peers are selected with probability proportional to degree. Peers can be selected more than once.

Page 10: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Variations on the Random Walk

Fixing the degree bias (“Degree Correction”) Select a candidate peer with probability Pro: Should result in uniform selection of peers Con: Decreases efficiency

Improving efficiency (“Random Stroll”) After the first r steps, select every peer instead

of every r peers Pro: Increases efficiency Con: Introduces slight correlations

degree

1

Page 11: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Evaluations

We simulated different techniques over two types of graphs: A snapshot of the Gnutella ultrapeer topology [Stutzbach 05 IMC] Random graphs (with the same number of vertices and edges as the

Gnutella topology) Metrics:

Bias: Is peer A more likely to be selected than peer B? Correlation: If we select peer A, are we more likely to select peer B? Efficiency: How easily can we collect a sample?

Techniques: Oracle (uniformly random) Breadth-First Search (BFS) Random Walk (RW) Random Walk with Degree Correction (RWDC) Random Stroll (RS) Random Stroll with Degree Correction (RSDC)

Page 12: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Bias

Collect k|V| samples and compare with Oracle. Most peers should be selected around k times. RSDC appears unbiased in both cases. RWDC performs well, but exhibits slight bias on Gnutella. BFS, RS, and RW are heavily biased.

Figures 1(a) and 1(b) go here

Page 13: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Correlation

Even if unbiased, a technique may exhibit correlations. We define a sampling session as 1,000 consecutive samples. For pair (A, B), if A is selected, how often is B also selected? A long tail indicates correlation. RWDC and RSDC appear uncorrelated. RW and RS exhibit slight correlations. BFS exhibits strong correlation.

Figures 2(a) and 2(b) go here

Page 14: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Efficiency

The basic operation is the neighbors-query.

Efficiency is:

BFS and RS are close to 100% efficient. Unfortunately, they are also heavily biased.

RW, RWDC, and RSDC are 2% to 8% efficient. RSDC is twice as efficient as RWDC (4% vs. 2%). However, even the inefficient techniques are O(log |V|).

queried peers ofnumber

produced samples ofnumber

Page 15: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Summary of Results and Lessons Learned

Addressing temporal causes of bias Avoid gathering a set of peers and collecting

measurements in separate passes. Select a peer, then collect the measurement. Repeat and allow re-selecting the same peer.

Addressing topological causes of bias Be careful to avoid bias towards high-degree. Consider using a random walk or random stroll with

degree correction.

Page 16: Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.

Ongoing Work

This work is preliminary. Additional types of random walks:

Weighting the selection of the next hop Additional types of graphs:

Power-law Small world

We have examined temporal and topological causes of bias separately. To examine them concurrently, we are creating a dynamic

overlay simulator. XXX This slide feels too much like a laundry list