Sampling Techniques for Large, Dynamic Graphs
Daniel Stutzbach – University of OregonReza Rejaie – University of OregonNick Duffield – AT&T Labs—ResearchSubhabrata Sen – AT&T Labs—ResearchWalter Willinger – AT&T Labs—Research
Global Internet Symposium
Barcelona, Spain
April 28th, 2006
Motivation
P2P systems are very popular in practice. Several million simultaneous users collectively. 60% of all Internet traffic [CacheLogic Research 2005]
Measurement studies aid understanding existing systems and user behavior.
Capturing global state is often infeasible. P2P systems are large and rapidly changing.
Sampling is therefore a natural approach, and has been used in several earlier measurement studies.
But how do we know the samples are representative?
The Problem
We focus on sampling peer properties. Peer degree Link Bandwidth Number of shared files Remaining uptime
Sampling peer properties occurs in two steps: Discover and select peers Collect the measurements
Selecting peers uniformly at random is hard. Peer dynamics can introduce bias. The graph topology can introduce bias.
We examine these two problems separately.
Temporal Causes of Bias
Define Vt as the set of peers present at time t.
We gather samples over a measurement window of length Δ.
The most common approach is to gather peers from the set present during the window:
0
0
t
tttVv
Example of Bias towards Short-Lived Peers
Time
{Short-lived peers
{Long-lived peer
Consider a simple two-peer system, containing: One long-lived peer One rapidly-changing short-lived peer
The common approach over-selects short-lived peers. XXX I plan to update this slide with animation to show how a particular
measurement window selects too many short-lived peers
Handling Temporal Causes of Bias
The common approach is intuitive but incorrect.
Sampling peers is the wrong goal. We want to sample peer properties. Therefore, vi,t and vi,t’ are distinct, even
though they come from the same peer. Allow sampling the same peer more than
once, at different points in time.
0
0
t
ttti Vv
tti Vvttt ,00 ],[
Example of avoiding bias towards Short-Lived Peers
Time
{Short-lived peers
{Long-lived peer
Allowing re-selecting a peer solves the problem. The long-lived peer will be selected half the time, reflecting the actual state of the
system. Now the problem remains, how do we select a peer uniformly a random at a particular
moment? XXX I plan to update this slide with animation
Topological Causes of Bias
Goal: Select a peer uniformly at time t Begin with one peer. Query peers to discover neighbors. Prior work uses classic graph-
discovery algorithms: Breadth-First Search (BFS) Depth-First Search (DFS)
Problems with these techniques: Peers are correlated by their neighbor
relationship Peers with higher degree are more
likely to be discovered. A peer can only be selected once.
Random walks are a promising alternative.
XXX Some kind of animation here showing the discovery process (using breadth-first search)
Random Walks
Basic idea of the random walk: Select a neighbor randomly to explore Explore that neighbor and “forget” the previous peer Only two pieces of state are maintained:
The current peer The length of the walk
A subset of visited peers are selected for sampling The basic random walk selects a peer every r steps.
Graph theory suggests r ≥ log(|V|). Walking r steps between samples eliminates correlations. Peers are selected with probability proportional to degree. Peers can be selected more than once.
Variations on the Random Walk
Fixing the degree bias (“Degree Correction”) Select a candidate peer with probability Pro: Should result in uniform selection of peers Con: Decreases efficiency
Improving efficiency (“Random Stroll”) After the first r steps, select every peer instead
of every r peers Pro: Increases efficiency Con: Introduces slight correlations
degree
1
Evaluations
We simulated different techniques over two types of graphs: A snapshot of the Gnutella ultrapeer topology [Stutzbach 05 IMC] Random graphs (with the same number of vertices and edges as the
Gnutella topology) Metrics:
Bias: Is peer A more likely to be selected than peer B? Correlation: If we select peer A, are we more likely to select peer B? Efficiency: How easily can we collect a sample?
Techniques: Oracle (uniformly random) Breadth-First Search (BFS) Random Walk (RW) Random Walk with Degree Correction (RWDC) Random Stroll (RS) Random Stroll with Degree Correction (RSDC)
Bias
Collect k|V| samples and compare with Oracle. Most peers should be selected around k times. RSDC appears unbiased in both cases. RWDC performs well, but exhibits slight bias on Gnutella. BFS, RS, and RW are heavily biased.
Figures 1(a) and 1(b) go here
Correlation
Even if unbiased, a technique may exhibit correlations. We define a sampling session as 1,000 consecutive samples. For pair (A, B), if A is selected, how often is B also selected? A long tail indicates correlation. RWDC and RSDC appear uncorrelated. RW and RS exhibit slight correlations. BFS exhibits strong correlation.
Figures 2(a) and 2(b) go here
Efficiency
The basic operation is the neighbors-query.
Efficiency is:
BFS and RS are close to 100% efficient. Unfortunately, they are also heavily biased.
RW, RWDC, and RSDC are 2% to 8% efficient. RSDC is twice as efficient as RWDC (4% vs. 2%). However, even the inefficient techniques are O(log |V|).
queried peers ofnumber
produced samples ofnumber
Summary of Results and Lessons Learned
Addressing temporal causes of bias Avoid gathering a set of peers and collecting
measurements in separate passes. Select a peer, then collect the measurement. Repeat and allow re-selecting the same peer.
Addressing topological causes of bias Be careful to avoid bias towards high-degree. Consider using a random walk or random stroll with
degree correction.
Ongoing Work
This work is preliminary. Additional types of random walks:
Weighting the selection of the next hop Additional types of graphs:
Power-law Small world
We have examined temporal and topological causes of bias separately. To examine them concurrently, we are creating a dynamic
overlay simulator. XXX This slide feels too much like a laundry list
Top Related