Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?
-
Upload
brett-case -
Category
Documents
-
view
23 -
download
5
description
Transcript of Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?
Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?
Samer Al-Kiswany – University of British Columbia
joint work with
Matei Ripeanu – University of British Columbia
Adriana Iamnitchi - University of South Florida
Sudharshan Vazhkudai - Oak Ridge National Laboratory
2
Introduction
Data-intensive science: large-scale simulations and new scientific instruments generate huge volumes of data (PetaBytes).
User communities: large, geographically dispersed
Requirement : Efficient data dissemination tools
Samer Al-Kiswany EuroPar ‘07 /26
3
Introduction - Example
Samer Al-Kiswany EuroPar ‘07 /26
4
Question ?
What data dissemination strategies perform best in today's Grids deployments?
Samer Al-Kiswany EuroPar ‘07 /26
Data dissemination solutions: IP-Multicast, Bullet, BitTorrent, SPIDER, OMNI, ALMI, Logistical-Multicast, Narada, Scribe, GridoGrido, FastReplica… and many others.
5
Workload characteristics
Deployment platform characteristics
Data dissemination proposed solutions
Evaluation Recommendations
What data dissemination strategies perform best in today's Grids deployments?
Roadmap
Samer Al-Kiswany EuroPar ‘07 /26
6Samer Al-Kiswany EuroPar ‘07 /26
Data-intensive scientific collaboration characteristics:
Scale of data: massive data collections (TeraBytes) Data usage: Uniform popularity distributions, and co‑usage
Workload and Deployment Platform
Resource availability: low churn rate, high node availability, well-provisioned networks.
Collaborative environments: no freeriding, thus less effort is needed to control fair resource sharing
Deployment platform characteristics:
7
Workload characteristics
Deployment platform characteristics
Data dissemination proposed solutions
Evaluation Recommendations
What data dissemination strategies perform best in today's Grids deployments?
Roadmap
Samer Al-Kiswany EuroPar ‘07 /26
8
Classification of Approaches
TechniqueTechnique ProtocolProtocol
Tree based techniques ALM and SPIDER
Swarming Bullet and BitTorrent
Techniques employing intermediate storage capabilities
Logistical Multicasting
Samer Al-Kiswany EuroPar ‘07 /26
Base Cases:• IP-Multicast.• Parallel transfers: separate data channels from the source to
each destination.
9
Separate Transfer from the Source to every Destination
/26
Drawbacks:
• Overwhelms the source – does not scale
• Generates high duplicate traffic at the links around the source
• Does not exploit all available transport capacity.
10
IP Multicasting
/26
10
10
10
10
1010
10
10
1010
10
5
10
10
10
10
1010
10
10
1010
10
5
11
IP Multicast
/26
Drawbacks:
• Limited deployment
• Vulnerability to nodes failures
• Does not exploit all available transport capacity.
• Throughput limited by bottleneck link
10
10
10
10
1010
10
10
10
10 10
5
12
Tree Based Techniques: Application Level Multicast (ALM)
Source
1
3
2
4
5
6
Source
1 5
6 3 24
ALM Tree
/26
13
Tree Based Techniques: Application Level Multicast (ALM)
/26
Source
1
3
2
4
5
6
Source
1 5
6 3 24
ALM Tree
Drawbacks:
• Vulnerability to nodes failures
• Does not exploit all possible routes in the network.
14
Swarming Techniques: BitTorrent and Bullet
1 2 3 4Complete file
12
3
/26
4
15
4
Swarming Techniques: BitTorrent and Bullet
1 2 3 4Complete file
1
2
3
4
1
/26
3
1
2
16
Swarming Techniques: BitTorrent and Bullet
/26
1 2 3 4Complete file
12
3
4
1
1
2
3
4
Drawbacks:
• Generates high duplicate traffic.
17
Logistical Multicasting
/26
18
Roadmap
Question: What data dissemination strategies perform best in today's Grids deployments?
Evaluation
Workload characteristics
Deployment platform characteristics
Data dissemination proposed solutions
Recommendations
Samer Al-Kiswany EuroPar ‘07 /26
Analytical Modeling Implementation Simulation
Evaluation Approaches:
19Samer Al-Kiswany
Methodology
Simulator Design:• Block-level simulation.• Simulates physical layer link-contention
EuroPar ‘07 /26
Inputs:- Real topologies of three deployed Grid testbeds: LCG, GridPP, EGEE.- Generated topologies: 100 (using BRITE)
20Samer Al-Kiswany
Methodology
EuroPar ‘07 /26
Success criteria Metrics
Dissemination time Transfer time.
Overhead MB x hop
Load balancing Volume of in/out data.
Fairness Link stress
21
Transfer Time
Number of destinations that have completed the file transfer for the original EGEE topology.
0
5
10
15
20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Time (10s)
# of
com
plet
ed tr
ansf
ers
. Logistical MT
0
5
10
15
20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Time (10s)
# of
com
plet
ed tr
ansf
ers
.
Bullet
ALM
Logistical MT
BitTorrent
0
5
10
15
20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Time (10s)
# of
com
plet
ed tr
ansf
ers
.
BulletALMIP-Multicast
Logistical MTBitTorrent
0
5
10
15
20
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Time (10s)
# of
com
plet
ed tr
ansf
ers
.
BulletSeparate transfALMIP-MulticastLogistical MTBitTorrent
Samer Al-Kiswany EuroPar ‘07 /26
22
Transfer Time – With reduced core-link bandwidth
Number of destinations that have completed the file transfer – EGEE topology with core bandwidth reduced to 1/8 of the
original one.
0
5
10
15
20
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30Time (10s)
# of
com
plet
ed tr
ansf
ers
.
Logistical MT
Conclusions:• On well-provisioned
topologies even naïve algorithms perform well.
• On constrained topologies application‑level techniques perform uniformly well: are among the first to finish the transfer with good intermediate progress,
0
5
10
15
20
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30Time (10s)
# of
com
plet
ed tr
ansf
ers
.
Bullet
ALM
Logistical MT
BitTorrent
0
5
10
15
20
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30Time (10s)
# of
com
plet
ed tr
ansf
ers
.
Bullet
ALM
IP-Multicast
Logistical MT
BitTorrent
0
5
10
15
20
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30Time (10s)
# of
com
plet
ed tr
ansf
ers
.
Bullet
Separate transf
ALM
IP-Multicast
Logistical MT
BitTorrent
Samer Al-Kiswany EuroPar ‘07 /26
23
Protocol Overhead – Metric Definition
Samer Al-Kiswany EuroPar ‘07 /26
1
1
Useful
DuplicateUseful
24
Protocol Overhead
Overhead of each protocol on EGEE Topology.
0
20
40
60
80
100
Bullet BitTorrent IP-Multicast ALM Separatetransfers
Tot
al tr
afic
vol
ume
(GB
) .
Duplicate
Useful
Conclusion:
Application-level techniques generates significant overheads. Up to 4 times more than IP layer solutions.
Reasons:
Samer Al-Kiswany EuroPar ‘07 /26
The dissemination decisions is based on application level metrics.
Ignore node topology location.
25
Fairness
Link stress distribution for the EGEE topology. For BitTorrent and Bullet the plot presents maximum link stress.
0
5
10
15
20
25
30
0 10 20 30 40 50 60Rank ( links ranked by max. # of flows)
Num
ber
of f
low
sBullet Max
BitTorrent Max
ALM
Conclusion:
Application‑level solutions have a considerable impact on competing traffic.
Samer Al-Kiswany EuroPar ‘07 /26
26
Summary
Samer Al-Kiswany EuroPar ‘07 /26
Motivating question: What data dissemination strategies perform best in today's Grids deployments?
In this project, we:
Simulated representative solutions.
Considering the characteristics of the workload and deployed platforms
Our results provide guidelines for selecting the data dissemination technique, depending on the:
Target environment.
Overall system workload characteristics.
Success Criteria.
27
Thank you
www.ece.ubc.ca/~samera