Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

27
Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations? Samer Al-Kiswany – University of British Columbia joint work with Matei Ripeanu – University of British Columbia Adriana Iamnitchi - University of South Florida Sudharshan Vazhkudai - Oak Ridge National Laboratory

description

Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?. Samer Al-Kiswany – University of British Columbia joint work with Matei Ripeanu – University of British Columbia Adriana Iamnitchi - University of South Florida - PowerPoint PPT Presentation

Transcript of Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

Page 1: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

Samer Al-Kiswany – University of British Columbia

joint work with

Matei Ripeanu – University of British Columbia

Adriana Iamnitchi - University of South Florida

Sudharshan Vazhkudai - Oak Ridge National Laboratory

Page 2: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

2

Introduction

Data-intensive science: large-scale simulations and new scientific instruments generate huge volumes of data (PetaBytes).

User communities: large, geographically dispersed

Requirement : Efficient data dissemination tools

Samer Al-Kiswany EuroPar ‘07 /26

Page 3: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

3

Introduction - Example

Samer Al-Kiswany EuroPar ‘07 /26

Page 4: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

4

Question ?

What data dissemination strategies perform best in today's Grids deployments?

Samer Al-Kiswany EuroPar ‘07 /26

Data dissemination solutions: IP-Multicast, Bullet, BitTorrent, SPIDER, OMNI, ALMI, Logistical-Multicast, Narada, Scribe, GridoGrido, FastReplica… and many others.

Page 5: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

5

Workload characteristics

Deployment platform characteristics

Data dissemination proposed solutions

Evaluation Recommendations

What data dissemination strategies perform best in today's Grids deployments?

Roadmap

Samer Al-Kiswany EuroPar ‘07 /26

Page 6: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

6Samer Al-Kiswany EuroPar ‘07 /26

Data-intensive scientific collaboration characteristics:

Scale of data: massive data collections (TeraBytes) Data usage: Uniform popularity distributions, and co‑usage

Workload and Deployment Platform

Resource availability: low churn rate, high node availability, well-provisioned networks.

Collaborative environments: no freeriding, thus less effort is needed to control fair resource sharing

Deployment platform characteristics:

Page 7: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

7

Workload characteristics

Deployment platform characteristics

Data dissemination proposed solutions

Evaluation Recommendations

What data dissemination strategies perform best in today's Grids deployments?

Roadmap

Samer Al-Kiswany EuroPar ‘07 /26

Page 8: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

8

Classification of Approaches

TechniqueTechnique ProtocolProtocol

Tree based techniques ALM and SPIDER

Swarming Bullet and BitTorrent

Techniques employing intermediate storage capabilities

Logistical Multicasting

Samer Al-Kiswany EuroPar ‘07 /26

Base Cases:• IP-Multicast.• Parallel transfers: separate data channels from the source to

each destination.

Page 9: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

9

Separate Transfer from the Source to every Destination

/26

Drawbacks:

• Overwhelms the source – does not scale

• Generates high duplicate traffic at the links around the source

• Does not exploit all available transport capacity.

Page 10: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

10

IP Multicasting

/26

10

10

10

10

1010

10

10

1010

10

5

10

10

10

10

1010

10

10

1010

10

5

Page 11: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

11

IP Multicast

/26

Drawbacks:

• Limited deployment

• Vulnerability to nodes failures

• Does not exploit all available transport capacity.

• Throughput limited by bottleneck link

10

10

10

10

1010

10

10

10

10 10

5

Page 12: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

12

Tree Based Techniques: Application Level Multicast (ALM)

Source

1

3

2

4

5

6

Source

1 5

6 3 24

ALM Tree

/26

Page 13: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

13

Tree Based Techniques: Application Level Multicast (ALM)

/26

Source

1

3

2

4

5

6

Source

1 5

6 3 24

ALM Tree

Drawbacks:

• Vulnerability to nodes failures

• Does not exploit all possible routes in the network.

Page 14: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

14

Swarming Techniques: BitTorrent and Bullet

1 2 3 4Complete file

12

3

/26

4

Page 15: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

15

4

Swarming Techniques: BitTorrent and Bullet

1 2 3 4Complete file

1

2

3

4

1

/26

3

1

2

Page 16: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

16

Swarming Techniques: BitTorrent and Bullet

/26

1 2 3 4Complete file

12

3

4

1

1

2

3

4

Drawbacks:

• Generates high duplicate traffic.

Page 17: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

17

Logistical Multicasting

/26

Page 18: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

18

Roadmap

Question: What data dissemination strategies perform best in today's Grids deployments?

Evaluation

Workload characteristics

Deployment platform characteristics

Data dissemination proposed solutions

Recommendations

Samer Al-Kiswany EuroPar ‘07 /26

Analytical Modeling Implementation Simulation

Evaluation Approaches:

Page 19: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

19Samer Al-Kiswany

Methodology

Simulator Design:• Block-level simulation.• Simulates physical layer link-contention

EuroPar ‘07 /26

Inputs:- Real topologies of three deployed Grid testbeds: LCG, GridPP, EGEE.- Generated topologies: 100 (using BRITE)

Page 20: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

20Samer Al-Kiswany

Methodology

EuroPar ‘07 /26

Success criteria Metrics

Dissemination time Transfer time.

Overhead MB x hop

Load balancing Volume of in/out data.

Fairness Link stress

Page 21: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

21

Transfer Time

Number of destinations that have completed the file transfer for the original EGEE topology.

0

5

10

15

20

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Time (10s)

# of

com

plet

ed tr

ansf

ers

. Logistical MT

0

5

10

15

20

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Time (10s)

# of

com

plet

ed tr

ansf

ers

.

Bullet

ALM

Logistical MT

BitTorrent

0

5

10

15

20

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Time (10s)

# of

com

plet

ed tr

ansf

ers

.

BulletALMIP-Multicast

Logistical MTBitTorrent

0

5

10

15

20

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Time (10s)

# of

com

plet

ed tr

ansf

ers

.

BulletSeparate transfALMIP-MulticastLogistical MTBitTorrent

Samer Al-Kiswany EuroPar ‘07 /26

Page 22: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

22

Transfer Time – With reduced core-link bandwidth

Number of destinations that have completed the file transfer – EGEE topology with core bandwidth reduced to 1/8 of the

original one.

0

5

10

15

20

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30Time (10s)

# of

com

plet

ed tr

ansf

ers

.

Logistical MT

Conclusions:• On well-provisioned

topologies even naïve algorithms perform well.

• On constrained topologies application‑level techniques perform uniformly well: are among the first to finish the transfer with good intermediate progress,

0

5

10

15

20

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30Time (10s)

# of

com

plet

ed tr

ansf

ers

.

Bullet

ALM

Logistical MT

BitTorrent

0

5

10

15

20

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30Time (10s)

# of

com

plet

ed tr

ansf

ers

.

Bullet

ALM

IP-Multicast

Logistical MT

BitTorrent

0

5

10

15

20

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30Time (10s)

# of

com

plet

ed tr

ansf

ers

.

Bullet

Separate transf

ALM

IP-Multicast

Logistical MT

BitTorrent

Samer Al-Kiswany EuroPar ‘07 /26

Page 23: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

23

Protocol Overhead – Metric Definition

Samer Al-Kiswany EuroPar ‘07 /26

1

1

Useful

DuplicateUseful

Page 24: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

24

Protocol Overhead

Overhead of each protocol on EGEE Topology.

0

20

40

60

80

100

Bullet BitTorrent IP-Multicast ALM Separatetransfers

Tot

al tr

afic

vol

ume

(GB

) .

Duplicate

Useful

Conclusion:

Application-level techniques generates significant overheads. Up to 4 times more than IP layer solutions.

Reasons:

Samer Al-Kiswany EuroPar ‘07 /26

The dissemination decisions is based on application level metrics.

Ignore node topology location.

Page 25: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

25

Fairness

Link stress distribution for the EGEE topology. For BitTorrent and Bullet the plot presents maximum link stress.

0

5

10

15

20

25

30

0 10 20 30 40 50 60Rank ( links ranked by max. # of flows)

Num

ber

of f

low

sBullet Max

BitTorrent Max

ALM

Conclusion:

Application‑level solutions have a considerable impact on competing traffic.

Samer Al-Kiswany EuroPar ‘07 /26

Page 26: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

26

Summary

Samer Al-Kiswany EuroPar ‘07 /26

Motivating question: What data dissemination strategies perform best in today's Grids deployments?

In this project, we:

Simulated representative solutions.

Considering the characteristics of the workload and deployed platforms

Our results provide guidelines for selecting the data dissemination technique, depending on the:

Target environment.

Overall system workload characteristics.

Success Criteria.

Page 27: Are P2P Data-Dissemination Techniques Viable in Today's Data-Intensive Scientific Collaborations?

27

Thank you

www.ece.ubc.ca/~samera