Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali...

35
A Self-Organizing Flock of Condors Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu

Transcript of Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali...

Page 1: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

A Self-Organizing Flock of Condors

Ali Raza ButtRongmei ZhangY. Charlie Hu

{butta,rongmei,ychu}@purdue.edu

Page 2: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

2

The need for sharing compute-cycles• Scientific applications

– Complex, large data sets

• Specialized hardware– Expensive

• Modern workstation– Powerful resource– Available in large numbers– Underutilized

Harness idle-cycles of network of workstations

Page 3: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

3

Condor: High throughput computing

• Cost-effective idle-cycle sharing

• Job management facilities– Scheduling, checkpointing, migration

• Resource management– Policy specification/enforcement

• Solves real problems world-wide – 1200+ machines Condor pools, 100+ researchers

@Purdue

Page 4: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

4

flocking

Sharing across pools: Flocking

Pre-configuredresource sharing

Central manager

Page 5: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

5

Flocking

• Static flocking requires– Pre-configuration– Apriori knowledge of all remote pools

• Does not support dynamic resources

Page 6: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

6

Our contribution:Peer-to-peer based dynamic flocking

• Automated remote Condor pool discovery

• Dynamic resource management– Support dynamic membership– Support changing local policies

Page 7: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

7

Agenda

• Background: peer-to-peer networks• Proposed scheme• Implementation• Evaluation• Conclusions

Page 8: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

8

Overlay Networks

P2P networks are self-organizing overlay networks without central control

ISP3

ISP1 ISP2

Site 1

Site 4

Site 3Site 2 N

N N

N

N

N N

Page 9: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

9

Advantages of structured p2p networks

• Scalable• Self-organization• Fault-tolerant• Locality-aware• Simple to deploy

• Many implementations available– E.g. Pastry, Tapestry, Chord, CAN…

Page 10: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

10

Pastry: locality-aware p2p substrate

• 128-bit circular identifier space– Unique random nodeIds– Message keys

• Routing: A message is routed reliably to a node with nodeIdnumerically closest to the key

• Routing in overlay < 2 * routing in IP

Identifierspace

Page 11: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

11

Agenda

• Background: peer-to-peer networks• Proposed scheme• Implementation• Evaluation• Conclusions

Page 12: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

12

Step 1:P2p organization of Condor pools

• Participating central managers join an overlay– Just need to know a single remote pool

• P2p provides self-organization– Pools can reach each other through the overlay– Pools can join/leave at anytime

Page 13: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

13Central managers Resources

P2p organized central managers

Page 14: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

14

Step 2:Disseminating resource information

• Announcements to nearby pools– Contain pool status information– Leverage locality-aware routing table

• Routing table has O(log N) entries matching increasingly long prefix of local nodeId

– Soft state• Periodically refreshed

Page 15: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

15

Resource announcements

are physically close to

Page 16: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

16

Step 3:Enable dynamic flocking

• Central managers flock with nearby pools– Use knowledge gained from resource announcements– Implement local policies– Support dynamic reconfiguration

Page 17: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

17Central managers Resources

Interactions between central managers

Locality-awareflocking

Page 18: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

18

Matchmaking

• Orthogonal to flocking• Condor matchmaking within a pool

• P2p approach affects the flocking decisions only

Page 19: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

19

Are we discovering enough pools?

• Only subset of nearby pools reached using the Pastry routing table

• Multi-hop TTL based announcement forwarding

Page 20: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

20

Agenda

• Background: peer-to-peer networks• Proposed scheme• Implementation• Evaluation• Conclusions

Page 21: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

21

Software

• Implemented as a daemon: poolD– Leverages FreePastry 1.3 from Rice – Runs on central managers– Manages self-organized Condor pools

• Condor version 6.4.7

• Interfaced to Condor configuration control

Page 22: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

22

Software architecturep2p

network

Cond

orp2

p ex

tens

ion

Query services Configuration

p2p module

AnnouncementManager

Condor module

Policy Manager

FlockingManager

Page 23: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

23

Agenda

• Background: peer-to-peer networks• Proposed scheme• Implementation• Evaluation• Conclusions

Page 24: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

24

Evaluation

• Measured results– Effect of flocking on job throughput

• Time spent in queue

– Four pools, three compute machines each– Synthetic job trace

Page 25: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

25

Job trace

• Sequence– 100 (issue time: T, job length: L) pairs– Interval (Tn–Tn-1), L uniform distribution [1,17]– Designed to keep a single machine busy– Random overload/idle periods

• Trace– One or more job sequences merged together

Page 26: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

26

B

PlanetLab experimental setup

U.C. Berkeley

Dynamic flocking

DC

A

A

B

C

D

Interxion, GermanyRiceColumbia

Page 27: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

27

Time spent in queue

72.100.0330.30557.550.03131.2012overall58.380.1028.37557.550.25284.915D64.480.1038.6897.170.0346.583C63.700.1332.6819.850.083.302B72.100.0320.1514.320.031.762A

maxminmeanmaxminmean

With flockingWithout flockingNo.of sequences in

tracehPool

Page 28: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

28

Simulations

• 1000 Condor pools

• GT-ITM transit-stub model– 50 transit domains– 1000 stub domains

• Size of pool: uniform distribution [25,225]

• Number of sequences in trace:uniform distribution [25,225]

Page 29: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

29

Cumulative distribution of locality

Page 30: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

30

Total job completion time:without flocking

Page 31: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

31

Total job completion time:with flocking

Page 32: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

32

Agenda

• Background: peer-to-peer networks• Proposed scheme• Implementation• Evaluation• Conclusions

Page 33: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

33

Conclusions• Design and implementation of a self-

organizing flock of Condors– Scalability– Fault-tolerance– Locality-awareness which yields flocking with

nearby resources– Local sharing policy enforced

• P2p mechanisms provide an effective substrate for discovery and management of dynamic resources over the wide-area network

Page 34: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

34

Questions?

Page 35: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need

35

What about security?• Authenticated pools / users

– Enforced by policy manager– Accountability

• Restricted access– Limited privileges e.g. UNIX user nobody– Condor libraries

• Controlled execution environment– Sandboxing – Process cleanups on job completion

• Intrusion detection