Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007.

Decentralizing Grids

Jon WeissmanUniversity of Minnesota

E-Science InstituteNov. 8 2007

Roadmap

• Background• The problem space• Some early solutions• Research frontier/opportunities• Wrapup

Background• Grids are distributed … but also centralized

– Condor, Globus, BOINC, Grid Services, VOs– Why? client-server based

• Centralization pros– Security, policy, global resource management

• Decentralization pros– Reliability, dynamic, flexible, scalable– **Fertile CS research frontier**

Challenges

• May have to live within the Grid ecosystem– Condor, Globus, Grid services, VOs, etc.– First principle approaches are risky (Legion)

• 50K foot view– How to decentralize Grids yet retain their existing

features? – High performance, workflows, performance

prediction, etc.

Decentralized Grid platform

• Minimal assumptions about each “node”• Nodes have associated “assets” (A)

– basic: CPU, memory, disk, etc.– complex: application services– exposed interface to assets: OS, Condor, BOINC, Web service

• Nodes may up or down• Node trust is not a given (do X, does Y instead)• Nodes may connect to other nodes or not• Nodes may be aggregates• Grid may be large > 100K nodes, scalability is key

Grid Overlay

Grid service

Raw – OS services

Condor network

BOINC network

Grid Overlay - Join

Grid service

Raw – OS services

Condor network

BOINC network

Grid Overlay - Departure

Grid service

Raw – OS services

Condor network

BOINC network

Routing = Discovery

Query contains sufficient information to locate a node: RSL, ClassAd, etcExact match or semantic match

discover A

Routing = Discovery

bingo!

Routing = Discovery

Discovered node returns a handle sufficient for the “client” to interact with it- perform service invocation, job/data transmission, etc

Routing = Discovery

• Three parties– initiator of discovery events for A– client: invocation, health of A– node offering A

• Often initiator and client will be the same

• Other times client will be determined dynamically– if W is a web service and results are returned to a calling

client, want to locate CW near W =>

– discover W, then CW !

Routing = Discovery

discover A

X

Routing = Discovery

Routing = Discovery

bingo!

Routing = Discovery

Routing = Discovery

outsideclient

Routing = Discovery

discover A’s

Routing = Discovery

Grid Overlay• This generalizes …

– Resource query (query contains job requirements)– Looks like decentralized “matchmaking”

• These are the easy cases …– independent simple queries

• find a CPU with characteristics x, y, z• find 100 CPUs each with x, y, z

– suppose queries are complex or related?• find N CPUs with aggregate power = G Gflops• locate an asset near a prior discovered asset

Grid Scenarios• Grid applications are more challenging

– Application has a more complex structure – multi-task, parallel/distributed, control/data dependencies

• individual job/task needs a resource near a data source• workflow • queries are not independent

– Metrics are collective• not simply raw throughput• makespan• response • QoS

Related Work

• Maryland/Purdue– matchmaking

• Oregon-CCOF– time-zone

CAN

Related Work (cont’d)

None of these approaches address the Grid scenarios (in a decentralized manner)– Complex multi-task data/control dependencies– Collective metrics

50K Ft Research Issues

• Overlay Architecture– structured, unstructured, hybrid– what is the right architecture?

• Decentralized control/data dependencies– how to do it?

• Reliability– how to achieve it?

• Collective metrics– how to achieve them?

Context: Application Model

answer

= component service request job task …

= data source

Context: Application Models

ReliabilityCollective metricsData dependenceControl dependence

Context: Environment

• RIDGE project - ridge.cs.umn.edu– reliable infrastructure for donation grid envs

• Live deployment on PlanetLab – planet-lab.org– 700 nodes spanning 335 sites and 35 countries– emulators and simulators

• Applications – BLAST– Traffic planning – Image comparison

Application Models


Reliability ExampleC

D

EG

B

B

G


D

EG

B

B

CGG CG responsible for G’s health


D

EG

B

B

G, loc(CG )

CG


D

EG

B

BG

CGcould also discover G then CG


D

EG

B

B

CG

X


D

EG

B

CG

G. …


D

EG

B

CG

G

Client ReplicationC

D

EG

B

B

G

Client ReplicationC

D

EG

B

BG

CG1

CG2

loc (G), loc (CG1), loc (CG2) propagated

Client ReplicationC

D

EG

B

BG

CG1

CG2

X client “hand-off” depends on nature of G and interaction

Component ReplicationC

D

EG

B

B

G

Component ReplicationC

D

EG

B

CG

G1 G2

Replication Research

• Nodes are unreliable – crash, hacked, churn, malicious, slow, etc.

• How many replicas?– too many – waste of resources– too few – application suffers

System Model

0.9

0.4

0.8

0.8

0.7

0.8

0.8

0.7

0.4

0.3

• Reputation rating ri– degree of node reliability

• Dynamically size the redundancy based on ri

• Nodes are not connected and check-in to a central server

• Note: variable sized groups

Reputation-based Scheduling

• Reputation rating – Techniques for estimating reliability based on

past interactions

• Reputation-based scheduling algorithms– Using reliabilities for allocating work – Relies on a success threshold parameter

Algorithm Space

• How many replicas?– first-, best-fit, random, fixed, …– algorithms compute how many replicas to meet a

success threshold

• How to reach consensus?– M-first (better for timeliness)– Majority (better for byzantine threats)

Experimental Results: correctness

This was a simulation based on byzantine behavior … majority voting

Experimental Results: timeliness

M-first (M=1), best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE

Next steps

• Nodes are decentralized, but not trust management!

• Need a peer-based trust exchange framework– Stanford: Eigentrust project – local exchange until

network converges to a global state

Application Models


Collective Metrics

BLAST

• Throughput not always the best metric• Response, completion time, application-centric

– makespan - response

Communication Makespan

• Nodes download data from replicated data nodes– Nodes choose “data servers” independently (decentralized)– Minimize the maximum download time for all worker nodes

(communication makespan)

data download dominates

Data node selection• Several possible factors

– Proximity (RTT)– Network bandwidth– Server capacity

[Download Time vs. RTT - linear]

[Download Time vs. Bandwidth - exp]

Mean Download Time / RTT

0

100

200

300

400

500

600

700

800

900

1000

1 3 5 7 10

Concurrency

Tim

e (m

sec)

flux

tamu

venus

ksu

ubc

wroc

Heuristic Ranking Function

• Query to get candidates, RTT/bw probes• Node i, data server node j

– Cost function = rtti,j * exp(kj /bwi,j), kj load/capacity• Least cost data node selected independently• Three server selection heuristics that use kj

– BW-ONLY: kj = 1– BW-LOAD: kj = n-minute average load (past)– BW-CAND: kj = # of candidate responses in last m

seconds (~ future load)

Performance Comparison

Computational Makespan

compute dominates:BLAST

Computational Makespan

variable-sizedequal-sized*

Next Steps

• Other makespan scenarios• Eliminate probes for bw and RTT -> estimation• Richer collective metrics

– deadlines: user-in-the-loop

Application Models


Data Dependence• Data-dependent component needs access to

one or more data sources – data may be large

discover A

,

Data Dependence (cont’d)

discover A

Where to run it?

,

The Problem

• Where to run a data-dependent component?– determine candidate set – select a candidate

• Unlikely a candidate knows downstream bw from particular data nodes

• Idea: infer bw from neighbor observations w/r to data nodes!

Estimation Technique

• C1 may have had little past interaction with – … but its neighbors may have

• For each neighbor generate a download estimate:– DT: prior download time to from neighbor– RTT: from candidate and neighbor to respectively– DP: average weighted measure of prior download times for

any node to any data source

C1

C2

Estimation Technique (cont’d)• Download Power (DP) characterizes download capability of

a node– DP = average (DT * RTT)– DT not enough (far-away vs. nearby data source)

• Estimation associated with each neighbor ni– ElapsedEst [ni] = α ∙ β ∙ DT

• α : my_RTT/neighbor_RTT (to )• β : neighbor_DP /my_DP• no active probes: historical data, RTT inference

• Combining neighbor estimates– mean, median, min, ….– median worked the best

• Take a min over all candidate estimates

Comparison of Candidate Selection Heuristics

2 4 8 16 320

5

10

15

20

25

30

35

40

45

50

55

Neighbor Size

Mea

n E

laps

e T

ime

(sec

)

Impact of Neighbor Size (Mix, N=8, Trial=50k)

OMNI

RANDOM

PROXIMSELF

NEIGHBOR

SELF usesdirect observations

Take Away

• Next steps– routing to the best candidates

• Locality between a data source and component– scalable, no probing needed– many uses

Application Models


The Problem

• How to enable decentralized control?– propagate downstream graph stages– perform distributed synchronization

• Idea: – distributed dataflow – token matching – graph forwarding, futures (Mentat project)

Control ExampleC

D

EG

B

B control nodetoken matching

Simple Example

C

D

EG

B

Control Example

C

D

EG

B

{E, B*C*D}

{E, B*C*D}

{E, B*C*D}

{C, G}

{D, G}

Control ExampleC

D

EG

CD

B

{E, B*C*D}

{E, B*C*D}

B

{E, B*C*D}

Control ExampleC

D

EG

CD

B

{E, B*C*D, loc(SC) } {E, B*C*D,

loc(SD) }

{E, B*C*D, loc(SB) }

B

output stored at loc(…) – where component is run, or client, or a storage node

Control ExampleC

D

EG

CD

B

B B

Control ExampleC

D

EG

CD

B

BE

B

Control ExampleC

D

EG

CD

B

B

E

B

Control ExampleC

D

EG

CD

B

B

E

B

How to color and route tokens so that they arrive to the same control node?

Open Problems

• Support for Global Operations– troubleshooting – what happened?– monitoring – application progress?– cleanup – application died, cleanup state

• Load balance across different applications– routing to guarantee dispersion

Summary

• Decentralizing Grids is a challenging problem• Re-think systems, algorithms, protocols, and

middleware => fertile research• Keep our “eye on the ball”

– reliability, scalability, and maintaining performance

• Some preliminary progress on “point solutions”

My visit

• Looking to apply some of these ideas to existing UK projects via collaboration

• Current and potential projects– Decentralized dataflow: (Adam Barker)– Decentralized applications: Haplotype analysis (Andrea

Christoforou, Mike Baker)– Decentralized control: openKnowledge (Dave Robertson)

• Goal – improve reliability and scalability of applications and/or infrastructures

Questions

EXTRAS

Non-stationarity• Nodes may suddenly shift gears

– deliberately malicious, virus, detach/rejoin– underlying reliability distribution changes

• Solution– window-based rating – adapt/learn target

• Experiment: blackout at round 300 (30% effected)

Adapting …

Adaptive Algorithmsuccess ratethroughput

success ratethroughput

Scheduling Algorithms

Estimation Accuracy

• Download Elapsed Time Ratio (x-axis) is a ratio of estimation to real measured time– ‘1’ means perfect estimation

• Accept if the estimation is within a range measured ± (measured * error)– Accept with error=0.33: 67% of the total are accepted– Accept with error=0.50: 83% of the total are accepted

• Objects: 27 (.5 MB – 2MB)• Nodes: 130 on PlanetLab• Download: 15,000 times from a

randomly chosen node

Impact of Churn

• Jinoh – mean over what?

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

2.5

3

3.5

4

4.5

5

5.5

6

6.5

Query

Ratio t

o O

mnis

cie

nt

Comparision of Elapsed Time (Candidate=8, Neighbor=8)

Without Churn

Churn 0.1%

Churn 0.5%

Churn 1.0%

Global(Prox) mean

Random mean

Estimating RTT• We use distance = √(RTT+1)• Simple RTT inference technique based on triangle inequality• Triangle Inequality: Latency(a,c) ≤ Latency(a,b) + Latency(b,c) – |Latency(a,b)-Latency(b,c)| ≤ Latency(a,c) ≤

Latency(a,b)+Latency(b,c)• Pick the intersected area as the range, and take the mean

Via Neighbor A

Via Neighbor B

Via Neighbor C

InferenceRTT

Lower bound

Higherbound

Final inferenceIntersected range

RTT Inference Result• More neighbors, greater accuracy• With 5 neighbors, 85% of the total < 16% error

-150 -100 -50 0 50 100 150 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Latency Difference (= |Inferred-Measured|)

F(x

)

Inferred Latency Difference CDF

N=2

N=3

N=5

N=10

Other Constraints

C

D

EA

B

{E, B*C*D}

{E, B*C*D}

{E, B*C*D}

{C, A, dep-CD}

{D, A, dep-CD}

C & D interact and they should be co-allocated, nearby …Tokens in bold should route to same control point so a collective queryfor C & D can be issued

Support for Global Operations

• Troubleshooting – what happened?• Monitoring – application progress?• Cleanup – application died, cleanup state• Solution mechanism: propagate control node

IPs back to origin (=> origin IP piggybacked)• Control nodes and matcher nodes report

progress (or lack thereof via timeouts) to origin• Load balance across different applications

Other Constraints

C

D

EA

B

{E, B*C*D}

{E, B*C*D}

{E, B*C*D}

{C, A}

{D, A}

C & D interact and they should be co-allocated, nearby …

Combining Neighbors’ Estimation

• MEDIAN shows best results – using 3 neighbors 88% of the time error is within 50% (variation in download times is a factor of 10-20)

• 3 neighbors gives greatest bang

0 5 10 15 20 25 300.83

0.84

0.85

0.86

0.87

0.88

0.89

0.9

0.91

Neighbor Size

Acc

epte

d R

ate

Acceptance 50%

RANDOMCLOSEST

MEAN

MEDIAN

RANK

WMEANTRMEAN

Effect of Candidate Size

2 4 8 16 32 ALL0

20

40

60

80

100

120

Candidate Size

Mea

n E

laps

e T

ime

(sec

)Impact of Candidate Size (Mix, N=8, Trial=25k)

OMNI

RANDOM

PROXIMSELF

NEIGHBOR

Performance Comparison

Parameters:Data size: 2MBReplication: 10Candidates: 5

Computation Makespan (cont’d)

• Now bring in reliability … makespan improvement scales well

# components

Token loss

• Between B and matcher; matcher and next stage– matcher must notify CB when token arrives (pass

loc(CB) with B’s token

– destination (E) must notify CB when token arrives (pass loc(CB) with B’s token

CD

BE

B

RTT Inference

• >= 90-95% of Internet paths obey triangle inequality– RTT (a, c) <= RTT (a, b) + RTT (b, c)– RTT (server, c) <= RTT (server, ni) + RTT (ni, c)

– upper- bound– lower-bound: | RTT (server, ni) - RTT (ni, c) |

• iterate over all neighbors to get max L, min U• return mid-point

Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007.

Documents

Transcript of Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007.