Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007.
-
Upload
isaac-schmitt -
Category
Documents
-
view
215 -
download
0
Transcript of Decentralizing Grids Jon Weissman University of Minnesota E-Science Institute Nov. 8 2007.
Decentralizing Grids
Jon WeissmanUniversity of Minnesota
E-Science InstituteNov. 8 2007
Roadmap
• Background• The problem space• Some early solutions• Research frontier/opportunities• Wrapup
Background• Grids are distributed … but also centralized
– Condor, Globus, BOINC, Grid Services, VOs– Why? client-server based
• Centralization pros– Security, policy, global resource management
• Decentralization pros– Reliability, dynamic, flexible, scalable– **Fertile CS research frontier**
Challenges
• May have to live within the Grid ecosystem– Condor, Globus, Grid services, VOs, etc.– First principle approaches are risky (Legion)
• 50K foot view– How to decentralize Grids yet retain their existing
features? – High performance, workflows, performance
prediction, etc.
Decentralized Grid platform
• Minimal assumptions about each “node”• Nodes have associated “assets” (A)
– basic: CPU, memory, disk, etc.– complex: application services– exposed interface to assets: OS, Condor, BOINC, Web service
• Nodes may up or down• Node trust is not a given (do X, does Y instead)• Nodes may connect to other nodes or not• Nodes may be aggregates• Grid may be large > 100K nodes, scalability is key
Grid Overlay
Grid service
Raw – OS services
Condor network
BOINC network
Grid Overlay - Join
Grid service
Raw – OS services
Condor network
BOINC network
Grid Overlay - Departure
Grid service
Raw – OS services
Condor network
BOINC network
Routing = Discovery
Query contains sufficient information to locate a node: RSL, ClassAd, etcExact match or semantic match
discover A
Routing = Discovery
bingo!
Routing = Discovery
Discovered node returns a handle sufficient for the “client” to interact with it- perform service invocation, job/data transmission, etc
Routing = Discovery
• Three parties– initiator of discovery events for A– client: invocation, health of A– node offering A
• Often initiator and client will be the same
• Other times client will be determined dynamically– if W is a web service and results are returned to a calling
client, want to locate CW near W =>
– discover W, then CW !
Routing = Discovery
discover A
X
Routing = Discovery
Routing = Discovery
bingo!
Routing = Discovery
Routing = Discovery
outsideclient
Routing = Discovery
discover A’s
Routing = Discovery
Grid Overlay• This generalizes …
– Resource query (query contains job requirements)– Looks like decentralized “matchmaking”
• These are the easy cases …– independent simple queries
• find a CPU with characteristics x, y, z• find 100 CPUs each with x, y, z
– suppose queries are complex or related?• find N CPUs with aggregate power = G Gflops• locate an asset near a prior discovered asset
Grid Scenarios• Grid applications are more challenging
– Application has a more complex structure – multi-task, parallel/distributed, control/data dependencies
• individual job/task needs a resource near a data source• workflow • queries are not independent
– Metrics are collective• not simply raw throughput• makespan• response • QoS
Related Work
• Maryland/Purdue– matchmaking
• Oregon-CCOF– time-zone
CAN
Related Work (cont’d)
None of these approaches address the Grid scenarios (in a decentralized manner)– Complex multi-task data/control dependencies– Collective metrics
50K Ft Research Issues
• Overlay Architecture– structured, unstructured, hybrid– what is the right architecture?
• Decentralized control/data dependencies– how to do it?
• Reliability– how to achieve it?
• Collective metrics– how to achieve them?
Context: Application Model
answer
= component service request job task …
= data source
Context: Application Models
ReliabilityCollective metricsData dependenceControl dependence
Context: Environment
• RIDGE project - ridge.cs.umn.edu– reliable infrastructure for donation grid envs
• Live deployment on PlanetLab – planet-lab.org– 700 nodes spanning 335 sites and 35 countries– emulators and simulators
• Applications – BLAST– Traffic planning – Image comparison
Application Models
ReliabilityCollective metricsData dependenceControl dependence
Reliability ExampleC
D
EG
B
B
G
Reliability ExampleC
D
EG
B
B
CGG CG responsible for G’s health
Reliability ExampleC
D
EG
B
B
G, loc(CG )
CG
Reliability ExampleC
D
EG
B
BG
CGcould also discover G then CG
Reliability ExampleC
D
EG
B
B
CG
X
Reliability ExampleC
D
EG
B
CG
G. …
Reliability ExampleC
D
EG
B
CG
G
Client ReplicationC
D
EG
B
B
G
Client ReplicationC
D
EG
B
BG
CG1
CG2
loc (G), loc (CG1), loc (CG2) propagated
Client ReplicationC
D
EG
B
BG
CG1
CG2
X client “hand-off” depends on nature of G and interaction
Component ReplicationC
D
EG
B
B
G
Component ReplicationC
D
EG
B
CG
G1 G2
Replication Research
• Nodes are unreliable – crash, hacked, churn, malicious, slow, etc.
• How many replicas?– too many – waste of resources– too few – application suffers
System Model
0.9
0.4
0.8
0.8
0.7
0.8
0.8
0.7
0.4
0.3
• Reputation rating ri– degree of node reliability
• Dynamically size the redundancy based on ri
• Nodes are not connected and check-in to a central server
• Note: variable sized groups
Reputation-based Scheduling
• Reputation rating – Techniques for estimating reliability based on
past interactions
• Reputation-based scheduling algorithms– Using reliabilities for allocating work – Relies on a success threshold parameter
Algorithm Space
• How many replicas?– first-, best-fit, random, fixed, …– algorithms compute how many replicas to meet a
success threshold
• How to reach consensus?– M-first (better for timeliness)– Majority (better for byzantine threats)
Experimental Results: correctness
This was a simulation based on byzantine behavior … majority voting
Experimental Results: timeliness
M-first (M=1), best BOINC (BOINC*), conservative (BOINC-) vs. RIDGE
Next steps
• Nodes are decentralized, but not trust management!
• Need a peer-based trust exchange framework– Stanford: Eigentrust project – local exchange until
network converges to a global state
Application Models
ReliabilityCollective metricsData dependenceControl dependence
Collective Metrics
BLAST
• Throughput not always the best metric• Response, completion time, application-centric
– makespan - response
Communication Makespan
• Nodes download data from replicated data nodes– Nodes choose “data servers” independently (decentralized)– Minimize the maximum download time for all worker nodes
(communication makespan)
data download dominates
Data node selection• Several possible factors
– Proximity (RTT)– Network bandwidth– Server capacity
[Download Time vs. RTT - linear]
[Download Time vs. Bandwidth - exp]
Mean Download Time / RTT
0
100
200
300
400
500
600
700
800
900
1000
1 3 5 7 10
Concurrency
Tim
e (m
sec)
flux
tamu
venus
ksu
ubc
wroc
Heuristic Ranking Function
• Query to get candidates, RTT/bw probes• Node i, data server node j
– Cost function = rtti,j * exp(kj /bwi,j), kj load/capacity• Least cost data node selected independently• Three server selection heuristics that use kj
– BW-ONLY: kj = 1– BW-LOAD: kj = n-minute average load (past)– BW-CAND: kj = # of candidate responses in last m
seconds (~ future load)
Performance Comparison
Computational Makespan
compute dominates:BLAST
Computational Makespan
variable-sizedequal-sized*
Next Steps
• Other makespan scenarios• Eliminate probes for bw and RTT -> estimation• Richer collective metrics
– deadlines: user-in-the-loop
Application Models
ReliabilityCollective metricsData dependenceControl dependence
Application Models
ReliabilityCollective metricsData dependenceControl dependence
Data Dependence• Data-dependent component needs access to
one or more data sources – data may be large
discover A
,
Data Dependence (cont’d)
discover A
Where to run it?
,
The Problem
• Where to run a data-dependent component?– determine candidate set – select a candidate
• Unlikely a candidate knows downstream bw from particular data nodes
• Idea: infer bw from neighbor observations w/r to data nodes!
Estimation Technique
• C1 may have had little past interaction with – … but its neighbors may have
• For each neighbor generate a download estimate:– DT: prior download time to from neighbor– RTT: from candidate and neighbor to respectively– DP: average weighted measure of prior download times for
any node to any data source
C1
C2
Estimation Technique (cont’d)• Download Power (DP) characterizes download capability of
a node– DP = average (DT * RTT)– DT not enough (far-away vs. nearby data source)
• Estimation associated with each neighbor ni– ElapsedEst [ni] = α ∙ β ∙ DT
• α : my_RTT/neighbor_RTT (to )• β : neighbor_DP /my_DP• no active probes: historical data, RTT inference
• Combining neighbor estimates– mean, median, min, ….– median worked the best
• Take a min over all candidate estimates
Comparison of Candidate Selection Heuristics
2 4 8 16 320
5
10
15
20
25
30
35
40
45
50
55
Neighbor Size
Mea
n E
laps
e T
ime
(sec
)
Impact of Neighbor Size (Mix, N=8, Trial=50k)
OMNI
RANDOM
PROXIMSELF
NEIGHBOR
SELF usesdirect observations
Take Away
• Next steps– routing to the best candidates
• Locality between a data source and component– scalable, no probing needed– many uses
Application Models
ReliabilityCollective metricsData dependenceControl dependence
The Problem
• How to enable decentralized control?– propagate downstream graph stages– perform distributed synchronization
• Idea: – distributed dataflow – token matching – graph forwarding, futures (Mentat project)
Control ExampleC
D
EG
B
B control nodetoken matching
Simple Example
C
D
EG
B
Control Example
C
D
EG
B
{E, B*C*D}
{E, B*C*D}
{E, B*C*D}
{C, G}
{D, G}
Control ExampleC
D
EG
CD
B
{E, B*C*D}
{E, B*C*D}
B
{E, B*C*D}
Control ExampleC
D
EG
CD
B
{E, B*C*D, loc(SC) } {E, B*C*D,
loc(SD) }
{E, B*C*D, loc(SB) }
B
output stored at loc(…) – where component is run, or client, or a storage node
Control ExampleC
D
EG
CD
B
B B
Control ExampleC
D
EG
CD
B
B B
Control ExampleC
D
EG
CD
B
B B
Control ExampleC
D
EG
CD
B
BE
B
Control ExampleC
D
EG
CD
B
B
E
B
Control ExampleC
D
EG
CD
B
B
E
B
How to color and route tokens so that they arrive to the same control node?
Open Problems
• Support for Global Operations– troubleshooting – what happened?– monitoring – application progress?– cleanup – application died, cleanup state
• Load balance across different applications– routing to guarantee dispersion
Summary
• Decentralizing Grids is a challenging problem• Re-think systems, algorithms, protocols, and
middleware => fertile research• Keep our “eye on the ball”
– reliability, scalability, and maintaining performance
• Some preliminary progress on “point solutions”
My visit
• Looking to apply some of these ideas to existing UK projects via collaboration
• Current and potential projects– Decentralized dataflow: (Adam Barker)– Decentralized applications: Haplotype analysis (Andrea
Christoforou, Mike Baker)– Decentralized control: openKnowledge (Dave Robertson)
• Goal – improve reliability and scalability of applications and/or infrastructures
Questions
EXTRAS
Non-stationarity• Nodes may suddenly shift gears
– deliberately malicious, virus, detach/rejoin– underlying reliability distribution changes
• Solution– window-based rating – adapt/learn target
• Experiment: blackout at round 300 (30% effected)
Adapting …
Adaptive Algorithmsuccess ratethroughput
success ratethroughput
Scheduling Algorithms
Estimation Accuracy
• Download Elapsed Time Ratio (x-axis) is a ratio of estimation to real measured time– ‘1’ means perfect estimation
• Accept if the estimation is within a range measured ± (measured * error)– Accept with error=0.33: 67% of the total are accepted– Accept with error=0.50: 83% of the total are accepted
• Objects: 27 (.5 MB – 2MB)• Nodes: 130 on PlanetLab• Download: 15,000 times from a
randomly chosen node
Impact of Churn
• Jinoh – mean over what?
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
2.5
3
3.5
4
4.5
5
5.5
6
6.5
Query
Ratio t
o O
mnis
cie
nt
Comparision of Elapsed Time (Candidate=8, Neighbor=8)
Without Churn
Churn 0.1%
Churn 0.5%
Churn 1.0%
Global(Prox) mean
Random mean
Estimating RTT• We use distance = √(RTT+1)• Simple RTT inference technique based on triangle inequality• Triangle Inequality: Latency(a,c) ≤ Latency(a,b) + Latency(b,c) – |Latency(a,b)-Latency(b,c)| ≤ Latency(a,c) ≤
Latency(a,b)+Latency(b,c)• Pick the intersected area as the range, and take the mean
Via Neighbor A
Via Neighbor B
Via Neighbor C
InferenceRTT
Lower bound
Higherbound
Final inferenceIntersected range
RTT Inference Result• More neighbors, greater accuracy• With 5 neighbors, 85% of the total < 16% error
-150 -100 -50 0 50 100 150 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Latency Difference (= |Inferred-Measured|)
F(x
)
Inferred Latency Difference CDF
N=2
N=3
N=5
N=10
Other Constraints
C
D
EA
B
{E, B*C*D}
{E, B*C*D}
{E, B*C*D}
{C, A, dep-CD}
{D, A, dep-CD}
C & D interact and they should be co-allocated, nearby …Tokens in bold should route to same control point so a collective queryfor C & D can be issued
Support for Global Operations
• Troubleshooting – what happened?• Monitoring – application progress?• Cleanup – application died, cleanup state• Solution mechanism: propagate control node
IPs back to origin (=> origin IP piggybacked)• Control nodes and matcher nodes report
progress (or lack thereof via timeouts) to origin• Load balance across different applications
Other Constraints
C
D
EA
B
{E, B*C*D}
{E, B*C*D}
{E, B*C*D}
{C, A}
{D, A}
C & D interact and they should be co-allocated, nearby …
Combining Neighbors’ Estimation
• MEDIAN shows best results – using 3 neighbors 88% of the time error is within 50% (variation in download times is a factor of 10-20)
• 3 neighbors gives greatest bang
0 5 10 15 20 25 300.83
0.84
0.85
0.86
0.87
0.88
0.89
0.9
0.91
Neighbor Size
Acc
epte
d R
ate
Acceptance 50%
RANDOMCLOSEST
MEAN
MEDIAN
RANK
WMEANTRMEAN
Effect of Candidate Size
2 4 8 16 32 ALL0
20
40
60
80
100
120
Candidate Size
Mea
n E
laps
e T
ime
(sec
)Impact of Candidate Size (Mix, N=8, Trial=25k)
OMNI
RANDOM
PROXIMSELF
NEIGHBOR
Performance Comparison
Parameters:Data size: 2MBReplication: 10Candidates: 5
Computation Makespan (cont’d)
• Now bring in reliability … makespan improvement scales well
# components
Token loss
• Between B and matcher; matcher and next stage– matcher must notify CB when token arrives (pass
loc(CB) with B’s token
– destination (E) must notify CB when token arrives (pass loc(CB) with B’s token
CD
BE
B
RTT Inference
• >= 90-95% of Internet paths obey triangle inequality– RTT (a, c) <= RTT (a, b) + RTT (b, c)– RTT (server, c) <= RTT (server, ni) + RTT (ni, c)
– upper- bound– lower-bound: | RTT (server, ni) - RTT (ni, c) |
• iterate over all neighbors to get max L, min U• return mid-point