MOAT: A Multi-Object Assignment Toolkit
description
Transcript of MOAT: A Multi-Object Assignment Toolkit
MOAT: MOAT: A Multi-Object Assignment ToolkitA Multi-Object Assignment Toolkit
Haifeng YuIntel Research Pittsburgh / CMU
Joint work with:
Phillip B. Gibbons
Intel Research Pittsburgh
Haifeng Yu, Intel Research Pittsburgh / CMU
2
BackgroundBackground Availability has become principle design goal:
0.1% improvement $2M / year
for Amazon and Ebay [internetweek.com]
One major focus of 8 OSDI’04 papers (out of 27)
Two orthogonal efforts: Lower-level system components robustness
Example: disk, individual machine, Internet routing
Higher-level redundancy
Example: data replication
This talk focuses on higher-level redundancy
Haifeng Yu, Intel Research Pittsburgh / CMU
3
High Availability via ReplicationHigh Availability via Replication
Large amount of data accessed by many users: Distributed file systems
Network monitoring (PIER, SDIMS, IRISLOG)
Index databases for search engine (Google, p2p)
Scientific / medical databases
Data replicated across multiple machines Object: The unit for replication
File, file block, database table, database tuple, inverted index for a certain keyword
Haifeng Yu, Intel Research Pittsburgh / CMU
4
Multi-object AccessesMulti-object Accesses
Many accesses request multiple objects Compile a project
Writing a paper under Latex
Asking for aggregates of network conditions
Search for web pages containing multiple keywords
Availability of single object can be misleading: An access requesting 1,000 objects can observe up to
1,000 times higher unavailability
There’s more subtlety.....
Haifeng Yu, Intel Research Pittsburgh / CMU
5
A Simple ExampleA Simple Example
Compile a small project with four files, each file has two replicas: A, A, B, B, C, C, D, D
Four machines fail independently with same prob, each holds two file
Which assignment gives better avail:
A B C D
A B C D
orA B C D
A C B D
Better
Assignment matters because objects are now correlated
Haifeng Yu, Intel Research Pittsburgh / CMU
6
A Simple Example - ContinuedA Simple Example - Continued
Suppose user is happy even if only three objects are available (e.g., when computing average)
A B C D
A B C D
orA B C D
A C B D
Better
Assignment makes a difference Even if we are using the same machines (same
amount of redundancy/resource)
Easily have multiple-nine difference
Haifeng Yu, Intel Research Pittsburgh / CMU
7
Goal and ContributionsGoal and Contributions
MOAT (Multi-Object Assignment Toolkit): Goal: High availability for multi-object accesses
Key issue: Replica assignment
Contributions: First to observe the importance of replica assignment
Strong theoretical results regarding best and worst assignments
Practical designs to approximate optimal assignments
MOAT toolkit implementation for replica assignments
Haifeng Yu, Intel Research Pittsburgh / CMU
8
OutlineOutline
Motivation and MOAT contributions
System model and case studies of existing systems
Theoretical results
Designs for approximating optimal assignments
Designs for mixed accesses
Conclusions
Haifeng Yu, Intel Research Pittsburgh / CMU
9
Assumptions for This TalkAssumptions for This Talk
Assume: Replication (no erasure coding)
Crash failures (no Byzantine failures)
Eventual consistency (no quorum or voting)
Most of our results hold without these assumptions
Assume same replication degree for all objects We have results for different replication degrees as
well
Talk to me if interested in the more complete story...
Haifeng Yu, Intel Research Pittsburgh / CMU
10
MOAT Architecture OverviewMOAT Architecture Overview
MOAT
raw data on distributed
machines or disks
file
system
network
monitoring
p2p
DB
search
engine
Storage
System
App
replication / repair / load balancing / naming / assignment
Data API
obj create / delete / read / write
Control API
assignment policy
Haifeng Yu, Intel Research Pittsburgh / CMU
11
System ModelSystem Model
Basic system model: N objects, each with k replicas
Load balancing among all machines
Machines fail independently with same prob
An assignment is a mapping: replica machine, for all N k replicas
A B C D
A B C D
Haifeng Yu, Intel Research Pittsburgh / CMU
12
Some Simple AssignmentsSome Simple Assignments
PTN: partition assignment Used in most practice of Coda [Satyanarayanan et al.’90]
A B C D E F
A B C D E F
for k = 2
...........
...........
RAND: pick a random replica each time Similar as in Google File System [Ghemawat et al.’03]
Haifeng Yu, Intel Research Pittsburgh / CMU
13
Assignment in Chord [Stoica et al.’01]Assignment in Chord [Stoica et al.’01]
DHTs: Hash machine
IP to get machine id
Assignment in Chord: Sliding window
Neither PTN nor RAND
101
120
104
098
090
080
AA
C
C
B
hash(A) = 95
C
B
B
Haifeng Yu, Intel Research Pittsburgh / CMU
14
Assignment in CAN [Ratnasamy et al.’01]Assignment in CAN [Ratnasamy et al.’01]
Hash object k times CAN uses a
similar approach
Similar as RAND But machines
may have slightly different number of objects
101
120
104
098
090
080
A
hash1(A) = 95
Haifeng Yu, Intel Research Pittsburgh / CMU
15
Assignment in CAN [Ratnasamy et al.’01]Assignment in CAN [Ratnasamy et al.’01]
101
120
104
098
090
080
A
A
hash2(A) = 119
Hash object k times CAN uses a
similar approach
Similar as RAND But machines
may have slightly different number of objects
Haifeng Yu, Intel Research Pittsburgh / CMU
16
Assignment in CAN [Ratnasamy et al.’01]Assignment in CAN [Ratnasamy et al.’01]
101
120
104
098
090
080
A
A
hash1(B) = 84
hash2(B) = 100
B
B
Hash object k times CAN uses a
similar approach
Similar as RAND But machines
may have slightly different number of objects
Haifeng Yu, Intel Research Pittsburgh / CMU
17
Which assignment should we use?Which assignment should we use?
MOAT Goal: Improve avail of multi-object accesses If an access requests n (n N) objects, what if only x
are available?
Threshold-based success definition: If x ≥ t, user happy Available
If x < t, too low confidence Unavailable
Availability for an access defined as: Prob[ t objects available out of n requested objects]
Haifeng Yu, Intel Research Pittsburgh / CMU
18
Examples of tExamples of t
t = n File systems
Search for terrorist images in image database
t close n Query for top-10 most-loaded machines on PlanetLab
t not close n Sample with confidence
Haifeng Yu, Intel Research Pittsburgh / CMU
19
OutlineOutline
Motivation and MOAT contributions
System model and case studies of existing systems
Theoretical results
Designs for approximating optimal assignments
Designs for mixed accesses
Conclusions
Haifeng Yu, Intel Research Pittsburgh / CMU
20
Formal ResultsFormal Results For access requesting N objects
Theorem: Among all assignments, when t = N: PTN is best (within constant)
RAND is worst (within constant)
Difference is about c folds (c is #obj / machine)
Theorem: Among all assignments, when t = c+1 < N: PTN is worst
RAND is best (within constant)
Difference is even larger
Haifeng Yu, Intel Research Pittsburgh / CMU
21
Numerical Examples (from Simulation)Numerical Examples (from Simulation)40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2
threshold
una
vaila
bili
ty
RAND (CAN)
PTN
Chord
c times difference
if p is small, where c is # obj/machine
unavail of single obj
Haifeng Yu, Intel Research Pittsburgh / CMU
22
A Spectrum of AssignmentsA Spectrum of Assignments40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2
threshold
una
vaila
bili
ty
RAND (CAN)
PTN
Haifeng Yu, Intel Research Pittsburgh / CMU
23
More Formal ArgumentsMore Formal Arguments
Tradeoff is fundamental: Impossible to achieve the best of RAND and PTN
Previous results only for access requesting N objects Similar results hold for accesses requesting n (n N)
objects
But each machine may not be filled to capacity:
For PTN, use as few machines as possible
For RAND, use as many machines as possible
I have more....talk to me if you are interested
Haifeng Yu, Intel Research Pittsburgh / CMU
24
40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2
threshold
una
vaila
bili
ty
RAND (CAN)
PTN
Chord
Access Requesting 500 ObjectsAccess Requesting 500 Objects
Haifeng Yu, Intel Research Pittsburgh / CMU
25
OutlineOutline
Motivation and MOAT contributions
System model and case studies of existing systems
Theoretical results
Designs for approximating optimal assignments
Designs for mixed accesses
Conclusions
Haifeng Yu, Intel Research Pittsburgh / CMU
26
Design of Replica AssignmentDesign of Replica Assignment
Trivial in a static / centralized environment
Challenging in dynamic environment: We may not have global knowledge with many objects
and many machines
Basic solution: Consistent hashing But some re-design is necessary
Haifeng Yu, Intel Research Pittsburgh / CMU
27
Approximating RANDApproximating RAND
Multi-hash DHT: Hash the object k
times
As in CAN
101
120
104
098
090
080
A
A
hash1(B) = 84
hash2(B) = 100
B
B
Haifeng Yu, Intel Research Pittsburgh / CMU
28
Approximating PTNApproximating PTN
Chord does not achieve PTN
101
120
104
098
090
080
A
B C
A B
C
B
hash(A) = 95
C
Haifeng Yu, Intel Research Pittsburgh / CMU
29
Approximating PTNApproximating PTN
Chord does not achieve PTN
Group DHT: (Arbitrarily) group
machine into groups of k size
120120
A B
C
B
hash(A) = 95
C
101101
090090
A B
C
Haifeng Yu, Intel Research Pittsburgh / CMU
30
Node Join and Leave in Group DHTNode Join and Leave in Group DHT
Maintain r rondevour points in DHT Diminishing Chord [Karger et al.’04] / ReDir [Karp et al.’04]
New node reports to a random rondevour point If group can be formed, join DHT
Two options upon node leave: Dismiss group and delete the group from DHT
The group wait to recruit a new node
Groups use rondevour point to decide
Haifeng Yu, Intel Research Pittsburgh / CMU
31
Complexity AnalysisComplexity AnalysisMetric Standard DHT Group DHT
Routing state log N log N/k
Routing hops log N log N/k
Messages / Join (log N)^2 log N/k + (log N/k)^2 / k
Messages / Leave (Log N)^2 log N/k + (log N/k)^2 / k
Obj moves / Join k/N k/N
Obj moves / Leave k/N 2k/N
Haifeng Yu, Intel Research Pittsburgh / CMU
32
OutlineOutline
Motivation and MOAT contributions
System model and case studies of existing systems
Theoretical results
Designs for approximating optimal assignments
Designs for mixed accesses
Conclusions
Haifeng Yu, Intel Research Pittsburgh / CMU
33
Mixture of QueriesMixture of Queries
Previous design only for single access requesting all N objects PTN if t close to N
RAND if t far from N
But there are other accesses Requests n (n < N) objects with threshold t
How does t change with n ? Infinite possibilities
We focus on 4 large categories
Haifeng Yu, Intel Research Pittsburgh / CMU
34
Four Application ScenariosFour Application Scenarios
Scenario small accesses (small n) large accesses (large n)
File system strict strict
Computing aggregates
loose loose
Network monitoring
strict
(pinpoint problems)
loose
(overview query)
Image database search
loose
(resource retrieval of frequent objects -- E.g.,
find clip art for slide)
strict
(non-existence test -- E.g., exhaustive search
of terrorist)
Strict accesses: t n Loose accesses: t < n
Haifeng Yu, Intel Research Pittsburgh / CMU
35
LooseLoose for both small and largefor both small and large n n
Goal: Approach RAND
for both small and large n
Design: Multi-hash DHT
101
120
104
098
090
080
A
A
hash1(B) = 84
hash2(B) = 100
B
B
Haifeng Yu, Intel Research Pittsburgh / CMU
36
LooseLoose for small for small nn; ; StrictStrict for for largelarge n n
Goal: Approach RAND
for small n
Approach PTN for large n
Design: Group DHT
120120
A B
C
A
C
101101
090090
A B
C
Haifeng Yu, Intel Research Pittsburgh / CMU
37
StrictStrict for both small andfor both small and largelarge n n Goal:
Approach PTN for both small and large n
Assume accesses are tree accesses
Design: Group DHT with
item-balancing [Karger et al.’04]
120120
A B
C
B
A = 95
101101
090090
A B
C
Haifeng Yu, Intel Research Pittsburgh / CMU
38
StrictStrict for small n; for small n; LooseLoose for for largelarge n n
Goal: Approaches PTN
for n < R
Approaches RAND for n >> R
Design: Multi-hash DHT
But cluster objects into clusters of constant size R
101
120
104
098
090
080
A
A
hash1(AB) = 84
hash2(AB) = 100
B
B
Haifeng Yu, Intel Research Pittsburgh / CMU
39
Simulation Results for Strict Accesses Simulation Results for Strict Accesses
number (n) of objects requested by an access
una
vaila
bili
ty
Here an access needs all n objects to be successful
400 machines
fail prob = 0.2
40,000 obj
4 replica / obj
Haifeng Yu, Intel Research Pittsburgh / CMU
40
Simulation Results for Loose AccessesSimulation Results for Loose Accesses
number (n) of objects requested by an access
una
vaila
bili
ty
Here an access needs only t = n - 150 objects to be successful
400 machines
fail prob = 0.2
40,000 obj
4 replica / obj
Haifeng Yu, Intel Research Pittsburgh / CMU
41
Current StatusCurrent Status
Waiting for paper deadlines
Finishing implementing MOAT
Evaluation on IrisLog trace and file system traces
Haifeng Yu, Intel Research Pittsburgh / CMU
42
Related WorkRelated Work Multi-object accesses rarely addressed
CFS [Dabek et al.’01] focuses on individual file blocks
Chain replication [Renesse et al.’04] considers single data object
A long list ..... Replica assignment largely ignored
Different DHTs (e.g., Chord, Pastry, CAN) use dramatically different replica assignment: Effects not understood / studied
Replica placement [Douceur et al.’01, Li et al.’99, Qiu et al.’01, Venkataramani et al.’01, Yu et al.’04] well studied:
Typically for machines in different locations in the network
Machines are heterogeneous
Approaches does not apply to replica assignment
Haifeng Yu, Intel Research Pittsburgh / CMU
43
ConclusionsConclusions
Availability becoming key design goal Multi-object access availability dramatically different
from single-object availability
MOAT Contributions: First to observe the importance of replica assignment
Strong theoretical results regarding the best and worst assignments
Practical designs to approximate optimal assignments
MOAT toolkit implementation
Haifeng Yu, Intel Research Pittsburgh / CMU
44
My Other Recent WorkMy Other Recent Work
Om [NSDI’04]: Consistent and automatic replica regeneration
Regenerate from any single replica rather than a majority
Signed quorum systems [PODC’04]: Constant quorum size at the cost of small prob of
inconsistency
Node failure characteristics in WAN [WORLDS’04]: Answer subtle questions regarding real-world failure
properties
Haifeng Yu, Intel Research Pittsburgh / CMU
45
Haifeng Yu, Intel Research Pittsburgh / CMU
46
Erasure CodingErasure Coding
Encode the object into k fragments and any m (m < k) out of k fragments can reconstruct the object
RAID techniques are special cases
Replication is a special case where m = 1
Haifeng Yu, Intel Research Pittsburgh / CMU
47
Example RevisitedExample Revisited
Need four files to compile:
A B C D
A B C D
orA B C D
A C B D
Better
Erasure coding is hard to be applied across large amount of data Updating any portion of data needs to update k - m + 1
fragments the size of original data
We cannot use erasure coding across 1,000 files
Can we treat A, B, C, D as a single obj and use erasure coding?
So that all files can be reconstructed from any 4 out of 8 fragments
Haifeng Yu, Intel Research Pittsburgh / CMU
48
Threshold Semantics and Erasure CodingThreshold Semantics and Erasure Coding
Threshold Semantics Erasure Coding
need t out of n objects to answer query
need m out of k fragments to reconstruct object
t determined by app semantics m determined at coding time
result dependent on which t objects
same result regardless of which m fragments
may update single object by itself modification to any portion of the object needs to update k-m+1 fragments
In short, they are different, orthogonal concepts
Haifeng Yu, Intel Research Pittsburgh / CMU
49
Numerical Examples (from Simulation)Numerical Examples (from Simulation)40,000 objects, 4 replicas each, 400 machines, fail prob = 0.2
threshold
una
vaila
bili
ty
RAND (CAN)
CRAND (10)
CRAND (100)
PTN
Chord
c times difference
if p is small, where c is # obj/machine