Toward Energy-efficient and Fault-tolerant Consistent ... Science: Large Hadron Collider (LHC) 1PB...
Transcript of Toward Energy-efficient and Fault-tolerant Consistent ... Science: Large Hadron Collider (LHC) 1PB...
Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store
Wei XieTTU CS Department Seminar, 3/7/2017
1
Outline❖ General introduction
❖ Study 1: Elastic Consistent Hashing based Store
❖ Motivation and related work
❖ Design
❖ Evaluation
❖ Study 2: Reducing Failure-recovery Cost in CH based Store
❖ Motivation and related work
❖ Design
❖ Evaluation
❖ Conclusion
2
Big Data Storage❖ Growing data-intensive (big data) application
❖ Large data volume (hundreds of TBs, PBs, EB), thousands of CPUs to access data
❖ Cluster computer (supercomputer, data center, cloud infrastructure )
A Cluster Computer3
1PB=1,000,000,000 MB1EB=1,000,000,000,000 MB (10 to 12)
Big Data Examples❖ Science: Large Hadron Collider (LHC)
❖ 1PB data per sec, 15PB filtered data per year, 160PB disk
❖ Search engine: Yahoo use 1500 nodes for 5PB data @2008
4
Scalability of Storage
❖ To store large volume of data, the scalability of a data store software is critical
❖ Scalability means: “performance improvement achieved by increasing the number of servers”
❖ Popular systems like Hadoop Distributed File System (HDFS) scale to 10,000 nodes
❖ Performance encounters bottleneck at metadata servers
5
Metadata Server Bottleneck❖ With many data nodes (DNs), HDFS has performance bottleneck at name-node
❖ Need very large capacity to store metadata
❖ Querying/updating the name-node with many concurrent clients degrades performance
6
Getting Rid of Metadata Server❖ Consistent hashing
❖ Use hash function to map data to DNs
❖ No need to update metadata server
❖ Much smaller memory footprint
❖ 10X increase in scale (Ceph)
Data Node0 11 1012 304
hash function
data ID=1
data ID=1
node ID=101
node ID=101
7
Consistent Hashing
1
2
3
D1
1 2 3D1 D2 D3
Keys Serverspartitions
D2D3
Hashes Hashes
02160
1
2
3
D1
1 2 3D1 D2 D3
Keys Serverspartitions
D2D3
Hashes Hashes
02160
4
4
1holdsD12holdsD23holdsD3
4holdsD12holdsD23holdsD31holdsnothing
8
Challenges with CH❖ Modern large-scale data store challenges
❖ Scalability
❖ Manageability
❖ Performance
❖ Power consumption
❖ Fault tolerance
❖ We observe and investigate two problems with CH, in terms of power consumption and fault tolerance
9
Outline❖ General introduction
❖ Study 1: Elastic Consistent Hashing based Store
❖ Motivation and related work
❖ Design
❖ Evaluation
❖ Study 2: Reducing Failure-recovery Cost in CH based Store
❖ Motivation and related work
❖ Design
❖ Evaluation
❖ Conclusion
10
❖ Background: Elastic Data Store for Power Saving
❖ Elasticity: the ability to resize the storage cluster as workload varies (more servers means better performance but higher power consumption)
❖ Benefits
❖ Re-use storage nodes for other purpose
❖ Save machine hours (operating cost)
❖ Most distributed storages are not elastic
❖ GFS and HDFS
❖ Deactivating servers may make data unavailable
11
Agility is Important❖ Agility determines how much machine hours to be saved
12
❖ Non-elastic Data Layout
• A typical pseudo-random based data layout as seen in most CH-based distributed FS
• Almost all server must be “on” to ensure 100% availability• No elastic resizing capability
13
Elastic Data Layout
❖ General rule
❖ Take advantage of replication
❖ Always keep the first (primary) replicas “on”
❖ The other replicas can be activated on demand
14
Primary Server Layout
❖ Peak write performance: N/3 (same as non-elastic)
❖ Limited scaling to N/3 only15
Equal-work Data Layout
16
Primary-server Layout with CH❖ Modifies data placement in original CH so that one replica is
always placed on a primary server
❖ To achieve equal-work layout, the cluster must be configured accordingly
1
23
5
4
67
8
9
Primary server (always active)
Secondary server(active)
Secondary server (inactive)
Data object
D1
D2
1
23
5
4
67
8
9
D1
D2
10
10
skip inactive
skip secondary
skip primaryskip inactive
17
Equal-work Data Layout❖ Number of data chunks on primary:
❖ Number of data chunks on secondary: v secondaryi =B
i
v primary =B
p
1 2 3 4 5 6 7 8 9 10Rank of server
0
1
2
3
4
5
6
7
8
9
10
Num
ber o
f Dat
a Bl
ocks
#104 Data Distribution
Version1 (10 active)Version2 (8 active)Version3 (10 active)Data to migrate
18
Contribution Summary
❖ Primary Data Placement/replication scheme with consistent hashing
❖ Achieves primary-secondary data layout for elasticity
❖ Slight modification to existing consistent hashing
❖ Preserves the property of consistent hashing
19
Data Re-integration
❖ After a node is turned down, no data will be written to it. When this node joins again, any newly created data/modified data might need to re-integrate to it.
❖ However, data store does not know what data is modified or newly created. It has to transfer all data that should be placed on the new joined node.
20
Data Re-integration❖ Data re-integration incurs lots of I/O operations and degrades performance
when scaling up
❖ 3-phase workload: high load -> low load -> high load
❖ No resizing: 10 servers always on
❖ With resizing: 10 servers -> 2 servers -> 10 servers
0 100 200 300 400 500 600Time (seconds)
0
50
100
150
200
250
300
350
IO th
roug
hput
(MB/
s)
Original Consistent Hashing
With resizingNo resizing
Phase 2 endsPhase 1 ends
21
Our Contribution❖ Selective background re-integration
❖ Dirty table to track all OIDs that are dirty
❖ When re-integration finishes, OID is removed from table
❖ The rate of re-integration is controlled
1
23
5
4
67
8
910
Primary (always active)
Secondary (active)
Secondary (inactive)
Node12
910
StateOnOn
OffOff
3 On
OID 10010Version 9
Membership TableOID
10103
20400
Version
99
910010 9
Dirty Table
1
2
5
4
67
8
910
Node12
910
StateOnOn
OnOff
3 On
Membership TableOID Version
Dirty Table
100200
88
OID 10010Version 9
OID 10010Version 10
OID 10010Version 10
Resizing
Dirty Y
Dirty Y
Dirty Y
Dirty Y
3obj 10010
Re-in
tegr
atio
n or
der
Re-in
tegr
atio
n or
der
Version 9 Version 10
10103
20400
99
910010 9
100200
88
1
2
5
4
67
8
910
Node12
910
StateOnOn
OnOn
3 On
Membership TableOID Version
Dirty Table
OID 10010Version 11
OID 10010Version 11
All the dirty data in the table till OID 10010 are re-integrated to version 10
Dirty N
Dirty N
3
Version 11
Resizing
102 10205 101010 10
10103
20400
99
910010 9
100200
88
102 10205 10
1010 10
Re-in
tegr
atio
n or
der
obj 10010
obj 10010
Data replica
22
Implementation
❖ Primary-secondary data placement/replication implemented in Sheepdog
❖ Dirty data tracking implemented using Redis
23
Evaluation❖ 3-phase workload test
❖ T: deadline for background re-integration
❖ Rate: data transfer rate for background re-integration
❖ Performance significantly improved with selective background re-integration
0 100 200 300 400 500 600Time (seconds)
0
50
100
150
200
250
300
350
IO th
roug
hput
(MB/
s)
Sel+backg(T=2,Rate=200)Sel+backg(T=4,Rate=200)Sel+backg(T=6,Rate=200)SelectiveOriginal CHNo-resizing
Phase 1 ends
High rate delays resizing
Phase 2 ends
24
Large-scale Trace Analysis❖ Use the Cloudera trace
❖ Apply our policy and analyze the effect of resizing
0 50 100 150 200 250Time (minutes)
0
10
20
30
40
50
Num
ber o
f ser
vers
CC-a Trace
IdealOriginal CHPrimary+aggresivePrimary+background
0 50 100 150 200 250Time (minutes)
0
20
40
60
80
100
120
140
160
180
Num
ber o
f ser
vers
CC-b Trace
IdealOriginal CHPrimary+aggresivePrimary+background
25
Summary
❖ We propose primary-secondary data placement/replication scheme to provide better elasticity in consistent hashing based data store
❖ We use selective background data re-integration technique to reduce the I/O footprint when re-integrating nodes to a cluster
❖ First work studying elasticity for saving power in consistent hashing based store
26
Outline❖ General introduction
❖ Study 1: Elastic Consistent Hashing based Store
❖ Motivation and related work
❖ Design
❖ Evaluation
❖ Study 2: Reducing Failure-recovery Cost in CH based Store
❖ Motivation and related work
❖ Design
❖ Evaluation
❖ Conclusion
27
❖ Replication for tolerating failures
❖ When a node fails, a self-healing system could recover lost data by itself without administrator intervention
Fault-tolerance and Self-healing
2failsD2’ssecondreplicasismigratedto3automa9cally
1
2
3
D1
1 2 3D1 D2 D3
Keys Serverspartitions
D2
D3
Hashes Hashes
02160
4
4
5
5
6
6
1 2 3D1 D2 D3
Keys Serverspartitions
Hashes Hashes
4
1
2
3
D1
D2
D3
02160
4
5
6
5 6
28
Motivation❖ Even though CH is able to self-heal from failures, the
cost of recovery is large (data transfers)
❖ If simply delaying self-healing, the risk of data loss can be large
❖ Use different data layout to delay healing as much as possible
❖ Determine when it is OK to delay self-healing and when it is not
29
Motivation❖ Psuedo-random replication has low tolerance on multiple
concurrent failures
❖ Losing one server makes data in danger
30
Primary Replication❖ Same as the one used in Elastic Consistent Hashing
❖ As long as primary replicas are available, there is no worry about losing data
31
Data Recovery Strategy
❖ Aggressive recovery: as long as a node fails, recovery starts to transfer data
❖ Lazy recovery: as long as a node fail does not incur much risk of losing data, data transfer is delayed
❖ Need a metric to quantify the risk of losing data
32
Determine Recovery Strategy
❖ Minimum Replication Level (MRL)
❖ The smallest number of replicas that a data may have
❖ Larger MRL means more failure can be tolerated
❖ Set a threshold of MRL. When MRL drops below the threshold, aggressive recovery is used
33
Measuring MRL in CH❖ MRL can be easily calculated in consistent hashing
based data store
3
1
2
4
6
7
8
Primary server
Secondary server
Data object
D3
109
Failed primary server
server 3 failed, MRL=3, aggressive
1
2
4
6
7
8
D1D2
109
server 5, 6 and 10 failed, MRL=2, lazy
3 active2 active
4
1
23
56
7
8
D2
109
4, 6 and 10 failed, MRL=1, aggressive
1 active
5
5
case (3)case (1) case (2)
3
2 active
Failed secondary server
4
1
23
56
7
8
D2
109
server 4, 6 and 10 failed, MRL=3, lazy
1 active
case (3)
u
u
u
u
u
u
c
cu
u
u
c Committed fail node
Uncommitted fail node
34
Analysis with MSR Trace❖ MSR trace: 1 week I/O trace from Microsoft Research Server
❖ Insert recovery periods into the trace with two recovery strategies
0 50 100 150 200Hours
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
IOPS
Recovery periodMSR Throughput
0 50 100 150 200Hours
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
IOPS
Recovery periodMSR Throughput
Aggressive recovery Lazy recovery35
Evaluation❖ Simulate primary-secondary replication and lazy recovery within libch-placement,
a consistent hashing library
❖ Failure is generated using Weibull distribution
❖ Failure and recovery data simulated is inserted into MSR trace and replayed on Sheepdog client
❖ Primary+ lazy recovery strategy improves I/O performance when a failure occurs
111 112 113 114 115 116 117Hours
0
10
20
30
40
50
60
70
80
90
100
IO R
ate
(MB/
s)
MSR Trace, I/O RatePrimary-secondaryRandom
failure
124 125 126 127 128 129 130Hours
-20
0
20
40
60
80
100
120
140
IO R
ate
(MB/
s)
MSR Trace, I/O RatePrimary-secondaryRandom
failure
36
Summary
❖ We leverage the primary-secondary replication scheme to replace random replication scheme to tolerate multiple concurrent failures
❖ We use MRL metric to determine the risk of data loss and the data recovery strategy
❖ Using our replication scheme and recovery strategy, the I/O footprint after node failure is significantly reduced
37
Conclusion
❖ Consistent hashing based store is promising but has limited functionality
❖ We provide some initial insight into how to enhance the consistent hashing to offer better functionalities that are important in modern data store, like fault-tolerance and elasticity
❖ There are many more to be explored
38
Questions!
Welcome to visit our website for more details.
DISCL lab: http://discl.cs.ttu.edu/
Personal site: https://sites.google.com/site/harvesonxie/
39