Toward Energy-efficient and Fault-tolerant Consistent ... Science: Large Hadron Collider (LHC) 1PB...

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store

Wei XieTTU CS Department Seminar, 3/7/2017

1

Outline❖ General introduction

❖ Study 1: Elastic Consistent Hashing based Store

❖ Motivation and related work

❖ Design

❖ Evaluation

❖ Study 2: Reducing Failure-recovery Cost in CH based Store


❖ Design

❖ Evaluation

❖ Conclusion

2

Big Data Storage❖ Growing data-intensive (big data) application

❖ Large data volume (hundreds of TBs, PBs, EB), thousands of CPUs to access data

❖ Cluster computer (supercomputer, data center, cloud infrastructure )

A Cluster Computer3

1PB=1,000,000,000 MB1EB=1,000,000,000,000 MB (10 to 12)

Big Data Examples❖ Science: Large Hadron Collider (LHC)

❖ 1PB data per sec, 15PB filtered data per year, 160PB disk

❖ Search engine: Yahoo use 1500 nodes for 5PB data @2008

4

Scalability of Storage

❖ To store large volume of data, the scalability of a data store software is critical

❖ Scalability means: “performance improvement achieved by increasing the number of servers”

❖ Popular systems like Hadoop Distributed File System (HDFS) scale to 10,000 nodes

❖ Performance encounters bottleneck at metadata servers

5

Metadata Server Bottleneck❖ With many data nodes (DNs), HDFS has performance bottleneck at name-node

❖ Need very large capacity to store metadata

❖ Querying/updating the name-node with many concurrent clients degrades performance

6

Getting Rid of Metadata Server❖ Consistent hashing

❖ Use hash function to map data to DNs

❖ No need to update metadata server

❖ Much smaller memory footprint

❖ 10X increase in scale (Ceph)

Data Node0 11 1012 304

hash function

data ID=1

data ID=1

node ID=101

node ID=101

7

Consistent Hashing

1

2

3

D1

1 2 3D1 D2 D3

Keys Serverspartitions

D2D3

Hashes Hashes

02160

1

2

3

D1

1 2 3D1 D2 D3


D2D3

Hashes Hashes

02160

4

4

1holdsD12holdsD23holdsD3

4holdsD12holdsD23holdsD31holdsnothing

8

Challenges with CH❖ Modern large-scale data store challenges

❖ Scalability

❖ Manageability

❖ Performance

❖ Power consumption

❖ Fault tolerance

❖ We observe and investigate two problems with CH, in terms of power consumption and fault tolerance

9




❖ Design

❖ Evaluation



❖ Design

❖ Evaluation

❖ Conclusion

10

❖ Background: Elastic Data Store for Power Saving

❖ Elasticity: the ability to resize the storage cluster as workload varies (more servers means better performance but higher power consumption)

❖ Benefits

❖ Re-use storage nodes for other purpose

❖ Save machine hours (operating cost)

❖ Most distributed storages are not elastic

❖ GFS and HDFS

❖ Deactivating servers may make data unavailable

11

Agility is Important❖ Agility determines how much machine hours to be saved

12

❖ Non-elastic Data Layout

• A typical pseudo-random based data layout as seen in most CH-based distributed FS

• Almost all server must be “on” to ensure 100% availability• No elastic resizing capability

13

Elastic Data Layout

❖ General rule

❖ Take advantage of replication

❖ Always keep the first (primary) replicas “on”

❖ The other replicas can be activated on demand

14

Primary Server Layout

❖ Peak write performance: N/3 (same as non-elastic)

❖ Limited scaling to N/3 only15

Equal-work Data Layout

16

Primary-server Layout with CH❖ Modifies data placement in original CH so that one replica is

always placed on a primary server

❖ To achieve equal-work layout, the cluster must be configured accordingly

1

23

5

4

67

8

9

Primary server (always active)

Secondary server(active)

Secondary server (inactive)

Data object

D1

D2

1

23

5

4

67

8

9

D1

D2

10

10

skip inactive

skip secondary

skip primaryskip inactive

17

Equal-work Data Layout❖ Number of data chunks on primary:

❖ Number of data chunks on secondary: v secondaryi =B

i

v primary =B

p

1 2 3 4 5 6 7 8 9 10Rank of server

0

1

2

3

4

5

6

7

8

9

10

Num

ber o

f Dat

a Bl

ocks

#104 Data Distribution

Version1 (10 active)Version2 (8 active)Version3 (10 active)Data to migrate

18

Contribution Summary

❖ Primary Data Placement/replication scheme with consistent hashing

❖ Achieves primary-secondary data layout for elasticity

❖ Slight modification to existing consistent hashing

❖ Preserves the property of consistent hashing

19

Data Re-integration

❖ After a node is turned down, no data will be written to it. When this node joins again, any newly created data/modified data might need to re-integrate to it.

❖ However, data store does not know what data is modified or newly created. It has to transfer all data that should be placed on the new joined node.

20

Data Re-integration❖ Data re-integration incurs lots of I/O operations and degrades performance

when scaling up

❖ 3-phase workload: high load -> low load -> high load

❖ No resizing: 10 servers always on

❖ With resizing: 10 servers -> 2 servers -> 10 servers

0 100 200 300 400 500 600Time (seconds)

0

50

100

150

200

250

300

350

IO th

roug

hput

(MB/

s)

Original Consistent Hashing

With resizingNo resizing

Phase 2 endsPhase 1 ends

21

Our Contribution❖ Selective background re-integration

❖ Dirty table to track all OIDs that are dirty

❖ When re-integration finishes, OID is removed from table

❖ The rate of re-integration is controlled

1

23

5

4

67

8

910

Primary (always active)

Secondary (active)

Secondary (inactive)

Node12

910

StateOnOn

OffOff

3 On

OID 10010Version 9

Membership TableOID

10103

20400

Version

99

910010 9

Dirty Table

1

2

5

4

67

8

910

Node12

910

StateOnOn

OnOff

3 On

Membership TableOID Version

Dirty Table

100200

88

OID 10010Version 9

OID 10010Version 10

OID 10010Version 10

Resizing

Dirty Y

Dirty Y

Dirty Y

Dirty Y

3obj 10010

Re-in

tegr

atio

n or

der

Re-in

tegr

atio

n or

der

Version 9 Version 10

10103

20400

99

910010 9

100200

88

1

2

5

4

67

8

910

Node12

910

StateOnOn

OnOn

3 On

Membership TableOID Version

Dirty Table

OID 10010Version 11

OID 10010Version 11

All the dirty data in the table till OID 10010 are re-integrated to version 10

Dirty N

Dirty N

3

Version 11

Resizing

102 10205 101010 10

10103

20400

99

910010 9

100200

88

102 10205 10

1010 10

Re-in

tegr

atio

n or

der

obj 10010

obj 10010

Data replica

22

Implementation

❖ Primary-secondary data placement/replication implemented in Sheepdog

❖ Dirty data tracking implemented using Redis

23

Evaluation❖ 3-phase workload test

❖ T: deadline for background re-integration

❖ Rate: data transfer rate for background re-integration

❖ Performance significantly improved with selective background re-integration

0 100 200 300 400 500 600Time (seconds)

0

50

100

150

200

250

300

350

IO th

roug

hput

(MB/

s)

Sel+backg(T=2,Rate=200)Sel+backg(T=4,Rate=200)Sel+backg(T=6,Rate=200)SelectiveOriginal CHNo-resizing

Phase 1 ends

High rate delays resizing

Phase 2 ends

24

Large-scale Trace Analysis❖ Use the Cloudera trace

❖ Apply our policy and analyze the effect of resizing

0 50 100 150 200 250Time (minutes)

0

10

20

30

40

50

Num

ber o

f ser

vers

CC-a Trace

IdealOriginal CHPrimary+aggresivePrimary+background

0 50 100 150 200 250Time (minutes)

0

20

40

60

80

100

120

140

160

180

Num

ber o

f ser

vers

CC-b Trace

IdealOriginal CHPrimary+aggresivePrimary+background

25

Summary

❖ We propose primary-secondary data placement/replication scheme to provide better elasticity in consistent hashing based data store

❖ We use selective background data re-integration technique to reduce the I/O footprint when re-integrating nodes to a cluster

❖ First work studying elasticity for saving power in consistent hashing based store

26




❖ Design

❖ Evaluation



❖ Design

❖ Evaluation

❖ Conclusion

27

❖ Replication for tolerating failures

❖ When a node fails, a self-healing system could recover lost data by itself without administrator intervention

Fault-tolerance and Self-healing

2failsD2’ssecondreplicasismigratedto3automa9cally

1

2

3

D1

1 2 3D1 D2 D3


D2

D3

Hashes Hashes

02160

4

4

5

5

6

6

1 2 3D1 D2 D3


Hashes Hashes

4

1

2

3

D1

D2

D3

02160

4

5

6

5 6

28

Motivation❖ Even though CH is able to self-heal from failures, the

cost of recovery is large (data transfers)

❖ If simply delaying self-healing, the risk of data loss can be large

❖ Use different data layout to delay healing as much as possible

❖ Determine when it is OK to delay self-healing and when it is not

29

Motivation❖ Psuedo-random replication has low tolerance on multiple

concurrent failures

❖ Losing one server makes data in danger

30

Primary Replication❖ Same as the one used in Elastic Consistent Hashing

❖ As long as primary replicas are available, there is no worry about losing data

31

Data Recovery Strategy

❖ Aggressive recovery: as long as a node fails, recovery starts to transfer data

❖ Lazy recovery: as long as a node fail does not incur much risk of losing data, data transfer is delayed

❖ Need a metric to quantify the risk of losing data

32

Determine Recovery Strategy

❖ Minimum Replication Level (MRL)

❖ The smallest number of replicas that a data may have

❖ Larger MRL means more failure can be tolerated

❖ Set a threshold of MRL. When MRL drops below the threshold, aggressive recovery is used

33

Measuring MRL in CH❖ MRL can be easily calculated in consistent hashing

based data store

3

1

2

4

6

7

8

Primary server

Secondary server

Data object

D3

109

Failed primary server

server 3 failed, MRL=3, aggressive

1

2

4

6

7

8

D1D2

109

server 5, 6 and 10 failed, MRL=2, lazy

3 active2 active

4

1

23

56

7

8

D2

109

4, 6 and 10 failed, MRL=1, aggressive

1 active

5

5

case (3)case (1) case (2)

3

2 active

Failed secondary server

4

1

23

56

7

8

D2

109

server 4, 6 and 10 failed, MRL=3, lazy

1 active

case (3)

u

u

u

u

u

u

c

cu

u

u

c Committed fail node

Uncommitted fail node

34

Analysis with MSR Trace❖ MSR trace: 1 week I/O trace from Microsoft Research Server

❖ Insert recovery periods into the trace with two recovery strategies

0 50 100 150 200Hours

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

IOPS

Recovery periodMSR Throughput

0 50 100 150 200Hours

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

IOPS

Recovery periodMSR Throughput

Aggressive recovery Lazy recovery35

Evaluation❖ Simulate primary-secondary replication and lazy recovery within libch-placement,

a consistent hashing library

❖ Failure is generated using Weibull distribution

❖ Failure and recovery data simulated is inserted into MSR trace and replayed on Sheepdog client

❖ Primary+ lazy recovery strategy improves I/O performance when a failure occurs

111 112 113 114 115 116 117Hours

0

10

20

30

40

50

60

70

80

90

100

IO R

ate

(MB/

s)

MSR Trace, I/O RatePrimary-secondaryRandom

failure

124 125 126 127 128 129 130Hours

-20

0

20

40

60

80

100

120

140

IO R

ate

(MB/

s)

MSR Trace, I/O RatePrimary-secondaryRandom

failure

36

Summary

❖ We leverage the primary-secondary replication scheme to replace random replication scheme to tolerate multiple concurrent failures

❖ We use MRL metric to determine the risk of data loss and the data recovery strategy

❖ Using our replication scheme and recovery strategy, the I/O footprint after node failure is significantly reduced

37

Conclusion

❖ Consistent hashing based store is promising but has limited functionality

❖ We provide some initial insight into how to enhance the consistent hashing to offer better functionalities that are important in modern data store, like fault-tolerance and elasticity

❖ There are many more to be explored

38

Questions!

Welcome to visit our website for more details.

DISCL lab: http://discl.cs.ttu.edu/

Personal site: https://sites.google.com/site/harvesonxie/

39

http://discl.cs.ttu.edu/

https://sites.google.com/site/harvesonxie/

Toward Energy-efficient and Fault-tolerant Consistent ... Science: Large Hadron Collider (LHC) 1PB...

Documents

Transcript of Toward Energy-efficient and Fault-tolerant Consistent ... Science: Large Hadron Collider (LHC) 1PB...