Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit...

31
Storage Fabric CS6453

Transcript of Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit...

Page 1: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Storage FabricCS6453

Page 2: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Summary

Last week: NVRAM is going to change the way we thing about storage.

Today: Challenges of storage layers (SSDs, HDs) that are created from massive

data.

Slowdowns in HDs and SSDs.

Enforcing policies for IO operations in Cloud architectures.

Page 3: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Background: Storage for Big Data

One disk is not enough to handle massive amounts of data.

Last time: Efficient datacenter networks using large number of cheap

commodity switches.

Solution: Efficient IO performance using large number of commodity storage

devices.

Page 4: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Background: RAIDS

Achieves Nx performance where

N is the number of Disks.

Is this for free?

When N becomes large then the

probability of Disk failures

becomes large as well.

RAID 0 does not tolerate

failures.

Page 5: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Background: RAIDS

Achieves (K-1)-fault tolerance

with Kx Disks.

Is this for free?

There are Kx more disks (e.g. if

you want to tolerate 1 failure

you need 2x more Disks than

RAID 0).

RAID 1 does not utilize resources

in an efficient way.

Page 6: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Background: Erasure Code

Achieves K-fault tolerance with

N+K Disks.

Efficient utilization of Disks (not

as great as RAID 0).

Fault-Tolerance (not as great as

RAID 1).

Is this for free?

Reconstruction Cost : # of Disks

needed from a read in case of

failure(s).

RAID 6 has a Reconstruction Cost

of 3.

Page 7: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Modern Erasure Code Techniques

Erasure Coding in Windows Azure Storage [Huang, 2012]

Exploit Point:

π‘ƒπ‘Ÿπ‘œπ‘ 1 π‘“π‘Žπ‘–π‘™π‘’π‘Ÿπ‘’ ≫ π‘ƒπ‘Ÿπ‘œπ‘[2 π‘“π‘Žπ‘–π‘™π‘’π‘Ÿπ‘’π‘  π‘œπ‘Ÿ π‘šπ‘œπ‘Ÿπ‘’]

Solution: Construct Erasure Code Technique that has low reconstruction cost for 1

failure.

1.33x more storage overhead (relatively low).

Tolerate up to 3 failures in 16 storage devices.

Reconstruction cost of 6 for 1 failure and 12 for 2+ failures.

Page 8: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

The Tail at Store: Problem

We have seen how we treat failures with reconstruction. What about

slowdowns in HDs (or SSDs)?

A slowdown of a disk (no failures) might have significant impact at overall

performance.

Questions:

Do HDs or SSDs exhibit transient slowdowns?

Are slowdowns of disks frequent enough to affect the overall performance?

What causes slowdowns?

How do we deal with slowdowns?

Page 9: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

The Tail at Store: Study

RAID

D P Q

Disk SSD

#RAID groups 38,029 572

#Data drives per group 3-26 3-22

#Data drives 458,482 4,069

Total drive hours 857,183,442 7,481,055

Total RAID hours 72,046,373 1,072,690

D … D

Page 10: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

0.9

0.92

0.94

0.96

0.98

1

1x 2x 4x 8xSlowdown

CDF of Slowdown (Disk)

SiT

The Tail at Store: Slowdowns?

Hourly average I/O latency per drive 𝐿

Slowdown:

𝑆 =𝐿

πΏπ‘šπ‘’π‘‘π‘–π‘Žπ‘›

Tail:T = π‘†π‘šπ‘Žπ‘₯

Slow Disks: S β‰₯ 2

𝑆 β‰₯ 2 at 99.8 percentile

𝑆 β‰₯ 1.5 at 99.3 percentile

𝑇 β‰₯ 2 at 97.8 percentile

𝑇 β‰₯ 1.5 at 95.2 percentile

SSDs exhibit even more slowdowns

Page 11: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

0

0.2

0.4

0.6

0.8

1

1 2 4 8 16 32 64 128 256

Slowdown Interval (Hours)

CDF of Slowdown Interval

DiskSSD

The Tail at Store: Duration?

Slowdowns are transient

40% of HD slowdowns β‰₯2

hours

12% of HD slowdowns β‰₯ 10hours

Many slowdowns happen in

consecutive hours (last

more)

Page 12: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35

Inter-Arrival between Slowdowns (Hours)

CDF of Slowdown Inter-Arrival Period

DiskSSD

The Tail at Store: Correlation between

slowdowns in the same storage?

90% of Disk slowdown are within 24 hours of another

slowdown of the same Disk.

> 80% of SSDs slowdown are within 24 hours of

another slowdown of the

same SSD.

Slowdowns happen in the

same Disks relatively close

to each other.

Page 13: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

0

0.2

0.4

0.6

0.8

1

0.5x 1x 2x 4xRate Imbalance

CDF of RI within Si >= 2

DiskSSD

The Tail at Store: Causes?

𝑅𝐼 =𝐼/π‘‚π‘…π‘Žπ‘‘π‘’

𝐼/π‘‚π‘…π‘Žπ‘‘π‘’π‘šπ‘’π‘‘π‘–π‘Žπ‘›

Rate imbalance does not

seem to be the main cause

of slowdowns for slow

Disks.

Page 14: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

0

0.2

0.4

0.6

0.8

1

0.5x 1x 2x 4xSize Imbalance

CDF of ZI within Si >= 2

DiskSSD

The Tail at Store: Causes?

𝑆𝐼 =𝐼/𝑂𝑆𝑖𝑧𝑒

𝐼/π‘‚π‘†π‘–π‘§π‘’π‘šπ‘’π‘‘π‘–π‘Žπ‘›

Size imbalance does not

seem to be the main cause

of slowdowns for slow

Disks.

Page 15: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

0.95

0.96

0.97

0.98

0.99

1

1x 2x 3x 4x 5x

Slowdown

CDF of Slowdown vs. Drive Age (Disk)

91234576

108

The Tail at Store: Causes?

Disk age seems to have

some correlation but it is

not strongly correlated.

Page 16: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

The Tail at Store: Causes?

No correlation of slowdowns to time of the day (0am – 24pm)

No explicit drive events around slow hours

Unplugging disks and plugging them back does not particularly help

SSD vendors have significant differences between them

Page 17: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

The Tail at Store: Solutions

Create Tail-Tolerant RAIDS.

Treat slow disks as failed disks.

Reactive

Detect slow Disks: take a lot of time to answer (>2x from other Disks).

Reconstruct answer from other disks using RAID redundancy if Disk is slow.

Latency is going to optimally be around 3x compared to a read from an average Disk.

Proactive

Always use RAID redundancy for additional read.

Take fastest answer.

Uses much more I/O bandwidth.

Adaptive

Combination of both approaches taking into account the findings.

Use reactive approach until a slowdown is detected.

After this use proactive approach since slowdowns are repetitive and last many hours.

Page 18: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

The Tail at Store: Conclusions

More research on possible causes for Disk and SSD slowdowns is required

Need Tail-Tolerant RAIDS to reduce the overhead from slowdowns

Since reconstruction of data is the way to deal with slowdowns and if

π‘ƒπ‘Ÿπ‘œπ‘ 1 π‘ π‘™π‘œπ‘€π‘‘π‘œπ‘€π‘› ≫ π‘ƒπ‘Ÿπ‘œπ‘[2 π‘ π‘™π‘œπ‘€π‘‘π‘œπ‘€π‘› π‘œπ‘Ÿ π‘šπ‘œπ‘Ÿπ‘’]

the Azure paper [Huang, 2012] becomes more relevant.

Page 19: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Background: Cloud Storage

General Purpose Applications

Separate VM-VM connections from VM-

Storage connections

Storage is virtualized

Many layers from application to actual storage

Resources are shared across multiple tenants

Page 20: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

IOFlow: Problem

Cannot support end-to-end policies (e.g.

minimum IO bandwidth from application to

storage)

Applications do not have any way of

expressing their storage policies

Sharing infrastructure where aggressive

applications tend to get more IO bandwidth

Page 21: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

IOFlow: Challenges

No existing enforcing mechanism for

controlling IO rates

Aggregate performance policies

Non-performance policies

Admission control

Dynamic enforcement

Support for unmodified applications and VMs

Page 22: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

IOFlow: Do it like SDNs

Page 23: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

IOFlow: Supported policies

<VM, Destination> -> Bandwidth (static, compute side)

<VM, Destination> -> Min Bandwidth (dynamic, compute side)

<VM, Destination> -> Sanitize (static, compute or storage side)

<VM, Destination> -> Priority Level (static, compute and storage side)

<Set of VMs, Set of Destinations> -> Bandwidth (dynamic, compute side)

Page 24: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Example 1: Interface

Policies:

<VM1,Server X> -> B1

<VM2,Server X> -> B2

Controller to SMBc of physical server containing VM1 and VM2

createQueueRule(<VM1,Server X>,Q1)

createQueueRule(<VM2,Server X>,Q2)

createQueueRule(<*,*>,Q0)

configureQueueService(Q1, <B1, low, S>), where S is the size of the queue

configureQueueService(Q2, <B2, low, S>)

configureQueueService(Q0, <C-B1-B2, low, S>), where C is the Capacity of Server X.

Page 25: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Example 2: Max-Min Fairness

Policies:

<VM1-VM3,Server X> -> 900 Mbps

Demand:

VM1 -> 600 Mbps

VM2 -> 400 Mbps

VM3 -> 200 Mbps

Result:

VM1 -> 350 Mbps

VM2 -> 350 Mbps

VM3 -> 200 Mbps

Page 26: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

IOFlow: Evaluation of Policy

Enforcement

Windows-based IO stack

10 hypervisors with 12 VMs each (120 VMs total)

4 tenants using 30 VMs each (3 VMs per hypervisor for each tenant)

1 Storage Server

6.4 Gbps IO Bandwidth

1 Controller

1s interval between dynamic enforcements of policies

Page 27: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

IOFlow: Evaluation of Policy

Enforcement

Tenant Policy

Index {VM 1 -30, X} -> Min 800 Mbps

Data {VM 31 - 60, X} -> Min 800 Mbps

Message {VM 61 -90, X} -> Min 2500 Mbps

Log {VM 91 -120, X} -> Min 1500 Mbps

Page 28: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

IOFlow: Evaluation of Policy Enforcement

Page 29: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

IOFlow: Evaluation of Overhead

Page 30: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

IOFlow: Conclusions

Contributions

First Software Defined Storage approach

Fine-grain control over the IO operations in Cloud

Limitations

Network or other resources might be the bottleneck

Need to care about locating the VMs (spatial locality) close to data

Flat Datacenter Storage [Nightingale, 2012] provides solutions for this problem

Guaranteed latencies are not expressed by current policies

Best effort approach by setting priority

Page 31: Storage FabricΒ Β· 2017. 5. 7.Β Β· Erasure Coding in Windows Azure Storage [Huang, 2012] Exploit Point: 𝑃 1 𝑖 𝑒 ≫𝑃 [2 𝑖 𝑒 ] Solution: Construct Erasure Code Technique

Specialized Storage Architectures

HDFS [Shvachko, 2009] and GFS [Ghemawat, 2003] work well for Hadoop

MapReduce applications.

Facebook’s Photo Storage [Beaver, 2010] exploits workload characteristics to

design and implement better storage system.