Large Scale Sharing

50
Large Scale Sharing The Google File System PAST: Storage Management & Caching – Presented by Chi H. Ho

description

Large Scale Sharing. The Google File System PAST: Storage Management & Caching. – Presented by Chi H. Ho. Introduction. A next step from network file systems. How large? GFS: > 1000 storage nodes > 300 TB disk storage Hundreds of client machines PAST: Internet-scale. - PowerPoint PPT Presentation

Transcript of Large Scale Sharing

Page 1: Large Scale Sharing

Large Scale Sharing

The Google File System

PAST: Storage Management & Caching

– Presented by Chi H. Ho

Page 2: Large Scale Sharing

Introduction

A next step from network file systems. How large?

GFS: > 1000 storage nodes > 300 TB disk storage Hundreds of client machines

PAST: Internet-scale

Page 3: Large Scale Sharing

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Page 4: Large Scale Sharing

Goals

Performance Scalability Reliability Availability Highly tuned for:

Google’s back-end file service Workloads: multiple-producer/single-consumer,

many-way merging

Page 5: Large Scale Sharing

Assumptions

H/W: inexpensive components that often fail. Files: modest number of large files. Reads/Writes: 2 kinds

Large streaming: common case => optimized. Small random: supported but need not be efficient.

Concurrency: hundreds of concurrent appends. Performance: high sustained bandwidth is more

important than low latency.

Page 6: Large Scale Sharing

Interface

Usual operations: create, delete, open, close, read, and write.

GFS specific: snapshot: creates a copy of a file or a directory

tree at low cost. record append: allows concurrent appends to

perform atomically.

Page 7: Large Scale Sharing

Architecture

Page 8: Large Scale Sharing

Architecture

User-level process

User-level process

User-level process

User-level process

Page 9: Large Scale Sharing

Architecture (Files)

Files are divided into fixed-size chunks, each is replicated at multiple (default 3) chunkservers as a Linux file.

Each chunk is identified by an immutable and globally unique chunk handle assigned by the master at the time of chunk creation.

Read/Write data chunk specified by <chunk handle, byte range>

Page 10: Large Scale Sharing

Architecture (Master)

Maintains metadata:•Namespace•Access control•Mapping files chunks•Chunks’ locations

Controls system-wide activities:•Chunk lease mamagement•Garbage collection•Chunk migration•And Heartbeat messages

Page 11: Large Scale Sharing

Architecture (Client)Interacts

with Master for

metadata

Communicates directly with

chunkservers for data

Page 12: Large Scale Sharing

Architecture (Notes)

No data cache is needed: Why?

• Client: ???

• Chunkservers: ???

Page 13: Large Scale Sharing

Architecture (Notes)

No data cache is needed: Why?

• Client: most applications stream through huge files or have working sets too large to be cached.

• Chunkservers: already have Linux cache.

Page 14: Large Scale Sharing

Single Master

Bottleneck?

Single point of failure?

Page 15: Large Scale Sharing

Single Master

Bottleneck? Never read/write data thru. the master Only ask the master for chunks’ locations Prefetch multiple chunks Cache

Single point of failure? Master’s state is replicated on multiple machines. Mutations of master’s state are atomic. “Shadow” masters are temporarily used for read.

Page 16: Large Scale Sharing

Chunk Size

Large: 64 MBs. Advantages:

Reduces client-master interaction. Reduces network overhead (use persistent TCP). Reduces size of metadata => kept in memory.

Disadvantages: Small files (small #chunks) may become hot spots.

Solutions: Small files => more replicas. Read from clients.

Page 17: Large Scale Sharing

Metadata

Three major types: file and chunk namespaces, file-to-chunk mapping, locations of each chunk’s replicas.

}in master’s memory

Persistence issues: Namespaces and mapping: operation log stored

on multiple machines. Chunks’ locations: polled when master starts and

chunkservers joining, update by heartbeat msgs.

Page 18: Large Scale Sharing

Operation Log

In the heart of GFS: The only persistent record of metadata, The logical time line that orders concurrent ops.

Operations are atomically committed. Recovery of master’s state is done by

replaying operations in the log.

Page 19: Large Scale Sharing

Consistency

Metadata: solely controlled by the master Data: consistent after successful mutations.

Same order of mutations is applied on all replicas. Stale replicas (missing some mutations) are

detected and eliminated.

Consistent and clients see what the mutation

writes in its entirety

Clients see same data regardless which replica

Page 20: Large Scale Sharing

Leases and Mutation Order

Lease: high-level chunk-based access control mechanism, granted by the master.

Global mutation order = lease grant order + serial number within a lease, chosen by the primary (lease holder).

Illustration of a mutation

ask for the lease holder of

the chunk

locations of primary and secondary

replicas

locate the lease or grant one if none exists.

cache the locations

push data to all replicas

store data in LRU buf.

and ack.

wait for all to ack.

write request

forward write request

assign serial no.

to request

request completed

reply (may be w/ errors)

Page 21: Large Scale Sharing

Special Ops Revisited

Atomic Record Appends Master chooses offset. Up on failure: pad the failed replica(s), then retry. Guarantee: the record is appended to the file at

least once atomically. Snapshot

Copy-on-write. Used to make a copy of a file/directory quickly.

Page 22: Large Scale Sharing

Master Operations

Namespace Management and Locking, To support concurrent master’s operations.

Replica Placement, To avoid dependent failures; to exploit network bandwidth.

Creation, Re-replication, Rebalancing, To better disk utilization, load balancing, fault tolerance.

Garbage Collection, Lazy deletion: simple, efficient, and support undelete.

Stale Replica Detection To eliminate obsolete replicas => garbage collected.

Page 23: Large Scale Sharing

Fault Tolerance Sum Up

Master fails? Chunkservers fail? Disks corrupted? Network noise?

Page 24: Large Scale Sharing

Micro-benchmarks

Configuration: 1 master, 2 replicas 16 chunkservers 16 clients Each machine: dual 1.4GHz PIII, 2GB mem,

2x80GB 5400rpm, full-duplex 100Mbps NIC.

}1 switch

1 switch

1Gbps

Page 25: Large Scale Sharing

Micro-benchmarkTest and Results

N clients read simultaneously, randomly from a 320GB file set.

Each client read 1GB.

Each read is 4MB.

N clients write simultaneously to N distinct files.

Each client write 1GB.

Each write is 1MB.

N clients append simultaneously to one file.

Page 26: Large Scale Sharing

Real World Clusters

Cluster A: R&D of over 100 engrs. Typical task:

Initiated by a human user and runs up to several hours.

Read (MBs – TBs) + Processed + Write results back.

Cluster B: Production data processing Tasks:

Long lasting. Continuously generate

and process multi-TB data sets.

Only occasion human intervening

Page 27: Large Scale Sharing

Real World Measurements Table shows:

Sustained high throughput.

Light workload on master.

Besides: recovery A full recovery of a

chunkserver takes 23.2 minutes.

Prioritized recovery to a state that could tolerate 1 more failure: 2 minutes.

Page 28: Large Scale Sharing

Workload Breakdown

Page 29: Large Scale Sharing

Conclusion

Design too narrow for Google’s applications. Most the challenges are implementing---more

development component than research. However, GFS is a complete, deployed

solution.

Any opinions/comments?

Page 30: Large Scale Sharing

Storage management and caching in PAST, a large-scale, persistent

peer-to-peer storage utility

Antony Rowstron, Peter Druschel

Page 31: Large Scale Sharing

What is PAST?

An Internet-based, P2P global storage utility. An archival storage and content distribution utility,

not a general-purpose file system. Nodes form a self-organizing overlay network. Nodes may contribute storage. Files are inserted and retrieved, handled by fileID

and maybe a key. Files are immutable. PAST does not have a lookup service => built on

top of one, such as Pastry.

Page 32: Large Scale Sharing

Goals

Strong persistence, High availability, Scalability, Security.

Page 33: Large Scale Sharing

Background – Pastry

A P2P routing substrate. Given (fileID, msg), route msg to the node

whose nodeID is closest to fileID. Routing Costs = ceiling(log2

bN) steps. Eventual delivery is guaranteed, unless

floor(l/2) nodes with adjacent nodeID fail. Per-node maps of (2b-1)*ceiling(log2

bN) + 2l entries: nodeID IP address.

Node recovery’s done by O(log2bN) msgs.

Page 34: Large Scale Sharing

Pastry – A closer look…

Routing: forward message with fileID

to a node that (nodeID) shares more digits with fileID than the current node.

if no such node found, fwd to node with similar match, but numerically closer.

Other nice properties: fault resilient, self-

organizing, scalable, efficient.

b = 2, l = 8

Page 35: Large Scale Sharing

PAST Operations

Insert fileID := SHA-1(filename, pub key, salt) => unique File certificate is issued. Client’s quota is charged.

Lookup Based on fileID. Node returns file’s contents and certificate.

Reclaim Client issues reclaim certificate for authentication. Credit client’s quota; double checked by reclaim receipt.

Page 36: Large Scale Sharing

Security Overview

Each node and each user hold a smartcard. Security model:

Infeasible to break the cryptosystems. Most nodes are well-behaved. Smartcards can’t be controlled by an attacker.

From smartcard, various certificates and receipts are generated to ensure security: file certificates, reclaim certificates, reclaim

receipts, etc.

Page 37: Large Scale Sharing

Storage Management

Assumptions: Storage capacities of nodes differ by no more

than 2 orders of magnitude. Advertised capacity is the basis for the admission

of nodes. 2 conflicting responsibilities:

Balance free storage under stress, Keep k copies of each file fileID at k nodes with

nodeIDs closest to fileID.

Page 38: Large Scale Sharing

I) Load Balancing

What causes load imbalance? Differences in:

#files / node (due to the dist. of nodeIDs and fileIDs). Size distribution of inserted files. Storage capacity of nodes.

What the solution aims for? Blur the differences by redistributing data:

Replica diversion: on local scale (relocate a replica among leaf nodes).

File diversion: on global scale (relocate all replicas w/ a different fileID).

Page 39: Large Scale Sharing

N receive D

SD / FN > tpri

Store DIssue receipt

(Fwd D to k-1)

Replica diversion: choose

diversion node N’

N’ = maxstorage{x |

(x is N’s leaf) & (x’s fileID not in k-closest) & (not exist diverted replica)}

N’ not exist|| SD / FN’ > tdiv

Store DN N’

(k+1)st N’

File diversion

No

No

Yes

Yes

SD size of file DFN free space of Ntpriv primary threshold

SD size of file DFN’ free space of N’tdiv diversion threshold

Page 40: Large Scale Sharing

II) Maintaining k Replicas

Problem: nodes join and leave. On joining:

Add ptr replaced node (~ replica diversion). Gradually migrate replicas back (background job).

On leaving: Each affected node picks a new kth

closest node, update its leaf set, and fwd replicas.

Notes: Extreme condition: “expand” the leaf set to 2l. Impossible to maintain k replicas if total storage decreases.

Page 41: Large Scale Sharing

Optimizations

Storage: file encoding E.g.: Reed-Solomon encoding:

m replicas for each file m checksum for n files.

Performance: caching Goals: to reduce client-access latencies, maximize query

throughput & balance query load. Algorithm: GreedyDual-Size (GD-S)

Upon a hit: Hd = c(d) / s(d) Eviction:

Evict file v where Hv min.

Subtract Hv from remaining H values.

Page 42: Large Scale Sharing

Experiments – Setup Workload 1:

8 web proxy logs from NLANR: 4 mil entries Reference 1,863,055 unique URLs 18.7GBs of contents Mean = 10517 Bs, median = 1,312 Bs, max = 138 MBs, min = 0 Bs.

Workload 2: Combining file name and size information from several file

systems: 2,027,908 files 166.6 GBs Mean = 88,233 Bs, median = 4,578 Bs, max = 2.7 GBs, min = 0 Bs.

System: k = 5, b = 4, N = 2250 Space contribution: 4 normal distribution (click to see figure.)

Page 43: Large Scale Sharing

Experiment 0

Disable replica and file diversions: tpriv = 1

tdiv = 0 Reject upon first failure.

Results: File insertions failed = 51.1%, Storage utilization = 60.8%.

Page 44: Large Scale Sharing

Storage Contribution & Leaf Set Size

Experiment: Workload 1 tpriv = 0.1

tdiv = 0.05

Results: Failures Utilization More leaves

=> better. d2 best.

Page 45: Large Scale Sharing

Sensitivity of Replica Diversion Parameter tpri

Experiment: Workload 1 l = 32 tdiv = 0.05

tpri varies

Results: As tpri

Successful insertion Storage utilization

Page 46: Large Scale Sharing

Sensitivity of File Diversion Paramerter tdiv

Experiment: Workload 1 l = 32 tpri = 0.1 tdiv varies

Results: As tdiv

Successful insertion Storage utilization

tpriv = 0.1 and tdiv = 0.05 yields best result.

Page 47: Large Scale Sharing

Diversions

File diversions are negligible as long as storage utilization is

below 83%

Acceptable overhead

Page 48: Large Scale Sharing

Insertion Failures w/ Respect to File Size

Workload 1tpriv = 0.1tdiv = 0.05

Workload 2tpriv = 0.1tdiv = 0.05

Page 49: Large Scale Sharing

Experiments – Caching

Replica diversion increase

99% load, still effective due to small

files

Page 50: Large Scale Sharing

Conclusion

PAST achieves its goals But:

Application specific Hard to deploy: what is the incentive for the nodes

to contribute storage?

Additional comments?