Download - Replica Control for Peer-to-Peer Storage Systems

Transcript
Page 1: Replica Control for Peer-to-Peer Storage Systems

Replica Control for Peer-to-Peer Storage Systems

Page 2: Replica Control for Peer-to-Peer Storage Systems

P2P

• Peer-to-peer (P2P) has emerged as an important paradigm model for sharing resources at the edges of the Internet.

• The most widely exploited resource is storage, as typified in P2P music file sharing– Napster– Gnutella

• Following the great success of P2P file sharing, a natural next step is to develop wide-area, P2P storage systems to aggregate the storage across the Internet.

Page 3: Replica Control for Peer-to-Peer Storage Systems

Replica Control Protocol

•Replication

to maintain multiple copies of some critical data to increase the availability

•Replica Control Protocol

to guarantee a consistent view of the replicated data

Page 4: Replica Control for Peer-to-Peer Storage Systems

Replica Control Methods

• Optimistic– Proceed optimistically with computation on the

available subgroup and join later with consistency

– Approaches• Log, Version vector, etc.

• Pessimistic– Restrict computations with worst-case

assumptions– Approaches

• Primary site, Voting, etc.

Page 5: Replica Control for Peer-to-Peer Storage Systems

5

Write-ahead Log

• Files are actually modified in place, but before any file block is changed, a record is written to a log telling which node is making the change, which file block is being changed, and what the old and new values are

• Only after the log has been written successfully is the change made to the file

• Write-ahead log can be used for undo (rollback) and redo

Page 6: Replica Control for Peer-to-Peer Storage Systems

Version Vector

• Version vector for file f– N element vector, where N is the number of

nodes in which f is stores– The ith element represents the number of

updates done by node I

• A vector V dominated V’ if– Every element in V >= corresponding element

in V’

• Conflicts if neither dominates

Page 7: Replica Control for Peer-to-Peer Storage Systems

Version Vector

• Consistency resolution – If V dominates V’, inconsistent; can be

resolved by copying V to V’– If V and V’ conflict, inconsistency is detected

• Version vector can detect only update conflicts; cannot detect read-write conflicts

Page 8: Replica Control for Peer-to-Peer Storage Systems

Primary Site Approach

• Data replicated on at least k+1 nodes (for k-resilient)

• One node acts as the primary site (PS)– Any read request is served by the PS– Any write request is copied to all other back-

up sites– Any write request to back-up sites are

forwarded to the PS

Page 9: Replica Control for Peer-to-Peer Storage Systems

PS Failure Handling

• If back-up fails, no interruption in service• If PS fails, there are two possibilities

– If the network not segmented• Choose a back-up node as the primary• If checkpointing has been active, need to restart

only from the previous checkpoint

– If segmented• Only the partition with PS can progress• Other partitions stops updates on data• Necessary to distinguish between site failures and

network partitions

Page 10: Replica Control for Peer-to-Peer Storage Systems

Voting Approach

• V votes are distributed to n replicas with – Vw+Vr > V– Vw+Vw > V

• Obtain Vr or more votes to read

• Obtain Vw or more votes to write

• Quorum system is more general than voting

Page 11: Replica Control for Peer-to-Peer Storage Systems

Quorum Systems

• Trees

• Grid-based (array-based)

• Torus

• Hierarchical

• Multi-column

and so on…

Page 12: Replica Control for Peer-to-Peer Storage Systems

12/40

Quorum-Based Schemes (1/2)

• n replicas with version numbers• Read operation

– Read-lock and access a read quorum– Obtaining a largest-version-number replica

• Write operation– Write-lock and access a write quorum– Updating all replicas with the new version number

the largest+ 1

Page 13: Replica Control for Peer-to-Peer Storage Systems

13/40

Quorum-Based Schemes (2/2)

• One-copy equivalence is guaranteed If we restrict– Write-write and write-read lock exclusion– Intersection Property

• A non-empty intersection in any pair of–A read quorum and a write quorum–Two write quorums

The set of replicas must behave as if there is only a single copy. This is the strictest consistency criterion.

Page 14: Replica Control for Peer-to-Peer Storage Systems

Witnesses

Witness - small entity that maintains enough information to identify the replicas that contain the most recent version of the data

- the information could also be a timestamp containing the time of the latest update

- the information could also be a version number, which is an integer incremented each time the data are updated

Page 15: Replica Control for Peer-to-Peer Storage Systems

Classification of P2P Storage Sys.

• Unstructured– “Replication Strategies for Highly Available Peer-to-peer

Storage”– “Replication Strategies in Unstructured Peer-to-peer

Networks” • Structured

– CFS– PAST– LAR– Ivy– Oasis– Om– Eliot– Sigma (for mutual exclusion primitive)

Read only

Read/Write (Mutable)

Page 16: Replica Control for Peer-to-Peer Storage Systems

Ivy

• Stores a set of logs with the aid of distributed hash tables.

• Ivy keeps, for each participant, a log storing all its updates, and maintains data consistency optimistically by performing conflict resolutions among all logs. (Maintain data consistency in a best-effort manner)

• The logs should be kept indefinitely and a participant must scan all the logs related to a file to look up the up-to-date file data. Thus, Ivy is only suitable for small groups of participants.

Page 17: Replica Control for Peer-to-Peer Storage Systems

Solution: Log Based

• Update: Each participant maintains a log of changes to the file system

• Lookup: Each participant scans all logs

Page 18: Replica Control for Peer-to-Peer Storage Systems

Eliot

• Eliot relies a reliable, fault-tolerant, immutable P2P storage substrate Charles to store data blocks, and uses an auxiliary metadata service (MS) for storing mutable metadata.

• It supports NFS-like consistency semantics; however, the traffic between MS and the client is high for such semantics.

• It also supports AFS open-close consistency semantics; however, this semantics may cause the problem of lost updates.

• The MS service is provided by a conventional replicated database, which may be not fit for dynamic P2P environments.

Page 19: Replica Control for Peer-to-Peer Storage Systems

Oasis

• Oasis is based on Gifford’s weighted voting quorum concept and allows dynamic quorum membership.

• It spreads versioned metadata along with data replicas over the P2P network.

• To complete an operation on a data object, a client must first find a metadata related to the object and figure out the total number of votes, required votes for read/write operations, replica list, and so on, to form a quorum accordingly.

• One drawback of Oasis is that if a node happens to use a stale metadata, the data consistency may be violated.

Page 20: Replica Control for Peer-to-Peer Storage Systems

Om• Om is based on the concepts of automatic replica

regeneration and replica membership reconfiguration.• The consistency is maintained by two quorum systems: a

read-one-write-all quorum system for accessing replicas, and a witness-modeled quorum system for reconfiguration.

• Om allows replica regeneration from single replica. However, a write in Om is always first forwarded to the primary copy, which serializing all writes and uses a two-phase procedure to propagate the write to all secondary replicas.

• The drawbacks of Om are (1) the primary replica may become a bottleneck (2) the overhead incurred by the two-phase procedure may be too high (3) the reconfiguration by witness model has the probability of violating consistency.

Page 21: Replica Control for Peer-to-Peer Storage Systems

Witness Model

Page 22: Replica Control for Peer-to-Peer Storage Systems

Sigma System Model

Page 23: Replica Control for Peer-to-Peer Storage Systems

Sigma System Model

• Replicas are always available, but their internal states may be randomly reset. failure-recovery model

• The number of clients is unpredictable. Clients are not malicious and fail stop.

• Clients and replicas communicate via messages, which could be replicated, lost, but never forged.

Page 24: Replica Control for Peer-to-Peer Storage Systems

Sigma

• The Sigma protocol intelligently collect states from all replicas to achieve mutual exclusion.

• The basic idea of the Sigma protocol is as follows. A node u wishing to be the winner of the mutual exclusion sends a timestamped request for each of the totally n (n=3k+1) replicas and waits for replies. On receiving a request from u, a node v should put u’s request in a local queue by the timestamp order, takes the node as the winner whose request is in the front of the queue, and reply the winner ID to u.

Page 25: Replica Control for Peer-to-Peer Storage Systems

Sigma• When the number of replies received by u exceeds m

(m=2k+1), u acts according to the following conditions:(1) if more than m replies take v as the winner, then u is the winner. (2) if more than m replies take w (wu) as the winner, then w is the winner and u just keeps waiting.(3) if no node is regarded as the winner by more than m replies, then u sends YIELD message to cancel its request temporarily and then re-inserts its request again with random backoff.

• In this manner, one node can eventually be elected as the winner even when communication delay variance is large.

• A drawback of the Sigma protocol is that a node needs to send requests to all replicas and gets advantaged replies from a large portion (2/3) of nodes to be the winner of the mutual exclusion, which will incur large overhead. Moreover, the overhead will even be larger under an environment of high contention.

Page 26: Replica Control for Peer-to-Peer Storage Systems

References• Ivy:

A. Muthitacharoen, R. Morris, T. Gil, and B. Chen, “Ivy: A Read/write Peer-to-peer File System,” in Proc. of the Symposium on Operating Systems Design and Implementation (OSDI), 2002.

• Eliot:C. Stein, M. Tucker, and M. Seltzer, “Building a Reliable Mutable File System on Peer-to-peer Storage,” in Proc. of 21st IEEE Symposium on Reliable Distributed Systems (WRP2PDS), 2002.

• Oasis:M. Rodrig, and A. Lamarca, “Decentralized Weighted Voting for P2P Data Management,” in Proc. of the 3rd ACM International Workshop on Data Engineering for Wireless and Mobile Access, pp. 85–92, 2003.

• OM:H. Yu. and A. Vahdat, “Consistent and Automatic Replica Regeneration,” in Proc. of First Symposium on Networked Systems Design and Implementation (NSDI '04), 2004.

• Sigma: S. Lin, Q. Lian, M. Chen, and Z. Zhang, “A practical distributed mutual exclusion protocol in dynamic peer-to- peer systems,” in Proc. of 3rd International Workshop on Peer-to-Peer Systems (IPTPS’04), 2004.

Page 27: Replica Control for Peer-to-Peer Storage Systems

Q&A