Post on 22-Feb-2016
description
D3S: Debugging Deployed Distributed Systems
Xuezheng Liu et al, Microsoft Research, NSDI 2008
Presenter: Shuo Tang, CS525@UIUC
Debugging distributed systems is difficult
• Bugs are difficult to reproduce– Many machines executing concurrently– Machines/network may fail
• Consistent snapshots are not easy to get• Current approaches– Multi-threaded debugging– Model-checking– Runtime-checking
State of the Arts
• Example– Distributed reader-writer locks
• Log-based debugging– Step1: add logs
• void ClientNode::OnLockAcquired(…) {• …• print_log( m_NodeID, lock, mode);• }
– Step2: Collect logs– Step3: Write checking scripts
Problems
• Too much manual effort• Difficult to anticipate what to log– Too much?– Too little?
• Checking for large system is challenging– A central checker cannot keep up– Snapshots must be consistent
D3S Contribution
• A simple language for writing distributed predicates
• Programmers can change what is being checked on-the-fly
• Failure tolerant consistent snapshot for predicate checking
• Evaluation with five real-world applications
D3S Workflow
Checker Checker
Predicate:no conflict locks
Violation!
statestate
state
statestate
Conflict!
Glance at D3S Predicate V0: exposer { ( client: ClientID, lock: LockID, mode: LockMode ) }V1: V0 { ( conflict: LockID ) } as finalafter (ClientNode::OnLockAcquired) addtuple ($0->m_NodeID, $1, $2)after (ClientNode::OnLockReleased) deltuple ($0->m_NodeID, $1, $2)
class MyChecker : vertex<V1> { virtual void Execute( const V0::Snapshot & snapshot ) { …. // Invariant logic, writing in sequential style } static int64 Mapping( const V0::tuple & t ) ; // guidance for
partitioning};
D3S Parallel Predicate CheckerLock clients
Checkers
Expose statesindividually
Reconstruct:SN1, SN2, …
Exposed states(C1, L1, E), (C2, L3, S), (C5, L1, S),…
L1L1
(C1,L1,E),(C5,L1,S) (C2,L3,S)
Key: LockID
Summary of Checking Language
• Predicate– Any property calculated from a finite number of
consecutive state snapshots• Highlights– Sequential programs (w/ mapping)– Reuse app types in the script and C++ code
• Binary Instrumentation– Supports for reducing the overhead (in the paper)
• Incremental checking• Sampling the time or snapshots
Snapshots
• Use Lamport clock– Instrument network library– 1000 logic clocks per second
• Problem: how does the checker know whether it receives all necessary states for a snapshot?
Consistent Snapshot
• Membership• What if a process does not have state to expose for a long
time?• What if a checker fails?
A
B
Checker
{ (A, L0, S) }, ts=2
{ (B, L1, E) }, ts=6
{ }, ts=10
ts=12
{ (A, L1, E) }, ts=16
M(2)={A,B}SB(2)=??
M(6)={A,B}SA(6)=??
M(10)={A,B}SA(6)=SA(2) check(6)
Detect failure
SB(10)=SB(6) check(10)
M(16)={A}check(16)
SA(2) SB(6) SA(10) SA(16)
Experimental Method
• Debugging five real systems– Can D3S help developers find bugs?– Are predicates simple to write?– Is the checking overhead acceptable?
• Case: Chord implementation – i3– Using predecessors and successors list to stabilize – “holes” and overlap
Chord Overlay
Perfect Ring:• No overlap, no hole• Aggregated key coverage is 100%
???
0 10000200003000040000500006000070000800000%
50%
100%
150%
200% 3 predecessors8 predecessors
time (seconds)
key
rang
e co
vera
ge r
atio
Consistency vs. Availability: cannot get both• Global measure on the factors• See the tradeoff quantitatively for performance tuning• Capable of checking detailed key coverage
0 64 128 192 2560
1
2
3
43 predecessors8 predecessors
key serial
# of
hit
of c
hord
nod
es
Summary of ResultsApplication LoC Predicates LoP Results
PacificA (Structured data storage)
67,263 membership consistency; leader election; consistency among replicas
118 3 correctness bugs
Paxos implement-ation
6,993 consistency in consensus outputs; leader election
50 2 correctness bugs
Web search engine
26,036 unbalanced response time of indexing servers
81 1 performance problem
Chord (DHT) 7,640 aggregate key range coverage; conflict key holders
72 tradeoff bw/ availability & consistency
BitTorrent client
36,117 Health in neighbor set; distribution of downloaded pieces; peer contribution rank
210 2 performance bugs; free riders
Data
cen
ter a
pps
Wid
e ar
ea a
pps
Overhead (PacificA)
2 4 6 8 100
30
60
90
120
150
180
7.21%
4,38%3.94%
4.20%
7.24%
withoutwith
# of clients, each sending 10,000 requests
tim
e to
com
plet
e (s
econ
ds)
• Less than 8%, in most cases less than 4%. • I/O overhead < 0.5%• Overhead is negligible in other checked systems
Related Work• Log analysis
– Magpie[OSDI’04], Pip[NSDI’06], X-Trace[NSDI’07]
• Predicate checking at replay time– WiDS Checker[NSDI’07], Friday[NSDI’07]
• P2-based online monitoring– P2-monitor[EuroSys’06]
• Model checking– MaceMC[NSDI’07], CMC[OSDI’04]
Conclusion
• Predicate checking is effective for debugging deployed & large-scale distributed systems
• D3S enables:– Change of what is monitored on-the-fly– Checking with multiple checkers– Specify predicate in sequential & centralized
manner
Thank You
• Thank the authors for providing some of slides
PNUTSYahoo!’s Hosted Data Serving Platform
Brian F. Cooper et al. @ Yahoo! Research
Presented by Ying-Yi Liang* Some slides come from the authors’ version
What is the Problem The web era: web applications Users are picky – low latency; high availability Enterprises are greedy – high scalability Things go fast – new ideas expires very soon Two ways of developing a cool web application
Making your own fire: quick, cool, but tiring, error prone Using huge “powerful” building blocks: wonderful, but the
market would have shifted away when you are done Both ways do not scale very well…
Something is missing – an infrastructure specially tailored for web applications!
Web Application ModelObject sharing: Blogs, Flicker, Web Picasa,
Youtube, …Social: Facebook, Twitter, …Listing: Yahoo! Shopping, del.icio.us, newsThey require:
High scalability, availability and fault tolerance Acceptable latency w.r.t. geographically
distributed requests Simplified query API Some consistency (weaker than SC)
PNUTS – DB in the Cloud
E 75656 C
A 42342 EB 42521 WC 66354 WD 12352 E
F 15677 E
E 75656 C
A 42342 EB 42521 WC 66354 WD 12352 E
F 15677 E
CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…
)
Parallel databaseGeographic replication
Indexes and views
Structured, flexible schema
Hosted, managed infrastructure
A 42342 EB 42521 WC 66354 WD 12352 EE 75656 CF 15677 E
Basic ConceptsGrape Grapes are good to eatLime Limes are GreenApple Apple is wisdom
Strawberry
Strawberry shortcake
Orange Arrgh! Don’t get scurvy! Avocado But at what price?Lemon How much did you pay for this
lemon?Tomato Is this a vegetable? Banana The perfect fruit
Kiwi New Zealand
Primary Key
Record
Tablet
Field
A view from 10,000-ft
PNUTS Storage Architecture
Storage units
RoutersTablet
controller
REST API
Clients
MessageBroker
Geographic Replication
Storage units
RoutersTablet
controller
REST API
Clients
MessageBroker
Region 1
Region 2
Region 3
In-region Load BalanceStorage unit
Tablets
Data and Query ModelsSimplified rational data model: tables of
recordsTyped columnsTypical data types plus the blob typeDoes not enforce inter-table relationshipOperation: selection, projection (no join,
aggregation, …)Options: point access, range query, multiget
Record Assignment
Storage unit 1 Storage unit 2 Storage unit 3
Router
AppleAvocadoBananaBlueberryCanteloupeGrapeKiwiLemonLimeMangoOrangeStrawberryTomatoWatermelon
SU1Strawberry-MAX
SU2Lime-Strawberry
SU3Canteloupe-Lime
SU1MIN-Canteloupe
Single Point Update1
Write key k
2Write key k
7Sequence # for key k
8Sequence # for key k
SU SU SU
3Write key k
4
5SUCCESS
6Write key k
RoutersMessage brokers
Range Query
Storage unit 1 Storage unit 2 Storage unit 3
Router
AppleAvocadoBananaBlueberry
CanteloupeGrapeKiwiLemon
LimeMangoOrange
StrawberryTomatoWatermelon
Grapefruit…Pear?Grapefruit…Lime?
Lime…Pear?
MIN-Canteloupe
SU1
Canteloupe-Lime
SU3
Lime-Strawberry
SU2
Strawberry-MAX
SU1
SU1Strawberry-MAX
SU2Lime-Strawberry
SU3Canteloupe-Lime
SU1MIN-Canteloupe
Relaxed ConsistencyACID transactions
Sequential consistency: too strong Non-trivial overhead for asynchronous settings
Users can tolerate stale data in many casesGo hybrid: eventual consistency + mechanism
for SCUse versioning to cope with asynchrony
Time
Record insertedUpdate Update Update UpdateUpdate Delete
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1
v. 6 v. 8
Update Update
Relaxed Consistency
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1
v. 6 v. 8
Current version
Stale versionStale version
read_any()
Relaxed Consistency
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1
v. 6 v. 8
Current version
Stale versionStale version
read_latest()
Relaxed Consistency
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1
v. 6 v. 8
Current version
Stale versionStale version
read_critical(“v.6”)
Relaxed Consistency
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1
v. 6 v. 8
Current version
Stale versionStale version
write()
Relaxed Consistency
Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7Generation 1
v. 6 v. 8
Current version
Stale versionStale version
test_and_set_write(v.7)
ERROR
Membership ManagementRecord timelines should be coherent for each
replicaUpdates must be applied to the latest versionUse mastership
Per-record basis Only one replica has mastership at anytime All update requests are sent to master to get
ordered Routers & YMB maintain mastership information Replica receiving frequent write req. gets the
mastership Leader election service provided by ZooKeeper
ZooKeeperA distributed system is like a zoo, someone needs to
be in charge of it.ZooKeeper is a highly available, scalable coordination
svc.ZooKeeper plays two roles in PNUTS
Coordination service Publish/subscribe service
Guarantees: Sequential consistency; Single system image Atomicity (as in ACID); Durability; Timeliness
A tiny kernel for upper level building blocks
ZooKeeper: High AvailabilityHigh availability via replicationA fault-tolerant persistent storeProviding sequential consistency
ZooKeeper: ServicesPublish/Subscribe Service
Contents stored in ZooKeeper organized as directory trees
Publish: write to specific znode Subscribe: read specific znode
Coordination via automatic name resolution By appending sequence number to names CREATE(“/…/x-”, host, EPHEMERAL | SEQUENCE) “/…/x-1”, “/…/x-2”, … Ephemeral nodes: znodes living as long as the
session
ZooKeeper Example: Lock1) id = create(“…/locks/x-”, SEQUENCE |
EMPHEMERAL);2) children = getChildren(“…/locks”, false);3) if (children.head == id) exit();4) test = exists(name of last child before id, true);5) if (test == false) goto 2);6) wait for modification to “…/locks”;7) goto 2);
ZooKeeper Is PowerfulMany core svc. in distributed sys. built on
ZooKeeper Consensus Distributed locks (exclusive, shared) Membership Leader election Job tracker binding …
More information at http://hadoop.apache.org/zookeeper/
Experimental SetupProduction PNUTS code
Enhanced with ordered table typeThree PNUTS regions
2 west coast, 1 east coast 5 storage units, 2 message brokers, 1 router West: Dual 2.8 GHz Xeon, 4GB RAM, 6 disk RAID 5
array East: Quad 2.13 GHz Xeon, 4GB RAM, 1 SATA disk
Workload 1200-3600 requests/second 0-50% writes 80% locality
Scalability
2 2.5 3 3.5 4 4.5 50
20406080
100120140160
Hash table Ordered table
Number of storage units
Avg.
Lat
ency
(ms)
Sensitivity to R/W Ratio
0 5 10 15 20 25 30 35 40 45 500
20406080
100120140
Hash table Ordered table
Write percentage
Avg.
Lat
ency
(ms)
Sensitivity to Request Dist.
0 0.25 0.5 0.75 10
20
40
60
80
100
Hash table Ordered table
Zipf Factor
Avg.
Lat
ency
(ms)
Related Work Google BigTable/GFS
Fault-tolerance and consistency via Chubby Strong consistency – Chubby not scalable Lack of geographic replication support Targeting analytical workloads
Amazon Dynamo Unstructured data Peer-to-peer style solution Eventual consistency
Facebook Cassandra (still kind of a secret) Structured storage over peer-to-peer network Eventual consistency Always writable property – success even in the face of a failure
DiscussionCan all web applications tolerate stale data? Is doing replication completely across WAN a good
idea?Single level router vs. B+ tree style router hierarchyTiny service kernel vs. stand alone services Is relaxed consistency just right or too weak? Is exposing record versions to applications a good idea?Should security be integrated into PNUTS?Using pub/sub service as undo logs