Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo,...

1

Effectively Model CheckingReal-World Distributed Systems

Junfeng YangJoint work with Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu,

Lintao Zhang, Heming Cui, Jingyue Wu,Chia-che Tsai, John Gallagher

2

One-slide Summary

• Distributed systems: important, but hard to get right• Model checking: find serious bugs but is slow• Dynamic Interface Reduction: a new type of state-

space reduction technique in 25 years [DeMeter SOSP 11]

– exponentially speed up model checking– One data point: 34 years 18 hours

• Stable Multithreading: a radically new approach [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13]

– what-you-check-is-what-you-run– Billions of years 7 hours– https://github.com/columbia/smt-mc

Distributed Systems:Pervasive and Critical

3

4

Distributed Systems: Hard to Get Right

• Node has no centralized view of entire system• Code must correctly handle many failures– Link failures, network partitions– message loss, delay, or reordering– machine crashes

• Worse: geo, larger, weird failures more likely

Complex protocols, more complex code, bugs

5

Model CheckingDistributed Systems Implementations

…

• Choices of actions– Send message– Recv message– Run thread– Delay message– Fail link– Crash machine– …

• Run checkers on states– E.g., assertions

send

fail link

thread

crash

…

6

Good Error Detection Results

• E.g., [MoDist NSDI 09] [dBug SSV 10]

– Easy: check unmodified, real code in native environment (“in-situ” [eXplode OSDI 06])

– Comprehensive: check many corner cases– Deterministic: detected errors can be replay

• MoDist results– Checked Berkeley DB rep, MPS (Microsoft production),

PacificA– Found 35 bugs

• 10 Protocol flaws found in every system checked

– Transfer to Microsoft product groups

7

But, the State Explosion Problem

• Real-world distributed systems have too many states to completely explore– Even for conceptually small state spaces– 3-node MPS: 34 years for MoDist!

• Incompleteness Low assurance• Prior model checkers explored many

redundant states

8

This Talk: Two Techniques toEffectively Reduce/Shrink State Space

• Dynamic Interface Reduction: check components separately to avoid costly global exploration [DeMeter SOSP 11]

– 34 years 18 hours, 10^5 reduction

• Leverage Stable Multithreading [Tern OSDI '10] [Peregrine

SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] to make what-you-check what-you-run (ongoing)

9

Dynamic Interface Reduction (DIR)• Insight: system builders decompose a system

into components with narrow interfaces– e.g., [Clarke, Long, McMillan 87] [Laster, Grumberg 98]

• Distinguish global and local actions• Check local actions via conceptually local fork()

// main // ckpt

n=recv()total+=nSend(n)

Log(total)

10

Reduction Analysis• N components, each having M local actions

w/o DIR: M * M * … * M = M^N

w DIR: M + M + … + M = M * N

Exponential reduction

…

…

…

…

…

11

Challenge in Implementing DIR

• How to automatically compute interfaces from real code w/o causing false positives or missing bugs?

• Manual spec: tedious, costly, error-prone– Required by prior compositional or modular

model checking work• Made-up interfaces: difficult-to-diagnose false

positives [Guerraoui and Yabandeh, NSDI 11]

12

Automatically Discover Interfaceby Running Code

12

Global Explorer Explore

global actions

Local Explorers

Explore local

actions

Explore local

actions

Explore local

actons

Message Traces

Message Traces

Message Traces

Message Traces

• Insight: message traces collectively define interfaces

Message Traces

Message Traces

Message Traces

13

// main // ckptWhile(n=recv()){ total+=n Send(S, n)}

Log(total)

if (Toss(2) == 0)) { Send(P, 1); Send(P, 2);} else { Send(P, 1); Send(P, 3);}

Example

// main // ckptWhile(n=recv()){ total+=n}

Log(total)

Client C Primary P Second S

14


Log(total)


Global Explorer:Compute Initial Global Trace


Log(total)


C.Toss(2) = 0C.Send(P, 1)P.Recv(C, 1)P.LogP.total+=1P.Send(S, 1)S.Recv(P, 1)S.LogS.total+=1C.Send(P, 2)P.Recv(C, 2)P.total+=2P.Send(S, 2)S.Recv(P, 2)S.total+=2

Global

15


Log(total)


Global Explorer:Project Message Traces


Log(total)



Global

P.Recv(C, 1)

P.Send(S, 1)

P.Recv(C, 2)

P.Send(S, 2)

S.Recv(P, 1)

S.Recv(P, 2)C.Send(P, 1)

C.Send(P, 2)

16


Log(total)


Local Explorers:Explore Local Actions Using Message traces


Log(total)



Global

P.Recv(C, 1)

P.Send(S, 1)

P.Recv(C, 2)

P.Send(S, 2)

S.Recv(P, 1)


C.Send(P, 2)

17


Log(total)


Local Explorer of Primary:Explore Local Trace 1


Log(total)



Global

P.Recv(C, 1)

P.Send(S, 1)

P.Recv(C, 2)

P.Send(S, 2)

S.Recv(P, 1)


C.Send(P, 2)

P.Log

P.total+=1

P.total+=2

18


Log(total)




Log(total)



Global

P.Recv(C, 1)

P.Send(S, 1)

P.Recv(C, 2)

P.Send(S, 2)

S.Recv(P, 1)


C.Send(P, 2)

P.LogP.total+=1

P.total+=2

19


Log(total)




Log(total)



Global

P.Recv(C, 1)

P.Send(S, 1)

P.Recv(C, 2)

P.Send(S, 2)

S.Recv(P, 1)


C.Send(P, 2)

P.Log

P.total+=1

P.total+=2

20


Log(total)


Local Explorer of Client


Log(total)



Global

P.Recv(C, 1)

P.Send(S, 1)

P.Recv(C, 2)

P.Send(S, 2)

S.Recv(P, 1)


C.Send(P, 2)

21


Log(total)


Local Explorer of Client


Log(total)



Global

P.Recv(C, 1)

P.Send(S, 1)

P.Recv(C, 2)

P.Send(S, 2)

S.Recv(P, 1)

S.Recv(P, 2)

C.Send(P, 1)

C.Send(P, 2)

C.Toss(2) = 0

22


Log(total)


Local Explorer of ClientFound New Message Trace


Log(total)



Global

P.Recv(C, 1)

P.Send(S, 1)

P.Recv(C, 2)

P.Send(S, 2)

S.Recv(P, 1)

S.Recv(P, 2)

C.Send(P, 1)

C.Send(P, 3)

C.Toss(2) = 1

C.Send(P, 2)

23


Log(total)


Global Explorer: Composition


Log(total)



Global

P.Recv(C, 1)

P.Send(S, 1)

P.Recv(C, 2)

P.Send(S, 2)

S.Recv(P, 1)

S.Recv(P, 2)

C.Send(P, 1)

C.Send(P, 3)

C.Toss(2) = 1

C.Send(P, 2)

24


Log(total)


Global Explorer: New Global Trace


Log(total)


C.Toss(2) = 0C.Send(P, 1)P.Recv(C, 1)P.LogP.total+=1P.Send(S, 1)S.Recv(P, 1)S.LogS.total+=1C.Send(P, 3)

Global

P.Recv(C, 1)

P.Send(S, 1)

P.Recv(C, 2)

P.Send(S, 2)

S.Recv(P, 1)

S.Recv(P, 2)

C.Send(P, 1)

C.Send(P, 3)

C.Toss(2) = 1

C.Send(P, 2)

25

Implementation

• 7,279 lines of C++

• Integrated DIR with –MoDist [MoDist NSDI 09] ,757 lines–MaceMC [MaceMC NSDI 07] ,1,114 lines– Easy

• Orthogonal with partial order reduction through vector clock tricks

Verification/Reduction Results

• MPS (Microsoft production system)• BDB: Berkeley DB Replication• Chord: Chord implementation in Mace• *-n: n nodes

• Results of other benchmarks in [Demeter SOSP 11]

26

App MPS-2 MPS-3 BDB-2 BDB-3 Chord-2 Chord-3

Reduction 488 542944 277 278481 19 1587

Speedup 153 217178

50 44203 7 547

DIR-MoDist DIR-MaceMC

27

DIR Summary• Proven sound (introduce no false positive)

and complete (introduce no false negative)• Fully automatic, real, exponential reduction• Works seamlessly w/ existing model

checkers– Integrated into MoDist and MaceMC; easy

• Results– Verified instances of real-world systems– Empirically observed large reduction

• 34 years 18 hours (10^5) on MPS

28

This Talk: Two Techniques toEffectively Reduce State Space


– 34 years 18 hours, 10^5 reduction


SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] to make what-you-check what-you-run (ongoing)

29

Threads: Difficult to Model Check

• Many thread interleavings, or schedules– To verify, local explorer must explore all schedules

• Wide interfaces between threads – Any shared-memory load/store– Tracing load/store is costly– DIR may not work well

30

What-you-check is what-you-run

• Coverage = C/R• Reduction: enlarge C exploiting

equivalence• But equivalence is rare, hard to

find!– DIR took us 2-3 years

• Can we increase coverage w/o equivalence?

• Shrink R w/ Stable Multithreading [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13]

All possible runtime schedules (R)

Model checked schedules (C)

31

Stable Multithreading

• Reuse well-checked schedules on diff. inputs• How does it work? See papers [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot

SOSP '13] [CACM '13]

• So much easier that it feels like cheating

Nondeterministic Stable Deterministic

32

Conclusion


– Automatic, real, exponential reduction– Proven sound and complete– 34 years 18 hours, 10^5 reduction


SOSP '11] to make what-you-check what-you-run (ongoing)

33

Key Challenge

• Make stable multithreading work with real-world distributed systems– Physical time?– Message passing?– Dynamic load balancing?– Overhead?

Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo,...

Documents

Transcript of Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo,...