Building and Programming the Cloud, Mysore, Jan 2010 1 Accountable distributed systems and the...

Building and Programming the Cloud, Mysore, Jan 2010 1

Accountable distributed systems and the accountable cloud

Peter Druschel

joint work with Andreas Haeberlen1, Petr Kuznetsov2, Rodrigo Rodrigues

1 University of Pennsylvania

2 TU Berlin/Deutsche Telekom Labs

2

Outline

Why accountability? A definition A practical implementation: PeerReview Accountability in the Cloud Technical Challenges Conclusion

Building and Programming the Cloud, Mysore, Jan 2010

3

What is the problem?


Multiple administrative domains (federated, p2p)

Multiple stakeholders (hosting, Web) different actors, somewhat different

interests lack of global visibility, control

Complex faults software faults, mis-configuration,

negligence, disgruntled employees, outside attacks, manipulation

Lack of transparency

4

Learning from the 'offline' world Relies heavily on accountability to deal with

faults, misbehavior Example: Banking

Record can be used to (manually) detect problems identify the responsible party convince that a problem does (not) exist

Requirement Solution

Commitment Signed receipts

Tamper-evident record

Double-entry bookkeeping

Inspections Audits


5

What does accountability mean in distributed systems?

1. Tamper-evident record of each node‘s actions

2. (Automated) audit for fault detection, localization

3. Evidence to convince a third party that a fault has (not) occured

Accountability provides transparency trust incentives to avoid faults


6

Outline



7

Ideal accountability

Whenever a node is faulty, the system generates a proof of misbehavior against that node

Fault := Node deviates from expected behavior

Our goal is to automatically detect faults identify the faulty nodes convince others that a node is (or is not) faulty

Can we build a system that provides the following guarantee?


8

Can we detect all faults? Problem: Faults that

affect only a node's internal state

Would require online trusted probes at each node

Focus on observable faults: Faults that affect a correct node

Can detect observable faults without requiring trusted components

AA

X

CC

100101011000101101011100100100

0


9

Can we always get a proof?

Problem: He-said-she-said Three possible causes:

A never sent X B refuses to acknowledge X X was delayed by the network

Cannot get proof of misbehavior! Generalize to verifiable evidence:

a proof of misbehavior, or a challenge that a faulty node cannot answer

What if the challenged node does not respond? Does not prove a fault, but node is suspected until it

responds

AA

X

BB

CC

?

I sent X!

I neverreceived

X!

?!


10

Practical accountability Requirement for an accountable distributed

system:

This is useful Any (!) fault that affects a correct node is

eventually detected and linked to a faulty node

It can be implemented in practice

Whenever a fault is observed by a correct node, the system eventually generates verifiable evidence against a faulty node


11

Outline



12

Adds accountability to a given system Implemented as a library Provides tamper-evident record Detects faults via state-machine replay

Assumptions:

PeerReview

1. Nodes can be modeled as deterministic state machines

2. There is a trusted reference implementation of the state machines

3. Correct nodes can eventually communicate

4. Nodes can sign messagesBuilding and Programming the Cloud, Mysore, Jan 2010

13

PeerReview is widely applicable App #1: NFS server in the Linux kernel

Many small, latency-sensitive requests Tampering with files Lost updates

App #2: Overlay multicast Transfers large volume of data

Freeloading Tampering with content

App #3: P2P email Complex, large, decentralized

Denial of service Attacks on DHT routing

Details in [Haeberlen et al., SOSP’07] NetReview [Haeberlen et al. NSDI’08]

Metadata corruption Incorrect access

control

Censorship


14

How much does PeerReview cost?

Log storage 10 – 100 GByte per month, depending on

application

Message signatures Message latency (e.g. 1.5ms RTT with RSA-

1024) CPU overhead (embarrassingly parallel)

Log/authenticator transfer, replay overhead Depends on # witnesses Can be deferred to exploit bursty/diurnal load

patternsBuilding and Programming the Cloud, Mysore, Jan 2010

15

Outline



Split administration in the Cloud

Bug in Alice‘s software Subtle differences between

Alice and Bob‘s environments ...

16

Alice

Bob

Alice's customers

Bug in Bob‘s software Insufficient resource allocation Hacker attack ...

What if there is a problem?


Split administraction: Alice‘s perspective

17Building and Programming the Cloud, Mysore, Jan 2010

Alice Alice's customers

? ?????? ?

Bob

If something is wrong, how will I

know? How can I tell if it's

my software or the cloud?

If it's the cloud, how can I convince Bob?

If something is wrong, how will I

know? How can I tell if it's

my software or the cloud?

If it's the cloud, how can I convince Bob?

Split administraction: Bob's perspective


Alice

Bob

Alice's customers

?? ?

???

?

?

?

?

??

?

If something is wrong, how will I know?

How can I tell if it's the cloud or Alice's

software? If it's Alice's software,

how can I convince Alice?

An idealized solution

What if we had an oracle that Alice and Bob could ask about problems?

Completeness: If the cloud is faulty, the oracle will say so

Accuracy: If the cloud is not faulty, the oracle will say so

Verifiability: The oracle produces evidence that would convince a third party


Alice

Bob

Alice's customers

Oracle

The accountable cloud

Idea: Make cloud accountable Cloud records its actions in a tamper-evident log Alice can audit the log and check for faults Use log to construct evidence that a fault does (not)

exist Should work even if one party was compromised!


Alice

Bob

Alice's customers

Tamper-evidentlog

Discussion

Is this too pessimistic? Cloud isn't malicious!

Hacker attacks, software bugs, operator error, malicious client, …

Difficult to come up with a more restrictive fault model

Without provable properties, evidence has little value

Why would a provider want to deploy this?

Attractive to prospective customers (peace of mind) Helps in handling customer complaints, resolve

disputes 21Building and Programming the Cloud, Mysore, Jan 2010

22

Outline



Is the technology ready?

Cloud accountability should Have provable guarantees Work for most cloud applications Require no changes to application code Cover a wide spectrum of properties Have reasonable overhead

Can existing techniques deliver this? CATS, Repeat&Compare, AIP, PeerReview,

NetReview, AudIt, ...

More work is needed!


??

?

Work in progress: AVM

Goal: Provide accountability for arbitrary binary executables

Idea: Accountable virtual machine (AVM) Cloud records enough data to enable deterministic

replay Alice can replay log against a reference

implementation Can audit any part of the hosted execution 24


Alice Bob

Virtual machine

Challenges

Complete state-machine replay expensive

limit to spot checks, investigation of suspected faults

multi-core replay is hard replay log against an abstract model?

Checking performance properties

Checking information flow

Lots of research opportunities25


Summary Accountability is a useful capability in

distributed systems tamper-evident record fault detection and localization evidence

Proposal: the accountable cloud Can verify correct operation, produce evidence Provable guarantees solid foundation for both

players Challenges remain

26

Questions?


Building and Programming the Cloud, Mysore, Jan 2010 1 Accountable distributed systems and the...

Documents

Transcript of Building and Programming the Cloud, Mysore, Jan 2010 1 Accountable distributed systems and the...