Improving Client Web Availability with MONET David G ...

55
Improving Client Web Availability with MONET David G. Andersen, CMU Hari Balakrishnan, M. Frans Kaashoek, Rohit Rao, MIT http: //nms.csail.mit.edu/ron/ronweb/

Transcript of Improving Client Web Availability with MONET David G ...

Page 1: Improving Client Web Availability with MONET David G ...

Improving Client Web Availabilitywith MONET

David G. Andersen, CMU

Hari Balakrishnan, M. Frans Kaashoek, Rohit Rao, MIT

http:

//nms.csail.mit.edu/ron/ronweb/

Page 2: Improving Client Web Availability with MONET David G ...

Availability We Want

• Carrier Airlines (2002 FAA Fact Book)

– 41 accidents, 6.7M departures

✔ 99.9993% availability

• 911 Phone service (1993 NRIC report +)

– 29 minutes per year per line

✔ 99.994% availability

• Std. Phone service (various sources)

– 53+ minutes per line per year

✔ 99.99+% availability

Page 3: Improving Client Web Availability with MONET David G ...

The Internet Has Only Two Nines

✘ End-to-End Internet Availability: 95% - 99.6%

[Paxson, Dahlin, Labovitz, Andersen]

Insufficient substrate for:

• New / critical apps:– Medical collaboration

– Financial transactions

– Telephony, real-time services, ...

• Users leave if page slower than 4-8 seconds

[Forrester Research, Zona Research]

Page 4: Improving Client Web Availability with MONET David G ...

MONET: Goals

• Mask Internet failures

– Total outages

– Extended high loss periods

• Reduce exceptional delays

– Look like failures to user

– Save seconds, not milliseconds

MONET achieves 99.9 - 99.99% availability

(Not enough, but a good step!)

Page 5: Improving Client Web Availability with MONET David G ...

A fatal exception 0E has occurred at 0028:C00068F8 in PPT.EXE<01> +000059F8. The current application will be terminated.

* Press any key to terminate the application.* Press CTRL+ALT+DEL to restart your computer. You will lose any unsaved information in all applications.

Press any key to continue

Windows

Page 6: Improving Client Web Availability with MONET David G ...

A fatal exception 0E has occurred at 0028:C00068F8 in PPT.EXE<01> +000059F8. The current application will be terminated.

* Press any key to terminate the application.* Press CTRL+ALT+DEL to restart your computer. You will lose any unsaved information in all applications.

Press any key to continue

Windows

Not about client failures...

Page 7: Improving Client Web Availability with MONET David G ...

A fatal exception 0E has occurred at 0028:C00068F8 in PPT.EXE<01> +000059F8. The current application will be terminated.

* Press any key to terminate the application.* Press CTRL+ALT+DEL to restart your computer. You will lose any unsaved information in all applications.

Press any key to continue

Windows

Not about client failures...

Nor fixingserver failures (but understand)

There’s another nine hidden in here, but today...

“It’s about the network!”

Page 8: Improving Client Web Availability with MONET David G ...

End-to-End Availability: Challenges

• Internet services depend on many components:

Access networks, routing, DNS, servers, ...

• End-to-end failures persistdespite availability

mechanisms for each component.

• Failures unannounced, unpredictable, silent

• Many different causes of failures:

– Misconfiguration, deliberate attacks,

hardware/software failures,

persistent congestion, routing convergence

Page 9: Improving Client Web Availability with MONET David G ...

Our Approach

• Expose multiple paths to end system

– How to get access to them?

• End-systems determine if path worksvia probing/measurement

– How to do this probing?

• Let host choose a good end-to-end path

Client MONET Web Proxy Server

Page 10: Improving Client Web Availability with MONET David G ...

Contributions

• MONET Web Proxy design and

implementation

• Waypoint Selection algorithm explores paths

with low overhead

• Evaluation of deployed system with live user

traces; roughly order of magnitude availability

improvement

Page 11: Improving Client Web Availability with MONET David G ...

MONET: Bypassing Web Failures

"Internet"

Lab Proxy

Cogent

Internet2

Genuity MITClients

����

• A Web-proxy based system to improve

availability

• Three ways to obtain paths

Page 12: Improving Client Web Availability with MONET David G ...

MONET: Obtaining Paths

MIT

"Internet"

Lab Proxy

Cogent

DSL

Internet2

Genuity

Clients

����

• 10-50% of failures at client access link

➔ Multihome theproxy(no routing needed)

Page 13: Improving Client Web Availability with MONET David G ...

MONET: Obtaining Paths

Clients

"Internet"

Lab Proxy

Cogent

DSL

Internet2

Genuity MIT

����

• 10-50% of failures at client access link

➔ Multihome theproxy(no routing needed)

• Many failures at server access link

➔ Contact multiple servers

Page 14: Improving Client Web Availability with MONET David G ...

MONET: Obtaining Paths

Clients

"Internet"

Lab Proxy

Cogent

DSL

Internet2

Genuity MIT

Peer Proxy

����

����

• 10-50% of failures at client access link➔ Multihome theproxy(no routing needed)

• Many failures at server access link➔ Contact multiple servers

• 40-60% failures “in network”➔Overlay paths

Page 15: Improving Client Web Availability with MONET David G ...

Parallel Connections Validate Paths

Near-concurrent TCP, peer proxy, and DNS

queries.

Peer Proxy Web ServerLocal Proxy

1 Request Starts

2 Local DNS Resolution3 Peer Query DNSPeer Proxy Query

Page 16: Improving Client Web Availability with MONET David G ...

Parallel Connections Validate Paths

Near-concurrent TCP, peer proxy, and DNS

queries.

Peer Proxy Web ServerLocal Proxy

1 Request Starts

2 Local DNS Resolution3 Peer Query

4 Local TCP Conns

DNS

SYNs

SYN/ACK

Peer Proxy Query

DNS

Page 17: Improving Client Web Availability with MONET David G ...

Parallel Connections Validate Paths

Near-concurrent TCP, peer proxy, and DNS

queries.

Peer Proxy Web ServerLocal Proxy

1 Request Starts

2 Local DNS Resolution3 Peer Query

4 Local TCP Conns

5 Fetch via 1st

6 Close others

DNS

SYNs

SYN/ACK

Peer Proxy Query

DNS

SYN

SYN/ACKPeer Response

Page 18: Improving Client Web Availability with MONET David G ...

A More Practical MONET

Evaluated MONET tries

all combinations:��������

��������

��������

l local interfaces

p peers

s servers

ls+ lpspaths

l = 3, p = 3, s= 1−8

Paths = 12 – 96

Page 19: Improving Client Web Availability with MONET David G ...

A More Practical MONET

Evaluated MONET tries

all combinations:��������

��������

��������

l local interfaces

p peers

s servers

ls+ lpspaths

l = 3, p = 3, s= 1−8

Paths = 12 – 96

• Waypoint Selectionchooses the right subset

– What order to try interfaces?

– How long to wait between tries?

Page 20: Improving Client Web Availability with MONET David G ...

Waypoint Selection Problem

.

.

.S1

P1

Pn

P2

CSs

����

ClientC PathsP1, · · · ,PN ServersS1, ...,Ss

➔ Find good order of thes∗ N Px,Sy pairs.

➔Find delay between each pair.

Page 21: Improving Client Web Availability with MONET David G ...

Waypoint Selection

C

S

C

S

Server Selection Waypoint Selection

Page 22: Improving Client Web Availability with MONET David G ...

Waypoint Selection

C

S4

S2

S

Server Selection

S2

S3

S4

Waypoint Selection

C

S

S3

Page 23: Improving Client Web Availability with MONET David G ...

Waypoint Selection

Shared learning

S4

Server Selection Waypoint Selection

CS2

S3

S4

S

CS2

S

S3

• History teaches aboutpaths, not just servers

➔ Better initial guess (ephemeral...)

Page 24: Improving Client Web Availability with MONET David G ...

Using Waypoint Results to Probe

• DNS: Current best + random interface

• TCP: Current best path (int or peer)

• 2nd TCP w/5% chance via random path

• Pass results back to waypoint algorithm

Page 25: Improving Client Web Availability with MONET David G ...

Using Waypoint Results to Probe

• DNS: Current best + random interface

• TCP: Current best path (int or peer)

• 2nd TCP w/5% chance via random path

• Pass results back to waypoint algorithm

• While no response withinthresh

– connect via next best

– increasethresh

➔What information affectsthresh?

Page 26: Improving Client Web Availability with MONET David G ...

TCP Response Time Knee

KneeTCP

0

0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 0.1 0.2 0.3 0.4 0.5

Fra

ctio

n of

req

uest

s

Response time (seconds)

DNS−DSL

TCP−CogentTCP−MIT

TCP−DSL

0.1 0.2

Page 27: Improving Client Web Availability with MONET David G ...

TCP Response Time Knee

MIT: 105ms

DSL: ~145ms

KneeTCP

0

0.5 0.6 0.7 0.8 0.9

1

0 0.1 0.2 0.3 0.4 0.5

Fra

ctio

n of

req

uest

s

Response time (seconds)

DNS−DSL

TCP−CogentTCP−MIT

TCP−DSL

0.3 0.2 0.1

0.4

• When to probe- right after knee

• Small extra latency➔ much less overhead

Two ways to approximate the knee in the paper

Page 28: Improving Client Web Availability with MONET David G ...

Implementation

MONET

Ad−blockingSquid

Squid

NormalSquid

Cogent

DSL

MIT

Proxy MachineClients

• Squid Web proxy + parallel DNS resolver

• Front-end squids mask back-end failures(Ad-blocking squid as bribe)

• Choose outbound link with FreeBSD / MacOS Xipfw or Linux policy routing

Page 29: Improving Client Web Availability with MONET David G ...

6-site MONET Deployment

Lab Proxy

Aros Proxy

Saved Traces

Cogent

DSL

Internet2

Genuity

"Internet" UUNET

ELI Aros

Wi−ISP MIT

NYU

Utah

Utah Proxy

NYU Proxy Mazu Proxy

Clients����

����

����

����

����

• Two years,∼ 50 users/week

• Primary traces atMIT, replay atMazu

• Three peer proxies:NYU, Utah, Aros

• Focus on 1 Dec 2003 – 27 Jan 2004

• Record everything

Page 30: Improving Client Web Availability with MONET David G ...

Measurement Challenges

• Invalid DNS responses (packet traces)

• Invalid IPs (0.0.0.0, 127.0.0.1, ...)

• Anomalous servers - discard 90% SYNs, etc.

• Implementation and design flaws

– Network anomalies hit corner cases(Must avoid correlated measurement &network failures!)

• Identify, automate detection, iterate...

Excluded consistently anomalous services.

Page 31: Improving Client Web Availability with MONET David G ...

MIT Trace Statistics

Request type Count

Client object fetch 2.1M

Cache misses 1.3M

Data fetch size 28.5 Gb

Cache hit size 1 Gb

TCP Connections 616,536

DNS lookups 82,957

137,341Sessions- first req to a server after 60+

idle seconds (avoids bias)

Page 32: Improving Client Web Availability with MONET David G ...

Characterizing Failures

DNS

Server unreach

Server RST

Client access

Wide-area

MIT

Cogent

DSL

MIT

Cogent

DSL

X

Local Interfaces

Peer Proxies

Server

��������

2+ peers reachable

no peer or link could reach server

(40% unreachable during post-analysis)

Page 33: Improving Client Web Availability with MONET David G ...

Failure Breakdown

MIT

137,612 sessions

Failure Type Srv MIT Cog DSL

DNS 1

Srv. Unreach 173

Srv. RST 50

Client Access 152 14 2016

Wide-area 201 238 1828

Availability 99.6% 99.7% 97%

Factor out server failures—until they use MONET!

Page 34: Improving Client Web Availability with MONET David G ...

Single Link Availability

97% of MIT connectionsestablished within 1s

.999

.9999

0.1 1 10.95

dns+connect() time (seconds)

0.972 DSL

0.9974 MIT0.9977 Cogent

0.9995 Cog+MIT+DSL

Fra

ctio

n su

cces

sful

con

nect

s

.99

Page 35: Improving Client Web Availability with MONET David G ...

Single Link Availability

at 2 secondsDNS retransmissions

.95

.9999

0.1 1 10

Fra

ctio

n su

cces

sful

con

nect

s

dns+connect() time (seconds)

0.972 DSL

0.9974 MIT0.9977 Cogent

0.9995 Cog+MIT+DSL

.99

.999

Page 36: Improving Client Web Availability with MONET David G ...

Single Link Availability

TCP SYN retransmissions

at 2 secondsDNS retransmissions

at 3, 6, 9, ... seconds

.95 1 10

Fra

ctio

n su

cces

sful

con

nect

s

dns+connect() time (seconds)

0.972 DSL

0.9974 MIT0.9977 Cogent

0.9995 Cog+MIT+DSL

.9999

.999

.99

0.1

Page 37: Improving Client Web Availability with MONET David G ...

Combined Link Availabilitgy

0.9995 Cog+MIT+DSL

.99

.999

.9999

0.1 1 10

Fra

ctio

n su

cces

sful

con

nect

s

dns+connect() time (seconds)

0.972 DSL

0.9974 MIT0.9977 Cogent0.9992 Cog+DSL

.95

• Cheap DSL augments 100Mbit link

Page 38: Improving Client Web Availability with MONET David G ...

MONET Achieves 4 Nines

0.9977 Cogent

.99

.999

.9999

0.1 1 10

Fra

ctio

n su

cces

sful

con

nect

s

dns+connect() time (seconds)

0.9999 All

0.972 DSL0.974 DSL+Peers

0.9974 MIT

0.9992 Cog+DSL0.9992 MIT+Peers0.9995 Cog+MIT+DSL

0.9997 Cog+Peers

.95

• Cheap DSL augments 100Mbit link

• Overlays + reliable linkverygood

Page 39: Improving Client Web Availability with MONET David G ...

MONET with Low OverheadHow do the practical MONETs compare?

• Optimal, Liveness, Random

• Post-best:

– Analyze trace, determine single “best”

interface to always use first

– While no response withinthresh

∗ connect via random interface or peer

∗ increasethresh

(Requires omniscience, but quasi-realistic).

Page 40: Improving Client Web Availability with MONET David G ...

Achievable Resilience

.9

.99

.999

.9999

0.2 0.5 1 2 3 6 9 15

Fra

ctio

n su

cces

sful

con

nect

s

dns+connect() time (seconds)

Optimal

cogent

DSL

Page 41: Improving Client Web Availability with MONET David G ...

Achievable Resilience

.9

.99

.999

.9999

0.2 0.5 1 2 3 6 9 15

Fra

ctio

n su

cces

sful

con

nect

s

dns+connect() time (seconds)

Optimal

cogent

Random

DSL

Page 42: Improving Client Web Availability with MONET David G ...

Achievable Resilience

.9

.99

.999

.9999

0.2 0.5 1 2 3 6 9 15

Fra

ctio

n su

cces

sful

con

nect

s

dns+connect() time (seconds)

Optimal

Post Best

cogent

Random

DSL

Page 43: Improving Client Web Availability with MONET David G ...

Achievable Resilience

.9

.99

.999

.9999

0.2 0.5 1 2 3 6 9 15

Fra

ctio

n su

cces

sful

con

nect

s

dns+connect() time (seconds)

OptimalLivenessPost Best

cogent

Random

DSL

• 10% more SYNs (< 1% packets), near optimal

Page 44: Improving Client Web Availability with MONET David G ...

What we didn’t talk about

• Discounted server failures: Some servers

really bad.

• Paper: MONET + Replicated services

– A more reliable subset of servers

– Presumably, operators care more...

✔ 8x better availabilityincludingserver failures.

Page 45: Improving Client Web Availability with MONET David G ...

Related Work

• SOSR (OSDI’04) - single-hop NAT-basedoverlay routing.Probing-based study

• Akella et al. multihomingAkamai-based study

➔ Similar underlying network performance.

• Commercial products (Stonesoft, Sockeye, ...)Tactics, performance, formalize problem

• Content Delivery NetworksMONET improves availability

Page 46: Improving Client Web Availability with MONET David G ...

Summary

• Expose multiple paths to end-system

– Choose one that works end-to-end

• Necessary location for availability engineering

• Multihomingwithoutrouting support

• Resilience achievable with low overhead

• Experience w/2 year deployment and 100s of

users: Avoids 90% of failures to reliable sites

http://nms.lcs.mit.edu/ron/ronweb/

Page 47: Improving Client Web Availability with MONET David G ...

Bulk Transfers

• Use application knowledge

– Static objects only

– HTTP parallel transfers (“Paraloaders”)

• Dykes et al. server selection + our tests

– First-response SYN effective

• Mid-stream failover

– SCTP, Migrate, Host ID schemes, others..

– Range requests / app-specific tactics

Page 48: Improving Client Web Availability with MONET David G ...

TCP CONTROL DEFER socket option

• Switch to new server if SYN lost

Still works if SYN delayed> 3 seconds

• Avoid 3-way handshake completion

for all but one connection

Time source dest Type

54:31 client.3430 > server-A.80 SYN

54:34 client.3430 > server-A.80 SYN

· · ·

55:05 client.3430 > server-A.80 SYN

55:17 client.3432 > server-B.80 SYN

Page 49: Improving Client Web Availability with MONET David G ...

Characterizing Failures

DNS

Server unreach

Server RST

Client access

Wide-area

MIT

Cogent

DSL

MIT

Cogent

DSL

DNSX

Local Interfaces

Peer Proxies

����

Peers reachable

nopeer or interface could resolve DNS.

Page 50: Improving Client Web Availability with MONET David G ...

Characterizing Failures

DNS

Server unreach

Server RST

Client access

Wide-area

MIT

Cogent

DSL

MIT

Cogent

DSL

X

Local Interfaces

Peer Proxies

Server

��������

2+ peers reachable

no peer or link could reach server

(40% unreachable during post-analysis)

Page 51: Improving Client Web Availability with MONET David G ...

Characterizing Failures

DNS

Server unreach

Server RST

Client access

Wide-area

MIT

Cogent

DSL

MIT

Cogent

DSL

RSTServer

Local Interfaces

Peer Proxies

���� ����

Server refused TCP connections

Network OK end-to-end.

Page 52: Improving Client Web Availability with MONET David G ...

Characterizing Failures

DNS

Server unreach

Server RST

Client access

Wide-area

MIT

Cogent

DSL

MIT

Cogent

DSLX

X

X

Local Interfaces

Peer Proxies

Server

��������

No peers, DNS or server reachable via one link.

Peers and server working via other links.

Page 53: Improving Client Web Availability with MONET David G ...

Characterizing Failures

DNS

Server unreach

Server RST

Client access

Wide-area

MIT

Cogent

DSL

MIT

Cogent

DSLX

Local Interfaces

Peer Proxies

Server

��������

Server not reachable via one link. That link canreach peers.

Server reachable via peer or other link.

Page 54: Improving Client Web Availability with MONET David G ...

MeasurementPacket-level traces at each node:

• TCP to server, all DNS lookups

• UDP overlay queries

Application traces:

• Proxy request parameters, TCP sessions, DNSqueries, overlay queries

• DNS server query log

Sliding-window join links application logs to localand remote packet logs.

Page 55: Improving Client Web Availability with MONET David G ...

When to probe: Practical SolutionConservative estimator fromaggregateconnection

behavior:

• rttest - expectedconnect() time

rttest ← q ∗ rttest + (1− q) ∗ rtt

• rttdev- average linear deviation (> σ)

• thresh= rttest + 4 ∗ rttdev

✔ Easily computed, little state, effective