DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability:...

104
Dont Lose Sleep Over Availability: Don t Lose Sleep Over Availability: The GreenUp Decentralized Wakeup Service Siddhartha Sen, Princeton University J bR L h Ri h dH h C l GJS JacobR. Lorch, Richar dHughes, Carlos G. J. Suarez, Brian Zill, Weverton Cordeiro, and Jitendra Padhye

Transcript of DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability:...

Page 1: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Don’t Lose Sleep Over Availability:Don t Lose Sleep Over Availability: The GreenUp Decentralized Wakeup Service

Siddhartha Sen, Princeton University

J b R L h Ri h d H h C l G J SJacob R. Lorch, Richard Hughes, Carlos G. J. Suarez, Brian Zill, Weverton Cordeiro, and Jitendra Padhye

Page 2: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Enterprise networksEnterprise networks

Page 3: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Enterprise networksEnterprise networks

WAN

users, IT admins

Page 4: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Enterprise networksEnterprise networks

WAN

users, IT admins

Despite the cloud, this is a common scenario

Page 5: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Enterprise networksEnterprise networks

St St

Energy savings

Stay Green!

Availability

Stay Up!

Energy savings Availability

Page 6: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Enterprise networksEnterprise networks

St St

Energy savings

Stay Green!

Availability

Stay Up!

GreenUp!Energy savings AvailabilityGreenUp!

Page 7: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Machine(active)

Page 8: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Machine(active) Sleep proxy

Page 9: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Machine(asleep) Sleep proxy

Page 10: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Send traffic to me!Machine(asleep) Sleep proxy

Page 11: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Send traffic to me!

Machine(asleep) Sleep proxy

Page 12: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Machine(asleep) Machine

(asleep)Sleep proxy

Page 13: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

Remote requestq(TCP SYN)

WAN

Machine(asleep) Machine

(asleep)Sleep proxy

Page 14: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Machine(asleep) Machine

(asleep)Sleep proxyRemote request

(TCP SYN)

Page 15: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Wake up!(WoL)Machine

(asleep) Machine(asleep)

Sleep proxy(WoL)

Page 16: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Machine(asleep) Machine

(asleep)Sleep proxyWake up!

(WoL)

Page 17: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Machine(active) Sleep proxy

Page 18: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

Remote requestq(TCP SYN)

WAN

Machine(active) Sleep proxy

Page 19: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Machine(active) Sleep proxyRemote request

(TCP SYN)

Page 20: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

WAN

Response

Machine(active) Sleep proxy

Page 21: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

RResponse

WAN

Machine(active) Sleep proxy

Page 22: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Sleep proxySleep proxy

RResponse

Pros Cons

WAN• No special hardware• No envir. changesN h

• Dedicated server per subnet

• No app changes

Machine(active) Sleep proxy

Page 23: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Dedicated servers are a problemDedicated servers are a problem

• High deployment andHigh deployment and management cost

• Single point of failure

Page 24: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Dedicated servers are a problemDedicated servers are a problem

• High deployment andHigh deployment and management cost

• Single point of failure

• High availability becomes expensive!

Page 25: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp A decentralized minimalGreenUp: A decentralized, minimal software‐only sleep proxy

Page 26: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp A decentralized minimalGreenUp: A decentralized, minimal software‐only sleep proxy

Any machine can act as a proxy (manager) forAny machine can act as a proxy (manager) for sleeping machines on the subnet

Page 27: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

Page 28: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

Machine State… …… …… …

Distributed management

Subnet state coordination

Guardians

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

Page 29: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

Machine State… …… …… …

Distributed management

Subnet state coordination

Guardians

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

Page 30: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp’s environmentGreenUp s environment

• Subnet domainsSubnet domains

• Load‐sensitive, unreliable machinesLoad sensitive, unreliable machines

• Single administrative domainSingle administrative domain

• Availability most importanty p

Page 31: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Running example (not to scale)Running example (not to scale)

M1 M8

M5

M2

M5

M6M3

M6

M7

M9M4

M7

Page 32: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Running example (not to scale)Running example (not to scale)

M1 M8

M5

M2

M5

M6awake

M3M6

M7

M9M4

M7

Page 33: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Running example (not to scale)Running example (not to scale)

M1 M8

M5

M2

M5

M6awake

M3M6

M7

M9M4

M7

asleep + unmanaged

Page 34: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Running example (not to scale)Running example (not to scale)manager

M1 M8 asleep + managedM5

M2 awake

managedM5

M6M3

M6

M7

M9M4asleep + 

unmanaged

M7

Page 35: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

b f l– No guarantees before sleep– M1 failure abandons M8M1 M8

M5• Probe randomly, repeat since machines unreliable

M2

M5

M6

• Load‐sensitive machines, di ib bi

M3M6

M7 so distribute probing– Robust to manager issuesM4

M9

M7

Page 36: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

• Wait for notification?– No guarantees before sleepM1 f il b d M8

M1 M8

M5 – M1 failure abandons M8

• Probe randomly, repeatM2

M5

M6 Probe randomly, repeat since machines unreliableM3

M6

M7• Load‐sensitive machines, so distribute probing

b

M4M9

M7

– Robust to manager issues

Page 37: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

• Wait for notification?– No guarantees before sleepM1 f il b d M8

M1 M8

M5 – M1 failure abandons M8

• Probe randomly, repeatM2

M5

M6 Probe randomly, repeat since machines unreliableM3

M6

M7• Load‐sensitive machines, so distribute probing

b

M4M9

M7

– Robust to manager issues

Page 38: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

• Wait for notification?– No guarantees before sleepM1 f il b d M8

M1 M8

M5 – M1 failure abandons M8

• Probe randomly, repeatM2

M5

M6 Probe randomly, repeat since machines unreliableM3

M6

M7• Load‐sensitive machines, so distribute probing

b

M4M9

M7

– Robust to manager issues

Page 39: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

M1 M8total # 

machines

M5

machinesawake#nM2

M5

M6 machinesawake#M3

M6

M7M4

M9

M7

Page 40: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

M5

M1 M8total # 

machines# managed 

by i

M5

M6 machinesawake#imn M2

M6

M7

machinesawake#M3

M7M4

M9

Page 41: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

M5

M1 M8total # 

machines# managed 

by i

M5

M6 machinesawake#imn M2

M6

M7

machinesawake#M3

M7M4

M9

Page 42: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

M5

M1 M8

M5

machinesawake#ln 1

1pimn M2

M6

M7 • Coupon collector analysis

machinesawake#M3

M6

M7 Coupon collector analysisM4

M9

Page 43: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

M5

M1 M8

M5

machinesawake#ln 1

1pimn M2

M6

M7 • Coupon collector analysis

machinesawake#M3

M6

M7 Coupon collector analysisM4

M9

Page 44: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

M5

M1 M8p = Pr(machine probed)

M5

machinesawake#ln 1

1pimn M2

M6

M7 • Coupon collector analysis

machinesawake#M3

M6

M7 Coupon collector analysisM4

M9

Page 45: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Distributed management: hWho manages M9?

M5

M1 M8p = Pr(machine probed)

M9

M5

machinesawake#ln 1

1pimn M2

M6

M7 • Coupon collector analysis

machinesawake#M3

M6

M7 Coupon collector analysisM4

Page 46: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Multiple managersMultiple managers

M1 M8

M5

M2 machinesawake#ln 1

1pimn

M5

M6

M7 • Availability most importantM3

machinesawake#M6

M7 Availability most important

• Simple resolution protocolM4

M9M9

Page 47: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Multiple managersMultiple managers

M1 M8

M5

M9

M2 machinesawake#ln 1

1pimn

M5

M6M9

M7 • Availability most importantM3

machinesawake#M6M9

M7 Availability most important

• Simple resolution protocolM4

Page 48: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Multiple managersMultiple managers

M1 M8

M5

M9

M2 machinesawake#ln 1

1pimn

M5

M6M9

M7 • Availability most importantM3

machinesawake#M6M9

M7 Availability most important

• Simple resolution protocolM4

Page 49: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Multiple managersMultiple managers

M1 M8

M5

M9

M2 machinesawake#ln 1

1pimn

M5

M6M9

M7 • Availability most importantM3

machinesawake#M6M9

M7 Availability most important

• Simple resolution protocolM4

Page 50: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Multiple managersMultiple managers

M1 M8

M5

M9

M2 machinesawake#ln 1

1pimn

M5

M6

M7 • Availability most importantM3

machinesawake#M6

M7 Availability most important

• Simple resolution protocolM4

Page 51: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Load balanceLoad balance

M8 M9M1

M5

M2

M5

M6M3

M6

M7M4

M7

Page 52: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Load balanceLoad balance

M5

M2 M8

M5

M6M3 M9

M6

M7M4 M1

M7

Page 53: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Load balanceLoad balance

M5

• Induction analysis: equivalent to balls‐in‐bins!

M2 M8

M5

M6)2/ln(ln

)2/ln(nn after n/2 sleeps

M3 M9M6

M7M4 M1

M7

Page 54: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Load balanceLoad balance

M5

• Induction analysis: equivalent to balls‐in‐bins!

Distributed management elects leaders in aM2 M8

M5

M6)2/ln(ln

)2/ln(nn after n/2 sleeps

Distributed management elects leaders in a robust and load‐balanced way, assuming temporary conflicts are tolerable

M3 M9M6

M7

temporary conflicts are tolerable.

M4 M1M7

Page 55: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

• Distributed management relies on global state 

M1 M8

M5 g– Who to probe?– How to manage?

M2

M5

M6

• IP address, MAC addressM3

M6

M7• TCP listen ports

M9M4

M7

Page 56: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

• Distributed management relies on global state 

M1 M8

M5 g– Who to probe?– How to manage?

M2Machine StateM8 …M9

M5

M6

• IP address, MAC addressM3M9 …M5 …… …

M6

M7• TCP listen ports

M9M4

M7

Page 57: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

• Replicated state machine?– Unreliable machines, correlated behavior

M1 M8

M5 correlated behavior– Strong consistency overkillM2Machine State

M8 …M9

M5

M6• External database?– Lose instant deployability

M3M9 …M5 …… …

M6

M7

• Exploit subnet and weaker M9M4

M7

consistency

Page 58: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

• Replicated state machine?– Unreliable machines, correlated behavior

M1 M8

M5 correlated behavior– Strong consistency overkillM2Machine State

M8 …M9

M5

M6• External database?– Lose instant deployability

M3M9 …M5 …… …

M6

M7

• Exploit subnet and weaker M9M4

M7

consistency

Page 59: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

• Replicated state machine?– Unreliable machines, correlated behavior

M1 M8

M5 correlated behavior– Strong consistency overkillM2Machine State

M8 …M9

M5

M6• External database?– Lose instant deployability

M3M9 …M5 …… …

M6

M7

• Exploit subnet and weaker M9M4

M7

consistency

Page 60: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9M9

3. Daily roll call to garbage‐collect stateM4

M7

Page 61: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9M9

3. Daily roll call to garbage‐collect stateM4

M9 state

M7

Page 62: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect stateM4

M7

Page 63: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect state

M7

Page 64: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 ? managers while asleep

3 Daily roll call toM3

M6

M7

?

M9

3. Daily roll call to garbage‐collect stateM4

M7

Page 65: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 ? managers while asleep

3 Daily roll call toM3

M6

M7

?

M9

3. Daily roll call to garbage‐collect stateM4

M7

Page 66: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8M8 state

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect stateM4

M7

Page 67: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8M8 state

M5

2. Rebroadcast by h l l

M2

M5

M6M8’ state

managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect stateM4

M7

Page 68: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

M8’ state

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect stateM4

M7

Page 69: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect stateM4

M7

Page 70: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5Subnet state coordination distributes2. Rebroadcast by 

h l l

M2

M5

M6

Subnet state coordination distributes per‐machine state on a subnet when strong consistency is not requiredmanagers while asleep

3 Daily roll call toM3

M6

M7

strong consistency is not required.

M9

3. Daily roll call to garbage‐collect stateM4

M7

Page 71: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

Machine State… …… …… …

Distributed management

Subnet state coordination

Guardians

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

Page 72: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

P iMachine State… …… …… …

• Protects against simultaneous sleep

• Caps the max loadDistributed management

Subnet state coordination

Guardians• Caps the max load

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

Page 73: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

Machine State… …… …… …

Distributed management

Subnet state coordination

Guardians

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

Page 74: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Deployment in MicrosoftDeployment in Microsoft

• C# codeC# code– Interfaces with packet sniffer/network driver

• Client GUI for users and d l teasy deployment

• Pilot on 1 100 machinesPilot on 1,100 machines

Page 75: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

EvaluationEvaluation

• Logs from 101 Windows 7 machines Feb –Logs from 101 Windows 7 machines, Feb. Sep. 2011

• Questions:– Does GreenUp consistently wake machines whenDoes GreenUp consistently wake machines when accessed?

– Does it do so in time to meet user patience?Does it do so in time to meet user patience?– Can GreenUp scale to large subnets?

Page 76: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp wakes machines reliablyGreenUp wakes machines reliably

Page 77: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp wakes machines reliablyGreenUp wakes machines reliably

• Connect to machines usingConnect to machines using Samba (TCP port 139) 

• 11 different days (weekends, evenings):– 496 already awake, 278 woken, 5 unwakeable

– Most failures due to WoL– Most failures due to WoL

• 99.4% success rate

Page 78: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp wakes machines reliablyGreenUp wakes machines reliably

• Connect to machines usingConnect to machines using Samba (TCP port 139) 

• 11 different days (weekends, evenings):

WoL is availability bottleneck!

– 496 already awake, 278 woken, 5 unwakeable

– Most failures due to WoL

bott e ec

– Most failures due to WoL

• 99.4% success rate

Page 79: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp wakes machines quicklyGreenUp wakes machines quickly

• GreenUp relies on someGreenUp relies on some user patience– Wakeup delay– User retry logic

• Side‐effect of WoLfailure: manager logs how long user waitshow long user waits– 48 events 

Page 80: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp wakes machines quicklyGreenUp wakes machines quickly

• GreenUp relies on someGreenUp relies on some user patience– Wakeup delay– User retry logic

• Side‐effect of WoLfailure: manager logs how long user waitshow long user waits– 48 events 

Page 81: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp wakes machines quicklyGreenUp wakes machines quickly

• Convolving: GreenUp wakes machines before userConvolving: GreenUp wakes machines before user gives up 85% of the time

Page 82: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp wakes machines quicklyGreenUp wakes machines quickly

87% of wakeups take < 9 sec

• Convolving: GreenUp wakes machines before userConvolving: GreenUp wakes machines before user gives up 85% of the time

Page 83: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp wakes machines quicklyGreenUp wakes machines quickly

87% of wakeups take < 9 sec

• Convolving: GreenUp wakes machines before userConvolving: GreenUp wakes machines before user gives up 85% of the time

Page 84: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp wakes machines quicklyGreenUp wakes machines quickly

87% of wakeups take < 9 sec

13% of users give up after 3 sec (port scanners?)

• Convolving: GreenUp wakes machines before userConvolving: GreenUp wakes machines before user gives up 85% of the time

Page 85: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp scales to large subnetsGreenUp scales to large subnets

• Sources of manager loadSources of manager load– Intercept traffic for asleep machinesBroadcast state– Broadcast state

– Probe/respond to probes

Page 86: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp scales to large subnetsGreenUp scales to large subnets

Page 87: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp scales to large subnetsGreenUp scales to large subnets

Page 88: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp scales to large subnetsGreenUp scales to large subnets

Page 89: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp scales to large subnetsGreenUp scales to large subnets

Good load balance +h k henough awake machines 

few managed machines!

Page 90: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp scales to large subnetsGreenUp scales to large subnets

• Simulated probing load on 2.4‐GHz, dual‐coreSimulated probing load on 2.4 GHz, dual core Windows 7 machine w/ 4GB memory and 1Gb/s NIC:

# of managed machines CPU utilization100 12%200 21%300 29%

• Guardians ensure max load is 100

Page 91: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUp scales to large subnetsGreenUp scales to large subnets

• Simulated probing load on 2.4‐GHz, dual‐coreSimulated probing load on 2.4 GHz, dual core Windows 7 machine w/ 4GB memory and 1Gb/s NIC:

# of managed machines CPU utilization100 12%200 21%300 29%

• Guardians ensure max load is 100

Page 92: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Does GreenUp save more energy?Does GreenUp save more energy?

• Energy savings depends on sleep timeEnergy savings depends on sleep time 

Page 93: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Does GreenUp save more energy?Does GreenUp save more energy?

• Energy savings depends on sleep timeEnergy savings depends on sleep time 

Page 94: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Does GreenUp save more energy?Does GreenUp save more energy?

• Energy savings depends on sleep timeEnergy savings depends on sleep time 

Average 31% $17.50/machine/year

Page 95: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Does GreenUp save more energy?Does GreenUp save more energy?

• Energy savings depends on sleep timeEnergy savings depends on sleep time

• IT enforces sleep policy at Microsoft, so hardIT enforces sleep policy at Microsoft, so hard to tell

Page 96: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Extension: Higher availability via l l d h d ffexplicit load hand‐off

M5

M8 M9M1

M2

M5

M6M3

M6

M7M4

M7

Page 97: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Extension: Higher availability via l l d h d ffexplicit load hand‐off

M5

M2

M5

M6M8

M3M6

M7M9

M4M7

M1

Page 98: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Extension: Higher availability via l l d h d ffexplicit load hand‐off

M5

M8 M9M1

M2

M5

M6M3

M6

M7M4

M7

Page 99: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Extension: Higher availability via l l d h d ffexplicit load hand‐off

M5

M2

M5

M6M3

M6

M7M4

M7M8 M9M1

Page 100: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Extension: Higher availability via l l d h d ffexplicit load hand‐off

+M5

M2

M5

M6M3

M6

M7M4

M7M8 M9M1

Page 101: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Extension: Higher availability via l l d h d ffexplicit load hand‐off

+M5

• Theorem Expected maxM2

M5

M6 • Theorem. Expected max load = n/d Hd.M3

M6

M7M4

# awake machines

Harmonic numbers

M7M8 M9M1

Page 102: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Other solutionsOther solutions

• Sleep proxy idea: Christensen & Gulledge ’98Sleep proxy idea: Christensen & Gulledge 98• Recently:

System TechniqueSomniloquy, NSDI ’09  augmented NICsq y, gLiteGreen, ATC ’10Jettison, EuroSys ’12  VM migration

SleepServer, ATC ’10  application stubsNedevschi et al., NSDI ’08 Reich et al ATC ’10 dedicated serversReich et al., ATC  10

Page 103: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

Other solutionsOther solutions

• Sleep proxy idea: Christensen & Gulledge ’98Barriers to Sleep proxy idea: Christensen & Gulledge 98• Recently:

deployment

System TechniqueSomniloquy, NSDI ’09  augmented NICsq y, gLiteGreen, ATC ’10Jettison, EuroSys ’12  VM migration

SleepServer, ATC ’10  application stubsNedevschi et al., NSDI ’08 Reich et al ATC ’10 dedicated serversReich et al., ATC  10

Page 104: DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability: ... M5 – M1 failure abandons ... M5 M1 M8 p= Pr(machine probed) M9 ...

GreenUpGreenUp

• Completely decentralized software‐onlyCompletely decentralized, software only  sleep proxy

• Useful distributed systems techniques

• High availability at low cost, even as   machines sleep!machines sleep!

http://research.microsoft.com/en‐us/projects/greenup/http://research.microsoft.com/en us/projects/greenup/