MyOps An Operational Framework for PlanetLab Deployments 1.

23
MyOps An Operational Framework for PlanetLab Deployments 1

Transcript of MyOps An Operational Framework for PlanetLab Deployments 1.

Page 1: MyOps An Operational Framework for PlanetLab Deployments 1.

MyOps

An Operational Framework for PlanetLab Deployments

1

Page 2: MyOps An Operational Framework for PlanetLab Deployments 1.

Outline

o Objective of MyOpso Current statuso Future ideas

o Questions at any time

2

Page 3: MyOps An Operational Framework for PlanetLab Deployments 1.

Example of Feedback

3

Page 4: MyOps An Operational Framework for PlanetLab Deployments 1.

Objective : Close Operational Cycle

• System - Provides service (slice)• Monitoring - Feedback from running system• Operator - Interpret feedback into tasks• Management - Control running system

4

Page 5: MyOps An Operational Framework for PlanetLab Deployments 1.

Challenges: Break-down

• System may not deliver service

• Monitoring not observe useful metrics

• Operator may not knowo how to interpret observationso how to control the systemo what the service goals are

• Management may not control system

5

Page 6: MyOps An Operational Framework for PlanetLab Deployments 1.

Requirements for Operational Systems

• Satisfy Minimal Conditions1. Physical Integrity2. Interconnectivity3. Controllable4. Provide a Service

• Two requirementso Reliably reach the final conditiono When failures occurs, repair or report automatically

1. Two approaches in MyOps1. Precise bootstrap stages (not discussed)2. Operational monitoring & management in platform

6

Page 7: MyOps An Operational Framework for PlanetLab Deployments 1.

System: PlanetLab Slices

7

Page 8: MyOps An Operational Framework for PlanetLab Deployments 1.

Monitoring Types

Open-loop monitoring• Identify the unknown• More information, fine-grainedOperational monitoring (closed-loop)• Correctness• Less information, coarse-grained• Actionable

8

Page 9: MyOps An Operational Framework for PlanetLab Deployments 1.

Management Types

Open-loop management• Bootstrap/Deploy from the ground up• Inefficient, coarse-grained• No feed-backOperational management (closed-loop)• Tweak the system to correct behavior• More efficient, fine-grained

9

Page 10: MyOps An Operational Framework for PlanetLab Deployments 1.

Example

• Observe: Node is Off-Line• Control: Attempt to Power-On• Observe: Node is On-line but Failed to boot• Observe: Failed to boot Error• Control: Create ticket & Send email to local contact

• Time passes

• Control: Disable slice creation• Observe: Local contact responds• Observe: Node is Power-on and Running• Control: Re-enable slice creation• Contro: Close ticket

10

Page 11: MyOps An Operational Framework for PlanetLab Deployments 1.

History of PlanetLab Operations

Open-loop Monitoring with Open-loop Management• Collect fine-grained statistics using CoMon• Act with coarse-grained operations (e.g. Reinstall)• Manual bridge between the two

Moving towards Closed-loop Operations• Collect targeted metrics• Take directed, problem-specific actions• Automate actions based on policy

11

Page 12: MyOps An Operational Framework for PlanetLab Deployments 1.

PlanetLab Operations

• Close the monitor/management cycle• Direct automation of common operations• Indirect through remote contacts and incentives

12

Page 13: MyOps An Operational Framework for PlanetLab Deployments 1.

MyOps Architecture

• Collection from Node• Translated by policy to Automated action

13

Page 14: MyOps An Operational Framework for PlanetLab Deployments 1.

MyOps Architecture

• Collection from Node• Send notice to Local contact to take action

14

Page 15: MyOps An Operational Framework for PlanetLab Deployments 1.

MyOps Architecture

• When there is no response• Indirect influence with incentives

15

Page 16: MyOps An Operational Framework for PlanetLab Deployments 1.

Collection

• Operational monitoring specific targets, such as:o Boot status, Filesystem statuso DNS - internal and externalo RPMso System services, etc

• Periodic collectiono Coarse-grained collection at a human-timescaleo Time-series of events and status

16

Page 17: MyOps An Operational Framework for PlanetLab Deployments 1.

Policy

• Constraints over a time-series of events

• To satisfy a constrainto Automated actiono Send noticeo Apply incentive

• Policy defineso Preferred status of systemo Frequency of actionso Magnitude of incentives

17

Page 18: MyOps An Operational Framework for PlanetLab Deployments 1.

Automation

• Automatic correction of common bootstrap problemso Communication errors with MyPLCo Corrupt filesystem repairo Retry when state is unknowno PCU Rebooto Reinstall

• Automation Noticeso Bad disko Minimal hardwareo Bad DNSo Bad node configuration

18

Page 19: MyOps An Operational Framework for PlanetLab Deployments 1.

Notices & Incentives

• Notices are indirect paths to node managemento Node down / online / specific problem (i.e. DNS, disk)o Site down / onlineo Privilege reduced / restoredo PCU errors

• The incentives on MyPLCo Sites 10 sliceso Disable slice creationo Disable running slices

19

Page 20: MyOps An Operational Framework for PlanetLab Deployments 1.

Validation of Notices & Incentives

A B C D E

Notice Bug FixKernel Bug Fix Fix2

20

Page 21: MyOps An Operational Framework for PlanetLab Deployments 1.

Time to Restore Down Node (all issues)

21

Page 22: MyOps An Operational Framework for PlanetLab Deployments 1.

Future Ideas

• Generalize Configuration• Collect from multiple sources• Expose policy• Act on multiple targets

• Self-monitoring

• Positive Incentives• Special access to services• Additional resources (Slices, Bandwidth, CPU, etc)

22

Page 23: MyOps An Operational Framework for PlanetLab Deployments 1.

Time to Reply (when there is a reply)

23