OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate...

20
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere, John Bresnahan Life Cycle Architecture Review La Jolla, CA

Transcript of OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate...

Page 1: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Ocean Observatories Initiative

Common Execution Infrastructure

Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere, John Bresnahan

Life Cycle Architecture ReviewLa Jolla, CA

Page 2: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Common Execution Infrastructure Purpose

•Basic capabilities in resource

provisioning on IaaS clouds

•Commercial

•National infrastructure

•Highly Available (HA) services

•Allow OOI computations to

scale to demand by

leveraging elastically

provisioned resources

Page 3: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

R1 Use Cases

ID Title Description

UC.R1.14 Use Service Anywhere Messages go to services wherever they are

UC.R1.15 Put Services Anywhere Allocate services where need is greatest

UC.R1.16 Scale the Processing Increase processing quickly to meet demand

UC.R1.17 Replicate Service Configure service once, deploy many times

UC.R1.20 Command A Resource Send typical commands to specific resource

UC.R1.25 Assure Reliability Computer fails, messages resent, work resumes

UC.R1.26 Virtualize Everything Virtual processes embody all system services

UC.R1.28 Operate System Configure system and respond to requests

UC.R1.30 Troubleshoot System Diagnose issues using logs, feeds, tools

Page 4: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

User’s View of the Architecture

EPU EPU Worker(Operational Unit) EPU Worker(Operational Unit)

HA Service(OOI Application)

VM(Deployable Unit)

Application Software

(Deployable Type)

EPU Worker(Operational Unit) EPU Worker(Operational Unit)

EPU Worker(Operational Unit) EPU Worker(Operational Unit)

VM(Deployable Unit)

VM(Deployable Unit)

Page 5: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Overall Architecture

HA App-v1

Client

VM

Exchange Point

…and then a miracle occurs…

Page 6: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Provisioner-2Provisioner-0

Overall Architecture

Capability Container

App-v1

cc-agent

EPU Worker

ctx-agent

IaaS

ContextBroker

HA App-v1

Client

Provisioner-0

Deployable Type

Registry Service

EPU Controller

(App-v1)

DE(Plann

er)

Sensor Aggregator

(App-v1)

A

HA-P

VM

updates

Queue length

uses

queries

contextualization

Launches VM

Health re

port

Per-node status

Exchange Point

Page 7: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Capability Container

One VM

HA Provisioner

Provisioner-2Provisioner-0

IaaS

ContextBroker

Provisioner-0

A

HA-P

Provisioner(Provisioner)

Controller(HA-Provisioner)

Sensor Aggregator

(HA-Provisioner)

Per-node status

Queue length

Base CEI Instance

All other EPU controllers

Bottom Turtle:

Operations

Monitors and restarts

Page 8: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Daemonize and monitor

Bootstrapping and Monitoring

Provisioner(Provisioner)

Controller(HA-Provisioner)

Sensor Aggregator

(HA-Provisioner)

Base CEI Instance

Context Broker

Messaging Service

Core Services

epu_control

launchtest

monitor

launchtest

monitor

launchtest

monitor

launchtest

monitorProvisioner-2Provisioner-0Provisioner-0

HA-P

Service launches

Page 9: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Summary of Implementation Status

•Detailed design and implementation documents

•All major components implemented:

• Provisioner, EPU Controller, Decision Engine and Planner,

Sensor Aggregator, DTRS

• Integrated with ION

•Some components needing refinement:

• bootstrap process, draft user and administrator process,

image building and management

•Tested on infrastructure from Magellan to EC2

Page 10: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Technology Choices

ION: Integrated Observatory Network

boto

txrabbitmq Twotp

NimbossContext Broker

Fabric

Page 11: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

The Testfest

•Objectives:

•Test a fully experimental system

•queue_length excepted to make progress

• Identify areas needing potential redesign

•Test the “muscle” of the system: no optimizations,

no policies, no fancy improvements

•Scalability target: up to 1000 VMs

•237 achieved so far

Page 12: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

R1 Use Cases Demonstrated

•UC.R1.16: Scale the processing• A load is put on the system

• Additional demand is recognized via different sensors

• Message queue length, CPU loads, disk usage

• System scales up to meet increased demand

• System scales down when demand goes away

•UC.R1.25: Assure reliability• Failures happen

• Remedial actions happen

• No significant impact on observatory operation

Page 13: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Testing Environment

Provisioner-2Provisioner-0Capability Container

App-v1

cc-agent

EPU Worker

ctx-agent

IaaS

ContextBroker

HA App-v1

Client

Provisioner-0

Deployable Type

Registry Service

EPU Controller

(App-v1)

DE(Plann

er)

Sensor Aggregator

(App-v1)

A

HA-P

VM

updates

Queue length

uses

queries

contextualization

Launches VM

Health re

port

Per-node status

Exchange Point

EC2 small

EC2 High-CPU XL

EC2 Small

UC EC2 small

Page 14: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Scale the Processing

• Average load scenario

• 70 jobs, infinitely long

• One job per VM

• Submitted over 28 minutes, 5 jobs

every 2 minutes

• Worst-case scenario

• 70 jobs, infinitely long

• One job per VM

• Saturating the system

Page 15: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Assure Reliability

• How does the system react

to failure?

• Saturate the system with

10s jobs

• Bounded policy: 20 VMs

• Kill 2 VMs every 5 minutes

Page 16: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Lessons Learned•Many, MANY, tractable small issues and lessons

learned

•a.k.a., “an endless stream of simple bugs” ;-)

•Most significant unresolved issues:• Messaging system connections close unexpectedly

• Currently prevents us from running at scale, need for scalability testing

in COI

• Inspecting message queue remotely needs to be rethought

• Need for concurrency in the container

• Unresolved issue in “pulling work”

•Lots of work to do!

Page 17: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Risk Assessment -CEIUse Cases

ID Name DescriptionRisk of Not Availability

Level of Maturity

Target Use

UC.R1.15 Put Services Anywhere

Allocate services where need is greatest

Low Expected Developer

UC.R1.16 Scale the Processing

Increase processing quickly to meet demand

Low Expected Developer

UC.R1.17 Replicate Service Configure service once, deploy many times

Low Expected Developer

UC.R1.26 Virtualize Everything

Virtual processes embody all system services

Low Expected Developer

UC.R1.25 Assure Reliability Computer fails, messages resent, work resumes

Medium Necessary Developer

UC.R1.28 Operate System Configure system and respond to requests

Medium Necessary Operator

Services

NameRisk of Not Availability

Level of Maturity Target Use

Elastic Computing Low Expected Developer

Exec Engine Repository Low Expected Developer

Resource Management Services Medium Necessary Developer

Page 18: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Roadmap

Iteration 1: Finalize components and interactions

- Continue stress testing- Refine Deployable Type Creation and Management

- Integration with Data Management- Bootstrapping

Iteration 2: Prepare an Internal Release

- Refine the policy engine- Continue testing

- Build&test harness- Preliminary documentation

Iteration 3: Prepare an External Release

-Testing and robustness- User and admin process

- Improve quality and documentation

Page 19: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Questions?

Page 20: OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere,

OOI CI LCA REVIEW August 2010

Use Cases at (Medium) Risk for Release 1

Type Title Impact

UC.R1.16

Scale Processing Potential known obstacles to scalability

UC.R1.25

Assure Reliability Potential known unreliable scenarios

UC.R1.28

Operate System Scaled down functionality, ease of use

UC.R1.30

Troubleshoot System

Scaled down functionality, ease of use