OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate...
-
Upload
audra-ramsey -
Category
Documents
-
view
212 -
download
0
Transcript of OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative Common Execution Infrastructure Kate...
OOI CI LCA REVIEW August 2010
Ocean Observatories Initiative
Common Execution Infrastructure
Kate Keahey, Tim Freeman, Alex Clemesha, David LaBissoniere, John Bresnahan
Life Cycle Architecture ReviewLa Jolla, CA
OOI CI LCA REVIEW August 2010
Common Execution Infrastructure Purpose
•Basic capabilities in resource
provisioning on IaaS clouds
•Commercial
•National infrastructure
•Highly Available (HA) services
•Allow OOI computations to
scale to demand by
leveraging elastically
provisioned resources
OOI CI LCA REVIEW August 2010
R1 Use Cases
ID Title Description
UC.R1.14 Use Service Anywhere Messages go to services wherever they are
UC.R1.15 Put Services Anywhere Allocate services where need is greatest
UC.R1.16 Scale the Processing Increase processing quickly to meet demand
UC.R1.17 Replicate Service Configure service once, deploy many times
UC.R1.20 Command A Resource Send typical commands to specific resource
UC.R1.25 Assure Reliability Computer fails, messages resent, work resumes
UC.R1.26 Virtualize Everything Virtual processes embody all system services
UC.R1.28 Operate System Configure system and respond to requests
UC.R1.30 Troubleshoot System Diagnose issues using logs, feeds, tools
OOI CI LCA REVIEW August 2010
User’s View of the Architecture
EPU EPU Worker(Operational Unit) EPU Worker(Operational Unit)
HA Service(OOI Application)
VM(Deployable Unit)
Application Software
(Deployable Type)
EPU Worker(Operational Unit) EPU Worker(Operational Unit)
EPU Worker(Operational Unit) EPU Worker(Operational Unit)
VM(Deployable Unit)
VM(Deployable Unit)
OOI CI LCA REVIEW August 2010
Overall Architecture
HA App-v1
Client
VM
Exchange Point
…and then a miracle occurs…
OOI CI LCA REVIEW August 2010
Provisioner-2Provisioner-0
Overall Architecture
Capability Container
App-v1
cc-agent
EPU Worker
ctx-agent
IaaS
ContextBroker
HA App-v1
Client
Provisioner-0
Deployable Type
Registry Service
EPU Controller
(App-v1)
DE(Plann
er)
Sensor Aggregator
(App-v1)
A
HA-P
VM
updates
Queue length
uses
queries
contextualization
Launches VM
Health re
port
Per-node status
Exchange Point
OOI CI LCA REVIEW August 2010
Capability Container
One VM
HA Provisioner
Provisioner-2Provisioner-0
IaaS
ContextBroker
Provisioner-0
A
HA-P
Provisioner(Provisioner)
Controller(HA-Provisioner)
Sensor Aggregator
(HA-Provisioner)
Per-node status
Queue length
Base CEI Instance
All other EPU controllers
Bottom Turtle:
Operations
Monitors and restarts
OOI CI LCA REVIEW August 2010
Daemonize and monitor
Bootstrapping and Monitoring
Provisioner(Provisioner)
Controller(HA-Provisioner)
Sensor Aggregator
(HA-Provisioner)
Base CEI Instance
Context Broker
Messaging Service
Core Services
epu_control
launchtest
monitor
launchtest
monitor
launchtest
monitor
launchtest
monitorProvisioner-2Provisioner-0Provisioner-0
HA-P
Service launches
OOI CI LCA REVIEW August 2010
Summary of Implementation Status
•Detailed design and implementation documents
•All major components implemented:
• Provisioner, EPU Controller, Decision Engine and Planner,
Sensor Aggregator, DTRS
• Integrated with ION
•Some components needing refinement:
• bootstrap process, draft user and administrator process,
image building and management
•Tested on infrastructure from Magellan to EC2
OOI CI LCA REVIEW August 2010
Technology Choices
ION: Integrated Observatory Network
boto
txrabbitmq Twotp
NimbossContext Broker
Fabric
OOI CI LCA REVIEW August 2010
The Testfest
•Objectives:
•Test a fully experimental system
•queue_length excepted to make progress
• Identify areas needing potential redesign
•Test the “muscle” of the system: no optimizations,
no policies, no fancy improvements
•Scalability target: up to 1000 VMs
•237 achieved so far
OOI CI LCA REVIEW August 2010
R1 Use Cases Demonstrated
•UC.R1.16: Scale the processing• A load is put on the system
• Additional demand is recognized via different sensors
• Message queue length, CPU loads, disk usage
• System scales up to meet increased demand
• System scales down when demand goes away
•UC.R1.25: Assure reliability• Failures happen
• Remedial actions happen
• No significant impact on observatory operation
OOI CI LCA REVIEW August 2010
Testing Environment
Provisioner-2Provisioner-0Capability Container
App-v1
cc-agent
EPU Worker
ctx-agent
IaaS
ContextBroker
HA App-v1
Client
Provisioner-0
Deployable Type
Registry Service
EPU Controller
(App-v1)
DE(Plann
er)
Sensor Aggregator
(App-v1)
A
HA-P
VM
updates
Queue length
uses
queries
contextualization
Launches VM
Health re
port
Per-node status
Exchange Point
EC2 small
EC2 High-CPU XL
EC2 Small
UC EC2 small
OOI CI LCA REVIEW August 2010
Scale the Processing
• Average load scenario
• 70 jobs, infinitely long
• One job per VM
• Submitted over 28 minutes, 5 jobs
every 2 minutes
• Worst-case scenario
• 70 jobs, infinitely long
• One job per VM
• Saturating the system
OOI CI LCA REVIEW August 2010
Assure Reliability
• How does the system react
to failure?
• Saturate the system with
10s jobs
• Bounded policy: 20 VMs
• Kill 2 VMs every 5 minutes
OOI CI LCA REVIEW August 2010
Lessons Learned•Many, MANY, tractable small issues and lessons
learned
•a.k.a., “an endless stream of simple bugs” ;-)
•Most significant unresolved issues:• Messaging system connections close unexpectedly
• Currently prevents us from running at scale, need for scalability testing
in COI
• Inspecting message queue remotely needs to be rethought
• Need for concurrency in the container
• Unresolved issue in “pulling work”
•Lots of work to do!
OOI CI LCA REVIEW August 2010
Risk Assessment -CEIUse Cases
ID Name DescriptionRisk of Not Availability
Level of Maturity
Target Use
UC.R1.15 Put Services Anywhere
Allocate services where need is greatest
Low Expected Developer
UC.R1.16 Scale the Processing
Increase processing quickly to meet demand
Low Expected Developer
UC.R1.17 Replicate Service Configure service once, deploy many times
Low Expected Developer
UC.R1.26 Virtualize Everything
Virtual processes embody all system services
Low Expected Developer
UC.R1.25 Assure Reliability Computer fails, messages resent, work resumes
Medium Necessary Developer
UC.R1.28 Operate System Configure system and respond to requests
Medium Necessary Operator
Services
NameRisk of Not Availability
Level of Maturity Target Use
Elastic Computing Low Expected Developer
Exec Engine Repository Low Expected Developer
Resource Management Services Medium Necessary Developer
OOI CI LCA REVIEW August 2010
Roadmap
Iteration 1: Finalize components and interactions
- Continue stress testing- Refine Deployable Type Creation and Management
- Integration with Data Management- Bootstrapping
Iteration 2: Prepare an Internal Release
- Refine the policy engine- Continue testing
- Build&test harness- Preliminary documentation
Iteration 3: Prepare an External Release
-Testing and robustness- User and admin process
- Improve quality and documentation
OOI CI LCA REVIEW August 2010
Questions?
OOI CI LCA REVIEW August 2010
Use Cases at (Medium) Risk for Release 1
Type Title Impact
UC.R1.16
Scale Processing Potential known obstacles to scalability
UC.R1.25
Assure Reliability Potential known unreliable scenarios
UC.R1.28
Operate System Scaled down functionality, ease of use
UC.R1.30
Troubleshoot System
Scaled down functionality, ease of use