Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack
-
Upload
john-burwell -
Category
Technology
-
view
199 -
download
0
Transcript of Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack
![Page 1: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/1.jpg)
Embracing FailureSelf-Healing, Decentralized Resource Management for Apache CloudStack
John BurwellVice President, Software Engineering
[email protected] | @john_burwell
![Page 2: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/2.jpg)
@shapeblue #ccceu
VP of Software Engineering @ ShapeBlue
Member, Apache CloudStack PMC (June 2013)
Ran operations and designed automated provisioning for analytic/virtualization clouds
Led architectural design and server-side development of a SaaS physical security platform
About Me
![Page 3: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/3.jpg)
@shapeblue #ccceu
“ShapeBlue are expert builders of public &
private clouds. They are the leading global
Apache CloudStack integrator & consultancy”
…and we’re hiring!
About ShapeBlue
![Page 4: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/4.jpg)
@shapeblue #ccceu
Bang ups and Hang Ups Can Happen to You
Derive the normative operationdesign from failure recovery
![Page 5: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/5.jpg)
@shapeblue #ccceu
What is a Resource?Control Plane
Device
Device
Device
(Desired State)
(Actual State)
Resource
(Converges Desired with Actual State)
Eventually, the desired and actual states will be consistent
![Page 6: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/6.jpg)
@shapeblue #ccceu
CloudStack partitions resources into zones,
clusters, and pods
![Page 7: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/7.jpg)
@shapeblue #ccceu
Resource status information is stale or lost
Resource definitions conflict with device state
Entropy
Failure Modes
![Page 8: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/8.jpg)
@shapeblue #ccceu
![Page 9: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/9.jpg)
@shapeblue #ccceu
Consistency
AvailabilityPartition Tolerance
Pick 2
![Page 10: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/10.jpg)
@shapeblue #ccceu
Orchestration operations are available and eventually consistent
... but device modifications must be consistent.
![Page 11: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/11.jpg)
@shapeblue #ccceu
![Page 12: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/12.jpg)
@shapeblue #ccceu
Orchestration TierAP
CP Automation Control Tier
![Page 13: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/13.jpg)
@shapeblue #ccceu
Desired Resource StateAP
CP Actual Resource State
![Page 14: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/14.jpg)
@shapeblue #ccceu
SchedulingAP
CP State Convergence
Resource OffersResource Status
State Transitions
Hoke
![Page 15: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/15.jpg)
@shapeblue #ccceu
Simple Self-contained Locality Non-persistent
Hoke Design Goals
![Page 16: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/16.jpg)
@shapeblue #ccceu
Runtime Resource View
ResourceFSM
Management
ProcessDevic
e
Queue
State Transitio
n
1
1
Monitor Process
ResourceOfferResourceStatu
s
![Page 17: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/17.jpg)
@shapeblue #ccceu
An actor represents state and behavior
Communicate by message passing — each actor has a dedicated queue or mailbox
Each actor is allocated a lightweight thread — implicit lock
Actor Model
![Page 18: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/18.jpg)
@shapeblue #ccceu
All resources represented in a directed, acyclic graph
The root node of the graph is the region organized in the following manner:region -> zone -> pod -> cluster
Each resource is a child of the partition node in which owns it
Resource Graph
![Page 19: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/19.jpg)
@shapeblue #ccceu
Google’s resource scheduler Transactional shared state model
enabling sophisticated, global decision making
Supports both high churn and low churn workloads
Multiple, pluggable schedulers working in parallel
Inspiration from Omega
![Page 20: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/20.jpg)
@shapeblue #ccceu
Two level scheduler Resource Offers Pessimistic Locking Pluggable Geared towards high churn workloads
Inspiration from Mesos
![Page 21: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/21.jpg)
@shapeblue #ccceu
Best Effort shared-state scheduler Multiple parallel schedulers
distributed by partition Combines allocators and planners Pluggable
Hybrid Scheduler
![Page 22: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/22.jpg)
@shapeblue #ccceu
Partition controllers spawn system VMs for their child partitions as need to address scheduler business and reliability guarantees
Parent partition controllers monitor the health of their child partition controllers and re-spawn as necessary
Auto Scaling, Self Healing
![Page 23: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/23.jpg)
@shapeblue #ccceu
Evaluate implementing the concepts in the Orleans paper to reduce the number of active actors required
Determine best approach causality tracking for state transitions (e.g. version vectors)
Create a library implementing these concepts to demonstrate viability and separate concerns and performance test
Next Steps
![Page 24: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/24.jpg)
@shapeblue #ccceu
Gilbert, Seth & Nancy Lynch. Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. 2002.
Schwarkopf, Malte; Konwinski, Andy; et. al. Omega: flexible, scalable schedulers for large compute clusters. 2013.
References
![Page 25: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/25.jpg)
@shapeblue #ccceu
Hindman, Benjamin; Konwinski, Andy; et. al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. 2011.
Bernstien, Philip; Bykov, Sergey; et. al. Orleans: Distributed Virtual Actors for Programmability and Scalability. 2014.
References
![Page 26: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/26.jpg)
@shapeblue #ccceu
Questions
Comments
![Page 27: Embracing Failure: Self-healing, Decentralized Resource Management for Apache CloudStack](https://reader036.fdocuments.us/reader036/viewer/2022062523/586f715b1a28ab10258b4efb/html5/thumbnails/27.jpg)
@shapeblue #ccceu
Thank you