Learning to Scale OpenStack: An Update from the Rackspace Public Cloud
-
Upload
jesse-keating -
Category
Internet
-
view
1.761 -
download
0
Transcript of Learning to Scale OpenStack: An Update from the Rackspace Public Cloud
Openstack_ATL
An update from the Rackspace Public Cloud
Learning to Scale Openstack
Rainya Mosher and Jesse Keating Deployment Engineering@rainyamosher @iamjkeating
Introductions, welcome to the talk.
The Rackspace Public Cloud6 Public Regions3 Pre-Production Regions10s of Thousands of nodesGrowing continuallyFrequent deploymentsStaying aligned with upstream
#rackstackatlA review of the Rackspace Public Cloud sets the context for the conversation
We could not deploy code in a reasonable window of time
We did not have confidence in the code we were deploying
We could not keep up with upstream
Our Old Challenges
This is our third summit presenting on this topic. Here is a brief review of some of the scale issues we were facing back at the Havana Summit in Portland
Our window of time is 30 minutes perceived downtime, 4 hour deploy windows
Code coverage wasn't great, lots of errors discovered in production
Upstream moved very fast, and we couldn't keep up with all the testing downstream
Deploys taking 6+ hours
Deploys often failed the first time
Migrations were an unknown factor
Deploys roughly 2 months behind upstream
Old Challenges Met
Deploys take an hour, as short as 10 minutes
Deploys rarely fail the first time
Migrations tested upstream and timed downstream
Still up to 2 months behind
Here is a comparison of how we met some of our challenges
Our deploys are much faster, some as short as 10 minutes total in our largest environment with 3 minutes of API interruption
Deploys are now more reliable
Migration data is known ahead of time (and bad ones blocked upstream)
We still haven't solved keeping up with upstream. Many factors there.
It is by riding a bicycle that you learn the contours of a
country best, since you have to sweat up the hills and coast down
them.
~ Ernest Hemingway
We are also learning the countours of openstack, by being the largest public cloud operator. We get to sweat up the hills and coast back down.
Scaling Services
Scaling Deployments
Scaling Frequency
Our New Challenges
Some of our new challenges scaling not just deploying bits on nodes as fast as we can.
Scaling servicesScaling DeploymentsScaling Frequency
While we are trying to be a thought leader and front runner, collaboration is the key to success. The developer, operator, and testing communities need be aware of these scaling challenges
Scaling Services#rackstackatl
Scaling Services As the size of our cloud grows, and the features of our cloud grows, the services used need to scale along with them. Here we will walk through two scaling scenarios that highlight the challenge.
Scaling Glance
Scheduled Images feature went live
Glance saw much more usage
Glance servers became saturated
Builds and snapshots slowed down, eventually piling up faster than could be consumed
Resolved by:Scaling number of glance-api nodes
Scaling size of glance-api nodes
Scaling use of glance-bypass feature
Glance is an interesting case. Our glance talks acts as a middle person between HVs and Swift. As glance got used more, the bottleneck emerged. Partly due to our own configuration, but partly due to the nature of glance.
Once we resolve the glance issues, swift could be the next bottleneck, care will be needed to make sure we don't just kick performance problems down the line to the next group.
Scaling Nova Cells
Performance Cells went live
More and more cells added to regions
Nova cells service became single funnel slowing down the exchange of data
Eventually our single nova-cells service could not consume messages faster than they were being produced
Resolved by:Scaling number of nova-cells services
Optimizing instance healing calls
Optimizing database usage from cells service
Nova cells is responsible for interacting between the global cell and all the child cells. Doing this with just a single instance was never going to scale, we just ran out of runway before the pain hit.
Through collaboration with upstream, we are now more able to scale out nova-cells as our cell counts grow.
How do we anticipate where our growth will hurt and proactively scale to match?
These challenges will repeat. New bottlenecks will be found and new resource limits will be discovered. Staying ahead of the pain is key. We will not be the only ones to experience this, we are looking for collaboration on how best to manage this kind of scale.
Scaling Deployments#rackstackatl
Our next scale challenge involves deployments.
We made great strides around Havana, what have we been doing since?
Higher Form Orchestration
Pre-staging content outside of deploy window
Increased tolerance of downed hosts
Targeted bring up of servicesAPI first, then computes
More deployment optionsFactonly
Cellonly
No migrations
Reduced complexitySingle entry point: bin/deploy
Single orchestration system: Ansible
Orchestration has been our theme around deployments. We continue to iterate on the parts of the deployment causing the most pain, always making improvements for the next time.
Walk through each block and explain why the change was made
We still treat OpenStack as a legacy software deployment. As a community we need to treat it more like a cloud application, but that requires collaboration!
Even with the improvements, we still treat openstack like a legacy application; upgrading in place, not utilizing load balancers, stopping everything to migrate databases, preventing mixed versions, etc.. There are many things that are preventing us from getting to zero downtime, and that's where we can all work together!
Scaling Frequency#rackstackatl
A third scale challenge is frequency. This is the scale of doing things much more often.
It never gets easier, you just go faster.
~ Greg LeMond
A very relevant quote, but unlike bicycling, when you do something more often in the DevOps world, it does tend to get easier, but there are challenges to going faster!
Scaling Change
New features coming
New configurations coming
Accommodate without interrupting customer experience
Change faster, change frequently, on an ever growing fleet of systems
Resolved by:Understanding change before it happens
Scheduling changes to not conflict
Dedicating release iterations to risky change on top of known good code
Custom deploy modes per change type
Change comes from many sources. These changes need to be distributed to the environments, but with as little customer impact as possible. If we can't deploy changes often enough, we fall behind upstream, we fall behind our features, and we have larger deployments to consume. A snowball effect.
Our work on creating new multiple release pipelines, improving our deployment methods, and moving our tests upstream have enabled us to move faster, but not fast enough.
Customer Experience is our most important measurement of how fast we can scale.
This is our limit. We absolutely have to make this better. This is a global need, throughout the community of developers, operators, and testers.
Object Placeholder
The Next Iteration
A quick look at what we've got cooking for the Juno cycle
Leverage object model in Icehouse for mixed-version services
Implement Nova conductor service
Investigate read-only states
Zero Perceived Downtime
In Icehouse nova made great strides toward live upgrade with object model and conductor, which give us the ability to run multiple versions of openstack at the same time, notably we could run newer nova-api against an older version in the rest of the environment and shield nova-compute from migrations. This could allow us to roll the update through without downtime of the API and the computes with less interruption.
Investigate putting API nodes in read-only during migrations to satisfy some requests and queue others
Can we give Glance it's own pipeline and deployment capability, independent of Nova or other services?
How do we combat the exponential growth of service version combinations?
Does this actually make the whole pipeline any faster?
Individual Service Deployment Pipelines
This is an ongoing conversation. If we allow each service to work independently, what does that do to the version test matrix? Can we reliably validate anything? While individual projects/services might go faster, does that allow the entire pipeline to go faster? This ties into the discussions happening now at the design summit about cross project interactions.
Creating not just ephemeral environments, but production ones as well
Upgrades are easy, initial setups are a lot harder
Validation is critical
Developers and Operators need to collaborate on this use case when services are being designed
Fully Automated Environments
Yeah, we need them. Setting them up is hard, lets work together to make them easier.
The ops meetups are great for collaborating on the issues at hand.
I have always struggled to achieve excellence. One thing that
cycling has taught me is that if you can achieve something without
a struggle it's not going to be satisfying.
~ Greg LeMond
We do a lot of things that are hard, but if it wasn't hard, it wouldn't be as satisfying. That's what keeps us coming back.
Scaling is more than just tossing code on nodes. There are a lot more considerations to take into account.
The development, operator, and tester communities need to collaborate more on where the painful parts are, particularly at scale, and work together on solutions.
#rackstackatl
Click to edit the title text formatClick to edit Master title style
#rackstackatlClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
RACKSPACE HOSTING | WWW.RACKSPACE.COM
Click to edit the title text formatClick to edit Master title style
#rackstackatlClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit the title text formatClick to edit Master title style
RACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatlClick to edit the title text formatClick to edit Master title style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
RACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatlClick to edit the title text formatClick to edit Master title style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text stylesSecond level
Third level
Fourth level
Fifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline Level
RACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatlClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit the title text formatClick to edit Master title style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
RACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatlClick to edit the title text formatClick to edit Master title style
RACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatlRACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatlClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline Level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level
Click to edit the title text formatClick to edit Master title style
RACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatl
Click to edit the title text formatClick to edit master title style
RACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatl
RACKSPACE HOSTING | WWW.RACKSPACE.COM Click to edit the title text formatCLICK TO EDIT MASTER TITLE STYLE
#rackstackatl
Click to edit the title text formatCLICK TO EDIT MASTER TITLE STYLE
RACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatl
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Click to edit the title text formatClick to edit Master title style
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
RACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatlClick to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text styles
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline Level
Seventh Outline LevelClick to edit Master text stylesSecond levelThird levelFourth levelFifth level
Click to edit the title text formatClick to edit Master title style
RACKSPACE HOSTING | WWW.RACKSPACE.COM
#rackstackatl
RACKSPACE HOSTING | 5000 WALZEM ROAD | SAN ANTONIO, TX 78218US SALES: 1-800-961-2888 | US SUPPORT: 1-800-961-4454 | WWW.RACKSPACE.COMRACKSPACE HOSTING | RACKSPACE US, INC. | RACKSPACE AND FANATICAL SUPPORT ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COMRACKSPACE HOSTING | RACKSPACE US, INC. | RACKSPACE AND FANATICAL SUPPORT ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED STATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM