‘fsck’ for Openstack
-
Upload
wei-tian -
Category
Engineering
-
view
518 -
download
0
Transcript of ‘fsck’ for Openstack
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
‘fsck’ for Openstack
Wei Tian -- Cloud Performance Lead at PaypalZhenhua Feng -- Staff Software Engineer10/ 27 / 2015
Detect Resource Leaking and Keep the Cloud Consistent
© 2014-15 PayPal Inc. All rights reserved. Confidential and proprietary.
Agenda
2
• Some numbers about Paypal Cloud• What makes our cloud inconsistent• Our solutions to keep our cloud consistent
© 2014-15 PayPal Inc. All rights reserved. Confidential and proprietary.
About PayPal Cloud
3
• Background– Started in July 2012 with 1 engineer and 16 decommissioned servers– Today, one of the world’s Largest OpenStack Private Cloud – Number of VMs : 82,000– Number of Physical Servers: 8064 – Number of Racks: 84 – Total Cores: 386,000– Block Storage: 2 peta bytes– Largest AZ with 2500+ hypervisors
• Business Goals– Hosting ~100% of PayPal’s production traffic (except Databases and Messaging)– Powers 100% of PaaS, Dev/QA and M&As– First production workload on SDN in 2013
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
What Makes the Cloud Inconsistent?
4
• VPC• Flavor• Image properties• Host Aggregate metadata• Default security group• Networks• Volumes
• VM Sprawl• inconsistent cinder volume states• Orphaned ports• Inconsistent DNS entries• Inconsistent states between neutron and NSX• Inconsistent states caused by RPC timeout• Inconsistent DB states between API and Compute cells
Misconfiguration
Resource Leaking
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Misconfiguration
• In Paypal cloud, administrator does the initial resource allocation and configuration.
• The resources set up includes VPC, flavor, image, network, host aggregate, etc.
• Administrator uses Openstack cli to create all those resources and make sure they match with each others.
• As long as we are human, we are bound to make mistakes.
5
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Scenario may have misconfiguration (1)
6
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Scenario may have misconfiguration (2)
7
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Resource Leaking
8
• When running Openstack, sometimes the state of a resource (volume, instance, port, etc.) can be inconsistent on the cluster.
• Sometime, it is not able to correct the state through REST API alone.
• You may need to manually edit the database or to run a shell script on hypervisor to correct the state.
• Please note that it is important to find and fix the underlying issue, and to edit database or run shell script is a just a quick hack.
• However, as a service operator, you also need to fix the issue right away to meet the SLA before the engineering fixes the code.
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Resource Leaking (1)
9
VM Sprawl can cause major performance and capacity problems.The resource leaking includes zombie VMs and orphaned disk files.The state of a volume or an instance can be inconsistent. The volume shows attached in nova but not in cinder, or otherwise.Sometimes a volume deletion hangs, or a detach does not work.Neutron orphaned ports. Ports not deleted when VM deleted. Or ports without device_id. Ports leaking causes IP leaking and DNS leaking.Inconsistent state between neutron and NSX controller. A port is deleted from neutron but still exists in NSX.
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Resource Leaking (2)
10
Inconsistent DNS entries. One IP with multiple DNS entries, or multiple IPs with same DNS entry, or fails to create/delete DNS entryInconsistent states caused by RPC timeout. The caller says A RPC timeout, but the handler does the job but fails to reply.
Inconsistent states between API and Compute cells DBs
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
Introduce CloudKeeper
11
CloudBuilder CloudSweeper
• Resolve the misconfiguration.• Eliminating manual steps to setup
Openstack cloud.• The CloudBuilder automates the
entire setup process to avoid human errors.
• Declarative instead of Imperative. All settings are described in a set of config files called Blueprint.
• Like Puppet, the CloudBuilder continuously pushes the changes from BluePrint to Openstack cloud and keeps them in sync.
• Resolve the resource leaking.• CloudSweeper has a task manager
which triggers all plugin tools periodically.
• CloudSweeper logs the results of each cleaning tool and report to dashboard for statistics and troubleshooting.
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudBuilder -- Blueprint
12
Everything Data-Driven
We define how the initial setup for the cloud in a set of JSON files. The CloudBuilder will create all the resources based on the JSON files:• VPC metadata• Flavor class• VPC networks• VPC host-aggregate• VPC images
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudBuilder – Blueprint – VPC metadata
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudBuilder – Blueprint – VPC Resources
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudBuilder – Add Hypervisor to Host Aggregate
New Hypervisor can be automatically add to the right host
aggregates based on its characteristics
The hypervisor asset information can be retrieved from CMS
(Configuration Management System)
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudSweeper
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudSweeper – Neutron Port Cleaner
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.
CloudSweeper – Volume Cleaner
• Mismatched volume state in nova and cinder
• Volume stuck in deleting state
• Missing connection_info in block_device_mappimg table
Symptom
• Find the REAL state of the volume from hypervisor
• Modify the nova and cinder DBs to reset the state.
• Re-run “nova volume-delete” after cleaning state in DB for volume stuck in deleting state.
Fix
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary. 19
Questions ?