‘fsck’ for Openstack

19
© 2015 PayPal Inc. All rights reserved. Confidential and proprietary. ‘fsck’ for Openstack Wei Tian -- Cloud Performance Lead at Paypal Zhenhua Feng -- Staff Software Engineer 10/ 27 / 2015 Detect Resource Leaking and Keep the Cloud Consistent

Transcript of ‘fsck’ for Openstack

Page 1: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

‘fsck’ for Openstack

Wei Tian -- Cloud Performance Lead at PaypalZhenhua Feng -- Staff Software Engineer10/ 27 / 2015

Detect Resource Leaking and Keep the Cloud Consistent

Page 2: ‘fsck’ for Openstack

© 2014-15 PayPal Inc. All rights reserved. Confidential and proprietary.

Agenda

2

• Some numbers about Paypal Cloud• What makes our cloud inconsistent• Our solutions to keep our cloud consistent

Page 3: ‘fsck’ for Openstack

© 2014-15 PayPal Inc. All rights reserved. Confidential and proprietary.

About PayPal Cloud

3

• Background– Started in July 2012 with 1 engineer and 16 decommissioned servers– Today, one of the world’s Largest OpenStack Private Cloud – Number of VMs : 82,000– Number of Physical Servers: 8064 – Number of Racks: 84 – Total Cores: 386,000– Block Storage: 2 peta bytes– Largest AZ with 2500+ hypervisors

• Business Goals– Hosting ~100% of PayPal’s production traffic (except Databases and Messaging)– Powers 100% of PaaS, Dev/QA and M&As– First production workload on SDN in 2013

Page 4: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

What Makes the Cloud Inconsistent?

4

• VPC• Flavor• Image properties• Host Aggregate metadata• Default security group• Networks• Volumes

• VM Sprawl• inconsistent cinder volume states• Orphaned ports• Inconsistent DNS entries• Inconsistent states between neutron and NSX• Inconsistent states caused by RPC timeout• Inconsistent DB states between API and Compute cells

Misconfiguration

Resource Leaking

Page 5: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

Misconfiguration

• In Paypal cloud, administrator does the initial resource allocation and configuration.

• The resources set up includes VPC, flavor, image, network, host aggregate, etc.

• Administrator uses Openstack cli to create all those resources and make sure they match with each others.

• As long as we are human, we are bound to make mistakes.

5

Page 6: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

Scenario may have misconfiguration (1)

6

Page 7: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

Scenario may have misconfiguration (2)

7

Page 8: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

Resource Leaking

8

• When running Openstack, sometimes the state of a resource (volume, instance, port, etc.) can be inconsistent on the cluster.

• Sometime, it is not able to correct the state through REST API alone.

• You may need to manually edit the database or to run a shell script on hypervisor to correct the state.

• Please note that it is important to find and fix the underlying issue, and to edit database or run shell script is a just a quick hack.

• However, as a service operator, you also need to fix the issue right away to meet the SLA before the engineering fixes the code.

Page 9: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

Resource Leaking (1)

9

VM Sprawl can cause major performance and capacity problems.The resource leaking includes zombie VMs and orphaned disk files.The state of a volume or an instance can be inconsistent. The volume shows attached in nova but not in cinder, or otherwise.Sometimes a volume deletion hangs, or a detach does not work.Neutron orphaned ports. Ports not deleted when VM deleted. Or ports without device_id. Ports leaking causes IP leaking and DNS leaking.Inconsistent state between neutron and NSX controller. A port is deleted from neutron but still exists in NSX.

Page 10: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

Resource Leaking (2)

10

Inconsistent DNS entries. One IP with multiple DNS entries, or multiple IPs with same DNS entry, or fails to create/delete DNS entryInconsistent states caused by RPC timeout. The caller says A RPC timeout, but the handler does the job but fails to reply.

Inconsistent states between API and Compute cells DBs

Page 11: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

Introduce CloudKeeper

11

CloudBuilder CloudSweeper

• Resolve the misconfiguration.• Eliminating manual steps to setup

Openstack cloud.• The CloudBuilder automates the

entire setup process to avoid human errors.

• Declarative instead of Imperative. All settings are described in a set of config files called Blueprint.

• Like Puppet, the CloudBuilder continuously pushes the changes from BluePrint to Openstack cloud and keeps them in sync.

• Resolve the resource leaking.• CloudSweeper has a task manager

which triggers all plugin tools periodically.

• CloudSweeper logs the results of each cleaning tool and report to dashboard for statistics and troubleshooting.

Page 12: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

CloudBuilder -- Blueprint

12

Everything Data-Driven

We define how the initial setup for the cloud in a set of JSON files. The CloudBuilder will create all the resources based on the JSON files:• VPC metadata• Flavor class• VPC networks• VPC host-aggregate• VPC images

Page 13: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

CloudBuilder – Blueprint – VPC metadata

Page 14: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

CloudBuilder – Blueprint – VPC Resources

Page 15: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

CloudBuilder – Add Hypervisor to Host Aggregate

New Hypervisor can be automatically add to the right host

aggregates based on its characteristics

The hypervisor asset information can be retrieved from CMS

(Configuration Management System)

Page 16: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

CloudSweeper

Page 17: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

CloudSweeper – Neutron Port Cleaner

Page 18: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary.

CloudSweeper – Volume Cleaner

• Mismatched volume state in nova and cinder

• Volume stuck in deleting state

• Missing connection_info in block_device_mappimg table

Symptom

• Find the REAL state of the volume from hypervisor

• Modify the nova and cinder DBs to reset the state.

• Re-run “nova volume-delete” after cleaning state in DB for volume stuck in deleting state.

Fix

Page 19: ‘fsck’ for Openstack

© 2015 PayPal Inc. All rights reserved. Confidential and proprietary. 19

Questions ?