NTTs Journey with Openstack-final

32
© 2015 NTT Software Innovation Center NTT’s Journey with OpenStack Shintaro Mizuno Takashi Natsume NTT Software Innovation Center OpenStack Summit Tokyo 2015

Transcript of NTTs Journey with Openstack-final

Page 1: NTTs Journey with Openstack-final

© 2015 NTT Software Innovation Center

NTT’s Journey with OpenStack

Shintaro Mizuno

Takashi Natsume

NTT Software Innovation Center

OpenStack Summit Tokyo 2015

Page 2: NTTs Journey with Openstack-final

2 Copyright©2015 NTT corp. All Rights Reserved.

Outline

1. Introduction

2. How we did before

3. How we do it now

4. Giving back to the community

5. Next steps

Page 3: NTTs Journey with Openstack-final

3 Copyright©2015 NTT corp. All Rights Reserved.

Introducing NTT Group

Other Businesses

R&D

Page 4: NTTs Journey with Openstack-final

4 Copyright©2015 NTT corp. All Rights Reserved.

OpenStack in production

Other Businesses

R&D

R&D Cloud since 2013

Multiple customer

environments

E-mail servers

since 2014

Public cloud service

since 2013

Web service at NTT Resonant

since 2014

R&D Dev environment

Field trial with a customer

since 2014

Page 5: NTTs Journey with Openstack-final

5 Copyright©2015 NTT corp. All Rights Reserved.

Community contribution

• Total commits: 1107 (ranked 18th of 263)

• Total LOC: 127,575 (ranked 25th of 267)

• Reviews: 5937 (ranked 16th of 212)

• Draft Blueprints: 103 (ranked 16th of 212)

• Completed Blueprints: 35 (ranked 18th of 138)

• Filed Bugs: 797 (ranked 14th of 237)

• Resolved Bugs: 439 (ranked 14th of 204)

• Total 67 contributors from all NTT Group

Source: www.stackalystics.com as of 10th Sep 2015

Page 6: NTTs Journey with Openstack-final

6 Copyright©2015 NTT corp. All Rights Reserved.

Behind the scenes of

R&D cloud and Public cloud development

Page 7: NTTs Journey with Openstack-final

7 Copyright©2015 NTT corp. All Rights Reserved.

Timeline

2011 2012 2013 2014 2015 Folsom Diablo Essex Grizzly Havana Icehouse Juno Kilo Cactus Liberty

1st production development Current development Joined the Community

Page 8: NTTs Journey with Openstack-final

8 Copyright©2015 NTT corp. All Rights Reserved.

How we did in the 1st development

In 2012 (Folsom era),

when people were still skeptic about the hype of OpenStack,

We focused in QA tests

- including

- Full-API function test (incl. parameter boundary tests)

- Non-API function test

- Full state transition test

- External-system failure test

- API race conditions/multiple requests

- Long-term stability test (scenario test)

Page 9: NTTs Journey with Openstack-final

9 Copyright©2015 NTT corp. All Rights Reserved.

Network QA tests to understand the limits

- Function

- Max MAC address learning, learning speed, MTU, fragment

- Capacity

- Max routers per tenant/region, static routes per router, num ports per network, networks per region, dhcp servers per network node…

- Performance

- Throughput for: VM to VM, VM to external network via router

- Multiple tenant, multiple network ,multiple routers, short packet, long packet, noisy neighbor, DoS simulation…

- API request processing speed, time to apply changes

- Availability

- Network node swichover time, packet loss, high network load, with numbers of routers, with floating IP

Page 10: NTTs Journey with Openstack-final

10 Copyright©2015 NTT corp. All Rights Reserved.

Quality level found

Major issues/weakness found in Folsom

- API race condition especially in Quantum - Lacking appropriate locking mechanism

- E.g. create port + create port = error

- Internal error handling - Lacking exception handling in many cases

- Resources fell into “ERROR” state so easily

- Need to clean up orphan resources, e.g. vifs, ports, instances, etc

- State transition - No workflow management.

- No rollback mechanism (e.g. migration, resize)

- API parameter validation

- HA feature (switchover time)

Page 11: NTTs Journey with Openstack-final

11 Copyright©2015 NTT corp. All Rights Reserved.

Our answer in 2012

“Folsom has good features!”

“…but it’s too fragile for public clouds”

Page 12: NTTs Journey with Openstack-final

12 Copyright©2015 NTT corp. All Rights Reserved.

Our first "Folsom-based" system

GUI/CLI/API

Resource Mgmt

Transaction Mgmt

Host Mgmt

User Mgmt

DB

Nova Cinder Glance Quantum (Neutron) Keystone

End user/operator

We built a proprietary system to be “gentle” to OpenStack

Driver

Folsom

Workflow engine

patch patch patch patch patch

Page 13: NTTs Journey with Openstack-final

13 Copyright©2015 NTT corp. All Rights Reserved.

What we added

- Proprietary GUI for end-users - Provide “business view” of resources and don’t let users touch OpenStack

resources/features directly

- Proprietary operation GUI - Host management, monitoring, resource/user management

- Transaction Management - API workflow management using Request-id tracking/notification

- Add “purge” feature for rollback/roll forward/clean-up after API failure

- Workflow engine - Execute certain scenario composed of multiple API calls (like what Heat does)

- API parameter validation check - Strict parameter validation before handing over to OpenStack API

- Cinder Driver for EMC VNX - There weren’t one from EMC!

Page 14: NTTs Journey with Openstack-final

14 Copyright©2015 NTT corp. All Rights Reserved.

Convincing business people

Question to answer:

“Why should we use OpenStack when we already have vCenter and CloudStack?”

Page 15: NTTs Journey with Openstack-final

15 Copyright©2015 NTT corp. All Rights Reserved.

What we discussed

- Cost comparison

- Compute feature comparison with vCenter

- Network feature comparison

- Future growth expectations

Page 16: NTTs Journey with Openstack-final

16 Copyright©2015 NTT corp. All Rights Reserved.

How we dealt with 150 OpenStack bugs

• Patches

• Live migration bug (Nova, about 13%)

• Input check improvement (about 9%)

• Log output improvement (about 7%)

• Unnecessary ‘things’ remaining (about 6%)

• Add timeout parameter (about 4%)

• API response improvement (about 4%) • Change HTTP Status code

• Volume boot bug (Nova, about 3%)

• Security (about 3%)

• Race condition(about 3%)

We did upstream for our patches with Canonical because there were many patches!

Page 17: NTTs Journey with Openstack-final

17 Copyright©2015 NTT corp. All Rights Reserved.

How we dealt with 150 OpenStack bugs(contd.)

• Merged(18 patches) • Tests (about 27%)

• Race condition bugs (about 17%)

• Unnecessary ‘things’ remaining (about 11%)

• Add timeout parameter (about 11%)

• Rejected • Multiplicity control function

• Input parameter check(Do it in the next major API version)

• Already merged by other companies(about 60 patches) • Input parameter check

• delete namespaces when they are no longer needed

• Multiple regions support for quantum in nova-compute

• No need upstream(about 50 patches) • The bug cannot be reproduced, etc.

Page 18: NTTs Journey with Openstack-final

18 Copyright©2015 NTT corp. All Rights Reserved.

Upstream proprietary function

• Transaction Management and Workflow engine • Log-request-id-mapping

• Enable us to analyze API calls between components by mapping each request ID

• Our proprietary function used common request ID and enable us to to analyze API calls between components by tarcking one request ID.

• The spec has been approved in openstack-specs. We will implement it.

• TaskFlow • Needed for our retry, rollback and API trace(checking the progress of API process) function

• Work in progress

• A lot of things to do... • Force delete for ‘rollback’

• Optimization of Error Handling

• EMC driver • Use the driver provided to the community by EMC Corporation(We do not

upstream)

Page 19: NTTs Journey with Openstack-final

19 Copyright©2015 NTT corp. All Rights Reserved.

What we learned from the first release

• “upstream-first” is very important

• The work of the development and fix is in vain because they have already been fixed by other companies in the community code.

• Our proprietary function/tools have to be modified because prerequisite function cannot be merged.

• It takes a long time to do upstream for our proprietary function since it needs coordination and persuasion at the community.

Page 20: NTTs Journey with Openstack-final

20 Copyright©2015 NTT corp. All Rights Reserved.

Timeline

2011 2012 2013 2014 2015 Folsom Diablo Essex Grizzly Havana Icehouse Juno Kilo Cactus Liberty

1st production development Current development Joined the Community

Page 21: NTTs Journey with Openstack-final

21 Copyright©2015 NTT corp. All Rights Reserved.

How we do it now…

We had to change our mindset “Don’t be greedy.

Find a way to live with the community code”

Page 22: NTTs Journey with Openstack-final

22 Copyright©2015 NTT corp. All Rights Reserved.

How we do it now

Features:

1. Try to satisfy with what you have or try to figure out with what you can get

2. Try to write a spec/RFE to realize you ideas (it’ll take quite some time, though)

3. (If upstream doesn’t work) and (if you really really need it) and (if you can afford it), then think of building it “outside”

Page 23: NTTs Journey with Openstack-final

23 Copyright©2015 NTT corp. All Rights Reserved.

How we do it now

Bugs:

1. Report the bug and wait

2. If you need it quick, pick up the bug and fix

3. If the community wont fix it or if the community says “it’s a spec”, try to live with it by “writing documents” 1. Work arounds and recovery manuals for operators

2. FAQs for users

4. If the bug may cause critical system failure, consider closing relevant APIs until it get fixed.

5. If above doesn’t work, create in-house patch but “keep it minimum” and maintain them.

Page 24: NTTs Journey with Openstack-final

24 Copyright©2015 NTT corp. All Rights Reserved.

What we did and didn't do

Against requirements from service/operation side.

We dropped everything that needed to change OpenStack specs:

- Features that will change current API behavior/specs

- “Do like CloudStack/vCenter does” thing - Created workarounds or leveraged equivalent OpenStack features

We did what was mandatory for operation without changing OpenStack:

- Add API filter to hide immature APIs (apache proxy)

- Add notification/API-log collection tool (external tool)

- Built cascaded domain/tenant/user model using existing keystone APIs (manual)

- Developed High-availability for virtual machines (open sourced)

Page 25: NTTs Journey with Openstack-final

25 Copyright©2015 NTT corp. All Rights Reserved.

Our current system overview

Nova Cinder Glance Neutron Keystone

Pure Juno/Kilo

Reverse proxy (Apache) Virtual Machine High Availability

(Masakari)

Notification/API log collection

End user/operators

filter rules for end user

filter rules for operators

OpenStack API

Notification

API Log

VM recovery

Event from agents Compute node Monitoring agents

OpenStack API (subset) Operation tools

Page 26: NTTs Journey with Openstack-final

26 Copyright©2015 NTT corp. All Rights Reserved.

Our current OpenStack configuration(figure)

Controller Node(2)

pacemaker(1Act-1Sby) •VIP(neutron-sv, haproxy) •neutron-server •nova-consoleauth

keystone-all nova-api nova-conductor nova-novncproxy nova-scheduler cinder-api cinder-scheduler Apache(keystone) haproxy

Network Node(4)

OVS

Compute Node(4)

nova-compute OVS

Backend Node(3)

mysql-pxc(3Act) RabbitMQ(2Act)

pacemaker(nAct-1Sby) • neutron-linuxbridge-agent • neutron-dhcp-agent • neutron-l3-agent

pacemaker(nAct)

Storage Node(2)

glance-api glance-registry

pacemaker(nAct-1Sby) •cinder-volume(NFS, iSCSI)

pacemaker(3Act) •VIP(MQ, PXC)

Active-Active

Legend:

DMZ Load Balancer(2)

haproxy

pacemaker(1Act-1Sby) •VIP(api & novncproxy endpoint)

Page 27: NTTs Journey with Openstack-final

27 Copyright©2015 NTT corp. All Rights Reserved.

Our current OpenStack configuration

• stable/kilo(2015.1.0) and Ubuntu 14.04 LTS

• Host aggregates for VM scheduling

• OS type(3 types) and memory capacity of nova flavors (2 types)

• Full HA architecture

• HA on each node

• Multiple data center architecture

• Support HA configuration between multiple data centers

Page 28: NTTs Journey with Openstack-final

28 Copyright©2015 NTT corp. All Rights Reserved.

Contributing to the community

• Cinder • Restrict users from uploading volume to image based on glance

protected properties

• Glance • Restrict users from downloading image based on policy

• Add multifilesystem store to support NFS servers as backend

• Reload configuration files on SIGHUP signal

• Neutron • Add enable_new_agents to neutron server

• Agent terminates services when turning admin_state_up False

Page 29: NTTs Journey with Openstack-final

29 Copyright©2015 NTT corp. All Rights Reserved.

Where OpenStack fit and still doesn't fit

Best fit in

- Private cloud hosting web services - Lower entrance barrier for the cattle model

Still hard but is running in production

- Public cloud for enterprise - Customer’s cattle is our precious pets

Maybe OpenStack is not the one (at least for some time)

- Core network function virtualization

- Virtualization of legacy silo applications

Page 30: NTTs Journey with Openstack-final

30 Copyright©2015 NTT corp. All Rights Reserved.

Next steps • Practical use of applications in upper level(PaaS, etc.)

• Practical use of OpenStack in NFV

• Now we are trying to do upstream for the following functions

• Nova

• Improve unshelve performance

• Neutron

• AZ support

• Congress

• Congress for OPNFV doctor use case

• Cross project

• Log request-id mappings

• Other

• VM/HA(Masakari)

Page 31: NTTs Journey with Openstack-final

31 Copyright©2015 NTT corp. All Rights Reserved.

Sessions from/about NTT Group

• From NTT Group • Korejanai Story: How To Integrate OpenStack Into Your

Business Strategy(October 29 3:30pm - 4:10pm)

• Gohan: An Open-source Service Development Engine for SDN/NFV Orchestration (October 29 4:30pm - 5:10pm)

• About NTT Group • Telco OpenStack Roadmap Panel(October 29 1:50pm -

2:30pm)

Page 32: NTTs Journey with Openstack-final

32 Copyright©2015 NTT corp. All Rights Reserved.

Questions? Masakari wo

nageru