L3 Agent HA - object-storage-ca-ymq-1.vexxhost.net · Neutron L3 Agent HA Or: How I Learned to Stop...

Post on 30-May-2020

2 views 0 download

Transcript of L3 Agent HA - object-storage-ca-ymq-1.vexxhost.net · Neutron L3 Agent HA Or: How I Learned to Stop...

Neutron L3 Agent HAOr: How I Learned to Stop Worrying and Love the API

Kevin Bringard // OpenStack Juno Summit // May 2014

• There is no “one right way” • The goal is to move L3 resources to a new L2

resource as quickly and seamlessly as possible • This is a really difficult, but important, problem to

solve

Layer 3Internet Happens

L3 agent L3 agent L3 agent

router1 router2router3router4router5

router6

VM1 VM3VM2 VM4 VM5 VM7VM6

Core Router

L3 agent L3 agent

router1 router2router3router4router5

router6

VM1 VM3VM2 VM4 VM5 VM7VM6

Core Router

Layer 2The ARPing is the hardest part

• One L3 resource may only be tied to one L2 resource at a time

• Many technologies exist to sort of work around this • HSRP • VRRP • CARP

• Work is being done to implement VRRP like functionality into Juno • https://blueprints.launchpad.net/neutron/+spec/l3-

high-availability • Nothing is currently integrated into OpenStack

Pacemakerhttp://docs.openstack.org/high-availability-guide/content/_highly_available_neutron_l3_agent.html

• False positives — caused more downtime than actual outages

• Split brain possibilities • Assumes control of L3 agent start/stop functions • Limited Horizontal Scale

• More difficult to run multiple Active L3 agents • Failover requires entire services starts/stops

• Active/Passive Model Requires More Hardware • Works on a “per agent” level • Akin to RAID1

L3 agent L3 agent L3 agent

router1 router2router3router4router5

router6

VM1 VM3VM2 VM4 VM5 VM7VM6

Core Router

L3 agent L3 agent

router1 router2router3router4router5

router6

VM1 VM3VM2 VM4 VM5 VM7VM6

Core Router

L3 agent L3 agent L3 agent

router1 router2router3router4router5

router6

VM1 VM3VM2 VM4 VM5 VM7VM6

Core Router

Neutron HA Toolhttps://raw.githubusercontent.com/stackforge/cookbook-openstack-network/master/files/default/neutron-ha-tool.py

• API Driven • Uses native API calls to perform all functions • Can be run externally from infrastructure or cross

site • Supports any operations the neutron client

libraries supports • Easily Extendable

• Written in python • Leverages standard OpenStack libraries

• Works on a “per resource” level

L3 agent L3 agent L3 agent

router1 router2router3router4router5

router6

VM1 VM3VM2 VM4 VM5 VM7VM6

Core Router

L3 agent L3 agent

router1 router2router3router4router5

router6

VM1 VM3VM2 VM4 VM5 VM7VM6

Core Router

L3 agent L3 agent

router1 router2

router3router4

router5

router6

VM1 VM3VM2 VM4 VM5 VM7VM6

Core Router

• Only routers/IPs on the affected L3 agent are impacted

• Recovery time depends on the number of routers which need to be migrated and the number of IPs on each router

• Migration happens quickly, but every IP on the routers must re-ARP to the upstream switch

• Meta-data proxies migrate with the routers

OK, so what’s the catch?

• Not seamless • The ARP processes happen in parallel, but generally

take 60-90 seconds for all IPs to complete • Various *aaS offerings further complicate things

• Currently only accounts for “l3-agent” controlled services

• No coordination between HA tools • How do you HA the HA?

• Currently not daemonized, runs from cron • Add 60 seconds to total recovery time • Jitter protection adds additional total recovery time

• No mechanism by which to ensure resources actually come up/work

What about DHCP?

• Multiple DHCP agents may be run Active/Active • DHCP agents per subnet may be specified in your

agent config file • Each agent requires an IP in the tenant’s subnet • DHCP is multi-cast

• All agents have the same lease file • The first one to reply binds to the VM

• Any DHCP agent may reply to a DNS request and resolve all known leases

• By default, each DHCP agent hands out a list of every agent as available resolvers

• HA tool has an option to replicate DHCP to all agents

• VRRP Like functionality • Specify number of Active L3 agents per subnet • Leverage conntrackd/keepalived • Point of diminishing returns for HA tool? • The beauty of open source:

• There is no “one right way” • Think outside the box • Do cool things

Moving Forward

Questions?