BGP IN THE DATA CENTER - packetdesign.com · Page 11 of 12 is identical, in order for BGP to route...

A PACKET DESIGN E-BOOK

BGP IN THE DATA CENTER

Copyright © 2018, Packet Design, LLC of 12

Contents

Page 3 : BGP the Savior

Page 4 : Traditional Data Center Architecture

• Traffic Flows

• Scalability

• Spanning Tree Protocol (STP)

Page 6 : CLOS Architecture to the Rescue

• L2 or L3 in CLOS?

• BGP in CLOS

• CLOS with eBGP Design

• ECMP

• Convergence

• Number of ASes to be used

• Evolving with BGP


In this E-book, we’ll cover why traditional data center architectures don’t meet the needs of today’s service providers and how to overcome their limitations. This includes using a decades-old architecture that has been given new life, Layer 3 routing, and Border Gateway Protocol (BGP).

BGP the Savior

BGP is a well-known protocol that service providers and enterprises have used for decades to manage routing throughout the internet. Considering the rapid growth of the full Internet table (growing approximately 10 percent year-over-year since 2009 and 664K total), BGP is the best protocol to handle these routes.

That being said, service providers and content providers require more functionality to control inbound and outbound traffic flows, particularly at the edge of their networks. BGP is called a path vector routing protocol, which is a fancy way of saying it is a distance vector routing protocol with several additional attributes. These additional attributes allow users to gain more flexibility in order to manipulate routing decisions, which means they also have higher flexibility in controlling network traffic.

Over time, BGP started to be used in different network segments for different purposes. Do you remember the address families such as vpnv4 used for MPLS L3VPN and labelled unicast for seamless MPLS and 6PE? It seems whenever we are up against the wall with new protocols or architectural needs, we call BGP to the rescue. This is what happens in today’s data centers as well, as there are many challenges with traditional architectures.

https://www.networkworld.com/article/3241728/lan-wan/refactoring-the-network.html

https://en.wikipedia.org/wiki/Path_vector_routing_protocol

https://en.wikipedia.org/wiki/Path_vector_routing_protocol

http://www.networkers-online.com/blog/2016/07/6pe/


Traditional Data Center Architecture

In this architecture (Figure 1), the topology is composed of three layers that are connected to each other via L2 links. Thus, traffic flow is controlled mostly by L2 protocols. Here are the drawbacks of this architecture and why it does not fit the current data center requirements.

Traffic Flows:

The classic data center architecture was developed on the assumption that most of the traffic flows in a north-south (user-server) direction. The obvious inference from this is that north-south traffic is supposed to be greater than east-west (server-server) traffic.

This architecture is still valid for some service provider data centers, but with new Content Delivery Networks (CDNs), almost 80 percent of total traffic is in an east-west direction. Content itself is becoming more critical and valuable day by day. Even service providers are providing more cloud services and are acquiring and serving more and more content.

For that reason, service provider data center requirements will likely evolve to the same ones as in CDNs. Server to server (e.g., App-Database, App-Web, VM migration, Data Replication) communication has been increasing significantly.

Figure 1: Traditional Data Center Architecture


When server A wants to reach server D, inter-vlan traffic takes the path to one of the core switches and goes back to server D by passing over all layers. However intra-vlan traffic can be handled by the distribution layer. This means the number of hops and the latency will vary based on the type of communication. In data center networks, the consistency of these two related parameters has become more critical than before, and the classic tree-based architecture does not provide this consistency.

Scalability:

When the data center grows, this architecture may not be able to scale due to port/card/device/bandwidth limitations. Adding new devices to the distribution layer will result in adding new devices to the core at some point, because the core layer has to be adjusted based on the lower layers’ increased bandwidth requirements. This means the data center has to scale vertically, since it was developed based on north-south traffic considerations.

Spanning Tree Protocol (STP):

STP is designed to prevent loops being created when there are redundant paths in the network. You are probably familiar with related terms, like Portfast, BPDU Guard, Root Guard, Loop Guard, UDLD, TCN, etc. STP is not perfect and, despite continuing enhancements to it, there should be an easier and better way to overcome the challenges it is intended to address.

Fortunately, vendors recognized STP’s limitations and came up with alternatives, such as VPC, QFabric, Fabric Path, MLag, and TRILL. By using a combination of these protocols instead of STP, users can employ most of their L2 links and create a loop-free structure. For example, it is possible to bond switches/links and let them act as one switch/link. “L2 routing” can be enabled on the network as well.

However, even when using these technologies, scalability can still be an issue. It is not possible to bond more than two switches (this is supported only with various limitations). Vendor lock-in is another disadvantage, as most of these protocols are proprietary.

https://en.wikipedia.org/wiki/Spanning_Tree_Protocol


CLOS Architecture to the Rescue

The whole story begins with the “new” architecture called CLOS. New in this context means new in data centers. Telephony network engineer Charles Clos developed this architecture in the 1950s to meet similar scalability requirements. It’s being used today to improve performance and resiliency. Let us take a closer look at CLOS Architecture in the data center.

A CLOS topology (Figure 2) is comprised of spine and leaf layers. Servers are connected to leaf switches (Top of Rack - TOR) and each leaf is connected to all spines. There is no direct leaf-to-leaf and spine-to-spine connection. Here are the architectural advantages of this topology:

• Regardless of being in the same vlan condition, each server is three hops away from the others. That’s why this is called 3-stage CLOS topology. It can be expanded to 5-stage CLOS by dividing the topology into clusters and adding another top-spine layer (also known as a super-spine layer). No matter how many stages there are, total hop count will be the same between any two servers. Thus, consistent latency can be maintained throughout the data center.

• Multi-Chassis Link Aggregation Group (MLAG or MCLAG) is still available on the server side. Servers can be connected to two different leaf or TOR switches in order to have redundancy and load balancing capability. On the other hand, as the connectivity matrix is quite complex in this topology, failures can be handled gracefully. Even if two spine switches go down at the same time, the connectivity between servers will remain.

• The CLOS topology scales horizontally, which is very cost effective. The bandwidth capacity between servers can be increased by adding more spine-leaf links as well as adding more spine switches. As newly added spine switches will be connected to each leaf, server to server bandwidth/throughput will increase significantly.

Figure 2. CLOS Data Center Topology

https://www.networkworld.com/article/2226122/cisco-subnet/clos-networks--what-s-old-is-new-again.html

http://www.excitingip.com/2802/data-center-network-top-of-rack-tor-vs-end-of-row-eor-design/


How could this be cost effective compared to the traditional design? The key point is that spine switches do not have to be big and expensive as opposed to the core switches in the traditional design. But if there are so many TORs/leaves that users hit some kind of port limitation, then users should select switches with a high number of ports (up to 2300-10G) for the spine layer. The best way in this case, though, would be to build 5-stage, 7-stage or a multiple pod architecture.

L2 or L3 in CLOS?

It would be a huge mistake to design a data center, based on the CLOS architecture, and use only L2 switching instead of L3 routing. When compared to L2 switching, L3 routing has lots of benefits, not only for scalability and resiliency, but also for visibility, which is quite important to the planning and operation teams.

But how should L3 routing be integrated into the design? Partially or fully? Which protocol should be used? Is it possible to get rid of L2 switching? First, here’s a design that separates TOR switches from the leaf layer and makes the spine to leaf connections L3 (Figure 3).

In this design, TOR instances are connected to leaf instances with L2 links and spine-leaf connections are L3. MLAG enables operators to utilize all the links and create bandwidth aggregation to leaf switches.

Figure 3. CLOS Topology using Layer 3 Routing for Spine to Leaf Connections


Is it possible to eliminate the use of STP and MLAG? In general, the size of L2 domains should be limited due to the difficulty in troubleshooting L2 environments. There is also the risk of broadcast storms in large L2 domains, and MLAG is vendor dependent. This introduces limitations such as being able to use only two leaves in one instance.

What about expanding L3 routing all the way down to the TORs? Here’s how the topology looks in that case (Figure 4).

Which routing protocol should be used in such an architecture? How about RIP? Just kidding! Here are the pros and cons of using IGP (OSPF/ISIS) and BGP.

Let’s consider OSPF and ISIS together to avoid an argument similar to comparing the relative merits of a Lamborghini and a Ferrari.

At first look, the choice may seem to be highly dependent on the size of the data center. IGP could easily scale if there are only a few thousand prefixes in total. This isn’t wrong, but there are more considerations than sizing.

Figure 4. CLOS Topology using only Layer 3 Routing


• From a configuration perspective, IGP is easier to deploy than BGP, especially considering the number of peers to be configured in BGP. Therefore, automation is a must in BGP deployments.

• IGP is more likely to be supported by TOR switches. In BGP deployments, this limitation could result in aggregating the TOR and leaf layers.

• As stated above, BGP is better when dealing with a high number of prefixes, and it has a limited event propagation scope. On the other hand, in such a complex connectivity matrix (depends on the number of switches), IGP will propagate link state changes throughout the network, even to the irrelevant nodes that are not impacted. SPF calculations will take place on each node. Some mechanisms such as Incremental SPF or Partial SPF could avoid this but may add more complexity.

• If IGP is used, BGP will remain in the data center, most likely at the edge. This means there will be two routing protocols in play and redistribution will be necessary. That being said, BGP can be the only routing protocol, if chosen.

• With its attributes and filters, BGP provides much better flexibility for controlling the traffic and it provides per-hop traffic engineering capability. BGP AS path visibility also helps operators troubleshoot problems more easily.

• BGP can be extended across data centers and used as a control plane protocol (EVPN) for VXLAN overlays. This provides many benefits such as active/active multi-homing.

• By default, IGP is more compatible with ECMP, where BGP configuration should be adjusted to meet the load balancing requirements.

BGP in CLOS

When evaluating BGP in data center networks, whether to use iBGP or eBGP is a good question. There are similar challenges in both. However, eBGP-based design is seen more commonly because iBGP can be tricky to deploy, especially in large-scale data centers.

One of the issues with iBGP is route reflection between nodes. Spine switches can be declared as route reflectors, but this causes another issue: Route reflectors reflect only the best paths. Therefore, nodes won’t be able to use ECMP paths. The BGP add-path feature has to be enabled to push all routes to the leaves. On the other hand, “AS_Path Multipath Relax (Multi-AS Pathing)” needs to be enabled in eBGP. Even so, eBGP does not require maintaining route reflectors in data centers.

As stated before, BGP has extensive capabilities for per-hop traffic engineering. With iBGP, it is possible to use some part of this capability. But eBGP’s attributes provide better visibility, such as directly comparing BGP-Local-RIB to Adj-RIB-In and Adj-RIB-Out. In terms of traffic engineering and troubleshooting, it is more advantageous than iBGP.


CLOS with eBGP Design

As shown in Figure 5, spine switches share one common AS. Each leaf instance (pod) and each TOR switch has its own AS. Why are spines located in one AS while each TOR switch has its own? This is to avoid path hunting issues in BGP. Here are some concerns about this architecture.

ECMP:

ECMP is one of the most critical points, because otherwise it wouldn’t be possible to use all links. This is indeed one of the reasons that we are avoiding STP. As stated earlier, IGP can easily deal with this requirement, whereas BGP needs some adjustments.

The topology in Figure 5 is a multiple pod/instance design, which reduces the number of links between leaf and spine layer. It also allows operators to put more than one leaf node in each AS. This type of setup is quite common, especially in large data center networks. Looking up the possible paths from server A to server H, there are four different paths, and each of them has the same AS_Path attribute (64515 64513 64512 64514 64518). If the other attributes are also the same, BGP can utilize all ECMP paths by enabling multi-path functionality.

It seems quite straightforward in this topology, but what if each leaf switch has its own AS or servers are connected to more than one TOR switch for the sake of redundancy? Then, the length of each AS Path attribute will be the same, which does not suffice for using multi-paths in BGP. Even if the length

Figure 5. CLOS Topology using eBGP for Spine, Leaf and TOR Switches


is identical, in order for BGP to route traffic over multiple paths, all attributes have to be the same, including the content of the AS Path attribute. Fortunately, there is a way to meet this requirement: The “AS Path multi-path relax” feature needs to be enabled to let BGP ignore the content of the AS Path attribute, while installing multiple best routes, as long as the length is identical.

Convergence:

Due to its nature and area of use, convergence time initially was not one of the first concerns in BGP. Stability was prioritized over fast convergence. However, some fast convergence enhancements have been introduced to BGP subsequently:

• BGP neighbor fall-over (iBGP) and/or BGP fast external fall-over (eBGP) should be enabled.• BFD can also be used, but until recently, there was a limitation that BFD did not take any

action in case of a single link failure on a LAG. Still, it needs to be checked with vendors.• The Events Advertisement Interval in eBGP peering should be set to zero, which is the default

value in iBGP peering.• The Keepalive timer should be set to no more than five seconds and hold time should be set to

15 seconds.

Number of ASes to be used:

Since each TOR switch is located in its own AS, how are operators able to scale the number of ASes, especially considering the number of TOR switches in a large data center?

• There are 1,023 private ASNs (64112-65535). If this is not sufficient, one of the options is using four-byte ASNs, which enables millions of ASNs.

• TOR ASNs can be used more than once. In this case, the BGP Allow-AS-In feature needs to be enabled on TOR switches. This will turn off one of BGP’s main loop avoidance mechanisms, and TOR switches will accept the routes even though they see their own ASN in received updates.

• If instances/pods are used in the topology where leaf switches share ASes, summarization might create black holes when a specific prefix is withdrawn on the TOR side. To avoid these kinds of issues, specific prefixes should not be hidden.

Evolving with BGP

BGP is a protocol that meets several requirements and architectural needs in various segments of the network. In this E-book, we covered the challenges in traditional data center architectures, walked through the CLOS topology as an answer, explained why L3 routing is preferable to L2 switching, and assessed IGP vs. BGP. The BGP-based architecture will likely become more common in data centers. Regardless of whether BGP or IGP is selected, moving the whole data center network from L2 to L3 is very beneficial in many aspects.

Packet Design’s Explorer Suite provides extensive network visibility by correlating L3 control plane protocols (OSPF, ISIS, BGP, MP-BGP, 6PE, etc.) with services, traffic, and performance data.

http://www.packetdesign.com/products/explorer-suite/


To learn more about Packet Design and the Explorer Suite, please visit www.packetdesign.com

http://www.packetdesign.com/

BGP IN THE DATA CENTER - packetdesign.com · Page 11 of 12 is identical, in order for BGP to route...

Documents

Transcript of BGP IN THE DATA CENTER - packetdesign.com · Page 11 of 12 is identical, in order for BGP to route...