for the Next Decade Networking Challenges · Networking Challenges for the Next Decade Amin Vahdat...

Post on 20-May-2020

3 views 0 download

Transcript of for the Next Decade Networking Challenges · Networking Challenges for the Next Decade Amin Vahdat...

Networking Challenges for the Next DecadeAmin VahdatOn behalf of Google Technical Infrastructure and Google Cloud Platform

APRIL 4, 2017

Google Global Cache edge nodes

FASTER (US, JP, TW) 2016

Unity (US, JP) 2010SJC (JP, HK, SG) 2013

Points of presence >100

Network fiber

Google NetworkMore than a collection of data centers

#

#

Future regions and number of zones

Current regions and number of zones

3

3

2

3

3 3

3

3

24

3

3

2

Frankfurt

Singapore

S Carolina

N Virginia

Belgium

London

TaiwanMumbai

Sydney

OregonIowa

São Paulo

Finland

Tokyo

Montreal

California

Netherlands

3

3

33

Google Cloud RegionsAdding 11 new regions

Ubiquitous Cloud...10x Scaling

Datacenter

Next-gen disaggregation of storage, memory and compute

Campus & MetroCloud regions and campus expansion driving DC interconnect

WANCloud replication and bandwidth intensive cloud services (e.g., turnkey video, IoT)

10x10x 10x

Step Function Disruptions: Bandwidth, Latency, Availability, Predictability

B4WAN

Interconnect

Andromeda NFV and network

virtualization

JupiterDatacenter Networking

The Pillars of SDN @ Google

B4WAN

Interconnect

Andromeda NFV and network

virtualization

JupiterDatacenter Networking

The Pillars of SDN @ Google

Espresso SDN for public

Internet

B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]

B4: Google's Software Defined WAN

B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]

B4: From Copy Network to Business Critical

B4 tr

affic

2012 — 2016

10.1.4/24

VNET: 5.4/16

VNET: 192.168.32/24

VNET: 10.1.1/24 Load Balancing

DoS

ACLs

VPN

NFVInternal Network

Andromeda

ToR

Google Infrastructure Services

10.1.1/24

ToR

10.1.2/24

ToR

10.1.3/24

ToR

Watchtower

Saturn

Firehose 1.1

Google Datacenter Network InnovationAnd hardware scale that we could not buy

10

Time

Capa

city

Firehose 1.0

Jupiter

4 Post

1.3Pb/s clusters in 2013

B4WAN

Interconnect

Andromeda NFV and network

virtualization

JupiterDatacenter Networking

The Pillars of SDN @ Google

PublicInternet?

B4WAN

Interconnect

Andromeda NFV and network

virtualization

JupiterDatacenter Networking

The Pillars of SDN @ Google

Espresso SDN for public

Internet

Espresso in Context

B4

Jupiter Data CenterGoogle

Espresso in Context

B4

B2

Peering Metro

Jupiter Data CenterGoogle

Google

Espresso in Context

B4Espresso

B2

Internet

Peering Metro

User

Jupiter Data CenterGoogle

Google

Cloud 1.0Espresso

SDNPeering

RouterCentric

Protocols

Espresso: Before and After

Local viewConnectivity firstCoarse fault recovery

Per-metro and global viewApplication signalsReal-time optimization

Espresso Architecture Overview

Label-switched Fabric

BGP speaker

External Peer

Espresso Metro

Peering Fabric

eBGP Peering

Espresso Architecture Overview

Label-switched Fabric

HostHostHostHostHost

Host

Packet Processor

BGP speaker

External PeereBGP Peering

Espresso Metro

Labeled packets specify egress

HostHostHostHostHost

Peering Fabric

Espresso Architecture Overview

Label-switched Fabric

HostHostHostHostHost

Host

Packet Processor

LocalControl

Global Controller

BGP speaker

External PeereBGP Peering

Espresso Metro

Application Signals

Labeled packets specify egress

HostHostHostHostHost

Peering Fabric

The next wave in computing• Serverless compute in Cloud 3.0• IoT• Tightly coupled, general purpose

distributed computing

It’s time to put it all together• Agile Scale• Jitter• Isolation• Performance is great, but only

meaningful with availability, manageability, and velocity

Next Decade Challenges in Networking

Virtualization delivers capex savings to enterprise DCs

Cloud 1.0

Last Decade

Cloud 1.0

Public cloud frees enterprise from private HW infrastructure

Scheduling, load balancing primitives, “big data” query processing

Cloud 2.0Cloud 1.0

HW on Demand

Now

Cloud 1.0 Cloud 2.0

Serverless compute, real-time intelligence, and machine learning

Not data placement, load balancing, OS configuration and patching

Cloud 3.0

Compute,not servers

The Third Wave of Cloud Computing

Cloud 2.0

Networking should be aiming for Cloud 3.0

Cloud 3.0Cloud 1.0

The Third Wave of Cloud Computing

Storage disaggregation:the datacenter is the storage appliance

Seamless telemetryand scale up/down

Transparent live migration

Open Marketplaceof services, securely placed and accessed

Networking and Cloud 3.0

Applications+Functionsnot VMs

Policynot middleboxes

Actionable Intelligencenot data processing

SLOsnot placement/load balancing/scheduling

Networking and Cloud 3.0

The network will enable next-generation compute infrastructure

The network can define next-generation storage infrastructure

The right network infrastructure can deliver fundamental new capability

Next Decade Challenges in Networking

How we Prioritize Infrastructure Work

Availability

Manageability

Velocity

Stranding

Performance

• First things first: an insecure infrastructure is an unavailable infrastructure• Stability is more important than efficiency• Network management is critical• Configuration is hard• Automation matters but can be counter to availability

“Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure.” SIGCOMM 2016.

Availability is Paramount

• Velocity is the speed of iteration• Retrospective on “Tussle in Cyberspace:

Defining Tomorrow’s Internet”• Build for hitless upgrades and

self-validation• Debugging and tracing matter

○ Without visibility, performance does not matter

• Network fabrics built for expansion and evolution

• Launch and Iterate

Build for Velocity

Isolation with reservations is easy but leads to huge resource stranding● General-purpose, shared infrastructure to approximate custom-built and reserved

Isolation has many components● Latency, bandwidth, but also the control plane● Accounting and chargeback are big missing pieces

Congestion Control is still really hard● Rationalizing multiple control loops, flow, endpoint, flow group, Traffic Engineering

Isolation is Critical; Stranding is Terrible

Amdahl’s law applies and so an incredible, localized optimization that takes any effort to adopt will be ignored

1. Scale2. Jitter3. Storage Disaggregation

Must optimize from the application all the way to the end user

Performance only Matters if End to End

How we Prioritize Infrastructure Work

Availability

Manageability

Velocity

Stranding

Performance

The next wave of computing• Serverless compute in Cloud 3.0• IoT• Tightly coupled, general purpose

distributed computing

It’s time to put it all together• Agile Scale• Jitter• Isolation• Performance is great, but only

meaningful with availability, manageability, and velocity

Next Decade Challenges in Networking

Thank You!Thank You!

Open Source

Google Cloud Platform 36

Google MapReduce

Google Bigtable

Google Borg Google BorgGoogle Dremel

Open Source

Google Cloud Platform 37

TCPBBR

gRPCOpen

ConfigQUIC ...