for the Next Decade Networking Challengesevents17.linuxfoundation.org/sites/events/files/slides/ONS...
Transcript of for the Next Decade Networking Challengesevents17.linuxfoundation.org/sites/events/files/slides/ONS...
Networking Challenges for the Next DecadeAmin VahdatOn behalf of Google Technical Infrastructure and Google Cloud Platform
APRIL 4, 2017
Google Global Cache edge nodes
FASTER (US, JP, TW) 2016
Unity (US, JP) 2010SJC (JP, HK, SG) 2013
Points of presence >100
Network fiber
Google NetworkMore than a collection of data centers
#
#
Future regions and number of zones
Current regions and number of zones
3
3
2
3
3 3
3
3
24
3
3
2
Frankfurt
Singapore
S Carolina
N Virginia
Belgium
London
TaiwanMumbai
Sydney
OregonIowa
São Paulo
Finland
Tokyo
Montreal
California
Netherlands
3
3
33
Google Cloud RegionsAdding 11 new regions
Ubiquitous Cloud...10x Scaling
Datacenter
Next-gen disaggregation of storage, memory and compute
Campus & MetroCloud regions and campus expansion driving DC interconnect
WANCloud replication and bandwidth intensive cloud services (e.g., turnkey video, IoT)
10x10x 10x
Step Function Disruptions: Bandwidth, Latency, Availability, Predictability
B4WAN
Interconnect
Andromeda NFV and network
virtualization
JupiterDatacenter Networking
The Pillars of SDN @ Google
B4WAN
Interconnect
Andromeda NFV and network
virtualization
JupiterDatacenter Networking
The Pillars of SDN @ Google
Espresso SDN for public
Internet
B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]
B4: Google's Software Defined WAN
B4: [Jain et al, SIGCOMM 13] BwE: [Jain et al, SIGCOMM 15]
B4: From Copy Network to Business Critical
B4 tr
affic
2012 — 2016
10.1.4/24
VNET: 5.4/16
VNET: 192.168.32/24
VNET: 10.1.1/24 Load Balancing
DoS
ACLs
VPN
NFVInternal Network
Andromeda
ToR
Google Infrastructure Services
10.1.1/24
ToR
10.1.2/24
ToR
10.1.3/24
ToR
Watchtower
Saturn
Firehose 1.1
Google Datacenter Network InnovationAnd hardware scale that we could not buy
10
Time
Capa
city
Firehose 1.0
Jupiter
4 Post
1.3Pb/s clusters in 2013
B4WAN
Interconnect
Andromeda NFV and network
virtualization
JupiterDatacenter Networking
The Pillars of SDN @ Google
PublicInternet?
B4WAN
Interconnect
Andromeda NFV and network
virtualization
JupiterDatacenter Networking
The Pillars of SDN @ Google
Espresso SDN for public
Internet
Espresso in Context
B4
Jupiter Data CenterGoogle
Espresso in Context
B4
B2
Peering Metro
Jupiter Data CenterGoogle
Espresso in Context
B4Espresso
B2
Internet
Peering Metro
User
Jupiter Data CenterGoogle
Cloud 1.0Espresso
SDNPeering
RouterCentric
Protocols
Espresso: Before and After
Local viewConnectivity firstCoarse fault recovery
Per-metro and global viewApplication signalsReal-time optimization
Espresso Architecture Overview
Label-switched Fabric
BGP speaker
External Peer
Espresso Metro
Peering Fabric
eBGP Peering
Espresso Architecture Overview
Label-switched Fabric
HostHostHostHostHost
Host
Packet Processor
BGP speaker
External PeereBGP Peering
Espresso Metro
Labeled packets specify egress
HostHostHostHostHost
Peering Fabric
Espresso Architecture Overview
Label-switched Fabric
HostHostHostHostHost
Host
Packet Processor
LocalControl
Global Controller
BGP speaker
External PeereBGP Peering
Espresso Metro
Application Signals
Labeled packets specify egress
HostHostHostHostHost
Peering Fabric
The next wave in computing• Serverless compute in Cloud 3.0• IoT• Tightly coupled, general purpose
distributed computing
It’s time to put it all together• Agile Scale• Jitter• Isolation• Performance is great, but only
meaningful with availability, manageability, and velocity
Next Decade Challenges in Networking
Virtualization delivers capex savings to enterprise DCs
Cloud 1.0
Last Decade
Cloud 1.0
Public cloud frees enterprise from private HW infrastructure
Scheduling, load balancing primitives, “big data” query processing
Cloud 2.0Cloud 1.0
HW on Demand
Now
Cloud 1.0 Cloud 2.0
Serverless compute, real-time intelligence, and machine learning
Not data placement, load balancing, OS configuration and patching
Cloud 3.0
Compute,not servers
The Third Wave of Cloud Computing
Cloud 2.0
Networking should be aiming for Cloud 3.0
Cloud 3.0Cloud 1.0
The Third Wave of Cloud Computing
Storage disaggregation:the datacenter is the storage appliance
Seamless telemetryand scale up/down
Transparent live migration
Open Marketplaceof services, securely placed and accessed
Networking and Cloud 3.0
Applications+Functionsnot VMs
Policynot middleboxes
Actionable Intelligencenot data processing
SLOsnot placement/load balancing/scheduling
Networking and Cloud 3.0
The network will enable next-generation compute infrastructure
The network can define next-generation storage infrastructure
The right network infrastructure can deliver fundamental new capability
Next Decade Challenges in Networking
How we Prioritize Infrastructure Work
Availability
Manageability
Velocity
Stranding
Performance
• First things first: an insecure infrastructure is an unavailable infrastructure• Stability is more important than efficiency• Network management is critical• Configuration is hard• Automation matters but can be counter to availability
“Evolve or Die: High-Availability Design Principles Drawn from Google’s Network Infrastructure.” SIGCOMM 2016.
Availability is Paramount
• Velocity is the speed of iteration• Retrospective on “Tussle in Cyberspace:
Defining Tomorrow’s Internet”• Build for hitless upgrades and
self-validation• Debugging and tracing matter
○ Without visibility, performance does not matter
• Network fabrics built for expansion and evolution
• Launch and Iterate
Build for Velocity
Isolation with reservations is easy but leads to huge resource stranding● General-purpose, shared infrastructure to approximate custom-built and reserved
Isolation has many components● Latency, bandwidth, but also the control plane● Accounting and chargeback are big missing pieces
Congestion Control is still really hard● Rationalizing multiple control loops, flow, endpoint, flow group, Traffic Engineering
Isolation is Critical; Stranding is Terrible
Amdahl’s law applies and so an incredible, localized optimization that takes any effort to adopt will be ignored
1. Scale2. Jitter3. Storage Disaggregation
Must optimize from the application all the way to the end user
Performance only Matters if End to End
How we Prioritize Infrastructure Work
Availability
Manageability
Velocity
Stranding
Performance
The next wave of computing• Serverless compute in Cloud 3.0• IoT• Tightly coupled, general purpose
distributed computing
It’s time to put it all together• Agile Scale• Jitter• Isolation• Performance is great, but only
meaningful with availability, manageability, and velocity
Next Decade Challenges in Networking
Thank You!Thank You!
Open Source
Google Cloud Platform 36
Google MapReduce
Google Bigtable
Google Borg Google BorgGoogle Dremel
Open Source
Google Cloud Platform 37
TCPBBR
gRPCOpen
ConfigQUIC ...