Stratus Fault-Tolerant Cloud Infrastructure Software for NFV using OpenStack
-
Upload
ali-kafel -
Category
Technology
-
view
203 -
download
2
Transcript of Stratus Fault-Tolerant Cloud Infrastructure Software for NFV using OpenStack
ACHIEVING AVAILABILITY AND RESILIENCY IN OPENSTACK FOR NFV
Stratus Webinar
May 26, 2015
Ali Kafel | Senior Director, Business Development | [email protected] Twitter: @akafel
Steve Hauser | CTO | [email protected]
NFV Overview
Defining Availability, Reliability and Resiliency
Achieving Resiliency in Applications vs Infrastructure
Software Defined Availability (SDA)
• Seamless service continuity, with no required code changes.
• Selectable levels of availability, for different control and forwarding applications
• Increasing traditional 45% utilization towards 80% to 90% utilization
Agenda
3Stratus Technologies
Stratus Technologies35 Years of Mission Critical Computing Leadership
VOS & Continuum
Intel PlatformsftServer
Hardware Fault Tolerance
Software Fault Tolerance
Software Defined Availability
2015Stratus Cloud Technologies
Proprietary Platforms
everRun Enterprise12,000+ Installed
1980 - Present
2008 - Present
Present
Page 3
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
CPE
L2/L3
Switch
Firewall
Load
Balancer
NAT
SBC
Vendor
A
Vendor
B
Vendor
C
Monolithic Vertical Integration
Vendor A
Vendor B
Vendor C
Vendor D
Vendor E
Vendor A
Vendor F
Vendor B
Vendor A
Vendor D
RAN
Firewall
EPC
PCEF
Diameter Core
MME
OCS/OFCS
HSS
PCRF
IMS
Delamination
Virtualization
Orchestration
Linux
EP
C
Linux
PC
RF
Linux
Fire
wa
ll
Linux
IMS
…
Decoupling with NFV
Commodity Hyper Scale COTSComputing
Network Functions VirtualizationWhat Exactly Is It?
Page 4
Network Functions Virtualization
Virtualization
Commodity Hyper Scale COTSComputing
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
Vendor A
Vendor B
Vendor C
Monolithic Vertical Integration
Vendor A
Vendor B
Vendor C
Vendor D
Vendor E
Vendor A
Vendor F
Vendor B
Vendor A
Vendor D
RAN
Backhaul
EPC
PCEF
Diameter Core
MME
OCS/OFCS
HSS
PCRF
IMS
Delamination
Liquid Pool of DynamicallyAllocated Resources
Automation
Orchestration
Linux
EP
C
Linux
PC
RF
Linux
Fire
wa
ll
Linux
IMS
…
Decoupling with NFV
Page 5
Network Functions Virtualization
Commodity Hyper Scale COTSComputing
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
RAN
Backhaul
GPRS/1X
MSC
HLR
SMSC
Vendor A
Vendor B
Vendor C
Monolithic Vertical Integration
Vendor A
Vendor B
Vendor C
Vendor D
Vendor E
Vendor A
Vendor F
Vendor B
Vendor A
Vendor D
RAN
Backhaul
EPC
PCEF
Diameter Core
MME
OCS/OFCS
HSS
PCRF
IMS
Delamination
Liquid Pool of DynamicallyAllocated Resources
Automation
Orchestration
Decoupling with NFV
Virtualization
Linux
EP
C
Linux
PC
RF
Linux
Fire
wa
ll
Linux
IMS
…
Page 6
Network Functions Virtualization
Orchestration
Decoupling with NFV
Virtualization
Linux
EP
C
Linux
PC
RF
Linux
Fire
wa
ll
Linux
IMS
…
Page 7
Virtualization
Orchestration
Linux
EP
C
LinuxP
CR
FLinux
Fire
wa
ll
Linux
IMS
…
Decoupling with NFV
Network Functions Virtualization
Virtualization
Orchestration
Linux
EP
C
Linux
PC
RF
Linux
Fire
wa
ll
Linux
IMS
…
Decoupling with NFV
Page 8
CommodityHigh VolumeNetworking
Virtualization
Linux
EPC
LinuxPCRF
Linux
HSS
Linux
IMS
…
L3 Routing
L2 SwitchingOptical
Transport
Control
Control
Control
Linux
Optica
l Tra
nsp
ort
Contr
ol Pla
ne
Linux
L3 R
outing
Contr
ol Pla
ne
Linux
Bill
ing
Linux
Cust
om
er
Care
Linux
NO
C
Linux
L2 S
witch
ing
Contr
ol Pla
ne
VirtualizedOSS/BSS
Commodity Hyper Scale COTS Computing
CommodityHigh Volume
Storage
VirtualizedSDN
SDNSeparates
ControlFrom
Forwarding
Orchestration
Decoupling with NFV
Network Functions VirtualizationWith Software Defined Networks
Page 9
Agenda
NFV Overview
Defining Availability, Reliability and Resiliency
Achieving Resiliency in Applications vs Infrastructure
Software Defined Availability (SDA)
• Seamless service continuity, with no required code changes.
• Selectable levels of availability, for different control and forwarding
applications
• Increasing traditional 45% utilization towards 80% to 90% utilization
Stratus Technologies Page 2
Defining Availability, Reliability and Resiliency
Availability
• % of time an equipment is in an operable state ie. access
information or resources
• Availability = Uptime / Total time
Reliability
• How long a system performs its intended function.
• MTBF = total time in service / number of failures
Resiliency
• The ability to recover quickly from failures, to return to
its original form, state, etc. (just before the failure)
Stratus Technologies Page 2
Defining Availability, Reliability and Resiliency
Therefore, a Highly Available (HA) system may not be
Highly Reliable (HRel) or Highly Resilient (HRes)
A Fault Tolerant (FT) system is Highly Available, Highly
Reliable and Highly Resilient (state is preserved)
Lose Transactions Lose Reputation Lose RevenueLose Customers
Fault Tolerant Systems Never Stop
Stateful Fault Tolerance = HA + HRel + HResWhen Seconds Count… Loss of Revenue, Reputation, Safety, Life
High Availability
StatefulFault Tolerant
Page 13
May be a few seconds, minutes or hours
Failure
Original state is lost
Original state is preserved!
Agenda
NFV Overview
Defining Availability, Reliability and Resiliency
Achieving Resiliency in Applications vs Infrastructure
Software Defined Availability (SDA)
• Seamless service continuity, with no required code changes.
• Selectable levels of availability, for different control and forwarding
applications
• Increasing traditional 45% utilization towards 80% to 90% utilization
Three ways to provide Stateful FT in VNFs
Applications / VNFs
Operating Environment
Hardware
• Transparent – no code change• Fast & Simple Deployment• No special App Software
• Very expensive• Inefficient utilization• Special Hardware• Rigid
Applications / VNFs
Operating Environment
with Resilience Layer
Hardware
• Transparent – no code change• Fast & Simple Deployment• No special App Software – deploy any• No Special Hardware – use commodity• Multiple Levels of Resiliency Supported• Higher efficiency of resiliency – N+k
• Higher efficiency may not be possible on very large monolithic Apps
Applications / VNFs
Operating Environment
Hardware
• App specific state can be Customized
• Every App must be modified• Longer time to deploy• Complex• Rigid
In the Hardware In the Applications In the Software Infrastructure
Costs
& R
eso
urc
es
Pros
Cons
16
But Fault Tolerance is more than just State Protection, it is about the complete Fault Management Cycle with multiple levels of resiliency
(State Protection) Detection
Localization
Isolation Recovery
Repair
(Restore Redundancy)
We call this:
Software Defined Availability (SDA)
and has 4 characteristics
1. Selectable Resiliency for each VNF
2. Seamless Protection for all VNFs
3. Agility with 3rd party ecosystem
4. Efficiency of Redundancy
Stratus Technologies Page 17
Stratus’ Software-Defined Architecture (SDA) Solution provides a highly resilient Cloud and NFVI
1. Seamless Protection for all VNFs• Software Defined, transparent Service Continuity, performed automatically by the
infrastructure, without Application code changes
2. Selectable Resiliency for each VNF• Deploy each VNF with selectable levels of resiliency including High Availability and
stateful Fault Tolerance (state protection), with Geo-Redundancy, without application
awareness
3. Agility with 3rd party ecosystem and any VNF• Protect all VNFs in any KVM/OpenStack environment seamlessly, with No complex
code development, testing and support – for optimal partner ecosystem
4. Efficiency of Redundancy• Unlike traditional approaches for Fault Tolerance, which limit Utilization to sub-50%, get
dramatic increase in Efficiency of Redundancy, at 80% to 90%
18
1 | Selectable Resiliency for each VNF: Software Defined Availability (SDA) with selectable levels of resiliency
Deliver Availability as an
infrastructure service to virtual and
cloud ecosystems
Firewall MME IMS Web Server
Page 18
Any application with any
availability need with application
transparency
VNF-CForwarding
Element
VNF-CForwarding
Element
VNF-CForwarding
Element
VNF-CControlElement
Monolithic VNFs Componentized VNF
Stateless Fast PathForwarding Elements
StatefulControlElement
The Right Level of Resiliency for Each Component
FT protected Control Elements
With SR-IOV enabled high-performance, low latency
Forwarding Elements
Stratus Technologies
2 | Seamless Protection: When needed, application states are protected without application awareness, in the VM Operation - Statepointing
VM instances paired between host in the cloud infrastructure
State of primary captured regularly and applied to secondary standby
If fault on primary, secondary takes over from the most recent
Statepoint without data loss
Control when information (network, storage I/O) is allowed to leave
the guest
Secondary Host
SP N-1
FaultPrimary Host
Guest Run
Epoch N-1
Guest Run
Epoch NSP N-1
SP N
SP N
Guest Run
Epoch N+1
Guest Run
Epoch N+2
Guest Run
Epoch N+1
SP N+1
Third Host(created post primary failure)
19
Guest From
Image
SP N+X
SP N+1 SP N+X
Page 19
20
Act.-Stby. Statepoint Processes & Egress Network Barrier
VM n-1 VM n+1
w/ barrier n-1
QEMU Monitor
Enqueue
VM n
QEMU Monitor
QEMU Monitor
w/ barriers n & n-1
QEMU Monitor
w/ barriers n+1 & n
QEMU Monitor
QEMU Monitor
n-1 P1
n-1 P2
n-1 P3
n-1 P4
n-1 P5
GuestVM
(Active)
QEMU (Active)
QEMU (Standby)
Egress Network Queue Barrier; prevents transmission of queued egress packet(s) until the barrier is removed
PC
R
PC
R
GuestEgressQueue
[snapshots]
PCR Pause, Capture, Resume (PCR); phases of Statepoint process when VM execution is suspended
En
qu
eu
e
En
qu
eu
e
P1 P2 P3 P4 P5
Note: For simplicity, n-2 interactions are not shown.
n P1
n P2
n P3
n P4
n+1 P3
n+1 P2
n+1 P1
1
2, P
async
3, C
5, R
4
CommodityHigh VolumeNetworking
Virtualization
Commodity Hyper Scale COTS Computing
CommodityHigh Volume
Storage
SDNSeparates
ControlFrom
Forwarding
Linux
EPC
Linux
PCRF
LinuxH
SS
Linux
IMS
…
Linux
Optica
l Tra
nsp
ort
Contr
ol Pla
ne
Linux
L3 R
outing
Contr
ol Pla
ne
Linux
Bill
ing
Linux
Cust
om
er
Care
Linux
NO
C
Linux
L2 S
witch
ing
Contr
ol Pla
ne
VirtualizedOSS/BSS
VirtualizedSDN
Orchestration
Decoupling with NFV
3 | Agility with 3rd party ecosystem and any VNF
NFV and SDN Allow Low Cost Commodity HWBut When Failures Happen, Service Continuity can be Affected
Does Not Provide Five 9’s (99.999%) Reliability
Page 21
CommodityHigh VolumeNetworking
Virtualization
L3 Routing
L2 SwitchingOptical
Transport
Commodity Hyper Scale COTS Computing
CommodityHigh Volume
Storage
SDNSeparates
ControlFrom
Forwarding
Stratus Automated Virtualized Resilience LayerLinux
EPC
LinuxPCRF
Linux
HSS
Linux
IMS
…
Linux
Optica
l Tra
nsp
ort
Contr
ol Pla
ne
Linux
L3 R
outing
Contr
ol Pla
ne
Linux
Bill
ing
Linux
Cust
om
er
Care
Linux
NO
C
Linux
L2 S
witch
ing
Contr
ol Pla
ne
VirtualizedOSS/BSS
VirtualizedSDN
Orchestration
Decoupling with NFV
We solved it by inserting A Virtualized Cloud Resilience Layer for NFV and SDN
Page 22
CommodityHigh VolumeNetworking
Virtualization
L3 Routing
L2 SwitchingOptical
Transport
Commodity Hyper Scale COTS Computing
CommodityHigh Volume
Storage
SDNSeparates
ControlFrom
Forwarding
Linux
EPC
Linux
PCRF
Linux
HSS
Linux
IMS
…
Linux
Optica
l Tra
nsp
ort
Contr
ol Pla
ne
Linux
L3 R
outing
Contr
ol Pla
ne
Linux
Bill
ing
Linux
Cust
om
er
Care
Linux
NO
C
Linux
L2 S
witch
ing
Contr
ol Pla
ne
VirtualizedOSS/BSS
VirtualizedSDN
Orchestration
Decoupling with NFV
Stratus Automated Virtualized Resilience Layer
Stratus ProvidesA Virtualized Cloud Resilience Layer for NFV and SDN
Page 23
4 | Efficiency of Redundancy: We have designed Shadow Secondary VMs in anti-affinity rules (different
hosts) to take up much less resources than their primaries, Yielding High Utilization, and Low Additional Reserve Capacity
Page 24
A
B
C
D
A1
B1
C1D1
But before we get into details, let’s look at how Traditional Fault Tolerance is Achieved by Full HW Redundancy
• Cloud Computing environments often utilize volumes of High Density Commodity Servers for Computing
Racks Of High Density Cloud Servers
Page 25
Server Workloads that need fault tolerance typically need redundancy to run another copy in LockStep
Workloads
Racks Of High Density Cloud Servers
Page 26
• Which has typically been Supported by twice the Hardware
Racks Of High Density Cloud ServersRacks of Redundant Servers
WorkloadsJust-In-Case
WorkloadCapacity
Page 27
Server Workloads that need fault tolerance typically need redundancy to run another copy in LockStep
Server Workloads that need fault tolerance typically need redundancy to run another copy in LockStep
• Which has typically been Supported by twice the Hardware
Arranged RigidlyIn Mated Pairs
Mated Pairs Mated Pairs Mated Pairs Mated Pairs
Page 28
• When a Failure happens, a Backup Takes Over until the original is replaced, thus preserving Service Continuity
• However, backup replacement can takes days and a great deal of human intervention, during which another failure would be disastrous
Page 29
But Resource Utilization is 50% at Best
“Traditional telecom networks operate great at 45% utilization, but as AT&T becomes a software company, a reasonable goal could be 80% to 90% utilization” John Donovan, Senior EVP AT&T
45%Utilization 55%Unutilized Backup CapacityProblem
Page 30
Stratus Resilient Cloud Technology ProvidesFully Stateful Fault Tolerance at up to 80% Utilization
45%Utilization 55%Unutilized Backup Capacity
80%Utilization 20%Unutilized Backup Capacity
Virtu
aliz
ed R
esi
lience
37.5% Savings
Problem
Solution
• Stratus Virtualized Resilience requires much less Backup Capacity for fully stateful Functional Fault Tolerance
Page 31
Stratus Resilient Cloud Technology ProvidesFully Stateful Fault Tolerance at up to 80% Utilization
45%Utilization 55%Unutilized Backup Capacity
80%Utilization 20%Unutilized Backup Capacity
Virtu
aliz
ed R
esi
lience
77.8% More Capacity
Problem
Solution
• Stratus Virtualized Resilience could alternatively provide 78% more Actively Utilized Capacity using the same resources
Page 32
Instead of the traditional 1+1 approach, the Stratus Resilient Cloud Technology uses Software Defined Availability (SDA) which Increases Utilization and The Efficiency of
Resiliency, and Decreases Cost
Page 33
It’s based on an n+k De-Clustered redundancy approach where Shadow Secondary VMs are
deployed in anti-affinity rules (different hosts) to take up much less resources than their primaries
34
Software Defined Availability Increases Utilization and The Efficiency of Resiliency, and Decreases Cost
Sim
ple
HWMonolithic
Fwd + CTRL
SW VirtualizedDe-CoupledFwd + CTRL
AGILITYMost Traditional TelcoSystems are in this Category
Eff
icie
ncy o
f R
ed
un
da
ncy
1+1
SoftwareDefined
Availability
So
ph
isti
ca
ted
35
Software Defined Availability Increases Utilization and The Efficiency of Resiliency, and Decreases Cost
So
ph
isti
ca
ted
Sim
ple
HWMonolithic
Fwd + CTRL
SW VirtualizedDe-CoupledFwd + CTRL
AGILITYMost Traditional TelcoSystems are in this Category
Eff
icie
ncy o
f R
ed
un
da
ncy
1+1
N+1
C+CF+F
FWD
CTRL
FWD
CTRL
1+.06F+k
CTRL
FWD F+k
k<<F SR-IOV
SoftwareDefined
Availability
Page 36
Asymetric StateSync™ Redundancy
Coordinated VM Interleave Improves Performance
on High Latency Links
Primary
Compute
StatePoint™
Secondary
6%-10%
StatePoint™ Sync Link
Pro
cesso
r A
ctivity
Pro
cesso
r A
ctivity
Page 37
A
B
C
D
A1
B1
C1D1
N+k De-Clustered Redundancy
Each Server Apps in VMs ABCD Are Backed Up On Separate ServersWhich could be anywhere in the Pool of Servers
N+k De-Clustered Redundancy
Server 2 Apps ABCD Are Backed Up On Separate ServersWhich could be anywhere in the Pool of Servers
Page 38
N+k De-Clustered Redundancy
Server 3 Apps ABCD Are Backed Up On Separate ServersWhich could be anywhere in the Pool of Servers
Page 39
Server 4 App ABCD
Page 71 Page
40
N+k De-Clustered Redundancy
Server 4 Apps ABCD Are Backed Up On Separate ServersWhich could be anywhere in the Pool of Servers
N+k De-Clustered RedundancyServer 5 Apps ABCD Are Backed Up On Separate Servers
Which could be anywhere in the Pool of Servers
Page 41
N+k De-Clustered Redundancy
All 5 Server Apps ABCD Are Backed Up On Separate ServersWhich are shown on each other in this example
Page 42
SecondaryShadow VMs Stand Up
ReserveCapacity
Stand Up can happen on other machines with
Lower Priority Pre-Emption
N+k De-Clustered Redundancy
Primaries, Secondaries, plus Reserve Capacity Shown for Each
Page 43
A
B
C
D
A1 B1
C1
D1
Upon Node Failure, Secondaries are ActivatedWith No Loss of State
SecondaryShadow VMs Stand Up
ReserveCapacity
Stand Up can happen on other machines with
Lower Priority Pre-Emption
Page 44
SecondaryShadow VMs Stand Up
ReserveCapacity
Stand Up can happen on other machines with
Lower Priority Pre-Emption
Cloud ServerResource
Pool
Recycle
One of “k” Reserve Servers is Activated
While the Failed Node is Logically Removed
Page 45
73%Utilization
27%ResilencyReserve
87%Utilization
13%ResilencyReserve
SecondaryShadow VMs Stand Up
ReserveCapacity
Stand Up can happen on other machines with
Lower Priority Pre-Emption
Primaries are “Live Migrated” to New Server to Balance the Load
Yielding High Utilization, and Low Additional Reserve Capacity
Page 46
Stratus Resilient Cloud TechnologyDramatically Improves The Efficiency of Redundancy
• Enables up to 37.5% Resource Savings to provide Redundancy
45%Utilization 55%Unutilized Backup Capacity
80%Utilization 20%Unutilized Backup Capacity
Virtu
aliz
ed
Resili
ence
37.5% Savings
Problem
Solution
45%Utilization 55%Unutilized Backup Capacity
80%Utilization 20%Unutilized Backup Capacity
Virtu
aliz
ed
Resili
ence
77.8% More Capacity
Problem
Solution
• Enables up to 77% More Capacity for Protected Redundant Workloads
Benefits Either
Or a combination of the two
Page 47
“Traditional telecom networks operate great at 45% utilization, but as AT&T becomes a software company, a reasonable goal could be 80% to 90% utilization.”
John Donovan, Senior Executive Vice President, AT&T
Page 48
Core OpenStack
Orchestrator(s)OSS/BSS
49
Beyond the Virtualized Resilience Layer, the Resilience Management Layer enables Automation
Linux Host OS
Virtualized Resilience Layer
Discovery And Tagging Tool
Heat Template
Service Template
Resiliency Workload Management
Authoring Tool/Service Catalog
Service Template
NFVI domain [SDN Controller]
MANO
vSwitch
Running any Guest OS
VNFCs Instantiated in NFVI
VNFC
GuestOS
VM
VNFC VNFC
VNFM VNFM
VNFC
MANO/VIM
Heat Orchestration
API
Standard Server Platform – Commodity Off-The-Shelf (COTS)
NFVI Compute Domain[Linux/KVM+QEMU, OpenStack, OVS+Availability Services]
VNFC
GuestOS
VM
VNFC
GuestOS
VM
VNFC
GuestOS
VM
VNFC
GuestOS
VM
VNFC
GuestOS
VM Resilience Management Layer
VNF Service Template
Page 12
50
Stratus Cloud Solutions – Two Technologies
Continuous Availability Including
Stateful Fault Tolerance
Based upon Linux technology
and KVM
Available on multiple
distributions
Based on Stratus everRun
technology which is field proven,
with 12,000+ license deployed
Deployment of workloads
Automation of availability events
Layers between Orchestrators and
OpenStack distributions
Availability Services Workload Services
Resilience Management
Stratus Technologies Page 51
In Summary: The Stratus Cloud Solution for telcos and Communications Infrastructures offers:
1. Seamless Protection for all VNFs• Software Defined, transparent Service Continuity, performed automatically by the
infrastructure, without Application code changes
2. Selectable Resiliency for each VNF• Deploy each VNF with selectable levels of resiliency including High Availability
and stateful Fault Tolerance (state protection), with Geo-Redundancy, without
application awareness
3. Agility with 3rd party ecosystem and any VNF• Protect all VNFs in any KVM/OpenStack environment seamlessly, with No
complex code development, testing and support – for optimal partner ecosystem
4. Efficiency of Redundancy• Unlike traditional approaches for Fault Tolerance, which limit Utilization to sub-
50%, get dramatic increase in Efficiency of Redundancy, at 80% to 90%
Seeing is believing: ETSI PoC#35
Availability Management with Stateful Fault Tolerance,
Telcos include AT&T, NTT & iBasis
Contact us to:1. See this demo and learn more about seamless Software-based
Fault Tolerance in VNFs and other Cloud applications2. Get a copy of the slide or have further questions
[email protected]: @akafel