AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect,...
Transcript of AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect,...
AI Driven Day2 Operation
Lai Kwai Seng
Technical Solution Architect, Cisco Systems
Agenda
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
• Introduction to Data Center Telemetry
• Data Center Telemetry Use Cases
• Operationalizing Telemetry
• Network Insights Resources
• Network Insights Advisor
• Network Assurance
• Key Takeaways
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
syslog
SNMP
CLI
Hard to Operationalize
Incomplete
Unstructured
Device-Specific
Slow
How to manage Network?
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Network Telemetry Frees the Data
As Much Useful DataAs Efficiently as Possible
Sensing & measurement
Where Data Is Created Where Data Is Useful
Storage & analysis
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Key Telemetry Characteristics
Efficient Delivery
Tool-Chain consumption and Integration
Structure andAutomation
Data-model DrivenConsistent format
Push not Pull
Analytics-readyDataUDP
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Use Cases
• Network Health
• Anomaly detection
• Troubleshooting / Remediation
• SLAs, Performance Tuning
• Capacity Planning
• Security
Trends
• Real time statistics
• Centralized / Software-defined
• Speed
• Scale
Why This Matters NowWhat hasn’t changed What has changed
Capabilities
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Data Center Visibility Use Cases
Network Health
• CPU and memory utilization
• Forwarding table utilization
• Protocol state and events
• Environmental data
Path and Latency Measurement
• End-to-end visibility
• Path tracing over time
• Flow latency monitoring
Network Performance
• Interface utilization
• Buffer monitoring
• Microburst detection
• Drop event correlation
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Memory
Power
Temperature
CPU
TCAM
System Info and Environmentals
Are my switches healthy?
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
! Neighbor Lost!
Alert:
t
OSPF Routes over Time
Protocol State and Events
OSPF Process State
Process ID 10
Router ID 10.1.1.1
Area 0.0.0.0
OSPF Interfaces
105
Hypervisor Hypervisor
Is routing working as expected?
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Monitoring Buffer Utilization and Drops
Incast or other oversubscription
Packet drops!
I see queue drops – but who’s affected?!
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Network Path and Latency Measurement
Application performance is slow between Server A &
Server B!
Server A Server B
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Network Insights Resources - Customer Benefits
Network
Insights
Resources
Resource UtilizationFabric-Wide Capacity Planning, Trend Monitoring
Troubleshoot Application LatencyIdentify Traffic/Protocol behavior
Identify/Predict Failing Devices Operations
Event AnalyticsEndpoint Analytics
Avoid Environmental (CPU, Power, Memory, Fan, StorageRelated Failures
Identify Subtle Path-Related issuesTrack endpoint details and moves
Statistics
Environmental Monitoring
Flow Analytics
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
NIR Architecture
Data Lake
Data Lake Connector
Telemetry SourcesACI/NX-OS
Hardware & Software
Message Bus (Kafka)
REST APIs
Anomaly & Correlation
Engines
Telemetry Collectors
REST Client
NIR GUI
NIR
13
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Correlation EngineCorrelate normalized telemetry data streams from Transformation Receiver
LLDP
Buffer and Queue stats
Flow details
End-to-end Flow Path
End-to-end Path Latency
Buffer Occupancy and drops along Flow Path
Correlation based on timestamp and matching 5-tuple
Pipelines
Configs
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Operational Intelligence Engine for Network Insights
Dynamic CorrelationCorrelate information across data sources
Failure Prediction & Corrective ActionAbility to predict failure and provide corrective action
Intelligent InsightsAbility to discover information with ease
Proactive AlertsSee problems before end users do and alert
Dynamic Correlation
Proactive Alerts
Failure Prediction and Corrective Action
Intelligent Insights
Increase Availability and Performance
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Network Insights Advisor -- Customer Benefits
Network
Insights
Advisor
Software/Hardware RecommendationsWorkarounds
Avoid multiple TAC calls
Significant CAPEX
And OPEX Savings
Remove Complexity
Avoid Outages
Faster Deployment times
Anomalies
Forwarding State Check
Network Anomaly Detection
Keep Network up to dateAdhere to Cisco policies Recommendations
Prevent traffic black holing
Avoid downtimes
Known Bugs/PSIRTs
Unknown runtime
Config anomalies
EOL/EOSField NoticesSMUs
Version Scale
Limits/Hardening
Check
Configuration
Network Insights Advisor - Customer Benefits
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Network Insights Advisor Targeted Use CasesProactive supportability insights
Fabric wide analysis
Advisories
Provides advisories based on anomalies, bugs,
PSIRTs and field notices. Measure upgrade impact
Dashboard ”Give me a summary of issues”
Anomalies
hardening checks, scale checks
Bugs and PSIRTs
Known bugs and vulnerabilities in the
system
Network
Provides:
• Running config of all devices
• “show tech” from all devices (including APIC)
Cisco
Provides:
• Best practices updates
• PSIRTs, FNs, EOS/EOL
• Software release notifications
• Digitized signatures of knowndefects
First, We Need Data!
NIACisco
Every 24h
Cloud Data
User-specified interval
Network On-prem Data
20
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Known Bugs
Use Case – Notify About Issues
Fabric
NIA
Insight DB
1
3 Alert / Inform
Monitor
Detected:
CSCDT2396 SAL1820SDRE
Recommend:
Upgrade S/W to NXOS
7.0(3)I7(3)
WeeklySync
2 Detect
4 Implement
Alert RemediateDetect
Network Insights issue detection
HardeningCheck
SignatureMatching
AdvisoryServices NIA – Core
StorageTech Support and ‘show run’ collection
Data Sources
Interacting with Cisco Services via NIA-PROXY
NIA – GUI
Tech supports from the switch collected and matched with signatures of external known caveats
Hardening guide is digitized into signatures and matched with show run from each switch
Insights DB
Bugs/PSIRTs detection
Updated periodically with signatures from the cloud
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Use Case – Notify Me About recommended Releases
Fabric
NIA
Insight DB
1
3 Alert / Inform
Monitor
Push Notification
2 Identify Switches
4 Implement
s
p p p
Notifications
Affected devices: 3
Leaf 1, Leaf 2, Leaf 3
With BUG ID: XYZ
Recommend:
Upgrade S/W to NXOS
7.0(3)I7(3)
Alert RemediateDetectAlert RemediateDetect
p
s
Affected devicesS/W Notify
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Network Assurance Engine: How it Works
• How it Works
24
Capture DC Wide Intent, Policy, Control/State across
Forwarding & Security
Precise Mathematical Models that codify Cisco’s 30+ Years of Networking and Cross Customer Domain Knowledge
Data Collection Formal Modeling of Network Continuous Analysis
Models verify that Network operates per Intent and accurately tell what is
wrong, where, why, impact and how to fix
Reactive Troubleshooting to Proactive Operations - continuously, network wide
Continuous Assurance Workflows
Is my network compliant with Governance Rules ?
Compliance analysis
Did something change in my network ?
Epoch Delta analysis
Can A talk to B ?
Connectivity Analysis
Smart Events & Compliance Score for Compliance
COMPLIANCE VIOLATED SMART EVENT
• Identify compliant policy
• Identify requirements satisfied
• Identify compliant EPGs
• Identify non compliant policy
• Identify requirements violated
• Identify non-compliant EPGs
COMPLIANCE SATISFIED SMART EVENT
COMPLIANCE SCORE
Epoch Delta AnalysisCorrelated Ad hoc Analysis Workflow
4 Qs, correlated answers…
• What changed?
• Who was impacted?
• Was it due to config changes?
• What happened as a result?
Use Cases
• Change Management
• Root-cause analysis
• Migration
• Maintenance Upgrades
• Capacity Management
Before /
BaselineAfter /
Current
Health Delta - SummaryChange in the health of the Fabric
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Epoch Delta Workflow – Policy DeltaImpact, Change, Operator
What got impacted ?
Who made the changes ?
What has changed ?
Details of
impact, if any
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Forwarding Connectivity AnalysisUse Cases
• Forwarding Communication Issues across entire fabric
• Visibility into Route Leakage
• Visibility into Fabric Communication with External Network
• Policy and Forwarding Inconsistencies
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Key Takeaways
• Nexus leads the industry in telemetry capabilities
• Combination of software and hardware streaming provides deepest level of network visibility
• Platforms for consuming, analyzing, visualizing telemetry data available or being developed for both ACI and standalone
• Both Cisco turnkey solutions and custom/third-party integrations exist today