OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with...
Transcript of OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with...
OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy
Qun Huang, Haifeng Sun, Patrick P. C. Lee
Wei Bai, Feng Zhu, Yungang Bao
1
Flow-level Network Telemetry
2
Hardware Switches
End-hosts
Controller
Flow 1 Pkt count
Packet: (flowkey, packet values)
………Flow 2 Pkt count ………Flow 3 Pkt count ………
... ... ………
Flow Statistics
Goal
3
Hardware Switches
End-hosts
Controller
Flow 1 Pkt count ………Flow 2 Pkt count ………Flow 3 Pkt count ………
... ... ………
Flow StatisticsFull
Accuracy
Resource Efficiency
Full Accuracy
4
Controller
Flow 1 Pkt count ………Flow 2 Pkt count ………Flow 3 Pkt count ………
... ... ………
Flow Statistics
1. Always-on: all time intervals2. Network-wide: all devices3. Complete: all flows4. Correct: zero per-flow error
Hardware Switches
End-hosts
Resource Efficiency
5
Hardware Switches
Controller
Flow 1 Pkt count ………Flow 2 Pkt count ………Flow 3 Pkt count ………
... ... ………
Flow Statistics
Sufficient memory Programmability Slow CPU Limited visibility
Fast ASIC Limited memory Limited programmability
Sufficient CPU and memory Global visibility Limited bandwidth
End-HostsController
1. Always-on: all time intervals2. Network-wide: all devices3. Complete: all flows4. Correct: zero per-flow error
Existing Approaches: Trade-offs
6Full Accuracy
ResourceEfficiency
ResourceEfficiency
FullAccuracy
Existing Approaches: Trade-offs
7Full Accuracy
ResourceEfficiency
SNMP
Course-grained
Existing Approaches: Trade-offs
8Full Accuracy
ResourceEfficiency
SNMP
Course-grained
High Overheads
Hash tables
Existing Approaches: Trade-offs
9Full Accuracy
ResourceEfficiency
EventMatching
Top-kCountingSampling
SNMP
Course-grained
High Overheads
Hash tables
Existing Approaches: Trade-offs
10Full Accuracy
ResourceEfficiency
EventMatching
Top-kCountingSampling
SNMP
Course-grained
Only Partial Flows
High Overheads
Hash tables
Existing Approaches: Trade-offs
11Full Accuracy
ResourceEfficiency
EventMatching
Top-kCountingSampling
SNMP
Course-grained
Only Partial Flows
SketchApproximate
ResultsHigh Overheads
Hash tables
Existing Approaches: Trade-offs
12Full Accuracy
ResourceEfficiency
Hash tables
EventMatching
Top-kCountingSampling
SNMP
Our Goal
Course-grained
Only Partial Flows
High Overheads
SketchApproximate
Results
Root Cause
13
Controller
Telemetry Operator
Telemetry Operator
Telemetry Operator
Operators are executed individually with limited collaboration
Root Cause
14
Controller
Telemetry Operator
Resource Management
Flowkeys ValuesTelemetry Operator
Resource Management
Flowkeys Values
Telemetry Operator
Resource Management
Flowkeys Values
Operators are executed individually with limited collaboration
Operators have to be heavy and sacrifice accuracy
Root Cause
15
Controller
Telemetry Operator
Resource Management
Flowkeys ValuesTelemetry Operator
Resource Management
Flowkeys Values
Telemetry Operator
Resource Management
Flowkeys Values
Operators are executed individually with limited collaboration
Operators have to be heavy and sacrifice accuracy
OminMon
Question 1
Coordinate different entities for network telemetry?
16
Re-architect network telemetry by distributed design
Question 2
Reliable guarantees for the coordination?
OminMon
Question 1
Coordinate different entities for network telemetry?
17
Re-architect network telemetry by distributed design
Question 2
Reliable guarantees for the coordination?
Split-and-Merge Architecture
18
Network telemetry
Flowkey Tracking Value Updating Resource Management Collective Analysis
Controller
Break heavy operators
Network-wide coordination
Flowkey Tracking
19
Network telemetry
Flowkey Tracking Value Updating Resource Management Collective Analysis
ControllerFlowkeysFlowkeys
Value Update
20
Network telemetry
Flowkey Tracking Value Updating Resource Management Collective Analysis
ControllerFlowkeysFlowkeys
Value Update
21
Network telemetry
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Flowkey Tracking Value Updating Resource Management Collective Analysis
Packet
Mapping at End-Host
22
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
Different strategies of slot maping in end-hosts and switches
Mapping at End-Host (Egress)
23
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
Mapping at End-Host (Ingress)
24
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
1. Embed LocationHostIndex 2. Locate Slot
Packet
No FlowkeysNo Flowkeys
Mapping at Switch
25
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
SwitchIndex
SwitchIndex
HostIndex
HostIndexFlow 1 Flow 2
1. Global Coordination
2. Embed Index3. Extract & Update
Collective Analysis
26
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
Collect results from end-host and switches to form final flow statistics
Collective Analysis
27
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
Flow 1 Flow 2 Flow 3 …
Exploit end-host information to decompose switch slots
Collective Analysis (Detail)
28
24Switch
Zero-errorFlowkey & Values
Source End-HostFlowkeys
Flow 1
Slots (Egress)
13
Flowkeys
Flow 2
Slots (Egress)
11
Switch IndexSwitch Index
Flow 1: 13 Flow 2: 11
Source End-Host
Zero-errorFlowkey & Values
Putting It Together
29
Network telemetry
Flowkey Tracking
ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)
Value Updating Resource Management Collective Analysis
Switches: Shared Slots Exact Per-flow Tracking Affordable Operations
Low Memory Usage Simple Updates
Switch Index Mapping Collective Analysis
End-Hosts: Hash Table Controller: Global Info.
OminMon
Question 1
Coordinate different entities for network telemetry?
30
Re-architect network telemetry by distributed design
Question 2
Reliable guarantees for the coordination?
Unreliable Events
Lack of global clock• Devices reside in different intervals
Packet loss• Flow values are missing in some devices
31
Impact of Unreliable Events
32
Flow 1 13 Flow 2 11
Long DelayPacket Loss
Recorded by Other Intervals
Expect: 24
Not Recorded
Impact of Unreliable Events
33
Flow 1 13 Flow 2 11
Long Delay
Flow 1: ??? Flow 2: ???
Packet Loss
Recorded by Other Intervals 20 Not Recorded
Expect: 24
Reliability Guarantees
34
Lack of global clock Packet Loss
Hybrid consistency model• Network-wide synchronization
Loss inference• Linear system with DCN-
specific optimizations• Flow mapping algorithm
Guarantees• Each packet is included in the
same intervals by all devices• All end-hosts reside in the same
time intervals in most time
Guarantees• Per-switch, per-flow loss
interference in common cases
More details in the paper
Implementation
Testbed• End-hosts: DPDK• Switch: P4• Controller: C++
Simulator: 8-ray fat-tree
35
Host Overheads
36
<10% overheads when adding telemetry functionalities to PktGen
Hash lookup dominates the overheads
Switch Overheads
Less resources than sketch-based techniques
Only OmniMon achieves zero errors
37
OS: OmniMon that monitors only packet countOF: OmniMon that monitors 9 statistics
FR: FlowRadar (NSDI 16)UM: UmniMon (SIGCOMM 16)ES: Elastic Sketch (SIGCOMM 18)SL: SketchLearn (SIGCOMM 18)Each sketch only monitors packet count
Sketch techniques
OmniMon
More Results
Controller overheads
Synchronization efficiency
Accountability
Scalability
User case: anomaly detection
Use case: network failure diagnosis
Use case: load balance evaluation
38
Conclusion OmniMon architecture: split-and-merge design
• Four partial operations• Network-wide coordination
Consistency guarantee• Network-wide synchronization with hybrid consistency model
Accountability guarantee• Packet loss inference with linear systems• Flow mapping algorithm
Prototype: DPDK + P4
Results: compare with 11 state-of-the-art solutions in various aspects
39Source Code Available: https://github.com/N2-Sys/Omnimon