OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with...

39
OmniMon: Re - architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang , Haifeng Sun, Patrick P. C. Lee Wei Bai, Feng Zhu, Yungang Bao 1

Transcript of OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with...

Page 1: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy

Qun Huang, Haifeng Sun, Patrick P. C. Lee

Wei Bai, Feng Zhu, Yungang Bao

1

Page 2: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Flow-level Network Telemetry

2

Hardware Switches

End-hosts

Controller

Flow 1 Pkt count

Packet: (flowkey, packet values)

………Flow 2 Pkt count ………Flow 3 Pkt count ………

... ... ………

Flow Statistics

Page 3: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Goal

3

Hardware Switches

End-hosts

Controller

Flow 1 Pkt count ………Flow 2 Pkt count ………Flow 3 Pkt count ………

... ... ………

Flow StatisticsFull

Accuracy

Resource Efficiency

Page 4: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Full Accuracy

4

Controller

Flow 1 Pkt count ………Flow 2 Pkt count ………Flow 3 Pkt count ………

... ... ………

Flow Statistics

1. Always-on: all time intervals2. Network-wide: all devices3. Complete: all flows4. Correct: zero per-flow error

Hardware Switches

End-hosts

Page 5: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Resource Efficiency

5

Hardware Switches

Controller

Flow 1 Pkt count ………Flow 2 Pkt count ………Flow 3 Pkt count ………

... ... ………

Flow Statistics

Sufficient memory Programmability Slow CPU Limited visibility

Fast ASIC Limited memory Limited programmability

Sufficient CPU and memory Global visibility Limited bandwidth

End-HostsController

1. Always-on: all time intervals2. Network-wide: all devices3. Complete: all flows4. Correct: zero per-flow error

Page 6: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Existing Approaches: Trade-offs

6Full Accuracy

ResourceEfficiency

ResourceEfficiency

FullAccuracy

Page 7: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Existing Approaches: Trade-offs

7Full Accuracy

ResourceEfficiency

SNMP

Course-grained

Page 8: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Existing Approaches: Trade-offs

8Full Accuracy

ResourceEfficiency

SNMP

Course-grained

High Overheads

Hash tables

Page 9: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Existing Approaches: Trade-offs

9Full Accuracy

ResourceEfficiency

EventMatching

Top-kCountingSampling

SNMP

Course-grained

High Overheads

Hash tables

Page 10: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Existing Approaches: Trade-offs

10Full Accuracy

ResourceEfficiency

EventMatching

Top-kCountingSampling

SNMP

Course-grained

Only Partial Flows

High Overheads

Hash tables

Page 11: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Existing Approaches: Trade-offs

11Full Accuracy

ResourceEfficiency

EventMatching

Top-kCountingSampling

SNMP

Course-grained

Only Partial Flows

SketchApproximate

ResultsHigh Overheads

Hash tables

Page 12: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Existing Approaches: Trade-offs

12Full Accuracy

ResourceEfficiency

Hash tables

EventMatching

Top-kCountingSampling

SNMP

Our Goal

Course-grained

Only Partial Flows

High Overheads

SketchApproximate

Results

Page 13: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Root Cause

13

Controller

Telemetry Operator

Telemetry Operator

Telemetry Operator

Operators are executed individually with limited collaboration

Page 14: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Root Cause

14

Controller

Telemetry Operator

Resource Management

Flowkeys ValuesTelemetry Operator

Resource Management

Flowkeys Values

Telemetry Operator

Resource Management

Flowkeys Values

Operators are executed individually with limited collaboration

Operators have to be heavy and sacrifice accuracy

Page 15: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Root Cause

15

Controller

Telemetry Operator

Resource Management

Flowkeys ValuesTelemetry Operator

Resource Management

Flowkeys Values

Telemetry Operator

Resource Management

Flowkeys Values

Operators are executed individually with limited collaboration

Operators have to be heavy and sacrifice accuracy

Page 16: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

OminMon

Question 1

Coordinate different entities for network telemetry?

16

Re-architect network telemetry by distributed design

Question 2

Reliable guarantees for the coordination?

Page 17: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

OminMon

Question 1

Coordinate different entities for network telemetry?

17

Re-architect network telemetry by distributed design

Question 2

Reliable guarantees for the coordination?

Page 18: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Split-and-Merge Architecture

18

Network telemetry

Flowkey Tracking Value Updating Resource Management Collective Analysis

Controller

Break heavy operators

Network-wide coordination

Page 19: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Flowkey Tracking

19

Network telemetry

Flowkey Tracking Value Updating Resource Management Collective Analysis

ControllerFlowkeysFlowkeys

Page 20: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Value Update

20

Network telemetry

Flowkey Tracking Value Updating Resource Management Collective Analysis

ControllerFlowkeysFlowkeys

Page 21: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Value Update

21

Network telemetry

ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)

Flowkey Tracking Value Updating Resource Management Collective Analysis

Packet

Page 22: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Mapping at End-Host

22

Network telemetry

Flowkey Tracking

ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)

Value Updating Resource Management Collective Analysis

Different strategies of slot maping in end-hosts and switches

Page 23: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Mapping at End-Host (Egress)

23

Network telemetry

Flowkey Tracking

ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)

Value Updating Resource Management Collective Analysis

Page 24: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Mapping at End-Host (Ingress)

24

Network telemetry

Flowkey Tracking

ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)

Value Updating Resource Management Collective Analysis

1. Embed LocationHostIndex 2. Locate Slot

Packet

No FlowkeysNo Flowkeys

Page 25: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Mapping at Switch

25

Network telemetry

Flowkey Tracking

ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)

Value Updating Resource Management Collective Analysis

SwitchIndex

SwitchIndex

HostIndex

HostIndexFlow 1 Flow 2

1. Global Coordination

2. Embed Index3. Extract & Update

Page 26: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Collective Analysis

26

Network telemetry

Flowkey Tracking

ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)

Value Updating Resource Management Collective Analysis

Collect results from end-host and switches to form final flow statistics

Page 27: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Collective Analysis

27

Network telemetry

Flowkey Tracking

ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)

Value Updating Resource Management Collective Analysis

Flow 1 Flow 2 Flow 3 …

Exploit end-host information to decompose switch slots

Page 28: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Collective Analysis (Detail)

28

24Switch

Zero-errorFlowkey & Values

Source End-HostFlowkeys

Flow 1

Slots (Egress)

13

Flowkeys

Flow 2

Slots (Egress)

11

Switch IndexSwitch Index

Flow 1: 13 Flow 2: 11

Source End-Host

Zero-errorFlowkey & Values

Page 29: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Putting It Together

29

Network telemetry

Flowkey Tracking

ControllerSlots (Ingress) Flowkeys Slots (Egress)Slots (Ingress) Flowkeys Slots (Egress)

Value Updating Resource Management Collective Analysis

Switches: Shared Slots Exact Per-flow Tracking Affordable Operations

Low Memory Usage Simple Updates

Switch Index Mapping Collective Analysis

End-Hosts: Hash Table Controller: Global Info.

Page 30: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

OminMon

Question 1

Coordinate different entities for network telemetry?

30

Re-architect network telemetry by distributed design

Question 2

Reliable guarantees for the coordination?

Page 31: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Unreliable Events

Lack of global clock• Devices reside in different intervals

Packet loss• Flow values are missing in some devices

31

Page 32: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Impact of Unreliable Events

32

Flow 1 13 Flow 2 11

Long DelayPacket Loss

Recorded by Other Intervals

Expect: 24

Not Recorded

Page 33: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Impact of Unreliable Events

33

Flow 1 13 Flow 2 11

Long Delay

Flow 1: ??? Flow 2: ???

Packet Loss

Recorded by Other Intervals 20 Not Recorded

Expect: 24

Page 34: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Reliability Guarantees

34

Lack of global clock Packet Loss

Hybrid consistency model• Network-wide synchronization

Loss inference• Linear system with DCN-

specific optimizations• Flow mapping algorithm

Guarantees• Each packet is included in the

same intervals by all devices• All end-hosts reside in the same

time intervals in most time

Guarantees• Per-switch, per-flow loss

interference in common cases

More details in the paper

Page 35: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Implementation

Testbed• End-hosts: DPDK• Switch: P4• Controller: C++

Simulator: 8-ray fat-tree

35

Page 36: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Host Overheads

36

<10% overheads when adding telemetry functionalities to PktGen

Hash lookup dominates the overheads

Page 37: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Switch Overheads

Less resources than sketch-based techniques

Only OmniMon achieves zero errors

37

OS: OmniMon that monitors only packet countOF: OmniMon that monitors 9 statistics

FR: FlowRadar (NSDI 16)UM: UmniMon (SIGCOMM 16)ES: Elastic Sketch (SIGCOMM 18)SL: SketchLearn (SIGCOMM 18)Each sketch only monitors packet count

Sketch techniques

OmniMon

Page 38: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

More Results

Controller overheads

Synchronization efficiency

Accountability

Scalability

User case: anomaly detection

Use case: network failure diagnosis

Use case: load balance evaluation

38

Page 39: OmniMon: Re-architecting Network Telemetry with …OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy Qun Huang, Haifeng Sun, Patrick P. C. Lee Wei

Conclusion OmniMon architecture: split-and-merge design

• Four partial operations• Network-wide coordination

Consistency guarantee• Network-wide synchronization with hybrid consistency model

Accountability guarantee• Packet loss inference with linear systems• Flow mapping algorithm

Prototype: DPDK + P4

Results: compare with 11 state-of-the-art solutions in various aspects

39Source Code Available: https://github.com/N2-Sys/Omnimon