Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect...

59
Panoptes: Network Telemetry Ecosystem Varun Varma, Sr. Principal Engineer March 10, 2019 SCALE 17x

Transcript of Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect...

Page 1: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

Panoptes: Network Telemetry Ecosystem Varun Varma, Sr. Principal EngineerMarch 10, 2019

SCALE 17x

Page 2: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

2

Collect, store, analyze & visualize network telemetry

Page 3: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

3

10 second primer on Network Telemetry

Page 4: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

4

Network Telemetry● Collection of metrics and state from network devices

● The dominant protocol to collect telemetry is SNMP (Simple Network Management Protocol)

○ Which is unencrypted transmission over UDP

○ First defined in 1993

● APIs, Agents and Streaming Telemetry are becoming mainstream

Page 5: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

5

How is this problem different?

Page 6: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

6

Why not use ‘x’?

● High rate of change network

○ Static configuration is out of the question

● Primitives unique to network telemetry

○ E.g. rate conversion, enrichments

● Decoupling of collection, processing, and storage

● Python

Page 7: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

7

Complexifiers● We have to poll as pushing metrics from devices isn’t supported universally

○ Polling is expensive on devices

● Vendor/Platform/OS Diversity

● Scale

Page 8: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

8

Meet Panoptes

Page 9: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

9

Panoptes● Greenfield Python based network

telemetry platform

● Built @Yahoo, now Verizon Media

● Provides real time telemetry collection

and analytics

● Implements discovery, enrichment,

polling, distribution bus and numerous

consumers

9

Page 10: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

10

Architecture

Page 11: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

11

System Requirements● Multiple methods to collect data

○ SNMP, APIs, CLI, Streaming

● Horizontal Scalability

○ No Single Point Of Failure

● Multiple, extensible, ways to consume data

● Survive Network Partitions

Page 12: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

12

Platform

Celery Redis ZooKeeper Kafka

Time S

eries DB

Plugin Framework

Discovery Plugins Polling Plugins

Device Specific Plugins (SNMP, API)

CM

DB

Configuration M

gmt

Enrichment Plugins

Page 13: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

13

Framework Requirements● Configuration Parsing

● Logging Management

● Plugin Management

● Work Queue Management

● Message Bus

● Distributed Locking and Leader Election

● Persistence

● Caching

● Federation

Page 14: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

14

Tech StackFramework Requirement Choice

Language Python

Configuration Parsing ConfigObj

Logging Logging Facility + rsyslog

Plugin Management yapsy

Work Queue Management Celery

Message Bus kafka-python + Kafka

Distributed Locking, Leader Election Kazoo + ZooKeeper

Persistence OpenTSDB, Django + MySQL

Caching redis-py + Redis

Federation Django + MySQL

Page 15: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

15

Core Concepts

Page 16: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

16

Plugins● Python classes conforming to a well defined API

● Can collect/process and transform data from any source

○ SNMP

○ API

○ CLI

○ *

● Can be of three types:

○ Discovery

○ Enrichment

○ Metrics

Page 17: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

17

Resources● Abstract representations of what should be monitored

○ In the context of network telemetry, these would usually be the network devices to monitor

● ‘Discovered’ using discovery plugins

○ Usually would talk to a Configuration Management Database but could also be from topology walks

● Have an id, endpoint and various metadata

○ For example, the vendor name or operating system version of a device would be it’s metadata

● Specified within Panoptes with a DSL

○ Example: “resource_class” = “network” AND “resource_subclass” = “switch” AND “resource_type” = “cisco” AND “resource_metadata.os_version” LIKE “12.2%”

Page 18: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

18

Metrics● Numbers that can be measured and plotted

○ Example is the bytes in/bytes out counter of an interface

● Generally fast changing or have the potential to be

● Can be collected through various means:

○ SNMP

○ API

○ CLI

○ Streaming

Page 19: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

19

Enrichments● Are metadata in addition to metrics

○ For interfaces, we collect metrics like bytes in and bytes out and enrichments like interface name and description

● Can be any data type

○ Unlike metrics which can only be numeric

● Can come from sources other than the device being monitored

○ The geo location of the device or the ASN number to name mapping

Page 20: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

20

Enrichments Cont...● Usually are more expensive to process than metrics

○ Might need complex transformations and therefore...

■ Are collected at a rate less than those for metrics

● We collect interface metrics every 60 seconds, but enrichments every 30 minute

■ Are cached

● Allows us to scale more by being efficient about data collection

Page 21: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

21

Data Encoding & Distribution● Panoptes is a distributed system

○ Discovery, enrichment and polling are all decoupled

● Kafka and/or Redis are used to pass data between all subsystems

○ This makes it so that you can extend or introspect any subsystem

● JSON is used to encode all data within Panoptes

○ It’s non-performant but developer/operator friendly

Page 22: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

22

Workflow

Collect Data Message Bus

TSDB RDBMS

API

UI CLI

Graphing Alerting Grid

Analytics/Reporting

Post Process

Page 23: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

23

Scaling & Operations

Page 24: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

24

Scale: Orders of Magnitude

10MTime Series

100KNetwork Interfaces

10KNetwork Devices

100Network Sites

60Seconds

Page 25: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

25

Scaling Issues● Panoptes was built to be horizontally scalable and free of single points of failure from day one

○ Performance or high-availability are not easy to bolt on afterwards

● We chose Python to be developer friendly but it wasn’t fast enough

○ High throughput actions are delegated to C extension modules

● Ditto for JSON serialization for all data

● We broke everything - Redis, ZooKeeper, Kafka

○ Redis allows ‘only’ 10,000 clients to be connected by default :)

Page 26: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

26

Divide & Conquer: Federated API● Due to availability concerns, each site has its own

MySQL cluster

○ Telemetry data must be available during a network partition

○ Centralized telemetry store might not be reachable in all cases

● Each API endpoint acts as a tribe node

○ If a tribe node doesn’t have the requested data, it returns a pointer to the node that does through a find API

DC1

DC2

DC3

DC4

DC5

DC6

DC7

DC8

Page 27: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

27

Covered Systems● Interface metrics for Arista, Cisco, Juniper, A10, Brocade

● System metrics for A10 (AX, TH), Arista EOS, Brocade TrafficWorks, Cisco IOS, Cisco IOS-XE, Cisco NX-OS, Juniper (MX, SRX)

● Functional metrics for VIPs (A10 AX, TH, Brocade), A10 LSN, Juniper SRX

Page 28: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

28

Operational Experiences ● Metrics across different platforms or versions of even the same OS from vendors aren’t consistent

○ Normalizing these metrics was our single biggest time drain

● SNMP has its faults but is still ubiquitous

○ Especially in a multi-vendor, multi-platform, and multi-generational network

● Performance of APIs was much better than SNMP

● Using Kafka proved to be the right choice, we already have 3 separate consumers

Page 29: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

29

Operational Experiences Cont... ● We don’t expose ‘raw’ data to external systems

○ It’s tempting to give access to external teams via Kafka, but that would lead to friction if we want to change our internals

○ Instead, we expose APIs which abstract away all our internals

● We push metrics to our in-house time series database and alerting service

○ Custom dashboard service our user base is familiar with

○ Economies of scale – no need to provision new hardware or software

● Custom UIs are useful and enabled by APIs

Page 30: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

30

Performance

Page 31: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

31

Throughput = Speed x Parallelism

Page 32: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

32

Throughput = Speed x Parallelism xProductivity

Page 33: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

33

“Optimize for your most expensive resource”- Nick Humrich: Yes, Python is Slow, and I Don’t Care

Page 34: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

34

Scaling Vertically: aka Speed

Page 35: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

35

Profile it!Our single slowest operation? JSON Schema Validation

Page 36: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

36

Begin with the basicshttps://wiki.python.org/moin/PythonSpeed

https://wiki.python.org/moin/PythonSpeed/PerformanceTips

● List comprehensions

● Built-ins

● Local vs. global

Page 37: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

37

Tools● cProfile

○ Built-in since Python 2.5○ pstats lets you do slicing/dicing/reporting○ Use with a signal handler to profile daemon processes

● objgraph○ Hunt down memory leaks○ Draw graphs of object counts and relations

Page 38: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

38

cProfileimport cProfileimport recProfile.run('re.compile("foo|bar")', 'restats')

197 function calls (192 primitive calls) in 0.002 seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.001 0.001 <string>:1(<module>) 1 0.000 0.000 0.001 0.001 re.py:212(compile) 1 0.000 0.000 0.001 0.001 re.py:268(_compile) 1 0.000 0.000 0.000 0.000 sre_compile.py:172(_compile_charset) 1 0.000 0.000 0.000 0.000 sre_compile.py:201(_optimize_charset) 4 0.000 0.000 0.000 0.000 sre_compile.py:25(_identityfunction) 3/1 0.000 0.000 0.000 0.000 sre_compile.py:33(_compile)

Page 39: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

39

objgraph

Page 40: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

40

Use C Extension Modules

Digits floats decimal cdecimal cdecimal-nt gmpy

9 0.12s 17.61s 0.27s 0.24s 0.52s

19 - 42.75s 0.58s 0.55s 0.52s

38 - - 1.32s 1.21s 1.07s

100 - - 4.52s 4.08s 3.57s

cDecimal vs. Decimal (in Python < 3.3):Pi, 64-bit, 10,000 iterations, 3.16GHz Core 2 Duo

Source: http://www.bytereef.org/mpdecimal/benchmarks.html

Page 41: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

41

Cache Propertieshttps://github.com/pydanny/cached-property

Page 42: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

42

Scaling Horizontally: aka Parallelism

Page 43: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

43

Celery!Scale across processes, CPUs, and hosts

http://www.celeryproject.org/How Celery fixed Python's GIL problem

Page 44: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

44

Choose & test dependent systems that scale horizontally

Page 45: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

45

Compare system performance with

all features

Page 46: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

46

TBDcython, Async I/O, More C extension modules

Page 47: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

47

Future: Streaming Telemetry

Page 48: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

48

Proposed Design

Celery Redis ZooKeeper Kafka

Panoptes Framework

Resource Cache

Streaming Telemetry Collector

Enrichment Cache

Device 1 Device 2 Device 3 Device n

Page 49: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

49

Pretty Pictures

Page 50: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

50

APIsRealtime - purpose specific Bulk/Historical - Generic

Page 51: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

51

Custom UIs

Page 52: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

52

And now: a special offer just for you...

Page 53: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

53

getpanoptes.io

Page 54: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

54

What you get● Docker container

● Discovery, enrichment and polling of the interfaces of the host you deploy on

● InfluxDB as the TSDB

● Grafana as the dashboarding system

Page 55: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

55

Sample InfluxDB/Grafana Dashboard

Page 56: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

56

Why?● Docker container

● Discovery, enrichment and polling of the interfaces of the host you deploy on

● InfluxDB as the TSDB

● Grafana as the dashboarding system

Page 57: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

57

Feedback & Contributions● Try it out!

● Find and fix bugs

● Tell your friends, family, and colleagues

● Can be used for more than just network telemetry

Page 58: Panoptes: Network Telemetry Ecosystem - SCALE...This makes it so that you can extend or introspect any subsystem ... High throughput actions are delegated to C extension modules Ditto

58

Thank you