An Annotation Layer for Network Management George Porter, Arne Baste, David Chu, Dilip Joseph Randy...

An Annotation Layer for Network Management

George Porter, Arne Baste,

David Chu, Dilip Joseph

Randy H. Katz

NetRads Retreat - June 2005

Goal of today’s talk

Snapshot of our thinking in this area Several open research problems as to

appropriateness of piggybacking, effectiveness of distributed observation, etc.

Your feedback appreciated

Outline

Motivating example: Discovering and protecting network service performance during stress

PNEs as A-Layer building block Overview: Annotation layer as provider of

component building block for network management

Revisit network service example with A-Layer Research challenges, open issues, opportunities

Outline

Motivating example: Discovering and protecting network service performance




Dist Tier

Motivating Example:Network service slowdown/failure

Problem: Users in the access tier complain of slow web access, can’t mount files,

and “DNS operation timed out messages” This problem started today at 10am

Where to begin? Network connectivity between users and outside seems ok But name resolution is intermittent and slow We need tools to figure out who is affected, who isn’t affected, the

cause, and a solution.

Client RIC

DNS

Web

DNS

NFS

FTP

Server tier

ISR

DNS

Dist Tier

Motivating Example:Network service slowdown/failure

Network connectivity to DNS? [ping,traceroute] Are DNS requests making it to the server tier?

What is happening to the request completion rate (is it lower)? Vs network path losses (I.e., is it the path or the service?) DNS server CPU level up

Localize the problem: Only this user? Or other clients? Just that server? What is happening to the DNS req/reply completion rate of

other servers in that cluster? Correlations? Is this user anomalous?

So far: DNS overloaded, leading to timeouts on client end

Client RIC

DNS

Web

DNS

NFS

FTP

Server tier

ISR

DNS

Dist Tier

Why is the service overloaded? Is there an usual number of requests from other sources? [deviation from

the mean] What is the status of requests to this service network-wide? How has it

changed since before the first reports of the problem? We discover that the number of DNS requests from access and ISP

networks is unchanged (must be in server tier) Other correlations? Yes, to SMTP traffic at ISP ingress

We suspect the endpoint of SMTP traffic, a spam appliance, as the cause of DNS performance loss

No unusual surges of DNS from access or ISP (from outside our enterprise network) Thus originating inside the server tier And correlated to SMTP traffic

Client RIC

DNS

Web

DNS

NFS

FTP

Server tier

ISR

RII

SMTP

Dist Tier

Eliminate false positives: testing this conjecture via experimental intervention Temporarily b/w throttle SMTP traffic from ISP ingress Test DNS latency from access network Find that DNS latency goes down when SMTP volume goes down

We enact a new (but temporary) policy: Redirect requests from access tier to secondary or tertiary DNS server

(service separation for different users) BW regulate SMTP traffic to keep DNS server CPU load from peaking Access users’ service restored--their traffic is protected.

Problem localized and mitigated Long term solution: software upgrade, firmware upgrade, add

dedicated DNS cache for appliance

Client RIC

DNS

Web

DNS

NFS

FTP

Server tier

ISR

RII

SMTP

DNS

Example Review

Localizing and identifying problem required Network-wide visibility despite stressed links/servers Path information (network connectivity, protocol request/reply

completion information) Finding changes in behavior (avg # requests/unit time, rate of

change of traffic) Finding correlations between traffic (traffic classes, volume,

network level paths) Experimental intervention (correlation to causation) Enabling new policy (redirecting traffic to secondary server,

BW throttling/fencing misbehaving flows)

Principles for network management Network-wide visibility despite

surges/overload/high loss rates Low overhead Path statistics gathering Some protocol visibility (TCP,

IP, Services like DNS, NFS) Need to discover

Changes to request-reply rate, completions, latency over time

Correlations between different flows, protocols, parts of the network

New policies (Actions) For experimental intervention

(root cause discovery) To protect good traffic

BW shaping, blocking, scheduling, fencing, selective drop

Security Against non-operators using

this infrastructure Against DoS attacks

Outline





PNEs (Programmable Network Elements) and iBoxes Inspection-and-action points

Deep, multiprotocol, packet inspection No routing, just observation and marking Actions: Selective drop, b/w fencing and shaping, notification of

operators, query “points of observation”

Some protocol visibility to TCP, UDP, ‘good’ network service protocols like DNS/NFS

Per-flow session state and reverse path visibility Per-flow and per-path simple statistics gathering (latencies,

round trip times, requests/sec, address source and destinations)

iBox

Annotation Layer

Explicit layer for iBox-to-iBox communication via packet annotations

Annotations: Fixed size Encoded to enable the de-annotation of packets Multiple payload types based on any layer of the flow Security field for authentication

iBox

iBox

iBox

url: X

A-Layer Annotation Design0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Annotation Layer Payload

Prior Protocol

Destination AddressSource Address

TypeAuthentication Field

Sequence Number

12 bytes of payload in one AL unit

AL unit headers (14 bytes)

Authentication Field(10 bytes)

Encode annotations in between IP and transport Allow annotations to be stacked (multiple) Annotations are removed by iBoxes before reaching

endhosts Motivation: start with large (but versatile) annotation

format When we discover the set of annotations that are most

effective for network management, we can reduce the footprint to support that set

Categories of annotations

Netflows Alteon, Packeteer.SNMP proxy

iBox placementIn an Enterprise Network: iBoxes at points of hierarchical division

R

R

DistributionTier

B

C

D

S

S

II

R IA

A

InternetEdge

AccessEdge

ServerEdge

SpamAppliance

Primary &Secondary

DNSServers

ISSMail

Server

S

10.0.0.110.0.0.2...10.0.0.100

10.0.0.10110.0.0.102......10.0.0.255

These locations give iBoxes ability to monitor and classify traffic flowing through them. Also, iBoxes can slow down, block, fence, and drop traffic to ease surges and protect “good” traffic from bad/ugly traffic

Routing to other iBoxes

Once we know which iBoxes exist, we need to know how to reach them so we can send them annotations

Requires building up this table at each iBox Topology dependent

If a packet’s destination address doesn’t match an iBox in this table, we remove all annotations to ensure endhost correctness

IPv4/v6 Address iBox ID169.229.62/24 A169.229.60/24 B169.229/16 C128.40.1.3/32 D128.40.1.4/32 B0/0 none

Represents “core” iBoxes

Represents “edge” iBoxes

Active vs Passive annotations When to send “active” annotations (I.e., a separate packet) vs when

to passively annotate? Available during high traffic (passive) vs expedient (active) Associate timers with each queue When packet arrives and an annotation is dequeued, we reset the

timer If the timer goes off, we generate a new dummy packet, annotate it,

send it off to the right destination iBox, and reset the timer

ABCDE

IPv4/v6 Address iBox ID169.229.62/24 A169.229.60/24 B169.229/16 C128.40.1.3/32 D128.40.1.4/32 B0/0 none

Outline





A-Layer as component building blocks for observe-analyse-act Observe

Path statistics; req/reply completion rate,latency; new conn rate; connection age; protocol types/mixtures; their change over time

Analyse Correlations; mean changing over time (chi-sq); PCA;

experimental intervention (act, then observe)

Act BW throttling, selective drop, packet scheduling, bw

fencing

Centralized More control, consistent

information (but could be out of date)

Centralize policy (no need to cast policy over multiple nodes)

Distributed routing preferred over centralized approach Similar motivation for

iBoxes/A-Layer

Why Distributed observe-analyse-act? Distributed Quick distribution of information Need for information throughout

the network Works during network partitions,

provides visibility during surges when it is hard to get packets through

Up-to-date info, but might be inconsistent

But, consistency hard; could start bad feedback loops; need to elect leader

Outline





Dist Tier

Path-oriented connectivity and reachability Network service monitoring

Are requests getting through? What is their rate? What has been happening to the DNS latency? Where are “DNS hotspots”?

iBoxes can store characteristics of paths through the network Types of protocols they see, volume of protocols, rate of change of traffic,

distribution of source/destination addresses seen, network errors, topology information

NetFlows as statistics gathering at a single point Extract and share reports from this information

Annotate packets with IBox Source annotation to have access to inside-vs-outside/paths chosen and paths taken

Annotate packets with service reachability reports, link conditions, traffic rates and changes of traffic rates

Annotate packets with protocol reports that represent the mixture of protocols seen at various points throughout the network

Client RIC

DNS

Web

DNS

NFS

FTP

Server tier

ISR

RII

SMTP

DNS

Dist Tier

Relationship between traffic classes, correlations, anomolies Discovering anomalies: iBoxes consuming annotations from other

parts of the network need to be able to discover when good services lose performance

SLT problem of anomaly detection made easier with more information and visibility

Network data stored in vector form for rate, quantity, time domain Discovering correlations: For good services that are degrading,

finding correlations to anomalous traffic surges, flash traffic, etc. provides hints to cause of problem

Each iBox representing affected traffic needs annotations containing network wide events capturing changes in traffic patterns

“Analysis” components of observe-analyze-act done from multiple network vantage points or centralized?

Client RIC

DNS

Web

DNS

NFS

FTP

Server tier

ISR

RII

SMTP

DNS

Dist Tier

Experimental Intervention, protection of good traffic via policy actions Experimental intervention:

Control annotations sent to iBox near source of surge to temporarily throttle

Annotations routed to iBox at ISP ingress to invoke new policy The policy in the annotation relies on iBox actions of BW shaping, fencing,

and TCP ack manipulation to reduce SMTP flow rate Protection of good traffic:

Policy could include network-level redirection to channel good DNS requests from access networks to a secondary, backup DNS service

Marking traffic not affiliated with surge for protection elsewhere in the network closer to the service location

Client RIC

DNS

Web

DNS

NFS

FTP

Server tier

ISR

RII

SMTP

DNS

Outline





Policy expression and deployment

When correlations discovered, what to do with them?

Initial efforts are to provide observation platform for visualization of network state A-Layer/iBoxes as building blocks for operator

interaction

“Above the network” services

Right now we envision iBoxes understanding well known network services Open question as to visibility to higher level applications

like web services, enterprise-specific apps New policy complexity, new correlations and state

management needed

Statistical visualization for operators Open problem to aggregate distributed

observations into coherent visualization for operators Where does the visualization reside? What are the right metrics/correlations/deviations from

mean that are relevant? How do actions relate to visualization?

SLT analysis

Choice of algorithm Finding “interesting” correlations Not being overloaded with too many correlations

and events Deviation from mean, finding patterns, what is

normal operation for a protocol?

Managing distributed actions

Managing feedback loops Providing coherent actions at the global scale

based on iBoxes distributed throughout the network

Coordinating actions despite network surges and limited network access, path losses, etc.

Q: What about the e2e argument?

Adding/removing annotations: Annotations easy to remove Packet paths not modified

Actions such as throttling, scheduling, dropping Con: affects traffic in ways endhosts can detect Pro: Provides “library” of components to enable new network

services / management features That’s how we build software

A-Layer gives enterprise operators control over their networks As long as their applications are supported and work Enterprise networks usually have white list of allowed apps, all

other disallowed Contrast this to ISPs

Q: What about per-flow state management?

Some routers can keep per-flow state (Netflows) iBoxes can sample traffic iBoxes not in correctness path--can act as ‘nops’ Network traffic parallelizable, targeting 1 GigE Can be merged into expandable network devices

(see Cisco’s server cards that plug into routers)

Q: What about e2e security (IPsec?)

E2e security obscured protocol, but not path stats Conceivable to discover request/response phases, infer

completion rate; keep stats on # connections, flow rates Statistically infer when a flow is starved for bandwidth;

observe bandwidth over time; correlate with destination/sources function (web server, mail server, etc)

Correlations still work over encrypted traffic Can still perform experiments by affecting flow X,

observing flow Y

Q: Why annotate? (Why not send separate packets?)

Annotations are about path characteristics Can bind to the flow they describe Statistics follow paths where they are the most relevant Marries per-path context with each packet of a particular flow

(gives iBoxes info they need to throttle, fence, etc)

As packet flow rate increases, more opportunity for visibility by piggybacking

Lower overhead during times of stress Possible preference of fewer large packets than more small

packets

Explicit sending of separate packets still ok Especially for discovery, control, and policy distribution

Q: Why distributed? Centralized statistics gathering easy in enterprise

networks But hard during times of stress/traffic spikes/flash traffic

Information might be needed in more than one place “Act” operations to protect good traffic needs timely info

Contrast to 5-min avgs common in SNMP

Raises difficulty, though Election protocols, distributed consensus, negative feedback

loops, management of iBoxes

Let’s experiment and see Open research question as to benefit of distributed vs

centralized network observation, analysis, and action/actuation

An Annotation Layer for Network Management George Porter, Arne Baste, David Chu, Dilip Joseph Randy...

Documents

Transcript of An Annotation Layer for Network Management George Porter, Arne Baste, David Chu, Dilip Joseph Randy...