Continuous Distributed Monitoring in the Evolved Packet ...

Post on 22-Dec-2021

5 views 0 download

Transcript of Continuous Distributed Monitoring in the Evolved Packet ...

Continuous Distributed Monitoring in the

Evolved Packet Core

Industry Experience Report

Romaric Duvignau 1 Marina Papatriantafilou 1

Konstantinos Peratinos 3 Eric Nordstrom 2 Patrik Nyman 2

DEBS 2019, Darmstadt (June 26).

1 Chalmers University of Technology, 2 Ericsson, 3 Chalmers student and Ericsson intern.

Introduction

Context: Monitoring the Evolved Packet Core (EPC) in 4G

The Evolved Packet Core

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb EPC

• Large-Scale, Distributed, Performance-critical system.

• Strong need to continuously monitor the EPC: e.g. detection

of under- or over-used subcomponents.

1

Context: Monitoring the Evolved Packet Core (EPC) in 4G

The Evolved Packet Core

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb EPC

• Large-Scale, Distributed, Performance-critical system.

• Strong need to continuously monitor the EPC: e.g. detection

of under- or over-used subcomponents.

1

MME, QoS, billing, ...

Context: Monitoring the Evolved Packet Core (EPC) in 4G

The Evolved Packet Core

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb EPC

• Large-Scale, Distributed, Performance-critical system.

• Strong need to continuously monitor the EPC: e.g. detection

of under- or over-used subcomponents.

1

MME, QoS, billing, ...

Packet Gateway

Context: Monitoring the Evolved Packet Core (EPC) in 4G

The Evolved Packet Core

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane(UP)

GTP

Base station

Control Plane(CP)

User Equipment(UE)

PFCP

TEID TEID

W1 W2 WN.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb EPC

• Large-Scale, Distributed, Performance-critical system.

• Strong need to continuously monitor the EPC: e.g. detection

of under- or over-used subcomponents.

1

MME, QoS, billing, ...

Packet Gateway

Continuous Distributed Monitoring

Continuous Distributed Monitoring (CDM) Model

2

Continuous Distributed Monitoring (CDM) Model

2

f (S1,S2, · · · ,Sk)

Continuous Distributed Monitoring (CDM) Model

2

f (S1,S2, · · · ,Sk)

There exist variants (unidirectional, relay nodes, etc).

Continuous Distributed Monitoring (CDM) Model

2

f (S1,S2, · · · ,Sk)

There exist variants (unidirectional, relay nodes, etc).

• Instant computation &

communication

• f depends on ∪Si

System Architecture

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

3

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

Fetched

Statis-

tics

3

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

Fetched

Statis-

tics

Monitoring Messages

3

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

Fetched

Statis-

tics

Monitoring Messages

Display (analysts)

3

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

Fetched

Statis-

tics

Monitoring Messages

Display (analysts)

Differences with CDM models

• Sites identity matters, performance statistics 6= “events”, etc

• Need to account for comp. and communication delays!

3

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

Fetched

Statis-

tics

Monitoring Messages

Display (analysts)

time

Monitoring Period

→ At the Agg: monitoring decisions then 1 monitoring message.

3

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

Fetched

Statis-

tics

Monitoring Messages

Display (analysts)

time

Monitoring Period Fetches

→ At the Agg: monitoring decisions then 1 monitoring message.

3

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

Fetched

Statis-

tics

Monitoring Messages

Display (analysts)

time

Monitoring Period Fetches Sliding Window

→ At the Agg: monitoring decisions then 1 monitoring message.

3

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

Fetched

Statis-

tics

Monitoring Messages

Display (analysts)

time

Monitoring Period Fetches Sliding Window

→ At the Agg: monitoring decisions then 1 monitoring message.

3

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

Fetched

Statis-

tics

Monitoring Messages

Display (analysts)

time

Monitoring Period Fetches Sliding Window

→ At the Agg: monitoring decisions then 1 monitoring message.

3

System Architecture Overview

C

Agg1

w11 w1

2 · · · w1`

Agg2

w21 w2

2 · · · w2`

· · ·

Load Balancer

Incoming

Traffic

Fetched

Statis-

tics

Monitoring Messages

Display (analysts)

time

Monitoring Period Fetches Sliding Window

→ At the Agg: monitoring decisions then 1 monitoring message.3

Monitoring Algorithms

Selected CDM Algorithms for Counting problems

Basic Mode: Exact Monitoring

• Send an update if last value sent is

different to measured value

• Keep an exact sliding window of

the last n values

• • •••• •

Approximation Mode: Relative Error of ε• Uses Exponential Histograms for

approximate counting

• Send the approximate count when

it is beyond some error bound from

the last value sent

• Requires in all O(log(nε)/ε) words

• • •••• •

4

Selected CDM Algorithms for Counting problems

Basic Mode: Exact Monitoring

• Send an update if last value sent is

different to measured value

• Keep an exact sliding window of

the last n values

• • •••• •

Approximation Mode: Relative Error of ε• Uses Exponential Histograms for

approximate counting

• Send the approximate count when

it is beyond some error bound from

the last value sent

• Requires in all O(log(nε)/ε) words

• • •••• •

4

Results

Experimental setup

• EPG setup: 2 aggregators, 72 workers per aggregator

• 2 phases: increasing load (20min) then stable load (15min)

0

20

40

60

80

100

CPU

utiliz

atio

n (%

) Maxp95Medianp5Min

0 500 1000 1500 2000

1M

2M

3M

Pack

et ra

te (p

acke

ts/s

)

Maxp95Medianp5Min

5

Experimental setup

• EPG setup: 2 aggregators, 72 workers per aggregator

• 2 phases: increasing load (20min) then stable load (15min)

0

20

40

60

80

100

CPU

utiliz

atio

n (%

) Maxp95Medianp5Min

0 500 1000 1500 2000

1M

2M

3M

Pack

et ra

te (p

acke

ts/s

)

Maxp95Medianp5Min

5

1000 fetches /s – high precision

1 fetch /s – low precision

No. of Monitoring Updates per Round

• 5-10% of data sent for packet proc. rate; 30-70% for CPU.

0.4

0.6

0.8

1.0

Upda

tes v

s Bas

ic (c

pu)

5%10%

20%5%W60

0 500 1000 1500 20000.0

0.2

0.4

0.6

0.8

Upda

tes v

s Bas

ic (p

kt)

5%10%

20%5%W60

• Max relative error < 5ε9 and average < ε

5 .

6

No. of Monitoring Updates per Round

• 5-10% of data sent for packet proc. rate; 30-70% for CPU.

0.4

0.6

0.8

1.0

Upda

tes v

s Bas

ic (c

pu)

5%10%

20%5%W60

0 500 1000 1500 20000.0

0.2

0.4

0.6

0.8

Upda

tes v

s Bas

ic (p

kt)

5%10%

20%5%W60

• Max relative error < 5ε9 and average < ε

5 .6

Monitoring Availability

• 8 runs (ca 4h of data) with monitoring round = 1s

1.5

2.0

2.5

Upda

te ti

me

(s,M

A300

)

B5%W60 Agg1/2B5% Agg1/25% Agg1/2B Agg1/2

250 500 750 1000 1250 1500 1750 2000

0.4

0.6

0.8

Avai

labi

lity

(MA3

00) B

20%10%5%

B20%B10%B5%B5%W60

7

Conclusion

Conclusions

• Adjusted state-of-the-art CDM implementations in the EPC

• Keys to popularize CDM within a production level system

• From experiments, only 6% of data sent for 1.6% avg error

• Useful for the upcoming transition to 5G architecture

8

Thank you!

8

Error Analysis

• Max relative error is always close to 5ε9

• Larger window influences absolute error on CPU

0.00

0.02

5%

0.00

0.02

5%W60

0

1

5%

0 500 1000 1500 20000

2

5%W60

Maxp90Medianp10Min

Comparison with Simple Approximation

• Simple Approximation: keep an exact window and send

updates when last count is beyond some predefined relative

bound

B 5% 10% 20% 5%W600

20

40

60

No o

f upd

ates

(pkt

)

B 5% 10% 20% 5%W600

5

10

Rela

tive

erro

rs (%

,pkt

)

• ε-Approximate algorithm presents similar tradeoffs as the

simple approximation with bound 5ε9

CDM approaches

Simple approaches

• Flooding, do not scale!

• Polling, but hard to choose right polling interval!

• Sampling, do not capture scarce under/over-used components!

Solutions

• Communication-optimal algorithms

• Geometric Monitoring → efficient network-wide aggregate.

• Tailored algorithms for particular tasks → e.g. computing the

frequency of items or most popular ones.

• Heuristics → e.g. adaptive filters.

• Compromises: Magpie, Dapper, Ganglia...

Proposed Monitoring Solutions

time

Monitoring Period Fetches Sliding Window

Monitoring Logic for each monitored value

• Implemented as part of the aggregator nodes

• once all fetched have been collected, a monitoring decision is

taken upon propagating the update

• Aggregation of all monitoring updates: sending of (up to) a

single monitoring message per aggregator

Selected CDM Algorithms

Basic Mode

• Send an update if last value sent is different

• Keep an exact sliding window of length n

ε-Approximation Mode

• Maintains an ε9 -approximate Exponential Histogram for

counting approximate sum c of items over a sliding window of

the last n events

• Whenever c > (1 + 4ε9 )c or c < (1− 4ε

9 )c , send an update,

where c is the last value sent

• Requires in all O(log(nε)/ε) words of memory

Measuring Metrics of Interests: 2 modes

With high granularity: CPU usage

1. P fetches of CPU-usage for past 1ms each within one

monitoring period

2. Frequency chart (histogram of F bins) for the P fetches

3. Sliding Windows are updated : each bin is monitored

4. For each changed (basic) or outside of bounds (approx) value,

a monitoring update is sent

5. Upon receiving an update: C updates its frequency counts for

the resp. observer and CPU-bin and then may display the

average CPU over the window as∑

1≤i≤F ifi/∑

1≤i≤F fi

With low granularity: Packet Processing Rate

• Only the no. of processed packets per mon. period is tracked