ABSENCE: Usage-based Failure Detection in Mobile … nguyen... · • E.g., accessibility KPI looks...

59
ABSENCE: Usage-based Failure Detection in Mobile Networks Binh Nguyen, Zihui Ge, Jacobus Van der Merwe, He Yan, Jennifer Yates Mobicom 2015 1

Transcript of ABSENCE: Usage-based Failure Detection in Mobile … nguyen... · • E.g., accessibility KPI looks...

ABSENCE: Usage-based Failure Detection in Mobile Networks

Binh Nguyen, Zihui Ge, Jacobus Van der Merwe, He Yan, Jennifer Yates Mobicom 2015

1

Silent failures

2

EPC core core

RAN

Silent failures

• Silent failures: service disruptions/outages that are not detected by current monitoring systems.

• New features rolled out, bugs on devices, or combination of both.

2

EPC core core

RAN

Silent failures

• Silent failures: service disruptions/outages that are not detected by current monitoring systems.

• New features rolled out, bugs on devices, or combination of both.

2

EPC core core

RAN

Silent failures

• Silent failures: service disruptions/outages that are not detected by current monitoring systems.

• New features rolled out, bugs on devices, or combination of both.

2

EPC core core

RAN

Detecting silent failures is challenging!

Detecting silent failures is difficult - passive network monitoring

3

Detecting silent failures is difficult - passive network monitoring

• Drops in traffic/usage on network elements do not imply service disruptions:

• Load balancing/maintenance activities.

• Dynamic routing/Self-Organizing Network (SON).

3

Load balancingeventLo

ad

Time

expected load

actual load

Detecting silent failures is difficult - passive network monitoring

• Drops in traffic/usage on network elements do not imply service disruptions:

• Load balancing/maintenance activities.

• Dynamic routing/Self-Organizing Network (SON).

• Key Performance metric Indicators (KPI) may not reflect service issues:

• E.g., accessibility KPI looks good even when only a subset of users can access the network.

3

Load balancingeventLo

ad

Time

expected load

actual load

Detecting silent failures is difficult - passive network monitoring

• Drops in traffic/usage on network elements do not imply service disruptions:

• Load balancing/maintenance activities.

• Dynamic routing/Self-Organizing Network (SON).

• Key Performance metric Indicators (KPI) may not reflect service issues:

• E.g., accessibility KPI looks good even when only a subset of users can access the network.

3

Load balancingeventLo

ad

Time

expected load

actual load

A “healthy network” (from a monitoring perspective) does not guarantee service experience of users!

Detecting silent failures is difficult - active service monitoring

4

EPC core

RAN

Detecting silent failures is difficult - active service monitoring

• Sending test traffic across the network on all service paths.

4

EPC core

RAN

Detecting silent failures is difficult - active service monitoring

• Sending test traffic across the network on all service paths.

• Many types of customer devices, applications, huge geographic environment to probe.

4

EPC core

RAN

Active monitoring does not scale!

Relying on customer feedback• It takes time for customers to give feedback.

• Relying on customer feedback is too slow: hours of delay.

• E.g., failure happens at 16:38 UTC but manifests in customer feedback at 21:00 UTC, 3.5 hours of delay.

5

Time

# o

f tic

kets

Event starting time(16:38 UTC)

Detected by Customer care (21:00 UTC)

3.5 hours delay

Too slow!

ABSENCE: usage-based failure detection

6

ABSENCE: usage-based failure detection

• ABSENCE: Passive service monitoring approach - monitor usage of users in a passive manner.

6

ABSENCE: usage-based failure detection

• ABSENCE: Passive service monitoring approach - monitor usage of users in a passive manner.

• Absence of customer usage is a reliable indicator of service disruptions in a mobile network.

6

ABSENCE’s key ideas

7

Mobile NetworkA group of users

Usage

ABSENCE’s key ideas

7

Mobile NetworkA group of users

Usage

• If failure happens, users are not able to use the network as normal.

ABSENCE’s key ideas

7

Mobile NetworkA group of users

Usage

• If failure happens, users are not able to use the network as normal.

• Large number of users cannot use the network leads to a drop in usage.

• Could detect both hard failures (outages) and performance degradations.

ABSENCE overview

8

Use anonymized and aggregated Call Detail Record (CDR) collected in real time from an U.S. operator.

0

0 20 40 60 80 100

120

140

160

180

Time

Week 1

ABSENCE overview

8

Use anonymized and aggregated Call Detail Record (CDR) collected in real time from an U.S. operator.

0

0 20 40 60 80 100

120

140

160

180

Time

0

0 20 40 60 80 100

120

140

160

180

Time

Week 2

ABSENCE overview

8

Use anonymized and aggregated Call Detail Record (CDR) collected in real time from an U.S. operator.

0

0 20 40 60 80 100

120

140

160

180

Time

0

0 20 40 60 80 100

120

140

160

180

Time

0

0 20 40 60 80 100

120

140

160

180

Time

Week 3“Expected” usage

“Absence” of usage

ABSENCE overview

8

Use anonymized and aggregated Call Detail Record (CDR) collected in real time from an U.S. operator.

0

0 20 40 60 80 100

120

140

160

180

Time

0

0 20 40 60 80 100

120

140

160

180

Time

0

0 20 40 60 80 100

120

140

160

180

Time

Week 3

Anomaly?

“Expected” usage

“Absence” of usage

ABSENCE overview

8

Use anonymized and aggregated Call Detail Record (CDR) collected in real time from an U.S. operator.

Outline• Motivation.

• ABSENCE overview.

• Is ABSENCE feasible?

• ABSENCE’s challenges.

• ABSENCE’s event detection.

• Synthetic workload evaluation.

• Operational validation.

9

Is ABSENCE feasible?

Is usage predictable enough for anomaly detection?

10

Is usage predictable enough?

11

Is usage predictable enough?• While individual user usage is not predictable, usage of a large group of

users is predictable.

11

Is usage predictable enough?• While individual user usage is not predictable, usage of a large group of

users is predictable.

• For example: 3 weeks of usage overlapped, usage of a small group is less predictable than usage of a large group.

00

# of

cal

ls

Week1Week2Week3

Time0

0

# of

cal

ls

Week1

Time

Week2Week3

1170 users 3000 users

Is usage predictable enough?• While individual user usage is not predictable, usage of a large group of

users is predictable.

• For example: 3 weeks of usage overlapped, usage of a small group is less predictable than usage of a large group.

00

# of

cal

ls

Week1Week2Week3

Time0

0

# of

cal

ls

Week1

Time

Week2Week3

1170 users 3000 usersYes, usage of a large enough group of users is predictable!

Outline• Motivation.

• ABSENCE overview.

• Is ABSENCE feasible?

• ABSENCE’s challenges.

• ABSENCE’s event detection.

• Synthetic workload evaluation.

• Operational validation.

12

Challenges

• Failures happens to different scopes: geo-area, device makes/models, service types.

13

• How to deal with users mobility?

• How to improve predictability of aggregate usage?

• How to make ABSENCE scalable, given a large amount of data in the network?

Challenges

• Failures happens to different scopes: geo-area, device makes/models, service types.

13

• How to deal with users mobility?

• How to improve predictability of aggregate usage?

• How to make ABSENCE scalable, given a large amount of data in the network?

How to detect failures with different scopes?

14

How to detect failures with different scopes?• Group users based on their geographical information: ZIP code area, city,

state.

• A user could belong to multiple geographical groups in the same time.

• Under each geographical group: further divided to device OS, make.

14

How to detect failures with different scopes?• Group users based on their geographical information: ZIP code area, city,

state.

• A user could belong to multiple geographical groups in the same time.

• Under each geographical group: further divided to device OS, make.

14

How to detect failures with different scopes?• Group users based on their geographical information: ZIP code area, city,

state.

• A user could belong to multiple geographical groups in the same time.

• Under each geographical group: further divided to device OS, make.

14

+ Android

Salt Lake City (841)

iOS

Samsung HTC iPhone

Gal. S5 iPhone 5Gal. S4 iPhone 6

How to detect failures with different scopes?• Group users based on their geographical information: ZIP code area, city,

state.

• A user could belong to multiple geographical groups in the same time.

• Under each geographical group: further divided to device OS, make.

14

+ Android

Salt Lake City (841)

iOS

Samsung HTC iPhone

Gal. S5 iPhone 5Gal. S4 iPhone 6

How to detect failures with different scopes?• Group users based on their geographical information: ZIP code area, city,

state.

• A user could belong to multiple geographical groups in the same time.

• Under each geographical group: further divided to device OS, make.

14

+ Android

Salt Lake City (841)

iOS

Samsung HTC iPhone

Gal. S5 iPhone 5Gal. S4 iPhone 6

Outline• Motivation.

• ABSENCE overview.

• Is ABSENCE feasible?.

• ABSENCE’s challenges.

• ABSENCE’s event detection.

• Synthetic workload evaluation.

• Operational validation.

15

Event detection algorithm

0

100

200

300

400

500

600

24 48 72 96

met

ric

Time seriesDetected anomalies

16

Usage’s time series

Event detection algorithm

0

100

200

300

400

500

600

24 48 72 96

met

ric

Time seriesDetected anomalies

100

150

200

250

300

24 48 72 96

met

ric

Trend

-300

-200

-100

0

100

200

300

400

24 48 72 96

met

ric

Seasonal

-250

-200

-150

-100

-50

0

50

100

150

200

24 48 72 96

met

ric

Noise componentLower 95% CI of Noise componentUpper 95% CI of Noise component

• Decompose time series: trend, seasonal, noise

• Trend: moving average. • Seasonal: average of phasing values. • Noise = Time series - Trend - Seasonal

Trend

Seasonal

Noise16

Usage’s time series

Event detection algorithm

0

100

200

300

400

500

600

24 48 72 96

met

ric

Time seriesDetected anomalies

-250

-200

-150

-100

-50

0

50

100

150

200

24 48 72 96

met

ric

Noise componentLower 95% CI of Noise componentUpper 95% CI of Noise component

Noise component17

Usage’s time series

Upper 95th C.I.

Lower 95th C.I.

Event detection algorithm

0

100

200

300

400

500

600

24 48 72 96

met

ric

Time seriesDetected anomalies

-250

-200

-150

-100

-50

0

50

100

150

200

24 48 72 96

met

ric

Noise componentLower 95% CI of Noise componentUpper 95% CI of Noise component

Noise component17

anomalyUsage’s time series

• If noise is out of the 95th percent Confidence Interval (CI) of noise component => anomaly.

Upper 95th C.I.

Lower 95th C.I.

Outline• Motivation.

• ABSENCE overview.

• Is ABSENCE feasible?

• ABSENCE’s challenges.

• ABSENCE’s event detection.

• Synthetic workload evaluation.

• Operational validation.

18

Synthetic workload evaluation

19

• 6 months of real CDR from an U.S operator. • Synthetically introduce failures:

• Network failures: remove usage on base stations. • Device failures: remove usage on devices.

Parameters and metrics

20

• 11,000 failures generated.

• 100 ZIPs, 10 cities.

• Two popular device types.

• LTE/Voice.

• Duration: 1,2,3,6,12 hours.

• Quiet and busy hours.

• Impact degree: 0 - 55%.

Parameters Metrics

• Detection rate = detected events/introduced events.

• Loss ratio = loss until detected/normal usage.

Example of failure scenarios:

• All Android devices in Los Angeles fail.

• All Iphone5 devices in Downtown Los Angeles fail.

Overall detection rate

21

0

10

20

30

40

50

60

70

80

90

100

0-55-10

10-15

15-20

20-25

25-30

30-35

35-40

40-45

45-50

De

tect

ion

ra

te (

pe

rce

nt)

Failure impact (percent)

All failures

• With the 11,000 introduced failures: • ABSENCE detected >96% of failures that have more than 15% of impact. • ABSENCE tends to miss events that are <10% of impact.

Overall detection rate

21

0

10

20

30

40

50

60

70

80

90

100

0-55-10

10-15

15-20

20-25

25-30

30-35

35-40

40-45

45-50

De

tect

ion

ra

te (

pe

rce

nt)

Failure impact (percent)

All failures

• With the 11,000 introduced failures: • ABSENCE detected >96% of failures that have more than 15% of impact. • ABSENCE tends to miss events that are <10% of impact.

Overall detection rate

21

0

10

20

30

40

50

60

70

80

90

100

0-55-10

10-15

15-20

20-25

25-30

30-35

35-40

40-45

45-50

De

tect

ion

ra

te (

pe

rce

nt)

Failure impact (percent)

All failures

• With the 11,000 introduced failures: • ABSENCE detected >96% of failures that have more than 15% of impact. • ABSENCE tends to miss events that are <10% of impact.

Loss ratio of detected failures

22

• All detected failures: • ~97% of them are detected when <10% of usage is lost (during busy hours).

0 10 20 30 40 50 600.0

0.2

0.4

0.6

0.8

1.0

CD

F

Busy hours

Loss ratio (percent)

Outline• Motivation.

• ABSENCE overview.

• Is ABSENCE feasible?

• ABSENCE’s challenges.

• ABSENCE event detection.

• Synthetic workload evaluation.

• Operational validation.

23

Evaluate against known silent failures from the operator

24

Evaluate against known silent failures from the operator

• 19 silent failure events: not known by the network operator when they happened.

• Detected19/19, 100% true positive.

24

Alarm rate and true positive

25

0

20

40

60

80

100

5001000

20003000

40005000

6000

0

20

40

60

80

100

120

140Tr

ue p

ositi

ve (%

)

Ala

rm ra

te (a

larm

s/da

y)

True positive

Cut-off threshold

Alarm rate

m

2m

3m

4m

5m

6m

7m

n 2n 3n 4n 5n 6n 7n4n 6n 8n 10n 12n

• Use the 19 known events from operator.

• Alarm rate (m): average number of alarms per day that an operation team needs to handle.

• Cut-off threshold (n): filter out events that less impactful.

• Increase cut-off threshold could reduce alarm rate while maintaining true positive rate of ABSENCE

Alarm rate and true positive

25

0

20

40

60

80

100

5001000

20003000

40005000

6000

0

20

40

60

80

100

120

140Tr

ue p

ositi

ve (%

)

Ala

rm ra

te (a

larm

s/da

y)

True positive

Cut-off threshold

Alarm rate

m

2m

3m

4m

5m

6m

7m

n 2n 3n 4n 5n 6n 7n4n 6n 8n 10n 12n

• Use the 19 known events from operator.

• Alarm rate (m): average number of alarms per day that an operation team needs to handle.

• Cut-off threshold (n): filter out events that less impactful.

• Increase cut-off threshold could reduce alarm rate while maintaining true positive rate of ABSENCE

100% true positive, m alarms per day m is small enough

Alarm rate and true positive

25

0

20

40

60

80

100

5001000

20003000

40005000

6000

0

20

40

60

80

100

120

140Tr

ue p

ositi

ve (%

)

Ala

rm ra

te (a

larm

s/da

y)

True positive

Cut-off threshold

Alarm rate

m

2m

3m

4m

5m

6m

7m

n 2n 3n 4n 5n 6n 7n4n 6n 8n 10n 12n

• Use the 19 known events from operator.

• Alarm rate (m): average number of alarms per day that an operation team needs to handle.

• Cut-off threshold (n): filter out events that less impactful.

• Increase cut-off threshold could reduce alarm rate while maintaining true positive rate of ABSENCE

100% true positive, m alarms per day m is small enough

ABSENCE’s alarm rate is reasonable for practical!

Conclusions

• Absence of customer usage is a reliable indicator of service disruptions a mobile network.

• Appropriate grouping users results in predictable usage and high fidelity for anomaly detection.

• Synthetic evaluation and operational validation.

• Practical in an operational environment.

26

ABSENCE: Usage-based Failure Detection in Mobile Networks

Binh Nguyen, Zihui Ge, Jacobus Van der Merwe, He Yan, Jennifer Yates Mobicom 2015

27

Thank you!