Context-Aware Time Series Anomaly Detection for Complex Systems
description
Transcript of Context-Aware Time Series Anomaly Detection for Complex Systems
![Page 1: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/1.jpg)
SDM4Service 5/4/2013 1
Context-Aware Time Series Anomaly Detection for Complex Systems
Manish Gupta, UIUCAbhishek B. Sharma, NEC Labs AmericaHaifeng Chen, NEC Labs AmericaGuofei (Geoff) Jiang, NEC Labs America
![Page 2: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/2.jpg)
SDM4Service 5/4/2013 2
Focus on Complex Systems• Lots of components interacting to accomplish challenging
tasks
Data centers Power plant Manufacturing plant
![Page 3: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/3.jpg)
SDM4Service 5/4/2013 3
Switch to proactive maintenance1. Continuous monitoring from multiple vantage points2. Replace calendar-based or reactive maintenance with early
detection, localization, and remediation• Enabling technologies
– low cost ubiquitous sensing and communication• Challenge
– How to combine heterogeneous data?– Unstructured or semi-structured log data– Multivariate time series data
![Page 4: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/4.jpg)
SDM4Service 5/4/2013 4
Importance of Collating Information• Only time series data
– No context or the global semantic view– Many false positives– Multiple alarms related to a single
event• Only system logs
– High level application/workflow view– Incomplete Coverage
• “Cost”– Lack of root-cause visibility
• Absence of observed system behavior
0 10000 20000 30000 40000 50000 60000 700000
0.010.020.030.040.050.060.070.080.09
0.1
CPU Utilization
Mem Utilization
Time (s)
Nor
mal
ized
Mea
sure
men
t
Task execution
![Page 5: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/5.jpg)
SDM4Service 5/4/2013 5
Our vision• Logs capture the context of a system’s operations• Time series monitoring data record the state of different
components• Hypothesis: jointly mining log and time series data for
anomaly detection is more accurate and robust.– Context-aware time series anomaly detection
![Page 6: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/6.jpg)
SDM4Service 5/4/2013 6
Outline1. Introduction and Motivation2. Framework for combining logs and time series data3. Proposed solution4. Instantiation details for Hadoop5. Evaluation 6. Conclude
![Page 7: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/7.jpg)
SDM4Service 5/4/2013 7
Framework for combining logs and time series data
Time series
![Page 8: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/8.jpg)
SDM4Service 5/4/2013 8
What is an instance?• An instance spans the interval between two consecutive
context changing events on a component.– Assumption: we can identify context changing events.
• Instance I = (C,M); C: content features, M: metrics/time series
t1 t2
t1: Task execution startst2: Task execution finishes
![Page 9: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/9.jpg)
SDM4Service 5/4/2013 9
Problem statement and solution approach
• Given: Instances I1, I2 , …, IL
• Find: Top K anomalous instances• 2 stage solution• Find patterns
• Context patterns• Metric patterns
• Find anomalies
Two notions of similarity:Peer similarity: similarity in context variables across instancesTemporal similarity: similarity in time series data for similar contexts
![Page 10: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/10.jpg)
SDM4Service 5/4/2013 10
Proposed Solution• Extraction of Context Patterns
– Normalize the data– Use K-means clustering
• Extraction of Metric Patterns
C1
C2
C3
M1
M2M3
M4
Not an anomalyAnomaly
𝑠𝑐𝑜𝑟𝑒 ( 𝐼 )=1−𝑠𝑖𝑚 (𝑀 ,𝑀 2 )
Anomaly Detection
Anomaly Post-processing– Remove instance if nearest context
cluster is far away.
CPUMemory
Disk Read
Disk Write
eth0 TX
eth0 RX CPU
10 20.9 622.4 14.0 4.2 5.6 10
20 24.5 977.7 82.7 4.0 1.8 20
10 24.6 836.4 90.0 1.6 1.6 10
30 20.6 198.6 78.6 7.1 0.3 3040 29.3 850.9 99.1 5.1 10.0 40
similarity?
𝑷 𝒄×𝒌 𝑸𝒄×𝒌
similarity?
CPUMemory
Disk Read
Disk Write
eth0 TX
eth0 RX CPU
10 20.9 622.4 14.0 4.2 5.6 1020 24.5 977.7 82.7 4.0 1.8 2010 24.6 836.4 90.0 1.6 1.6 1030 20.6 198.6 78.6 7.1 0.3 30
![Page 11: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/11.jpg)
11
Instantiating the framework for MapReduce (Hadoop)
• MapReduce programming model– Example: count the frequency of all words appearing in a document
• Distributed block storage (e.g. HDFS)• Two phases of computation: Map and Reduce
A B C
B C D
E F G
A B D
A: 1 B: 1 C: 1
B: 1 C: 1 D: 1
E: 1 F: 1 G: 1
A: 1 B: 1 D: 1
Map
Map
Map
Map
Intermediate output
Reduce
A: 2B: 3C: 2D: 2E: 1F: 1G: 1
Final output
SDM4Service 5/4/2013
![Page 12: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/12.jpg)
SDM4Service 5/4/2013 12
Hadoop: Open source implementation of MapReduce runtime
Map and Reduce phases exhibitpeer and temporal similarity
![Page 13: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/13.jpg)
SDM4Service 5/4/2013 13
Discussion• Selecting number of principal components ()– Capture >95% variance for both time series.
• Selecting number of context/metric clusters– Knee point of within cluster sum of squares versus
# clusters curve.• Richer context for MapReduce– Job conf parameters – Events extracted using regex pattern matches from
logs.
![Page 14: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/14.jpg)
SDM4Service 5/4/2013 14
Evaluation1. Synthetic datasets
– Context part comes from real Hadoop runs.– Metrics part is synthetically generated.– Hadoop cluster: master + 5 slaves.– Workload: standard Hadoop examples like sorting, count word
frequencies, etc.– 3 context clusters.
2. Real Hadoop runs with injected faults– CPU hog and Disk hog
![Page 15: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/15.jpg)
SDM4Service 5/4/2013 15
Synthetic data:Context Clusters for Hadoop Examples
#Map
s
#Red
uces
COM
BIN
E O
UTPU
T RE
CORD
S
COM
MIT
TED
HEAP
BYT
ES
CPU
MILL
ISEC
ON
DS
FILE
BYT
ES W
RITT
EN
HDFS
BYT
ES R
EAD
MAP
INPU
T BY
TES
MAP
INPU
T RE
CORD
S
MAP
OUT
PUT
BYTE
S
MAP
OUT
PUT
MAT
ERIA
LIZED
BYT
ES
PHYS
ICAL
MEM
ORY
BYT
ES
RECO
RDS
WRI
TTEN
REDU
CE IN
PUT
GRO
UPS
REDU
CE IN
PUT
RECO
RDS
REDU
CE O
UTPU
T RE
CORD
S
REDU
CE S
HUFF
LE B
YTES
SPILL
ED R
ECO
RDS
SPLIT
RAW
BYT
ES
VIRT
UAL M
EMO
RY B
YTES-2
-1
0
1
2
3Cluster1 Cluster2 Cluster3
Nor
mal
ized
Mea
sure
men
t
• Cluster 1: large number of Map tasks high values for Map counters. • Cluster 2: instances with a few Map and a few Reduce tasks.• Cluster 3: instances with large number of Reduce tasks and high values for
Reduce counters.
![Page 16: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/16.jpg)
SDM4Service 5/4/2013 16
Injecting Anomalies in Synthetic Dataset
• Fix anomaly factor=• Randomly select instances into set R• For each instance in R, choose to add swap-
anomaly or new-anomaly.• Swap Anomaly: Swap the metrics part with
another randomly chosen instance.• New Anomaly: Replace the metrics time series
part with a new random matrix.
![Page 17: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/17.jpg)
SDM4Service 5/4/2013 17
Synthetic Dataset Results
• 20 experiments per setting.• Avg. standard deviations are 3.34% for CA, 7.06 % for SI and 4.58%
for NC.
SI (1%)NC (28%)
![Page 18: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/18.jpg)
SDM4Service 5/4/2013 18
Results on real Hadoop runs with injected faults
1 15 29 43 57 71 85 99 1131270
0.050.1
0.150.2
0.250.3
0.35Disk HogCPU Hog
Instance Number
Anom
aly
Scor
e
• Original number of anomalies– Disk hog: 7.– CPU hog: 4.
• Detected anomalies– Disk hog: 4 in top5, all 7 in top 10.– CPU hog: 3 in top 5, all 4 in top 10.
1 12 23 34 45 56 67 780
20
40
60
80
100Anomaly
Metric Cluster 0
Metric Cluster 1
Metric Cluster 2
Time (sec)
CPU
Utiliz
ation
![Page 19: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/19.jpg)
SDM4Service 5/4/2013 19
Conclusion and Future work• Proactive maintenance is more effective when we combine
information from heterogeneous sources– System logs and time series measurements
• We proposed a clustering based approach for finding context patterns from log data and metric patterns from time series– Use these patterns for anomaly detection
• Future directions– How to define context and instances in other settings?– Define anomalies based on transition in context and expected change
in metrics
![Page 20: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/20.jpg)
SDM4Service 5/4/2013 20
Appendix
![Page 21: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/21.jpg)
SDM4Service 5/4/2013 21
500 1000 2000 50000
5000
10000
15000
20000
25000
30000#Metrics=5#Metrics=10#Metrics=20
Number of instances (N)
Exec
ution
Tim
e fo
r Met
ric
Patte
rns D
isco
very
(sec
)
Running Time
• Algorithm is linear in number of instances.• Time spent in anomaly detection: ~188ms.
![Page 22: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/22.jpg)
SDM4Service 5/4/2013 22
Real Datasets
• Workload: Multiple runs of RandomWriter and Sort.
• RandomWriter (16 Maps) writes 1 GB data in 64 MB chunks and Sort (16 Maps and 16 Reduces) sorts the data.
• Anomalies are inserted on 1 machine for– CPU Hog: Infinite loop.– Disk Hog: Sequential write to file on disk.
• Total instances: 134 (Disk Hog) & 121 (CPU Hog).
![Page 23: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/23.jpg)
SDM4Service 5/4/2013 23
Context Clusters for RandomWriter+Sort Dataset
#Map
s
#Red
uces
COM
MIT
TED
HEAP
BYT
ES
CPU
MILL
ISEC
ON
DS
FILE
BYT
ES R
EAD
FILE
BYT
ES W
RITT
EN
HDFS
BYT
ES W
RITT
EN
MAP
OUT
PUT
MAT
ERIA
LIZED
BYT
ES
MAP
OUT
PUT
RECO
RDS
PHYS
ICAL
MEM
ORY
BYT
ES
RECO
RDS
WRI
TTEN
REDU
CE IN
PUT
RECO
RDS
REDU
CE O
UTPU
T RE
CORD
S
REDU
CE S
HUFF
LE B
YTES
SPILL
ED R
ECO
RDS
SPLIT
RAW
BYT
ES
VIRT
UAL M
EMO
RY B
YTES-3
-2-1012 Cluster1 Cluster2 Cluster3
Nor
mal
ized
Mea
sure
men
t
• Cluster 1 consists of a mix of Maps and Reduces and has a distinctly high number of HDFS bytes being written.
• Cluster 2 is Map-heavy and shows a large number of Map Output Records.• Cluster 3 is Reduce-heavy and hence demonstrates a large activity in Reduce
counters.
![Page 24: Context-Aware Time Series Anomaly Detection for Complex Systems](https://reader036.fdocuments.us/reader036/viewer/2022062411/5681668b550346895dda4d6e/html5/thumbnails/24.jpg)
SDM4Service 5/4/2013 24
Metric Patterns