Dynamic Data-Driven Adaptive Sampling and …...•Adaptive sampling strategy : require large amount...
Transcript of Dynamic Data-Driven Adaptive Sampling and …...•Adaptive sampling strategy : require large amount...
Lab for System Informatics and Data Analytics (SIDA)
Dynamic Data-Driven Adaptive Sampling and
Monitoring of Big Spatial-Temporal Data Streams
for Real-Time Solar Flare Detection
Department of Industrial and Systems Engineering
University of Wisconsin-Madison
Date: 8/8/2017
1
Dr. Kaibo Liu
Lab for System Informatics and Data Analytics (SIDA)
Outline
• Motivation
• State of the art
• Proposed DDDAS framework
– Data-Driven Dynamic Sampling Strategy
• Case study
• Conclusion
2
Lab for System Informatics and Data Analytics (SIDA)
Motivation
3
• With the advancement of sensing technique and data collection capability, Big Data Streams have become widely available in many DoD applications.
• This provides an unprecedented opportunity to gain system-wide situational awareness through real-time anomaly detection and fault localization.
• The emerging NASA Solar Dynamics Observatory (SDO) continuously monitors the dynamic solar activities for 24 hours/7 days a week Solar flare detection
Source: NASA
• generate a high resolution image every 0.75 second
• produce 1.5 TB big data per day
Lab for System Informatics and Data Analytics (SIDA)
Applications and Importance
4
– significantly affect Earth’s ionosphere, causing hours-long disruptions in radio communications
– affect GPS receivers and satellites, making it very difficult for search and rescue in a war zone
– lead to failures in large-scale power-grid with cascading effects
• The solar flare activities have a close relationship with Air Force equipment and applications.
• Real-time detection system for the solar flare by exploiting the Big Data Streams of solar images is highly desired.
Lab for System Informatics and Data Analytics (SIDA)
Challenges
• Big Data Streams place critical requirements and resources constraints for data communication and processing in real time– Send only 6 images back every minute for real-time analysis given
transmissions rate 130 million bits/second
• The occurrence of solar flare is naturally– complicated (depends on the cycle and the inherent dynamics and
randomness of solar activities)
– sparse (with a small signal-to-noise ratio (SNR))
– transient (only lasts for minutes and hard to predict)
• Currently, there is a lack of efficient online monitoring scheme tailored to these unique characteristics.
5
Lab for System Informatics and Data Analytics (SIDA)
Proposed DDDAS Framework
6
(a) Applications𝑡
Original Solar Image
(b) Applications modeling
Updated Solar Image
(c) Application measurement
systems and methods
Dynamic Sampling
𝑡
DDDAS
Framework
(d) Mathematical and
statistical algorithms
SPC Chart
Update
Model
Update
SPC
Update samplingSample data
• A dynamically updated
spatial-temporal
statistical model fully
characterize the
changing background
• A dynamic sampling
algorithm that
actively decides
which data streams to
observe given the
resources constraints
• A scalable and robust
SPC to effectively
combine the information
from significant data
streams to produce an
overall global
monitoring system
Lab for System Informatics and Data Analytics (SIDA)
State of the art and innovative idea
7
• Existing approaches to process monitoring
– focus on fixed sub-region (rigid spatial domain)
• assume that the locations of anomaly event are known
• fail to capture the dynamic features of solar flare events
– sample whole image at fixed frequency (rigid temporal domain)
• result in a large detection delay or miss the event
• Adaptive sampling strategy : require large amount of historical information
– Top-r based Adaptive Sampling (TRAS) at DDDAS 2016
• Innovative idea:
random sampling fixed sampling
blue: sampled data streams
red: anomaly regions
black: overlapping
Lab for System Informatics and Data Analytics (SIDA)
Problem formulation and objective
When 𝑡 < 𝜏 , the process is in control,
𝑌𝑘,𝑡~i.i.d𝑁(0,1)
When 𝑡 > 𝜏, the process is out-of-control, the mean of data streams in an unknown region 𝐶 shifts to 𝛿:
𝑌𝑘,𝑡~i.i.d𝑁 𝛿𝜒𝐶 𝒙𝑘 , 1
𝑛
𝑚
𝑘th data streamLocation 𝒙𝑘Observation 𝑌𝑘,𝑡
Change point 𝜏
𝑌1,𝑡 𝑌2,𝑡
𝑌𝑘,𝑡
𝑌𝑝,𝑡
⋮
⋯
𝐶
𝑝 = 𝑚 × 𝑛 data streams
At time 𝑡, 𝒀𝑡 = 𝑌1,𝑡, … , 𝑌𝑝,𝑡′
At each time 𝑡, only 𝑞 out of 𝑝variables can be observed.
Clustered shift Limited resources
Goal: Based on dynamic observations in real time, actively decide which data stream to observe at the next time for quick detection of anomaly event while still maintaining a system-wide false alarm rate.
8
Lab for System Informatics and Data Analytics (SIDA)
General strategy 1Use local statistics
• To propose a scalable monitoring scheme, we define two
local statistics 𝑊𝑘𝑡1,𝑊𝑘𝑡
(2)for each stream to describe the
likelihood of upward or downward shift
𝑛
𝑚 Local statistics
𝑊𝑘𝑡(1)
: the likelihood of upward shift
𝑊𝑘𝑡(2)
: the likelihood of downward shift
𝑌1,𝑡 𝑌2,𝑡
𝑌𝑘,𝑡
𝑌𝑝,𝑡
⋮
⋯
9
Lab for System Informatics and Data Analytics (SIDA)
General strategy 2Allocate observed data streams adaptively
Wide search strategy In-control: observed data streamsshould be placed evenly to observethe shifted data stream rapidly.
Deep search strategyOut-of-control: observed data streams should accumulate in suspected region.
To combine these two strategies, we separate the observable variables intotwo groups: 𝑞𝑊,𝑡 W-observations and 𝑞𝐷,𝑡 D-observations.The proportion of 𝑞𝑊,𝑡 and 𝑞𝐷,𝑡 is adjusted adaptively.
10
Lab for System Informatics and Data Analytics (SIDA)
General strategy 2Allocate observed data streams adaptively
𝑞𝑊,𝑡 = 6 W-observations
𝑞𝐷,𝑡 = 4 D-observations
Combination of two strategies
11
To combine these two strategies, we separate the observable variables intotwo groups: 𝑞𝑊,𝑡 W-observations and 𝑞𝐷,𝑡 D-observations.The proportion of 𝑞𝑊,𝑡 and 𝑞𝐷,𝑡 is adjusted adaptively.
Lab for System Informatics and Data Analytics (SIDA)
The proposed Spatial Adaptive Sampling and
Monitoring (SASAM) algorithm
[i] How to allocate W-observations?
[ii] How to update local statistics?
[iii] How to determine the
number and location of D-observations?
Three Questions
𝑞𝑊,𝑡 = 𝑞 − 𝑞𝐷,𝑡Allocate W-observations [i]
Begin
𝑞𝐷,1 = 0
Update local statistics and charting statistics [ii]
Determine 𝑞𝐷,𝑡+1; Allocate D-observations for the next period [iii]
YAlarm
N
𝑡 ← 𝑡 + 1
Charting statistics > 𝐻?
12
Lab for System Informatics and Data Analytics (SIDA)
[i] Allocate W-observations
• Objective: select 𝒪𝑊,𝑡 = 𝑞𝑊,𝑡 observations
– with good space filling property
– with diversity from the previous selection
• Method– Generate 𝐼 candidate Latin hyper-cube designs.
– For each 𝑖 = 1,… , 𝐼, calculate
𝑚1,𝑖 = min𝒙𝑘∈𝒪𝑊,𝑡
𝒙𝑗∈𝒪𝑊,𝑡
𝑑 𝒙𝑘 , 𝒙𝑗 [measuring space filling property]
𝑚2,𝑖 = min𝒙𝑘∈𝒪𝑊,𝑡
𝒙𝑗∈𝒪𝑊,𝑡−1
𝑑 𝒙𝑘 , 𝒙𝑗 [measuring diversity from the previous design]
– Pick the candidate selection with the maximum 𝑚𝑖 = 𝑚1,𝑖 +𝑚2,𝑖
index.
An example, 𝒪𝑊,𝑡 =10
13
Lab for System Informatics and Data Analytics (SIDA)
[ii] Update Local statistics
Observations at time 𝑡: 𝑌𝑗,𝑡 , 𝑗 ∈ 𝒪𝑊,𝑡 ∪ 𝒪𝐷,𝑡
Conventional CUSUM chart:
𝑊𝑘,𝑡(1)
= 𝑊𝑘,𝑡−11
+ 𝑢𝑚𝑖𝑛𝑌𝑘,𝑡 −𝑢𝑚𝑖𝑛2
2+
𝑊𝑘,𝑡(2)
= 𝑊𝑘,𝑡−12
+ −𝑢𝑚𝑖𝑛𝑌𝑘,𝑡 −𝑢𝑚𝑖𝑛2
2+
the 𝑘th local statistics indicating +/- shift
14
Lab for System Informatics and Data Analytics (SIDA)
[ii] Update Local statistics
Observations at time 𝑡: 𝑌𝑗,𝑡 , 𝑗 ∈ 𝒪𝑊,𝑡 ∪ 𝒪𝐷,𝑡
Conventional CUSUM chart:
𝑊𝑘,𝑡(1)
= 𝑊𝑘,𝑡−11
+ 𝑢𝑚𝑖𝑛𝑌𝑘,𝑡 −𝑢𝑚𝑖𝑛2
2+
𝑊𝑘,𝑡(2)
= 𝑊𝑘,𝑡−12
+ −𝑢𝑚𝑖𝑛𝑌𝑘,𝑡 −𝑢𝑚𝑖𝑛2
2+
With Spatial consideration:
𝑊𝑘,𝑡(1)
= 𝑊𝑘,𝑡−11
+
𝑗∈𝒪𝑊,𝑡∪𝒪𝐷,𝑡
𝐾ℎ 𝒙𝑘 − 𝒙𝑗 𝑢𝑚𝑖𝑛𝑌𝑘,𝑡 −𝑢𝑚𝑖𝑛2
2+
𝑊𝑘,𝑡(1)
= 𝑊𝑘,𝑡−11
+
𝑗∈𝒪𝑊,𝑡∪𝒪𝐷,𝑡
𝐾ℎ 𝒙𝑘 − 𝒙𝑗 −𝑢𝑚𝑖𝑛𝑌𝑘,𝑡 −𝑢𝑚𝑖𝑛2
2+
Clustered shift Nearby data streams have similar likelihood of shift
𝐾ℎ(⋅): influence function
the 𝑘th local statistics indicating +/- shift
𝑊𝑘,𝑡 = max 𝑊𝑘,𝑡(1),𝑊𝑘,𝑡
(2)
Charting statistic: 𝑆𝑡 = max
1≤𝑘≤𝑝𝑊𝑘,𝑡
Describe the likelihood that the most suspected data stream has a shift.
15
Lab for System Informatics and Data Analytics (SIDA)
[iii] Allocate D-observations at next time point
• Where to allocate?– Nearest neighbors of
the most suspected stream
Most suspected data stream
local stats.
16
Lab for System Informatics and Data Analytics (SIDA)
𝑞𝐷,𝑡+1 = 𝑓𝜃(𝑆𝑡)
[iii] Allocate D-observations
• Where to allocate?– Nearest neighbors of the
most suspected stream
• How many? – More D-observations for
higher chance of shift
– Similar to variable sample size charts
The charting statistic
# of D-obs for time 𝑡 + 1
An increasing function described by someparameter 𝜃. We apply the linear function.
Li and Qiu (2014) used this technique on variable sample interval chart
Most suspected data stream
𝑆𝑡
𝑞𝐷,𝑡+1 = 6
local stats.
17
Lab for System Informatics and Data Analytics (SIDA)
Summary
Generate a number of candidatesamples, and select the best onein terms of space filling property.
Update local statistics with allobservations of nearby variablesusing a kernel function.
New D-observations are selectedaround the most suspected stream;the number is determined by thecharting statistics.
𝑞𝑊,𝑡 = 𝑞 − 𝑞𝐷,𝑡Allocate W-observations [i]
Begin
𝑞𝐷,1 = 0
Update local statistics and charting statistics [ii]
Determine 𝑞𝐷,𝑡+1; Allocate D-observations for the next period [iii]
YAlarm
N
𝑡 ← 𝑡 + 1
Charting statistics > 𝐻?
18
Lab for System Informatics and Data Analytics (SIDA)
A real case study - solar flare detection
Dataset: • 𝑝 = 232 × 292 = 67744 dimensional variables;• 𝑞 = 500 observable pixels (~0.7% available);• One solar flare starts from frame 87.• Both charts are adjusted to have the same IC-ARL.
Results:• TRAS triggers alarm
at t = 95.• SASAM triggers an
alarm at t = 91.
SASAMTRAS
19
Lab for System Informatics and Data Analytics (SIDA)
Video clip
20
Lab for System Informatics and Data Analytics (SIDA)
Video clip
21
Lab for System Informatics and Data Analytics (SIDA)
Result
Time 91The SASAM
triggers alarm
Time 95The TRAS
triggers alarm
TRAS SASAMObservable stream△ Top-10 observable stream in TRAS method
22
Lab for System Informatics and Data Analytics (SIDA)
Summary of the proposed sampling
strategy
• A systematic adaptive sampling strategy is proposed for real-time monitoring of Big Data streams with dynamically selected partial information.
• Adaptability:
– Integrate two sampling strategies (wide or deep)
– Quickly detect a wide range of possible changes with no prior knowledge of the potential anomaly events by adaptively adjusting to the event locations;
– Actively select the data streams to observe from the whole streaming data to maximize the sensitivity for anomaly detection with consideration of resource constraints.
23
Lab for System Informatics and Data Analytics (SIDA)
Proposed DDDAS new ideas
24
Objectives Existing Approaches Proposed New Methodology
Applications modeling
Establish a spatial-temporal statistical model for capturing
the changing baseline
Either only capture spatial or temporal characteristics
Extended models are rigid and limited to specific applications
Require full observations
Capture both domains by integration of graphical models with matrix factorization
Generic transfer learning framework for adaptive learning
Allow dynamic partial observations
Application measurement systems and
methods
Establish an effective sampling strategy to decide
which data streams to observe
Sampling over fixed sub-regions (rigid spatialdomain) or whole dataframes (rigid temporaldomain)
Dynamic sampling partial data streams over the spatial domain at each acquisition time based on resources constraints
Mathematical and statistical
algorithms
Establish a scalable and robust SPC
scheme to maximize change detection
capability
Search for all possible faultscenarios
Require all historical data
Assume potential faultscenarios are known
High computational costs
Require full observations
Automatically identify and localizefault scenarios
Require current observations and asummary statistic
Assume potential fault scenarios areunknown
Only linear complexity in the numberof data streams
Allow dynamic partial observations
Lab for System Informatics and Data Analytics (SIDA)
Conclusion and Impact to Air Force
25
• It is critically important for the Air Force to make rapid decisions in a battlefield based on Big Data continuously collected from massive sensors in real time.
(a) Intrusion detection (b) unmanned vehicle surveillance (c) Cybersecurity
• Early detection and localization of these anomaly events will enhance system-wide situational awareness to support warfighters/military operations, prevent damages, reduce cost, improve efficiency, and save billions of lives. Snapshots of the temperature profile
from Climate Institute at ORNL
Lab for System Informatics and Data Analytics (SIDA)
Thank you for coming!
Questions?
26