Scalable Auto-Encoders for Gravitational Waves Detection from … · Auto-encoder models to...
Transcript of Scalable Auto-Encoders for Gravitational Waves Detection from … · Auto-encoder models to...
Scalable Auto-Encoders for Gravitational Waves
Detection from Time Series Data
Roberto Corizzo, Michelangelo Ceci, Eftim Zdravevski, Nathalie Japkowicz
Introduction
Challenges in GW detection
To effectively carry out manual filtering operations, we need to know the underlying type of noise, as well as which frequencies could be relevant (or irrelevant) for the detection of a GW.
Additional knowledge is required to perform other recurrent tasks, such as spectrogram analysis, filtering, and whitening.
Large volumes of data, in terms of petabytes per day, are collected by detectors
We investigated:
Two approaches (one supervised, one unsupervised) working directly on strain data
Auto-encoder models to classify strain time series data with an anomaly detection strategy
Auto-encoders for feature extraction and then a classification with Gradient-Boosted Trees (GBTs)
The goal was discard noise time series and identify time series that could contain a real phenomenon – a classification problem
Our approaches do not require any pre-processing step (e.g., generation of spectrograms, filtering and whitening) and use real annotated GW events
The proposed approaches utilized Apache Spark framework for scalable computation
2
The problem
Whitened time series representation of the
same GWs signal (ID GW151226).
3
The strain time series representing real GWs
immersed in noise are impossible to
distinguish, even for the human eye, if no data
pre-processing is carried out.
Signal whitening
4
• Data whitening is a typical step that can be preformed in time domain or in frequency
domain.
• The goal is to remove all the stationary noise, in order to possibly reveal weak signals
hidden in other bands
• Removing high-frequency noise is often done by applying a bandpass filter (a
Butterworth bandpass filter has been applied to filter signal frequencies below 43Hz and
above 300Hz in this figure)
Binary classification as anomaly
detection with autoencoders
When the data belongs to two classes, the binary classification task can be reformulated as an unsupervised anomaly detection task, detecting whether it’s normal or anomaly
The purpose is to discriminate among two classes by analyzing data over time and classify if the current behavior is as expected (normal) or it deviates from the expected distribution (anomaly).
The task can be reformulated as an unsupervised anomaly detection task
Auto-encoders have demonstrated superior performance in recent literature
They are able to build representations with a low reconstruction error, based on non-linear combinations of input features
5
Autoencoders
Auto-encoders use backpropagation learning, which transformed new vectors into lower dimensionality using the activation function of the hidden neurons.
The architecture can consist of multiple hidden layers.
If the purpose is to reconstruct the input, the final layer of the auto-encoder has the same size as the input data
When the purpose is classification, usually only the encoding part is used, and the autoencoder is solely used for initialization.
7
Time series classification via anomaly
detection (AE)
The final layer represents the decoding stage and, thus, has the same size as the input layer.
Train the auto-encoder with one-class data –different noise time series
For a new data instance at prediction time, if a high reconstruction error of the auto-encoder is observed, then this data instance is assumed to be an anomaly, i.e. a data instance that potentially contain a real GWs signal
The noise (i.e., the negative class) is abundant, in contrast to few GWs signals observed so far
Each time the auto-encoder is trained, a maximum distance threshold is calculated automatically from training data (a function of the std and mean)
8
Feature extraction with supervised
classification (AE-GBT)
We use auto-encoders exclusively for feature extraction, utilizing the the encoding function of the auto-encoder.
The hidden layers have lower number of neurons, and represent a new feature space with reduced dimensionality and at a higher level of abstraction.
We use GBTs (an ensemble of decision trees) because of their exceptional performances in predictive modeling
Supervised GBTs models are trained on features extracted by auto-encoders, even on data representing different classes.
Even though the auto-encoder has been trained on time series of the negative class (noise), it extracts features also for time series of the positive class.
The GBT model classifies time series using their encoded vector extracted by the auto-encoder.
9
Datasets information GW1
positive class - time series related to 4 events (GW150914, LVT151012, GW151226 and GW170104)
negative class - negative class corresponds to time series of 70 different types of noise, generated by the PyCBC Python library
GW2
positive class - blending whitened time series with the different noise time series, at different noise rates (0.1, 0.25, 0.50)
negative class - same as in GW1
GW3
positive class - same as in GW1.
negative class - sampling gravitational noise in the near proximity (3-15 seconds) of the 7 other events (GW170729, GW170809,
GW170817, GW170608, GW170814, GW170818 and GW170823).
11
Specifics:
Time series of 1 and 3 seconds are collected at a sample rate of 4096Hz
From each event there are 2 time series - data strain collected at H1 (Hanford)
and L1 (Livingstone) interferometers
Data augmentation by shifting data in time by an offset (e.g., -0.25, -0.125,
0.125, 0.25 seconds)
Time series are normalized using min-max normalization.
Experimental setup
Auto-encoder (1 or 2 hidden layers for encoding and 1 or 2 hidden layers for decoding).
One hidden layer: hidden units is set to 512 or 1024
Two hidden layers, the hidden units is set to:
512 for the first layer, and 256 for the second layer,
1024 for the first layer, and 512 for the second layer.
Considering the sampling rate of 4096Hz, 512 hidden units correspond to 1/8 of the 1-second time series length, or 1/24 of the 3 seconds time series length
GW1 and GW2 datasets - 5-fold CV
GW3 dataset - leave-one-event-out CV (4 events in the + class => 4-fold CV)
12
Other details:
The auto-encoder training uses the LBFGS optimizer
Maximum number of iterations = 500
Early stopping if the algorithm is unable to reduce the training error by 10e-5 at a certain iteration
In detection mode, the factor for the standard deviation to f=1.5 (empirically determined)
GBT parameters – default in MLlib : maxIter = 10, maxDepth = 5, maxBins = 32, minInstancesPerNode = 1, lossType = logistic.
Anomaly detection threshold analysis
13
Sensitivity study on the threshold f (factor for the standard deviation) and its impact on
classification performance for the AE approach on all datasets with time series length of 1
second. The model architecture adopted consists of one hidden layer with 512 units. Best
results in terms of F-Score are marked in bold.
Baseline methods
Conv1D - 1 dimensional CNN
2 convolutional layers (CL) (1st with 40 filters, 2nd with 20 filters, both of length 3), max-pooling layers of size 2, and a dense layer with fully connected units at the end (evaluated two configurations: 512 and 1024).
2 CL (1st with 40 filters, 2nd with 20 filters, both of length 3), max-pooling layers of size 2, and a MLP consisting of 3 layers at the end: two dense layers (DL) with fully connected (FC) units and 1 output layer for classification. We evaluated two configurations for DL: (512, 256) and (1024, 512). In both cases, the standard ReLU activation function was used for CL and DL, whereas the softmax activation function was used for the classification layer.
Hyperparameter selection via grid search according to a nested cross-validation scheme of the: learning rate, dropout, batch size and dense units.
Deep Filtering (George et al. 2018) - features a deep architecture based on 1-d CNNs, containing 4 CL with dilations, filter sizes of 64, 128, 256 and 512, and 2 FC layers with sizes 128 and 64.
14
GW1 using different methods
16
• The AE approach can discriminate effectively
unseen time series between noise and GWs.
• No significant difference between the different
auto-encoder architectures.
• The same consideration holds when
comparing AE, AE-GBT, and Conv1D, which
obtain similar results.
• Overall, the best results in terms of F-Score
are obtained by AE-GBT.
• The AE approach assumes no prior knowledge
about the data distribution of the positive
class, representing GWs phenomena, and it
provides classification solely based on the
reconstruction error of unseen time series.
GW2 using different methods
17
10% noise 25% noise 50% noise
• AE and AE-GBT are still comparable when the noise rate is 10%, whereas the performances of AE and Conv1D degrades
• AE degrades with higher noise levels, which is expected considering the way the dataset was created.
• AE prefers smaller-sized networks.
• Deep Filtering method perform better with longer time series, although the overall classification result is still much worse
than our proposed approaches.
• The AE-GBT approach maintains optimal performance, even when noise increases.
GW3 using different methods
18
• Degradation in the performance of all
methods, deriving from the increased
complexity of the task.
• AE-GBT still appear to be the best approach,
with satisfactory classification performance in
terms of F-Score.
• The p-values of the Wilcoxon Signed Rank
tests show that AE-GBT is consistently best
Scalability experimental setup
Scalability of GBT is well documented within Spark MLLib
Therefore, the focus is on the AE approach, which is utilized in both proposed approaches
We replicated all negative class time series of 1 sec. length from the GW1 dataset, resulting in up to 50.000 instances of time series that have at most 1% perturbation of each element of the time series. This was done to prevent unrealistically fast convergence of the loss function.
We used two Spark cluster configurations on Microsoft Azure HDInsight using D12, D13 and D14 instances (D12 has 4 cores, 28 GB of RAM, a D13 instance has 8 cores and 56 GB of RAM, and a D14 instance has 16 cores and 112 GB RAM.)
For the local setting, we use 2 D12 head nodes, and one D12 worker node.
1st cluster configuration: 2 D12 head nodes and 4 D13 worker nodes (scale factor=8)
2nd cluster configuration: 2 D12 head nodes and 8 D14 worker nodes (scale factor=16)
We evaluated different configurations for the Spark job, including the number of executors, the number of executor cores, and the executor memory.
19
Scalability results (1)
20
0306090
120150180210240270300330360390420
0 100 200 300 400 500 600 700 800 900
Trai
nin
g d
ura
tio
n (m
inu
tes)
Minutes of processed data
Local Distributed (1) Distributed (2)
1
2
3
4
5
6
7
8
9
10
11
12
0 20 40 60 80 100 120 140 160 180
Spe
ed
up
fact
or
Minutes of processed data
Speedup (1) Speedup (2)
Scalability results in terms of execution time (left) and speedup (right) for the AE approach,
obtained with two cluster configurations.
The two curves (1) and (2) refer to the different Spark cluster configurations on Azure
HDInsight.
Scalability results (2)
21
0
50
100
150
200
250
300
350
400
450
8 16 24 32 40 48 56 64
Trai
nin
g d
ura
tio
n (
min
ute
s)
Number of worker CPUs16.67 minutes of processed data 83.33 minutes of processed data
166.7 minutes of processed data 833.33 minutes of processed data
Scalability results in terms of execution time for the
training process with a varying number of CPUs:
• 8 – Local
• 32 - Distributed (1)
• 64 - Distributed (2)
The missing point of the last series in the graph
(833.33 minutes of processed data) denotes that the
Local configuration with 8 worker CPUs could not
complete the in time.
Conclusion
22
The results show that, overall, the proposed AE and AE-GBT
significantly outperform the competitor methods
AE-GBT significantly outperforms AE.
The AE-GBT, Conv1D and the Deep Filtering methods require labeled
data of both classes during the learning phase.
Since it is not always possible to know the morphology of the expected
phenomena, the AE approach could be a valid alternative to AE-GBT.
The proposed methods scale well using Apache Spark