Hoeffding Races: Accelerating Model Selection Search for ...
Echo State Hoeffding Tree Learning
-
Upload
diego-marron-vida -
Category
Data & Analytics
-
view
33 -
download
0
Transcript of Echo State Hoeffding Tree Learning
Echo State Hoeffding Tree Learning
Diego Marron ([email protected])Jesse Read ([email protected])
Albert Bifet ([email protected])Talel Abdessalem ([email protected])Eduard Ayguade ([email protected])Jose R. Herrero ([email protected])
ACML 2016
Hamilton, New Zeland
Introduction ESHT Evaluations Conclusions
Introduction
• Real-time classification of Big Data streams is becomingessential in a variety of application domains.
• Real-time classification imposes some challenges:• Deal with potentially infinite streams• Strong temporal dependences• React to changes on the stream• Response time and memory are bounded
2/18
Introduction ESHT Evaluations Conclusions
Real Time Classification
• In real-time classification:• Hoeffding Tree (HT) is the streaming state-of-the art decision
tree• HTs are powerful and easy–to–deploy (no hyper-parameter to
tune)• But, they are unable to capture strong temporal dependences
• Recurrent Neural Networks (RNN) are very popular nowadays
3/18
Introduction ESHT Evaluations Conclusions
Recurrent Neural Networks
• Recurrent Neural Networks (RNNs) are the state-of-the-art inhandwriting recognition, speech recognition, natural languageprocessing among others
• They are able to capture time dependences• But their use for data streams is not straight forward
• Very sensitive to hyper-parameters configuration• Training requires many iterations over data...• ...and large amount of time
4/18
Introduction ESHT Evaluations Conclusions
RNN: Echo State Network
• A type of Recurrent Neural Network• Echo State Layer (ESL):
• Dynamics only driven by the input• Requires very few computations• Easy to understand hyper-parameters• Can capture time dependences
• ESN also requires the hyper-parameters needed by the NN
• Gradient Descent methods have slow convergence
5/18
Introduction ESHT Evaluations Conclusions
Contribution
• Objective:• Need to model the evolution of the stream over time• Reduce number of hyper-parameters• Reduce amount of samples needed to learn
• In this work we present the ESHT:• Combination of HT + ESL• To learn temporal dependences in data streams in real-time• Requires less hyper-parameters than the ESN
6/18
Introduction ESHT Evaluations Conclusions
ESHT
• Echo State Layer (ESL):• Only needs two hyper-parameters:
• Alpha (α): weights events in X(n) importance over new ones• Density: Wres is a sparse matrix with given density
• Encodes time-dependences
• FIMT-DD: Hoeffding tree for regression• Works out-of-the-box: no hyper-parameters tuning
7/18
Introduction ESHT Evaluations Conclusions
ESHT: Evaluation Methodology
• We propose the ESHT to learn character-stream functions:• Counter (skipped in this presentation)• lastIndexOf• emailFilter
• lastIndexOf Evaluation:• Study the effects of hyper-parameters: α and density
• Alpha (α): weights events in X(n) importance over new ones• Density: Wres is a sparse matrix with given density
• Use 1,000 neurons on the ESL
• emailFilter evaluation:• We focus on the speed of learning• Use outcomes from previous evaluations to configure the
ESHT for this task
• Metrics:• Cumulative loss• We consider an error if |yt − y | >= 0.5
8/18
Introduction ESHT Evaluations Conclusions
Input format
• Input is a vector of floats• Number of attributes = number of input symbols• Attribute representing current symbol set to 0.5• Other attributes are set to zero
9/18
Introduction ESHT Evaluations Conclusions
LastIndexOf
• Counts the number of time steps since the current symbol waslast observed
• Input stream is randomly generated• We 2,3 and 4 symbols
10/18
Introduction ESHT Evaluations Conclusions
LastIndexOf: Vector vs Scalar Input
• Vector input improves accuracy in all cases• Specially with 4 symbols
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
α
Accuracy
(%)
2symbols density=0.4
2symbols-vec density=0.4
3symbols density=0.4
3symbols-vec density=0.4
4symbols density=0.4
4symbols-vec density=0.4
11/18
Introduction ESHT Evaluations Conclusions
LastIndexOf: Alpha and Density vs Accuracy
• Lower values of alpha (α) have low accuracy
• There is no clear correlation between accuracy and density
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Alpha (α)
Accuracy
(%)
2symbols density=0.1
2symbols density=0.4
3symbols density=0.1
3symbols density=0.4
4symbols density=0.1
4symbols density=0.4
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.3
0.4
0.5
0.6
0.7
0.8
Density
Accuracy
(%)
α=0.2
α=0.3
α=0.4
α=0.5
α=0.6
α=0.7
α=0.8
α=0.9
α=1.0
12/18
Introduction ESHT Evaluations Conclusions
EmailFilter
• ESHT configuration:• ESL: 4,000 neurons• α = 1.0 and density = 0.1
• Outputs the length on the next space character• Dataset: 20 newsgroups dataset• Extracted 590 characters and repeated them 8 times• To reduce the memory usage we used an input vector of 4
symbols
13/18
Introduction ESHT Evaluations Conclusions
EmailFilter: Recurrence vs Non Recurrence
• Non-recurrent methods (FIMT-DD and NN) fail to capturetemporal dependences
• NN defaults to majority class
Algorithm Density α Learning rate Loss Accuracy (%)FIMT-DD - - - 4,119.7 91.61
NN - - 0.8 2,760 97.80ESN1 0.2 1.0 0.1 1,032 98.47ESN2 0.7 1.0 0.1 850 98.47ESHT 0.1 1.0 - 180 99.75
14/18
Introduction ESHT Evaluations Conclusions
EmailFilter: ESN vs ESHT
• After 500 samples the ESHT loss is close to 0 (and 0 lossafter the 1,000 samples)
0
1,000
2,000
3,000
4,000
0
200
400
600
800
1,000
1,200
500
# Samples
Cummulative
Loss
ESN1
ESN2
ESHT
15/18
Introduction ESHT Evaluations Conclusions
Conclusions and Future Work
• Conclusions:• We presented the ESHT to learn temporal dependences in data
streams in real-time• The ESHT requires less hyper-parameters than the ESN• Our proof-of-concept implementation is able to learn faster
than an ESN (Most of them at first shot)
• Future Work:• We are currently reimplementing our prototype so we can test
larger input sequences• We need to study the effects of the initial state vanishing in
large sequences
16/18
Thank you
Echo State Hoeffding Tree Learning
Diego Marron ([email protected])Jesse Read ([email protected])
Albert Bifet ([email protected])Talel Abdessalem ([email protected])Eduard Ayguade ([email protected])Jose R. Herrero ([email protected])
ACML 2016
Hamilton, New Zeland
ESHT: Module Architecture
• In each evaluation we use the following architecture• Label generator implements the function to be learnt
1/0
Counter: Introduction
• Stream of zeros and ones randomly generated
• Input is a scalar
• Two variants:• Option1: Outputs cumulative count• Option2: Outputs total count on the next zero
2/0
Counter: Cumulative Loss
• After 200 samples the loss is stable
0200
400
600
800
1,000
0
10
20
30
# Samples
Cummulative
Loss
Op1(density=0.3,α=1.0)
Op1(density=1.0,α=0.7)
Op2(density=0.8,α=1.0)
Op2(density=0.8,α=0.7)
3/0
Counter: Alpha and Density vs Accuracy
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.5
0.6
0.7
0.8
0.9
1
Alpha (α)
Accuracy
(%)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.5
0.6
0.7
0.8
0.9
1
Density (%)
Accuracy
(%)
4/0
EmailFilter: ASCII to 4 symbols Table
ASCII Domain 4-Symbols DomainOriginal Symbols Target Symbol Target Symbol Index
[\t \n \r]+ Single space 0[a-zA-Z0-9] x 1
@ @ 2. . 3
5/0