Exponential-e's GDPR Event at the Hiton Hotel, Manchester - 12th July 2017
Random Sets Approach and its Applications Basic iterative feature selection, and modifications....
-
Upload
elijah-perkins -
Category
Documents
-
view
225 -
download
0
Transcript of Random Sets Approach and its Applications Basic iterative feature selection, and modifications....
Random Sets Approach and its Applications
• Basic iterative feature selection, and modifications.
• Tests for independence & trimmings (similar to HITON algorithm).• Experimental results with some comments.
• Concluding remarks.
• Introduction: input data, objectives and main assumptions.
1
• Random sets approach.
Vladimir Nikulin, Suncorp, Australia
Training data: where is binary label
and is a vector of m features:
},{ tt Xy ty
tX },...,{ 1 tmtt xxX
Introduction
2
In practical situation the label y may be hidden, and the task
is to estimate it using vector of features:
Area under receiver operating curve (AUC) will be used as an
evaluation and optimisation criterion.
)( tX
3
Causal relations
X1 X2
Y
X6
X3 X4
X7
X8 X9
Main assumption: direct features have stronger influence on the target variable and, therefore, are more likely to be selected by the FS-algorithms.
Manipulations are actions or experiments performed by an external agent on a system, whose effect disrupts the natural functioning of the system. By definition, all direct features can not be manipulated.
4
Basic iterative FS-algorithm
5
BIFS: behaviour of the target function
CINA LUCAP
MARTI REGED
6
RS-algorithm
7
RS(10000, 40), MARTI case
10%
10%
8
Test for independence (or trimming)
))\(,(min fZgyfcorrZf
9
Data # Train (positive) # Test Dimension Method SoftwareLUCAS 2000 (1443) 10000 11 neural+gentleboost MATLAB-CLOPLUCAP 2000 (1443) 10000 143 neural+gentleboost MATLAB-CLOPREGED 500 (59) 20000 999 SVM-RBF CSIDO 12678 (452) 10000 4932 binaryRF CCINA 16033 (3939) 10000 132 adaBoost R
MARTI 500 (59) 20000 1024 svc+standardize MATLAB-CLOP
Base models and software
10
Final results (first 4 lines)
Data Submission CASE0 CASE1 CASE2 Mean RankREGED vn14 0.9989 0.9522 0.7772 0.9094 4SIDO vn14 0.9429 0.7192 0.6143 0.7588 5CINA vn14a 0.9764 0.8617 0.7132 0.8504 2
MARTI vn14 0.9889 0.8953 0.7364 0.8736 4LUCAS vn1 0.9209 0.9097 0.7958 0.8755 validationLUCAP vn10b+vn1 0.9755 0.9167 0.9212 0.9378 validation
CINA vn1 0.9765 0.8564 0.7253 0.8528 all featuresCINA vn11 0.9778 0.8637 0.718 0.8532 CE
11
Data Submission # features Fscore TrainAUC TestAUC REGED1 vn14 400 0.7316 1 0.9522REGED1 vn11d 150 0.8223 1 0.9487REGED1 vn1 999 0.5 1 0.9445REGED1 vn8 899 0.5145 1 0.9436MARTI1 vn12c 500 0.5784 1 0.8977MARTI1 vn14 400 0.5554 1 0.8953MARTI1 vn3 999 0.5124 1 0.8872MARTI1 vn7 899 0.4895 1 0.8722SIDO0 vn9 203 0.5218 0.9684 0.946SIDO0 vn9a 326 0.536 0.9727 0.9459SIDO0 vn1 1030 0.5785 0.9811 0.943SIDO0 vn14 527 0.5502 0.9779 0.9429
Some particular results
12
Behaviour of linear filtering coefficients, MARTI-set
13
CINA-set: AdaBoost, plot of one solution against another
14
SIDO, RF(1000, 70, 10)
15
Some comments
In practical applications we are dealing not with pure probability distributions, but with mixtures of distributions, which reflect changing in time trends and patterns. Accordingly, it appears to be more natural to form training set as an unlabeled mixture of subsets derived from different (manipulated) distributions, for example, REGED1, REGED2,..,REGED9.As a distribution for the test set we can select any “pure” distribution.
Proper validation is particularly important in the case when training and test sets have different distributions. Respectively, it will be good to apply traditional strategy: split randomly available test-set into 2 parts 50/50 where one part will be used for validation, second part for testing.
16
Concluding remarks
Random sets approach has heuristic nature and has been inspired by the growing speed of computations. It is general method, and there are many ways for further developments.
Performance of the model depends on the particular data. Definitely, we can not expect that one method will produce good solutions for all problems.
Probably, it was necessary to apply more aggressive FS-strategy in the case of Causal Discovery competition.
Our results against all unmanipulated and all validation sets are in line withtop results.