Post on 03-Jan-2016
description
Pfizer HTS Machine Learning Algorithms:
November 2002Paul Hsiung (hsiung+@cs.cmu.edu)
Paul Komarek (komarek@cs.cmu.edu)Ting Liu (tingliu@cs.cmu.edu)
Andrew W. Moore (awm@cs.cmu.edu)
Auton Lab, Carnegie Mellon UniversitySchool of Computer Science
www.autonlab.org
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 2
DatasetsOur Name
Num. Records
Num Attributes
Num non-zero input cells
Num positive outputs
Description
train1 26,733 6,348 3.7M 804 The original dataset sent to CMU in Feb 2002
test1 1,456 6,121 0.2M 878 The test set associated with the above training set
jun-3-1 88,358 1,143,054
30M 423 The large “TEST3” dataset sent to us in May 2002. the “-1” at the end denotes that we were using the first of the four activation columns
combined
88,358 1,143,054
30M 211 Combining the “TEST3” datasets. The activation in Combined is positive if and only if at least two of the four original activations were positive.
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 3
Projections
Our Name name given to original
name given to 100 dimensional projection
name given to 10 dimensional projection
train1 train1 train100 train10
test1 test1 test100 test10
train1 train1 train-pls-100
train-pls-10
test1 test1 test-pls-100 test-pls-10
jun-3-1 n/a jun-3-1 n/a
combined n/a combined n/a
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 4
Previous AlgorithmsBC Bayes Classifier
On original data, a naïve categorical classifier was used.On Real-valued projected data, a Naïve Gaussian classifier was used.
Dtree
Decision TreeThis technique is also known as Recursive Partitioning and CART. It was only implemented for the original data.
SVM Support Vector Machine.Except where stated otherwise, a linear SVM was used. We could not find significant performance difference between Linear SVM and Radial Basis Function SVM with a variety of RBF parameters.
k-NN k-nearest neighborExcept where stated otherwise, k=9 neighbors were used. Only implemented for projected data.
LR Logistic RegressionExcept where stated otherwise, used Conjugate Gradient to perform intermediate weighted regressions, using a newly developed technique.
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 5
New Algorithmsnew-KNN
Tractable High dimensional k-nearest neighborCan work on the 1,000,000 dimensional “June” data.
EFP Explicit False Positive Logistic RegressionLogistic regression that accounts for the high false positive rate.
SMod
Super Model.Automatically combining the predictions from multiple algorithms with a “meta-level” of logistic regression.
PLS-proj
Partial Least Squares ProjectionUsing PLS instead of PCA to project down data
PLS Partial Least Squares PredictionUsing the PLS algorithm as a predictor
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 6
Explicit False Positive Model
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 7
Explicit False Positive Model
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 8
Example in 2 dimensions: Decision Boundary
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 9
Example in 2 dimensions: 100 true positives
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 10
100 true positives and 100 true negatives
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 11
100 TP, 100 TN, 10 FP
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 12
Using regular logistic regression
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 13
Using EFP Model
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 14
Example: 10000 true positives
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 15
10000 true positives, 10000 true negatives
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 16
10000 TP, 10000 TN, 1000 FP
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 17
Using regular logistic regression
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 18
Using EFP Model
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 19
EFP Model Real Data Results
K-fold
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 20
EFP Effect
…Very impressive on Train1 / Test1
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 21
Log X-axis
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 22
EFP Effect
…Unimpressive on jun31 / jun32
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 23
Super Model• Divide Training Set into Compartment A
and Compartment B
• Learn each of N models on Compartment A
• Predict each of N models on Compartment B
• Learn best weighting of opinions with Logistic Regression of Predictions on Compartment B
• Apply the models and their weights to Test Data
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 24
Comparison
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 25
Log X-Axis Scale
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 26
Comparison on 100-dims
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 27
Log X-axis
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 28
Comparison on 10 dims
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 29
Log X-axis
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 30
NewKNN summary of results and timings
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 31
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 32
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 33
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 34
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 35
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 36
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 37
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 38
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 39
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 40
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 41
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 42
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 43
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 44
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 45
PLS summary of results•PLS projections did not do so well.•However, PLS as a predictor performed well,especially under train100/test100.•PLS is fast. The runtime varies from 1 to 10 minutes.•But PLS takes large amounts of memory. Impossibleto use in a sparse representation. (This is due to theupdate on each iteration.)
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 46
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 47
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 48
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 49
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 50
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 51
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 52
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 53
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 54
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 55
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 56
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 57
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 58
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 59
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 60
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 61
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 62
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 63
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 64
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 65
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 66
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 67
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 68
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 69
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 70
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 71
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 72
Summary of results• SVM best early on in Train1, LR better in the
long-haul.• Projecting to 10-d always a disaster• Projecting to 100-d often indistinguishable from
behavior with original data (and much cheaper)• Naïve Gaussian Bayes Classifier best on JUN-3-1
(k-nn better for long haul)• Naïve Gaussian Bayes Classifier best on
combined• Non-linear SVM never seems distinguishable
from Linear SVM• All methods have won in at least one context,
except Dtree.
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 73
Some AUC ResultsExperiment Algorithm AUC
Train on Train1 then test on Test1
Linear SVM 0.876*
Best non-Linear SVM
0.875*
BC 0.867*
LR 0.71
KNN 0.872*
DTree 0.70
Combined SVM 0.638
BC 0.700
LR 0.606
KNN 0.603
* = Not statistically significantly different
Auton Lab, www.autonlab.org HTS Results, 11/25/2: Slide 74
Some AUC ResultsExperiment Algorithm AUC
10-fold cross-validation on Train1
Linear SVM 0.919
BC 0.885
LR 0.933
DTree 0.894