Competitive data science: A tale of two web services
-
Upload
david-thompson -
Category
Technology
-
view
1.478 -
download
1
description
Transcript of Competitive data science: A tale of two web services
Competitive data science: A tale of two web services
David C. Thompson Jörg Bentzien Ingo Mügge Ben Hamner
What is about to happen
• about.me
• The Kaggle process
• The data set
• How the competition went
• The models and implementation
• What we have learnt
about.me/dcthompson
My favourite papers from each period: [1] J. Chem. Phys. 122, 124107 (2005) [2] J. Chem. Phys. 128, 224103 (2008) [3] J. Chem. Inf. Model. 49, 1889 (2009) [4] J. Chem. Inf. Model. 51, 93 (2011)
• We wanted to investigate the utility of the process
• We wanted to move with speed
• We wanted to use a data set the scientific community had previously seen
• We wanted to be inclusive – no domain expertise needed
What you should know about this exercise
The data set
• Version 2 of the Hansen AMES mutagenicity data was used
• The following protocol was observed:
http://doc.ml.tu-berlin.de/toxbenchmark/ J. Chem. Inf. Model. 49, 2077 (2009) * D, B, Al, P, Ga, Si, Ge, Sn, As, Sb, Se, Te, At, He, Ne, Ar, Kr, Xe, Rn
What happened # of molecules (removed)
Download smiles 6512
Conversion with Corina 6503 (9)
Remove non-zero formal charge
6419 (84)
Remove if more than 99 atoms
6414 (5)
Remove if contains undesirable atoms*
6252 (162)
Descriptor calculation SD file, descriptor calculation – 6252 x 5030
– Filter for low variance (≤ 0.01); removed 2537
– Remove for high correlation (> 0.90); removed 716
– Descriptor normalization resulted in 6252 x 1777 .csv file
Descriptor Engine # of descriptors
MOE 2D 76 (186)
Atom Pair 696 (1920)
MolConn-Z 174 (745)
Pipeline Pilot Property Counts
5 (130)
Daylight fingerprints
825 (2048)
clogP 0 (1) 0
200
400
600
800
1000
1200
1400
50
10
0
15
0
20
0
25
0
30
0
35
0
40
0
45
0
50
0
55
0
60
0
65
0
70
0
75
0
80
0
85
0
90
0
95
0
1000
1050
1100
1150
1200
J. Chem. Inf. Model. 49, 2077 (2009)
Testing Framework
“Predictive Modeling from a Kaggler’s Perspective” Jeremy Achin, Sergey Yergenson, Tom Degodoy
• Public Leaderboard: The split of the test set that competition participants see real-time feedback on over the course of the competition.
• Private Leaderboard: The split of the test set that is used to determine the competition winners and estimate the generalization error. Participants do not see feedback on this during the competition.
Expectations
“Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set”
• 20 models generated with different algorithms and descriptors
• Models have overall accuracies between 0.75 and 0.83 for the training set and 0.76 and 0.82 for the test set
• Inter-laboratory accuracy for Ames test reported at 85%
Expectation: Models should have similar accuracy to literature
Goal: Models should be balanced; sensitivity and specificity should be high
J. Chem. Inf. Model. 50, 2094 (2010)
http://www.kaggle.com/c/bioresponse
log loss= N
i
iiii yyyyN 1
)ˆ1log()1()ˆlog(1
Performance as a function of time
796 players
703 teams
8841 entries
55 forum topics, 409 posts
Final Ranking
Team Name Public
Ranking Δ (log loss)
1 Winter is Coming & Sergey 11 0
2 seelary 26 7E-05
3 bluehat 1 0.00051
4 jazz 15 0.0014
5 Wayne Zhang & Gxav & woshialex 19 0.00146
6 Indy Actuaries 38 0.00184
7 bluemaster & imran 7 0.00231
8 Efiimov & Bers & Cragin & vsu 4 0.00241
9 y_tag 18 0.0026
10 Killian O’Connor 44 0.00285
11 PlanetThanet & SirGuessalot 40 0.00298
12 AussieTim 48 0.00335
13 Jason Farmer 31 0.00347
14 GreenPeace 16 0.00356
15 mars 32 0.00388
16 Fuzzify 60 0.00392
17 Emanuele 63 0.00395
18 HappyHour 10 0.00431
19 Baltic 30 0.00465
20 dejavu 20 0.00482
352 Random Forest Benchmark 373 0.04184
541 Support Vector Machine Benchmark 522 0.12147
639 Optimized Constant Value Benchmark 647 0.31414
642 Uniform Benchmark 650 0.31959
https://github.com/emanuele/kaggle_pbr
https://github.com/benhamner/BioResponse
#FTW Strategies
• Feature selection
• RF + complementary approaches
• Blending
All three winning teams identified D27 as important. What is it? Organon toxicophore*
* J. Med. Chem. 49, 312 (2005)
“Predictive Modeling from a Kaggler’s Perspective” Jeremy Achin, Sergey Yergenson, Tom Degodoy
Winning Teams
Team 1 Team 2 Team 3
873 888 893
165 150 145
Team 1 Team 2 Team 3
151 165 162
687 673 676
TP FN
FP TN
Benchmarks
RF SVM
888 822
150 216
RF SVM
166 215
672 673
Other
Team 17 D27
896 781
142 257
Team 17 D27
169 215
669 623
Se Sp CCR
RF 0.86 0.80 0.83
SVM 0.79 0.74 0.77
Se Sp CCR
Team 1 0.84 0.82 0.83
Team 2 0.86 0.80 0.83
Team 3 0.86 0.80 0.83
Se Sp CCR
Team 17 0.86 0.80 0.83
D27 0.75 0.74 0.75
Se: TP/(TP+FN) Sp: TN/(FP+TN)
CCR: (Se + Sp)/2
Private Set Performance
Okay, where’s this ‘second’ web service?
17
BIpredict Physicochemical properties are updated as molecule is built Atomistic descriptor values are appended directly to the molecule
* D. C. Thompson Chemical Computing Group, User Group Meeting, Montreal, 2011
So, what did we learn?
• Was this useful? – Yes
• Participation was high, contributors and contributions were diverse*
• A large number of models were of a high quality – Differences in top models in log loss metric are small
– Different statistical measures lead to different rankings
– RandomForest benchmark has high correct classification rate (CCR)
* Sort of
‘Machine learning that matters’
Kiri L. Wagstaff. Machine Learning that Matters. Proceedings of the Twenty-Ninth International Conference on Machine Learning (ICML), June 2012. Download PDF (CL #12-2026)
Domain expertise Machine learning
skill
Thanks to: Lilly Ackley Amy Kunkel Mehul Patel Alex Renner, PhD All Kaggle participants – esp. Winter is Coming & Sergey