Competitive data science: A tale of two web services

20

description

Initial results from the Boehringer Ingelheim Pharmacueticals, Inc. 'Predicting a biological response' Kaggle competition. Presented at the Fall ACS 2012 #CINF session "When Chemists and Computers Collide: Putting Cheminformatics in the Hands of Medicinal Chemists"

Transcript of Competitive data science: A tale of two web services

Page 1: Competitive data science: A tale of two web services
Page 2: Competitive data science: A tale of two web services

Competitive data science: A tale of two web services

David C. Thompson Jörg Bentzien Ingo Mügge Ben Hamner

Page 3: Competitive data science: A tale of two web services

What is about to happen

• about.me

• The Kaggle process

• The data set

• How the competition went

• The models and implementation

• What we have learnt

Page 4: Competitive data science: A tale of two web services

about.me/dcthompson

My favourite papers from each period: [1] J. Chem. Phys. 122, 124107 (2005) [2] J. Chem. Phys. 128, 224103 (2008) [3] J. Chem. Inf. Model. 49, 1889 (2009) [4] J. Chem. Inf. Model. 51, 93 (2011)

Page 5: Competitive data science: A tale of two web services
Page 6: Competitive data science: A tale of two web services

• We wanted to investigate the utility of the process

• We wanted to move with speed

• We wanted to use a data set the scientific community had previously seen

• We wanted to be inclusive – no domain expertise needed

What you should know about this exercise

Page 7: Competitive data science: A tale of two web services

The data set

• Version 2 of the Hansen AMES mutagenicity data was used

• The following protocol was observed:

http://doc.ml.tu-berlin.de/toxbenchmark/ J. Chem. Inf. Model. 49, 2077 (2009) * D, B, Al, P, Ga, Si, Ge, Sn, As, Sb, Se, Te, At, He, Ne, Ar, Kr, Xe, Rn

What happened # of molecules (removed)

Download smiles 6512

Conversion with Corina 6503 (9)

Remove non-zero formal charge

6419 (84)

Remove if more than 99 atoms

6414 (5)

Remove if contains undesirable atoms*

6252 (162)

Page 8: Competitive data science: A tale of two web services

Descriptor calculation SD file, descriptor calculation – 6252 x 5030

– Filter for low variance (≤ 0.01); removed 2537

– Remove for high correlation (> 0.90); removed 716

– Descriptor normalization resulted in 6252 x 1777 .csv file

Descriptor Engine # of descriptors

MOE 2D 76 (186)

Atom Pair 696 (1920)

MolConn-Z 174 (745)

Pipeline Pilot Property Counts

5 (130)

Daylight fingerprints

825 (2048)

clogP 0 (1) 0

200

400

600

800

1000

1200

1400

50

10

0

15

0

20

0

25

0

30

0

35

0

40

0

45

0

50

0

55

0

60

0

65

0

70

0

75

0

80

0

85

0

90

0

95

0

1000

1050

1100

1150

1200

J. Chem. Inf. Model. 49, 2077 (2009)

Page 9: Competitive data science: A tale of two web services

Testing Framework

“Predictive Modeling from a Kaggler’s Perspective” Jeremy Achin, Sergey Yergenson, Tom Degodoy

• Public Leaderboard: The split of the test set that competition participants see real-time feedback on over the course of the competition.

• Private Leaderboard: The split of the test set that is used to determine the competition winners and estimate the generalization error. Participants do not see feedback on this during the competition.

Page 10: Competitive data science: A tale of two web services

Expectations

“Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set”

• 20 models generated with different algorithms and descriptors

• Models have overall accuracies between 0.75 and 0.83 for the training set and 0.76 and 0.82 for the test set

• Inter-laboratory accuracy for Ames test reported at 85%

Expectation: Models should have similar accuracy to literature

Goal: Models should be balanced; sensitivity and specificity should be high

J. Chem. Inf. Model. 50, 2094 (2010)

Page 11: Competitive data science: A tale of two web services

http://www.kaggle.com/c/bioresponse

Page 12: Competitive data science: A tale of two web services
Page 13: Competitive data science: A tale of two web services

log loss= N

i

iiii yyyyN 1

)ˆ1log()1()ˆlog(1

Performance as a function of time

796 players

703 teams

8841 entries

55 forum topics, 409 posts

Page 14: Competitive data science: A tale of two web services

Final Ranking

Team Name Public

Ranking Δ (log loss)

1 Winter is Coming & Sergey 11 0

2 seelary 26 7E-05

3 bluehat 1 0.00051

4 jazz 15 0.0014

5 Wayne Zhang & Gxav & woshialex 19 0.00146

6 Indy Actuaries 38 0.00184

7 bluemaster & imran 7 0.00231

8 Efiimov & Bers & Cragin & vsu 4 0.00241

9 y_tag 18 0.0026

10 Killian O’Connor 44 0.00285

11 PlanetThanet & SirGuessalot 40 0.00298

12 AussieTim 48 0.00335

13 Jason Farmer 31 0.00347

14 GreenPeace 16 0.00356

15 mars 32 0.00388

16 Fuzzify 60 0.00392

17 Emanuele 63 0.00395

18 HappyHour 10 0.00431

19 Baltic 30 0.00465

20 dejavu 20 0.00482

352 Random Forest Benchmark 373 0.04184

541 Support Vector Machine Benchmark 522 0.12147

639 Optimized Constant Value Benchmark 647 0.31414

642 Uniform Benchmark 650 0.31959

https://github.com/emanuele/kaggle_pbr

https://github.com/benhamner/BioResponse

Page 15: Competitive data science: A tale of two web services

#FTW Strategies

• Feature selection

• RF + complementary approaches

• Blending

All three winning teams identified D27 as important. What is it? Organon toxicophore*

* J. Med. Chem. 49, 312 (2005)

“Predictive Modeling from a Kaggler’s Perspective” Jeremy Achin, Sergey Yergenson, Tom Degodoy

Page 16: Competitive data science: A tale of two web services

Winning Teams

Team 1 Team 2 Team 3

873 888 893

165 150 145

Team 1 Team 2 Team 3

151 165 162

687 673 676

TP FN

FP TN

Benchmarks

RF SVM

888 822

150 216

RF SVM

166 215

672 673

Other

Team 17 D27

896 781

142 257

Team 17 D27

169 215

669 623

Se Sp CCR

RF 0.86 0.80 0.83

SVM 0.79 0.74 0.77

Se Sp CCR

Team 1 0.84 0.82 0.83

Team 2 0.86 0.80 0.83

Team 3 0.86 0.80 0.83

Se Sp CCR

Team 17 0.86 0.80 0.83

D27 0.75 0.74 0.75

Se: TP/(TP+FN) Sp: TN/(FP+TN)

CCR: (Se + Sp)/2

Private Set Performance

Page 17: Competitive data science: A tale of two web services

Okay, where’s this ‘second’ web service?

17

BIpredict Physicochemical properties are updated as molecule is built Atomistic descriptor values are appended directly to the molecule

* D. C. Thompson Chemical Computing Group, User Group Meeting, Montreal, 2011

Page 18: Competitive data science: A tale of two web services

So, what did we learn?

• Was this useful? – Yes

• Participation was high, contributors and contributions were diverse*

• A large number of models were of a high quality – Differences in top models in log loss metric are small

– Different statistical measures lead to different rankings

– RandomForest benchmark has high correct classification rate (CCR)

* Sort of

Page 19: Competitive data science: A tale of two web services

‘Machine learning that matters’

Kiri L. Wagstaff. Machine Learning that Matters. Proceedings of the Twenty-Ninth International Conference on Machine Learning (ICML), June 2012. Download PDF (CL #12-2026)

Domain expertise Machine learning

skill

Page 20: Competitive data science: A tale of two web services

Thanks to: Lilly Ackley Amy Kunkel Mehul Patel Alex Renner, PhD All Kaggle participants – esp. Winter is Coming & Sergey