The collection, curation and modeling of Open Melting Point measurements

Post on 11-May-2015

1.704 views 0 download

Tags:

description

Jean-Claude Bradley and Andrew Lang present at the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry on August 26, 2011 about "The collection, curation and modeling of Open Melting Point measurements". The talk also covers the role of Open Notebook Science and Google Apps Scripts in this effort.

Transcript of The collection, curation and modeling of Open Melting Point measurements

The collection, curation and modeling of Open Melting Point measurements

August 26, 2011

5th Meeting on U.S. Government Chemical Databases and Open Chemistry

Jean-Claude Bradley

Department of ChemistryDrexel University

Andrew Lang

Department of MathematicsOral Roberts University

Antony Williams

ChemSpiderRoyal Society of

Chemistry

The Problem of Data Quality in Chemistry

• Lack of provenance

•Reliance on a system of “trusted sources”

• CRC Handbook•Merck Index• Chemical Vendor Catalogs (e.g. Sigma-Aldrich)• Peer-Reviewed Journals

In the case of melting points:

Strategy for the curation of melting points

Using technology, we can begin to replace the “trusted source”

model with one based on transparency and provenance

1. Rely on redundancy when possible2. Provide the maximum level of

provenance when necessary (Open Notebook Science)

3. Adhere to Open Data, Open Descriptors and Open Algorithms for measurements and modeling

The Chemical Information Validation Sheet

567 curated and referenced measurements from Fall 2010 Chemical Information Retrieval course

Investigating the m.p. inconsistencies of EGCG

Most popular data sources

Alfa Aesar donates melting points to the public

Open Melting Point Explorer

OutliersMDPI

datasetEPA/PhysProp

(donated all data to public also)

Outliers for ethanol: Alfa Aesar and Oxford MSDS

Inconsistencies and SMILES problems within MDPI dataset

MDPI Dataset labeled with High Trust Level

EPA/PHYSPROP Structure Errors (Incorrect Valence): 2315 out of 43543 were contained pentavalent

nitrogens

EPA/PHYSPROP Errors: Structure displayed is for the neutral compound dopamine but the associated CAS

Number and chemical name in the file are for the hydrobromide salt.

Common errors in datasets

1. multiple melting points for the same compound in the same database

2. stereochemistry issues3. sign inversion4. conversion errors (Kelvin/Celcius

Fahrenheit/Celcius)5. bad SMILES (non-rendering)6. salts associated with SMILES for free base7. using boiling point for melting point

Open melting point datasets

Double+ validated: 2706 compounds (7413 highly curated measurements. range: 0.01-5 C. Compounds that had at least one chiral center, possessed cis/trans isomerism, were inorganic or a salt removed.)

Entire dataset: 19933 unique compounds (27684 measurements – no inorganics or salts)

Open Models with Open Data Using Open Descriptors (CDK)

Modeling Results

Model Training set Test set (TS) Descriptors TS AAE TS RMSE TS R2

1 2205 500 132 2D 29.51 40.91 0.82

1 2204 500 170 2D/3D 29.52 40.79 0.83

2 16015 500 137 2D 26.62 36.35 0.86

3 16015 3500 137 2D 29.36 40.18 0.81

Melting point prediction service

Melting point predictions and measurements on iPhone/iPad (Alex Clark)

Publication of double+ validated melting point dataset to Nature Precedings and LuLu

For all Formats of ONS Projects

Open Melting Point DatasetsCurrently 20,000 compounds with Open MPs

Some melting points can’t be resolved only with literature: 4-benzyltoluene

Motivation: Faster Science, Better Science

Open Lab Notebook page measuring the melting point of 4-benzyltoluene

Using melting point for temperature dependent solubility prediction

Crowdsourcing Solubility Data

Integration of Multiple Web Services to Recommend Solvents for Reactions

All ONS web services

Google Apps Scripts web services

Google Apps Scripts for conveniently exploring melting point data

Straight chain carboxylic acids from 1 to 10 carbons

Straight chain alcohols from 1 to 10 carbons

Comparison of model with triple validated measurements

Cyclic primary amines from 3 to 6 carbons (cyclobutylamine flagged for validation – only single

source available)

Google Apps Scripts for planning reactions and creating schemes

Open Melting Points in Supplementary Data Pages of Wikipedia (Martin Walker)

Conclusions

• For science to progress quickly there is great benefit in moving away from a “trusted source” model to one based on transparency and data provenance

•Open Notebook Science offers an efficient way to make research transparent and discoverable