Experiences in Hosting Big Chemistry Data Collections for the Community
-
Upload
orcid-0000-0002-2668-4821 -
Category
Science
-
view
645 -
download
3
description
Transcript of Experiences in Hosting Big Chemistry Data Collections for the Community
![Page 1: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/1.jpg)
Experiences in Hosting Big Chemistry Data Collections
for the Community
Antony WilliamsJuly 30th 2014, NIST
![Page 2: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/2.jpg)
Overview of Our Activities
• The Royal Society of Chemistry as a provider of chemistry for the community:• As a charity • As a scientific publisher• As a host of commercial databases• As a partner in grant-based projects• As the host of ChemSpider• And now in development : the RSC Data
Repository for Chemistry
![Page 3: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/3.jpg)
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our journals and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
![Page 4: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/4.jpg)
ChemSpider
![Page 5: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/5.jpg)
ChemSpider
![Page 6: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/6.jpg)
ChemSpider
![Page 7: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/7.jpg)
Experimental/Predicted Properties
![Page 8: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/8.jpg)
Literature references
![Page 9: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/9.jpg)
Patents references
![Page 10: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/10.jpg)
RSC Books
![Page 11: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/11.jpg)
Google Books
![Page 12: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/12.jpg)
Vendors and data sources
![Page 13: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/13.jpg)
Crowdsourced “Annotations”
• Users can add • Descriptions, Syntheses and Commentaries• Links to PubMed articles• Links to articles via DOIs • Add spectral data• Add Crystallographic Information Files• Add photos• Add MP3 files• Add Videos
![Page 14: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/14.jpg)
APIs
![Page 15: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/15.jpg)
APIs
![Page 16: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/16.jpg)
WebBook and ChemSpider
![Page 17: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/17.jpg)
WebBook and ChemSpider
![Page 18: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/18.jpg)
WebBook and ChemSpider
![Page 19: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/19.jpg)
WebBook and ChemSpider
![Page 20: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/20.jpg)
WebBook and ChemSpider
![Page 21: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/21.jpg)
Javascript viewer NMR, MS, IR
![Page 22: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/22.jpg)
Aspirin on ChemSpider
![Page 23: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/23.jpg)
Many Names, One Structure
![Page 24: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/24.jpg)
What is the Structure of Vitamin K?
![Page 25: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/25.jpg)
MeSH
• A lipid cofactor that is required for normal blood clotting.
• Several forms of vitamin K have been identified: • VITAMIN K 1 (phytomenadione) derived
from plants, • VITAMIN K 2 (menaquinone) from bacteria,
and synthetic naphthoquinone provitamins, • VITAMIN K 3 (menadione).
![Page 26: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/26.jpg)
What is the Structure of Vitamin K?
![Page 27: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/27.jpg)
The ultimate “dictionary”
• Search all forms of structure IDs
• Systematic name(s)
• Trivial Name(s)
• SMILES
• InChI Strings
• InChIKeys
• Database IDs
• Registry Number
![Page 28: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/28.jpg)
Linking Names to Structures
![Page 29: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/29.jpg)
Semantic Mark-up of Articles
![Page 30: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/30.jpg)
Data Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
![Page 31: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/31.jpg)
Data quality is a known issue
![Page 32: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/32.jpg)
Standardize
• Use the SRS as a guidance document for standardization
• Adjust as necessary to our needs
![Page 33: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/33.jpg)
Nitro groups
![Page 34: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/34.jpg)
Salt and Ionic Bonds
![Page 35: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/35.jpg)
Ammonium salts
![Page 36: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/36.jpg)
CVSP Filtering and Flagging
![Page 37: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/37.jpg)
Openness and Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
![Page 38: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/38.jpg)
Substructure # of
Hits
# of
Correct
Hits
No
stereochemistry
Incomplete
Stereochemistry
Complete but
incorrect
stereochemistry
Gonane 34 5 8 21 0
Gon-4-ene 55 12 3 33 7
Gon-1,4-diene 60 17 10 23 10
![Page 39: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/39.jpg)
Crowdsourced Enhancement
• The community can clean and enhance the database by providing Feedback and direct curation
• Tens of thousands of edits made
![Page 40: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/40.jpg)
Data Quality is Work
• Cholesterol
• Taxol
![Page 41: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/41.jpg)
Maybe we can help?
• Is there an interest in data checking the WebBook or other NIST data sources?
![Page 42: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/42.jpg)
Publications-summary of work
• Scientific publications are a summary of work• Is all work reported?• How much science is lost to pruning?• What of value sits in notebooks and is lost?• Publications offering access to “real data”?
• How much data is lost?• How many compounds never reported?• How many syntheses fail or succeed?• How many characterization measurements?
![Page 43: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/43.jpg)
What are we building?
• We are building the “RSC Data Repository”
• Containers for compounds, reactions, analytical data, tabular data
• Algorithms for data validation and standardization
• Flexible indexing and search technologies
• A platform for modeling data and hosting existing models and predictive algorithms
![Page 44: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/44.jpg)
Deposition of Data
![Page 45: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/45.jpg)
Compounds
![Page 46: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/46.jpg)
Reactions
![Page 47: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/47.jpg)
Analytical data
![Page 48: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/48.jpg)
Crystallography data
![Page 49: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/49.jpg)
Can we get historical data?
• Text and data can be mined
• Spectra can be extracted and converted
• SO MUCH Open Source Code available
![Page 50: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/50.jpg)
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
![Page 51: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/51.jpg)
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
![Page 52: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/52.jpg)
Text spectra?
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
![Page 53: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/53.jpg)
1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
![Page 54: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/54.jpg)
Turn “Figures” Into Data
![Page 55: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/55.jpg)
Make it interactive
![Page 56: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/56.jpg)
SO MANY reactions!
![Page 57: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/57.jpg)
Extracting our Archive
• What could we get from our archive?• Find chemical names and generate structures• Find chemical images and generate structures• Find reactions• Find data (MP, BP, LogP) and deposit• Find figures and database them• Find spectra (and link to structures)
![Page 58: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/58.jpg)
Models published from data
![Page 59: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/59.jpg)
Text-mining Data to compare
![Page 60: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/60.jpg)
How is DERA going?
• We have text-mined all 21st century articles… >100k articles from 2000-2013
• Marked up with XML and published onto the HTML forms of the articles
• Required multiple iterations based on dictionaries, markup, text mining iterations
• New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!
![Page 61: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/61.jpg)
Work in Progress
![Page 62: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/62.jpg)
Work in Progress
![Page 63: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/63.jpg)
Work in Progress
![Page 64: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/64.jpg)
Work in Progress
![Page 65: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/65.jpg)
Dictionary(ontologies)
RSC ontologies(methods, reactions)
Dictionary(chemistry)
Text-mining
Curated dictionaries for known names
ACD N2S
OPSIN
Unknown names: automated name to structure conversion
XML ready for publication
Marked-up XML
Production processes
CDX integration (coming soon)
Chemical structures SD
file
Is It Easy?
![Page 66: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/66.jpg)
Acknowledgments
• Regarding InChI – Steve Stein, Steve Heller, Dmitrii Tchekhovskoi, Igor Pletnev
![Page 67: Experiences in Hosting Big Chemistry Data Collections for the Community](https://reader036.fdocuments.us/reader036/viewer/2022062513/554e7d06b4c905f66a8b525c/html5/thumbnails/67.jpg)
Email: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams
Thank you