ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?
Antony WilliamsACS DenverAugust 30th 2011
What’s said on the web is true…
What’s said on the web is true…
What’s said on the web is true…
“We then established a collaboration with professor Sum Ting Wong, a fugitive from the North Korean University Hu Yu Hai Ding, currently in Rome (Italy).”
“This was identified as the new protein Wai So Dim (WSD).”
Who is Sandy Lawson? Ask Google
Who is Sandy..to me?
Mentor in computer-generated nomenclature Educational Technologist Innovator Ethical
“Gentleman Sandy”
What is the Structure of Vitamin K1?
ChemSpider
The Free Chemical Database
A central hub for chemists to source information >26 million unique chemical records Aggregated from >400 data sources Chemicals, spectra, CIF files, movies, images,
podcasts, links to patents, publications, predictions
A central hub for chemists to deposit & curate data
ChemSpider general statements
ChemSpider : one of many important resources The “Google and Wikipedia of Chemistry” A vision of “Linking all chemistry on the internet” Most people in this room probably know about it New people discover us regularly
Our distinct roles are: Hosting and exposing data for the community Curating and validating chemistry-related data
I want to know about “Vincristine”
I want to know about “Vincristine”
If all algorithms work then everything on the page is correct by default except the name!
Vincristine: Identifiers and Properties
Vincristine: Identifiers and Properties
Vincristine: Vendors and Sources
Vincristine: Patents
Vincristine: Articles
Searches: The INTERNET
All ChemSpider and Internet searches are “simply algorithms” but synonym searching is based on an assertion
InChIs
Validated Names for Searching…
What you might not know about Chemistry Databases on the Internet Data-sharing between the databases is cyclic –
proliferating errors – “Linked Data”
What you might not know about Chemistry Databases on the Internet Some public databases are “trusted” as primary
sources
Trust is granted without investigation or understanding of the content
Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.
What you might not know about Chemistry Databases on the Internet Some public databases are “trusted” as primary
sources.
What you might not know about Chemistry Databases on the Internet Some public databases are “trusted” as primary
sources
Trust is granted without investigation or understanding of the content
What do we know about some of the online resources?
PHYSPROP Database
The freely downloadable database under the EPI Suite prediction software
Very Basic filters suggest data quality issues
The Stereochemistry challenge.12500 chemicals with “missed” stereo
NIST Webbook
PubChem
What you might not know about Chemistry Databases on the Internet Make sure you blame the database hosts!!! (???)
Errors are primarily deposited and inherited by the data suppliers
Chemistry databases depend enormously on structure representations…
What you might not know about Chemistry Databases on the Internet
Despite all of the blog posts, lectures, presentations and pleas it’s not improving
NPC Browser http://tripod.nih.gov/npc/
NPC Browser http://tripod.nih.gov/npc/
NPC Browser http://tripod.nih.gov/npc/
NPC Browser http://tripod.nih.gov/npc/
Patents
Patents
WYSIWYG compounds
WYSIWYG compounds
But Chemspider is curated right?
Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine
All aggegators suffer dilution!
Data Curation…long torturous task
Data curation – JUST structure-name validation is a long, torturous, iterative task.
How about validating “data” – PhysChem data such as logP data, boiling points, melting points, spectra
Curating Melting Point Datahttp://tinyurl.com/3e44vbx
Melting Point Validation Work
Some melting points can’t be resolved only with literature: 4-benzyltoluene
Data Curation…long torturous task
Data curation – JUST structure-name validation is a long, torturous, iterative task.
How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra
The crowd in crowdsourcing is …generally small
Which of the large databases are doing careful curation. How can we share the workload? Hmm..
ChemSpider can “do it” for us
ChemSpider provides a curation interface
All curation activities are available for review, online immediately, iteratively checked
Curators have different abilities based on their profile: There are only a few “Master Curators”.
Can we “share” the curation workload?
Identifier Dictionaries
Reciprocal curation processes…share curation with each other.
If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.
A series of “added” and “removed” synonyms against InChIKeys for matching.
Proof of Concept Data Curation Sharing
Structure Validation using feed
Look for approved synonyms
Compare feed InChIKey with database InChIKey
If different, flag for inspection
Identifier Dictionaries
Reciprocal curation processes…share curation with each other.
If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.
A series of “added” and “removed” synonyms against InChIKeys for matching.
Who will participate???
Batch Validation Also Works!
Batch validation of name-structure relationships
“Background Processing framework”
Hexamethylchickenwire Chloride = C12H23O5
Batch Validation Also Works!
Batch validation of name-structure relationships
“Background Processing framework”
Hexamethylchickenwire Chloride = C12H23O5
Batch Validation Also Works!
Batch validation of name-structure relationships
“Background Processing framework”
Hexamethylchickenwire Chloride = C12H23O5
Define set of synonym filters and process the entire backfile. We will use synonym filters at deposition
Community Contribution to ChemSpider
ChemSpider as a host for community contributions Curation and validation input Structures Movies Images Analytical data – especially spectra
Spectra
www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9
Spectral Game
Data Curation
Reversed Spectrum
Download, reprocess, redeposit
True Curation of Data
Batch wise validation of NMR data
Automated C13 Verification
Mixture Identified
NMR Verification H1 NMR: 77% of spectra consistent C13NMR: 67% of spectra consistent
Algorithms NOT perfect but did identify: Misreferenced data Reversed spectra 22 mixtures identified Signal-to-noise was poor – missing peaks
What about 2DNMR verification?
ChemSpider ID 24528095 HHCOSY
ChemSpider ID 24528095 HSQC
Crowdsourced Spectral Data
Spectral data available athttp://www.chemspider.com/spectra.aspx
Regular data depositions Generally licensed as Open Data Chemical vendors now contributing spectral data
– up to 800 spectra presently being acquired
All data welcomed – who will they benefit? www.SpectralGame.com http://spectraschool.rsc.org/
SpectraSchool
Community Contribution to ChemSpider
ChemSpider as a host for community contributions Curation and validation input Analytical data – especially spectra Movies, images Is it just structures?
ChemSpider SyntheticPages as a host for reaction syntheses
ChemSpider SyntheticPages
ChemSpider SyntheticPages
Submission Process Simple template-based submission process
Submissions reviewed by editorial board. Published as is or comments sent to author
Online Peer Review process
Data supported include web movies, images, live spectra etc.
DOI issued to author
Is it working? Show of hands…
How many of you know CSSP? Have any of you submitted to CSSP?
Low submissions but some dedicated authors
Is it working? Show of hands…
How many of you know CSSP? Have any of you submitted to CSSP?
Low submissions but some dedicated authors
It is NOT a technology issue Students need permission to publish Publishing syntheses might prevent publication CSSP would grow if we abstracted supp. info –
templated supp info. submissions could help.
Crowdsourcing – does it work?
131 people EVER has either deposited or curated data on ChemSpider
ChemSpider SyntheticPages has a small group of dedicated authors
Database hosts and vendors make the largest contributions of data
ChemSpider staff do the most curation
If it was not just about me…
We might have a community built encyclopedia
I might know where the best restaurants are
I might get good advice on books to read
I might know which movies to watch
I might know which plumber to call
Data might just be Open
If it was not just about me…
We might have a community built encyclopedia
I might know where the best restaurants are
I might get good advice on books to read
I might know which movies to watch
I might know which plumber to call
Data might just be Open
How will it improve?
Participation and
contribution
RSC’s LearnChemistry:Share
Improved Quality of data is essential Open PHACTS : partnership between European
Community and EFPIA Freely accessible for knowledge discovery and
verification. Data on small molecules Pharmacological profiles ADMET data Biological targets and pathways Proprietary and public data sources.
Conclusions ChemSpider has an important role in quality data
Crowdsourced deposition, validation and curation works but low engagement to date
Primary challenge – engaging the community to help create what they want. Rewards and recognition?
MORE collaboration can benefit us all
All indicators are good for continued growth
Acknowledgments
The ChemSpider team
Craig Knox, DrugBank
Our data providers, depositors, collaborators and curators
Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)
Thank you
Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams
Top Related