Download - ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

ChemSpider - Does Community Engagement work to Build a Quality Online Resource for Chemists?

Antony WilliamsACS DenverAugust 30th 2011

What’s said on the web is true…

What’s said on the web is true…

“We then established a collaboration with professor Sum Ting Wong, a fugitive from the North Korean University Hu Yu Hai Ding, currently in Rome (Italy).”

“This was identified as the new protein Wai So Dim (WSD).”

Who is Sandy Lawson? Ask Google

Who is Sandy..to me?

Mentor in computer-generated nomenclature Educational Technologist Innovator Ethical

“Gentleman Sandy”

What is the Structure of Vitamin K1?

ChemSpider

The Free Chemical Database

A central hub for chemists to source information >26 million unique chemical records Aggregated from >400 data sources Chemicals, spectra, CIF files, movies, images,

podcasts, links to patents, publications, predictions

A central hub for chemists to deposit & curate data

ChemSpider general statements

ChemSpider : one of many important resources The “Google and Wikipedia of Chemistry” A vision of “Linking all chemistry on the internet” Most people in this room probably know about it New people discover us regularly

Our distinct roles are: Hosting and exposing data for the community Curating and validating chemistry-related data

I want to know about “Vincristine”

I want to know about “Vincristine”

If all algorithms work then everything on the page is correct by default except the name!

Vincristine: Identifiers and Properties

Vincristine: Vendors and Sources

Vincristine: Patents

Vincristine: Articles

Searches: The INTERNET

All ChemSpider and Internet searches are “simply algorithms” but synonym searching is based on an assertion

InChIs

Validated Names for Searching…

What you might not know about Chemistry Databases on the Internet Data-sharing between the databases is cyclic –

proliferating errors – “Linked Data”

What you might not know about Chemistry Databases on the Internet Some public databases are “trusted” as primary

sources

Trust is granted without investigation or understanding of the content

Consider searching each of these chemical databases by chemical name (systematic name, trade name or synonym). Please mark each online resource according to how much you generally trust the results.


sources.


sources

Trust is granted without investigation or understanding of the content

What do we know about some of the online resources?

PHYSPROP Database

The freely downloadable database under the EPI Suite prediction software

Very Basic filters suggest data quality issues

The Stereochemistry challenge.12500 chemicals with “missed” stereo

NIST Webbook

PubChem

What you might not know about Chemistry Databases on the Internet Make sure you blame the database hosts!!! (???)

Errors are primarily deposited and inherited by the data suppliers

Chemistry databases depend enormously on structure representations…

What you might not know about Chemistry Databases on the Internet

Despite all of the blog posts, lectures, presentations and pleas it’s not improving

NPC Browser http://tripod.nih.gov/npc/

Patents

WYSIWYG compounds

But Chemspider is curated right?

Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine

All aggegators suffer dilution!

Data Curation…long torturous task

Data curation – JUST structure-name validation is a long, torturous, iterative task.

How about validating “data” – PhysChem data such as logP data, boiling points, melting points, spectra

Curating Melting Point Datahttp://tinyurl.com/3e44vbx

http://tinyurl.com/3e44vbx

Melting Point Validation Work

Some melting points can’t be resolved only with literature: 4-benzyltoluene

Data Curation…long torturous task

Data curation – JUST structure-name validation is a long, torturous, iterative task.

How about validating “data” – PhysChem data such as logP data, boiling points, melting points (J.C.Bradley’s talk), spectra

The crowd in crowdsourcing is …generally small

Which of the large databases are doing careful curation. How can we share the workload? Hmm..

ChemSpider can “do it” for us

ChemSpider provides a curation interface

All curation activities are available for review, online immediately, iteratively checked

Curators have different abilities based on their profile: There are only a few “Master Curators”.

Can we “share” the curation workload?

Identifier Dictionaries

Reciprocal curation processes…share curation with each other.

If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.

A series of “added” and “removed” synonyms against InChIKeys for matching.

Proof of Concept Data Curation Sharing

Structure Validation using feed

Look for approved synonyms

Compare feed InChIKey with database InChIKey

If different, flag for inspection

Identifier Dictionaries

Reciprocal curation processes…share curation with each other.

If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.

A series of “added” and “removed” synonyms against InChIKeys for matching.

Who will participate???

Batch Validation Also Works!

Batch validation of name-structure relationships

“Background Processing framework”

Hexamethylchickenwire Chloride = C12H23O5

Batch Validation Also Works!

Batch validation of name-structure relationships

“Background Processing framework”

Hexamethylchickenwire Chloride = C12H23O5

Define set of synonym filters and process the entire backfile. We will use synonym filters at deposition

Community Contribution to ChemSpider

ChemSpider as a host for community contributions Curation and validation input Structures Movies Images Analytical data – especially spectra

Spectra

www.SpectralGame.comhttp://www.jcheminf.com/content/1/1/9

http://www.spectralgame.com/

Spectral Game

Data Curation

Reversed Spectrum

Download, reprocess, redeposit

True Curation of Data

Batch wise validation of NMR data

Automated C13 Verification

Mixture Identified

NMR Verification H1 NMR: 77% of spectra consistent C13NMR: 67% of spectra consistent

Algorithms NOT perfect but did identify: Misreferenced data Reversed spectra 22 mixtures identified Signal-to-noise was poor – missing peaks

What about 2DNMR verification?

ChemSpider ID 24528095 HHCOSY

ChemSpider ID 24528095 HSQC

Crowdsourced Spectral Data

Spectral data available athttp://www.chemspider.com/spectra.aspx

Regular data depositions Generally licensed as Open Data Chemical vendors now contributing spectral data

– up to 800 spectra presently being acquired

All data welcomed – who will they benefit? www.SpectralGame.com http://spectraschool.rsc.org/

http://www.chemspider.com/spectra.aspx



http://www.spectralgame.com/

http://spectraschool.rsc.org/

SpectraSchool

Community Contribution to ChemSpider

ChemSpider as a host for community contributions Curation and validation input Analytical data – especially spectra Movies, images Is it just structures?

ChemSpider SyntheticPages as a host for reaction syntheses

ChemSpider SyntheticPages

Submission Process Simple template-based submission process

Submissions reviewed by editorial board. Published as is or comments sent to author

Online Peer Review process

Data supported include web movies, images, live spectra etc.

DOI issued to author

Is it working? Show of hands…

How many of you know CSSP? Have any of you submitted to CSSP?

Low submissions but some dedicated authors

Is it working? Show of hands…

How many of you know CSSP? Have any of you submitted to CSSP?

Low submissions but some dedicated authors

It is NOT a technology issue Students need permission to publish Publishing syntheses might prevent publication CSSP would grow if we abstracted supp. info –

templated supp info. submissions could help.

Crowdsourcing – does it work?

131 people EVER has either deposited or curated data on ChemSpider

ChemSpider SyntheticPages has a small group of dedicated authors

Database hosts and vendors make the largest contributions of data

ChemSpider staff do the most curation

If it was not just about me…

We might have a community built encyclopedia

I might know where the best restaurants are

I might get good advice on books to read

I might know which movies to watch

I might know which plumber to call

Data might just be Open

How will it improve?

Participation and

contribution

RSC’s LearnChemistry:Share

Improved Quality of data is essential Open PHACTS : partnership between European

Community and EFPIA Freely accessible for knowledge discovery and

verification. Data on small molecules Pharmacological profiles ADMET data Biological targets and pathways Proprietary and public data sources.

Conclusions ChemSpider has an important role in quality data

Crowdsourced deposition, validation and curation works but low engagement to date

Primary challenge – engaging the community to help create what they want. Rewards and recognition?

MORE collaboration can benefit us all

All indicators are good for continued growth

Acknowledgments

The ChemSpider team

Craig Knox, DrugBank

Our data providers, depositors, collaborators and curators

Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams