ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

89
ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

description

ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are many tens of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of over 20 million chemical substances integrated with over 300 disparate data sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for the semantic web for chemistry and to provide access to a set online tools and services to support access to these data. I will also discuss how ChemSpider is being used to enhance Semantic Publishing in Chemistry at RSC.

Transcript of ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Page 1: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Page 2: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Page 3: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Page 4: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences
Page 5: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences
Page 6: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

The Final Search Strategy

Page 7: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

All Those Names, One StructureA problem to solve…

Page 8: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Page 9: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Trustworthy Chemistry? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Page 10: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Where Would You look? What Do You Trust?

Page 11: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Question Everything online: www.dhmo.org

Page 12: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Di-Hydrogen Monoxide

2H

Page 13: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Di-Hydrogen Monoxide

2H + 1O

Page 14: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Di-Hydrogen Monoxide

H2O

Page 15: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Di-Hydrogen Monoxide

H2OWater

Page 16: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

It’s all on Wikipedia…

Page 17: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Chemistry on The Internet Is Messy

Page 18: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

It’s Methane…

Page 19: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

What’s Methane?

Page 20: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

What’s Methane?

Page 21: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

What ELSE is Methane???

Page 22: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Drugs are REALLY Messy

Page 23: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Vancomycin

Who will curate?

How would you clean such a large dataset?

Assertions!!!

Page 24: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

The EXPERTS must get it right?!

Page 25: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Wikipedia, C&E News, PubChem C&E News (from ACS)

Page 26: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Feedback from C&E Senior Editor

“Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”

“It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.”

Page 27: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Structural Data for LifeSciencesDailyMed

Page 28: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Lack of Stereochemisty

Page 29: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Incorrect Structures

Page 30: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Ugh…

Page 31: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Page 32: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Just “Public Compound” Databases

PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider

Page 33: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

media.obsessable.com

As few interfaces as possible

What do humans want?

Page 34: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

A Pragmatic Vision“Build a Structure Centric Community to

Serve Chemists”

December 2006 – A hobby project initiated to connect chemistry on the web

Integrate chemical structure data on the web Create a “structure-based hub” to information and

data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data

Page 35: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Answer Questions

Questions a chemist might ask… What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?

Page 36: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

ChemSpider Searches

Page 37: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Search Cholesterol

Page 38: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Search Cholesterol

Page 39: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Search Cholesterol

Page 40: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Search Cholesterol

Page 41: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Search Cholesterol

Page 42: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

A Link Farm to Content

Page 43: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Linked across the internet

Page 44: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Kyoto Encyclopedia of Genes and Genomes

Page 45: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Linking SMPDB

Page 46: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Links to Patents based on structure

Page 47: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Articles Linked

Page 48: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences
Page 49: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Search “OEA”

Page 50: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Search OEA

Page 51: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Search OEA

Page 52: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Search OEA

Page 53: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Linked Patents for OEA

Page 54: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences
Page 55: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Statistics for Today

>23 million compounds from >300 data sources

About 7000 unique users per day and up to ½ million transactions per day

A crowdsourced deposition and curation platform

Grows daily – more depositions, more links, more data

Page 56: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Searching Chemistry on the Internet

How complete a result set will we get if we search for “chemicals” by name?

Is there a better way to link chemistry databases? Linking by “names” is dangerous

Chemists want structure and SUBstructure searching

Page 57: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

The InChI Identifier

Page 58: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Multiple Layers

Page 59: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

InChIStrings Hash to InChIKeys

Page 60: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Link the Internet with InChIKeys!

Taken from: Rafael Sidis’ Blog

Page 61: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Vancomycin – Search the Internet

Page 62: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Page 63: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Full Molecule Search: 4 Hits

Page 64: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Full Skeleton Search: 104 Hits

Page 65: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences
Page 66: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences
Page 67: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences
Page 68: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Vancomycin

Page 69: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Vancomycin on ChemSpider 1 compound – 3 days

Page 70: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

InChIKeys

RCINICONZNJXQF-MZXODVADSA-N

Make the internet searchable by adding InChIKeys

Publishers add InChIKeys to papers now…

Page 71: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

InChIKeys

RCINICONZNJXQF-MZXODVADSA-N

Make the internet searchable by adding InChIKeys

Publishers add InChIKeys to papers now…

is what???

Page 72: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

The InChI “Resolver”

Page 73: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

InChI Resolver to DOIsStructure Search the Web

Page 74: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Most Chemistry is NOT Published

Only a fraction of chemistry is published

Only a tiny fraction of chemistry is patented

What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of Available chemicals never found

Page 75: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

The CAS Registry

Page 76: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

CAS Registry

Page 77: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Crowd-sourcing Curation and Deposition

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Page 78: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Building a Structure Centric Community for Chemists

Multi-level Curation and Approval

Page 79: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Entity-Extraction, Mark-up, Annotate

Page 80: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Semantic Markup: Project Prospect

Page 81: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Success Depends on Dictionaries

Link to a Structure or the Right Structure?

Page 82: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Name-Structure Pairs

Page 83: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Page 84: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Org Prep Daily (Blog)

Page 85: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

ChemSpider SyntheticPages

Page 86: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,

syntheses, data, publications and patents A world of Open Access and Open Data

Classical business models will have to morph

Page 87: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

ChemSpider Web Services

Page 88: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences
Page 89: ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences

Thank you

[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams