Navigating the Complex Web of Chemistry Using ChemSpider

Post on 10-May-2015

1.647 views 0 download

Tags:

description

There is an increasing availability of free and open access resources for scientists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 200 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of the ChemSpider platform and how it is fast becoming the centralized hub for resourcing information about chemical entities.

Transcript of Navigating the Complex Web of Chemistry Using ChemSpider

Navigating the Complex Web of Chemistry Using ChemSpider

Antony Williams vs Identifiers

Old Passport ID

Dad, Tony, others

SSN

Green Card

License5 email addressesChemSpiderman (blog, Twitter account, Facebook, Friendfeed)OpenID….

Aspirin vs Chemical Identifiers

Aspirin names and synonyms

• Text searches depend on correct association

• 335 suggested identifiers for Aspirin just on PubChem!

• Disambiguation dictionaries are necessary

Linked Data Cloud

…the premium database producers are using some

automatic tools to prepare a ‘first draft’ of a database record, to be refined by eye.

Coupled with the public internet as a distribution method of choice, it is becoming possible for the first time to create and distribute new structure based databases at much lower costs, or even free of charge.

The Final Search Strategy

All Those Names, One Structure

Content is King and Quality Costs Chemistry “content” is big business. Not everyone

can afford it. Patent searching Structures and properties Drug databases Literature databases

Chemical Abstracts Service (CAS), the “Gold Standard” in Chemistry related information 101 years of content $260 million revenue (2006) >50 million substances Proprietary platform

Searching Chemistry on the Internet

How complete a result set will we get if we search for “chemicals” by name?

Is there a better way to link chemistry databases? Linking by “names” is dangerous

Chemists want structure and SUBstructure searching

The InChI Identifier

Multiple Layers

InChIStrings Hash to InChIKeys

Oleoylethanolamine

InChI=1S/C20H39NO2/c1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-20(23)21-18-19-22/h9-10,22H,2-8,11-19H2,1H3,(H,21,23)/b10-9-

BOWVQLFMWHZBEF-KTKRTIGZSA-N

InChIKey Searches Work

Search Engine Dependencies

Search Engine Dependencies

InChIs have traction…

RDF Linking of Structures

PubChem

The Simplest Organic Molecule

Question Everything online: www.dhmo.org

The Structure-Based Data Cloud

Vancomycin

Vancomycin

Who will curate?

How would you clean such a large dataset?

Vancomycin on ChemSpider

Vancomycin

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Full Skeleton Search: 104 Hits

Full Molecule Search: 4 Hits

What is ChemSpider? ChemSpider is:

Building a Structure Centric Community for Chemists 22.2 million compounds, >200 data sources

A deposition and curation platform

A publishing platform for the community

Grows daily – more depositions, more links, more data sources

For Chemical Compounds

Vendor sites – Aldrich, Alfa Aesar, TCI and 100s of others

Government databases – PubChem, DSSTox, FDA databases, ChemIDPlus,…

Biological Databases – Protein Database, Stitch, KEGG, ChEBI,…

Analytical databases –NMRShiftDB,…

How Was ChemSpider Built? ChemSpider was a “hobby project”

Housed in a basement and running off three servers – one bought, two built

May 2009

3 servers – 2 homebuilt .NET architecture SQL server Homebuilt structure/substructure Commercial components Open Source Components

OpenBabel, Jmol, JSpecView, NCBI Toolkit, InChI Libraries

Search Cholesterol

Search Cholesterol

Search Cholesterol

Search Cholesterol

Linked across the internet

Kyoto Encyclopedia of Genes and Genomes

Links to Patents based on structure

Answering Questions for Chemists Questions a chemist might ask…

What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?

Complex Data and Information

Remember – QUALITY ISSUES

The FDA’s DailyMed

Incorrect Structures

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon

Crowd-sourcing Chemistry Curation

We Need Recognition and Rewards

Master Curators, Curators, Depositors

Collaborating with Wikipedia

Long term project to curate chemical compounds

Robotically linking ChemSpider to Wikipedia at present

Will layer on InChI Strings and InChIKeys shortly and make Wikipedia structure searchable

Blogs need InChIs too!

Blogs need InChIs too!

Use Intelligent Structures : ChemSpider Embed Web Service

ChemSpider Web Services

Semantic Mark-up for Chemistry

Semantic mark-up for chemistry is here

RSC project prospect

Nature publishing group compound linking

ChemMantis

Nature Chemistry Compound Pages

Project Prospect

ChemMantis

Deposit Structures

Species – linked to Wikipedia

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

The InChI “Resolver”

InChI Resolver to DOIsStructure Search the Web

Conclusions Internet resources provide a collaborative

community for chemistry

Crowdsourcing to expand, curate and integrate to the benefit of chemists

Searching the web for chemistry is arriving

InChIs are enabling chemistry on the internet

Question Quality!

antony.williams@chemspider.comTwitter: ChemSpidermanwww.chemspider.com/blog