RSC ChemSpider – Building an Internet Based Community for Chemists
Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Poor quality and little curation/validation work
Too many searches required to resource data
media.obsessable.com
As few interfaces as possible
What do humans want?
Chemistry on the Internet FUTURE
Search by chemical structure and substructure
Chemistry articles indexed and searchable
Reduced number of searches to find data
Data are integrated – compounds, vendors, syntheses, data, publications and patents
For Synthesis…TotallySynthetic.com
Org Prep Daily (Blog)
Lots of “Public Compound” Databases
PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider
Where Would You look? What Do You Trust?
Linked Data on the Web
Taken from: Rafael Sidis’ Blog
What is a compound?
What is ChemSpider?
ChemSpider is:
Building a Structure Centric Community for Chemists >23 million compounds, >300 data sources
A deposition and curation platform
A publishing platform for the community
Grows daily – more depositions, more links, more data sources
How Was ChemSpider Built? ChemSpider was a “hobby project”
Housed in a basement and running off three servers – one bought, two built
Sensitive to weather and power stability
Went live at ACS Spring 2007 in Chicago
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
Linked across the internet
Kyoto Encyclopedia of Genes and Genomes
Link off a structure in ChemSpider
Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
Links to Patents based on structure
Clickthrough to Patents
Articles Linked
Answering Questions for Chemists Questions a chemist might ask…
What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
Complex Data and Information
ChemSpider is a structure-centric hub
ChemSpider aggregates and links out across the internet
Data aggregate based on “structures and links”
What defines a chemical compound?
What is a compound?
Question Everything online: www.dhmo.org
Di-Hydrogen Monoxide
2H
Di-Hydrogen Monoxide
2H + 1O
Di-Hydrogen Monoxide
H2O
Di-Hydrogen Monoxide
H2OWater
It’s all on Wikipedia…
It’s all on Wikipedia…
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
PubChem
Truly “I Love You”
Chemistry is REALLY Messy
Vancomycin
Who will curate?
How would you clean such a large dataset?
Assertions!!!
Vancomycin
Who will curate?
How would you clean such a large dataset?
Vancomycin on ChemSpider 1 compound – 3 days
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem C&E News
(from ACS)
What About Digitonin?
CAS as an authority
The Blogging Community Participate
The FDA’s DailyMed
Structures on DailyMed
Lack of Stereochemisty
Incorrect Structures
Wow!
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
InChIs for Taxol
Back to Taxol
DrugBank: RCINICONZNJXQF-CLDWUXIMDD
ChEBI: RCINICONZNJXQF-
GXKQXQCDDN Wikipedia: RCINICONZNJXQF-
MZXODVADBJ
Which one is correct???
InChIKeys for Taxol
DrugBank: RCINICONZNJXQF-CLDWUXIMDD
ChEBI: RCINICONZNJXQF-
GXKQXQCDDN Wikipedia: RCINICONZNJXQF-
MZXODVADBJ
ChEBI and Wikipedia are the SAME structure
Drugbank is a DIFFERENT structure – ONE stereocenter
Does one stereocenter matter?
Does one stereocenter matter?
Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
Does one stereocenter matter?
Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
Building a Structure Centric Community for Chemists
Assertion and Chemical Entities
Who says what Taxol is?
What is the “timeline” for a molecule?
How do we clean up the Public data?
The Quality source is Chemical Abstracts Service…
ChemSpider Searches
ChemSpider Searches
ChemSpider Complex Searches
Vancomycin – Search the Internet
Full Molecule Search: 4 Hits
Full Skeleton Search: 104 Hits
The InChI “Resolver”
Citizen Scientists
Crowd-sourcing Chemistry Curation
Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Building a Structure Centric Community for Chemists
Multi-level Curation and Approval
Citizens as Data Sources
Entity-Extraction, Mark-up, Annotate
Success Depends on Dictionaries
Project Prospect
ChemMantis and CJOC
Name-Structure Pairs
Species – linked to Wikipedia
Semantic Linking of Structures
What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
ChemSpider Everywhere
Linked from Wikipedia Linked from Open Notebook Science sites
using EMBED Linked from Blogs using Structure/Spectra
EMBED Integrated into structure drawing packages
such as ACD/ChemSketch, Symyx Draw, Open Source applets
Integrated to software offerings from Thermo, Waters, Agilent, Bruker
ChemSpider Everywhere : Embed
ChemSpider Everywhere:What do computers want?
Web services
flickr.com/photos/microcosmos
ChemSpider Everywhere: Spectral Game
ChemSpider EverywhereCrowdsourced Curation of Spectra
ChemSpider EverywhereChemMobi
There are always gaps...
What ChemSpider doesn’t deal with yet...
Markush structures and other “non-defineds” Materials Minerals Polymers Biological macromolecules
What’s next?
Continue the curation effort and keep cleaning
Finish depositions – millions left to deposit
Layer on RDF to allow the semantic web to benefit from our efforts
Integrate RSC content – a massive archive!
Integrate RSC publishing workflows and databases
Thank you
[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams
Top Related