ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
-
Upload
orcid-0000-0002-2668-4821 -
Category
Technology
-
view
1.358 -
download
4
description
Transcript of Delivering Curated Chemistry to the World via Crowdsourced Deposition and Annotation on ChemSpider
Delivering Curated Chemistry to the World via Crowdsourced Deposition
and Annotation on ChemSpider
Antony WilliamsUniversity of Chicago, January 27th 2012
The World of Online Chemistry Property databases Compound aggregators Screening assay results Scientific publications Encyclopedic articles (Wikipedia) Metabolic pathway databases ADME/Tox data – eTOX for example Blogs/Wikis and Open Notebook Science Contributing Open Source code to projects
We Have …Too Much Data!!!
e-Science and Primary Data
How much data generated in a lab, that COULD go public, is lost forever?
TotallySynthetic.com
e-Science and Primary Data
How much data generated in a lab, that COULD go public, is lost forever?
Public Domain reference databases of value? Syntheses Properties Spectra CIFs Images
PubChem
ChEMBL
Collaborative Knowledge Management
e-Science and Primary Data
How much data generated in a lab, that COULD go public, is lost forever?
Public Domain reference databases of value? Syntheses Properties Spectra CIFs Images
Much of chemistry is chemical structure-based – where and how could we host these data?
RSC’s ChemSpider
Available Information…
Linked to vendors, safety data, toxicity, metabolism
Available Information….
Crowdsourced “Annotations”
Users can add Descriptions/Syntheses/Commentaries Links to PubMed articles Links to articles via DOIs Add spectral data Add Crystallographic Information Files Add photos Add MP3 files Add Videos
Spectra
Spectra
Data on the Web
Chemistry Data online is messy
We have inherited errors All public compound databases, including ours,
have errors “Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE
The Structure of Vitamin K?
MeSH
A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K
The Structure of Vitamin K1?
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
ChEBI – Manual Curation
“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”
Variants of systematic names on PubChem
2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl
Question Everything online: www.dhmo.org
It’s all on Wikipedia…
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
EPA’s DailyMed
EPA’s DailyMed
EPA’s DailyMed
PHYSPROP Database
The freely downloadable database under the EPI Suite prediction software
Very Basic filters suggest data quality issues
The Stereochemistry challenge.12500 chemicals with “missed” stereo
With Great Fanfare…
NPC Browser http://tripod.nih.gov/npc/
NPC Browser http://tripod.nih.gov/npc/
Openness and Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
Public Domain Databases
Our databases are a mess…
Non-curated databases are proliferating errors
We source and deposit data between databases
Original sources of errors hard to determine
Curation is time-consuming and challenging
Stop Whining – Fix it
Crowdsourced Curation
Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
“Curate” Identifiers
Standards : Structure Standardization
Standards : Structure Standardization
Standards : Structure Standardization
What needs to happen?
Standards Standardization of structures
ChEBI/PubChem sharing InChI adoption
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
Vancomycin – Search the Internet
Vancomycin
Search Molecular SKELETON
Search Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Crowdsourcing Works
>130 people have deposited data and participated in data curation
Different level curators check each other
More curators and depositors are encouraged!
What needs to happen?
Standards Standardization of structures
ChEBI/PubChem sharing InChI adoption
Collaboration Stop reinventing the wheel Share data, share efforts and speed the process
Antony Williams vs Identifiers
Passport ID
Dad, Tony, others
SSN
Green Card
License5 email addressesChemSpiderman (blog, Twitter account, Facebook, Friendfeed)OpenID….
Aspirin names and synonyms
• Text searches depend on correct association
• 335 suggested identifiers for Aspirin just on PubChem!
• Disambiguation dictionaries are necessary, not just for authors!
The Final Search Strategy
All Those Names, One Structure
Ambiguity in Identifiers
Curated Dictionaries Matter
Success Depends on Dictionaries
Validated Name-Structure Dictionaries
Chemical name dictionaries are used for: Text-mining (publications, patents)
Used to index PubMed and link to Google Patents
Linking to other databases – think Biology! When structures are not available drug names link
Searching the web Names link to structures link to InChIs
I want to know about “Vincristine”
If all algorithms work then everything on the page is correct by default except the name-structure relationship!
Vincristine: Identifiers and Properties
Vincristine: Vendors and SourcesLinked by Structure
Vincristine: PatentsLinked by Name
Vincristine: ArticlesLinked by Name
Challenges of Complex Molecules Yohimbine
Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine
Internal and external content Built to meet primary use-case Tailored indexes and GUIs Internal unique language & metadata Poor interoperability/integration Powerpoint, Documents, Excel Many suppliers of systems and content in
a single workflow
Literature Patents NewsPipeline SAR CSRs SafetyIn vivo Etc
Pharma Information Tombs
What could create change?
Harvard Business Review (2010)
“One change would make a substantial difference [to drug R&D]: the creation of agreed-upon standards for digitally
representing drug assets.”
It is so difficult to navigate…
What’s the structure?What’s the structure?
Are they in our file?
Are they in our file?
What’s similar?What’s
similar?
What’s the target?
What’s the target?Pharmacology
data?Pharmacology
data?
Known Pathways?
Known Pathways?
Working On Now?
Working On Now?Connections
to disease?Connections to disease?
Expressed in right cell type?Expressed in
right cell type?
Competitors?Competitors?
IP?IP?
Open PHACTS Project Develop a set of robust standards… Implement the standards in a semantic integration hub Deliver services to support drug discovery programs in
pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project
Guiding principle is open access, open usage, open source- Key to standards adoption -
Guiding principle is open access, open usage, open source- Key to standards adoption -
ChemSpider Resources for Chemistry
Internet Data
The Future
Commercial SoftwarePre-competitive Data
Open ScienceOpen DataPublishersEducators
Open DatabasesChemical Vendors
Small organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals
The Future of Chemistry on the Web? Public compound databases federate & build
a linked environment of validated data! Data validation needs are not ignored Publishers layer on information to make
publications discoverable Public-Private databases can be linked Open Data proliferate The “Semantic Web” in action
Acknowledgments
The ChemSpider team
Our data providers, depositors, collaborators and curators
Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)
Sean Ekins @collabchem
Thank you
Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams