Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider,...
-
Upload
orcid-0000-0002-2668-4821 -
Category
Technology
-
view
2.851 -
download
1
description
Transcript of Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider,...
Enhancing discoverability across Royal Society of Chemistry content by integrating to ChemSpider, an online database of chemical structures
A Pragmatic Vision
“Build a Structure Centric Community toServe Chemists”
Integrate chemical structure data on the web Create a “structure-based hub” to information,
data and algorithmic predictions Let chemists contribute their own data Allow the community to curate/correct data
ChemSpider Today
Over 25 million unique compounds Sourced from over 300 data sources Growing daily – new compounds, annotations, data
Structures, text, spectra, images, movies, syntheses
Text searching the web is far from optimal Structure searching the web is not a dream The quality of data on the web is a problem An example…
Keep Your Plants Healthy-Looking
Which is better for Plants?Vodka, Sprite or Viagra?
It Works – Viagra Wins the Day
Now Which is Better?
Viagra or Cialis?
Images sourced from Wikipedia
Cialis
I want…The structureAny patent informationRelated publicationsWhere can I buy it?Metabolic pathway infoWhat else is easy to find…
Cialis on Google?
What is Cialis?
What is Cialis? Can we trust Wikipedia?
What is Cialis?
6 hits on PubChem
What is Cialis?
Search by Trade Name
Search by CAS Number (from Wikipedia)
Are there other names???
Are there other names???
PubMed hits: 736 Tadalafil 744 Cialis
Are there other names???
Are there other names?
Are There Other Names?
IC351 on PubChem?
5 HITS for IC351
ZERO HITS for IC 351
Text Searching the Web
Text searching the web for chemical compounds is an enormous challenge
RSC has multiple databases, >500,000 articles and a lot of other resources. How do we do?
The RSC Publishing Platform (Beta)
2+2 = 4 Articles?
CAS Number Search
Text Searching the Web
Text searching RSC Publishing for chemical compounds to retrieve ALL hits is a challenge
Dictionaries of name-structure relationships could be very enabling. Creating validated dictionaries is, also, an enormous challenge
Search ChemSpider for Cialis
Cialis on ChemSpider : 1 hit
Chemicals are curated/validated on ChemSpider by ourselves and the community
Based on assertions from various sources. Iterative, time-consuming and exacting!
We believe we know the structure now
Cialis – Searching the Web by InChI
Search Molecular SKELETON
Search Full Molecule
InChI Search the Web by Skeleton78 Hits by Skeleton
InChI Search the Web Exact Match32 Hits by InChIKey
InChI Search the Web Exact Match6 Hits by Standard InChIKey
InChifying the Web
Different versions of InChI lead to complex search results
There are more 2X “skeletons” for Cialis than exact matches – different stereo? Mistakes?
Our judgment…based on the following experience. MISTAKES
Vancomycin – Search the Internet
Full Molecule Search: 4 Hits
Full Skeleton Search: 104 Hits
ChemSpider – Patents Linked
SURECHEM PATENTS GOOGLE
Google Patents
Google Books
Microsoft Academic Search
Google Scholar – Found By CAS #
Identifiers for Tadalafil
Validated Registry NumberSame Result as Searching PubMed
How Many Articles in RSC Journals?
Based on 171596-29 -5 there are 13 articles in RSC journals
What about if we VALIDATE identifiers?
How Many Articles in RSC Journals?
How Many Articles in RSC Journals?
RSC Journals
RSC Journals
REMEMBER 2+2 = 4
RSC Books
PubMed
Google Books – Expanded Hit Set
Google Scholar – Expanded Hit Set
Microsoft Academic Search
Microsoft Academic Search
More mussels than drugs…
RSC Databases
media.obsessable.com
As few interfaces as possible
Did we solve this problem now?
What Do We Know? Validated Name-Structure Dictionaries enable
“structure-searching” the web.
Search the structure on ChemSpider and we have integrated many services online NCBI Entrez PubMed Google Scholar, Books, Patents Microsoft Academic Search SureChem Patents …..
Semantic Markup: Project Prospect
Pospected Compound Deposition
Success Depends on Dictionaries
Link to a Structure or the Right Structure?
Name-Structure Pairs
Semantic Linking of Structures
What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
ChemSpider SyntheticPages
Other RSC Resources…
Once we have validated name-structure dictionaries we can tap other RSC resources
There is ALWAYS a validation stage
Ultimately crowdsourced curation is necessary
Roses’ Crystal Image Collection
MP3s and Videos : Titanium
Beautiful Elements
Periodic Table Images
Other system enhancements?
What ChemSpider doesn’t deal with yet...
Markush structures and other “non-defineds” Materials Minerals Polymers Biological macromolecules
Leaving Markush to Patent Indexers
What’s Next? Continue the curation effort and keep cleaning
Enhanced integration with RSC publishing workflows and databases
Tighter integration to RSC databases Natural Product Updates Methods of Organic Synthesis
Use ChemSpider dictionaries to enhance markup precision and recall
What’s Next?
Use entity extraction approaches and ChemSpider dictionaries to analyze the entire RSC archive
Deposit structures into ChemSpider from the backfile
Use crowdsourced curation approaches to optimize the results
The InChI “Resolver”
InChI Resolver to DOIsStructure Search the Web
Most Chemistry is NOT Published Only a fraction of chemistry is published
Only a tiny fraction of chemistry is patented
What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of
ChemSpider can give it all a home…
Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,
syntheses, data, publications and patents A world of Open Access and Open Data
Classical business models will have to morph
Anyone from Penn State here?
Please see me afterwards…
Thank you
[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams