Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider,...

78
Enhancing discoverability across Royal Society of Chemistry content by integrating to ChemSpider, an online database of chemical structures

description

The ability to query across a chemistry publishers content using chemical structure searching can dramatically enhance discoverability. RSC has been applying a number of procedures to integrate RSC’s ChemSpider community resource with our published content and databases. These include: 1) entity extraction procedures 2) chemical name conversion procedures using software algorithms and curated dictionaries 3) semantic markup and 4) a crowdsourced curation processes. This presentation will provide an overview of the processes we have utilized in order to provide structure-based integration to RSC content. We will discuss our ongoing efforts to extend the approaches to the mining of data from the rich supplementary information sections of many RSC publications. Our intention is to provide access to synthesis procedures and analytical data and further enrich the ChemSpider database for the benefit of the chemistry community.

Transcript of Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider,...

Page 1: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Enhancing discoverability across Royal Society of Chemistry content by integrating to ChemSpider, an online database of chemical structures

Page 2: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

A Pragmatic Vision

“Build a Structure Centric Community toServe Chemists”

Integrate chemical structure data on the web Create a “structure-based hub” to information,

data and algorithmic predictions Let chemists contribute their own data Allow the community to curate/correct data

Page 3: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

ChemSpider Today

Over 25 million unique compounds Sourced from over 300 data sources Growing daily – new compounds, annotations, data

Structures, text, spectra, images, movies, syntheses

Text searching the web is far from optimal Structure searching the web is not a dream The quality of data on the web is a problem An example…

Page 4: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Keep Your Plants Healthy-Looking

Page 5: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Which is better for Plants?Vodka, Sprite or Viagra?

Page 6: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

It Works – Viagra Wins the Day

Page 7: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Now Which is Better?

Viagra or Cialis?

Images sourced from Wikipedia

Page 8: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Cialis

I want…The structureAny patent informationRelated publicationsWhere can I buy it?Metabolic pathway infoWhat else is easy to find…

Page 9: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Cialis on Google?

Page 10: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

What is Cialis?

Page 11: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

What is Cialis? Can we trust Wikipedia?

Page 12: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

What is Cialis?

6 hits on PubChem

Page 13: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

What is Cialis?

Page 14: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Search by Trade Name

Page 15: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Search by CAS Number (from Wikipedia)

Page 16: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Are there other names???

Page 17: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Are there other names???

PubMed hits: 736 Tadalafil 744 Cialis

Page 18: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Are there other names???

Page 19: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Are there other names?

Page 20: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Are There Other Names?

Page 21: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

IC351 on PubChem?

5 HITS for IC351

ZERO HITS for IC 351

Page 22: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Text Searching the Web

Text searching the web for chemical compounds is an enormous challenge

RSC has multiple databases, >500,000 articles and a lot of other resources. How do we do?

Page 23: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

The RSC Publishing Platform (Beta)

Page 24: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

2+2 = 4 Articles?

Page 25: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

CAS Number Search

Page 26: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Text Searching the Web

Text searching RSC Publishing for chemical compounds to retrieve ALL hits is a challenge

Dictionaries of name-structure relationships could be very enabling. Creating validated dictionaries is, also, an enormous challenge

Page 27: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Search ChemSpider for Cialis

Page 28: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Cialis on ChemSpider : 1 hit

Chemicals are curated/validated on ChemSpider by ourselves and the community

Based on assertions from various sources. Iterative, time-consuming and exacting!

We believe we know the structure now

Page 29: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Cialis – Searching the Web by InChI

Search Molecular SKELETON

Search Full Molecule

Page 30: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

InChI Search the Web by Skeleton78 Hits by Skeleton

Page 31: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

InChI Search the Web Exact Match32 Hits by InChIKey

Page 32: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

InChI Search the Web Exact Match6 Hits by Standard InChIKey

Page 33: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

InChifying the Web

Different versions of InChI lead to complex search results

There are more 2X “skeletons” for Cialis than exact matches – different stereo? Mistakes?

Our judgment…based on the following experience. MISTAKES

Page 34: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Vancomycin – Search the Internet

Page 35: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Full Molecule Search: 4 Hits

Page 36: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Full Skeleton Search: 104 Hits

Page 37: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

ChemSpider – Patents Linked

SURECHEM PATENTS GOOGLE

Page 38: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Google Patents

Page 39: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Google Books

Page 40: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Microsoft Academic Search

Page 41: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Google Scholar – Found By CAS #

Page 42: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Identifiers for Tadalafil

Page 43: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Validated Registry NumberSame Result as Searching PubMed

Page 44: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

How Many Articles in RSC Journals?

Based on 171596-29 -5 there are 13 articles in RSC journals

What about if we VALIDATE identifiers?

Page 45: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

How Many Articles in RSC Journals?

Page 46: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

How Many Articles in RSC Journals?

Page 47: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

RSC Journals

Page 48: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

RSC Journals

REMEMBER 2+2 = 4

Page 49: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

RSC Books

Page 50: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

PubMed

Page 51: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Google Books – Expanded Hit Set

Page 52: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Google Scholar – Expanded Hit Set

Page 53: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Microsoft Academic Search

Page 54: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Microsoft Academic Search

More mussels than drugs…

Page 55: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

RSC Databases

Page 56: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

media.obsessable.com

As few interfaces as possible

Did we solve this problem now?

Page 57: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

What Do We Know? Validated Name-Structure Dictionaries enable

“structure-searching” the web.

Search the structure on ChemSpider and we have integrated many services online NCBI Entrez PubMed Google Scholar, Books, Patents Microsoft Academic Search SureChem Patents …..

Page 58: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Semantic Markup: Project Prospect

Page 59: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Pospected Compound Deposition

Page 60: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Success Depends on Dictionaries

Link to a Structure or the Right Structure?

Page 61: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Name-Structure Pairs

Page 62: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Page 63: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

ChemSpider SyntheticPages

Page 64: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Other RSC Resources…

Once we have validated name-structure dictionaries we can tap other RSC resources

There is ALWAYS a validation stage

Ultimately crowdsourced curation is necessary

Page 65: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Roses’ Crystal Image Collection

Page 66: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

MP3s and Videos : Titanium

Page 67: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Beautiful Elements

Page 68: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Periodic Table Images

Page 69: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Other system enhancements?

What ChemSpider doesn’t deal with yet...

Markush structures and other “non-defineds” Materials Minerals Polymers Biological macromolecules

Page 70: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Leaving Markush to Patent Indexers

Page 71: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

What’s Next? Continue the curation effort and keep cleaning

Enhanced integration with RSC publishing workflows and databases

Tighter integration to RSC databases Natural Product Updates Methods of Organic Synthesis

Use ChemSpider dictionaries to enhance markup precision and recall

Page 72: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

What’s Next?

Use entity extraction approaches and ChemSpider dictionaries to analyze the entire RSC archive

Deposit structures into ChemSpider from the backfile

Use crowdsourced curation approaches to optimize the results

Page 73: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

The InChI “Resolver”

Page 74: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

InChI Resolver to DOIsStructure Search the Web

Page 75: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Most Chemistry is NOT Published Only a fraction of chemistry is published

Only a tiny fraction of chemistry is patented

What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of

ChemSpider can give it all a home…

Page 76: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,

syntheses, data, publications and patents A world of Open Access and Open Data

Classical business models will have to morph

Page 77: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Anyone from Penn State here?

Please see me afterwards…

Page 78: Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider, An Online Database Of Chemical Structures

Thank you

[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams