Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

60
Crowdsourcing, Crowdsourcing, Collaborations and Text- Collaborations and Text- Mining in a World of Open Mining in a World of Open Chemistry Chemistry Antony Williams Antony Williams Bio-IT World 2009 Bio-IT World 2009

description

There is an increasing availability of free and open access resources for scientists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. However, freedom costs and in many cases the cost is quality. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 150 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of the issue of quality in many chemistry-related databases, approaches to cleaning up the data and how a curated platform can become the centralized hub for resourcing information about chemical entities. This includes experimental and predicted properties, analytical data, publications, suppliers and integrated databases. I will detail three efforts :1) the curation of chemistry on Wikipedia 2) an examination of structure integrity on the FDA Daily Med website, a web site of medication content and labeling as found in medication package inserts 3) recognizing chemical names in documents and providing a platform for structure-based searching of Open Access chemistry literature.

Transcript of Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Page 1: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Crowdsourcing, Collaborations Crowdsourcing, Collaborations and Text-Mining in a World of and Text-Mining in a World of

Open Chemistry Open Chemistry

Antony WilliamsAntony WilliamsBio-IT World 2009Bio-IT World 2009

Page 2: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Linked Data CloudLinked Data Cloud

Page 3: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Chemistry on the InternetChemistry on the Internet

Much of the information online is Much of the information online is User Beware! User Beware!

The Quality of information is “diverse”The Quality of information is “diverse”

Technologies can “link and connect” information Technologies can “link and connect” information but validation and curation is key to providing but validation and curation is key to providing qualityquality

The LinkedData web is of less value when the The LinkedData web is of less value when the data linked are “wrong”data linked are “wrong”

Page 4: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Quality Costs Quality Costs

Chemical Abstracts ServiceChemical Abstracts Service (CAS), a (CAS), a division of the ACS is “Gold Standard” in division of the ACS is “Gold Standard” in Chemistry related informationChemistry related information 101 years of content, $260 million revenue 101 years of content, $260 million revenue

(2006), >40 million substances and 60 million (2006), >40 million substances and 60 million sequencessequences

But online…But online…

Page 5: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

What is “wrong”?What is “wrong”?

Page 6: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

A platform for:A platform for: Data deposition, Data deposition, curation and annotationcuration and annotation Supporting Open Notebook Science effortsSupporting Open Notebook Science efforts Chemistry document mark-up with ChemMantisChemistry document mark-up with ChemMantis The Open Access ChemSpider Journal of The Open Access ChemSpider Journal of

ChemistryChemistry

Page 7: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Page 8: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Page 9: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Page 10: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Page 11: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Page 12: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Page 13: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Complex Data and InformationComplex Data and Information

Page 14: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Online DataOnline Data

Many websites host structure-based Many websites host structure-based informationinformation

Question quality!!!Question quality!!!

Page 15: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Page 16: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Wikipedia, C&E News, Wikipedia, C&E News, PubChemPubChem

C&E News C&E News (from ACS)(from ACS)

Page 17: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Does one stereocenter matter?Does one stereocenter matter?

Page 18: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

VancomycinVancomycin

Who will Who will curate?curate?

PubChem is PubChem is not resourced not resourced to clean these to clean these errors errors

How would How would you clean such you clean such a large a large dataset?dataset?

Page 19: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Vancomycin Vancomycin ChemSpider: 1 compound – 3 days ChemSpider: 1 compound – 3 days

Page 20: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Question EverythingQuestion Everythingwww.dhmo.orgwww.dhmo.org

Page 21: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

DailyMedDailyMed

“ “DailyMed provides DailyMed provides high qualityhigh quality information about marketed drugs. information about marketed drugs.

This information includes FDA approved This information includes FDA approved labels (package inserts).”labels (package inserts).”

Page 22: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

The FDA’s DailyMedThe FDA’s DailyMed

Page 23: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Structures on DailyMedStructures on DailyMedPoor RepresentationsPoor Representations

Page 24: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Structures on DailyMedStructures on DailyMedLack of StereochemistyLack of Stereochemisty

Page 25: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Incorrect StructuresIncorrect StructuresScanning (?) IssuesScanning (?) Issues

Page 26: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Incorrect StructuresIncorrect Structures

Page 27: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Does it Matter?Does it Matter?

Does it matter to the consumer that the Does it matter to the consumer that the structures are wrong? No…what matters structures are wrong? No…what matters is what is in the bottle is the right is what is in the bottle is the right medication!medication!

To make DailyMed structure searchable it To make DailyMed structure searchable it DOES matterDOES matter

To data mine DailyMed it mattersTo data mine DailyMed it matters To mark up DailyMed it mattersTo mark up DailyMed it matters

Page 28: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

CollaborativeCollaborative Knowledge Knowledge Management Management for Chemistsfor Chemists

Page 29: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Wikipedia Links to DrugbankWikipedia Links to Drugbank

Page 30: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Taxol on PubChemTaxol on PubChem

Page 31: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Taxol on Daily MedTaxol on Daily Med

Page 32: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

The InChI IdentifierThe InChI Identifier

Page 33: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Multiple LayersMultiple Layers

Source: Unofficial InChI FAQ pageSource: Unofficial InChI FAQ page

Page 34: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

InChIStrings Hash to InChIStrings Hash to InChIKeysInChIKeys

Page 35: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

InChIs for TaxolInChIs for Taxol

Page 36: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Back to TaxolBack to Taxol

DrugBank: RCINICONZNJXQF-CLDWUXIMDDDrugBank: RCINICONZNJXQF-CLDWUXIMDD

ChEBI: ChEBI: RCINICONZNJXQF-GXKQXQCDDN RCINICONZNJXQF-GXKQXQCDDN

Wikipedia: Wikipedia: RCINICONZNJXQF-MZXODVADBJ

Which one is correct???

Page 37: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

InChIKeys for TaxolInChIKeys for Taxol

DrugBank: RCINICONZNJXQF-DrugBank: RCINICONZNJXQF-CLDWUXIMDDCLDWUXIMDD

ChEBI: ChEBI: RCINICONZNJXQF-GXKQXQCDDN RCINICONZNJXQF-GXKQXQCDDN

Wikipedia: Wikipedia: RCINICONZNJXQF-MZXODVADBJ

ChEBI and Wikipedia are the SAME structure Drugbank is a DIFFERENT structure – ONE

stereocenter

Page 38: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

The InChI ResolverThe InChI Resolver

Page 39: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Page 40: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Coming Soon…Linked ArticlesComing Soon…Linked Articles

Page 41: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

How bad can it get???How bad can it get???And who is right????And who is right????

Page 42: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

ChemMantisChemMantis

ChemChemical ical MMarkup arkup AAnd nd NNomenclature omenclature TTransformation ransformation IIntegrated ntegrated SSystem – ystem – ChemMantisChemMantis

A platform for entity extraction for chemistry A platform for entity extraction for chemistry documents, markup and integration to online documents, markup and integration to online information sources – Wikipedia, ChemSpider, information sources – Wikipedia, ChemSpider, Entrez…Entrez…

Web-based submission, markup and publishing Web-based submission, markup and publishing platform now hosting the platform now hosting the ChemSpider Journal of ChemSpider Journal of ChemistryChemistry

Page 43: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

ChemMantis MarkupChemMantis Markup

Page 44: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Enable Electronic Articles…Enable Electronic Articles…

Structures are the Structures are the language of language of chemistrychemistry

Show structures to Show structures to chemists and chemists and search/link from search/link from there…there…

Page 45: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Species MarkupSpecies Markup

Page 46: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Dictionaries are Easily Dictionaries are Easily EnhancedEnhanced

Copy-Paste into appropriate Entity Copy-Paste into appropriate Entity DictionaryDictionary

Impacts all future markupsImpacts all future markups

Expanding knowledgebases of informationExpanding knowledgebases of information

Linked out to rich sources of informationLinked out to rich sources of information

Page 47: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Build Dictionaries Build Dictionaries Ontologies Next Ontologies Next

Page 48: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Outlinks…Outlinks…

Page 49: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Publishers and Document Publishers and Document Mark-UpMark-Up

Page 50: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider Everywhere

Linked from WikipediaLinked from Wikipedia

Linked from Open Notebook Science sites using Linked from Open Notebook Science sites using EMBEDEMBED

Linked from Blogs using Structure/Spectra EMBEDLinked from Blogs using Structure/Spectra EMBED

Integrated into structure drawing packages such as Integrated into structure drawing packages such as ACD/ChemSketch, Symyx Draw, Open Source appletsACD/ChemSketch, Symyx Draw, Open Source applets

Integrated to software offerings from Thermo, Integrated to software offerings from Thermo, Waters, Agilent, BrukerWaters, Agilent, Bruker

Page 51: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider EverywhereEmbed Functionality (like Embed Functionality (like

YouTube)YouTube)

Page 52: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider Everywherewww.spectralgame.comwww.spectralgame.com

Page 53: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider EverywhereCrowdsourced Curation of SpectraCrowdsourced Curation of Spectra

Page 54: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider EverywhereRSC CompoundsRSC Compounds

Page 55: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider EverywhereNature ChemistryNature Chemistry

Nature ChemistryNature Chemistry articles articles are annotated to identify all are annotated to identify all of the chemical compounds of the chemical compounds mentioned throughout the mentioned throughout the text. text.

Those compounds are linked Those compounds are linked out to other information out to other information resources including resources including PubChem and PubChem and ChemSpiderChemSpider. .

Page 56: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider EverywhereChemMobiChemMobi

Page 57: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Structure RSS Feeds with Structure RSS Feeds with InChIsInChIs

Page 58: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

Page 59: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

AcknowledgmentsAcknowledgments

Richard Kidd, Royal Society of ChemistryRichard Kidd, Royal Society of Chemistry Jason Wilde, Nature Publishing GroupJason Wilde, Nature Publishing Group Martin Walker and the Wikipedia Chemistry Martin Walker and the Wikipedia Chemistry

teamteam Microsoft – Rudy PotenzoneMicrosoft – Rudy Potenzone Symyx – Keith Taylor and James JackSymyx – Keith Taylor and James Jack SureChem – Nicko Goncharoff SureChem – Nicko Goncharoff Spectral game - Andrew Lang and Jean-Spectral game - Andrew Lang and Jean-

Claude BradleyClaude Bradley ““The InChI team and Advisory Group”The InChI team and Advisory Group”

Page 60: Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry

Building a Structure Centric Community for Chemists

ConclusionsConclusions

www.chemspider.comwww.chemspider.com

www.chemspider.com/journalwww.chemspider.com/journal

InChIs and Internet ChemistryInChIs and Internet Chemistry

http://inchis.chemspider.comhttp://inchis.chemspider.com