How can the international chemical identifier (InChI) be extended to non trivial chemicals
The importance of the InChI identifier as a foundation technology for eScience platforms
-
Upload
antony-williams-chemconnector -
Category
Science
-
view
3.204 -
download
1
description
Transcript of The importance of the InChI identifier as a foundation technology for eScience platforms
![Page 1: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/1.jpg)
The Importance of the InChI Identifier as a Foundation Technology for eScience Platforms at RSC
Antony Williams
Bio-IT,
Boston, April 27th 2014
![Page 2: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/2.jpg)
Without the InChI…
• ChemSpider is unlikely to have been built
• It would not have grown into one of the domains primary online chemistry resources
• The Royal Society of Chemistry would not have it as an online database, would not have a large cheminformatics team and would not be involved in a number of large scale funded projects around chemistry data
![Page 3: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/3.jpg)
• ~30 million chemicals and growing
• Data sourced from >500 different sources
• Crowd sourced curation and annotation
• Ongoing deposition of data from our journals and our collaborators
• Structure centric hub for web-searching
• …and a really big dictionary!!!
![Page 4: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/4.jpg)
ChemSpider
![Page 5: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/5.jpg)
ChemSpider
![Page 6: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/6.jpg)
Experimental/Predicted Properties
![Page 7: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/7.jpg)
Literature references
![Page 8: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/8.jpg)
Patents references
![Page 9: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/9.jpg)
So what is Yohimbine?
![Page 10: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/10.jpg)
Of course it is out there…
Drugbox: 3001/5080 with InChIs Chembox:5436/7690 with InChIs
![Page 11: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/11.jpg)
Tell me more…
• Where can I find the molfile for Yohimbine?• Papers/Patents about Yohimbine?• What are the side effects of Yohimbine?• Where can I order Yohimbine?• What are the physicochemical properties?• Metabolic pathways?• Different synonyms of Yohimbine?• Synthesis of Yohimbine?• Side effects of Yohimbine?• Etc….
![Page 12: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/12.jpg)
Quantity!
![Page 13: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/13.jpg)
Yohimbine on ChemSpider
![Page 14: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/14.jpg)
Downsides of Overall Approach
• Meshing data together based on InChIs worked for simple molecules
• 2D layout errors inherited or limited by algorithm
• Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
![Page 15: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/15.jpg)
Yohimbine on ChemSpider..Quality?
![Page 16: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/16.jpg)
So where can we travel???
![Page 17: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/17.jpg)
So where can we travel???
![Page 18: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/18.jpg)
![Page 19: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/19.jpg)
InChI String Search via GoogleGive me InChIKeys…
![Page 20: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/20.jpg)
And where can we travel???
![Page 21: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/21.jpg)
ChemSpider
BRENDA
Wikipedia
ChEMBL
ChEBI
DrugBank
![Page 22: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/22.jpg)
Aggregator
Enzymes
Encyclopedia
Pharmacology
Curated Chemicals
Drug-Drug Target
![Page 23: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/23.jpg)
How do we build it?
• We deal in Molfiles or SDF files – with coordinates• Deposit anything that has an InChI – we support
what InChI can handle, good and bad• Standardization based on “InChI standardization”• InChIs aggregate (certain) tautomers• We link out to external sites using their IDs
![Page 24: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/24.jpg)
Downsides of InChI
• InChI was a moving target (multi versions) but overall worked as planned.
• Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”
• InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
![Page 25: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/25.jpg)
Side Effects of InChI Usage
![Page 26: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/26.jpg)
SMILES by comparison…
![Page 27: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/27.jpg)
Side Effects of InChI Usage
![Page 28: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/28.jpg)
Standardization IssuesDepiction based on molfile
![Page 29: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/29.jpg)
Standardize
Use the SRS as a guidance document for standardizationAdjust as necessary to our needs
![Page 30: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/30.jpg)
Nitro groups
![Page 31: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/31.jpg)
Salt and Ionic Bonds
![Page 32: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/32.jpg)
Ammonium salts
![Page 33: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/33.jpg)
CVSP
![Page 34: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/34.jpg)
NPC Browser Set
![Page 35: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/35.jpg)
Checking include InChI
• Many SDF files contain InChIs and SMILES – comparing the structure contained within the file with the associated InChI is useful – turned up a number of errors in checking online databases
![Page 36: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/36.jpg)
So, I’m writing an article…
![Page 37: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/37.jpg)
With these…I will lose data
![Page 38: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/38.jpg)
But linking with InChI …
![Page 39: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/39.jpg)
Structure Searching the Web
![Page 40: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/40.jpg)
Data in Publications
• This is not new, you know the story…• So much data of value is contained within a
publication and delivered in a PDF form• PDF files, and unclear licensing/copyright, limit
access to data so I can rework, reuse, repurpose, text mine etc.
• “I specialize in XXXX. I want a database of YYYY extracted from publications and made available, for free, with the capabilities I need, and the publishers should just do it”
![Page 41: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/41.jpg)
“Data enable” publications?
• We would LOVE to bring data out of our archive• What could we do?
• Find chemical names and generate structures• Find chemical images and generate structures• Find reactions – and make a database!• Find data (MP, BP, LogP) and host. Build
models!• Find figures and database them• Find spectra (and link to structures)• Validate the data algorithmically
![Page 42: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/42.jpg)
RSC Archive – since 1841
![Page 43: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/43.jpg)
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
![Page 44: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/44.jpg)
Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
![Page 45: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/45.jpg)
But names = structures
• Systematic names can be generated FROM chemical structures algorithmically
![Page 46: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/46.jpg)
But names = structures
• …and structures from systematic names
![Page 47: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/47.jpg)
But what of trivial names?
• What about trivial names, trade names, CAS numbers, multilingual names etc.?
![Page 48: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/48.jpg)
Searching that lipid in patents
![Page 49: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/49.jpg)
Aspirin on ChemSpider
![Page 50: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/50.jpg)
Work in Progress
![Page 51: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/51.jpg)
Work in Progress
![Page 52: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/52.jpg)
Work in Progress
![Page 53: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/53.jpg)
Work in Progress
![Page 54: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/54.jpg)
But Context Gives Reactions
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .
The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .
After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue
![Page 55: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/55.jpg)
ChemSpider Reactions
![Page 56: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/56.jpg)
ChemSpider as a Foundation
• >30 million chemicals (and growing)
• ChemSpider is free to access for everyone – and the API means people program against it
• What projects can we benefit?
![Page 57: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/57.jpg)
Support grant-based services• Multiple European consortium-based grants
• PharmaSea (FP7 funded)• Open PHACTS (IMI funded)
• UK National Chemical Database Service (http://cds.rsc.org) – developing data repository for lab data, integrate Electronic Lab Notebooks
• Open Drug Discovery projects
![Page 58: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/58.jpg)
![Page 59: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/59.jpg)
PharmaSea
![Page 60: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/60.jpg)
• 3-year Innovative Medicines Initiative project
• Integrating chemistry and biology data using semantic web technologies
• Open code, open data, open standards
• Academics, Pharmas, Publishers…
• To put medicines in the pipeline…
![Page 61: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/61.jpg)
Open PHACTS
![Page 62: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/62.jpg)
All Databases We Generate…
• All databases and systems we build now include generated InChIs
• InChIs are facilitating discoverability via searching on Google (see Chris’ talk) but also for querying and linking
![Page 63: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/63.jpg)
But we are still VERY LIMITED
• RSC deals with way more than organics, inorganics, organometallics – we are building a data repository to include materials, polymers, ambiguous materials etc.
• There are many plans for InChI moving forward – Markush, polymers, organometallics etc
![Page 64: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/64.jpg)
The great promise should be obvious
• InChIs are here to stay• They will evolve, they will encompass, we
will adopt and adapt• Public and private databases will federate &
build a linked environment of validated data!• Data validation and standardization is
needed• Open Data will continue to proliferate• InChIs are in the “Semantic Web” already
![Page 65: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/65.jpg)
If InChI never existed …
• ChemSpider would never have been built
• Database linking would suffer dramatically
• The web would not be “structure searchable”
• Cheminformatics tools would likely not be linking to public domain databases in the same way
![Page 66: The importance of the InChI identifier as a foundation technology for eScience platforms](https://reader033.fdocuments.us/reader033/viewer/2022061103/54100bfc8d7f72dc0c8b458c/html5/thumbnails/66.jpg)
Thank youEmail: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams