Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling...

80
Hosting public domain chemicals data online for the community – the challenges of handling materials Antony Williams NIST Diffusion/CALPHAD Data Informatics and Tools Workshop May 14 th , 2015 ORCID ID:0000-0002-2668-4821

Transcript of Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling...

Hosting public domain chemicals data online for the community – the

challenges of handling materials

Antony WilliamsNIST Diffusion/CALPHAD Data Informatics and Tools Workshop

May 14th, 2015

ORCID ID:0000-0002-2668-4821

Disclaimer…

• Previously at the Royal Society of Chemistry• Now I am here…

Many challenges are the same

• What I will discuss in terms of publisher, public domain databases, curated chemistry challenges etc. are the same…• Need capable tools to handle the data• Need standards for data exchange • Meshing data without review is dangerous!

• Quality costs – time, effort and money• Algorithms can help clean data

Where is chemistry online?

• Encyclopedic articles (Wikipedia)• Chemical vendor databases• Metabolic pathway databases• Property databases• Patents with chemical structures• Drug Discovery data• Scientific publications

• Compound aggregators• Blogs/Wikis and Open Notebook Science

Chemistry on the Internet…

• Most searching for chemistry on the internet…• Name searching Google/Bing/Yahoo• Name searching Wikipedia• Name searching Wolfram Alpha• Name, name, name, name…searching

The issue of identifiers

Some names for Aspirin..

The CAS Number• MUCH integration is done using CAS Numbers• MANY searches are CAS Numbers and Names

CAS Numbers are GREAT!

The CAS Number Index grows…

Scifinder

Prophetic Enumeration

CAS Numbers are “Trademarked”?

• From http://www.cas.org/legal/infopolicy

CAS and Wikipedia• http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry/CAS_validation

CAS and Wikipedia

CAS and Wikipedia

7900 CAS Chemicals Online…

How many CAS Numbers?

How many CAS Numbers?• >34 million chemicals from >500 sources

But CAS is hard to “Resolve”

Why CAS Numbers are not great

• There is no free service…like DOIs

• The resolver is a “Google Search”• Maybe we need another “identifier”?

• And thanks to IUPAC/NIST….

The InChI Identifier

Multiple Layers

InChI

• SINGLE code base managed by IUPAC – integrated into drawing packages and used by MANY databases. No variability as with SMILES

Vendor-dependent SMILESACD/LabsCC(C)CCC[C@@H](C)CCC[C@@H](C)CCCC(\C)=C\CC2=C(C)C(=O)c1ccccc1C2=O

OpenEyeCC1=C(C(=O)c2ccccc2C1=O)C/C=C(\C)/CCC[C@H](C)CCC[C@H](C)CCCC(C)C

ChEMBLCC(C)CCC[C@@H](C)CCC[C@@H](C)CCC\C(=C\CC1=C(C)C(=O)c2ccccc2C1=O)\C

InChI

• SINGLE code base managed by IUPAC – integrated into drawing packages and used by MANY databases. No variability as with SMILES

• InChI Strings can be reversed to structures – same problem as with SMILES – no layout

• Adopted by the community (databases, blogs, Wikipedia) – good for searching the internet

InChIStrings Hash to InChIKeys

InChIs for small molecules…

• InChIs are good for “small molecules”• Read here: http://www.jcheminf.com/series/InChI

A Vision in December 2006

Lots of data coming online…

ChemSpider

ChemSpider

ChemSpider

Experimental/Predicted Properties

Literature references

Patents references

Google Books

Vendors and data sources

Structure search the web

Exact Search

Skeleton Search

6 years ago this week…

ChemSpider strengths

• Serves over 40,000 unique users per day• Advanced searching of >34 million chemicals

Fully documented APIs

Fully documented APIs

Data Quality/Standardization

• MANY structures meant to be something online are MISREPRESENTED.

• Commonly you will have better success finding information by name searches than structure – with many caveats of course…

• Validating chemical structure representations is laborious work – and it’s shocking to review data…

What is the Structure of Vitamin K1?

Data Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)

Science Translational Medicine 2011

Data quality is a known issue

Data quality is a known issue

Patent data in public databases

Patent data in public databases

Depiction vs Accurate Representation

Depiction vs Accurate Representation

There are Unused Standards!

There are Unused Standards!

There are Unused Standards!

Nitro groups

Salt and Ionic Bonds

Ammonium salts

Can we MAKE Quality Data?

• Systems for everyone to validate and standardize their data would be useful

• Would improve structure data in publications, databases etc. and make searching across resources better

• Collaboration to establish community rules would be good!

Chemical Validation and Standardization: http://cvsp.chemspider.com

CVSP Rules Sets

CVSP Filtering of DrugBank

CVSP Filtering of DrugBank

CVSP is Open to Anyone!

ChemSpider limitations

• Supports “small molecules” only – no InChI, no possibility to register a compound

• SO MUCH of chemistry is “materials”

• Severe limitation in chemistry coverage:• Monomers but no polymers• Inorganic and organometallic handling• Ambiguous structures – “Markush”• Nanomaterials

• Minerals• Bound to beads, surfaces etc

ORGANICS vs. Materials• Comment – you don’t know all of the

challenges until you start to work in the area!

• We, and cheminformatics companies, have solved MANY, but not all of the issues regarding organic chemistry management

• The majority of our approaches do not map to materials • No standard ways to represent compounds• No InChI for materials

Questions to consider…

• Organics are hard enough! • What are your best dictionaries of materials?• We have chemical ontologies. Status for

materials?• Is open annotation of your databases possible?• What standards do you have for materials data

exchange?

Polymorphism is common

Known Challenges

• Many materials are non-stoichiometric• How to represent composite materials (e.g.

supported catalysts)?

• Methods to distinguish novelty in materials (equivalent to diversity in organic structures)?

• Lots of challenges ahead..a curated “community dictionary” would be of value…

Mapped DICTIONARIES…

• Structure IDs• Systematic name(s)• Trivial Name(s)• SMILES• InChI Strings• InChIKeys• Database IDs

• Registry Number

Pragmatism wins

Collaboration is key

Wouldn’t it be nice if…

Thank you

Email: [email protected] ORCID: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams