IUPAC-NIST Solubility Data Series. 92. Metal Carbonates. Part 1
Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology...
-
Upload
jasmin-walton -
Category
Documents
-
view
226 -
download
1
Transcript of Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology...
Standards for Digital Data Representation
1) The IUPAC/NIST Chemical Identifier2) IUPAC Terminology
NSF WorkshopConstructing a Kinetics Database
NIST, April 19-20, 2004
• Bad News:– There are more problems than you thought
• Good News:– NIST/IUPAC are trying to solve them for you
The News
Data Tags
STM – Scientific, Technical, Medical ‘Publication’
thermo kinetics spectroscopy synthesis
Chemistry
Data Tags
IUPAC/NIST
Chemical Identity – INChI
Interdisciplinary Terms – Gold & Green
STM – Scientific, Technical, Medical ‘Publication’
Chemistry
A Digital ‘Name’ for A Chemical Entity
convert chemical structure to digital ‘signature’
To allow computers to:– Organize chemical data– Disseminate data (queries)– Manage quality control
Current Representations are Inadequate
• Drawing – for humans only
• CAS registry number– Arbitrary value (hard to find and confirm)
– CAS Indexer may not match Specialist
– Expensive, imprecise, incomplete, no hierarchy
• Connection Table– One compound – Many representations
– Embedded ambiguities
• ‘Canonical’ Connection Table– No open standard
Reactive Intermediates
• Ions, radicals, excited states– In principle, no problem
• Equilibrated species– Must specify variability precisely
• Weakly bound complexes– OK if orientation is omitted
• Transition states– Maybe not necessary in data compilation
ChemWeb, 3/2002
Nature, May 23, 2002
Requirements
• Different compounds have different identifiers– All distinguishing structural information is included
INChI - 1 INChI - 2=
=
Requirements
• One compound has only one identifier– Include only necessary information
NOO
NOO
N+ OO
NOO
Same INChI
= ==
Two Problems
• Chemicals– Fast isomerization (esp, H-atoms)– Unconventional connectivity
• Chemists– Differing conventions
• Depends on discipline, education and convenience
– Imprecision/uncertainty
3 Steps to INChI
• Chemistry– ‘Normalize’ Input Structure
• Implement chemical rules
• Math– ‘Canonicalize’ (label the atoms)
• Equivalent atoms get the same label
• Format– ‘Serialize’ Labeled Structure
• Output as character string (‘name’)
NormalizeSimplify
• Divide structure into ‘layers’– Each layer ‘refines’ structure
• Ignore ‘Electron Density’– Ignore bond type and electron
location
• Stereochemistry– sp2 and sp3 only– Free rotation around single bonds
formula
connectivity
stereo
isotope
Chemical Substances“Layers”
4 Connectivity ‘Sublayers’
• Disconnect H-atoms and metals– Create skeleton
• Reconnect Fixed H-atoms– Represent multiple species
• Reconnect mobile H-atoms – A single species
• Reconnect metals-non-metal bonds– Represent bonds to metals
Ignore Electron Density
• Not required for compound identification– Represent ‘excited states’
• Simplify representations– Delocalization, aromaticity, zwitterions, coordination …
H
H
H
H
H
H
Münchnones
O- O
R
N+
R R
O- O
+
R
N
R R
OO
C- R
N+
R R
OO
R
N+
C-
R R
OO+
R
NC-
R R
OO
R
NC
R R
O- O
R
N
R R
+O
C
O
C R
NC
R R
O
N
RO
R R
Simplify - Ignore Electrons
Mobile H-atom (Tautomer) Sublayer
N O
H
N O
H
H-migration between 1,3 heteroatoms
N O[H]
NitrobenzeneCH5
CH3
CH1
CH2
CH4
C6
N+7 O
8O
9
Canonical numbering
Description Layers
formula C6H5NO2
connectivity 8-7(9)6-4-2-1-3-5-6
H-atoms 1-6H
charges
MSG tautomeric
C4 C5
O8
CH2
2
O9
CH21
CH3O
10OH7
NH26
Na+1
Canonical numbering
Description Layers
formula C5H8NO4.Na
connectivity 6-3(5(9)10)1-2-4(7)8;
H-atoms 1-2H2,3H,6H2(H-,7,8,9,10);
stereo sp3 3-;
charges -1;+1
MSG fixed
C4 C5
O8
CH2
2
O9
CH21
CH3O
10OH7
NH26
Na+1
Canonical numbering
Description Layers
formula C5H8NO4.Na
connectivity 6-3(5(9)10)1-2-4(7)8;
H-atoms 1-2H2,3H,6H2(H-,7,8,9,10);
stereo sp3 3-;
H-atoms fixed 7H;
stereo sp3 3-;
charges -1;+1
Ferrocene
Disconnected structure
CH1
CH2
CH3
CH5
CH4
CH1
CH2
CH4
CH5
CH3
Fe2+1
Canonical numbering
Reconnected structure
CH1
Fe2+11
CH2
CH-3
CH5
CH4
CH6
CH-7
CH9
CH10
CH8
Canonical numbering
Description Layers Layers
formula 2C5.Fe C10Fe
connectivity 2*1-2-4-5-3-1; 1-2-4-5-3(1)11(1,2,4,5)6-7(11)9(11)10(11)8(6)11
H-atoms 2*1-5H; 1-10H
charges 2*-1;+2
Auxiliary Output
• Confirmation– Label stereogenic atoms– Identify equivalent atoms
• Warnings/Errors– Unusual valences– Unrecognized input
• ‘Reversibility’– Coordinates– Bond/Charge Location
Testing - OK
Beta Testing
O
OH
O
OH
N
O
OH
N
OH
O
NH NH
NSC# 666457
50 ms – 2 GHz PC
Performance:Most Challenging NCI-NIH Structure
INChI FAQs
• How can you represent chemistry without electrons?– Chemistry is not represented, just identity– Whole molecule properties may be added (state, phase,..).
• Do big molecules have big INChIs?– Yes, just like systematic names
• How to handle other tautomer types, substructures,..?– Other software
• Is INChI reversible?– Partly - contains only data needed for ‘naming’– Auxiliary fields can carry structure depiction information
• Is INChI extensible?– New layers can add refinement
Started
Oct. 2002
http://www.nicmila.org/Gold/Output/
Miloslav Nic, Jiri Jirat, Czech Republic
Converted - XML
My Point of View
• A forest of data dictionaries is growing– Horizontally and vertically
• We need to consider forest management
• Some day all reusable data will be tagged