InChI for Large Molecules

20
Inchi for large molecules: The nextmove software perspective Roger Sayle & Noel O’Boyle Nextmove software, cambridge, uk InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27 th October 2014

Transcript of InChI for Large Molecules

Inchi for large molecules: The nextmove software perspective

Roger Sayle & Noel O’Boyle

Nextmove software, cambridge, uk

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

“this house believes…”

• The most important distinction in life science informatics is between molecular and non-molecular (bio)chemistry, not between chemistry and biology.

• Fuzzy distinctions such as “small molecules”, lipids, proteins, nucleic acids, peptides, oligosaccharides, or terpenes are like asking how many colors are there in a rainbow? (c.f. The Sapir-Whorf hypothesis).

• Schemes that encode these distinctions (such as HELM and ISO 11238 even RasMol) break down when (poorly defined) categories overlap.

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

Peptide or not?

cyclo[OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val-OAla-Val-D-OVal-D-Val]

valinomycin

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

Saccharide or not?

D-Glucopyranose D-gluco-hexopyranose

(2S)-2-methyloxane (2S)-2-methyl-tetrahydropyran

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

Saccharide or not?

D-Glucopyranose D-gluco-hexopyranose

D-Quinovopyranose 6-deoxy-Glucopyranose

6-deoxy-D-gluco-hexopyranose

D-Paratopyranose 3,6-dideoxy-Glucopyranose

3,6-dideoxy-D-ribo-hexopyranose

D-Amicetopyranose 2,3,6-trideoxy-Glucopyranose

2,3,6-trideoxy-D-erythro-hexopyranose

(2S)-2-methyloxane (2S)-2-methyl-tetrahydropyran

The cutting edge of biosimilarity

• The high prevalence of potentially life-threatening hypersensitivity reactions to the antibody cetuximab (Erbitux) in some US states has been traced to its glycosylation [containing a Gal(a1-3)Gal epitope].

Chung et al., “Cetuximab-induced anaphylaxis and IgE specific for galactose-alpha-1,3-galactose”, New England Journal of Medicine, Vol. 358, No. 11, pp. 1109-1117, 13th March 2008.

• Similarly, Human Erythropoietin (EPO) alpha, beta, delta and omega share the same primary sequence, but differ in their glycosylation patterns.

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

Destructive suggestion…

• Systems based upon monomer dictionaries (such as HELM and PDB) are notoriously difficult to maintain.

• The limited number of monomers in proteinogenic peptides and natural nucleic acid sequences leads to a false sense of security; that monomers are finite.

• In practice, the number of monomers, post-translational and chemical modifications is infinite.

• Even more difficult than standardizing monomer definitions via a central repository, like PDB, is allowing local custom definitions.

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

48 hexopyranoses

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

264 deoxy-hexopyranoses

9540 substituted hexopyranoses (4 most common

substituents)

Constructive suggestion…

• Ideally, a chemical identifier should be independent of the input representation or file format.

• Duplicates between small molecules, peptide and proteins are best determined by a single identifier, preferably the existing InChI.

• This is possible as increases in computer power and storage mean that cheminformatics toolkits can handle huge biopolymers on modern hardware.

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

Proof-of-concept

• I’ve previously reported on Tanimoto chemical search of PDB (80K) represented as canonical SMILES (1Gb).

• To test for duplicates and InChI key hash collisions, we attempted to generate InChI keys for uniprot.

• OpenBabel source tree already contains patches to InChI library to increase the official 1024 atom limit.

• A few additional source changes also helped.

• Ultimately, InChI keys could be generated for ~99.4% of the ~450K unique sequences in swissprot division.

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

Record breaking inchi-key

• Sequence Identifier: UTP10_KLULA

• Sequence Length: 1774 amino acids

• Molecule size: 28509 atoms

• InChI Length: 119699 characters

• InChI key: PHBRSEQMAKHFGD-ZBXWIJJNSA-N

• InChI Canonicalization Time: 73.2s

• Canonical SMILES Length: 35408 chars

• SMILES Canonicalization Time: 0.4s

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

protein Canonicalization time

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

protein Canonicalization time

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

protein Canonicalization time

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

conclusions

• “InChI for large molecules” simply requires fixing the bugs in standard InChI.

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

acknowledgements

• Lisa Sach-Peltason, Hoffmann-La Roche, Basel.

• Joann Prescott-Roy, Novartis, Boston, MA.

• Greg Landrum, Novatis, Basel, Switzerland.

• Evan Bolton, NCBI PubChem project, Bethesda, MD.

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014

PDB

IUPAC NAME L-Cys(1)-L-Tyr-L-Ile-L-Gln-L-Asp-L-Cys(1)-L-Pro-L-Leu-Gly-NH2

IUPAC Condensed

[C@H]1(CCCN1C(=O)[C@@H]1CSSC[C@@H](C(=O)N[C@@H](Cc2ccc(cc2)O)C(=O)N[C@@H]([C@H](CC)C)C(=O)N[C@@H](CCC(=O)N)C(=O)N[C@@H](CC(=O)O)C(=O)N1)N)C(=O)N[C@@H](CC(C)C)C(=O)NCC(=O)N

SMILES

DEPICTIONS

Sugar & SPLICE

L-cysteinyl-L-tyrosyl-L-isoleucyl-L-glutaminyl-L-alpha-aspartyl-L-cysteinyl-L-prolyl-L-leucyl-glycinamide (1->6)-disulfide

common NAME

[5-L-aspartic acid]oxytocin

OH

PLN

H-C(1)YIQDC(1)PLG-[NH2]

PEPTIDE1{C.Y.I.Q.N.C.P.L.G.[am]}$PEPTIDE1,PEPTIDE1,1:R3-6:R3$$$

helm

Competing interests statement

Peptide names imply architecture

• Named peptides imply not only sequence but also N-terminal acetylation, C-terminal amidation and disulfide bridge topology.

• Example named derivatives: – gastrin (14-17)

– motilin amide

– oxytocin free-acid

– acetyl-oxytocin

– deacetyl-abarelix

– oxytocin reduced

– endothelin-1 (1→3),(11 → 15)-bis(disulfide)

InChI for Large Molecules meeting, NCBI, Bethesda, MD Monday 27th October 2014