Glycan database

29
Glycan database

description

Glycan database. Database of molecules. Two models (of vocabularies) Proteins / Nucleic Acids Residues (+ modifications) Genbank / Swissprot Compounds Atoms & covalent bonds (SMILE/SMARTS language) Pubchem / ACS Glycans Residues: monosaccahrides (+ many modifications) - PowerPoint PPT Presentation

Transcript of Glycan database

Page 1: Glycan database

Glycan database

Page 2: Glycan database

Database of molecules

• Two models (of vocabularies)– Proteins / Nucleic Acids

• Residues (+ modifications)• Genbank / Swissprot

– Compounds• Atoms & covalent bonds (SMILE/SMARTS language)• Pubchem / ACS

• Glycans– Residues: monosaccahrides (+ many modifications)– Branching nonlinear structure

Page 3: Glycan database

Simplified molecular input line entry specification (SMILE)

• Glucose

• OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)1

Page 4: Glycan database

Representation of glycans

• Vocabulary– monosaccharides rather than atoms

• Two challenges– Controlled vocabulary of monosaccharides• GlycoCT

– From residues to molecules: glycan exchange format• GLYDE-II

Page 5: Glycan database

Searching the glycan database: comparison

• Glycan representation– tree vs. sequences

• Glycan matching– exact vs. non-exact• Graph theoretic algorithm

– alignment? Mutations are natural events.– Multiple glycan matching

• Glycan pattern searching– Significance estimation

Page 6: Glycan database

GlycoCT: controlled vocabulary

Page 7: Glycan database

GLYDE standard

• An XML based representation format for glycan structures

• Inter-convertible with existing data represented using IUPAC or LINUCS.

• GLYDE II: Incorporation of Probability based representation

• Visualization: structures using GLYDE (XML) files

GLYDE - An expressive XML standard for the representation of glycan structure. Carbohydrate Research, 340 (18), Dec 30, 2005.

Page 8: Glycan database

• Enable querying and export of query results in GLYDE format• Using GLYDE representation for disambiguation, mapping and matching

MonosaccharideDB

SweetDB

KEGG

<glyde><residue>

.

.</residue></glyde>

<glyde><residue>

.

.</residue></glyde>

QUERY

RESULT

GLYDE

Collaborative GlycoInformatics

Page 9: Glycan database

Semantic GlcyoInformatics - Ontologies• GlycOGlycO: A domain ontology for glycan structures, glycan

functions and enzymes (embodying knowledge of the structure and metabolisms of glycans)o Contains 600+ classes and 100+ properties –

describe structural features of glycans; unique population strategy

o URL: http://lsdis.cs.uga.edu/projects/glycomics/glyco

• ProPreOProPreO: a comprehensive process Ontology modeling experimental proteomicso Contains 330 classes, 6 million+ instanceso Models three phases of experimental proteomics

URL: http://lsdis.cs.uga.edu/projects/glycomics/propreo

Page 10: Glycan database

GlycO taxonomy

The first levels of the GlycO taxonomy

Most relationships and attributes in GlycO

GlycO exploits the expressiveness of OWL-DL.Cardinality constraints, value constraints, Existential and Universal restrictions on Range and Domain of properties allow the classification of unknown entities as well as the deduction of implicit relationships.

Page 11: Glycan database

<Glycan> <aglycon name="Asn"/> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="Man" > <residue link="3" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> <residue link="4" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc" > </residue> </residue> <residue link="6" anomeric_carbon="1" anomer="a" chirality="D" monosaccharide="Man" > <residue link="2" anomeric_carbon="1" anomer="b" chirality="D" monosaccharide="GlcNAc"> </residue> </residue> </residue> </residue> </residue></Glycan>

Page 12: Glycan database
Page 13: Glycan database

• ProPreO: A process ontology to capture proteomics experimental lifecycle:o Separationo Mass spectrometryo Analysiso 330 classeso 110 propertieso 6 million+ instances

ProPreO

Page 14: Glycan database

Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated.

Usage: Mass spectrometry analysis

Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

Page 15: Glycan database

P(S | M = 3461.57) = 0.6 P(T | M = 3461.57) =

0.4

Goldberg, et al, Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra, Proteomics 2005, 5, 865–875

Semantic Annotation of Experimental Data•Enables Ontology-mediated Disambiguation•Allows correlation between disparate entities using Semantic Relations

Page 16: Glycan database

Graph Theoretic Basics• tree: an acyclic connected graph, whose vertices we refer to as nodes;• rooted tree: a tree having a specific node called the root, from which the

rest of the tree extends. • children: nodes that extend from a node x by one edge are called the

children of x; and conversely, x would be called the parent of these children;

• Leaf: a node with no children;• Subtree: subtree of a tree T is a tree whose nodes and edges are subsets

of those of T;• ordered tree: the rooted tree in which the children of each node are

ordered;• labeled tree: a tree in which a label is attached to each node;• Forest: a set of trees

• Oligosaccarides can be represented as labeled (monosaccahrides), ordered (if linkages are specified) and rooted trees.

Page 17: Glycan database

Maximum Common Subtree Problem (MCST)

• Input: Two labeled rooted trees T1 and T2.• Output: A tree which is a subtree of both tree

T1 and T2 and whose number of edges is the maximum among all such possible subtrees.

• Variants: Each of T1 and T2 can be ordered or unordered.

Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: 134-143 (2003).

Page 18: Glycan database

A bottom-up dynamic programming algorithm

• Let {u1, …,un} and {v1, …,vm} are the sets of nodes in T1 and T2, respectively;

• R[ui, vj] – the size of the maximum subtree of T1(ui) and T2(vj), the subtrees of T1 and T2 with ui and vj as roots, respectively;

– Computed from leaves to roots (bottom-up)– MCST of T1 and T2 R[root(T1), root(T2)]

• R[ui, ] = R[vj, ] = 0;

• M(u, v) is a matching in a bipartite graph between the children of u and children of v; if both T1 and T2 are ordered trees, M(u, v) = 1.

Aoki, et. al. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics 14: 134-143 (2003). Implemented in KEGG glycan matching and many other services.

ji

jiuchildrenu

kkvuM

ji

vuif

vuifuuRvuR

ikji

0

,max1, );(

,

Page 19: Glycan database

Alignment algorithm?

• Complexity: unordered tree ~O(4!mn) ~ O(24mn); ordered tree ~ O(mn). Typically m, n < 25.

• Extended to MCST problem in multiple trees– Is the MCST of T1, T2 and T2 is the MCST between MCST(T1, T2) and

T3, where MCST(T1, T2) is the maximum subtree of T1 and T2?– Multi-MCST problem is NP-hard (Akutsu, 2002)

• Reduciable from Longest Common Substring problem (LCS)

– Finding substructures, motif finding problem profile models

• Should we consider indels as DNA/protein alignments?– Indels is not a natural changes; but mutation might be.– Profile HMM may not be appropriate

Page 20: Glycan database

Maximum Common Approximate Subtree Problem (MCAST)

• Input: Two labeled rooted trees T1 and T2.• Output: A tree which is a k-appximate subtree

of both tree T1 and T2 and whose number of edges is the maximum among all such possible subtrees.

• T is a k-appximate subtree of U if one of U’s subtree can be transformed to T by replacing at most k labels.

Page 21: Glycan database

Subtree finding problem (pattern matching problem)

• Input: a labeled rooted tree P and a set (database) S of labeled rooted trees.

• Output: all trees in S which each has a subtree matching P.

• Variants: (1) P can be ordered or unordered; (2) P must be on the root; (3) P must be on the leaves

• A bottom-up DP algorithm modified from MCST algorithm; complexity O(|P|*|T|) for each T in the database.

Page 22: Glycan database

A bottom-up dynamic programming algorithm

• Let {u1, …,un} and {v1, …,vm} are the sets of nodes in P and T.• R[ui, vj] – indicator if the tree with the root of ui is a subtree of the

tree with the root of vj, which is rooted by vj – Output subtree with the root of vj which has R[root(P), vj] = 1;

• R[x, ] = R[, y] = 0.• R[x, y] = 1, if x = y and x or y is the leave of P and T, respectively.

• For ordered tree, matching edges rather than nodes.• Variants: (1) leaves: R[x, y] = 1, if x = y and x and y are both leaves;

(2) root: Output tree T which has R[root(P), root(T)] = 1;

otherwise

yxRtsvofychildaexistthereuofxchildeachforandvuifvuR

jijiji

0

1],[..,1,

Page 23: Glycan database

Significance of matching glycans

• MCST of T1 and T2 has k nodes (monosaccharides)

• N(T, k): # of subtrees of T with k nodes– Can be counted by a DP algorithm (how?)

• P = a-k N(T1, k) N(T2, k)

Page 24: Glycan database

Motif retrieval from glycans

• PSTMM (Probabilistic Sibling-dependent Tree Markov Model)– Learns patterns from glycan structures

• Profile PSTMM– Extracts patterns (as profiles) from glycan structures

• Kernel methods– Classification of glycans– Extraction of “features” to predict glycan biomarkers

Page 25: Glycan database

Kernel method

• Extracted glycan structures from CarbBank• Pre-analysis showed that the trisaccharide

structure was most effective for classification• Furthermore, since the non-reducing end is

usually the portion being recognized, this information was included in the kernel model

Page 26: Glycan database

Kernel method

Page 27: Glycan database

Other kernels

• Q-gram distribution kernel:– Wanted to be able to analyze any data regardless

of marker structure or size– Definition of q-gram: A sub-tree containing q

nodes– All of the q-grams for a particular glycan were

included in the kernel

• Multiple kernel:– A kernel of kernels

Page 28: Glycan database

Using a gram distribution, potential biomarkers of the appropriate size can be extracted from the data

Page 29: Glycan database

Data mining for glycobiology

• Kernels can be utilized in many ways– Feature retrieval methods for detecting putative

biomarkers– Cell-specific glycan structures can be extracted– Sequences of glycan binding proteins can be

included in a new kernel to predict binding domains

– Many more possibilities, depending on the data