The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Patterns in Protein Structures Algorithms and...

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Mining Patterns in Protein StructuresMining Patterns in Protein Structures Algorithms and ApplicationsAlgorithms and ApplicationsMining Patterns in Protein StructuresMining Patterns in Protein Structures Algorithms and ApplicationsAlgorithms and Applications

Wei WangUNC Chapel [email protected]

Wei WangUNC Chapel [email protected]


Proteins Are the Machinery of LifeProteins Are the Machinery of Life

Function

Protein Structure Initiative

Protein Data Bank

Serine protease Papain-like Cysteine protease

GTP binding protein

Spatial motifs


User Input

MotifSpaceMotifSpace

MotifMiner

MotifFilter

ProteinClassifier

KnowledgeRetriever

MotifNavigator

Spatial MotifDatabase

Spatial MotifKnowledgebase

DigitalLibrary

ProteinData Bank

SCOPCATH

GO

Subgraphmining

Feature selectionAssociationdiscovery

ClassificationInfo retrievalText mining

Indexing &Search

Visualization Knowledgemanagement

proteinstructures protein family

spatialmotifs

family-specificmotifs

articles

experimentalknowledge

EC

protein classificatio

n


Modeling a Protein by a Set of PointsModeling a Protein by a Set of Points

Amino acids can be presented by points in a 3D space.Amino acids can be presented by points in a 3D space.

ATOM 156 C GLY A 38 43.696 71.361 61.773 1.00 25.96 C ATOM 157 O GLY A 38 43.916 70.461 62.583 1.00 27.40 O ATOM 158 N HIS A 39 43.506 72.626 62.145 1.00 25.72 N ATOM 159 CA HIS A 39 43.583 73.021 63.550 1.00 22.52 C ATOM 160 C HIS A 39 42.367 73.829 63.983 1.00 19.35 C ATOM 161 O HIS A 39 41.790 74.562 63.187 1.00 20.24 O ATOM 162 CB HIS A 39 44.821 73.890 63.798 1.00 26.08 C ATOM 163 CG HIS A 39 46.117 73.173 63.590 1.00 32.47 C ATOM 164 ND1 HIS A 39 46.786 72.533 64.612 1.00 34.50 N ATOM 165 CD2 HIS A 39 46.850 72.967 62.471 1.00 31.79 C ATOM 166 CE1 HIS A 39 47.875 71.961 64.129 1.00 36.40 C ATOM 167 NE2 HIS A 39 47.937 72.209 62.832 1.00 31.42 N ATOM 168 N LEU A 40 41.986 73.701 65.248 1.00 22.27 N ATOM 169 CA LEU A 40 40.851 74.468 65.724 1.00 21.68 C ATOM 170 C LEU A 40 41.226 75.942 65.709 1.00 23.21 C


Graphcomplexity

Information

Protein structures are chains of amino acid residues with certain spatial arrangementsProtein structures are chains of amino acid residues with certain spatial arrangements

ASP102 HIS57

SER195ALA55

GLY43 GLY42

GLY40

ASP194

SER190

Frequent subgraph mining:Given a group of proteins G each of which is represented by a graph and a support threshold 1≥ σ ≥ 0, find all maximal subgraphs which occurs in at least σ fraction of graphs in G

Challenge: subgraph isomorphism (NP-complete)

node ↔ amino acid residueedge ↔ potential physical interaction


Almost-Delaunay (AD)Almost-Delaunay (AD)A 4-tuple of points is almost-Delaunay with parameter , if, by perturbing all points in the set by at most , the circumscribing sphere can become empty.

A 4-tuple of points is AD() if is the minimal perturbation.

A 4-tuple of points is almost-Delaunay with parameter , if, by perturbing all points in the set by at most , the circumscribing sphere can become empty.

A 4-tuple of points is AD() if is the minimal perturbation.

Blue: Delaunay is AD(0) Red: is AD()

Vertex can move within a sphere of radius R1

R2

R3

R5

R4

New tetrahedron may be formed due to the perturbation

(Bandyopadhyay and Snoeyink, SODA, 2004)


Graph RepresentationsGraph Representations

CD

DT

E(DT) E(AD) E(CD)

AD(0.5)


Recurring patterns from Graph DatabasesRecurring patterns from Graph Databases

d

c

c

x

yx

(Q)

q1

q3

q2

p2 p4

d

c

c a

x

y

x

x

x

(P)

p1

p3p5

b

d

c

c

x

x

(S)

s1

s3

s2

Input: a database of labeled undirected graphs

Output: All (connected) frequent subgraphs from the graph database.

d c3/3 3/3 d cx y

c c3/3 2/3

d

c

c

x

yd

c

c

x

x3/3 2/3 d

c

c

x

yx

2/3


Canonical Adjacency MatrixCanonical Adjacency Matrix

The Canonical Adjacency Matrix (CAM) of a graph G is the maximal adjacency matrix for G under a total ordering defined on adjacency matrices.

The Canonical Adjacency Matrix (CAM) of a graph G is the maximal adjacency matrix for G under a total ordering defined on adjacency matrices.

M1

x

c

cx

a000

bx0

yx

d

0

M2

0

x

c

cx

b0x0

a00

yx

d

M3

0

0

a

cy

d0xx

bx0

0x

c

> >

p2 p4

d

c

c a

x

y

x

x

x

(P)

p1

p3p5

b

p’2 p’4

d

c

c b

x

y

x

x

x

(P’)

p’1

p’3 p’5

a

dxcxyc0x0b00x0a > dxcxyc00xa0x00b > cycx0a0x0bxx00d

P1 P2 P3 P4 P5 P1 P2 P3 P5 P4P3 P2 P5 P4 P1

Jack Snoeyink

I think you need to label rows and columns of at least one matrix with node numbers


CAM Tree: Frequent SubgraphsCAM Tree: Frequent Subgraphs

bxy

by

a

b0y

by

a

bx0

by

a

by

a

a

bx

b

b

a

b

b

y

xy

(Q)

q1

q3

q2

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

a

b

b

y

y

(S)

s1

s3

s2

= 2/3


Fast Frequent Subgraph MiningFast Frequent Subgraph Mining

Spatial localitySubgraphs with bounded degree and size

Apriori propertyany supergraph of an infrequent subgraph is infrequenteliminates unnecessary isomorphism checks

Canonical formAvoid redundant examination

Depth-firstIncremental isomorphism checkBetter memory utilization

The state of the art algorithm that can handle large and complex protein graphsOpen issues

SubstitutionDynamics and geometric constraints

Spatial localitySubgraphs with bounded degree and size

Apriori propertyany supergraph of an infrequent subgraph is infrequenteliminates unnecessary isomorphism checks

Canonical formAvoid redundant examination

Depth-firstIncremental isomorphism checkBetter memory utilization

The state of the art algorithm that can handle large and complex protein graphsOpen issues

SubstitutionDynamics and geometric constraints


Proof of ConceptSerine ProteasesProof of ConceptSerine ProteasesM

c κ λ S M c κ λ S M c κ λ S

Eukaryotic Serine Protease (ID: 50514) N: 56 σ: 48/56, T: 31.5

1 DHAC 54 13 100 14 DHAC 50 6 100 27 DASC 49 20 92

2 ACGG 52 9 100 15 HACA 50 8 100 28 SAGG 49 31 90

3 DHSC 52 10 100 16 ACGA 50 11 100 29 DGGL 49 53 83

4 DHSA 52 10 100 17 DSAG 50 16 100 30 DSAGC 48 9 99

5 DSAC 52 12 100 18 SGGC 50 17 100 31 DSSC 48 12 97

6 DGGG 52 23 100 19 AGAG 50 27 95 32 SCSG 48 19 93

7 DHSAC 51 9 100 20 AGGG 50 58 85 33 AGAG 48 19 93

8 SAGC 51 11 100 21 ACGAG 49 4 100 34 SAGG 48 23 88

9 DACG 51 14 100 22 SCGA 49 6 100 35 DSGS 48 23 94

10 HSAC 51 14 100 23 DACS 49 7 100 36 DAAG 48 27 89

11 DHAA 51 18 100 24 DGGS 49 8 100 37 DASG 48 32 87

12 DAAC 51 32 99 25 SACG 49 10 98 38 GGGG 48 71 76

13 DHAAC 50 5 100 26 DSGC 49 15 98 Packing motifs identified in the Eukaryotic Serine Protease. N: total number of structures included in the data set. σ: The support threshold used to obtain recurring spatial motifs, T: processing time (in unit of second). M: motif number, C: the sequence of one-letter residue codes for the residue composition of the motif, κ: the actual number of occurrences of a motif in the family, λ, the background frequency of the motif, and S= -log(P) where the P-value defined by a hyper-geometric distribution. The packing motifs were sorted first by their support values in descending order, and then by their background frequencies in ascending order. The –log(P) values are highlighted


Proof of ConceptSerine ProteasesProof of ConceptSerine Proteases

1HJ9

1MD8 1OP0 1OS8 1PQ7 1P57 1SSX 1S83

38 highly specific motifs mined fromserine proteases classified by SCOP v1.65 (Dec 2003)


Proof of ConceptPapain-like Cysteine ProteaseProof of ConceptPapain-like Cysteine Protease

Patt. Composition δ Patt. Composition δ Patt. Composition δ

1 HCQS 23 3 11 WCSQ 21 0 21 WHCQS 20 0

2 FSQC 22 3 12 WSFC 21 2 22 WFCSQ 20 0

3 FQCG 22 10 13 WWGS 21 1 23 WFCQG 20 0

4 WHCS 21 0 14 WHCQ 21 0 24 WFCG 20 0

5 WCQG 21 0 15 SGQN 20 3 25 HCSS 20 2

6 WGNS 21 3 16 WFQG 20 0 26 WHCG 20 2

7 WGSG 21 3 17 SGCC 20 1 27 HCSG 20 9

8 WFCS 21 2 18 FQCG 20 2 28 WGFQ 20 7

9 WFCQ 21 0 19 WFSQ 20 7 29 WWGG 20 4

10 HCQG 21 6 20 CCGG 20 4

All the patterns have –log(P) > 49, : support in the PCP family, : number of occurrences outside the family. Patterns that contain the active diad (His and Cys) of the proteins are highlighted.


Proof of ConceptPapain-like Cysteine ProteaseProof of ConceptPapain-like Cysteine Protease

The active site in 1cqd

Choi, K. H., Laursen, R. A. & Allen, K. N. (1999). The 2.1 angstrom structure of a cysteine protease with proline specificity from ginger rhizome, zingiber officinale. Biochemistry, 7, 38(36), 11624–33.


Proof of ConceptFunction Inference of Orphan StructureProof of ConceptFunction Inference of Orphan Structure

1nfg 1m65

CASP5T0147

SCOP51556

Metallo-dependent hydrolase (MDH)

8-stranded (TIM) barrel fold

17 members, 49 family specific spatial motifs

unknown function

no good sequence and global structure alignment to known proteins

7-stranded barrel fold, 30 motifs found


Proof of ConceptFunction Inference IIProof of ConceptFunction Inference II

1ecs

SCOP54598

Antibiotic resistance protein

Glyoxalase / bleomycin resistance / dioxygenase superfamily

4 members (SCOP 1.65), 62 family specific spatial motifs

unknown function, not in SCOP 1.67, DALI z < 10 in Nov 2004

46 motifs found, structurally similar to the three new non-redundant AR

proteins added in SCOP 1.67

Yyce

1twu


References and AcknowledgementReferences and AcknowledgementComparing graph representations of protein structure for mining family-specific residue-based packing motifs, Journal of Computational Biology (JCB), 2005. SPIN: Mining maximal frequent subgraphs from graph databases, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 581-586, 2004. Mining spatial motifs from protein structure graphs,. Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pp. 308-315, 2004. Accurate classification of protein structural families using coherent subgraph analysis, Proceedings of the Pacific Symposium on Biocomputing (PSB), pp. 411-422, 2004. Efficient mining of frequent subgraph in the presence of isomorphism, Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), pp. 549-552, 2003. Another 45 papers on general methodology development directly related to this project

Comparing graph representations of protein structure for mining family-specific residue-based packing motifs, Journal of Computational Biology (JCB), 2005. SPIN: Mining maximal frequent subgraphs from graph databases, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 581-586, 2004. Mining spatial motifs from protein structure graphs,. Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pp. 308-315, 2004. Accurate classification of protein structural families using coherent subgraph analysis, Proceedings of the Pacific Symposium on Biocomputing (PSB), pp. 411-422, 2004. Efficient mining of frequent subgraph in the presence of isomorphism, Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), pp. 549-552, 2003. Another 45 papers on general methodology development directly related to this project

CollaboratorsCatherine Blake (information retrieval)

Charlie Carter (biochemistry)

Nikolay Dohkolyan (biophysics)

Leonard McMillan (computer graphics)

Jan Prins (high performance computing)

Jack Snoeyink (computational geometry)

Alexander Tropsha (pharmacy)

Partially supported by Microsoft eScience Applications Award

Microsoft New Faculty Fellowship

NSF CAREER Award IIS-0448392

NSF CCF-0523875

NSF DMS-0406381

Prototype deployed at

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Patterns in Protein Structures Algorithms and...

Documents

Transcript of The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Patterns in Protein Structures Algorithms and...