EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and...
-
Upload
shannon-cox -
Category
Documents
-
view
217 -
download
1
Transcript of EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and...
EECS 730Introduction to Bioinformatics
Structure Comparison
Luke HuanElectrical Engineering and Computer Science
http://people.eecs.ku.edu/~jhuan/
Protein Structure Similarity
23/4/21 EECS 730 3
Secondary Structure Elements: helicesstrands/sheets & loops
23/4/21 EECS 730 4
Structure Prediction/Determination
Computational tools• Homology, threading• Molecular dynamics
Experimental tools
NMR spectrometryX-ray crystallography
23/4/21 EECS 730 5
The State of the Strucutre Space
1990 250 new structures1999 2500 new structures2000 >20,000 structures total2004 ~30,000 structures total
Only about 10% of structures have been determined for known protein sequences
Protein Structure Initiative (PSI)
23/4/21 EECS 730 6
Structure Similarity Refers to how well (or poorly) 3D folded
structures of proteins can be aligned Expected to reflect functional similarities
(interaction with other molecules)
Proteins in the TIM barrel fold family
23/4/21 EECS 730 7
Alignment of 1xis and 1nar (TIM-Barrels)
Alignment computed by DALI helix axes
1xis1nar
Sayle, R. RasMol. A protein visualization tool.http://www.umass.edu/microbio/rasmol/index2.htm.
ribbon format
backbone format
23/4/21 EECS 730 8
Structure Similarity Refers to how well (or poorly) 3D folded
structures of proteins can be aligned Is expected to reflect functional similarities
(interaction with other molecules) 2007: ~ 34,000 structures in PDB
~ 1,000 different folds (1:34 ratio)
23/4/21 EECS 730 9
23/4/21 EECS 730 10
23/4/21 EECS 730 11
Structure Similarity Refers to how well (or poorly) 3D folded
structures of proteins can be aligned Is expected to reflect functional similarities
(interaction with other molecules) 2000: ~ 20,000 structures in PDB
~ 4,000 different folds (1:5 ratio) Three possible reasons:
- evolution, - physical constraints (e.g., few ways to maximize hydrophobic interactions), - limits in techniques used for structure determination
Given a new structure, the probability is high that it is similar to an existing one
23/4/21 EECS 730 12
Sequence Structure Function
sequencesimilarity
Why Compute Structure Similarity?
Low sequence similarity may yield very similar structures Sometimes high sequence similarity yields different structures
23/4/21 EECS 730 13
Alignment of 1xis and 1nar (TIM-Barrels)
1xis and 1nar have only 7% sequenceidentity, but approximately 70% of the residues are structurally similar
23/4/21 EECS 730 14
Sequence Structure Function
sequencesimilarity
structuresimilarity
Why Compute Structure Similarity?
Low sequence similarity may yield very similar structures Sometimes high sequence similarity yields different structures Structure comparison is expected to provide more pertinent
information about functional (dis-)similarity among proteins, especially with non-evolutionary relationships or non-detectable evolutionary relationships
23/4/21 EECS 730 15
Ill-Posed Problem Multiple Terminology
(Dis-)similarity analysis Structure comparison Alignment, superposition, matching Classification
Definitions Applications Methods Issues
23/4/21 EECS 730 16
A Few Web Sites Protein Data Bank (PDB):
http://www.rcsb.org/pdb/ Protein classification:
SCOP:http://scop.berkeley.edu/
CATHhttp://www.biochem.ucl.ac.uk/bsm/cath/
Protein alignment: DALI:
http://www.ebi.ac.uk/dali/ LOCK:
http://motif.stanford.edu/lock2/
23/4/21 EECS 730 17
3D Molecular Structure
Collection of (possibly typed) atoms or groups of atoms in some given 3D relative placement
The placement of a group of atoms is defined by the position of a reference point (e.g., the center of an atom) and the orientation of a reference direction
The type can be the atom ID, the amino-acid ID, etc…
23/4/21 EECS 730 18
Matching of StructuresTwo structures A and B match if:
1. Correspondence: There is a one-to-one map between their elements
2. Alignment:There exists a rigid-body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold .
23/4/21 EECS 730 19
Complete Match
23/4/21 EECS 730 20
Alignment of 3adk and 1gky
But a complete match is rarely possible: The molecules have different sizes Their shapes are only locally similar
Both matching and non-matching secondary structure elements
23/4/21 EECS 730 21
Partial Match
Notion of support σ of the match: the match is between σ(A) and σ(B) Dual problem: - What is the support? - What is the transform? Often several (many) possible supports Small supports motifs
23/4/21 EECS 730 22
Mathematical Relative
f
g
||f g||2
s
Over which support?
23/4/21 EECS 730 23
Mathematical Relative
f
g
||f g||2
s
Over which support?
23/4/21 EECS 730 24
Application #1: Find Global Similarities Among Protein Structures
Given two protein structures, find the largest similar substructures
For example, a substructure is a subset of C atoms or a subset of secondary structure elements in each molecule
Several possible similarity measures Variants: 1-to-1, 1-to-many, many-to-
many (PDB) Must be automatic (and fast)
23/4/21 EECS 730 25
Application #2: Classify Proteins Many proteins, but relatively few distinct
fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997]
Hierarchical classification Insight into functions and structure
stabilization Basis for homology and threading
Manual classification SCOP [Murzin et al., 1995]
23/4/21 EECS 730 26
Application #2: Classify Proteins Many proteins, but relatively few distinct
fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997]
Hierarchical classification Insight into functions and structure
stabilization Basis for homology and threading
Manual classification SCOP [Murzin et al., 1995]
Increasing size of PDB Automatic classifiers: CATH [Orengo et al., 1997]; Pclass [Singh et al.]; FSSP [Holm and Sander]
Class: Similar secondary structure content
Fold: SSE’s in similar arrangement
Family: Clear evolutionary relationship
23/4/21 EECS 730 27
Manuel vs. Automatic Classification
23/4/21 EECS 730 28
Application #3: Find Motif in Protein Structure Given a protein structure and a motif (e.g., a small
collection of atoms corresponding to a binding site) Find whether the motif matches a substructure of the
protein Variant: One motif against many proteins
Active sites of 1PIP and 5PAD. Only 3 amino-acids participate in the motif
23/4/21 EECS 730 29
Application #4: Find Pharmacophore Given:
• Small collection (5-10) of small flexible ligands with similar activity (hence, assumed to bind at same protein site)
• Low-energy conformations (several dozens to few 100’s) for each ligand
Find substructure (pharmacophore) that occurs in at least one conformation of each ligand
Key problem in drug design when binding site is unknown
23/4/21 EECS 730 30
Application #4: Find Pharmacophore
1TLP
4TMN
5TMN
6TMN
Inhibitors of thermolysin
Clusters of low-energy conformations of 1TLP
The 4 ligands overlappedwith their pharmacophorematched
23/4/21 EECS 730 31
Application #5: Search for Ligands Containing a Pharmacophore Given:
• Database containing several 100,000, or more, small ligands
• A pharmacophore P Find all ligands that have a low-energy
conformation containing P Data mining of pharmaceutical databases
(lead generation)
S.M. LaValle, P.W. Finn, L.E. Kavraki, and J.C. Latombe. A Randomized Kinematics-Based Approach to Pharmacophore-Constrained Conformational Search and Database Screening. J. of Computational Chemistry, 21(9):731-747, July 2000
23/4/21 EECS 730 32
Definitions Applications Methods Issues
23/4/21 EECS 730 33
Multiple Partial Matches
23/4/21 EECS 730 34
Distributed Support
B
A
B
A
Gap
σ(A)
σ(B)
23/4/21 EECS 730 35
What is Best?
B
A
B
A
Should gaps be penalized?
23/4/21 EECS 730 36
What About This?
B
A
Sequence along backbone is not preserved
23/4/21 EECS 730 37
Similarity measure is unlikely to satisfy triangular inequality for partial match
23/4/21 EECS 730 38
Compute Structure Similarity
Structure presentation Similarity measurement Computational solution
23/4/21 EECS 730 39
Structure presentation
Element based representation A structure is broken down to a list of structure
elements We represent a protein structure by its geometry,
topology, and attributes: Geometry: the coordinates of the elements Topology: the physical and chemical interaction of
elements Attributes: the physical and chemical attributes of the
elements
23/4/21 EECS 730 40
Structure Representation
There are three major groups of structure presentation Point list: treat protein as a list of points in a 3D
space Point set: treat protein as a set of points in a 3D
space Graphs: treat protein as a graph
23/4/21 EECS 730 41
Comparing two point sets
Similarity measure:
Given two point set P = {p1, p2, …, pn} and Q = {q1, q2, …, qm}, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 mapping f from P to Q such that
S (P, Q) = sqrt( id2(pi, T(f(pi)) ) is minimized .
S is called the RMSD (root-mean-spared-distance) between the two structures
23/4/21 EECS 730 42
Comparing two point sets
If m = n, there is a close-form solution to find the exact solution to the problem of comparing the two point sets
If m ≠n, the problem is much harder
23/4/21 EECS 730 43
Common Point Subset Problem
Find the largest common point subset Given two point set P = {p1, p2, …, pn} and Q = {q1,
q2, …, qm}, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 partial mapping f with maximal cardinality from P to Q such that
d (pi, T(f(pi)) ) < t for all i defined in f Also a harder problem (but not a NP-hard problem)
23/4/21 EECS 730 44
Geometric Hashing
Originally used for automatic visual recognition of geometric figures
The principle We have two geometric figures
model A with m points (can have several models) quary B with n points
Discover similar subfigures in A and B invariant under placement, rotation (and often size)
Let the figures be described by points Try to find the largest set of points from (A, B) with
coinciding points
23/4/21 EECS 730 45
Coinciding points Example from 2
dimension Find six overlapping
pairs (1,a)(2,d)(3,c)(4,e)
(6,f)(7,g) The coinciding pairs
are independent of the labeling
Note that the figures can be translated and rotated
23/4/21 EECS 730 46
Reference frames The points of the figures are specified in coordinate
systems or reference frames A reference frame can in 2D be defined by two points Choose two points from A (ai,ak) and two from B (bj, bl),
called basises, and define the reference frames (RF) from the basises Example: origin in ai and the x-axis along the line ai,ak, or origin
at the middle of ai,ak
Find the positions in RF of all the other points, called reference frame system, RFS
”Overlap” (the x,y-axes) RFSA and RFSB, and count the number of coinciding points
23/4/21 EECS 730 47
Reference frame system, example
Model (1,3) [(0,0)(6,2)(8,0)(9,4)(6,10)(3,8)(-1,6)]
four coinciding points Query (a,c) [(0,0)(3,-2) (8,0)(6,2)
(10,4)(3,8)(0,6)] only the origins coincidies Model (3,5) [(0,0)(1,8)(2,2)(4,-2)
(10,0)(8,3)(8,7)]
23/4/21 EECS 730 48
Comparison of (Reference) Frame Systems
The number of coinciding points depends on the basises
Should therefore try all possible pairs as basises
This would result in m(m-1)n(n-1) comparison of reference frame systems, but many of those comparisons are redundant
Geometric hashing is used for efficiently performing ”simultaneously” many comparisons
23/4/21 EECS 730 49
Hashing
Compare simultaneously a query frame system to all model frame systems
Assume a 2D hashing table H, a simple hashing function One bucket for each square of the frame system, identified
by (p,q) Let (u,v) H(p,q) mean that the frame system with basis (u,v)
has a point in the square (p,q) (a very simple hash function) H is filled in a preprocessing of the model
23/4/21 EECS 730 50
Hashing preprocessing example
23/4/21 EECS 730 51
Recognition Compare the query with the model (several frame
systems) Select a basis in the query and define reference frame Find the positions in the reference frame to all the other
points For each point r in the query reference system do
Calculate the position (x,y) in H that r hashes to Vote one for each model reference system in
H(x,y) End Recognize the model reference systems with highest votes
Repeat for more query reference systems, if not enough coinciding points are found
23/4/21 EECS 730 52
Example recognition
query (a,c) [(0,0)
(3,-2) (8,0)(6,2)(10,4)(3,8)(0,6)]
23/4/21 EECS 730 53
Use of several models
Can have several models in the same hashtable Must then identify model and reference system in
the hash table Example: Have a database of structures, stored in
a hashtable
23/4/21 EECS 730 54
Geometric hashing for structure comparison Need methods invariant under translation and rotation Use geometric hashing to find subsets and coincident residues
(points), residues that superpose well1. Define referance frames
Three atoms can be used: ai, ak, ar. Example Origin in ai The x-axis along ai,ak The y-axis in the plane defined of ai, ak, ar in
counterclockwise The z-axis orthogonal to the plane
2. The residues may have labels (attributes)1. Implement the labels explicit: in the hastable2. Implement the labels implicit: in the hashing