EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and...

54
EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/

Transcript of EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and...

Page 1: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

EECS 730Introduction to Bioinformatics

Structure Comparison

Luke HuanElectrical Engineering and Computer Science

http://people.eecs.ku.edu/~jhuan/

Page 2: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

Protein Structure Similarity

Page 3: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 3

Secondary Structure Elements: helicesstrands/sheets & loops

Page 4: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 4

Structure Prediction/Determination

Computational tools• Homology, threading• Molecular dynamics

Experimental tools

NMR spectrometryX-ray crystallography

Page 5: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 5

The State of the Strucutre Space

1990 250 new structures1999 2500 new structures2000 >20,000 structures total2004 ~30,000 structures total

Only about 10% of structures have been determined for known protein sequences

Protein Structure Initiative (PSI)

Page 6: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 6

Structure Similarity Refers to how well (or poorly) 3D folded

structures of proteins can be aligned Expected to reflect functional similarities

(interaction with other molecules)

Proteins in the TIM barrel fold family

Page 7: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 7

Alignment of 1xis and 1nar (TIM-Barrels)

Alignment computed by DALI helix axes

1xis1nar

Sayle, R. RasMol. A protein visualization tool.http://www.umass.edu/microbio/rasmol/index2.htm.

ribbon format

backbone format

Page 8: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 8

Structure Similarity Refers to how well (or poorly) 3D folded

structures of proteins can be aligned Is expected to reflect functional similarities

(interaction with other molecules) 2007: ~ 34,000 structures in PDB

~ 1,000 different folds (1:34 ratio)

Page 9: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 9

Page 10: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 10

Page 11: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 11

Structure Similarity Refers to how well (or poorly) 3D folded

structures of proteins can be aligned Is expected to reflect functional similarities

(interaction with other molecules) 2000: ~ 20,000 structures in PDB

~ 4,000 different folds (1:5 ratio) Three possible reasons:

- evolution, - physical constraints (e.g., few ways to maximize hydrophobic interactions), - limits in techniques used for structure determination

Given a new structure, the probability is high that it is similar to an existing one

Page 12: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 12

Sequence Structure Function

sequencesimilarity

Why Compute Structure Similarity?

Low sequence similarity may yield very similar structures Sometimes high sequence similarity yields different structures

Page 13: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 13

Alignment of 1xis and 1nar (TIM-Barrels)

1xis and 1nar have only 7% sequenceidentity, but approximately 70% of the residues are structurally similar

Page 14: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 14

Sequence Structure Function

sequencesimilarity

structuresimilarity

Why Compute Structure Similarity?

Low sequence similarity may yield very similar structures Sometimes high sequence similarity yields different structures Structure comparison is expected to provide more pertinent

information about functional (dis-)similarity among proteins, especially with non-evolutionary relationships or non-detectable evolutionary relationships

Page 15: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 15

Ill-Posed Problem Multiple Terminology

(Dis-)similarity analysis Structure comparison Alignment, superposition, matching Classification

Definitions Applications Methods Issues

Page 16: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 16

A Few Web Sites Protein Data Bank (PDB):

http://www.rcsb.org/pdb/ Protein classification:

SCOP:http://scop.berkeley.edu/

CATHhttp://www.biochem.ucl.ac.uk/bsm/cath/

Protein alignment: DALI:

http://www.ebi.ac.uk/dali/ LOCK:

http://motif.stanford.edu/lock2/

Page 17: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 17

3D Molecular Structure

Collection of (possibly typed) atoms or groups of atoms in some given 3D relative placement

The placement of a group of atoms is defined by the position of a reference point (e.g., the center of an atom) and the orientation of a reference direction

The type can be the atom ID, the amino-acid ID, etc…

Page 18: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 18

Matching of StructuresTwo structures A and B match if:

1. Correspondence: There is a one-to-one map between their elements

2. Alignment:There exists a rigid-body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold .

Page 19: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 19

Complete Match

Page 20: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 20

Alignment of 3adk and 1gky

But a complete match is rarely possible: The molecules have different sizes Their shapes are only locally similar

Both matching and non-matching secondary structure elements

Page 21: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 21

Partial Match

Notion of support σ of the match: the match is between σ(A) and σ(B) Dual problem: - What is the support? - What is the transform? Often several (many) possible supports Small supports motifs

Page 22: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 22

Mathematical Relative

f

g

||f g||2

s

Over which support?

Page 23: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 23

Mathematical Relative

f

g

||f g||2

s

Over which support?

Page 24: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 24

Application #1: Find Global Similarities Among Protein Structures

Given two protein structures, find the largest similar substructures

For example, a substructure is a subset of C atoms or a subset of secondary structure elements in each molecule

Several possible similarity measures Variants: 1-to-1, 1-to-many, many-to-

many (PDB) Must be automatic (and fast)

Page 25: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 25

Application #2: Classify Proteins Many proteins, but relatively few distinct

fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997]

Hierarchical classification Insight into functions and structure

stabilization Basis for homology and threading

Manual classification SCOP [Murzin et al., 1995]

Page 26: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 26

Application #2: Classify Proteins Many proteins, but relatively few distinct

fold families [Chotia, 1992; Holm and Sander, 1996; Brenner et al. 1997]

Hierarchical classification Insight into functions and structure

stabilization Basis for homology and threading

Manual classification SCOP [Murzin et al., 1995]

Increasing size of PDB Automatic classifiers: CATH [Orengo et al., 1997]; Pclass [Singh et al.]; FSSP [Holm and Sander]

Class: Similar secondary structure content

Fold: SSE’s in similar arrangement

Family: Clear evolutionary relationship

Page 27: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 27

Manuel vs. Automatic Classification

Page 28: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 28

Application #3: Find Motif in Protein Structure Given a protein structure and a motif (e.g., a small

collection of atoms corresponding to a binding site) Find whether the motif matches a substructure of the

protein Variant: One motif against many proteins

Active sites of 1PIP and 5PAD. Only 3 amino-acids participate in the motif

Page 29: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 29

Application #4: Find Pharmacophore Given:

• Small collection (5-10) of small flexible ligands with similar activity (hence, assumed to bind at same protein site)

• Low-energy conformations (several dozens to few 100’s) for each ligand

Find substructure (pharmacophore) that occurs in at least one conformation of each ligand

Key problem in drug design when binding site is unknown

Page 30: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 30

Application #4: Find Pharmacophore

1TLP

4TMN

5TMN

6TMN

Inhibitors of thermolysin

Clusters of low-energy conformations of 1TLP

The 4 ligands overlappedwith their pharmacophorematched

Page 31: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 31

Application #5: Search for Ligands Containing a Pharmacophore Given:

• Database containing several 100,000, or more, small ligands

• A pharmacophore P Find all ligands that have a low-energy

conformation containing P Data mining of pharmaceutical databases

(lead generation)

S.M. LaValle, P.W. Finn, L.E. Kavraki, and J.C. Latombe. A Randomized Kinematics-Based Approach to Pharmacophore-Constrained Conformational Search and Database Screening. J. of Computational Chemistry, 21(9):731-747, July 2000

Page 32: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 32

Definitions Applications Methods Issues

Page 33: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 33

Multiple Partial Matches

Page 34: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 34

Distributed Support

B

A

B

A

Gap

σ(A)

σ(B)

Page 35: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 35

What is Best?

B

A

B

A

Should gaps be penalized?

Page 36: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 36

What About This?

B

A

Sequence along backbone is not preserved

Page 37: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 37

Similarity measure is unlikely to satisfy triangular inequality for partial match

Page 38: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 38

Compute Structure Similarity

Structure presentation Similarity measurement Computational solution

Page 39: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 39

Structure presentation

Element based representation A structure is broken down to a list of structure

elements We represent a protein structure by its geometry,

topology, and attributes: Geometry: the coordinates of the elements Topology: the physical and chemical interaction of

elements Attributes: the physical and chemical attributes of the

elements

Page 40: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 40

Structure Representation

There are three major groups of structure presentation Point list: treat protein as a list of points in a 3D

space Point set: treat protein as a set of points in a 3D

space Graphs: treat protein as a graph

Page 41: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 41

Comparing two point sets

Similarity measure:

Given two point set P = {p1, p2, …, pn} and Q = {q1, q2, …, qm}, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 mapping f from P to Q such that

S (P, Q) = sqrt( id2(pi, T(f(pi)) ) is minimized .

S is called the RMSD (root-mean-spared-distance) between the two structures

Page 42: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 42

Comparing two point sets

If m = n, there is a close-form solution to find the exact solution to the problem of comparing the two point sets

If m ≠n, the problem is much harder

Page 43: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 43

Common Point Subset Problem

Find the largest common point subset Given two point set P = {p1, p2, …, pn} and Q = {q1,

q2, …, qm}, (n≤ m), find a Euclidian transformation T (rotation + translation), and a 1-1 partial mapping f with maximal cardinality from P to Q such that

d (pi, T(f(pi)) ) < t for all i defined in f Also a harder problem (but not a NP-hard problem)

Page 44: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 44

Geometric Hashing

Originally used for automatic visual recognition of geometric figures

The principle We have two geometric figures

model A with m points (can have several models) quary B with n points

Discover similar subfigures in A and B invariant under placement, rotation (and often size)

Let the figures be described by points Try to find the largest set of points from (A, B) with

coinciding points

Page 45: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 45

Coinciding points Example from 2

dimension Find six overlapping

pairs (1,a)(2,d)(3,c)(4,e)

(6,f)(7,g) The coinciding pairs

are independent of the labeling

Note that the figures can be translated and rotated

Page 46: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 46

Reference frames The points of the figures are specified in coordinate

systems or reference frames A reference frame can in 2D be defined by two points Choose two points from A (ai,ak) and two from B (bj, bl),

called basises, and define the reference frames (RF) from the basises Example: origin in ai and the x-axis along the line ai,ak, or origin

at the middle of ai,ak

Find the positions in RF of all the other points, called reference frame system, RFS

”Overlap” (the x,y-axes) RFSA and RFSB, and count the number of coinciding points

Page 47: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 47

Reference frame system, example

Model (1,3) [(0,0)(6,2)(8,0)(9,4)(6,10)(3,8)(-1,6)]

four coinciding points Query (a,c) [(0,0)(3,-2) (8,0)(6,2)

(10,4)(3,8)(0,6)] only the origins coincidies Model (3,5) [(0,0)(1,8)(2,2)(4,-2)

(10,0)(8,3)(8,7)]

Page 48: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 48

Comparison of (Reference) Frame Systems

The number of coinciding points depends on the basises

Should therefore try all possible pairs as basises

This would result in m(m-1)n(n-1) comparison of reference frame systems, but many of those comparisons are redundant

Geometric hashing is used for efficiently performing ”simultaneously” many comparisons

Page 49: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 49

Hashing

Compare simultaneously a query frame system to all model frame systems

Assume a 2D hashing table H, a simple hashing function One bucket for each square of the frame system, identified

by (p,q) Let (u,v) H(p,q) mean that the frame system with basis (u,v)

has a point in the square (p,q) (a very simple hash function) H is filled in a preprocessing of the model

Page 50: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 50

Hashing preprocessing example

Page 51: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 51

Recognition Compare the query with the model (several frame

systems) Select a basis in the query and define reference frame Find the positions in the reference frame to all the other

points For each point r in the query reference system do

Calculate the position (x,y) in H that r hashes to Vote one for each model reference system in

H(x,y) End Recognize the model reference systems with highest votes

Repeat for more query reference systems, if not enough coinciding points are found

Page 52: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 52

Example recognition

query (a,c) [(0,0)

(3,-2) (8,0)(6,2)(10,4)(3,8)(0,6)]

Page 53: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 53

Use of several models

Can have several models in the same hashtable Must then identify model and reference system in

the hash table Example: Have a database of structures, stored in

a hashtable

Page 54: EECS 730 Introduction to Bioinformatics Structure Comparison Luke Huan Electrical Engineering and Computer Science jhuan

23/4/21 EECS 730 54

Geometric hashing for structure comparison Need methods invariant under translation and rotation Use geometric hashing to find subsets and coincident residues

(points), residues that superpose well1. Define referance frames

Three atoms can be used: ai, ak, ar. Example Origin in ai The x-axis along ai,ak The y-axis in the plane defined of ai, ak, ar in

counterclockwise The z-axis orthogonal to the plane

2. The residues may have labels (attributes)1. Implement the labels explicit: in the hastable2. Implement the labels implicit: in the hashing