2009 CSBB LAB 新生訓練

Protein structure concepts and its related computation problem

Speaker: Chia Han Chu (PHD candidate)

21/07/2009 1nthu CSBB lab

What are proteins made of?• The parts of a protein, backbone and

side chain

H

OH

“Backbone”: N, C, C, N, C, C…

R: “side chain”21/07/2009 2nthu CSBB lab

What are proteins made of?• By replacing different R, twenty amino acid can be formed and grouped according to the chemic -al and physical propert -ies (e.g. size) of the R


What are proteins made of?• Pepide is an substance between

a animo acid (a.a for short) and a protein.

• The smallest molecular is a a.a. and the biggest one is a protein.

• Two or more a.a forms a pepide by utilizing peptide bond formation with removal of water.


What are proteins made of?• Dipeptide and peptide bond

羧基胺基

脫水


What is protein structure?• Proteins are linear polymers that fold

up by themselves…mostly.


What is protein structure?• Quaternary StructuresQuaternary Structures

– Proteins that are comprised

of more than one polypeptide chain

– Each polypeptide chain in such

a protein is called a subunit

Example: Hemoglobin


What are the primary secondary structures?

• A common motif in the secondary structure of proteins, the alpha helix (α-helix) is a right- or left-handed coiled conformation.

• 3.6 amino acid (residues) per turn• O(i) hydrogen bonds to N(i+4)

Wikipedia21/07/2009 8nthu CSBB lab


• A beta strand (also β-strand) is a stretch of amino acids typically 5–10 amino acids long whose peptide backbones are almost fully extended

• The β sheet (also β-pleated sheet) is the second form of regular secondary structure in proteins consisting of beta strands conn-ected laterally by three or more hydrogen bonds, forming a gener-ally twisted, pleated sheet. The picture comes from Wiki



• Parallel and anti-parallel sheets

Parallel Anti-parallel21/07/2009 10nthu CSBB lab


• Loops• Connect the secondary structure

Elements (Helix or strand)• Have various lengths and shapes• Located at the surface of the fold

-ed protein and therefore may have

important role in biological recognitio

-n processes• Proteins that are evolutionary relat

-ed have the same helices & sheets

but may vary in loop structures

Figure 2.8, Brandon & Tooze


What are the super-secondary structures?

• Simple combinations of secondary structural elements, called motifs or supersecondary structure

Beta hairpin

Beta-alpha-beta unitHelix hairpin



• Assembly of secondary structures which are shared by many structures

β hairpin




Green key




β-α-β Found almost in every protein structure with a parallel -sheet


What is a protein domain?• A protein domain is a part of protein sequence

and structure that can evolve, function, and exist independently of the rest of the protein chain.

• Each domain forms a compact three-dimensional structure and often can be independently stable and folded.• One domain may appear in a variety of evolutionarily related proteins. • Domains vary in length from betweenabout 25 a.a up to 500 a.a in length

Pyruvate kinase, a protein from three domains (PDB 1pkn).*The picture above comes from wiki

Domain 1

Domain 2

Domain 3


What is a protein domain?• Domains often form functional units, such as the calcium-

binding EF hand domain of calmodulin. • The EF hand is a helix-loop-helix structural domain found in a large family of calcium-binding proteins.

• Protein parvalbumin, which contains three such motifs and is probably involv-ed in muscle relaxation via its calcium-binding activity.

Calmodulin with four EF-Hand-motifs.*The above picture comes from Wiki

loop region (usually about

12 amino acids)

loop region (usually about

12 amino acids)


What is a protein domain?• Because domains are self-stable, domains can be

"swapped" by genetic engineering between one protein and another to make chimera proteins.

1.BS-RNase. 2.The picture comes from the paper, 3D Domain swapping: A mechanism for oligomer assembly, Protein Science (1995)


General concepts for structural bioinformatics

SequenceSequence

StructureStructureAnalysis

Classification

FunctionFunction

PredictionModelling

DesignEngineering


Structure Databases• Original database-PDB

– Only one central repository for experimentally determined macromolecular structures – the Protein Data Bank (PDB)

– Established 1971– Walter Hamilton @ Brookhaven– 7 structures– “PDB format”– Magnetic tape distribution


Other primary structure databases

• NDB – Nucleic acid Data Base– Most structures also in PDB

• BMRB – BioMagResBank– Experimental NMR data– Joined wwPDB in 2006

• CSD – Cambridge Structural Database– Small molecules, including some peptides

and antibiotics– You have to pay to use it!><


Structure Databases• PDB accepts experimental structures of

“biopolymers”• When is a biomolecule big enough?

– Polypeptides: > 23 resides– Polynucleotides: > 3 residues ??– Polysaccharides: > 3 sugar residues– Fibers (only repeating unit deposited)

• Where is smaller molecules?– Deposit at Cambridge Crystallographic Data Center (CCDC) or NDB


Structure Databases• International effort

– Curated by RCSB (USA), PDBe (EBI-MSD;Europe) and PDBj (Japan) + BMRB (USA) forNMR data

• > 58000 structures (July, 2009)• Distribute over internet• Updated daily• “The PDB” = ftp archive of “flat” PDBfile format


Structure Databases


Structure Databases• Redundancy

– There are > 58000 structures (July, 2009)– There are > 120,000 chains

• Multiple copies per entry (e.g. dimer, viruses)

– However there are only ~ 8600 unique proteins – why?

• Non-protein entries (DNA, RNA, carbohydrates醣類 , antibiotics抗生素 )

• Different laboratories• Complexes• Mutants• Paralogs and orthologs


Structure Databases• To error is human...

– Experimental structures• May contain errors!• Need for validation!


Structure Databases• PDB files


Structure Databases

• Other formats– PDB format is not compatible with

modern database technology

– Internally, wwPDB uses• ORACLE for web-services

– Exchange formats– mmCIF – macromolecular

Crystallographic

Information File– XML – eXtended Mark-up Language


Structure Databases• wwPDB front-ends

– Several front-ends that provide raw and derived data and links to other database for all PDB entries.

• RCSB (often, inaccurately, called “PDB”)• PDBe• PDBj• OCA• PDBsum (lots of derived information)• MMDB (integrated with all of NCBI’s databsae)• Jena Library


Structure Databases• Is wwPDB enough?

– All proteins in the RCSB PDB are whole proteins or a part of proteins.

– However, something interesting to biologists are the relationship of basic protein unit, domains, not whole proteins.

– Q: How do you extract the domains from PDB?


Structure Classification Databases

• Structural alignment can be used to classify known (and new!) structures– SCOP (manual)– FSSP/DDD (automatic)– CATH (mixed)


Structure Classification Databases• SCOP database

– Structural Classification Of Proteins (SCOP for short)

– It is created and organized by the University of Cambridge, UK.

– The SCOP database aims to provide a detailed and comprehensive description of the structural and functional relationships between all proteins whose structure is known.

– Proteins are classified to reflect both structural and evolutionary relatedness.

– Classification is done manually.



– The basic classification is the protein domain.

– SCOP hierarchy



– sunid, a new SCOP identifier, is simply a number which uniquely identifies each entry in the SCOP hierarchy, from root to leaves.

– sccs, a new set of concise classification string, is a compact representation of a SCOP domain classification, including only the most relevant levels-for class, fold, superfamily, family.

– For example, PDB entry 1g61, chain A.• sunid:

cl=53931,cf=55908,sf=55909,fa=55910,dm=55911, sp=55912,px=41126

• sccs: d.126.1.1



Family: Clear evolutionary relationship Proteins are clustered together into families on the basis of one of two criteria that imply their having a common evolutionary origin.

Criteria 1: All proteins that have residue identities of 30% and greater.

Criteria 2: Proteins with lower sequence identities but whose functions and structures are very similar. For example, globins with sequence identities of 15%.

Family: Clear evolutionary relationship Proteins are clustered together into families on the basis of one of two criteria that imply their having a common evolutionary origin.

Criteria 1: All proteins that have residue identities of 30% and greater.

Criteria 2: Proteins with lower sequence identities but whose functions and structures are very similar. For example, globins with sequence identities of 15%.

Superfamily: Probable common evolutionary origin Families, whose proteins have low sequence identities but whose structures and, in many cases, functional features suggest that a common evolutionary origin is probable, are placedtogether in superfamilies.

Example actin, the ATPase domain of the heat-shock protein and hexokinase

Superfamily: Probable common evolutionary origin Families, whose proteins have low sequence identities but whose structures and, in many cases, functional features suggest that a common evolutionary origin is probable, are placedtogether in superfamilies.

Example actin, the ATPase domain of the heat-shock protein and hexokinase

Fold: Major Structural Similarity Superfamilies and families are defined as having a common fold if their proteins have same major secondary structures in same arrangement with the same topological connections.

Advantage There may, however, be cases where a common evolutionary origin is obscured by the extent of the divergence in sequence, structure and function. In these cases, it is possible that the discovery of new structures, with folds between those of the previously known structures, will make clear their common evolutionary relationship.

Fold: Major Structural Similarity Superfamilies and families are defined as having a common fold if their proteins have same major secondary structures in same arrangement with the same topological connections.

Advantage There may, however, be cases where a common evolutionary origin is obscured by the extent of the divergence in sequence, structure and function. In these cases, it is possible that the discovery of new structures, with folds between those of the previously known structures, will make clear their common evolutionary relationship.

Class(1)α-helical domains (2)β-sheet domains (3)α/β domains which consist of from "beta-alpha-beta" structural units or "motifs" that form mainly parallel β-sheets (4)α+β domains formed by independent α-helices and mainly antiparallel β-sheets (5)multi-domain proteins (for those with domains of different fold and for which no homologues are known at present)(6)membrane and cell surface proteins and peptides(7)small proteins (8)coiled-coil proteins (9)low-resolution protein structures (10)peptides and fragments (11)designed proteins of non-natural sequence

Class(1)α-helical domains (2)β-sheet domains (3)α/β domains which consist of from "beta-alpha-beta" structural units or "motifs" that form mainly parallel β-sheets (4)α+β domains formed by independent α-helices and mainly antiparallel β-sheets (5)multi-domain proteins (for those with domains of different fold and for which no homologues are known at present)(6)membrane and cell surface proteins and peptides(7)small proteins (8)coiled-coil proteins (9)low-resolution protein structures (10)peptides and fragments (11)designed proteins of non-natural sequence

Information comes from Murzin,A., Brenner,S.E., Hubbard,T.J.P. and Chothia,C. (1995) SCOP: a Structural Classification of Proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540and Wiki.



• All a: Secondary structure exclusively or almost exclusively of a-helical



• All b: Secondary structure exclusively or almost exclusively of b sheets



• a/b: helices and sheet assembled from b-a-b units



• a+b: a helices and b sheets separated in different parts of molecule. Absence of b-a-b motifs



• SCOP website glance



• CATH classification– C = Class

• Mainly α, mainly β, mixed α/β, few SSEs

– A = Architecture• Overall domain shape, orientatioin but not

connectivity of SSEs

– T = Topology = fold– H = Homologous superfamily

• Groups proteins thought to share a common ancester



• CATH classification– Lower levels sequence-based

• S = %SI ≥ 35%• O = %SI ≥ 60%• L = %SI ≥ 90%• I = %SI ≥ 100%

– D = domain• Individual domains for each I-level



• CATH classification


Structure – sequence relationship

• Two conserved sequencessimilar structures (sure)

• Two similar structuresconserved sequences?

Human Myoglobin pdb:2mm1

Human Hemoglobin alpha-chain pdb:1jebA

Sequence id: 27%

Structural id: 90%21/07/2009 48nthu CSBB lab

Principles of Protein Structure

• Today's proteins reflect millions of years of evolution

• 3D structure is better conserved than sequence during evolution

• Similarities among sequences or among structures may reveal information about shared biological functions of a protein family


Why structural alignment?• In evolutionary related proteins

structure is much better preserved than sequence

• Similar structures may predict similar biological function

• Getting inside into the protein folding

• Similar two structures is equal to a good superimposition.


Structure superimposition• What is the best transformation that What is the best transformation that

superimposes the unicorn on the lion?superimposes the unicorn on the lion?


Structure superimposition• This is not a good result….


Structure superimposition• Good result:


Structure superimposition• Find the transformation matrix that

best overlaps the table and the chair

• i.e. Find the transformation matrix that minimizes the root mean square deviation between corresponding points of the table and the chair


Kinds of transformations• Rotation• Translation• Scaling• And more…


Translation

X

Y


Rotation

X

Y


Scale

X

Y


Correspondence is Unknown• Given two configurations of points

in the three dimensional space

+


Correspondence is Unknown• Find those rotations and translations of

one of the point sets which produce “large” superimpositions of corresponding 3-D points

60

?

21/07/2009 nthu CSBB lab

Correspondence is Unknown• Simple case – two closely related

proteins with the same number of amino acids.

61

Question:

how do we asses the quality of the transformation?

+


Scoring the Alignment• Two point sets: A={ai} i=1…n

B={bj} j=1…m• Pairwise Correspondence:

(ak1,bt1) (ak2,bt2)… (akN,btN)

• RMSD (Root Mean Square Distance)

Sqrt( Σ||aki – bti||2/N)


Scoring the Alignment• Given two sets of 3-D points :

P={pi}, Q={qi} , i=1,…,n;

rmsd(P,Q) = √ i|pi - qi |2 /n

• Find a 3-D transformation T* such that:

rmsd( T*(P), Q ) = minT √ i|T(pi) - qi |2 /n

63Find the highest number of atoms aligned with the lowest RMSD


Matching of structures• Two structures A and B match iff:1. Correspondence:

There is a one-to-one map between their elements

2. Alignment:There exists a rigid-body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold .


Matching of structures• Complete match


Matching of structures• But a complete match is rarely

possible– The molecules have different sizes– Their shapes are only locally similar

Alignment of 3adk and 1gky


Matching of structures

67

Notion of support σ of the match: the match is between σ(A) and σ(B) Dual problem: - What is the support? - What is the transform? Often several (many) possible supports Small supports motifs


Matching of structures• Mathematical Relative

f

g

||f g||2

s

Over which support?21/07/2009 68nthu CSBB lab

Matching of structures• Multiple Partial Matches


Matching of structures• What is best?

B

A

B

A

Should gaps be penalized?


Matching of structures• What about this?

B

A

Sequence along backbone is not preserved


Matching of structures• Similarity measure is unlikely to

satisfy triangular inequality for partial match


Scoring Issues• Trade-off between size of σ and RMSD• How should gaps be counted?• Is there a “quality” of the correspondence?

[The correspondence may, or may not, satisfy type and/or backbone sequence preferences]

• Should accessible surface be given more importance?

• Similarity measure may be different from the inverse of RSMD (though no consensus on best measure!)

• But RMSD is computationally very convenient!


RMSD v.s. Similarity measure

2( )

max / 2( )

1

Ti T

i i

ANGAP

a T b

B

2

( )

1min ( )

| ( ) |T i ii T

a T bT

RMSD dissimilarity measure emphasizes differences smaller support

STRUCTAL’s similarity measure emphasizes similarities larger support

Gap penalty21/07/2009 75nthu CSBB lab

Comparison of Similarity Measures

• A.C.M. May. Toward more meaningful hierarchical classification of amino acids scoring functions. Protein Engineering, 12:707-712, 1999reviews 37 protein structure similarity measures

• The difficulty of defining a similarity score is probably due to the facts that structure comparison is an ill-posed problem and has multiple solutions


Bottom Line• Finding an optimal partial match is NP-

hard: No fast algorithm is guaranteed to give an optimal answer for any given measure [Godzik, 1996]

– Heuristic/approximate algorithms– Probably not a single solution, but application-

dependent solutions– But there exist general algorithmic principles


Algorithms for structure superimposition

• Distance based methods– DALI (Holm and Sander): Aligning scalar distance plots– STRUCTAL (Gerstein and Levitt): Dynamic programming using

pairwise inter-molecular distances– SSAP (Orengo and Taylor): Dynamic programming using

intramolecular vector distances– MINAREA (Falicov and Cohen): Minimizing soap-bubble surface

area

• Vector based methods– VAST (Bryant): Graph theory based secondary structure alignment– 3dSearch (Singh and Brutlag): Fast secondary structure index

lookup

• Both vector and distance based– LOCK (Singh and Brutlag): Hierarchically uses both secondary

structure vectors and atomic distances


Dali

An intra-molecular distance plot for myoglobin


Dali• http://www.ebi.ac.uk/dali/ • Based on aligning 2-D intra-molecular

distance matrices• Computes the best subset of

corresponding residues from the two proteins such that the similarity between the 2-D distance matrices is maximized

• Searches through all possible alignments of residues using Monte-Carlo and branch-and-bound algorithms


VAST


VAST• http://www.ncbi.nih.gov/Structure/VAST/

vast.shtml• Aligns only secondary structure elements (SSE)• Represents each SSE as a vector• Finds all possible pairs of vectors from the two

structures that are similar• Uses a graph theory algorithm to find maximal

subset of similar vectors• Overall alignment score is based on the number

of similar pairs of vectors between the two structures


Recommanded books


Thank you for your attention!


2009 CSBB LAB 新生訓練

Documents

Transcript of 2009 CSBB LAB 新生訓練