自主防災組織活動 - Himeji...よ り よ い 訓 練 に す る た め に こ ん な 活 動 を 参 考 に し て み よ う!さ あ 、 訓 練 を や っ て み
2009 CSBB LAB 新生訓練
-
Upload
abner-huang -
Category
Documents
-
view
620 -
download
5
description
Transcript of 2009 CSBB LAB 新生訓練
Protein structure concepts and its related computation problem
Speaker: Chia Han Chu (PHD candidate)
21/07/2009 1nthu CSBB lab
What are proteins made of?• The parts of a protein, backbone and
side chain
H
OH
“Backbone”: N, C, C, N, C, C…
R: “side chain”21/07/2009 2nthu CSBB lab
What are proteins made of?• By replacing different R, twenty amino acid can be formed and grouped according to the chemic -al and physical propert -ies (e.g. size) of the R
21/07/2009 3nthu CSBB lab
What are proteins made of?• Pepide is an substance between
a animo acid (a.a for short) and a protein.
• The smallest molecular is a a.a. and the biggest one is a protein.
• Two or more a.a forms a pepide by utilizing peptide bond formation with removal of water.
21/07/2009 4nthu CSBB lab
What are proteins made of?• Dipeptide and peptide bond
羧基 胺基
脫水
21/07/2009 5nthu CSBB lab
What is protein structure?• Proteins are linear polymers that fold
up by themselves…mostly.
21/07/2009 6nthu CSBB lab
What is protein structure?• Quaternary StructuresQuaternary Structures
– Proteins that are comprised
of more than one polypeptide chain
– Each polypeptide chain in such
a protein is called a subunit
Example: Hemoglobin
21/07/2009 7nthu CSBB lab
What are the primary secondary structures?
• A common motif in the secondary structure of proteins, the alpha helix (α-helix) is a right- or left-handed coiled conformation.
• 3.6 amino acid (residues) per turn• O(i) hydrogen bonds to N(i+4)
Wikipedia21/07/2009 8nthu CSBB lab
What are the primary secondary structures?
• A beta strand (also β-strand) is a stretch of amino acids typically 5–10 amino acids long whose peptide backbones are almost fully extended
• The β sheet (also β-pleated sheet) is the second form of regular secondary structure in proteins consisting of beta strands conn-ected laterally by three or more hydrogen bonds, forming a gener-ally twisted, pleated sheet. The picture comes from Wiki
21/07/2009 9nthu CSBB lab
What are the primary secondary structures?
• Parallel and anti-parallel sheets
Parallel Anti-parallel21/07/2009 10nthu CSBB lab
What are the primary secondary structures?
• Loops• Connect the secondary structure
Elements (Helix or strand)• Have various lengths and shapes• Located at the surface of the fold
-ed protein and therefore may have
important role in biological recognitio
-n processes• Proteins that are evolutionary relat
-ed have the same helices & sheets
but may vary in loop structures
Figure 2.8, Brandon & Tooze
21/07/2009 11nthu CSBB lab
What are the super-secondary structures?
• Simple combinations of secondary structural elements, called motifs or supersecondary structure
Beta hairpin
Beta-alpha-beta unitHelix hairpin
21/07/2009 12nthu CSBB lab
What are the super-secondary structures?
• Assembly of secondary structures which are shared by many structures
β hairpin
21/07/2009 13nthu CSBB lab
What are the super-secondary structures?
• Assembly of secondary structures which are shared by many structures
Green key
21/07/2009 14nthu CSBB lab
What are the super-secondary structures?
• Assembly of secondary structures which are shared by many structures
β-α-β Found almost in every protein structure with a parallel -sheet
21/07/2009 15nthu CSBB lab
What is a protein domain?• A protein domain is a part of protein sequence
and structure that can evolve, function, and exist independently of the rest of the protein chain.
• Each domain forms a compact three-dimensional structure and often can be independently stable and folded.• One domain may appear in a variety of evolutionarily related proteins. • Domains vary in length from betweenabout 25 a.a up to 500 a.a in length
Pyruvate kinase, a protein from three domains (PDB 1pkn).*The picture above comes from wiki
Domain 1
Domain 2
Domain 3
21/07/2009 16nthu CSBB lab
What is a protein domain?• Domains often form functional units, such as the calcium-
binding EF hand domain of calmodulin. • The EF hand is a helix-loop-helix structural domain found in a large family of calcium-binding proteins.
• Protein parvalbumin, which contains three such motifs and is probably involv-ed in muscle relaxation via its calcium-binding activity.
Calmodulin with four EF-Hand-motifs.*The above picture comes from Wiki
loop region (usually about
12 amino acids)
loop region (usually about
12 amino acids)
21/07/2009 17nthu CSBB lab
What is a protein domain?• Because domains are self-stable, domains can be
"swapped" by genetic engineering between one protein and another to make chimera proteins.
1.BS-RNase. 2.The picture comes from the paper, 3D Domain swapping: A mechanism for oligomer assembly, Protein Science (1995)
21/07/2009 18nthu CSBB lab
General concepts for structural bioinformatics
SequenceSequence
StructureStructureAnalysis
Classification
FunctionFunction
PredictionModelling
DesignEngineering
21/07/2009 19nthu CSBB lab
Structure Databases• Original database-PDB
– Only one central repository for experimentally determined macromolecular structures – the Protein Data Bank (PDB)
– Established 1971– Walter Hamilton @ Brookhaven– 7 structures– “PDB format”– Magnetic tape distribution
21/07/2009 20nthu CSBB lab
Other primary structure databases
• NDB – Nucleic acid Data Base– Most structures also in PDB
• BMRB – BioMagResBank– Experimental NMR data– Joined wwPDB in 2006
• CSD – Cambridge Structural Database– Small molecules, including some peptides
and antibiotics– You have to pay to use it!><
21/07/2009 21nthu CSBB lab
Structure Databases• PDB accepts experimental structures of
“biopolymers”• When is a biomolecule big enough?
– Polypeptides: > 23 resides– Polynucleotides: > 3 residues ??– Polysaccharides: > 3 sugar residues– Fibers (only repeating unit deposited)
• Where is smaller molecules?– Deposit at Cambridge Crystallographic Data Center (CCDC) or NDB
21/07/2009 22nthu CSBB lab
Structure Databases• International effort
– Curated by RCSB (USA), PDBe (EBI-MSD;Europe) and PDBj (Japan) + BMRB (USA) forNMR data
• > 58000 structures (July, 2009)• Distribute over internet• Updated daily• “The PDB” = ftp archive of “flat” PDBfile format
21/07/2009 23nthu CSBB lab
Structure Databases
21/07/2009 24nthu CSBB lab
Structure Databases• Redundancy
– There are > 58000 structures (July, 2009)– There are > 120,000 chains
• Multiple copies per entry (e.g. dimer, viruses)
– However there are only ~ 8600 unique proteins – why?
• Non-protein entries (DNA, RNA, carbohydrates醣類 , antibiotics抗生素 )
• Different laboratories• Complexes• Mutants• Paralogs and orthologs
21/07/2009 25nthu CSBB lab
Structure Databases• To error is human...
– Experimental structures• May contain errors!• Need for validation!
21/07/2009 26nthu CSBB lab
Structure Databases• PDB files
21/07/2009 27nthu CSBB lab
Structure Databases• PDB files
21/07/2009 28nthu CSBB lab
Structure Databases
• Other formats– PDB format is not compatible with
modern database technology
– Internally, wwPDB uses• ORACLE for web-services
– Exchange formats– mmCIF – macromolecular
Crystallographic
Information File– XML – eXtended Mark-up Language
21/07/2009 29nthu CSBB lab
Structure Databases• wwPDB front-ends
– Several front-ends that provide raw and derived data and links to other database for all PDB entries.
• RCSB (often, inaccurately, called “PDB”)• PDBe• PDBj• OCA• PDBsum (lots of derived information)• MMDB (integrated with all of NCBI’s databsae)• Jena Library
21/07/2009 30nthu CSBB lab
Structure Databases• Is wwPDB enough?
– All proteins in the RCSB PDB are whole proteins or a part of proteins.
– However, something interesting to biologists are the relationship of basic protein unit, domains, not whole proteins.
– Q: How do you extract the domains from PDB?
21/07/2009 31nthu CSBB lab
Structure Classification Databases
• Structural alignment can be used to classify known (and new!) structures– SCOP (manual)– FSSP/DDD (automatic)– CATH (mixed)
21/07/2009 32nthu CSBB lab
Structure Classification Databases• SCOP database
– Structural Classification Of Proteins (SCOP for short)
– It is created and organized by the University of Cambridge, UK.
– The SCOP database aims to provide a detailed and comprehensive description of the structural and functional relationships between all proteins whose structure is known.
– Proteins are classified to reflect both structural and evolutionary relatedness.
– Classification is done manually.
21/07/2009 33nthu CSBB lab
Structure Classification Databases• SCOP database
– The basic classification is the protein domain.
– SCOP hierarchy
21/07/2009 34nthu CSBB lab
Structure Classification Databases• SCOP database
– sunid, a new SCOP identifier, is simply a number which uniquely identifies each entry in the SCOP hierarchy, from root to leaves.
– sccs, a new set of concise classification string, is a compact representation of a SCOP domain classification, including only the most relevant levels-for class, fold, superfamily, family.
– For example, PDB entry 1g61, chain A.• sunid:
cl=53931,cf=55908,sf=55909,fa=55910,dm=55911, sp=55912,px=41126
• sccs: d.126.1.1
21/07/2009 35nthu CSBB lab
Structure Classification Databases
Family: Clear evolutionary relationship Proteins are clustered together into families on the basis of one of two criteria that imply their having a common evolutionary origin.
Criteria 1: All proteins that have residue identities of 30% and greater.
Criteria 2: Proteins with lower sequence identities but whose functions and structures are very similar. For example, globins with sequence identities of 15%.
Family: Clear evolutionary relationship Proteins are clustered together into families on the basis of one of two criteria that imply their having a common evolutionary origin.
Criteria 1: All proteins that have residue identities of 30% and greater.
Criteria 2: Proteins with lower sequence identities but whose functions and structures are very similar. For example, globins with sequence identities of 15%.
Superfamily: Probable common evolutionary origin Families, whose proteins have low sequence identities but whose structures and, in many cases, functional features suggest that a common evolutionary origin is probable, are placedtogether in superfamilies.
Example actin, the ATPase domain of the heat-shock protein and hexokinase
Superfamily: Probable common evolutionary origin Families, whose proteins have low sequence identities but whose structures and, in many cases, functional features suggest that a common evolutionary origin is probable, are placedtogether in superfamilies.
Example actin, the ATPase domain of the heat-shock protein and hexokinase
Fold: Major Structural Similarity Superfamilies and families are defined as having a common fold if their proteins have same major secondary structures in same arrangement with the same topological connections.
Advantage There may, however, be cases where a common evolutionary origin is obscured by the extent of the divergence in sequence, structure and function. In these cases, it is possible that the discovery of new structures, with folds between those of the previously known structures, will make clear their common evolutionary relationship.
Fold: Major Structural Similarity Superfamilies and families are defined as having a common fold if their proteins have same major secondary structures in same arrangement with the same topological connections.
Advantage There may, however, be cases where a common evolutionary origin is obscured by the extent of the divergence in sequence, structure and function. In these cases, it is possible that the discovery of new structures, with folds between those of the previously known structures, will make clear their common evolutionary relationship.
Class(1)α-helical domains (2)β-sheet domains (3)α/β domains which consist of from "beta-alpha-beta" structural units or "motifs" that form mainly parallel β-sheets (4)α+β domains formed by independent α-helices and mainly antiparallel β-sheets (5)multi-domain proteins (for those with domains of different fold and for which no homologues are known at present)(6)membrane and cell surface proteins and peptides(7)small proteins (8)coiled-coil proteins (9)low-resolution protein structures (10)peptides and fragments (11)designed proteins of non-natural sequence
Class(1)α-helical domains (2)β-sheet domains (3)α/β domains which consist of from "beta-alpha-beta" structural units or "motifs" that form mainly parallel β-sheets (4)α+β domains formed by independent α-helices and mainly antiparallel β-sheets (5)multi-domain proteins (for those with domains of different fold and for which no homologues are known at present)(6)membrane and cell surface proteins and peptides(7)small proteins (8)coiled-coil proteins (9)low-resolution protein structures (10)peptides and fragments (11)designed proteins of non-natural sequence
Information comes from Murzin,A., Brenner,S.E., Hubbard,T.J.P. and Chothia,C. (1995) SCOP: a Structural Classification of Proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540and Wiki.
21/07/2009 36nthu CSBB lab
Structure Classification Databases
• All a: Secondary structure exclusively or almost exclusively of a-helical
21/07/2009 37nthu CSBB lab
Structure Classification Databases
• All b: Secondary structure exclusively or almost exclusively of b sheets
21/07/2009 38nthu CSBB lab
Structure Classification Databases
• a/b: helices and sheet assembled from b-a-b units
21/07/2009 39nthu CSBB lab
Structure Classification Databases
• a+b: a helices and b sheets separated in different parts of molecule. Absence of b-a-b motifs
21/07/2009 40nthu CSBB lab
Structure Classification Databases
• SCOP website glance
21/07/2009 41nthu CSBB lab
Structure Classification Databases
• CATH classification– C = Class
• Mainly α, mainly β, mixed α/β, few SSEs
– A = Architecture• Overall domain shape, orientatioin but not
connectivity of SSEs
– T = Topology = fold– H = Homologous superfamily
• Groups proteins thought to share a common ancester
21/07/2009 42nthu CSBB lab
Structure Classification Databases
• CATH classification– Lower levels sequence-based
• S = %SI ≥ 35%• O = %SI ≥ 60%• L = %SI ≥ 90%• I = %SI ≥ 100%
– D = domain• Individual domains for each I-level
21/07/2009 43nthu CSBB lab
Structure Classification Databases
• CATH classification
21/07/2009 44nthu CSBB lab
Structure Classification Databases
• CATH classification
21/07/2009 45nthu CSBB lab
Structure Classification Databases
• CATH classification
21/07/2009 46nthu CSBB lab
Structure Classification Databases
• CATH classification
21/07/2009 47nthu CSBB lab
Structure – sequence relationship
• Two conserved sequencessimilar structures (sure)
• Two similar structuresconserved sequences?
Human Myoglobin pdb:2mm1
Human Hemoglobin alpha-chain pdb:1jebA
Sequence id: 27%
Structural id: 90%21/07/2009 48nthu CSBB lab
Principles of Protein Structure
• Today's proteins reflect millions of years of evolution
• 3D structure is better conserved than sequence during evolution
• Similarities among sequences or among structures may reveal information about shared biological functions of a protein family
21/07/2009 49nthu CSBB lab
Why structural alignment?• In evolutionary related proteins
structure is much better preserved than sequence
• Similar structures may predict similar biological function
• Getting inside into the protein folding
• Similar two structures is equal to a good superimposition.
21/07/2009 50nthu CSBB lab
Structure superimposition• What is the best transformation that What is the best transformation that
superimposes the unicorn on the lion?superimposes the unicorn on the lion?
21/07/2009 51nthu CSBB lab
Structure superimposition• This is not a good result….
21/07/2009 52nthu CSBB lab
Structure superimposition• Good result:
21/07/2009 53nthu CSBB lab
Structure superimposition• Find the transformation matrix that
best overlaps the table and the chair
• i.e. Find the transformation matrix that minimizes the root mean square deviation between corresponding points of the table and the chair
21/07/2009 54nthu CSBB lab
Kinds of transformations• Rotation• Translation• Scaling• And more…
21/07/2009 55nthu CSBB lab
Translation
X
Y
21/07/2009 56nthu CSBB lab
Rotation
X
Y
21/07/2009 57nthu CSBB lab
Scale
X
Y
21/07/2009 58nthu CSBB lab
Correspondence is Unknown• Given two configurations of points
in the three dimensional space
+
21/07/2009 59nthu CSBB lab
Correspondence is Unknown• Find those rotations and translations of
one of the point sets which produce “large” superimpositions of corresponding 3-D points
60
?
21/07/2009 nthu CSBB lab
Correspondence is Unknown• Simple case – two closely related
proteins with the same number of amino acids.
61
Question:
how do we asses the quality of the transformation?
+
21/07/2009 nthu CSBB lab
Scoring the Alignment• Two point sets: A={ai} i=1…n
B={bj} j=1…m• Pairwise Correspondence:
(ak1,bt1) (ak2,bt2)… (akN,btN)
• RMSD (Root Mean Square Distance)
Sqrt( Σ||aki – bti||2/N)
6221/07/2009 nthu CSBB lab
Scoring the Alignment• Given two sets of 3-D points :
P={pi}, Q={qi} , i=1,…,n;
rmsd(P,Q) = √ i|pi - qi |2 /n
• Find a 3-D transformation T* such that:
rmsd( T*(P), Q ) = minT √ i|T(pi) - qi |2 /n
63Find the highest number of atoms aligned with the lowest RMSD
21/07/2009 nthu CSBB lab
Matching of structures• Two structures A and B match iff:1. Correspondence:
There is a one-to-one map between their elements
2. Alignment:There exists a rigid-body transform T such that the RMSD between the elements in A and those in T(B) is less than some threshold .
21/07/2009 64nthu CSBB lab
Matching of structures• Complete match
21/07/2009 65nthu CSBB lab
Matching of structures• But a complete match is rarely
possible– The molecules have different sizes– Their shapes are only locally similar
Alignment of 3adk and 1gky
21/07/2009 66nthu CSBB lab
Matching of structures
67
Notion of support σ of the match: the match is between σ(A) and σ(B) Dual problem: - What is the support? - What is the transform? Often several (many) possible supports Small supports motifs
21/07/2009 nthu CSBB lab
Matching of structures• Mathematical Relative
f
g
||f g||2
s
Over which support?21/07/2009 68nthu CSBB lab
Matching of structures• Multiple Partial Matches
21/07/2009 69nthu CSBB lab
Matching of structures• Multiple Partial Matches
21/07/2009 70nthu CSBB lab
Matching of structures• What is best?
B
A
B
A
Should gaps be penalized?
21/07/2009 71nthu CSBB lab
Matching of structures• What about this?
B
A
Sequence along backbone is not preserved
21/07/2009 72nthu CSBB lab
Matching of structures• Similarity measure is unlikely to
satisfy triangular inequality for partial match
21/07/2009 73nthu CSBB lab
Scoring Issues• Trade-off between size of σ and RMSD• How should gaps be counted?• Is there a “quality” of the correspondence?
[The correspondence may, or may not, satisfy type and/or backbone sequence preferences]
• Should accessible surface be given more importance?
• Similarity measure may be different from the inverse of RSMD (though no consensus on best measure!)
• But RMSD is computationally very convenient!
21/07/2009 74nthu CSBB lab
RMSD v.s. Similarity measure
2( )
max / 2( )
1
Ti T
i i
ANGAP
a T b
B
2
( )
1min ( )
| ( ) |T i ii T
a T bT
RMSD dissimilarity measure emphasizes differences smaller support
STRUCTAL’s similarity measure emphasizes similarities larger support
Gap penalty21/07/2009 75nthu CSBB lab
Comparison of Similarity Measures
• A.C.M. May. Toward more meaningful hierarchical classification of amino acids scoring functions. Protein Engineering, 12:707-712, 1999reviews 37 protein structure similarity measures
• The difficulty of defining a similarity score is probably due to the facts that structure comparison is an ill-posed problem and has multiple solutions
21/07/2009 76nthu CSBB lab
Bottom Line• Finding an optimal partial match is NP-
hard: No fast algorithm is guaranteed to give an optimal answer for any given measure [Godzik, 1996]
– Heuristic/approximate algorithms– Probably not a single solution, but application-
dependent solutions– But there exist general algorithmic principles
21/07/2009 77nthu CSBB lab
Algorithms for structure superimposition
• Distance based methods– DALI (Holm and Sander): Aligning scalar distance plots– STRUCTAL (Gerstein and Levitt): Dynamic programming using
pairwise inter-molecular distances– SSAP (Orengo and Taylor): Dynamic programming using
intramolecular vector distances– MINAREA (Falicov and Cohen): Minimizing soap-bubble surface
area
• Vector based methods– VAST (Bryant): Graph theory based secondary structure alignment– 3dSearch (Singh and Brutlag): Fast secondary structure index
lookup
• Both vector and distance based– LOCK (Singh and Brutlag): Hierarchically uses both secondary
structure vectors and atomic distances
21/07/2009 78nthu CSBB lab
Algorithms for structure superimposition
• Distance based methods– DALI (Holm and Sander): Aligning scalar distance plots– STRUCTAL (Gerstein and Levitt): Dynamic programming using
pairwise inter-molecular distances– SSAP (Orengo and Taylor): Dynamic programming using
intramolecular vector distances– MINAREA (Falicov and Cohen): Minimizing soap-bubble surface
area
• Vector based methods– VAST (Bryant): Graph theory based secondary structure alignment– 3dSearch (Singh and Brutlag): Fast secondary structure index
lookup
• Both vector and distance based– LOCK (Singh and Brutlag): Hierarchically uses both secondary
structure vectors and atomic distances
21/07/2009 79nthu CSBB lab
Dali
An intra-molecular distance plot for myoglobin
21/07/2009 80nthu CSBB lab
Dali• http://www.ebi.ac.uk/dali/ • Based on aligning 2-D intra-molecular
distance matrices• Computes the best subset of
corresponding residues from the two proteins such that the similarity between the 2-D distance matrices is maximized
• Searches through all possible alignments of residues using Monte-Carlo and branch-and-bound algorithms
21/07/2009 81nthu CSBB lab
VAST
21/07/2009 82nthu CSBB lab
VAST• http://www.ncbi.nih.gov/Structure/VAST/
vast.shtml• Aligns only secondary structure elements (SSE)• Represents each SSE as a vector• Finds all possible pairs of vectors from the two
structures that are similar• Uses a graph theory algorithm to find maximal
subset of similar vectors• Overall alignment score is based on the number
of similar pairs of vectors between the two structures
21/07/2009 83nthu CSBB lab
Recommanded books
21/07/2009 84nthu CSBB lab
Recommanded books
21/07/2009 85nthu CSBB lab
Thank you for your attention!
21/07/2009 86nthu CSBB lab