Lecture 10 – protein structure prediction. A protein sequence.
-
Upload
heather-gibson -
Category
Documents
-
view
225 -
download
1
Transcript of Lecture 10 – protein structure prediction. A protein sequence.
A protein sequence
>gi|22330039|ref|NP_683383.1| unknown protein; protein id: At1g45196.1 [Arabidopsis thaliana]
MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSSASVSVDRCSSTSDAHDRDDSLISAWKEEFEVKKDDESQNL
DSARSSFSVALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVNVKRASVSTNKSSVFPSPGTPTYLHSMQKGW
SSERVPLRSNGGRSPPNAGFLPLYSGRTVPSKWEDAERWIVSPLAKEGAARTSFGASHERRPKAKSGPLGPPGFAYYSLY
SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPSMARSVSIHGCSETLASSSQDDIHESMKDAATDA
QAVSRRDMATQMSPEGSIRFSPERQCSFSPSSPSPLPISELLNAHSNRAEVKDLQVDEKVTVTRWSKKHRGLYHGNGSKM
RDHVHGKATNHEDLTCATEEARIISWENLQKAKAEAAIRKLEKYFPQMKLEKKRSSSMEKIMRKVKSAEKRAEEMRRSVL
DNRVSTASHGKASSFKRSGKKKIPSLSGCFTCHVF
Domain and Folds
• A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function.
• Most proteins have multi-domains.
• The core 3D structure of a domain is called a fold. There are only a few thousand possible folds.
Protein Similarity Level
• Family– The proteins in the same family are homologous at the
sequence level.
• Super Family– all members of the super family should have the same
overall domain architecture, i.e., the same domains in the same order
• Fold– The folds of two domains are similar.
Protein Folding Problem
A protein folds into a unique 3D structure under the physiological condition.
Lysozyme sequence: KVFGRCELAA AMKRHGLDNY
RGYSLGNWVC AAKFESNFNT
QATNRNTDGS TDYGILQINS
RWWCNDGRTP GSRNLCNIPC
SALLSSDITA SVNCAKKIVS
DGNGMNAWVA WRNRCKGTDV
QAWIRGCRL
Structure-Function Relationship
Certain level of function can be found without structure. But a structure is a key to understand the detailed mechanism.
A predicted structure is a powerful tool for function inference.
Trp repressor as a function switch
Structure-Based Drug Design
HIV protease inhibitor
Structure-based rational drug design is still a major method for drug discovery.
Protein Structure Prediction
Structure:Traditional experimental methods:
X-Ray or NMR to solve structures;
generate a few structures per day worldwidecannot keep pace for new protein sequences
Strong demand for structure prediction:
more than 30,000 human genes;
10,000 genomes will be sequenced in the next 10 years.
Unsolved problem after efforts of two decades.
Ab initio Structure Prediction
An energy function to describe the protein
o bond energy
o bond angle energy
o dihedral angel energy
o van der Waals energy
o electrostatic energy
Minimize the function and obtain the structure. Not practical in general
o Computationally too expensive
o Accuracy is poor
Template-Based Prediction
Structure is better conserved than sequence
Structure can adopt a wide range of mutations.
Physical forces favorcertain structures.
Number of fold is limited. Currently ~700 Total: 1,000 ~10,000 TIM barrel
~90% of new globular proteins share similar folds with known structures, implying the general applicability of comparative modeling methods for structure prediction
general applicability of template-based modeling methods for structure prediction (currently 60-70% of new proteins, and this number is growing as more structures being solved)
NIH Structural Genomics Initiative plans to experimentally solve ~10,000 “unique” structures and predict the rest using computational methods
Scope of the Problem
Homology Modeling
• Sequence is aligned with sequence of known structure, usually sharing sequence identity of 30% or more.
• Superimpose sequence onto the template, replacing equivalent sidechain atoms where necessary.
• Refine the model by minimizing an energy function.
• Applicable to ~20% of all proteins.
Concept of Threading
o Thread (align or place) a query protein sequence onto a template structure in “optimal” way
o Good alignment gives approximate backbone structure
Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
Template set
Prediction accuracy: fold recognition / alignment
Definition of Template
Residue type / profile Secondary structure type Solvent assessibility Coordinates for C / C
RES 1 G 156 S 23 10.528 -13.223 9.932 11.977 -12.741 10.115
RES 5 P 157 H 110 12.622 -17.353 10.577 12.981 -16.146 11.485
RES 5 G 158 H 61 17.186 -15.086 9.205 16.601 -15.457 10.578
RES 5 Y 159 H 91 16.174 -10.939 12.208 16.612 -12.343 12.727
RES 5 C 160 H 8 12.670 -12.752 15.349 14.163 -13.137 15.545
RES 1 G 161 S 14 15.263 -17.741 14.529 15.022 -16.815 15.733
Energy (Score) Function
…YKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEW…
Singleton energy: How well a residue fits a template position (sequence and structural environment): E_s
Pairwise energy: How preferable to put two particular residues nearby: E_p
Alignment gap penalty: E_g
Total energy: E_p + E_s + E_g
Threading problem
• Threading: Given a sequence, and a fold (template), compute the optimal alignment score between the sequence and the fold.
• If we can solve the above problem, then– Given a sequence, we can try each known fold, and find
the best fold that fits this sequence.
– Because there are only a few thousands folds, we can find the correct fold for the given sequence.
• Threading is NP-hard.
Computational Methods
• Branch and Bound.
• Integer Program.– Use linear programming plus branch and
bound.
Blue Gene
• On December 6, 1999, IBM announced a $100 million research initiative to build the world's fastest supercomputer, "Blue Gene", to tackle fundamental problems in computational biology.
• More than one petaflop/s (1,000,000,000,000,000 floating point operations per second)