Lecture 10 – protein structure prediction. A protein sequence.

26
Lecture 10 – protein structure prediction

Transcript of Lecture 10 – protein structure prediction. A protein sequence.

Lecture 10 – protein structure prediction

A protein sequence

A protein sequence

>gi|22330039|ref|NP_683383.1| unknown protein; protein id: At1g45196.1 [Arabidopsis thaliana]

MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSSASVSVDRCSSTSDAHDRDDSLISAWKEEFEVKKDDESQNL

DSARSSFSVALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVNVKRASVSTNKSSVFPSPGTPTYLHSMQKGW

SSERVPLRSNGGRSPPNAGFLPLYSGRTVPSKWEDAERWIVSPLAKEGAARTSFGASHERRPKAKSGPLGPPGFAYYSLY

SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPSMARSVSIHGCSETLASSSQDDIHESMKDAATDA

QAVSRRDMATQMSPEGSIRFSPERQCSFSPSSPSPLPISELLNAHSNRAEVKDLQVDEKVTVTRWSKKHRGLYHGNGSKM

RDHVHGKATNHEDLTCATEEARIISWENLQKAKAEAAIRKLEKYFPQMKLEKKRSSSMEKIMRKVKSAEKRAEEMRRSVL

DNRVSTASHGKASSFKRSGKKKIPSLSGCFTCHVF

Protein Structure

Heparin docking –Red: heparin; blue: central domainYellow: C-terminal domain

A Protein Structure

alpha-helix

beta-sheet

loop

core

Domain and Folds

• A discrete portion of a protein assumed to fold independently of the rest of  the protein and possessing its own function.

• Most proteins have multi-domains.

• The core 3D structure of a domain is called a fold. There are only a few thousand possible folds.

Protein Similarity Level

• Family– The proteins in the same family are homologous at the

sequence level.

• Super Family– all members of the super family should have the same

overall domain architecture, i.e., the same domains in the same order

• Fold– The folds of two domains are similar.

Protein Folding Problem

A protein folds into a unique 3D structure under the physiological condition.

Lysozyme sequence: KVFGRCELAA AMKRHGLDNY

RGYSLGNWVC AAKFESNFNT

QATNRNTDGS TDYGILQINS

RWWCNDGRTP GSRNLCNIPC

SALLSSDITA SVNCAKKIVS

DGNGMNAWVA WRNRCKGTDV

QAWIRGCRL

Relevance of Protein Structurein the Post-Genome Era

sequence

structure

function

medicine

Structure-Function Relationship

Certain level of function can be found without structure. But a structure is a key to understand the detailed mechanism.

A predicted structure is a powerful tool for function inference.

Trp repressor as a function switch

Structure-Based Drug Design

HIV protease inhibitor

Structure-based rational drug design is still a major method for drug discovery.

Protein Structure Prediction

Structure:Traditional experimental methods:

X-Ray or NMR to solve structures;

generate a few structures per day worldwidecannot keep pace for new protein sequences

Strong demand for structure prediction:

more than 30,000 human genes;

10,000 genomes will be sequenced in the next 10 years.

Unsolved problem after efforts of two decades.

Ab initio Structure Prediction

An energy function to describe the protein

o bond energy

o bond angle energy

o dihedral angel energy

o van der Waals energy

o electrostatic energy

Minimize the function and obtain the structure. Not practical in general

o Computationally too expensive

o Accuracy is poor

Template-Based Prediction

Structure is better conserved than sequence

Structure can adopt a wide range of mutations.

Physical forces favorcertain structures.

Number of fold is limited. Currently ~700 Total: 1,000 ~10,000 TIM barrel

~90% of new globular proteins share similar folds with known structures, implying the general applicability of comparative modeling methods for structure prediction

general applicability of template-based modeling methods for structure prediction (currently 60-70% of new proteins, and this number is growing as more structures being solved)

NIH Structural Genomics Initiative plans to experimentally solve ~10,000 “unique” structures and predict the rest using computational methods

Scope of the Problem

Homology Modeling

• Sequence is aligned with sequence of known structure, usually sharing sequence identity of 30% or more.

• Superimpose sequence onto the template, replacing equivalent sidechain atoms where necessary.

• Refine the model by minimizing an energy function.

• Applicable to ~20% of all proteins.

Concept of Threading

o Thread (align or place) a query protein sequence onto a template structure in “optimal” way

o Good alignment gives approximate backbone structure

Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

Template set

Prediction accuracy: fold recognition / alignment

4 Components of Threading

Template libraryScoring functionAlignment Confidence assessment

Core of a Template

Core secondary structures: -helices and -strands

Definition of Template

Residue type / profile Secondary structure type Solvent assessibility Coordinates for C / C

RES 1 G 156 S 23 10.528 -13.223 9.932 11.977 -12.741 10.115

RES 5 P 157 H 110 12.622 -17.353 10.577 12.981 -16.146 11.485

RES 5 G 158 H 61 17.186 -15.086 9.205 16.601 -15.457 10.578

RES 5 Y 159 H 91 16.174 -10.939 12.208 16.612 -12.343 12.727

RES 5 C 160 H 8 12.670 -12.752 15.349 14.163 -13.137 15.545

RES 1 G 161 S 14 15.263 -17.741 14.529 15.022 -16.815 15.733

Energy (Score) Function

…YKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEW…

Singleton energy: How well a residue fits a template position (sequence and structural environment): E_s

Pairwise energy: How preferable to put two particular residues nearby: E_p

Alignment gap penalty: E_g

Total energy: E_p + E_s + E_g

Threading problem

• Threading: Given a sequence, and a fold (template), compute the optimal alignment score between the sequence and the fold.

• If we can solve the above problem, then– Given a sequence, we can try each known fold, and find

the best fold that fits this sequence.

– Because there are only a few thousands folds, we can find the correct fold for the given sequence.

• Threading is NP-hard.

Computational Methods

• Branch and Bound.

• Integer Program.– Use linear programming plus branch and

bound.

ab initio

threadinghomology

Blue Gene

• On December 6, 1999, IBM announced a $100 million research initiative to build the world's fastest supercomputer, "Blue Gene", to tackle fundamental problems in computational biology.

• More than one petaflop/s (1,000,000,000,000,000 floating point operations per second)