Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I...

32
Emidio Capriotti http://biofold.org/ Department of Pharmacy and Biotechnology (FaBiT) University of Bologna Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020

Transcript of Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I...

Page 1: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Emidio Capriotti http://biofold.org/

Department of Pharmacy and Biotechnology (FaBiT) University of Bologna

Introduction and Basic Concepts

Laboratory of Bioinformatics I Module 2

March 10, 2020

Page 2: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Schedule and MaterialsThis module is a 58-hour course running for 8 weeks

March 10 - April 30, 2020

Schedule changes from week to week:

Tuesday, Thursday and Friday 14:00 - 17:00

In April more changes

Project submission deadline May 18, 2020

Course website

http://biofold.github.io/pages/courses/2020/lb1-2.html

Page 3: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Main Aims

• Knowledge of tools for sequence and structure analysis and their development

• Protein functional annotation

• Theoretical background of machine learning approaches

• Problem solving skills and development of basic tools.

Page 4: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Topics• Protein Geometrical Features and Protein Structural Alignment

• Multiple Sequence Alignment

• Hidden Markov Models for Sequence Alignment

• Methods for Building Hidden Markov Models for Proteins

• Protein Structure and Mapping Problems

• Introduction to Statistical Methods and Machine Learning

• Development of Structure Prediction Methods

• Module Project: Model a Protein Domain HMM

Page 5: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Take Home Message

• Protein structure is more conserved than sequence. Proteins sharing high sequence identity usually share similar structures, as proven by pair-wise structural alignment procedures.

•When the identity level is high enough, it is possible to exploit the results of pair-wise sequence alignment for transferring structural information between proteins.

Page 6: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Structural Alignment Given two sets of points A = (a1, a2, …, an) and B = (b1,b2,…bm) in Cartesian space, find the optimal subsets A(P) and B(Q) with |A(P)| = |B(Q)|, and find the optimal rigid body transformation G between the two subsets A(P) and B(Q) that minimizes a given distance metric D over all possible rigid body transformation G, i.e.

The two subsets A(P) and B(Q) define a “correspondence”, and p = |A(P)| = |B(Q)| is called the correspondence length. Naturally, the correspondence length is maximal when A(P) and B(Q) are similar.

Therefore there are essentially two problems in structure alignment: • Find the correspondence set (which is NP-hard), and • Find the alignment transform (which is O(n)).

Bourne P. 2012

RMSD =

(ai −bi)2

i=1

n

∑n

[ ]{ }))(()(min QBGPADG

Page 7: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

The Foundation of Structural Bioinformatics

Chotiha and Lesk, EMBO Journal 1986 Chothia C, Lesk AM. The relation between the divergence of sequence andstructure in proteins. EMBO J. 1986

Page 8: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Why Sequence Alignment?

The measure of sequence similarity allow to make estimation about the structural similarity

Comparison of two sequences for measuring their similarity

• To define a distance between two sequences

• Develop an algorithm for finding the alignment with minimal distance

• To statistically evaluate the significance of the alignment

Page 9: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Sequence Distance ScoreWhich events do we consider? Mutation

It is necessary to define a score for the substitution of residue i with residue j Substitution Matrices s(i,j)

Distance among sequences

Which events do we consider?

MutationIt is necessary to define a score for the substitution ofresidue i with residue j

Substitution matrices s(i,j)

A: ALASVLIRLITRLYPB: ASAVHLNRLITRLYP

¦ ),(),( ii BAsBAScore

BLOSUM62

A:B:

Page 10: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Other eventsDeletion and Insertion: some residues can be inserted or deleted during the evolution

The (negative) score of a gap depends only on the length

σ(n) = -nd linearσ(n) = -d - (n-1) e (d: opening, e: extension)

Distance among sequences

Which events do we consider?

MutationDeletion and InsertionSome residues can be inserted or deleted

A: ALASVLIRLIT--YPB: ASAVHL---ITRLYP

)2()3(),(),( VV �� ¦ ii BAsBAScoreThe (negative) score of a gap depends only on its lengthV(n) = -nd linearV(n) = -d - (n-1)e affine (d: opening, e: extension)

N.B. All the scores are independent of the position alongthe sequence

A:B:

Page 11: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Alignment Algorithms

Algorithms for finding the minimum distance between two sequences

• Global alignment: Needleman-Wunsch: Global alignment-compare pairs of sequences on their whole length

• Local alignment: Smith-Waterman: Local alignment-compare pairs of sequences searching the most similar subsequences

Page 12: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Alignment SignificanceGiven an alignment with score S, is it significant?

Significance can be evaluated by comparing with the score distribution of random alignments

Significant alignments

Significance of an alignment

Given an alignment with score S, is it significant?

Significance can be evaluated by comparing with the score distribution of random alignments

Score

Occ

urre

nce

Page 13: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Structural Homology

Sander and Schneider, Proteins 1991

Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins. 1991;9(1):56-68

Based on the database of homology-derived secondary structure of proteins (HSSP).Define the relation between sequence similarity, structure similarity, and alignment length.

Page 14: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Twilight ZoneB.Rost

Fig. 3. Pairwise sequence identity versus alignment length. The originalHSSP-curve (Sander and Schneider, 1991) (dotted circles, eqn 1) appearedto fit the true positives (homologues, A) better than the false positives (B).In contrast, the new curve proposed here (filled diamonds, eqn 2) was moreconservative in excluding false positives. Note that due to the huge numberof pairs the plots for true (A) and false (B) positives appeared almostequally densely populated (Figure 2 revealed the problem of such a scatterplot).

any threshold extracted from the sequence alignment n, thefollowing equations hold (for cumulative numbers):false negatives 1 true positives 5 all pairs of similar structure

true negatives 1 false positives 5 all pairs ofdissimilar structure.

Distance to HSSP thresholdThe HSSP-curve was originally defined by (Sander andSchneider, 1991):

pI(n) 5 n 1 290.15 · L–0.562, for L , 80{ 25 , doe L ˘ 80 (1)where L gave the number of residues aligned between twoproteins; pI the cut-off percentage of identical residues overthe L aligned residues; and n described the distance inpercentage points from the curve (n 5 0 corresponds to theoriginal HSSP-curve; n 5 5 to the official HSSP databasereleases; curve plotted in Figure 3). Once Schneider and Sander

88

(1991) had discovered the basic functional dependence betweensequence identity and alignment length, they merely had tofix two free parameters: the factor and the exponent. Bothwere chosen to fit the data observed in 1991, in particular toreach values of 25% around alignment length of 80, andvalues of 100% around alignment length of 10. The principlefunctional dependence described by eqn 1 also followsfrom statistics, as was recently shown in an elegant work(Alexandrov and Soloveyev, 1998). Let pi (i 5 1,..., 20) be theprobability that amino acid i occurs in a protein, and mij thescore for randomly aligning two amino acids i and j. The scoreS of an entire alignment can then be approximated by:

S 5 ,m. · L

where,m. is the expectation value ofmij, and L the alignmentlength. If the values ofmij are independent, Gaussian distributedvariables, it follows (after some elementary operations) thatthe relation between the standard deviation of the values ofmij (σm ), and the resulting score distribution (σS) is:

σm 5 L–0.5 · σsIn their original article Alexandrov and Soloveyev workout an appropriate re-scaling of the dynamic programmingalignment. However, this scheme cannot be applied after thealignment has been completed (as the threshold functions usedin this work), rather it has to be implemented into thealignment method.New curve for length-dependent significance of pairwisesequence identityI attempted to solve the problems of the original HSSP-curve(eqn 1; Results) by defining the following curve for theseparation of true and false positives (Figure 3, grey line withdotted circles):

pI(n 5 n 1 480 · L–0.32 · (1 1 e –L/1000) (2)

where L gave the number of residues aligned between twoproteins; pI the cut-off percentage of identical residues overthe L aligned residues; and n described the distance inpercentage points from the curve (n 5 0 plotted in Figure 3).The constraints in visually selecting the final function were (i)to maintain the functional form defined by eqn 1 (and suggestedby the statistics of Alexandrov and Soloveyev, 1998); (ii) tohit the 100% mark at alignments that are too short to revealanything about structural similarity (5 11 residues); (iii) tosaturate at levels around 20% sequence identity (reached forlength5 300); and (iv) to roughly reflect the observed gradient.Saturation for long alignments was realized by the functionalform of the exponent (note: the term 1 e–L/a resulted in anexponential decay). This ‘saturation’ constraint also afflictedthe particular value of the factor (0.32 rather than about 0.5as suggested by the distribution of the data, Figure 4).New curve for length-dependent significance of pairwisesequence similarityThe original HSSP-curve was derived for sequence identity,not for sequence similarity (Sander and Schneider, 1991). Thefunctional dependence between similarity and length appearedcomparable to the one between identity and length (Results).This prompted a similar definition for the separation betweentrue and false positives based on similarity:

pS(n 5 n 1 420 · L–0.335 · (1 1 e –L/2000) (3)

B.Rost

Fig. 3. Pairwise sequence identity versus alignment length. The originalHSSP-curve (Sander and Schneider, 1991) (dotted circles, eqn 1) appearedto fit the true positives (homologues, A) better than the false positives (B).In contrast, the new curve proposed here (filled diamonds, eqn 2) was moreconservative in excluding false positives. Note that due to the huge numberof pairs the plots for true (A) and false (B) positives appeared almostequally densely populated (Figure 2 revealed the problem of such a scatterplot).

any threshold extracted from the sequence alignment n, thefollowing equations hold (for cumulative numbers):

false negatives 1 true positives 5 all pairs of similar structure

true negatives 1 false positives 5 all pairs ofdissimilar structure.

Distance to HSSP thresholdThe HSSP-curve was originally defined by (Sander andSchneider, 1991):

pI(n) 5 n 1 290.15 · L–0.562, for L , 80{ 25 , doe L ˘ 80 (1)

where L gave the number of residues aligned between twoproteins; pI the cut-off percentage of identical residues overthe L aligned residues; and n described the distance inpercentage points from the curve (n 5 0 corresponds to theoriginal HSSP-curve; n 5 5 to the official HSSP databasereleases; curve plotted in Figure 3). Once Schneider and Sander

88

(1991) had discovered the basic functional dependence betweensequence identity and alignment length, they merely had tofix two free parameters: the factor and the exponent. Bothwere chosen to fit the data observed in 1991, in particular toreach values of 25% around alignment length of 80, andvalues of 100% around alignment length of 10. The principlefunctional dependence described by eqn 1 also followsfrom statistics, as was recently shown in an elegant work(Alexandrov and Soloveyev, 1998). Let pi (i 5 1,..., 20) be theprobability that amino acid i occurs in a protein, and mij thescore for randomly aligning two amino acids i and j. The scoreS of an entire alignment can then be approximated by:

S 5 ,m. · L

where,m. is the expectation value ofmij, and L the alignmentlength. If the values ofmij are independent, Gaussian distributedvariables, it follows (after some elementary operations) thatthe relation between the standard deviation of the values ofmij (σm ), and the resulting score distribution (σS) is:

σm 5 L–0.5 · σsIn their original article Alexandrov and Soloveyev workout an appropriate re-scaling of the dynamic programmingalignment. However, this scheme cannot be applied after thealignment has been completed (as the threshold functions usedin this work), rather it has to be implemented into thealignment method.New curve for length-dependent significance of pairwisesequence identityI attempted to solve the problems of the original HSSP-curve(eqn 1; Results) by defining the following curve for theseparation of true and false positives (Figure 3, grey line withdotted circles):

pI(n 5 n 1 480 · L–0.32 · (1 1 e –L/1000) (2)

where L gave the number of residues aligned between twoproteins; pI the cut-off percentage of identical residues overthe L aligned residues; and n described the distance inpercentage points from the curve (n 5 0 plotted in Figure 3).The constraints in visually selecting the final function were (i)to maintain the functional form defined by eqn 1 (and suggestedby the statistics of Alexandrov and Soloveyev, 1998); (ii) tohit the 100% mark at alignments that are too short to revealanything about structural similarity (5 11 residues); (iii) tosaturate at levels around 20% sequence identity (reached forlength5 300); and (iv) to roughly reflect the observed gradient.Saturation for long alignments was realized by the functionalform of the exponent (note: the term 1 e–L/a resulted in anexponential decay). This ‘saturation’ constraint also afflictedthe particular value of the factor (0.32 rather than about 0.5as suggested by the distribution of the data, Figure 4).New curve for length-dependent significance of pairwisesequence similarityThe original HSSP-curve was derived for sequence identity,not for sequence similarity (Sander and Schneider, 1991). Thefunctional dependence between similarity and length appearedcomparable to the one between identity and length (Results).This prompted a similar definition for the separation betweentrue and false positives based on similarity:

pS(n 5 n 1 420 · L–0.335 · (1 1 e –L/2000) (3)

Rost, Proteins 1999

In the region above 20% of sequence identity, 90% of alignments correspond to homologous protein; while below 25% only 10%.

For sequences longer than 100 residues

Midnight zone:

Similar structuresare abundant but

cannot be recognized

Twilight zone:

Many false positives(similar sequences

with different structures)

Safe zone:

No false positives(all the similar

sequences have similar structures)

20% 30%Identity (%)

Sequence identity and structural identity

20% 30% % Identity

Page 15: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Comparative ModelingFlow chart of Comparative Modeling

Template selection

Sequence alignment

Model building

Model evaluation

Target

Model

PDB

>targetIIGGVESRPHSRPYMAHLEITTE RGFTATCGGFLITRQFVMTAAHC SGREITVTLGAHDVSKTES....

Quality check

Yes

No

Sequence matching

Template Model

0 50 100 150 200 250Residue

-40

-30

-20

-10

0

10

E/kT

E/kT?

target IIGGVESRPHSRPYMAHLEI 3RP2A IIGGVESIPHSRPYMAHLDItarget TTERGFTATCGGFLITRQ..3RP2A VTEKGLRVICGGFLISRQ..

Page 16: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Use of Predicted StructuresDepending off the sequence similarity with the template the predicted structure can be used for different purposes

•Comparative Modeling

•Threading

•Ab initio or De novo predictions

Baker and Sali (2001). Science, 294: 93-96.

ly (19, 20). Other factors such as templateselection and alignment accuracy usuallyhave a larger impact on the model accuracy,especially for models based on less than 40%sequence identity to the templates.

There is a wide range of applications ofprotein structure models (Figs. 1 and 2). Forexample, high- and medium-accuracy com-parative models frequently are helpful in re-fining functional predictions that have beenbased on a sequence match alone becauseligand binding is more directly determined bythe structure of the binding site than by itssequence. It is often possible to correctlypredict features of the target protein that donot occur in the template structure. The sizeof a ligand may be predicted from the volumeof the binding site cleft (Fig. 2A). For exam-ple, the complex between docosahexaenoicfatty acid and brain lipid-binding protein wasmodeled on the basis of its 62% sequenceidentity to the crystallographic structure ofadipocyte lipid-binding protein (PDB code1ADL) (21). A number of fatty acids wereranked for their affinity to brain lipid-bindingprotein consistently with site-directed mu-

tagenesis and affinity chromatography exper-iments, even though the ligand specificityprofile of this protein is different from that ofthe template structure. Another example isprediction of a binding site for a chargedligand based on a cluster of charged residueson the protein, as was done for mouse mastcell protease 7 (Fig. 2B) (22). The predictionof a proteoglycan binding patch was con-firmed by site-directed mutagenesis and hep-arin-affinity chromatography experiments.Fortunately, errors in the functionally impor-tant regions in comparative models are manytimes relatively low because the functionalregions, such as active sites, tend to be moreconserved in evolution than the restof the fold. The utility of low-accuracy com-parative models can be illustrated by a mo-lecular model of the whole yeast ribosome,whose construction was facilitated by fit-ting comparative models of many ribosom-al proteins into the electron microscopymap of the ribosomal particle (23). Thisexample also suggests that structuralgenomics of single proteins or their do-mains, combined with protein structure pre-

diction, may contribute substantially to ef-ficient structural characterization of largemacromolecular assemblies.

The accuracy and reliability of modelsproduced by de novo methods is much lowerthan that of comparative models based onalignments with more than 30% sequenceidentity, but the basic topology of a protein ordomain can in some cases be predicted rea-sonably well (Fig. 1, D and E). For roughly40% of proteins shorter than 150 amino acidsthat have been examined, one of the five mostcommonly recurring models generated byRosetta has sufficient global similarity to thetrue structure to recognize it in a search of theprotein structure database. Reasonable mod-els can in some cases be produced for do-mains of even very large proteins by usingmultiple sequence alignments to identify do-main boundaries (Fig. 1D).

The accuracy of de novo models is toolow for problems requiring high-resolutionstructure information. Instead, the low-reso-lution models produced by these methods canreveal structural and functional relationshipsbetween proteins not apparent from their ami-no acid sequences and provide a frameworkfor analyzing spatial relationships betweenevolutionarily conserved residues or betweenresidues shown experimentally to be func-tionally important. These applications are il-lustrated by examples from the recent CASP4blind protein structure prediction experiment(24, 25). The predicted structure of a proteininvolved in cell lysis (26 ) was found to bestructurally related to a protein with a similarfunction but no significant sequence similar-ity (Fig. 2B). The predicted structure of adomain of the mismatch repair protein MutS(27) (Fig. 1D) has structural similarity toproteins with related functions (28). Func-tionally important residues of the signalingprotein Frizzled (29) were clustered in thepredicted structure in a surface patch likely tobe involved in a key protein-protein interac-tion (Fig. 2C). Thus, in favorable cases denovo predictions can provide some of themost important functional insights obtainablefrom experimentally determined structures.

Modeling on a Genomic ScaleThreading and comparative modeling methodshave already been applied on a genomic scale(18, 30, 31). In total, domains in 58% of all600,000 known protein sequences were mod-eled with ModPipe (18) and MODELLER (9)and deposited into a comprehensive database ofcomparative models, ModBase (32–34). TheWeb interface to the database allows flexiblequerying for fold assignments, sequence-struc-ture alignments, models, and model assessmentsof interest. An integrated sequence/structureviewer, ModView, allows inspection and anal-ysis of the query results. ModBase will be in-creasingly interlinked with other applications

Fig. 1. Accuracy andapplication of proteinstructure models.Shown are the differ-ent ranges of applica-bility of comparativeprotein structure mod-eling, threading, and denovo structure predic-tion; the correspondingaccuracy of proteinstructure models; andtheir sample applica-tions. (A through C).Sample comparativemodels based on about60% (A), 40% (B), and30% (C) sequenceidentity to their tem-plate structure. (D andE) Examples of Rosettade novo structure pre-dictions for the CASP4structure predictionexperiment. Predictedstructures are in red,and actual structuresare in blue. The accura-cy of the models de-crease significantly ingoing from (A) to (E),but the overall struc-ture is still roughly cor-rect. (D) A domainfrom the 811-residueMUtS protein whichwas recognized as anautonomous unit froman alignment of ho-mologous sequences;such parsing of largeproteins into domainscan make structure

prediction more tractable.

5 OCTOBER 2001 VOL 294 SCIENCE www.sciencemag.org94

G E N O M E : U N L O C K I N G B I O L O G Y ’ S S T O R E H O U S E

on

July

4, 2

007

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Page 17: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Remote homologsSequences longer than 100 residues and sharing more the 30% of residues have similar structures (for shorter sequences the level of identity must be higher).

This DO NOT exclude that sequences sharing lower identity have similar structures.

Pairs proteins with similar structure and low sequence identity are referred as “remote homologs”

Example:Sperm Whale Myoglobin (1JP6:A)Bacterial Haemoglobin (1VHB:A) RMSD = 0.18 nm, Identity: 12%

aligned by TM-align

Page 18: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Sequence Identity InferenceCan we use sequence similarity to predict other features of an unknown protein?

Solution: Define a the sequence similarity threshold that allow a reliable transfer of annotation features.

In other words we need to find the problem specific twilight region

Midnight zone: Twilight zone: Safe zone:

?% ?%Identity (%)

Sequence identity and function inference

Are the thresholds for functional inference equal to those valid for structural inference?

If not, how would you fix the thresholds?

?% % Identity?%

Page 19: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Subcellular LocalizationSequence identity for reliably transferring subcellular localization is higher than that required for transferring structure.

Rost, Nair: Protein Sci. 2002 Dec; 11(12): 2836–2847.

IT IS NECESSARY TO START FROM EXPERIMENTAL DATA:Example: Sequence identity and subcellular localization.

Sequence identity for reliably transferring subcellular localization is higher than that required for transferring structure

Rost and Nair, Protein Sci. 2002

Page 20: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

A false positiveQ9SLK0 (ICDHX_ARATH):Peroxisomal isocitrate dehydrogenase

Q9SRZ6 (ICDHC_ARATH): Cytosolic isocitrate dehydrogenase

84.2% identity (93.3% similar) in 417 aa overlap

Lalign at ExPASy

Page 21: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Functional AnnotationSequence identity for can be used for functional annotation measuring the identity and similarity between Gene Ontology terms.

Sangar V et al. BMC Bioinformatics 2007

Vineet Sangar1 Daniel J Blankenberg, Naomi Altman and Arthur M Lesk: BMC Bioinformatics 2007, 8:294

IT IS NECESSARY TO START FROM EXPERIMENTAL DATA:Example: Sequence identity and similarity between GO annotations

Page 22: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Dissimilar functionsP04385 (GAL1_YEAST) Galactokinase

Catalytic activity ATP + alpha-D-galactose = ADP + alpha-D-galactose 1-phosphate.

P13045 (GAL3_YEAST) Protein GAL3 The GAL3 regulatory function is required for rapid induction of the galactose system.

72.9% identity (90.5% similar) in 528 aa overlap

Lalign at ExPASy

Page 23: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Case StudyElectron carrier protein. The oxidized form of the cytochrome c heme group can accept an electron from the heme group of the cytochrome c1 subunit of cytochrome reductase. Cytochrome c then transfers this electron to the cytochrome oxidase complex, the final protein carrier in the mitochondrial electron-transport chain.

Back to the sequence-to-structure relation: Cytochrome C

HIS 19

CYS 18

CYS 15

MET 81

Feature key Position(s) Length Description

Binding sitei 15 – 15 1 Heme (covalent)Binding sitei 18 – 18 1 Heme (covalent)Metal bindingi 19 – 19 1 Iron (heme axial ligand)Metal bindingi 81 – 81 1 Iron (heme axial ligand)

Electron carrier protein. The oxidized form of the cytochrome c heme group can accept an electron from the heme group of the cytochrome c1 subunit of cytochrome reductase. Cytochrome c then transfers this electron to the cytochrome oxidase complex, the final protein carrier in the mitochondrial electron-transport chain.

PDB: 3zcf:A

UniProt: P99999

Back to the sequence-to-structure relation: Cytochrome C

HIS 19

CYS 18

CYS 15

MET 81

Feature key Position(s) Length Description

Binding sitei 15 – 15 1 Heme (covalent)Binding sitei 18 – 18 1 Heme (covalent)Metal bindingi 19 – 19 1 Iron (heme axial ligand)Metal bindingi 81 – 81 1 Iron (heme axial ligand)

Electron carrier protein. The oxidized form of the cytochrome c heme group can accept an electron from the heme group of the cytochrome c1 subunit of cytochrome reductase. Cytochrome c then transfers this electron to the cytochrome oxidase complex, the final protein carrier in the mitochondrial electron-transport chain.

PDB: 3zcf:A

UniProt: P99999

Page 24: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Homo vs HorseHuman Cytochrome C – Uniprot:P99999. PDB: 3ZCF:A Equine Cytochrome C – Uniprot: P00004. PDB 3O20:ACytochrome C (Homo vs. Horse)

Human Cytochrome C - Uniprot:P99999. PDB: 3ZCF:AEquine Cytochrome C – Uniprot: P00004. PDB 3O20:A

Structural alignment: RMSD= 0,035 nm88% sequence identity

1:A 20:A 40:A 60:A | | . | . | . | . | . | . |GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLEN||||||||||:.||.||||||||||||||||||||||||||||||:.||.|||||||.|.|:||||||||GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFTYTDANKNKGITWKEETLMEYLEN| | . | . | . | . | . | . |1:A 20:A 40:A 60:A

80:A 100:A . | . | . |

PKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE||||||||||||.|||||.||.||||||||||||PKKYIPGTKMIFAGIKKKTEREDLIAYLKKATNE

. | . | . | 80:A 100:A

Structural alignment: RMSD= 0.035 nm

88% sequence identity

Page 25: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Sequence vs Structure In this case the sequence alignment is the same of the structural alignment and the positions of the binding sites are conserved.

Lalign at ExPASy

Cytochrome C (Homo vs. Horse)

Structural alignment: RMSD= 0,035 nm88% sequence identity

88.6% identity (95.2% similar) in 105 aa overlap (1-105:1-105)

10 20 30 40 50 60Homo MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIW

:::::::::::..::.::::::::::::::::::::::::::::::..:: ::::::: :Horse MGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFTYTDANKNKGITW

10 20 30 40 50 60

70 80 90 100 Homo GEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE

:.::::::::::::::::::::.::::: :: ::::::::::::Horse KEETLMEYLENPKKYIPGTKMIFAGIKKKTEREDLIAYLKKATNE

70 80 90 100

GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLEN||||||||||:.||.||||||||||||||||||||||||||||||:.||.|||||||.|.|:||||||||GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFTYTDANKNKGITWKEETLMEYLEN

PKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE||||||||||||.|||||.||.||||||||||||PKKYIPGTKMIFAGIKKKTEREDLIAYLKKATNE

Sequence alignment: 88% sequence identity

IDENTICAL TO STRUCTURAL ALIGNMENT

IT CAN BE SAFELY ADOPTED FOR MODELLING PROCEDURES

portion where structural and sequence alignments coincide

Sequence alignment: 88% sequence identity IDENTICAL TO STRUCTURAL ALIGNMENT

Page 26: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Homo vs Rhodobacter Sph.Human Cytochrome C — Uniprot:P99999. PDB: 3ZCF:ACytochrome C2 Rhodobacter Sph. – Uniprot: P0C0X8. PDB 1CXC:A

Structural alignment: RMSD= 0,18 nm

28% sequence identity

Cytochrome C (Homo vs. Rhodobacter sphaeroides)

Human Cytochrome C - Uniprot:P99999. PDB: 3ZCF:ACytochrome C2 Rhodobacter Sph. – Uniprot: P0C0X8. PDB 1CXC:A

Structural alignment: RMSD= 0,18 nm28% sequence identity

1:A 20:A 40:A | | . | . | . | . | . GDVEKGKKIFIMKCSQCHTVEKGG-------KHKTGPNLHGLFGRKTGQAPGYS-YTAANKNKG---IIW||.|.|.|.|. .|..||.:.... ..||||||:|..||..|....:. |....|..| :.|GDPEAGAKAFN-QCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQADFKGYGEGMKEAGAKGLAW| | . | . | . | . | . | . | 3:A 20:A 40:A 60:A

60:A 80:A 100:A | . | . | . | . | GEDTLMEYLENPKKYI--------PGTKMIFVGIKKKEERADLIAYLKKATNE.|:...:|.:.|.|:: ...||.| .:||..:...:.|||......DEEHFVQYVQDPTKFLKEYTGDAKAKGKMTF-KLKKEADAHNIWAYLQQVAVR

. | . | . | . | . | 80:A 100:A 120:A

Page 27: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Sequence vs Structure (I)In this case the sequence alignment can be used for homology modeling after a refinement of the alignment because one binding site is not conserved.

Lalign at ExPASy

Structural alignment: RMSD= 0,18 nm

28% sequence identityStructural alignment: RMSD= 0,18 nm28% sequence identity

GDVEKGKKIFIMKCSQCHTVEKGG-------KHKTGPNLHGLFGRKTGQAPGYS-YTAANKNKG---IIW||.|.|.|.|. .|..||.:.... ..||||||:|..||..|....:. |....|..| :.|GDPEAGAKAFN-QCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQADFKGYGEGMKEAGAKGLAW

GEDTLMEYLENPKKYI--------PGTKMIFVGIKKKEERADLIAYLKKATNE.|:...:|.:.|.|:: ...||.| .:||..:...:.|||......DEEHFVQYVQDPTKFLKEYTGDAKAKGKMTF-KLKKEADAHNIWAYLQQVAVR

Sequence alignment: 29% sequence identity

SIMILAR (NOT IDENTICAL) TO STRUCTURAL ALIGMENT

IT CAN BE ADOPTED FOR MODELLING PROCEDURES (WITH SOME REFINEMENT)

Global without end-gap score: 111; 29.3% identity (56.1% similar) in 123 aa overlap10 20 30 40 50

sp|P99 MGDVEKGKKIFIMKCSQCHTVEK-------GGKHKTGPNLHGLFGRKTG-QAPGYSYTA:: : : : : .:. ::.. : . ::::::.:. :: .: :: .:

sp|P0C QEGDPEAGAKAF-NQCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQADFKGYGE10 20 30 40 50

60 70 80 90 100 sp|P99 ANKN---KGIIWGEDTLMEYLENPKKYIP-------GTKMIFVGIKKKEERADLIAYLKK

. :. ::. : :. ...:...: :.. . . .::. . .. :::..sp|P0C GMKEAGAKGLAWDEEHFVQYVQDPTKFLKEYTGDAKAKGKMTFKLKKEADAHNIWAYLQQ

60 70 80 90 100 110

sp|P99 ATNE .. .

sp|P0C VAVRP120

Cytochrome C (Homo vs. Rhodobacter sphaeroides)

portion where structural and sequence alignments coincide

Page 28: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Homo vs Rhodobacter Pal.Human Cytochrome C - Uniprot:P99999. PDB: 3ZCF:ACytochrome C2 Rhodopseudomons pal. – Uniprot: P00091. PDB 1I8O:A

Structural alignment: RMSD= 0,13 nm

29% sequence identity

Cytochrome C (Homo vs. Rhodopseudomonas palustris)

Human Cytochrome C - Uniprot:P99999. PDB: 3ZCF:ACytochrome C2 Rhodopseudomons pal. – Uniprot: P00091. PDB 1I8O:A

Structural alignment: RMSD= 0,13 nm29% sequence identity

1:A 20:A 40:A 60:A | | . | . | . | . | . | . GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKG---IIWGEDTLMEY.|...|..:|. .|..||.. .|...||.|.|..|||.|.|.|:.|...|.|.| ::|..|.:..|xDAKAGEAVFK-QCMTCHRA---DKNMVGPALAGVVGRKAGTAAGFTYSPLNHNSGEAGLVWTADNIVPY| | . | . | . | . | . | . 1:A 20:A 40:A 60:A

80:A 100:A | . | . | . |

LENPKKYIP--------------GTKMIFVGIKKKEERADLIAYLKKAT|..|..::. .|||.| .:...::|.|.:|||....LADPNAFLKKFLTEKGKADQAVGVTKMTF-KLANEQQRKDVVAYLATLK

| . | . | . | . | 80:A 100:A

Page 29: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Sequence vs Structure (II)In this case the sequence alignment needs to be fixed homology to because all the binding site shifted.

Lalign at ExPASy

Cytochrome C (Homo vs. Rhodopseudomonas palustris)

Structural alignment: RMSD= 0,13 nm29% sequence identity

GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKG---IIWGEDTLMEY

.|...|..:|. .|..||.. .|...||.|.|..|||.|.|.|:.|...|.|.| ::|..|.:..|

xDAKAGEAVFK-QCMTCHRA---DKNMVGPALAGVVGRKAGTAAGFTYSPLNHNSGEAGLVWTADNIVPY

LENPKKYIP--------------GTKMIFVGIKKKEERADLIAYLKKAT

|..|..::. .|||.| .:...::|.|.:|||....

LADPNAFLKKFLTEKGKADQAVGVTKMTF-KLANEQQRKDVVAYLATLK

Sequence alignment: 29% sequence identity

IS IT SIMILAR TO STRUCTURAL ALIGMENT?

COULD IT BE SAFELY ADOPTED FOR MODELLING PROCEDURES?

Global without end-gap score: 152; 28.7% identity (63.0% similar) in 108 aa overlap

10 20 30

sp|P99 MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGL

:.. :. .: .:: : ... :. .:: : :.

sp|P00 MVKKLLTILSIAATAGSLSIGTASAQDAKAGEAVF----KQCMTCHRADKNMVGPALGGV

10 20 30 40 50

40 50 60 70 80 90

sp|P99 FGRKTGQAPGYSYTAANKNKG---IIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERA

:::.: : :..:. :.:.: ..: :....::..:. .. : ... : .. .

sp|P00 VGRKAGTAAGFTYSPLNHNSGEAGLVWTADNIINYLNDPNAFL---KKFLTDKGKADQAV

60 70 80 90 100 110

100

sp|P99 DLIAYLKKATNE

. . : .::

sp|P00 GVTKMTFKLANEQQRKDVVAYLATLK

120 130 portion where structural and sequence alignments coincide

Structural alignment: RMSD= 0,13 nm

29% sequence identity

Page 30: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Homo vs ArabidopsisHuman Cytochrome C - Uniprot:P99999. PDB: 3ZCF:ACytochrome C6A Arabidopsis Thaliana – Uniprot: Q93VA3. PDB 2CE0:A

Structural alignment: RMSD= 0,35 nm

13% sequence identity

Cytochrome C (Homo vs. Arabidopsis thaliana)

Human Cytochrome C - Uniprot:P99999. PDB: 3ZCF:ACytochrome C6A Arabidopsis Thaliana – Uniprot: Q93VA3. PDB 2CE0:A

Structural alignment: RMSD= 0,35 nm13% sequence identity

1:A 20:A 40:A 60:A | | . | . | . | . | . | . GDVEKGKKIFIMKCSQCHTVEKGGKHKTGP---NLHG-LFGRKTGQAPGYSYTAANKNKGIIWGEDTLME.|:::|..:|...|..||... |... . .|.. ...|. .:..|:.:..LDIQRGATLFNRACAACHDTG--GNII--QPGATLFTKDLERN-----------------GVDTEEEIYR| | . | . | . | . | 3:A 20:A 40:A

80:A 100:A | . | . | . |

YLE---------NPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE... ..|....|.......:... |...|..:.|....:VTYFGKGRMPGFGEKCTPRGQCTFGPRLQDE-EIKLLAEFVKFQADQ

. | . | . | . | . 60:A 80:A

Page 31: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Sequence vs Structure (III)In this case the sequence alignment is significantly different form the structural alignment.

Lalign at ExPASy

Cytochrome C (Homo vs. Arabidopsis thaliana)

Structural alignment: RMSD= 0,35 nm13% sequence identity

GDVEKGKKIFIMKCSQCHTVEKGGKHKTGP---NLHG-LFGRKTGQAPGYSYTAANKNKGIIWGEDTLME.|:::|..:|...|..||... |... . .|.. ...|. .:..|:.:..LDIQRGATLFNRACAACHDTG--GNII--QPGATLFTKDLERN-----------------GVDTEEEIYR

YLE---------NPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE... ..|....|.......:... |...|..:.|....:VTYFGKGRMPGFGEKCTPRGQCTFGPRLQDE-EIKLLAEFVKFQADQ

Sequence alignment: 20% sequence identity

VERY DIFFERENT FROM STRUCTURAL ALIGMENT

IT CANNOT BE SAFELY ADOPTED FOR MODELLING PROCEDURES

Global without end-gap score: 3; 20.0% identity (43.8% similar) in 105 aa overlap10 20 30

Homo MGDVEKGKKIFIMKCSQCHTVEKGGKHKTG:...: .: : :: . :. . :

A.Thal DFLLKKLAPPLTAVLLAVSPICFPPESLGQTLDIQRGATLFNRACIGCHDT-GGNIIQPG50 60 70 80 90 100

40 50 60 70 80 90sp|P99 PNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKE

.: ...: : . . .:. . . : : : . : : . ..sp|Q93 ATLFTKDLERNGVD-----TEEEIYRVTYFGKGRMPGFGE---KCTPRGQCTF-GPRLQD

110 120 130 140 150

100 sp|P99 ERADLIAYLKKATNE

:. :.: . : . sp|Q93 EEIKLLAEFVKFQADQGWPTVSTD

160 170 portion where structural and sequence alignments coincide

Structural alignment: RMSD= 0,35 nm

13% sequence identity

Page 32: Introduction and Basic Concepts · Introduction and Basic Concepts Laboratory of Bioinformatics I Module 2 March 10, 2020. Schedule and Materials This module is a 58-hour course running

Search for Better AlignmentWhy is it not sufficient to align sequences (when identity is low) to recover information, not even for “important” residues?

Sequence alignments are «general» and treat each position in the same way There is no knowledge on the «important» sites

How can we detect the “important” residues starting from protein structures (even when information on catalytic sites is not available)?

Compare multiple structures and analyze the conservation of residues

How can we align sequences constraining the alignment of important residues?

Compare multiple sequences and check for the conservation of patterns Use alignment frameworks able to introduce positional dependences.