Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment •...

50

Transcript of Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment •...

Page 1: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 2: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• Sequence analysis can be used to assign functionto genes and proteins by the study of thesimilarities between the compared sequences.

• In bioinformatics, a sequence alignment is a wayof arranging the sequences of DNA, RNA orprotein to identify regions of similarity that maybe a consequence of functional, structural, orevolutionary relationships between sequences.

Page 3: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 4: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 5: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 6: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 7: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 8: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

Pairwise Sequence Alignment

• Sequence comparison lies at the heart ofbioinformatics analysis.

• It is an important first step towards structural andfunctional analysis of newly determinedsequences.

• As new biological sequences are being generatedat exponential rates, sequence comparison isbecoming increasingly important to drawfunctional and evolutionary inference of a newprotein with proteins already existing in thedatabase.

Page 9: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• The most fundamental process in this type ofcomparison is sequence alignment.

• This is the process by which sequences arecompared by searching for common characterpatterns and establishing residue correspondenceamong related sequences.

• Pairwise sequence alignment is the process ofaligning two sequences and is the basis ofdatabase similarity searching and multiplesequence alignment

Page 10: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• Pairwise sequence alignment is the fundamentalcomponent of many bioinformatics applications.

• Pairwise sequence alignment is used to identify regions ofsimilarity that may indicate functional, structural and or/evolutionary relationships between two biologicalsequence e.g say: (protein or nucleic acid)

• It is extremely useful in structural, functional, andevolutionary analyses of sequences. Pairwise sequencealignment provides inference for the relatedness of twosequences.

Page 11: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 12: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 13: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• DNA and proteins are products of evolution.

• For example, active site residues of an enzyme family tend to beconserved because they are responsible for catalytic functions.

• Therefore, by comparing sequences through alignment, patterns ofconservation and variation can be identified. The degree ofsequence conservation in the alignment reveals evolutionaryrelatedness of different sequences, whereas the variation betweensequences reflects the changes that have occurred during evolutionin the form of substitutions, insertions, and deletions.

• Identifying the evolutionary relationships between sequenceshelps to characterize the function of unknown sequences.

• When a sequence alignment reveals significant similarity among agroup of sequences, they can be considered as belonging to thesame family

Page 14: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• Therefore, sequence alignment can be used as basis for prediction ofstructure and function of uncharacterized sequences.

• Sequence alignment provides inference for the relatedness of twosequences under study.

• If the two sequences share significant similarity, that the two sequencesmust have derived from a common evolutionary origin.

• When a sequence alignment is generated correctly, it reflects theevolutionary relationship of the two sequences: regions that are alignedbut not identical represent residue substitutions; regions where residuesfrom one sequence correspond to nothing in the other representinsertions or deletions that have taken place on one of the sequencesduring evolution.

Page 15: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

Multiple Sequence Alignment• A natural extension of pairwise alignment is multiple sequence

alignment, which is to align multiple related sequences toachieve optimal matching of the sequences.

• Related sequences are identified through the databasesimilarity searching.

• As the process generates multiple matching sequence pairs, it isoften necessary to convert the numerous pairwise alignmentsinto a single alignment, which arranges sequences in such away that evolutionarily equivalent positions across allsequences are matched.

• There is a unique advantage of multiple sequence alignmentbecause it reveals more biological information than manypairwise alignments.

Page 16: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• Many conserved and functionally critical amino acid residues canbe identified in a protein multiple alignment.

• Multiple sequence alignment is also an essential prerequisite tocarrying out phylogenetic analysis of sequence families andprediction of protein secondary and tertiary structures.

• However, the amount of computing time and memory it requiresincreases exponentially as the number of sequences increases. As aconsequence, full dynamic programming cannot be applied fordatasets of more than ten sequences. In practice, heuristicapproaches are most often used.

• Multiple sequence alignment is to arrange sequences in such a waythat a maximum number of residues from each sequence arematched up according to a particular scoring function.

• They are more computationally complex than pairwise.

Page 17: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 18: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 19: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

SEQUENCE HOMOLOGY VERSUS SEQUENCE SIMILARITY

• When two sequences are descended from a common evolutionary origin, theyare said to have a homologous relationship or share homology. A related butdifferent term is sequence similarity, which is the percentage of aligned residuesthat are similar in physiochemical properties such as size, charge, andhydrophobicity.

• To be clear, sequence homology is an inference or a conclusion about a commonancestral relationship drawn from sequence similarity comparison when the twosequences share a high enough degree of similarity.

• On the other hand, similarity is a direct result of observation from the sequencealignment.

• Sequence similarity can be quantified using percentages; homology is aqualitative statement. For example, one may say that two sequences share 40%similarity but It is incorrect to say that the two sequences share 40% homology.They are either homologous or non homologous.

• Generally, if the sequence similarity level is high enough, a common evolutionaryrelationship can be inferred.

Page 20: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 21: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 22: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 23: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 24: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

SEQUENCE SIMILARITY VERSUS SEQUENCE IDENTITY

• Another set of related terms for sequence comparisonare sequence similarity and sequence identity.

• Sequence similarity and sequence identity aresynonymous for nucleotide sequences.

• For protein sequences, however, the two concepts arevery different. In a protein sequence alignment, sequenceidentity refers to the percentage of matches of the sameamino acid residues between two aligned sequences.

• Similarity refers to the percentage of aligned residuesthat have similar physicochemical characteristics and canbe more readily substituted for each other.

Page 25: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 26: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• Sequence similarity is a measure of an empiricalrelationship between sequences. Its commonobjective is establishing the likelihood for sequencehomology i.e chance that sequences has evolvedfrom a common ancestor.

• Sequence identity is the amount of characterswhich match exactly between two differentsequences. A similarity score is therefore aimed toapproximate the evolutionary distance between apair of nucleotide or protein sequences.

Page 27: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• The overall goal of pairwise sequence alignment isto find the best pairing of two sequences, suchthat there is maximum correspondence amongresidues.

• To achieve this goal, one sequence needs to beshifted relative to the other to find the positionwhere maximum matches are found.

• There are two different alignment strategies thatare often used: global alignment and localalignment.

Page 28: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 29: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

Global Alignment and Local Alignment• In global alignment, two sequences to be aligned are assumed to be generally

similar over their entire length. Alignment is carried out from beginning to end ofboth sequences to find the best possible alignment across the entire lengthbetween the two sequences. You take entirety of both sequences intoconsideration when finding alignment.

• Local alignment, on the other hand, does not assume that the two sequences inquestion have similarity over the entire length. It only finds local regions with thehighest level of similarity between the two sequences and aligns these regionswithout regard for the alignment of the rest of the sequence regions. When youtake small portion into account.

• This approach can be used for aligning more divergent sequences with the goalof searching for conserved patterns in DNA or protein sequences.

• The two sequences to be aligned can be of different lengths. This approach ismore appropriate for aligning divergent biological sequences containing onlymodules that are similar, which are referred to as domains or motifs.

Page 30: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 31: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 32: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 33: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar
Page 34: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

Alignment Algorithms• Very short or very similar sequences can be aligned by hand.

However, most interesting problems require alignment oflengthy numerous numerous sequences that cannot be alignedby human effort.

• Instead, human knowledge is applied in constructing algorithmsto produce high quality sequence alignments.

• Alignment algorithms, both global and local, are fundamentallysimilar and only differ in the optimization strategy used inaligning similar residues.

• Both types of algorithms can be based on one of the threemethods:

• the dot matrix method, the dynamic programming method, andthe word method. The word method, which is used in fastdatabase similarity searching .

• The dot matrix method is useful in visually identifying similarregions, but lacks the sophistication of the other two methods.

Page 35: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

Dot Matrix Method• The most basic sequence alignment method is the dot matrix

method, also known as the dot plot method. It is a graphicalway of comparing two sequences in a two dimensional matrix.In a dot matrix, two sequences to be compared are written inthe horizontal and vertical axes of the matrix.

• The comparison is done by scanning each residue of onesequence for similarity with all residues in the other sequence.If a residue match is found, a dot is placed within the graph.

• Otherwise, the matrix positions are left blank. When the twosequences have substantial regions of similarity, many dots lineup to form contiguous diagonal lines, which reveal the sequencealignment.

• If there are interruptions in the middle of a diagonal line, theyindicate insertions or deletions. The dot matrix method gives adirect visual statement of the relationship between twosequences and helps easy identification of the regions ofgreatest similarities

Page 36: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

Figure 1: Example of comparing two sequences using dot plots. Lines linking the dots indiagonals indicate sequence alignment. Diagonal lines above or below the main diagonalrepresent internal repeats of either sequence.

Page 37: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• Dynamic programming is an exhaustive and quantitativemethod to find optimal alignments. This methodeffectively works in three steps.

• It first produces a sequence versus sequence matrix.

• The second step is to accumulate scores in the matrix.

• The last step is to trace back through the matrix in reverseorder to identify the highest scoring path.

This scoring step involves the use of scoring matrices andgap penalties.

Page 38: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

Dynamic Programming Method

• Dynamic programming is a method that determines optimalalignment by matching two sequences for all possible pairsof characters between the two sequences.

• It is fundamentally similar to the dot matrix method in thatit also creates a two dimensional alignment grid.

• However, it finds alignment in a more quantitative way byconverting a dot matrix into a scoring matrix to account formatches and mismatches between sequences.

• By searching for the set of highest scores in this matrix, thebest alignment can be accurately obtained.

Page 39: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

Database Similarity Searching

• A main application of pairwise alignment isretrieving biological sequences in databases basedon similarity.

• Thus, database similarity searching is pairwisealignment on a large scale. However, the dynamicprogramming method is slow and impractical to usein most cases.

• Special search methods are needed to speed up thecomputational process of sequence comparison.

Page 40: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• There are unique requirements for implementing algorithms forsequence database searching. The first criterion is sensitivity,which refers to the ability to find as many correct hits as possible.It is measured by the extent of inclusion of correctly identifiedsequence members of the same family. These correct hits areconsidered “true positives” in the database searching exercise.

• The second criterion is selectivity, also called specificity, whichrefers to the ability to exclude incorrect hits. These incorrect hitsare unrelated sequences mistakenly identified in databasesearching and are considered “false positives.”

• The third criterion is speed, which is the time it takes to get resultsfrom database searches.

• Depending on the size of the database, speed sometimes can be aprimary concern. Ideally, one wants to have the greatestsensitivity, selectivity, and speed in database searches.

Page 41: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• In database searching, as well as in many other areas inbioinformatics, are two fundamental types of algorithms. One is theexhaustive type, which uses a rigorous algorithm to find the best orexact solution for a particular problem by examining allmathematical combinations.

• Dynamic programming is an example of the exhaustive method andis computationally very intensive.

• Another is the heuristic type, which is a computational strategy tofind an empirical or near optimal solution by using rules of thumb.

• Essentially, this type of algorithms take shortcuts by reducing thesearch space according to some criteria. It is often used because ofthe need for obtaining results within a realistic time frame withoutsignificantly sacrificing the accuracy of the computational output.

Page 42: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

HEURISTIC DATABASE SEARCHING• Searching a large database using the dynamicprogramming methods, such as the Smith–Watermanalgorithm, although accurate and reliable, is too slowand impractical when computational resources arelimited

• Thus, speed of searching became an important issue. Tospeed up the comparison, heuristic methods have to beused.

• The heuristic algorithms perform faster searches becausethey examine only a fraction of the possible alignmentsexamined in regular dynamic programming.

• Currently, there are two major heuristic algorithms forperforming database searches: BLAST and FASTA.

Page 43: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

Profiles and Hidden Markov Models• One of the applications of multiple sequence alignments inidentifying related sequences in databases is byconstruction of position-specific scoring matrices (PSSMs),profiles, and hidden Markov models (HMMs). These arestatistical models that reflect the frequency information ofamino acid or nucleotide residues in a multiple alignment.

• The purpose of establishing the mathematical models is toallow partial matches with a query sequence so they can beused to detect more distant members of the same sequencefamily, resulting in an increased sensitivity of databasesearches.

Page 44: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

PROFILES• Actual multiple sequence alignments often contain gaps of varying lengths.

When gap penalty information is included in the matrix construction, a profile iscreated. In other words, a profile is a PSSM with penalty information regardinginsertions and deletions for a sequence family.

• However, in the literature, profile is often used interchangeably with PSSM,even though the two terms in fact have subtle but significant differences.Position-specific scoring matrices

• PSSMs, profiles, and HMMs are statistical models that represent the consensusof a sequence family. Because they allow partial matches, they are moresensitive in detecting remote homologs than regular sequence alignmentmethods.

• A PSSM by definition is a scoring table derived from ungapped multiplesequence alignment.

• A profile is similar to PSSM, but also includes probability information for gapsderived from gapped multiple alignment. An HMM is similar to profiles butdifferentiates insertions from deletions in handling gaps.

Page 45: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• The probability calculation in HMMs is morecomplex than in profiles. It involves travelingthrough a special architecture of variousobserved and hidden states to describe a gappedmultiple sequence alignment.

• As a result of flexible handling of gaps, HMM ismore sensitive than profiles in detecting remotesequence homologs.

• All three types of models require trainingbecause the statistical parameters have to bedetermined according to alignment of sequencefamilies.

Page 46: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

MARKOV MODEL AND HIDDEN MARKOV MODEL

Markov Model• A more efficient way of computing matching scoresbetween a sequence and a sequence profile isthrough the use of HMMs, which are statisticalmodels originally developed for use in speechrecognition.

• This statistical tool was subsequently found to beideal for describing sequence alignments. Tounderstand HMMs, it is important to have somegeneral knowledge of Markov models.

Page 47: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• A Markov model, also known as Markov chain, describes asequence of events that occur one after another in a chain. Eachevent determines the probability of the next event.

• A Markov chain can be considered as a process that moves in onedirection from one state to the next with a certain probability,which is known as transition probability.

• A good example of a Markov model is the signal change of trafficlights in which the state of the current signal depends on the stateof the previous signal (e.g., green light switches on after red light,which switches on after yellow light).

• Biological sequences written as strings of letters can be describedby Markov chains as well; each letter representing a state is linkedtogether with transitional probability values.

Page 48: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

A simple representation of a markov chain which consist of a linear events or states(numbered) linked by transition probability values between events (states)

Page 49: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

• An HMM combines two or more Markov chains

• In an HMM, as in a Markov chain, the probabilitygoing from one state to another state is thetransition probability. The probability valueassociated with each symbol in each state iscalled emission probability.

Page 50: Sequence analysis can be used to assign function In ... · Global Alignment and Local Alignment • In global alignment, two sequences to be aligned are assumed to be generally similar

Applications

• An advantage of HMMs over profiles is that the probability modeling inHMMs has more predictive power. This is because an HMM is able todifferentiate between insertion and deletion states, whereas in profilecalculation, a single gap penalty score that is often subjectivelydetermined represents either an insertion or deletion.

• Because the handling of insertions and deletions is a major problem inrecognizing highly divergent sequences, HMMs are therefore morerobust in describing subtle patterns of a sequence family than standardprofile analysis.

• HMMs are very useful in many aspects of bioinformatics. Although anHMM has to be trained based on multiple sequence alignment, once it istrained, it can in turn be used for the construction of multiple alignmentof related sequences. HMMs can be used for database searching todetect distant sequence homologs.

• HMMs are also used in protein family classification through motif andpattern identification