Pairwise Alignment (BIOINFORMATICS)

7/29/2019 Pairwise Alignment (BIOINFORMATICS)

1/17

1

Aayudh Das

PAIRWISE ALIGNMENT-Homology, Similarity, Identity

Two sequences are homologous if they share a common evolutionary ancestry. E.g.Human myoglobin and beta globin two proteins are distant but significantly related.

Proteins that are homologous may be orthologous or paralogous.

1. Orthologs are homologous sequences in different species that arose from acommon ancestral gene during speciation having similar biological functions; inthis example, human and rat myoglobins both transport oxygen in muscle cells .

2. Paralogs are homologous sequences that arose by a mechanism such as geneduplication. For example, human alpha-1 globin is paralogous to alpha-2

globin; indeed, these two proteins share 100% amino acid identity.

We can assess the relatedness of any two proteins by performing a pairwise

alignment. One practical way to do this is through the NCBI pairwise BLAST tool.

Another aspect of this pairwise alignment is that some of the aligned residues may be

similar but not identical because they share similar biochemical properties. These areconservative substitutions. Amino acids with similar properties include the basic amino

acids (K, R, H), acidic amino acids (D, E), hydroxylated amino acids (S, T), and

hydrophobic amino acids (W, F, Y, L, I, V,M, A).

The percent similarity of two protein sequences is the sum of both identical and

similar matches.

The purpose of a pairwise alignment is to assess the degree of similarity and thepossibility of homology between two molecules.

Pairwise alignment is useful as a way to identify mutations that have occurred duringevolution and have caused divergence of the sequences of the two proteins we are

studying. The most common mutations are substitutions, insertions, and deletions.

Insertions or deletions (even those just one character long) are referred to as gaps in the

alignment.

Scoring matrix

Margaret Dayhoff (1978) provided a model which gives the basis of a quantitative

scoring system for pairwise alignments.

Based on substitution frequencies in the protein sequences that were then known

(1972) Dayhoff and her coworkers organized the proteins into families and

superfamilies based on the degree of sequence similarity.

Their approach was to catalog thousands of proteins and compare the sequences of

closely related proteins in many families.

They considered the question ofwhich specific amino acid substitutions are observed to

occur when two homologous protein sequences are aligned. They defined an acceptedpoint mutation as a replacement of one amino acid in a protein by another residue that


2/17

2

has been accepted by natural selection. Accepted point mutation is abbreviated PAM

(which is easier to pronounce than APM).

Dayhoff and colleagues examined 1572 changes in 71 groups of closely related proteins.

Thus, their definition of accepted mutations was based on empirically observed amino

acid substitutions.

Derivation of substitution matrix

From the idea of divergent evolution we know that as sequences divergemutations accumulate.

The idea ofaccepted point mutation was the ultimate outcome. It is the change of one amino acid to another through natural selection . Thus there are simultaneously two processes that take place.

One there is a mutation such that the gene changes and there is a change in the

amino acid of the protein.

Two this mutation is accepted by the species.

The observed behaviour of amino acids in the evolutionary process needs to beconsidered.

These demands for 20X20=400 possible comparisons.Calculation for the matrix of accepted point mutation

Assumption The likelihood of amino acid X replacing Y is the same as that ofYreplacing X. This reduces the number of comparisons between amino acids.

Comparison is done with observed sequences with inferred ancestralsequences.

This means mutation data were accumulated from the phylogenetic trees andfrom a few pairs of related sequences.

The sequences of all the nodal common ancestors in each tree are generated asfollows.

The main goal of Dayhoff s approach was to define a set of scores for the comparison of

aligned amino acid residues. By comparing two aligned proteins, one can then tabulate

an overall score, taking into account identities as well as mismatches, and also applying

appropriate penalties for gaps. A scoring matrix defines scores for the interchange of

residues i and j. It is given by theprobability qi,jof aligning original amino acid residue j

with replacement residue i relative to the likelihood of observing residues i by chance


3/17

3

(pi). The scoring matrix further incorporates a logarithm to generate log-odds scores. For

the Dayhoff matrices, the following-

Here the score si, j refers to the score for aligning any two residues (including an amino

acid with itself) along the length of a pairwise alignment. The probability qi, j is the

observed frequency of substitution for each pair of amino acids. The values for qij are

called the target frequencies, and they are estimated in reference to a particularamount of evolutionary change.

e.g. If in a particular comparison of closely related proteins an aligned serine were to

change to athreonine 5% of the time, then thattarget frequency qS,T would be 0.05.

PAM (Point accepted mutation)

The entries in the cell are the lods ratio log of odds ratio log of observed frequency is to expected frequency


4/17

4

The PAM250 mutation probability matrix is useful because itdescribes the frequency of

amino acid replacements between distantly related proteins . PAM 250 corresponds to

~20% overall sequence identity.

We have to convert the elements of a PAM mutation probability matrix into a scoring

matrix, also called a log-odds matrixor relatedness odds matrix.

Whats the need of taking Log of odd? For this scoring system Dayhoff and colleagues took

10 times the base 10 logarithm of the odds ratio . Using the logarithm here is helpfulbecause it allows us to sum the scores of the aligned residues when we perform an

overall alignment of two sequences. (If we did not take the logarithm, we would need to

multiply the ratios at all the aligned positions)

The values have been rounded off to the nearest integer. As an example, to determine the

score assigned to two aligned tryptophan residues, the PAM250 mutation probability

matrix value is 0.55, and the normalized frequency of tryptophan is 0.010 Thus,

What do the scores in the PAM250 matrix signify?

A score of -10 indicates that the correspondence of two amino acids in an alignment that

accurately represents homology (evolutionary descent) is one-tenth as frequent as the

chance alignment of these amino acids. This assumes that each was randomly selected

from the background amino acid frequency distribution. A score of zero is neutral. A score

of +17 for tryptophan indicates that this correspondence is 50 times more frequent

than the chance alignment of this residue in a pairwise alignment. A score of +2

indicates that the amino acid replacement occurs 1.6 times as frequently as expected

by chance. The highest values in this particular matrix are for tryptophan (17 for an

identity) and cysteine (12), while the most severe penalties are associated with

substitutions for those two residues.

BLOSUM-BLOcks Substitution Matrix developed byHenikoff & Henikoff in 1992

BLOSUM matrix is asubstitution matrix used forsequence alignmentofproteins.

They used the BLOCKS database, which consisted of over 500 groups of local multiple

alignments (blocks) of distantly related protein sequences. Thus the Henikoffs focused

on conserved regions (blocks) of proteins that are distantly related to each other.

The BLOSUM scoring scheme employs a log-odds ratio using the base 2 logarithm:
http://en.wikipedia.org/wiki/Substitution_matrixhttp://en.wikipedia.org/wiki/Substitution_matrixhttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Substitution_matrix


5/17

5

Itmerges all proteins in an alignment that have 62% amino acid identity or greaterinto one sequence.

If a block of aligned globin orthologs includes several that have 62%, 80%, and 95%amino acid identity, these would all be weighted (grouped) as one sequence.

Substitution frequencies for the BLOSUM62 matrix are weighted more heavily by

blocks of protein sequences having less than 62% identity. (Thus, this matrix is useful

for scoring proteins that share less than 62% identity.)

Overall procedure-

Collect a set of multiple alignments Find the Blocks (no gaps). Blocks are defined as ungapped alignments of amino acids

from related proteins. Consider a single block representing a conserved region of a

protein family. Each row is a different protein segment. Each column is an aligned

residue position.

Group segment of Blocks with x% identity. Count the occurrence of all pairs of amino acids. Employ these counts to obtain odds ratio (log).

PAM BLOSUM

Dayhoff estimated mutation rates from

substitutions observed in closely related

proteins and extrapolated those rates to

model distant relationships.

For BLOSUM, blocks have been derived

from highly conserved region of proteins

and so they reflect direct relationships.

PAM matrices are based on a mutational

model of evolution that assumes aminoacid changes occur as a Markov process

(each amino acid change at a site is

independent of previous changes at that

site)

In contrast BLOSUM matrices are not

based on an explicit evolutionary model.

Changes are scored in sequences that are

85% similar after predicting a

phylogenetic history of the changes in each

family.

Less than 62% identity.

Thus PAM matrices are based on

prediction of the first changes that occuras proteins diverge from a common

ancestor during evolution of a protein

family.

They are derived from considering all

amino acid changes observed in an alignedregion from a related family of proteins,

regardless of the degree of similarity

between the protein sequences. These

sequences are said to be related

biochemically

PAM matrices are based on scoring all

amino acid positions in related sequences

BLOSUM matrices are based on

substitutions and conserved positions in

blocks which represent the most alike

common regions in related sequences

Thus the PAM model is designed to trackthe evolutionary origins of proteins BLOSUM model is designed to find theirconserved domains


6/17

6


7/17

7

BLAST-Basic Local Alignment Search Tool

BLAST searching allows the user to select one sequence (query) and perform pairwise

sequence alignments between the query and an entire database (target). The programs

produce high-scoring segment pairs (HSPs) that represent local alignments between

your query and database sequences (hits).

USE-

Identifying species

With the use of BLAST, you can possibly correctly identify a species and/or find

homologous species. This can be useful, for example, when you are working with a DNA

sequence from an unknown species.

Locating domains

When working with a protein sequence you can input it into BLAST, to locate known

domains within the sequence of interest.

Establishing phylogeny

Using the results received through BLAST you can create a phylogenetic tree using the

BLAST web-page. Phylogenies based on BLAST alone are less reliable than other purpose-

built computational phylogeneticmethods, so should only be relied upon for "first pass"

phylogenetic analyses.

DNA mapping

When working with a known species, and looking to sequence a gene at an unknown

location, BLAST can compare the chromosomal position of the sequence of interest, to

relevant sequences in the database(s).

Comparison

When working with genes, BLAST can locate common genes in two related species, and can

be used to map annotations from one organism to another.

BLAST SEARCH STEPS

Step 1: Specifying Sequence of Interest

Step 2: Selecting BLAST Program

Step 3: Selecting a Database

Step 4a: Selecting Optional Search Parameters

Step 4b: Selecting Formatting Parameters
http://en.wikipedia.org/wiki/Computational_phylogeneticshttp://en.wikipedia.org/wiki/Computational_phylogeneticshttp://en.wikipedia.org/wiki/Computational_phylogeneticshttp://en.wikipedia.org/wiki/Computational_phylogenetics


8/17

8

BLAST Algorithm

The BLAST search algorithm finds a match between a query and a database sequence

and then extends the match in either direction. The search results consist of both

highly related sequences from the database as well as marginally related sequences, along

with a scoring scheme to describe the degree of relatedness between the query and each

database hit. The blastp algorithm can be described in three phases-

1. BLAST compiles a preliminary list of pairwise alignments, called word pairs.

2. The algorithm scans a database for word pairs that meet some threshold score T.

3. BLAST extends the word pairs to find those that surpass a cut-off score S, at which

point those hits will be reported to the user. Scores are calculated from scoring

matrices (such as BLOSUM62) along with gap penalties.

Gap penalty values are designed to reduce the score when an alignment has been

disturbed by indels. Typically the central elements used to measure the score of an

alignment have been matches, mismatches and spaces. Another important element tomeasure alignment scores are gaps. A gap is a consecutive run of spaces in an alignment

and are used to create alignments that are better conformed to underlying biological

models and more closely fit patterns that one expects to find in meaningful alignments.

Gaps are represented as dashes on a protein/DNA sequence alignment. The length of a gap

is scored by the number of indels (insertions/deletions) in the sequence alignment. In

protein and DNA sequence matching, two sequences are aligned to determine if they have a

segment each that is significantly similar. A local alignment score is assigned according to

the quality of the matches in the alignment subtracted by penalties for gaps present within

the alignment. The best gap costs to use with a given substitution matrix are determined

empirically. Gap penalties are used with local alignment that match a contiguous sub-

sequence of the first sequence with a contiguous sub-sequence of the second sequence.

When comparing proteins, one uses a similarity matrix which assigns a score to each

possible residue. The score should be positive for similar residues and negative for

dissimilar residues pair. Gaps are usually penalized using a linear gap function that assigns

an initial penalty for a gap opening, and an additional penalty for gap extensions which

increase the gap length.

E VALUE-

For the comparison of a query sequence to a

database of random sequences of uniform length,

the scores can be plotted and shown to have the

shape of an extreme value distribution. The extreme

value distribution is skewed to the right, with a tail

that decays in x.


9/17

9

We now arrive at the main mathematical description of the significance of scores from a

BLAST search. The expected number of HSPs having some score S (or better) by

chance alone is described using the equation.

For two random sequences m and n, the cumulative distribution function of scores S.

decay constant . E refers to the expect value, which is the number of different

alignments with scores equivalent to or better than S that are expected to occur bychance in a database search. This provides an estimate of the number of false positive

results from a BLAST search. We see that the E value depends on the score and l, which

is a parameter that scales the scoring system. Also, E depends on the length of the query

sequence and the length of the database. The parameter K is a scaling factor for the

search space.

E- value-expected number of sequences that give the same Z-score or better if the database

is probed with a random sequence

E is found by multiplying the value of P by the size of the database probed. E-values range between 0 and the number of sequences in the database searched. E1-expect this as good a match by chance.

Database search-

Sequence database search can be used for finding the function of a gene that hasbeen sequenced in the laboratory using evolutionary relationship .

Database searches can also be used for finding genes in other organisms relatedto the gene whose sequence has been determined in the laboratory. The

sequence of the gene of interest is compared to every sequence in a database and

the similar ones are identified.

Database searches were attempted when there was serious limitation in machinesize and memory. Methods faster than the existing ones were the ultimate need.

**The main idea of BLAST is that there are often high-scoring segment pairs (HSP)

contained in a statistically significant alignment. BLAST searches for high scoringsequence

alignmentsbetween the query sequence and sequences in the database using a heuristic

approach that approximates theSmith-Waterman algorithm. The exhaustive Smith-

Waterman approach is too slow for searching large genomic databases such as GenBank.

Therefore, the BLAST algorithm uses aheuristicapproach that is less accurate than the

Smith-Waterman algorithm but over 50 times faster. The speed and relatively good

accuracy of BLAST are among the key technical innovations of the BLAST programs.
http://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Smith-Waterman_algorithmhttp://en.wikipedia.org/wiki/Smith-Waterman_algorithmhttp://en.wikipedia.org/wiki/Smith-Waterman_algorithmhttp://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/Heuristic_algorithmhttp://en.wikipedia.org/wiki/Heuristic_algorithmhttp://en.wikipedia.org/wiki/Heuristic_algorithmhttp://en.wikipedia.org/wiki/Heuristic_algorithmhttp://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/Smith-Waterman_algorithmhttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignment


10/17

10

Steps for searching a protein sequence database by a query

protein sequence include the following:

1. The sequence is optionally filtered to remove low-complexity regions that are notuseful for producing meaningful sequence alignments. "Low-complexity region" means a

region of a sequence composed of few kinds of elements. These regions might give high

scores that confuse the program to find the actual significant sequences in the database,

so they should be filtered out. The regions will be marked with an X (protein sequences)

or N (nucleic acid sequences) and then be ignored by the BLAST program.

2. A list of words of length 3 in the query protein sequence is made starting with

positions 1,2,3 then 2,3,4 etc; until the last 3 available positions in the sequence are

reached (word length 11 for DNA sequences, 3 for programs that translate DNA

sequences). While attempting to find similarity in sequences, sets of common letters,

known as words, are very important. For example, suppose that the sequence contains the

following stretch of letters, GLKFA. If aBLASTpwas being conducted under default

conditions, the word size would be 3 letters. In this case, using the given stretch of letters,the searched words would be GLK, LKF, KFA.

3. The scores are created by comparing the word in the list in step 2 with all the 3-

letter words. By using the scoring matrix (substitution matrix) to score the

comparison of each residue pair, there are 20^3 (20x20x20=8000) possible match scores

for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA

is 15 and 12, respectively.

The likelihood of a match to itself is found in the BLOSUM62 matrix as the log odds score of

a P-P match + a Q-Q match + G-G match =7+5+6 =18

(The scores are added because the BLOSUM62 matrix is made up of logarithms of odds of

finding a match in sequences. The likelihoods of each pair are multiplied, and adding

logarithms of scores is equivalent of multiplying the raw odds scores.)

Similarly matches of PQG to

PEG would score 15

PRG 14
http://en.wikipedia.org/w/index.php?title=BLASTp&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=BLASTp&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=BLASTp&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=BLASTp&action=edit&redlink=1


11/17

11

PSG 13 and

PQA 12

4. A cutoff score called neighbourhood word score threshold (T) is selected to reduce

the number of possible matches to PQG to the significant ones .

For example if the cutoff score T is 13, only the words that score above 13 are kept.

In our example possible matches to PQG would include

PEG(15)

but not PQA(12)

The list of possible matching words is thereby shortened from 8000 of all possible to the

highest scoring number of approximately 50.

5. The above procedure is repeated for each three-letter word in the query sequence.

For a sequence length of 250 amino acid the total number of words to search for isapproximately 50 X (250-w (3)-1)

6. The remaining high scoring words that comprise possible matches to each three letter

position in the query sequence are organized into an efficient search tree for

comparing them rapidly to the database sequences.

7. Each database sequence is scanned for an exact match to one of the 50 words

corresponding to the first query sequence position , for the words to the second

position and so on. If a match is found this match is used to seed a possible ungapped

alignment between the query and database sequences.

8. An attempt is made to extend an alignment from the matching words in each

direction along the sequences, continuing for as long as the score continued to increase.

The original version of BLAST stretches a longer alignment between the query and

the database sequence in the left and right directions, from the position where the

exact match occurred. The extension does not stop until the accumulated total score of

the HSP begins to decrease.

At this point, a larger stretch of sequence called high-scoring segment pair (HSP) which

is larger than the original word is said to have been found.


12/17

12

9. The next step is to determine whether each HSP score found by one of the above

methods is greater in value than a cutoff scoreS. A suitable value of S is determined

empirically by examining the range of scores found by comparing random sequences

and by choosing a value that is significantly greater. The high scoring pairs matched

in the entire database are identified and listed .

10. BLAST next determines the statistical significance of each HSP score.

Significance of alignments

Suppose an alignment reveals an intriguing similarity between two sequences. What

should be our next job? Is the similarity significant or could it have arisen by chance? What

is the practical approach to the problem?

Ifthe score of the alignment observed is no better than might be expected from a

random permutation of the sequence, then it is likely to have arisen by chance . The

alignment is unlikely (doubtful) to be significant, if the randomized sequences score

as well as the original one.

We may randomize one of the sequences, many times, realign each result tothe second sequence and collect the distribution.

We can measure the mean and standard deviation of the scores of thealignments of randomized sequences and ask whether the score of original

sequence is unusually high.

The Z-score reflects the extent to which the original result is an outlier from the

randomized sequence

A Z-scorezero means that the observed similarity is no better than the average

of random permutations of sequence, and might well have arisen by chance.

P is another measure of significance. It is the probability that the observedmatch could have happened by chance. Guide to interpreting p values-

P 10-1 Insignificant match probably

Pairwise Alignment with Dot Plots

It is a graphical method for comparing two sequences. One protein or nucleic acid

sequence is placed along the x axis and the other is placed along the y axis. Positions

of identity are scored with a dot. A region of identity between two sequences results

in the formation of a diagonal line.

deviationdards

meanscore=

tan


13/17

13

Max score: The score of the highest scoring HSP from that database sequence.

Total score: The total score of all HSP's from that database sequence.

Query Coverage: It is the percent of length of the query covered.

Max Identity: It is the maximal percent identity of the HSP


14/17

14


15/17

15


16/17

16


17/17

17

Pairwise Alignment (BIOINFORMATICS)

Documents

Transcript of Pairwise Alignment (BIOINFORMATICS)