Pairwise Alignment (BIOINFORMATICS)

download Pairwise Alignment (BIOINFORMATICS)

of 17

Transcript of Pairwise Alignment (BIOINFORMATICS)

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    1/17

    1

    Aayudh Das

    PAIRWISE ALIGNMENT-Homology, Similarity, Identity

    Two sequences are homologous if they share a common evolutionary ancestry. E.g.Human myoglobin and beta globin two proteins are distant but significantly related.

    Proteins that are homologous may be orthologous or paralogous.

    1. Orthologs are homologous sequences in different species that arose from acommon ancestral gene during speciation having similar biological functions; inthis example, human and rat myoglobins both transport oxygen in muscle cells .

    2. Paralogs are homologous sequences that arose by a mechanism such as geneduplication. For example, human alpha-1 globin is paralogous to alpha-2

    globin; indeed, these two proteins share 100% amino acid identity.

    We can assess the relatedness of any two proteins by performing a pairwise

    alignment. One practical way to do this is through the NCBI pairwise BLAST tool.

    Another aspect of this pairwise alignment is that some of the aligned residues may be

    similar but not identical because they share similar biochemical properties. These areconservative substitutions. Amino acids with similar properties include the basic amino

    acids (K, R, H), acidic amino acids (D, E), hydroxylated amino acids (S, T), and

    hydrophobic amino acids (W, F, Y, L, I, V,M, A).

    The percent similarity of two protein sequences is the sum of both identical and

    similar matches.

    The purpose of a pairwise alignment is to assess the degree of similarity and thepossibility of homology between two molecules.

    Pairwise alignment is useful as a way to identify mutations that have occurred duringevolution and have caused divergence of the sequences of the two proteins we are

    studying. The most common mutations are substitutions, insertions, and deletions.

    Insertions or deletions (even those just one character long) are referred to as gaps in the

    alignment.

    Scoring matrix

    Margaret Dayhoff (1978) provided a model which gives the basis of a quantitative

    scoring system for pairwise alignments.

    Based on substitution frequencies in the protein sequences that were then known

    (1972) Dayhoff and her coworkers organized the proteins into families and

    superfamilies based on the degree of sequence similarity.

    Their approach was to catalog thousands of proteins and compare the sequences of

    closely related proteins in many families.

    They considered the question ofwhich specific amino acid substitutions are observed to

    occur when two homologous protein sequences are aligned. They defined an acceptedpoint mutation as a replacement of one amino acid in a protein by another residue that

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    2/17

    2

    has been accepted by natural selection. Accepted point mutation is abbreviated PAM

    (which is easier to pronounce than APM).

    Dayhoff and colleagues examined 1572 changes in 71 groups of closely related proteins.

    Thus, their definition of accepted mutations was based on empirically observed amino

    acid substitutions.

    Derivation of substitution matrix

    From the idea of divergent evolution we know that as sequences divergemutations accumulate.

    The idea ofaccepted point mutation was the ultimate outcome. It is the change of one amino acid to another through natural selection . Thus there are simultaneously two processes that take place.

    One there is a mutation such that the gene changes and there is a change in the

    amino acid of the protein.

    Two this mutation is accepted by the species.

    The observed behaviour of amino acids in the evolutionary process needs to beconsidered.

    These demands for 20X20=400 possible comparisons.Calculation for the matrix of accepted point mutation

    Assumption The likelihood of amino acid X replacing Y is the same as that ofYreplacing X. This reduces the number of comparisons between amino acids.

    Comparison is done with observed sequences with inferred ancestralsequences.

    This means mutation data were accumulated from the phylogenetic trees andfrom a few pairs of related sequences.

    The sequences of all the nodal common ancestors in each tree are generated asfollows.

    The main goal of Dayhoff s approach was to define a set of scores for the comparison of

    aligned amino acid residues. By comparing two aligned proteins, one can then tabulate

    an overall score, taking into account identities as well as mismatches, and also applying

    appropriate penalties for gaps. A scoring matrix defines scores for the interchange of

    residues i and j. It is given by theprobability qi,jof aligning original amino acid residue j

    with replacement residue i relative to the likelihood of observing residues i by chance

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    3/17

    3

    (pi). The scoring matrix further incorporates a logarithm to generate log-odds scores. For

    the Dayhoff matrices, the following-

    Here the score si, j refers to the score for aligning any two residues (including an amino

    acid with itself) along the length of a pairwise alignment. The probability qi, j is the

    observed frequency of substitution for each pair of amino acids. The values for qij are

    called the target frequencies, and they are estimated in reference to a particularamount of evolutionary change.

    e.g. If in a particular comparison of closely related proteins an aligned serine were to

    change to athreonine 5% of the time, then thattarget frequency qS,T would be 0.05.

    PAM (Point accepted mutation)

    The entries in the cell are the lods ratio log of odds ratio log of observed frequency is to expected frequency

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    4/17

    4

    The PAM250 mutation probability matrix is useful because itdescribes the frequency of

    amino acid replacements between distantly related proteins . PAM 250 corresponds to

    ~20% overall sequence identity.

    We have to convert the elements of a PAM mutation probability matrix into a scoring

    matrix, also called a log-odds matrixor relatedness odds matrix.

    Whats the need of taking Log of odd? For this scoring system Dayhoff and colleagues took

    10 times the base 10 logarithm of the odds ratio . Using the logarithm here is helpfulbecause it allows us to sum the scores of the aligned residues when we perform an

    overall alignment of two sequences. (If we did not take the logarithm, we would need to

    multiply the ratios at all the aligned positions)

    The values have been rounded off to the nearest integer. As an example, to determine the

    score assigned to two aligned tryptophan residues, the PAM250 mutation probability

    matrix value is 0.55, and the normalized frequency of tryptophan is 0.010 Thus,

    What do the scores in the PAM250 matrix signify?

    A score of -10 indicates that the correspondence of two amino acids in an alignment that

    accurately represents homology (evolutionary descent) is one-tenth as frequent as the

    chance alignment of these amino acids. This assumes that each was randomly selected

    from the background amino acid frequency distribution. A score of zero is neutral. A score

    of +17 for tryptophan indicates that this correspondence is 50 times more frequent

    than the chance alignment of this residue in a pairwise alignment. A score of +2

    indicates that the amino acid replacement occurs 1.6 times as frequently as expected

    by chance. The highest values in this particular matrix are for tryptophan (17 for an

    identity) and cysteine (12), while the most severe penalties are associated with

    substitutions for those two residues.

    BLOSUM-BLOcks Substitution Matrix developed byHenikoff & Henikoff in 1992

    BLOSUM matrix is asubstitution matrix used forsequence alignmentofproteins.

    They used the BLOCKS database, which consisted of over 500 groups of local multiple

    alignments (blocks) of distantly related protein sequences. Thus the Henikoffs focused

    on conserved regions (blocks) of proteins that are distantly related to each other.

    The BLOSUM scoring scheme employs a log-odds ratio using the base 2 logarithm:

    http://en.wikipedia.org/wiki/Substitution_matrixhttp://en.wikipedia.org/wiki/Substitution_matrixhttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Substitution_matrix
  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    5/17

    5

    Itmerges all proteins in an alignment that have 62% amino acid identity or greaterinto one sequence.

    If a block of aligned globin orthologs includes several that have 62%, 80%, and 95%amino acid identity, these would all be weighted (grouped) as one sequence.

    Substitution frequencies for the BLOSUM62 matrix are weighted more heavily by

    blocks of protein sequences having less than 62% identity. (Thus, this matrix is useful

    for scoring proteins that share less than 62% identity.)

    Overall procedure-

    Collect a set of multiple alignments Find the Blocks (no gaps). Blocks are defined as ungapped alignments of amino acids

    from related proteins. Consider a single block representing a conserved region of a

    protein family. Each row is a different protein segment. Each column is an aligned

    residue position.

    Group segment of Blocks with x% identity. Count the occurrence of all pairs of amino acids. Employ these counts to obtain odds ratio (log).

    PAM BLOSUM

    Dayhoff estimated mutation rates from

    substitutions observed in closely related

    proteins and extrapolated those rates to

    model distant relationships.

    For BLOSUM, blocks have been derived

    from highly conserved region of proteins

    and so they reflect direct relationships.

    PAM matrices are based on a mutational

    model of evolution that assumes aminoacid changes occur as a Markov process

    (each amino acid change at a site is

    independent of previous changes at that

    site)

    In contrast BLOSUM matrices are not

    based on an explicit evolutionary model.

    Changes are scored in sequences that are

    85% similar after predicting a

    phylogenetic history of the changes in each

    family.

    Less than 62% identity.

    Thus PAM matrices are based on

    prediction of the first changes that occuras proteins diverge from a common

    ancestor during evolution of a protein

    family.

    They are derived from considering all

    amino acid changes observed in an alignedregion from a related family of proteins,

    regardless of the degree of similarity

    between the protein sequences. These

    sequences are said to be related

    biochemically

    PAM matrices are based on scoring all

    amino acid positions in related sequences

    BLOSUM matrices are based on

    substitutions and conserved positions in

    blocks which represent the most alike

    common regions in related sequences

    Thus the PAM model is designed to trackthe evolutionary origins of proteins BLOSUM model is designed to find theirconserved domains

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    6/17

    6

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    7/17

    7

    BLAST-Basic Local Alignment Search Tool

    BLAST searching allows the user to select one sequence (query) and perform pairwise

    sequence alignments between the query and an entire database (target). The programs

    produce high-scoring segment pairs (HSPs) that represent local alignments between

    your query and database sequences (hits).

    USE-

    Identifying species

    With the use of BLAST, you can possibly correctly identify a species and/or find

    homologous species. This can be useful, for example, when you are working with a DNA

    sequence from an unknown species.

    Locating domains

    When working with a protein sequence you can input it into BLAST, to locate known

    domains within the sequence of interest.

    Establishing phylogeny

    Using the results received through BLAST you can create a phylogenetic tree using the

    BLAST web-page. Phylogenies based on BLAST alone are less reliable than other purpose-

    built computational phylogeneticmethods, so should only be relied upon for "first pass"

    phylogenetic analyses.

    DNA mapping

    When working with a known species, and looking to sequence a gene at an unknown

    location, BLAST can compare the chromosomal position of the sequence of interest, to

    relevant sequences in the database(s).

    Comparison

    When working with genes, BLAST can locate common genes in two related species, and can

    be used to map annotations from one organism to another.

    BLAST SEARCH STEPS

    Step 1: Specifying Sequence of Interest

    Step 2: Selecting BLAST Program

    Step 3: Selecting a Database

    Step 4a: Selecting Optional Search Parameters

    Step 4b: Selecting Formatting Parameters

    http://en.wikipedia.org/wiki/Computational_phylogeneticshttp://en.wikipedia.org/wiki/Computational_phylogeneticshttp://en.wikipedia.org/wiki/Computational_phylogeneticshttp://en.wikipedia.org/wiki/Computational_phylogenetics
  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    8/17

    8

    BLAST Algorithm

    The BLAST search algorithm finds a match between a query and a database sequence

    and then extends the match in either direction. The search results consist of both

    highly related sequences from the database as well as marginally related sequences, along

    with a scoring scheme to describe the degree of relatedness between the query and each

    database hit. The blastp algorithm can be described in three phases-

    1. BLAST compiles a preliminary list of pairwise alignments, called word pairs.

    2. The algorithm scans a database for word pairs that meet some threshold score T.

    3. BLAST extends the word pairs to find those that surpass a cut-off score S, at which

    point those hits will be reported to the user. Scores are calculated from scoring

    matrices (such as BLOSUM62) along with gap penalties.

    Gap penalty values are designed to reduce the score when an alignment has been

    disturbed by indels. Typically the central elements used to measure the score of an

    alignment have been matches, mismatches and spaces. Another important element tomeasure alignment scores are gaps. A gap is a consecutive run of spaces in an alignment

    and are used to create alignments that are better conformed to underlying biological

    models and more closely fit patterns that one expects to find in meaningful alignments.

    Gaps are represented as dashes on a protein/DNA sequence alignment. The length of a gap

    is scored by the number of indels (insertions/deletions) in the sequence alignment. In

    protein and DNA sequence matching, two sequences are aligned to determine if they have a

    segment each that is significantly similar. A local alignment score is assigned according to

    the quality of the matches in the alignment subtracted by penalties for gaps present within

    the alignment. The best gap costs to use with a given substitution matrix are determined

    empirically. Gap penalties are used with local alignment that match a contiguous sub-

    sequence of the first sequence with a contiguous sub-sequence of the second sequence.

    When comparing proteins, one uses a similarity matrix which assigns a score to each

    possible residue. The score should be positive for similar residues and negative for

    dissimilar residues pair. Gaps are usually penalized using a linear gap function that assigns

    an initial penalty for a gap opening, and an additional penalty for gap extensions which

    increase the gap length.

    E VALUE-

    For the comparison of a query sequence to a

    database of random sequences of uniform length,

    the scores can be plotted and shown to have the

    shape of an extreme value distribution. The extreme

    value distribution is skewed to the right, with a tail

    that decays in x.

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    9/17

    9

    We now arrive at the main mathematical description of the significance of scores from a

    BLAST search. The expected number of HSPs having some score S (or better) by

    chance alone is described using the equation.

    For two random sequences m and n, the cumulative distribution function of scores S.

    decay constant . E refers to the expect value, which is the number of different

    alignments with scores equivalent to or better than S that are expected to occur bychance in a database search. This provides an estimate of the number of false positive

    results from a BLAST search. We see that the E value depends on the score and l, which

    is a parameter that scales the scoring system. Also, E depends on the length of the query

    sequence and the length of the database. The parameter K is a scaling factor for the

    search space.

    E- value-expected number of sequences that give the same Z-score or better if the database

    is probed with a random sequence

    E is found by multiplying the value of P by the size of the database probed. E-values range between 0 and the number of sequences in the database searched. E1-expect this as good a match by chance.

    Database search-

    Sequence database search can be used for finding the function of a gene that hasbeen sequenced in the laboratory using evolutionary relationship .

    Database searches can also be used for finding genes in other organisms relatedto the gene whose sequence has been determined in the laboratory. The

    sequence of the gene of interest is compared to every sequence in a database and

    the similar ones are identified.

    Database searches were attempted when there was serious limitation in machinesize and memory. Methods faster than the existing ones were the ultimate need.

    **The main idea of BLAST is that there are often high-scoring segment pairs (HSP)

    contained in a statistically significant alignment. BLAST searches for high scoringsequence

    alignmentsbetween the query sequence and sequences in the database using a heuristic

    approach that approximates theSmith-Waterman algorithm. The exhaustive Smith-

    Waterman approach is too slow for searching large genomic databases such as GenBank.

    Therefore, the BLAST algorithm uses aheuristicapproach that is less accurate than the

    Smith-Waterman algorithm but over 50 times faster. The speed and relatively good

    accuracy of BLAST are among the key technical innovations of the BLAST programs.

    http://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Smith-Waterman_algorithmhttp://en.wikipedia.org/wiki/Smith-Waterman_algorithmhttp://en.wikipedia.org/wiki/Smith-Waterman_algorithmhttp://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/Heuristic_algorithmhttp://en.wikipedia.org/wiki/Heuristic_algorithmhttp://en.wikipedia.org/wiki/Heuristic_algorithmhttp://en.wikipedia.org/wiki/Heuristic_algorithmhttp://en.wikipedia.org/wiki/GenBankhttp://en.wikipedia.org/wiki/Smith-Waterman_algorithmhttp://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Sequence_alignment
  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    10/17

    10

    Steps for searching a protein sequence database by a query

    protein sequence include the following:

    1. The sequence is optionally filtered to remove low-complexity regions that are notuseful for producing meaningful sequence alignments. "Low-complexity region" means a

    region of a sequence composed of few kinds of elements. These regions might give high

    scores that confuse the program to find the actual significant sequences in the database,

    so they should be filtered out. The regions will be marked with an X (protein sequences)

    or N (nucleic acid sequences) and then be ignored by the BLAST program.

    2. A list of words of length 3 in the query protein sequence is made starting with

    positions 1,2,3 then 2,3,4 etc; until the last 3 available positions in the sequence are

    reached (word length 11 for DNA sequences, 3 for programs that translate DNA

    sequences). While attempting to find similarity in sequences, sets of common letters,

    known as words, are very important. For example, suppose that the sequence contains the

    following stretch of letters, GLKFA. If aBLASTpwas being conducted under default

    conditions, the word size would be 3 letters. In this case, using the given stretch of letters,the searched words would be GLK, LKF, KFA.

    3. The scores are created by comparing the word in the list in step 2 with all the 3-

    letter words. By using the scoring matrix (substitution matrix) to score the

    comparison of each residue pair, there are 20^3 (20x20x20=8000) possible match scores

    for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA

    is 15 and 12, respectively.

    The likelihood of a match to itself is found in the BLOSUM62 matrix as the log odds score of

    a P-P match + a Q-Q match + G-G match =7+5+6 =18

    (The scores are added because the BLOSUM62 matrix is made up of logarithms of odds of

    finding a match in sequences. The likelihoods of each pair are multiplied, and adding

    logarithms of scores is equivalent of multiplying the raw odds scores.)

    Similarly matches of PQG to

    PEG would score 15

    PRG 14

    http://en.wikipedia.org/w/index.php?title=BLASTp&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=BLASTp&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=BLASTp&action=edit&redlink=1http://en.wikipedia.org/w/index.php?title=BLASTp&action=edit&redlink=1
  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    11/17

    11

    PSG 13 and

    PQA 12

    4. A cutoff score called neighbourhood word score threshold (T) is selected to reduce

    the number of possible matches to PQG to the significant ones .

    For example if the cutoff score T is 13, only the words that score above 13 are kept.

    In our example possible matches to PQG would include

    PEG(15)

    but not PQA(12)

    The list of possible matching words is thereby shortened from 8000 of all possible to the

    highest scoring number of approximately 50.

    5. The above procedure is repeated for each three-letter word in the query sequence.

    For a sequence length of 250 amino acid the total number of words to search for isapproximately 50 X (250-w (3)-1)

    6. The remaining high scoring words that comprise possible matches to each three letter

    position in the query sequence are organized into an efficient search tree for

    comparing them rapidly to the database sequences.

    7. Each database sequence is scanned for an exact match to one of the 50 words

    corresponding to the first query sequence position , for the words to the second

    position and so on. If a match is found this match is used to seed a possible ungapped

    alignment between the query and database sequences.

    8. An attempt is made to extend an alignment from the matching words in each

    direction along the sequences, continuing for as long as the score continued to increase.

    The original version of BLAST stretches a longer alignment between the query and

    the database sequence in the left and right directions, from the position where the

    exact match occurred. The extension does not stop until the accumulated total score of

    the HSP begins to decrease.

    At this point, a larger stretch of sequence called high-scoring segment pair (HSP) which

    is larger than the original word is said to have been found.

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    12/17

    12

    9. The next step is to determine whether each HSP score found by one of the above

    methods is greater in value than a cutoff scoreS. A suitable value of S is determined

    empirically by examining the range of scores found by comparing random sequences

    and by choosing a value that is significantly greater. The high scoring pairs matched

    in the entire database are identified and listed .

    10. BLAST next determines the statistical significance of each HSP score.

    Significance of alignments

    Suppose an alignment reveals an intriguing similarity between two sequences. What

    should be our next job? Is the similarity significant or could it have arisen by chance? What

    is the practical approach to the problem?

    Ifthe score of the alignment observed is no better than might be expected from a

    random permutation of the sequence, then it is likely to have arisen by chance . The

    alignment is unlikely (doubtful) to be significant, if the randomized sequences score

    as well as the original one.

    We may randomize one of the sequences, many times, realign each result tothe second sequence and collect the distribution.

    We can measure the mean and standard deviation of the scores of thealignments of randomized sequences and ask whether the score of original

    sequence is unusually high.

    The Z-score reflects the extent to which the original result is an outlier from the

    randomized sequence

    A Z-scorezero means that the observed similarity is no better than the average

    of random permutations of sequence, and might well have arisen by chance.

    P is another measure of significance. It is the probability that the observedmatch could have happened by chance. Guide to interpreting p values-

    P 10-1 Insignificant match probably

    Pairwise Alignment with Dot Plots

    It is a graphical method for comparing two sequences. One protein or nucleic acid

    sequence is placed along the x axis and the other is placed along the y axis. Positions

    of identity are scored with a dot. A region of identity between two sequences results

    in the formation of a diagonal line.

    deviationdards

    meanscore=

    tan

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    13/17

    13

    Max score: The score of the highest scoring HSP from that database sequence.

    Total score: The total score of all HSP's from that database sequence.

    Query Coverage: It is the percent of length of the query covered.

    Max Identity: It is the maximal percent identity of the HSP

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    14/17

    14

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    15/17

    15

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    16/17

    16

  • 7/29/2019 Pairwise Alignment (BIOINFORMATICS)

    17/17

    17