Lecture 101

43
03/25/22 Bioinformatics I Fall 200 2 1 Multiple alignment: Feng-Doolittle algorithm

description

l

Transcript of Lecture 101

04/17/23 Bioinformatics I Fall 2002 1

Multiple alignment: Feng-Doolittle algorithmMultiple alignment: Feng-Doolittle algorithm

04/17/23 Bioinformatics I Fall 2002 2

Why multiple alignments?Why multiple alignments?

• Alignment of more than two sequences• Usually gives better information about

conserved regions and function (more data)

• Better estimate of significance when using a sequence of unknown function

• Must use multiple alignments when establishing phylogenetic relationships

• Alignment of more than two sequences• Usually gives better information about

conserved regions and function (more data)

• Better estimate of significance when using a sequence of unknown function

• Must use multiple alignments when establishing phylogenetic relationships

04/17/23 Bioinformatics I Fall 2002 3

Dynamic programming extended to many dimensions?

Dynamic programming extended to many dimensions?• No – uses up too much computer time and

space• E.g. 200 amino acids in a pairwise

alignment – must evaluate 4 x 104 matrix elements

• If 3 sequences, 8 x 106 matrix elements• If 6 sequences, 6.4 x 1013 matrix

elements

• No – uses up too much computer time and space

• E.g. 200 amino acids in a pairwise alignment – must evaluate 4 x 104 matrix elements

• If 3 sequences, 8 x 106 matrix elements• If 6 sequences, 6.4 x 1013 matrix

elements

04/17/23 Bioinformatics I Fall 2002 4

• Need to find more efficient method

• Sacrifice certainty of optimum alignment for certainty of good alignment but faster

• Need to find more efficient method

• Sacrifice certainty of optimum alignment for certainty of good alignment but faster

04/17/23 Bioinformatics I Fall 2002 5

Feng-doolittle algorithmFeng-doolittle algorithm

• Does all pairwise alignments and scores them

• Converts pairwise scores to “distances”• D = -logSeff = -log [(Sobs –Srand)/(Smax –Srand)]• Sobs = pairwise alignment score• Srand = expected score for random alignment• Smax = average of self-alignments of the two

sequences

• Does all pairwise alignments and scores them

• Converts pairwise scores to “distances”• D = -logSeff = -log [(Sobs –Srand)/(Smax –Srand)]• Sobs = pairwise alignment score• Srand = expected score for random alignment• Smax = average of self-alignments of the two

sequences

04/17/23 Bioinformatics I Fall 2002 6

• As Srand increases (increasing evolutionary distance), Seff goes down; this is why –log is used to scale Seff so that it’s linear with evolutionary distance.

• As Srand increases (increasing evolutionary distance), Seff goes down; this is why –log is used to scale Seff so that it’s linear with evolutionary distance.

04/17/23 Bioinformatics I Fall 2002 7

• Once the distances have been calculated, construct a guide tree (more in the phylogeny class) – tells what order to group the sequences

• Sequences can be aligned with sequences or groups; groups can be aligned with groups

• Once the distances have been calculated, construct a guide tree (more in the phylogeny class) – tells what order to group the sequences

• Sequences can be aligned with sequences or groups; groups can be aligned with groups

04/17/23 Bioinformatics I Fall 2002 8

• Sequence-sequence alignments: dynamic programming

• Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group

• Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned

• Sequence-sequence alignments: dynamic programming

• Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group

• Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned

04/17/23 Bioinformatics I Fall 2002 9

ExampleExample

Seq1 Seq2 Seq3 Seq4 Seq5

Alignment 1 Alignment 2

Alignment 3 Final alignment

04/17/23 Bioinformatics I Fall 2002 10

Notice that this method does not guarantee the optimum alignment; just a good one.

Gaps are preserved from alignment to alignment: “once a gap, always a gap”

Notice that this method does not guarantee the optimum alignment; just a good one.

Gaps are preserved from alignment to alignment: “once a gap, always a gap”

04/17/23 Bioinformatics I Fall 2002 11

In-class exerciseIn-class exercise

• Run Gap on all combinations of the sequences in multalign.rsf; penalize endgaps as before; use a gap penalty of 6 and an extension penalty of 2

• Record quality scores of each pairwise comparison

• Run Gap on all combinations of the sequences in multalign.rsf; penalize endgaps as before; use a gap penalty of 6 and an extension penalty of 2

• Record quality scores of each pairwise comparison

04/17/23 Bioinformatics I Fall 2002 12

In class exercise, contIn class exercise, cont

• use quality scores as distance measures; make a guide tree based on these scores

• Select all sequences in multalign.rsf; select Functions multiple alignment pileup; select options, select penalize end gaps; set gap penalty at 6, gap extension at 2

• Compare pileup’s guide tree with yours

• use quality scores as distance measures; make a guide tree based on these scores

• Select all sequences in multalign.rsf; select Functions multiple alignment pileup; select options, select penalize end gaps; set gap penalty at 6, gap extension at 2

• Compare pileup’s guide tree with yours

04/17/23 Bioinformatics I Fall 2002 13

In class exercise, contIn class exercise, cont

• Select pileup file in the output manager; select add to editor; select load second copy at the prompt

• Carefully examine pileup’s alignment; compare it to the pairwise alignments (you can open them from the output manager)

• Select pileup file in the output manager; select add to editor; select load second copy at the prompt

• Carefully examine pileup’s alignment; compare it to the pairwise alignments (you can open them from the output manager)

04/17/23 Bioinformatics I Fall 2002 14

• Start refining alignment:

• Use structural info if you have it

• Find patterns if you don’t

• Use amino acid structure handout from beginning of class for substitution decisions!

• Start refining alignment:

• Use structural info if you have it

• Find patterns if you don’t

• Use amino acid structure handout from beginning of class for substitution decisions!

04/17/23 Bioinformatics I Fall 2002 15

ClustalWClustalW

• Most widely used multiple alignment method

• Similar strategy to the Feng-Doolittle approach implemented as Pileup, but more complex and gives generally superior results

• Ad hoc nature of the program can be mysterious

• Most widely used multiple alignment method

• Similar strategy to the Feng-Doolittle approach implemented as Pileup, but more complex and gives generally superior results

• Ad hoc nature of the program can be mysterious

04/17/23 Bioinformatics I Fall 2002 16

Advantageous differencesAdvantageous differences

• Gap penalties vary locally:• By observed frequency (in database)

after each residue• By simple structure prediction – lower

gap penalties in probable loop regions• By proximity to existing gaps – higher

gap penalties when within 8 residues of an existing gap

• Gap penalties vary locally:• By observed frequency (in database)

after each residue• By simple structure prediction – lower

gap penalties in probable loop regions• By proximity to existing gaps – higher

gap penalties when within 8 residues of an existing gap

04/17/23 Bioinformatics I Fall 2002 17

Advantages, cont.Advantages, cont.• Change in substitution matrix choice

depending on distance computed for guide tree

• Substitution matrix families• Profile construction (more later)• Weighting of sequences in profiles

depending on evolutionary distance computed for guide tree

• More similar sequences get less weight than less similar sequences

• Change in substitution matrix choice depending on distance computed for guide tree

• Substitution matrix families• Profile construction (more later)• Weighting of sequences in profiles

depending on evolutionary distance computed for guide tree

• More similar sequences get less weight than less similar sequences

04/17/23 Bioinformatics I Fall 2002 18

In class exercise IIIn class exercise II

• Change a few parameters in the ClustalW program (gap, gap extension, substitution matrix, etc.) one at a time: this is done in Alignment Setup. After each run with a different change, save the alignment project with some descriptive name that you can remember (e.g., gap20 or blosum)

• Compare alignment results with different parameters changed

• Change a few parameters in the ClustalW program (gap, gap extension, substitution matrix, etc.) one at a time: this is done in Alignment Setup. After each run with a different change, save the alignment project with some descriptive name that you can remember (e.g., gap20 or blosum)

• Compare alignment results with different parameters changed

04/17/23 Bioinformatics I Fall 2002 19

MultAlinMultAlin

• MultAlin is also a heuristic algorithm that builds up a multiple alignment from a group of pairwise alignments

• It differs from Pileup and Clustal in that the guide tree is recalculated based on the results of each alignment step

• Because this leads to cycles of tree building and alignmnent, MultAlin can take a long time to run. It stops after the overall alignment score stops improving

• MultAlin is also a heuristic algorithm that builds up a multiple alignment from a group of pairwise alignments

• It differs from Pileup and Clustal in that the guide tree is recalculated based on the results of each alignment step

• Because this leads to cycles of tree building and alignmnent, MultAlin can take a long time to run. It stops after the overall alignment score stops improving

04/17/23 Bioinformatics I Fall 2002 20

Scoring a multiple sequence alignmentScoring a multiple sequence alignment• Assumptions:

• Sequences (rows) independent• Positions (columns) independent

• Neither assumption is true …• Score of a column is the (possibly

weighted) sum of all the pairwise comparisons (I.e., substitution matrix values) within that column

• Score of a multiple alignment is the sum of scores for all columns

• Assumptions:• Sequences (rows) independent• Positions (columns) independent

• Neither assumption is true …• Score of a column is the (possibly

weighted) sum of all the pairwise comparisons (I.e., substitution matrix values) within that column

• Score of a multiple alignment is the sum of scores for all columns

04/17/23 Bioinformatics I Fall 2002 21

Introduction to profilesIntroduction to profiles

• A profile is a way to take into account variability in characters within each position in a msa

• As you look along the column of a msa, you can see different characters

• A profile is a way to preserve the information about the differences observed within each column

• A profile is a way to take into account variability in characters within each position in a msa

• As you look along the column of a msa, you can see different characters

• A profile is a way to preserve the information about the differences observed within each column

04/17/23 Bioinformatics I Fall 2002 22

Profile building: simple model (Durbin, fig. 5.3)Profile building: simple model (Durbin, fig. 5.3)

vga--hageyv----nvdevvea--dvaghvkg------dvys--tyetsfna--nipkhiagadngagv

In column 1, we have 5 v’s, 1 f, and 1 i

To score an “a” in that position in a sequence being tested for fit to the profile, we use the frequency information in the column

04/17/23 Bioinformatics I Fall 2002 23

Profile: simple model, cont.Profile: simple model, cont.• Score for the alignment of that position is

5/7 s(v,a) + 1/7 s(f,a), + 1/7 s(i,a)where s(x,a) is the value from a substitution matrix for x substituted for a

• Gap penalty decreased according to the length of the longest gap spanning that column

• Idea is to maximize the score of the alignment of the new sequence to the profile (or reject an alignment that falls below a threshold)

• Score for the alignment of that position is5/7 s(v,a) + 1/7 s(f,a), + 1/7 s(i,a)

where s(x,a) is the value from a substitution matrix for x substituted for a

• Gap penalty decreased according to the length of the longest gap spanning that column

• Idea is to maximize the score of the alignment of the new sequence to the profile (or reject an alignment that falls below a threshold)

04/17/23 Bioinformatics I Fall 2002 24

Profile, evolutionary modelProfile, evolutionary model

• Instead of just estimating the probabilities found empirically to predict the probability of a different substitution, explicitly assume that everything seen in the column is a substitution from a common ancestor

• This allows us to use Bayesian approaches about prior and posterior probabilities

• Instead of just estimating the probabilities found empirically to predict the probability of a different substitution, explicitly assume that everything seen in the column is a substitution from a common ancestor

• This allows us to use Bayesian approaches about prior and posterior probabilities

04/17/23 Bioinformatics I Fall 2002 25

Profile-type programsProfile-type programs

• Again, idea is to make a profile, then search for more sequences that fit that profile

• Example: ProfileSearch, BLOCKS, etc. • Can combine this with statistical methods

like expectation maximization (MEME) and Gibbs Sampler. Both of these depend on “random” starting points and algorithms that converge on an endpoint

• Again, idea is to make a profile, then search for more sequences that fit that profile

• Example: ProfileSearch, BLOCKS, etc. • Can combine this with statistical methods

like expectation maximization (MEME) and Gibbs Sampler. Both of these depend on “random” starting points and algorithms that converge on an endpoint

04/17/23 Bioinformatics I Fall 2002 26

In class exercise: Profilemake, etc.In class exercise: Profilemake, etc.

• Open nosalign.rsf file• Select 3 or 4 sequences; then use Edit

select range, enter numbers for a range within these sequences (range of about 500 column positions)

• Select Functions ProfileMake; choose use selected regions; take defaults and select run

• Examine output

• Open nosalign.rsf file• Select 3 or 4 sequences; then use Edit

select range, enter numbers for a range within these sequences (range of about 500 column positions)

• Select Functions ProfileMake; choose use selected regions; take defaults and select run

• Examine output

04/17/23 Bioinformatics I Fall 2002 27

• Now select ProfileSearch; use Query profiles pulldown to select the profilemake (.prf) output you just created; this profile will be the input for the Profile Search

• The default database to search is SwissProt; we’ll leave it at the default, but you can use the Search set option to change or add to the databases searched (this is a local search)

• Choose ProfileSegments; this will show local alignments between some number of the returned sequences with the profile

• Choose run; this search will take a long time, perhaps 20 minutes

• Now select ProfileSearch; use Query profiles pulldown to select the profilemake (.prf) output you just created; this profile will be the input for the Profile Search

• The default database to search is SwissProt; we’ll leave it at the default, but you can use the Search set option to change or add to the databases searched (this is a local search)

• Choose ProfileSegments; this will show local alignments between some number of the returned sequences with the profile

• Choose run; this search will take a long time, perhaps 20 minutes

04/17/23 Bioinformatics I Fall 2002 28

Other GCG profile programsOther GCG profile programs

• MEME: finds conserved motifs in a group of unaligned sequences

• MotifSearch: uses output of MEME to search a database for new sequences

• MEME: finds conserved motifs in a group of unaligned sequences

• MotifSearch: uses output of MEME to search a database for new sequences

04/17/23 Bioinformatics I Fall 2002 29

Confusingly named but related GCG programsConfusingly named but related GCG programs• Motifs: searches a protein sequence for

presence of motif patterns already identified in the Prosite library

• ProfileScan: similar to Motifs, but uses profiles in the Prosite library to search the sequence for the presence of a profile; also performs the alignment of the profile to the sequence

• Motifs: searches a protein sequence for presence of motif patterns already identified in the Prosite library

• ProfileScan: similar to Motifs, but uses profiles in the Prosite library to search the sequence for the presence of a profile; also performs the alignment of the profile to the sequence

04/17/23 Bioinformatics I Fall 2002 30

Check out these websites for moreCheck out these websites for more• BLOCKS

• eMOTIF

• GIBBS

• HMMER

• SAGA

• BLOCKS

• eMOTIF

• GIBBS

• HMMER

• SAGA

04/17/23 Bioinformatics I Fall 2002 31

PAM matricesPAM matrices

• Started with alignments of very closely related proteins; each pair of sequences was at least 85% identical

• Then used idea of parsimony (least number of changes) to build phylogenetic trees

acgctafkigcgctafkiacgctafklgcgctgfkigcgctlfkiasgctafklacactafkl

• Started with alignments of very closely related proteins; each pair of sequences was at least 85% identical

• Then used idea of parsimony (least number of changes) to build phylogenetic trees

acgctafkigcgctafkiacgctafklgcgctgfkigcgctlfkiasgctafklacactafkl

04/17/23 Bioinformatics I Fall 2002 32

acgctafki

gcgctafki acgctafkl

gcgctgfki gcgctlfki

ag

ag al

asgctafkl acactafkl

il

gacs

04/17/23 Bioinformatics I Fall 2002 33

• For each amino acid, find the frequency Fi,j for which it is (reciprocally) substituted by every other amino acid

• for example, Fg,a = 3• Compute the relative mutability mi of each amino

acid• the relative mutability is the number of times

the amino acid is substituted by any other amino acid in the phylogenetic tree, divided by the number of mutations that could have affected the residue (times scaling factor 100)

• for example, ma • number of times a substituted = 4• number of mutations in entire tree x 2 = 12• frequency of a = 10/63 = 0.159• ma = 4/12 x 0.159 x 100 = 0.0209

• For each amino acid, find the frequency Fi,j for which it is (reciprocally) substituted by every other amino acid

• for example, Fg,a = 3• Compute the relative mutability mi of each amino

acid• the relative mutability is the number of times

the amino acid is substituted by any other amino acid in the phylogenetic tree, divided by the number of mutations that could have affected the residue (times scaling factor 100)

• for example, ma • number of times a substituted = 4• number of mutations in entire tree x 2 = 12• frequency of a = 10/63 = 0.159• ma = 4/12 x 0.159 x 100 = 0.0209

04/17/23 Bioinformatics I Fall 2002 34

• Compute the mutation probability Mij of each pair of amino acids

• Mij = mjFij/Fij

Fij is the total number of substitutions involving a in the phylogenetic tree

• Mg,a = 0.0209 x ¾ = 0.0156 (notice j refers to a)

• Divide the mutation probability by the frequency of occurrence fi of residue i, and take the log

• fi in this example is fg = 10/63 = 0.1587• Rg,a = log (0.0156/0.1587) = 1.01

• Compute the mutation probability Mij of each pair of amino acids

• Mij = mjFij/Fij

Fij is the total number of substitutions involving a in the phylogenetic tree

• Mg,a = 0.0209 x ¾ = 0.0156 (notice j refers to a)

• Divide the mutation probability by the frequency of occurrence fi of residue i, and take the log

• fi in this example is fg = 10/63 = 0.1587• Rg,a = log (0.0156/0.1587) = 1.01

04/17/23 Bioinformatics I Fall 2002 35

Then defined a 1PAM (1 point accepted mutation) matrix to be one where the expected number of mutations overall was 1% (that’s why the factor of 100 is used).

The entries in the whole matrix generated by the above method gives the PAM1 matrix; this supposes an amount of evolutionary time goes by that will let this amount of change happen

Then defined a 1PAM (1 point accepted mutation) matrix to be one where the expected number of mutations overall was 1% (that’s why the factor of 100 is used).

The entries in the whole matrix generated by the above method gives the PAM1 matrix; this supposes an amount of evolutionary time goes by that will let this amount of change happen

04/17/23 Bioinformatics I Fall 2002 36

• To get longer times, multiply PAM1 by itself by however many units of time you are interested in, e.g. PAM250 is PAM1 raised to the 250 power

• So for PAM, higher numbers indicate longer evolutionary time

• To get longer times, multiply PAM1 by itself by however many units of time you are interested in, e.g. PAM250 is PAM1 raised to the 250 power

• So for PAM, higher numbers indicate longer evolutionary time

04/17/23 Bioinformatics I Fall 2002 37

BLOSUM matricesBLOSUM matrices

• We’ll turn to the BLOSUM substitution matrices, because they were made using a special kind of profile called a block: areas of alignments that contain no gaps

• We want to assign a score that gives a measure of the relative likelihood that the sequences are related as opposed to unrelated

• So, we consider models of each case, assign a probability to the alignment in each case, and then take a ratio

• We’ll turn to the BLOSUM substitution matrices, because they were made using a special kind of profile called a block: areas of alignments that contain no gaps

• We want to assign a score that gives a measure of the relative likelihood that the sequences are related as opposed to unrelated

• So, we consider models of each case, assign a probability to the alignment in each case, and then take a ratio

04/17/23 Bioinformatics I Fall 2002 38

• We need to introduce some notation – don’t be frightened!

• xi is the ith symbol in sequence x; yj

is the jth symbol in sequence y

• The symbols come from an alphabet; for proteins, 20 letters; for DNA, 4 letters; letters from the alphabet (a,b)

• We need to introduce some notation – don’t be frightened!

• xi is the ith symbol in sequence x; yj

is the jth symbol in sequence y

• The symbols come from an alphabet; for proteins, 20 letters; for DNA, 4 letters; letters from the alphabet (a,b)

04/17/23 Bioinformatics I Fall 2002 39

Random model RRandom model R

• Letter a occurs independently with frequency qa, so the probability of the two sequences is the product of the probabilities of each character

• P(x,y|R) reads “the probability of sequences x and y given R”

• P(x,y|R) = qxi qyj

• Letter a occurs independently with frequency qa, so the probability of the two sequences is the product of the probabilities of each character

• P(x,y|R) reads “the probability of sequences x and y given R”

• P(x,y|R) = qxi qyji j

04/17/23 Bioinformatics I Fall 2002 40

Match model MMatch model M

• Pairs of residues occur with a joint probability pab (probability that a and b derived independently from common ancestor c)

• Probability of the whole alignment is the product of the individual joint probabilities

• P (x,y|M) = pxiyi

• Pairs of residues occur with a joint probability pab (probability that a and b derived independently from common ancestor c)

• Probability of the whole alignment is the product of the individual joint probabilities

• P (x,y|M) = pxiyii

04/17/23 Bioinformatics I Fall 2002 41

• Remember, we said we were going to take the ratio of these probabilities (M/R)

• So, • P(x,y|M) = i pxiyi = pxiyi

• P(x,y|R) iqxiiqyi qxiqyi

• This is known as the odds ratio: the product of the joint probabilities over the individual probabilities

• To make it additive, we make it the log-odds ratio by taking the logarithm; to get score, we add all the log-odds ratios

• S = i s(xiyi), where s(a,b) = log pab/qaqb

• This is the log likelihood ratio that the pair a,b is aligned as opposed to unaligned

• Remember, we said we were going to take the ratio of these probabilities (M/R)

• So, • P(x,y|M) = i pxiyi = pxiyi

• P(x,y|R) iqxiiqyi qxiqyi

• This is known as the odds ratio: the product of the joint probabilities over the individual probabilities

• To make it additive, we make it the log-odds ratio by taking the logarithm; to get score, we add all the log-odds ratios

• S = i s(xiyi), where s(a,b) = log pab/qaqb

• This is the log likelihood ratio that the pair a,b is aligned as opposed to unaligned

04/17/23 Bioinformatics I Fall 2002 42

BLOSUM matricesBLOSUM matrices

• Derived from BLOCKS database; aligned, ungapped regions from protein families

• Sequences in each block were clustered by percentage of identical residues > L%

• Then calculated frequencies Aab of observing a in one cluster aligned with b in a different cluster; corrections for sizes of clusters by weighting 1/n1n2

• Estimated qa = the fraction of pairings that included an a

• Derived from BLOCKS database; aligned, ungapped regions from protein families

• Sequences in each block were clustered by percentage of identical residues > L%

• Then calculated frequencies Aab of observing a in one cluster aligned with b in a different cluster; corrections for sizes of clusters by weighting 1/n1n2

• Estimated qa = the fraction of pairings that included an a

04/17/23 Bioinformatics I Fall 2002 43

• Estimated pab = the fraction of pairings of a with b (from all observed pairings)

• Then s(a,b) is the log-odds ratio

s(a,b) = log pab/qaqb

• (These values are actually scaled and rounded)• Thus the number after the BLOSUM (e.g.,

BLOSUM62) means that the log-odds scores in this matrix were from sequences that had > x% identity

• For BLOSUM, smaller number means less similarity, and hence we presume longer evolutionary time

• Estimated pab = the fraction of pairings of a with b (from all observed pairings)

• Then s(a,b) is the log-odds ratio

s(a,b) = log pab/qaqb

• (These values are actually scaled and rounded)• Thus the number after the BLOSUM (e.g.,

BLOSUM62) means that the log-odds scores in this matrix were from sequences that had > x% identity

• For BLOSUM, smaller number means less similarity, and hence we presume longer evolutionary time