Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... ·...

Sequence Similarity Methods

Gloria Rendon

SC11 – Education

June, 2011

Sequence Similarity Methods - caveats

• Assumption1: genes of closely related species are more similar than genes of distantly related species.

• Assumption2: Similar genes have similar sequences.• These methods predict the amount of evolution among

species solely in terms of mutation events observed in the sequences of their genes.

The General Algorithm...

Step1. COLLECT. Sequences are gathered

Step 2. COMPARE. Sequences are compared for similarity

Step 3. SCORE. A score is computed to assess significance of results

Step 4. CLUSTER. A matrix of sequence similarity is computed

Step 5 (Opt). A phylogenetic tree is reconstructed with matrix

Types of Similarity-Based Methods

•Alignment-free Methods:

oBased on k-word frequencyoBased on Structural alignmentoBased on Hidden markov modelsoOthers

•Based on Sequence alignment

Alignment-based Methods

Alignment-based MethodsA sequence alignment is a way of arranging the sequences of DNA, RNA, or proteins to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

Alignment-based MethodsA sequence alignment is a scheme of writing one sequence on top of another where the residues in one position are deemed to have a common evolutionary origin.

If the same letter occurs in both sequences then this position has been conserved in evolution.

If the letters differ it is assumed that the two derive from an ancestral letter (which could be one of the two or neither)..

Alignment Representation

Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix.

Sequence Sequence Alignment LengthName

Point Mutations•ONLY these types of point mutation events are considered by alignment-based methods: insertion, deletion, substitution.

•Homologous sequences may have different length, though, which is generally explained through insertions or deletions in sequences.

•Thus, a letter or a stretch of letters may be paired up with dashes in the other sequence to signify such an insertion or deletion.

•The term given to those dashes is indel or gap.

Gaps in Alignments

Gaps may be are inserted between the residues so that identical or similar characters are aligned in successive columns.

Gaps represent a) deletions or insertions events b) sites with missing information

There are two types of Gaps (from the point of view of the aligning algorithm): gap opening and gap extension. Moreover, they are weighted differently by the algorithm.

One gap opening and two gap extensions

SNPs (single nucleotide polymorphism)

•Copying errors during cell division result in variations in the DNA at a particular location.

•These copying errors are point mutations called single nucleotide polymorphisms, or SNPs.

•SNPs are passed on to the next generation through inheritance.

SNIPs are a special case of point mutations

Role of SNPs

•In humans SNPs account for much of the genetic diversity.

•Certain genetic diseases have been linked to SNPs.

•However, much of the SNPs do not result in observable differences

Point Mutation AnalysisThe reason for aligning sequences when trying to elucidate their evolutionary relationship is that algorithms can calculate an estimate of their evolutionary distance from the alignment.

These methods are based on Levenshtein’s notion of edit distance between strings:

“Edit distance is the minimum number of edit operations needed to transform one string into another.”

“The more similar the sequences are, the smaller their edit distance is”

Types of Alignment-based Methods

•Global alignment is when matching is attempted on the entire length of the sequences. This is usually the choice when aligning very similar sequences

•Local alignment is when matching is done for specific segments of the sequences. This is usually the choice when it is believed that sequences contained conserved regions.


•Earlier we used BLAST to search for a sequence given a partial segment of it.

•Blast will try both global as well as local alignments and will report the best matches of them all.

•Re-examine the results page and find out which type of alignment performed best in this case

Let us re-examine the portion of this page that displays the alignment --marked with 3

Let us re-examine the portion of this page that displays the alignment --marked with 3

There are three rows.

The numbers on the left column specify the starting positionThe numbers on the right specify the ending position

The first row is the partial sequence you typed, named QueryThe third row is the sequence it is being matched against; in this case P46098

The second row is the result of the alignment between the top and bottom seqsThe match is exact at every position


•Pair-wise alignment. Two sequences are aligned together

•Multiple sequence alignment. Three or more sequences are aligned together

Pairwise Alignment

Illustrated with BLAST and18s ribosomal RNA sequence

Pair-wise Alignment

1.Collect the two sequences

2. Align the sequences

3. Count the mutations in the alignment

4. Score the alignments

Pair-wise Alignment





>seq2|LemnaMinor_18S_rRNACTCCTACCGATTGAATGGTCCGGTGAAGCGCTCGGATCGCGGCGACGAGGGCGGTCCCCCGCCCGCGACGTCGCGAGAAGTCCGTTGAACCTTATCATTTAGAGGAAGGAG

The first sequence is displayed above.

To get the second sequence and perform the alignment, we simply use BLAST.

Go to the BLAST page at NCBI

blast.ncbi.nlm.nih.gov

Then click on nucleotide blast

Pair-wise Alignment

This is the nucleotide blast page at NCBI

Paste the sequence in the box

Select a database from the drop-down list; in this case, choose Nucleotide collection

Scroll to the bottom of the page and click on the Blastbutton

Pair-wise Alignment This is the results page of the Blast search.

The top hit is our original sequence.

It is listed in the table along with some statistics.

Let’s see under the hood to understand what happened and how the stats were calculated..

Pair-wise Alignment





If you scroll down the same results page, you will see the results of all the pairwise alignments that BLAST included in the report.

They will be sorted from best alignment (first one in the report) to worst alignment (last one in the report).

This is the first one, therefore it is the best match.

Pair-wise Alignment





Steps 3 and 4 are perform after the alignment is performed in order to assess how good a match it is.

First, we need to count mismatches in the alignment.

Cell (T,T) = number of unchanged T residues = 1Cell (T,G) = number of substitutions from T to GCell (T, C) = number of substitutions from T to CCell (T, A) = number of substitutions from T to ACell (T, -) = number of deletions of T

...Cell (-, T) = number of insertions of TCell (-, G) = number of insertions of GCell (-, C) = number of insertions of CCell (-, A) = number of insertions of A = 0

Counting Mismatches (mutations)

Not all mismatches are created equal. Some substitutions are more likely than others; therefore we must use weight values, such as those in substitution matrices

Pair-wise Alignment





Scoring the alignments

Note that the result is a single value, a score, obtained by performing dot product between the alignment matrix and the substitution matrix, and adding the values of the resulting matrix as shown here.

So, now you have a clearer idea of what goes under the hood of pairwise-alignment tools like BLAST.

Exercise2: Using BLAST to transfer annotation

Sometimes we have a gene (or protein) for which an annotation (the description line in fasta format) is unknown; for example, when a new genome is being sequenced.

The general ‘in-silico’ procedure for assigning an annotation to that newly sequenced gene (or protein) calls for using BLAST to find a similar gene (or protein) for which the annotation is known.

If the match is close enough, we can then transfer the annotation from the known gene (or protein) to the new one.


•Open a web browser and go the UNIPROT url www.uniprot.org1.Click on the Blast tab2.In the box type the identifier: A7JKN7_FRANO3.Then click on the BLAST button

http://www.uniprot.org/


Notice how the UniProt-Blast program fetches the corresponding sequence before launching the BLAST search.Also notice that the annotation (description line) is unknown


This is the BLAST result page.The first and second hits do not have annotations either.The third hit is annotated as Neurotransmitter-gated ion-channel. So, at first blush, we could transfer that annotation to the protein A7JKN7_FRANO

Exercise3: GLOBAL Pairwise alignment program

• Open a web browser and go to the MOBYLE portal: mobyle.pasteur.fr/

• Choose Programs/ Alignment /pairwise/global/needle from the Programs box (left)

• Copy-paste any two sequences from the file woese.seqs.fasta• Select the parameters: gap penalty=5, gap extension=0.2• Click on Run• A job will be created to run this program with your data• Once the job is done we can view the results

• Q: how many gaps were inserted?• Q: what is the score of the alignment?• Q: what is the percent identity of the alignment?

Exercise4: LOCAL Pairwise alignment program

• Open a web browser and go to the MOBYLE portal: mobyle.pasteur.fr/

• Choose Programs/ Alignment /pairwise/local/water from the Programs box (left)

• Copy-paste THE SAME two sequences from the example we just finished

• Select the parameters: gap penalty=5, gap extension=0.2• Click on Run• A job will be created to run this program with your data• Once the job is done we can view the results

• Q: how many gaps were inserted?• Q: what is the score of the alignment?• Q:is the local alignment identical to the global alignment of the

previous exercise? explain

Multiple Sequence Alignment

Multiple Sequence Alignment (MSA)

•Multiple sequence alignment methods try to align a group of three or more related sequences at once.

•MSAs are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related.

•MSAs are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees.

•MSA algorithms are more complex (time and space wise) than pairwise alignment algorithms.

Multiple Sequence Alignment (MSA)

•A multiple alignment arranges a set of sequences in a scheme where positions believed to be homologous are written in a common column.

•Like in a pairwise alignment, when a sequence does not possess an amino acid in a particular position this is denoted by a dash (indel,gap).

•The scoring function calculates ‘the similarity’ of a sequence in relationship to the entire group.

Conservation in a MSA

•In addition to the alignment itself; a line is added at the end with information about the degree of conservation for each position (i.e. column)

No symbol. There is no conservation in the column* exact match of the residue for all sequences: high degree of conservation; the mutations where for residues of

similar biochemical properties, with letters of the same color. There is conservation among the majority of the sequences

however, there were mutations for residues of a different group

• Open a web browser and go to MOBYLE portal: mobyle.pasteur.fr/

• Choose Alignment/multiple/clustalw-multialign from the Programs box (left)

• Select Upload to copy the sequences from the file 8species.seqs.fasta on your computer to the portal

• Click on Run• A job will be created to run this program with your data• Once the job is done we can view the results: alignment,

tree, output

Exercise 4: Using a MSA with the Eight Species Solar System

Clustalw Results: aln file• The first segment of the ClustalW results page shows

the alignment itself as shown below• To see additional information about conservation,

please click on ‘view with jalview’

The consensus sequence refers to the most common residue (nucleotide or amino acid) at a particular position after a MSA has been calculated.

The consensus sequence for the eight imaginary species is:

A – T – A G A G

The most conserved positions are T in the third position (6/8)A in the fifth position (6/8)

Hence the height of the bars in the histogram denotes frequencies of the most common residues at that location.

Clustalw Results: aln file and consensus

Clustalw Results: tree file and output file

•We will skip the tree for now.

•Let us examine the output file. Click here

ClustalW Results: Output file

CLUSTAL 2.0.12 Multiple Sequence Alignments

Sequence format is Pearson

Sequence 1: 1 7 bpSequence 2: 2 7 bp


Sequence 5: 5 7 bp


Sequence 8: 8 7 bpStart of Pairwise alignments

Aligning...

Sequences (1:2) Aligned. Score: 85

Sequences (1:3) Aligned. Score: 42Sequences (1:4) Aligned. Score: 28
















Guide tree file created: [8planets.dnd]

Start of Multiple Alignment

Aligning...

Group 1: Sequences: 2 Score:114

Group 2: Sequences: 2 Score:114Group 3: Sequences: 4 Score:69

Group 4: Sequences: 2 Score:114Group 5: Sequences: 2 Score:123

Group 6: Sequences: 4 Score:83

Group 7: Sequences: 8 Score:59Alignment Score 209

CLUSTAL-Alignment file created [8planets.aln]

Input pair-wise alignment scores clustering

• Open a web browser and go to MOBYLE portal: mobyle.pasteur.fr/

• Choose Alignment/multiple/clustalw-multialign from the Programs box (left)

• Select Upload to copy the sequences from the file woese.seqs.fasta on your computer to the portal

• Click on Run• A job will be created to run this program with your data• Once the job is done we can view the results: alignment,

tree, output

Exercise 5: Using a MSA program to re-discover the Three Kingdoms of C. Woese

Clustalw Output Results

The table was constructed with the file shown in the windows “Standard Output”

Q: Cluster these results. Do they fall onto three groups?

Sequence 1: Methanosarcina_barkeri 1262 bpSequence 2: Methanothermobacter_thermau 1494 bpSequence 3: Methanobrevibacter_ruminant 1260 bpSequence 4: Methanococcus_maripaludis_C6 1465 bpSequence 5: Lemna_minor_chloroplast 1487 bpSequence 6: Aphanocapsa_sp._HBC6 1441 bpSequence 7: Corynebacterium_diphtheriae 712 bpSequence 8: Bacillus_firmus_strain_QJGY2 746 bpSequence 9: Chloribium_vibrioforme__Pros 1243 bpSequence 10: Escherichia_coli_HS 1542 bpSequence 11: Mus_musculus_L_cell 918 bpSequence 12: Lemna_minor_18S_rRNA 111 bpSequence 13: Saccharomyces_cerevisiae_str 1730 bp

Types of Similarity-Based Methods

•Alignment-free Methods:

oBased on k-word frequencyoBased on Structural alignmentoBased on Hidden markov modelsoOthers

•Based on Sequence alignment

K-wordIn bioinformatics, a k-word (or k-tuple) is a sequence of length k.

A sequence of length n has n – k + 1 k-words.

Example query string L: TGATGATGAAGACATCAG

For k = 8, the set of k-tuples of L is

TGATGATGGATGATGA

ATGATGAATGATGAAG

…GACATCAG

K-word ListsConsider the k-words when k=2 and L=GCATCGGC:

GC, CA, AT, TC, CG, GG, GC

AT: 3 → means the k-word AT in sequence L starts at position 3CA: 2CG: 5GC: 1, 7GG: 6TC: 4

K-word Frequency Methods

Goal: Find common k-words in a group of sequences that have statistical significance.

Let us illustrate this statement with an example in natural language.

One English scholar tries to determine if a newly found manuscript was written by Shakespeare. He compares one page from the new manuscript against a page from one of Shakespeare’s works.

The top k-word of length 4 in common between the two pages is THOU.

Is this k-word statistically meaningful??

K-word Frequency Methods

• Based on Euclidean Distance• Based on Weighted Euclidean Distance• Based on Correlation• Based on Covariance• Based on Information Content

Algorithm using K-word Frequencyto determine sequence similarity

• Collect sequences• Calculate meaningful k-words • Identify k-words in sequences• Catalog k-words • Score significance• Cluster sequences into similar groups

The same algorithm in graphical form

1. Collect seqs

2. Calc k-words

3. Search k-words4. Catalog k-words

4. Score5. Cluster

Step 2: Alternatives for K-words

•Example 1: Use Interpro, a database of already calculated k-words, called protein functional domains. Results may overlap

•Example 2: Use tools such as MEME to calculate ‘de-novo’ k-words from a training set the user specifies. Results do not overlap.

Exercise 1: Using INTERPRO

With the following exercise, we will go to INTERPRO, a url where the entire database is made up of k-words called protein functional domains.

A functional domain is a segment of the protein that generally has a very specific function and/or structure.

INTERPRO’s search engine takes as input a single sequence (of a protein) and it will try to match it against its catalog of functional domains. The k-words may overlap

Exercise 1:using INTERPRO1. Open a web browser and retrieve the sequence of the protein that we found in

Scenario 3 of the previous section by pasting this link on the browser http://www.uniprot.org/uniprot/P46098.fasta

2. Copy to the clipboard this sequence (Ctrl-A Ctrl-C)3. Open another tab on the browser and type this link

http://www.ebi.ac.uk/Tools/pfa/iprscan/ to go to the url of INTERPRO4. Paste the sequence from the clipboard to the box provided in this page for the

query sequence (Ctrl-V)5. Scroll down to the bottom of the page leaving all parameters unchanged with

default values and click on the Submit button6. Examine the results page; it should look similar to the figure in the next page.7. How many k-words where found in this protein?8. What score did each k-word receive?9. Are there any k-word in the specific segment of the sequence that the paper

discussed and that we used in Scenario 3 to look for the entire sequence?

http://www.uniprot.org/uniprot/P46098.fasta

http://www.ebi.ac.uk/Tools/pfa/iprscan/

K-word identifiers k-word alias(es) k-word location and length

Some k-words DO overlap

1. Collect seqs

2. Calc k-words


4. Score5. Cluster

The INTERPRO k-words found were:

IPR006029IPR006201IPR006202IPR008132IPR008133IPR018000

Now, we can search for proteins with those k-words

Exercise 1:using INTERPRO

1. Collect seqs

2. Calc k-words


4. Score5. Cluster


Go to the uniprot.org page and type in the query box:

(IPR006029 AND IPR006201 AND IPR006202 ANDIPR008132 AND IPR008133 AND IPR018000)

Then click on the Search button

It should look like the figure here with 19 hits

1. Collect seqs

2. Calc k-words


4. Score5. Cluster


Uniprot DOES NOT calculate a score for us.

However, we can go ahead and cluster the 19 hits into a single group since they all have the same k-words and are similarly annotated as serotonin receptors

Exercise 2: Using MEME

With the following exercise, we will use MEME to discover motifs in a set of proteins in a de-novo way.

MEME will start with a training set that you provide and will identify meaningful k-words called MEME motifs.

The training set is a group of scorpion neurotoxin sequences. From prior knowledge of this group of sequences, we know which residues are conserved and play a key role.

Therefore the exercise will focus on adjusting the parameters of the MEME tool so that the resulting motifs will include all those key residues.

Exercise 2: Using MEMEKey residues are marked in this figure: The cysteines that form the disulfide bridges; the residues R G K in positions 28,29,30 and Y in position 39


1. Open a web browser and go to the MEME web server at http://meme.nbcr.net/

2. Scroll down to the programs and click on the MEME icon; it will take you to the MEME Data Submission Form

http://meme.nbcr.net/

Exercise 2: Using MEME3. In the segment of the page marked as 1, you need to include the training set. It is the file called toxins.fasta

4. In the segment of the page marked as 2, you need to specify occurrences, or repetitions, of the k-word PER sequence. Do not change the default value.

5. In the segment of the page marked as 3, you need to specify the width of the k-word. For a fixed width you type the same value in Minimum and Maximum. For variable width you type the limits of the window. Try these values: 5-10

6. In the segment of the page marked as 4, you need to specify the maximum number of motifs. Type 10

7. Leave the other parameters unchanged. Scroll to the bottom of the page and click on Start search.

8. Results will be emailed to you.

Exercise 2: MEME 5-10-10Let us examine the k-words found by MEME.

They are ordered by statistical significance; hence, motif-1 is the most significantly conserved segment in the training set and motif-n is the least significantly conserved segment.

This is the WEBLOGO representation of motif-1 that consists of 10 residues. The height of the residue is proportional to its occurrence frequency in the training set.

In position 1 a K is completely conservedIn position 2 a C is completely conservedIn position 3 an M is completely conservedIn position 4, it could be N or G, however, N is more likely than G

Etc.

Exercise 2: MEME 5-10-10Now, let us scroll to the bottom of the page to see the diagram of motifs.

For each sequence we observe k-words and gaps. The fewer the gaps, the better the coverage. So, there is good coverage in this figure.

From prior knowledge of this group, we know it to be a set a relatively conserved motifs.

The more sequences with the same diagram of motifs, the more similar the sequences are.

The trouble with this diagram is that it gives the impression that the sequences are not that similar to each other.


Repeat the same steps with these other parameter values

For width: minimum 5, maximum 20For number of motifs: 5

Examine the resulting motifs and choose which one is better.

Exercise 2: MEME 5-20-5

This is the WEBLOGO on motif-1. Compared to motif-1 in the previous run, we can see that:

•This one has twice as many residues•This one has more key residues included in (all the tall Cs and G)•There is more variability in other parts of the motif (more letters per column)


Now, let us scroll to the bottom of the page to see the diagram of motifs.

This diagram looks much better than in the previous run:

•There are few gaps in the diagram for this run too. •The diagram of motifs looks similar for many more sequences in this run than in the previous run. •Conservation or similarity among sequences by virtue of having almost identical motif diagrams can be appreciated better in this run than in the previous one.

1. Collect seqs

2. Calc k-words


4. Score5. Cluster


So, we have FIVE k-words found by MEME.

Now, we need to search for similar sequences.

The email you received from MEME specifies a number of links to the results. Click on the link

MEME output as html

1. Collect seqs

2. Calc k-words


4. Score5. Cluster


On the MEME results page; scroll down until you see the section called Further Analysis

Click on the MAST button.

MAST is the search engine that looks for sequences that match any of the k-words discovered with MEME

K-words from MEME will be used as input here.

Select database to search

Then click on Search button.

Exercise 2: MAST results

Among the 135 hits; those that have a motif diagram similar to any diagram in training set would be the sequences most similar to the neurotoxins.

1. Collect seqs

2. Calc k-words


4. Score5. Cluster

Exercise 2: MAST results

There exist formulas that use the motif diagram and e-value to perform scoring and clustering of the results.

Additional Readings• Online lecture notes on Bioinformatics

Lectures.molgen.mpg.de/online_lectures.html• Vinga S, Almeida J. Alignment-free sequence comparison--a review.

Bioinformatics 2003; 19:513-23.• Needleman SB, Wunsch CD. A general method applicable to the search for

similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 1970; 48:443-53.

• Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology 1981; 147:195-7.

• Gotoh O. An improved algorithm for matching biological sequences. Journal of Molecular Biology 1982; 162:705-8.

• Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res 1994; 22:4673 - 80.

• Mulder N, Apweiler R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol Biol 2007; 396:59-70

Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... ·...

Documents

Transcript of Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... ·...