Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... ·...
Transcript of Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... ·...
![Page 1: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/1.jpg)
Sequence Similarity Methods
Gloria Rendon
SC11 – Education
June, 2011
![Page 2: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/2.jpg)
Sequence Similarity Methods - caveats
• Assumption1: genes of closely related species are more similar than genes of distantly related species.
• Assumption2: Similar genes have similar sequences.• These methods predict the amount of evolution among
species solely in terms of mutation events observed in the sequences of their genes.
![Page 3: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/3.jpg)
The General Algorithm...
Step1. COLLECT. Sequences are gathered
Step 2. COMPARE. Sequences are compared for similarity
Step 3. SCORE. A score is computed to assess significance of results
Step 4. CLUSTER. A matrix of sequence similarity is computed
Step 5 (Opt). A phylogenetic tree is reconstructed with matrix
![Page 4: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/4.jpg)
Types of Similarity-Based Methods
•Alignment-free Methods:
oBased on k-word frequencyoBased on Structural alignmentoBased on Hidden markov modelsoOthers
•Based on Sequence alignment
![Page 5: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/5.jpg)
Types of Similarity-Based Methods
•Alignment-free Methods:
oBased on k-word frequencyoBased on Structural alignmentoBased on Hidden markov modelsoOthers
•Based on Sequence alignment
![Page 6: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/6.jpg)
Alignment-based Methods
![Page 7: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/7.jpg)
Alignment-based MethodsA sequence alignment is a way of arranging the sequences of DNA, RNA, or proteins to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
![Page 8: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/8.jpg)
Alignment-based MethodsA sequence alignment is a scheme of writing one sequence on top of another where the residues in one position are deemed to have a common evolutionary origin.
If the same letter occurs in both sequences then this position has been conserved in evolution.
If the letters differ it is assumed that the two derive from an ancestral letter (which could be one of the two or neither)..
![Page 9: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/9.jpg)
Alignment Representation
Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix.
Sequence Sequence Alignment LengthName
![Page 10: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/10.jpg)
Point Mutations•ONLY these types of point mutation events are considered by alignment-based methods: insertion, deletion, substitution.
•Homologous sequences may have different length, though, which is generally explained through insertions or deletions in sequences.
•Thus, a letter or a stretch of letters may be paired up with dashes in the other sequence to signify such an insertion or deletion.
•The term given to those dashes is indel or gap.
![Page 11: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/11.jpg)
Gaps in Alignments
Gaps may be are inserted between the residues so that identical or similar characters are aligned in successive columns.
Gaps represent a) deletions or insertions events b) sites with missing information
There are two types of Gaps (from the point of view of the aligning algorithm): gap opening and gap extension. Moreover, they are weighted differently by the algorithm.
One gap opening and two gap extensions
![Page 12: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/12.jpg)
SNPs (single nucleotide polymorphism)
•Copying errors during cell division result in variations in the DNA at a particular location.
•These copying errors are point mutations called single nucleotide polymorphisms, or SNPs.
•SNPs are passed on to the next generation through inheritance.
SNIPs are a special case of point mutations
Role of SNPs
•In humans SNPs account for much of the genetic diversity.
•Certain genetic diseases have been linked to SNPs.
•However, much of the SNPs do not result in observable differences
![Page 13: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/13.jpg)
Point Mutation AnalysisThe reason for aligning sequences when trying to elucidate their evolutionary relationship is that algorithms can calculate an estimate of their evolutionary distance from the alignment.
These methods are based on Levenshtein’s notion of edit distance between strings:
“Edit distance is the minimum number of edit operations needed to transform one string into another.”
“The more similar the sequences are, the smaller their edit distance is”
![Page 14: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/14.jpg)
Types of Alignment-based Methods
•Global alignment is when matching is attempted on the entire length of the sequences. This is usually the choice when aligning very similar sequences
•Local alignment is when matching is done for specific segments of the sequences. This is usually the choice when it is believed that sequences contained conserved regions.
![Page 15: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/15.jpg)
Types of Alignment-based Methods
•Earlier we used BLAST to search for a sequence given a partial segment of it.
•Blast will try both global as well as local alignments and will report the best matches of them all.
•Re-examine the results page and find out which type of alignment performed best in this case
![Page 16: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/16.jpg)
![Page 17: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/17.jpg)
Let us re-examine the portion of this page that displays the alignment --marked with 3
![Page 18: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/18.jpg)
Let us re-examine the portion of this page that displays the alignment --marked with 3
There are three rows.
The numbers on the left column specify the starting positionThe numbers on the right specify the ending position
The first row is the partial sequence you typed, named QueryThe third row is the sequence it is being matched against; in this case P46098
The second row is the result of the alignment between the top and bottom seqsThe match is exact at every position
![Page 19: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/19.jpg)
Types of Alignment-based Methods
•Pair-wise alignment. Two sequences are aligned together
•Multiple sequence alignment. Three or more sequences are aligned together
![Page 20: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/20.jpg)
Pairwise Alignment
Illustrated with BLAST and18s ribosomal RNA sequence
![Page 21: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/21.jpg)
Pair-wise Alignment
1.Collect the two sequences
2. Align the sequences
3. Count the mutations in the alignment
4. Score the alignments
![Page 22: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/22.jpg)
Pair-wise Alignment
1.Collect the two sequences
2. Align the sequences
3. Count the mutations in the alignment
4. Score the alignments
>seq2|LemnaMinor_18S_rRNACTCCTACCGATTGAATGGTCCGGTGAAGCGCTCGGATCGCGGCGACGAGGGCGGTCCCCCGCCCGCGACGTCGCGAGAAGTCCGTTGAACCTTATCATTTAGAGGAAGGAG
The first sequence is displayed above.
To get the second sequence and perform the alignment, we simply use BLAST.
Go to the BLAST page at NCBI
blast.ncbi.nlm.nih.gov
Then click on nucleotide blast
![Page 23: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/23.jpg)
Pair-wise Alignment
This is the nucleotide blast page at NCBI
Paste the sequence in the box
Select a database from the drop-down list; in this case, choose Nucleotide collection
Scroll to the bottom of the page and click on the Blastbutton
![Page 24: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/24.jpg)
Pair-wise Alignment This is the results page of the Blast search.
The top hit is our original sequence.
It is listed in the table along with some statistics.
Let’s see under the hood to understand what happened and how the stats were calculated..
![Page 25: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/25.jpg)
Pair-wise Alignment
1.Collect the two sequences
2. Align the sequences
3. Count the mutations in the alignment
4. Score the alignments
If you scroll down the same results page, you will see the results of all the pairwise alignments that BLAST included in the report.
They will be sorted from best alignment (first one in the report) to worst alignment (last one in the report).
This is the first one, therefore it is the best match.
![Page 26: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/26.jpg)
Pair-wise Alignment
1.Collect the two sequences
2. Align the sequences
3. Count the mutations in the alignment
4. Score the alignments
Steps 3 and 4 are perform after the alignment is performed in order to assess how good a match it is.
First, we need to count mismatches in the alignment.
![Page 27: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/27.jpg)
Cell (T,T) = number of unchanged T residues = 1Cell (T,G) = number of substitutions from T to GCell (T, C) = number of substitutions from T to CCell (T, A) = number of substitutions from T to ACell (T, -) = number of deletions of T
...Cell (-, T) = number of insertions of TCell (-, G) = number of insertions of GCell (-, C) = number of insertions of CCell (-, A) = number of insertions of A = 0
Counting Mismatches (mutations)
![Page 28: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/28.jpg)
Not all mismatches are created equal. Some substitutions are more likely than others; therefore we must use weight values, such as those in substitution matrices
Pair-wise Alignment
1.Collect the two sequences
2. Align the sequences
3. Count the mutations in the alignment
4. Score the alignments
![Page 29: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/29.jpg)
Scoring the alignments
Note that the result is a single value, a score, obtained by performing dot product between the alignment matrix and the substitution matrix, and adding the values of the resulting matrix as shown here.
![Page 30: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/30.jpg)
So, now you have a clearer idea of what goes under the hood of pairwise-alignment tools like BLAST.
![Page 31: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/31.jpg)
Exercise2: Using BLAST to transfer annotation
Sometimes we have a gene (or protein) for which an annotation (the description line in fasta format) is unknown; for example, when a new genome is being sequenced.
The general ‘in-silico’ procedure for assigning an annotation to that newly sequenced gene (or protein) calls for using BLAST to find a similar gene (or protein) for which the annotation is known.
If the match is close enough, we can then transfer the annotation from the known gene (or protein) to the new one.
![Page 32: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/32.jpg)
Exercise2: Using BLAST to transfer annotation
•Open a web browser and go the UNIPROT url www.uniprot.org1.Click on the Blast tab2.In the box type the identifier: A7JKN7_FRANO3.Then click on the BLAST button
![Page 33: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/33.jpg)
Exercise2: Using BLAST to transfer annotation
Notice how the UniProt-Blast program fetches the corresponding sequence before launching the BLAST search.Also notice that the annotation (description line) is unknown
![Page 34: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/34.jpg)
Exercise2: Using BLAST to transfer annotation
This is the BLAST result page.The first and second hits do not have annotations either.The third hit is annotated as Neurotransmitter-gated ion-channel. So, at first blush, we could transfer that annotation to the protein A7JKN7_FRANO
![Page 35: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/35.jpg)
Exercise3: GLOBAL Pairwise alignment program
• Open a web browser and go to the MOBYLE portal: mobyle.pasteur.fr/
• Choose Programs/ Alignment /pairwise/global/needle from the Programs box (left)
• Copy-paste any two sequences from the file woese.seqs.fasta• Select the parameters: gap penalty=5, gap extension=0.2• Click on Run• A job will be created to run this program with your data• Once the job is done we can view the results
• Q: how many gaps were inserted?• Q: what is the score of the alignment?• Q: what is the percent identity of the alignment?
![Page 36: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/36.jpg)
Exercise4: LOCAL Pairwise alignment program
• Open a web browser and go to the MOBYLE portal: mobyle.pasteur.fr/
• Choose Programs/ Alignment /pairwise/local/water from the Programs box (left)
• Copy-paste THE SAME two sequences from the example we just finished
• Select the parameters: gap penalty=5, gap extension=0.2• Click on Run• A job will be created to run this program with your data• Once the job is done we can view the results
• Q: how many gaps were inserted?• Q: what is the score of the alignment?• Q:is the local alignment identical to the global alignment of the
previous exercise? explain
![Page 37: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/37.jpg)
Multiple Sequence Alignment
![Page 38: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/38.jpg)
Multiple Sequence Alignment (MSA)
•Multiple sequence alignment methods try to align a group of three or more related sequences at once.
•MSAs are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related.
•MSAs are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees.
•MSA algorithms are more complex (time and space wise) than pairwise alignment algorithms.
![Page 39: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/39.jpg)
Multiple Sequence Alignment (MSA)
•A multiple alignment arranges a set of sequences in a scheme where positions believed to be homologous are written in a common column.
•Like in a pairwise alignment, when a sequence does not possess an amino acid in a particular position this is denoted by a dash (indel,gap).
•The scoring function calculates ‘the similarity’ of a sequence in relationship to the entire group.
![Page 40: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/40.jpg)
Conservation in a MSA
•In addition to the alignment itself; a line is added at the end with information about the degree of conservation for each position (i.e. column)
No symbol. There is no conservation in the column* exact match of the residue for all sequences: high degree of conservation; the mutations where for residues of
similar biochemical properties, with letters of the same color. There is conservation among the majority of the sequences
however, there were mutations for residues of a different group
![Page 41: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/41.jpg)
• Open a web browser and go to MOBYLE portal: mobyle.pasteur.fr/
• Choose Alignment/multiple/clustalw-multialign from the Programs box (left)
• Select Upload to copy the sequences from the file 8species.seqs.fasta on your computer to the portal
• Click on Run• A job will be created to run this program with your data• Once the job is done we can view the results: alignment,
tree, output
Exercise 4: Using a MSA with the Eight Species Solar System
![Page 42: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/42.jpg)
Clustalw Results: aln file• The first segment of the ClustalW results page shows
the alignment itself as shown below• To see additional information about conservation,
please click on ‘view with jalview’
![Page 43: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/43.jpg)
The consensus sequence refers to the most common residue (nucleotide or amino acid) at a particular position after a MSA has been calculated.
The consensus sequence for the eight imaginary species is:
A – T – A G A G
The most conserved positions are T in the third position (6/8)A in the fifth position (6/8)
Hence the height of the bars in the histogram denotes frequencies of the most common residues at that location.
Clustalw Results: aln file and consensus
![Page 44: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/44.jpg)
Clustalw Results: tree file and output file
•We will skip the tree for now.
•Let us examine the output file. Click here
![Page 45: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/45.jpg)
ClustalW Results: Output file
CLUSTAL 2.0.12 Multiple Sequence Alignments
Sequence format is Pearson
Sequence 1: 1 7 bpSequence 2: 2 7 bp
Sequence 3: 3 7 bpSequence 4: 4 7 bp
Sequence 5: 5 7 bp
Sequence 6: 6 7 bpSequence 7: 7 7 bp
Sequence 8: 8 7 bpStart of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 85
Sequences (1:3) Aligned. Score: 42Sequences (1:4) Aligned. Score: 28
Sequences (1:5) Aligned. Score: 57
Sequences (1:6) Aligned. Score: 71Sequences (1:7) Aligned. Score: 28
Sequences (1:8) Aligned. Score: 42Sequences (2:3) Aligned. Score: 42
Sequences (2:4) Aligned. Score: 28
Sequences (2:5) Aligned. Score: 42Sequences (2:6) Aligned. Score: 42
Sequences (2:7) Aligned. Score: 28Sequences (2:8) Aligned. Score: 28
Sequences (3:4) Aligned. Score: 85
Sequences (3:5) Aligned. Score: 42Sequences (3:6) Aligned. Score: 42
Sequences (3:7) Aligned. Score: 57Sequences (3:8) Aligned. Score: 57
Sequences (4:5) Aligned. Score: 57
Sequences (4:6) Aligned. Score: 57Sequences (4:7) Aligned. Score: 71
Sequences (4:8) Aligned. Score: 71Sequences (5:6) Aligned. Score: 85
Sequences (5:7) Aligned. Score: 57
Sequences (5:8) Aligned. Score: 57Sequences (6:7) Aligned. Score: 71
Sequences (6:8) Aligned. Score: 57Sequences (7:8) Aligned. Score: 85
Guide tree file created: [8planets.dnd]
Start of Multiple Alignment
Aligning...
Group 1: Sequences: 2 Score:114
Group 2: Sequences: 2 Score:114Group 3: Sequences: 4 Score:69
Group 4: Sequences: 2 Score:114Group 5: Sequences: 2 Score:123
Group 6: Sequences: 4 Score:83
Group 7: Sequences: 8 Score:59Alignment Score 209
CLUSTAL-Alignment file created [8planets.aln]
Input pair-wise alignment scores clustering
![Page 46: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/46.jpg)
• Open a web browser and go to MOBYLE portal: mobyle.pasteur.fr/
• Choose Alignment/multiple/clustalw-multialign from the Programs box (left)
• Select Upload to copy the sequences from the file woese.seqs.fasta on your computer to the portal
• Click on Run• A job will be created to run this program with your data• Once the job is done we can view the results: alignment,
tree, output
Exercise 5: Using a MSA program to re-discover the Three Kingdoms of C. Woese
![Page 47: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/47.jpg)
Clustalw Output Results
The table was constructed with the file shown in the windows “Standard Output”
Q: Cluster these results. Do they fall onto three groups?
Sequence 1: Methanosarcina_barkeri 1262 bpSequence 2: Methanothermobacter_thermau 1494 bpSequence 3: Methanobrevibacter_ruminant 1260 bpSequence 4: Methanococcus_maripaludis_C6 1465 bpSequence 5: Lemna_minor_chloroplast 1487 bpSequence 6: Aphanocapsa_sp._HBC6 1441 bpSequence 7: Corynebacterium_diphtheriae 712 bpSequence 8: Bacillus_firmus_strain_QJGY2 746 bpSequence 9: Chloribium_vibrioforme__Pros 1243 bpSequence 10: Escherichia_coli_HS 1542 bpSequence 11: Mus_musculus_L_cell 918 bpSequence 12: Lemna_minor_18S_rRNA 111 bpSequence 13: Saccharomyces_cerevisiae_str 1730 bp
![Page 48: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/48.jpg)
Types of Similarity-Based Methods
•Alignment-free Methods:
oBased on k-word frequencyoBased on Structural alignmentoBased on Hidden markov modelsoOthers
•Based on Sequence alignment
![Page 49: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/49.jpg)
K-wordIn bioinformatics, a k-word (or k-tuple) is a sequence of length k.
A sequence of length n has n – k + 1 k-words.
Example query string L: TGATGATGAAGACATCAG
For k = 8, the set of k-tuples of L is
TGATGATGGATGATGA
ATGATGAATGATGAAG
…GACATCAG
![Page 50: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/50.jpg)
K-word ListsConsider the k-words when k=2 and L=GCATCGGC:
GC, CA, AT, TC, CG, GG, GC
AT: 3 → means the k-word AT in sequence L starts at position 3CA: 2CG: 5GC: 1, 7GG: 6TC: 4
![Page 51: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/51.jpg)
K-word Frequency Methods
Goal: Find common k-words in a group of sequences that have statistical significance.
Let us illustrate this statement with an example in natural language.
One English scholar tries to determine if a newly found manuscript was written by Shakespeare. He compares one page from the new manuscript against a page from one of Shakespeare’s works.
The top k-word of length 4 in common between the two pages is THOU.
Is this k-word statistically meaningful??
![Page 52: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/52.jpg)
K-word Frequency Methods
• Based on Euclidean Distance• Based on Weighted Euclidean Distance• Based on Correlation• Based on Covariance• Based on Information Content
![Page 53: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/53.jpg)
K-word Frequency Methods
• Based on Euclidean Distance• Based on Weighted Euclidean Distance• Based on Correlation• Based on Covariance• Based on Information Content
![Page 54: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/54.jpg)
Algorithm using K-word Frequencyto determine sequence similarity
• Collect sequences• Calculate meaningful k-words • Identify k-words in sequences• Catalog k-words • Score significance• Cluster sequences into similar groups
![Page 55: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/55.jpg)
The same algorithm in graphical form
1. Collect seqs
2. Calc k-words
3. Search k-words4. Catalog k-words
4. Score5. Cluster
![Page 56: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/56.jpg)
Step 2: Alternatives for K-words
•Example 1: Use Interpro, a database of already calculated k-words, called protein functional domains. Results may overlap
•Example 2: Use tools such as MEME to calculate ‘de-novo’ k-words from a training set the user specifies. Results do not overlap.
![Page 57: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/57.jpg)
Exercise 1: Using INTERPRO
With the following exercise, we will go to INTERPRO, a url where the entire database is made up of k-words called protein functional domains.
A functional domain is a segment of the protein that generally has a very specific function and/or structure.
INTERPRO’s search engine takes as input a single sequence (of a protein) and it will try to match it against its catalog of functional domains. The k-words may overlap
![Page 58: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/58.jpg)
Exercise 1:using INTERPRO1. Open a web browser and retrieve the sequence of the protein that we found in
Scenario 3 of the previous section by pasting this link on the browser http://www.uniprot.org/uniprot/P46098.fasta
2. Copy to the clipboard this sequence (Ctrl-A Ctrl-C)3. Open another tab on the browser and type this link
http://www.ebi.ac.uk/Tools/pfa/iprscan/ to go to the url of INTERPRO4. Paste the sequence from the clipboard to the box provided in this page for the
query sequence (Ctrl-V)5. Scroll down to the bottom of the page leaving all parameters unchanged with
default values and click on the Submit button6. Examine the results page; it should look similar to the figure in the next page.7. How many k-words where found in this protein?8. What score did each k-word receive?9. Are there any k-word in the specific segment of the sequence that the paper
discussed and that we used in Scenario 3 to look for the entire sequence?
![Page 59: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/59.jpg)
K-word identifiers k-word alias(es) k-word location and length
Some k-words DO overlap
![Page 60: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/60.jpg)
1. Collect seqs
2. Calc k-words
3. Search k-words4. Catalog k-words
4. Score5. Cluster
The INTERPRO k-words found were:
IPR006029IPR006201IPR006202IPR008132IPR008133IPR018000
Now, we can search for proteins with those k-words
Exercise 1:using INTERPRO
![Page 61: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/61.jpg)
1. Collect seqs
2. Calc k-words
3. Search k-words4. Catalog k-words
4. Score5. Cluster
Exercise 1:using INTERPRO
Go to the uniprot.org page and type in the query box:
(IPR006029 AND IPR006201 AND IPR006202 ANDIPR008132 AND IPR008133 AND IPR018000)
Then click on the Search button
It should look like the figure here with 19 hits
![Page 62: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/62.jpg)
1. Collect seqs
2. Calc k-words
3. Search k-words4. Catalog k-words
4. Score5. Cluster
Exercise 1:using INTERPRO
Uniprot DOES NOT calculate a score for us.
However, we can go ahead and cluster the 19 hits into a single group since they all have the same k-words and are similarly annotated as serotonin receptors
![Page 63: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/63.jpg)
Exercise 2: Using MEME
With the following exercise, we will use MEME to discover motifs in a set of proteins in a de-novo way.
MEME will start with a training set that you provide and will identify meaningful k-words called MEME motifs.
The training set is a group of scorpion neurotoxin sequences. From prior knowledge of this group of sequences, we know which residues are conserved and play a key role.
Therefore the exercise will focus on adjusting the parameters of the MEME tool so that the resulting motifs will include all those key residues.
![Page 64: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/64.jpg)
Exercise 2: Using MEMEKey residues are marked in this figure: The cysteines that form the disulfide bridges; the residues R G K in positions 28,29,30 and Y in position 39
![Page 65: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/65.jpg)
Exercise 2: Using MEME
1. Open a web browser and go to the MEME web server at http://meme.nbcr.net/
2. Scroll down to the programs and click on the MEME icon; it will take you to the MEME Data Submission Form
![Page 66: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/66.jpg)
Exercise 2: Using MEME
![Page 67: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/67.jpg)
Exercise 2: Using MEME3. In the segment of the page marked as 1, you need to include the training set. It is the file called toxins.fasta
4. In the segment of the page marked as 2, you need to specify occurrences, or repetitions, of the k-word PER sequence. Do not change the default value.
5. In the segment of the page marked as 3, you need to specify the width of the k-word. For a fixed width you type the same value in Minimum and Maximum. For variable width you type the limits of the window. Try these values: 5-10
6. In the segment of the page marked as 4, you need to specify the maximum number of motifs. Type 10
7. Leave the other parameters unchanged. Scroll to the bottom of the page and click on Start search.
8. Results will be emailed to you.
![Page 68: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/68.jpg)
Exercise 2: MEME 5-10-10Let us examine the k-words found by MEME.
They are ordered by statistical significance; hence, motif-1 is the most significantly conserved segment in the training set and motif-n is the least significantly conserved segment.
This is the WEBLOGO representation of motif-1 that consists of 10 residues. The height of the residue is proportional to its occurrence frequency in the training set.
In position 1 a K is completely conservedIn position 2 a C is completely conservedIn position 3 an M is completely conservedIn position 4, it could be N or G, however, N is more likely than G
Etc.
![Page 69: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/69.jpg)
Exercise 2: MEME 5-10-10Now, let us scroll to the bottom of the page to see the diagram of motifs.
For each sequence we observe k-words and gaps. The fewer the gaps, the better the coverage. So, there is good coverage in this figure.
From prior knowledge of this group, we know it to be a set a relatively conserved motifs.
The more sequences with the same diagram of motifs, the more similar the sequences are.
The trouble with this diagram is that it gives the impression that the sequences are not that similar to each other.
![Page 70: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/70.jpg)
Exercise 2: Using MEME
Repeat the same steps with these other parameter values
For width: minimum 5, maximum 20For number of motifs: 5
Examine the resulting motifs and choose which one is better.
![Page 71: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/71.jpg)
Exercise 2: MEME 5-20-5
This is the WEBLOGO on motif-1. Compared to motif-1 in the previous run, we can see that:
•This one has twice as many residues•This one has more key residues included in (all the tall Cs and G)•There is more variability in other parts of the motif (more letters per column)
![Page 72: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/72.jpg)
Exercise 2: MEME 5-20-5
Now, let us scroll to the bottom of the page to see the diagram of motifs.
This diagram looks much better than in the previous run:
•There are few gaps in the diagram for this run too. •The diagram of motifs looks similar for many more sequences in this run than in the previous run. •Conservation or similarity among sequences by virtue of having almost identical motif diagrams can be appreciated better in this run than in the previous one.
![Page 73: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/73.jpg)
1. Collect seqs
2. Calc k-words
3. Search k-words4. Catalog k-words
4. Score5. Cluster
Exercise 2: MEME 5-20-5
So, we have FIVE k-words found by MEME.
Now, we need to search for similar sequences.
The email you received from MEME specifies a number of links to the results. Click on the link
MEME output as html
![Page 74: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/74.jpg)
1. Collect seqs
2. Calc k-words
3. Search k-words4. Catalog k-words
4. Score5. Cluster
Exercise 2: MEME 5-20-5
On the MEME results page; scroll down until you see the section called Further Analysis
Click on the MAST button.
MAST is the search engine that looks for sequences that match any of the k-words discovered with MEME
![Page 75: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/75.jpg)
K-words from MEME will be used as input here.
Select database to search
Then click on Search button.
![Page 76: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/76.jpg)
Exercise 2: MAST results
Among the 135 hits; those that have a motif diagram similar to any diagram in training set would be the sequences most similar to the neurotoxins.
![Page 77: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/77.jpg)
1. Collect seqs
2. Calc k-words
3. Search k-words4. Catalog k-words
4. Score5. Cluster
Exercise 2: MAST results
There exist formulas that use the motif diagram and e-value to perform scoring and clustering of the results.
![Page 78: Sequence Comparison Methods - Calvin Collegerpruim/talks/SC11/2011-06/SC11-Calvin-Rendon... · Sequence Similarity Methods - caveats • Assumption1: genes of closely related species](https://reader030.fdocuments.us/reader030/viewer/2022021505/5adcdbdf7f8b9a213e8c1c48/html5/thumbnails/78.jpg)
Additional Readings• Online lecture notes on Bioinformatics
Lectures.molgen.mpg.de/online_lectures.html• Vinga S, Almeida J. Alignment-free sequence comparison--a review.
Bioinformatics 2003; 19:513-23.• Needleman SB, Wunsch CD. A general method applicable to the search for
similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 1970; 48:443-53.
• Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of Molecular Biology 1981; 147:195-7.
• Gotoh O. An improved algorithm for matching biological sequences. Journal of Molecular Biology 1982; 162:705-8.
• Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res 1994; 22:4673 - 80.
• Mulder N, Apweiler R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol Biol 2007; 396:59-70