Markov models and applications Sushmita Roy BMI/CS 576 [email protected] Oct 7 th, 2014.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 [email protected] Sep 16 th,...
-
Upload
jocelyn-atkinson -
Category
Documents
-
view
216 -
download
0
Transcript of Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 [email protected] Sep 16 th,...
![Page 1: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/1.jpg)
Practical algorithms in Sequence Alignment
Sushmita RoyBMI/CS 576
www.biostat.wisc.edu/bmi576/[email protected]
Sep 16th, 2014
![Page 2: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/2.jpg)
Key concepts
• Assessing the significance of an alignment– Extreme value distribution gives an analytical form to
compute the significance of a score
• Heuristic algorithms– BLAST algorithm
![Page 3: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/3.jpg)
Readings from book
• Chapter 2– Section 2.5– Section 2.7
![Page 4: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/4.jpg)
RECAP: Four issues in sequence alignment
• Type• Algorithm• Score• Significance
![Page 5: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/5.jpg)
Classical approach to assessing the significance of score
• Develop a “background” distribution of alignment score
• Assess the probability of observing our score S from this background distribution
![Page 6: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/6.jpg)
Thought experiment for generating the background distribution
• Suppose we assume• Sequence lengths m and n• A particular substitution matrix and amino-acid
frequencies
• And we consider generating random sequences of lengths m and n and finding the best alignment of these sequences
• This will give us a distribution over alignment scores for random pairs of sequences
![Page 7: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/7.jpg)
The extreme value distribution• Because we are picking the best alignments, we
want to know the distribution of max scores for alignments against a random set of sequences looks like
• The background distribution is given by an extreme value distribution (EVD)
• We need to assess the probability of observing a score of S or higher by random chance, that is, we need the form of the CDF: P(x>S)
![Page 8: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/8.jpg)
Assessing significance of sequence score alignments
• It can be shown that the mean of optimal scores is
– K, λ estimated from the substitution matrix
• Probability of observing a score greater than S
• Substituting U into the equation
![Page 9: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/9.jpg)
Need to speed up sequence alignment
• Consider the task of searching the RefSeq collection of sequences against a query sequence:– most recent release of DB contains 32,504,738 proteins– Entails 33,000,000*(300*300) matrix operations
(assuming query sequence is of length 300 and avg. sequence length is 300)
• O(mn) too slow for large databases with high query traffic• We will look at a heuristic algorithm to speed up the search
process
![Page 10: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/10.jpg)
Heuristic alignment
• Heuristic algorithm: a problem-solving method which isn’t guaranteed to find the optimal solution, but which is efficient and finds good solutions
• Heuristic methods do fast approximation to dynamic programming– FASTA [Pearson & Lipman, 1988]– BLAST [Altschul et al., 1990; Altschul et al., Nucleic Acids
Research 1997]
![Page 11: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/11.jpg)
BLAST: Basic Local Alignment Search Tool
• Altshul et al 1990– Cited >48,000 times!
• Key heuristics in BLAST– A good alignment is made up short stretches of matches:
seeds– Extend seeds to make longer alignments
• Key tradeoff made: sensitivity vs. speed• Used EVD theory for random sequence score• Works for both protein sequence and DNA sequence– Only scores differ
![Page 12: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/12.jpg)
BLAST continued
• Two parameters control how BLAST searches the database– w: This specifies the length of words to seed the alignment– T: The smallest threshold of word pair match to be
considered in the alignment
![Page 13: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/13.jpg)
Key steps of the BLAST algorithm
• For each query sequence1. Compile a list of high-scoring words of score at least T
• First generate words in the query sequence• Then find words that match query sequence words with score at
least T• Thus allows for inexact matches
2. Scan the database for hits of these words• Relies on indexing performed as pre-processing
3. Extend hits
![Page 14: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/14.jpg)
Determining query words
Given:query sequence: QLNFSAGWword length w = 2 (default for protein usually w = 3)word score threshold T = 9
Step 1: determine all words of length w in query sequence
QL LN NF FS SA AG GW
![Page 15: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/15.jpg)
Determining query words
Step 2: Determine all words that score at least T when compared to a word in the query sequence
QL QL=9LN LN=10NF NF=12, NY=9…SA none...
words fromquery sequence words with T≥9
Additional words potentially in database
Aminoacid substitution matrix
![Page 16: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/16.jpg)
Scanning the database
• Search database for all occurrences of query words• Approach:– index database sequences into table of words
(pre-compute this)– index query words into table (at query time)
NPNSNTNWNY
QLNFSAGW
MFNYT, STNYD…
NPGAT, TSQRPNP…
query sequence
database sequences
![Page 17: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/17.jpg)
• Extending a word hit to a larger alignment is straightforward
• Terminate extension when the score of the current alignment falls a certain distance below the best score found for shorter extensions
Extending a hit
Query sequence: Q L N F S ADB sequence: R L N Y S WScore: 1 4 5 3 4 -3Total: 1+4+5+3+4=17
![Page 18: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/18.jpg)
How to choose w and T?
• Tradeoff between running time and sensitivity• Sensitivity
• T– small T: greater sensitivity, more hits to expand– large T: lower sensitivity, fewer hits to expand
• w– Larger w: fewer query word seeds, lower time for
extending, but more possible words (20w for AAs)
![Page 19: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/19.jpg)
Updates to BLAST
• Two hit method– Lower the threshold but require two words to be on the
same diagonal and be no more than A characters apart
• Ability to handle gaps• Ability to handle position-specific score matrix
created from alignments generated from iteration i for iteration i+1
Altshul et al 1997
![Page 20: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/20.jpg)
Two-hit method
Figure from Altshul et al 1997
+ Hits wit T>=13 (15 hits)
.Hits with T>=11 (22 hits)
Only these two are considered as they satisfy the two-hit method
![Page 21: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/21.jpg)
Summary of BLAST
• T: Don’t consider seeds with score < T• Don’t extend hits when score falls below a specified
threshold• Pre-processing of database or query helps to improve
the running time
![Page 22: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/22.jpg)
FASTA
• Starts with exact seed matches instead of inexact matches that satisfy a threshold
• Extends seeds (similar to BLAST)• Join high scoring seeds allowing for gaps• Re-align high scoring matches using dynamic
programming
![Page 23: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/23.jpg)
Different versions of BLAST programs
Program Query Database
BLASTP Protein Protein
BLASTN DNA DNA
BLASTXTranslated
DNAProtein
TBLASTN ProteinTranslated
DNA
TBLASTXTranslated
DNATranslated
DNA
![Page 24: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/24.jpg)
Sequence databases
• Web portals/Knowledge bases– NCBI: http://www.ncbi.nih.gov– EBI: http://www.ebi.ac.uk– Sanger: http://www.sanger.ac.uk– Each of these centers link to hundreds of databases
• Nucleotide sequences– Genbank– EMBL-EBI Nucleotide Sequence Database– Comprise ~8% of the total database (Nucleic Acid Research 2006
Database edition)
• Protein sequences– UniProtKB
![Page 25: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/25.jpg)
Using BLAST
• http://blast.ncbi.nlm.nih.gov/Blast.cgi
• Will blast a DNA sequence against NCBI nucleotide database
• We will select– http://www.ncbi.nlm.nih.gov/nuccore/NG_000007.3?from
=70545&to=72150&report=fasta
![Page 26: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/26.jpg)
Using BLAST
Choose the database
Enter the query sequence
![Page 27: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/27.jpg)
Using BLASTThe sequence corresponds to the human HBB (hemoglobin) gene.
But we will select the mouse DB
Use Megablast (large word size)
![Page 28: Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 sroy@biostat.wisc.edu Sep 16 th, 2014.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d6e5503460f94a4f357/html5/thumbnails/28.jpg)
Interpreting results
Assesses significance of a score. Related to P-value, but gives the expected number of alignments ofthis score value or higher.