Sequence alignment - Rensselaer Polytechnic Institute · query sequence identity matches seeds HSPs...
Transcript of Sequence alignment - Rensselaer Polytechnic Institute · query sequence identity matches seeds HSPs...
Experimental origins of sequence data
F
Each color is one lane of an electrophoresis gel.
The Sanger dideoxynucleotide method
AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC
Building an alignment starts with a scoring matrix. In its simplest form, a dot plot.
Everything aligned to everything.
AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC
An alignment is a path through the scoring matrix, always proceeding to the right and
down. (no non-sequential alignments allowed.)
AAAGAGATTCTGCTAGCGGTCGG
AGAGATGCTGCAGCGAGTCGGCC
Unbroken diagonals represent “blocks” of sequence without indels.
Database searching
Why do a database search?Mol. Bio: Determination of gene function. Primer design.
Pathology, epidemiology, ecology: Determination of species, strain, lineage, phylogeny.
Biophysics: Prediction of RNA or protein structure, effect of mutation.
one sequenceGenBank, PIR,
Swissprot,GenEMBL, DDBJ
lots of sequences
Searching millions of sequences
Given a protein or DNA sequence, we want to find all of the sequences in GenBank (over 17 million sequences!!) that have a good alignment score.
Each alignment score should be the optimal score (or a close approximation).
How do we do it?
Fast Database SearchingBLAST S. Altschul et al.
First make a set of lookup tables for all 3-letter (protein) or 11-letter (DNA) matches.
Make another lookup table: the locations of all 3-letter words in the database.
Start with a match, extend to the left and right until the score no longer increases.
Very fast. Selective, but not as sensitive as slower search methods (SSEARCH). Reliable statistics. Heuristic, not optimal.
BLAST, precalculations
PGQ
...
PGQ PGR PGS ... PGT PGV PGWPGY PAQ PCQPDQ PEQ PFQ ...
...
All 8000 possible 3-tuples
50 high-scoring
3-tuples
Each 3-tuple is scored against all 8000 possible 3-tuples using BLOSUM. The top scoring 50 are kept as that 3tuple’s “neighborhood words”
BLASTquery sequence
identity matches
seeds HSPs
a 3-tuple
For every 3-residue window, we get the set of 50 nearest neighbors. Use each word to get identity matches (seeds). Then extend the seed alignments as long as the score increases.
neighborhood words for 3-tuple
target sequence
BLAST
HSPs alignment
The best extended seeds are called HSPs (high scoring pairs). The top scoring HSP is picked first, then the second (as long as it falls "northwest" or "southeast" of the first.), and so on.
15
» Local Dynamic Programming (DP) alignment is applied to only the sequences that pass the FASTA score cutoff.
» DP scores are converted to e-values.» Local alignments are output for the top
hits.» Optionally, multiple sequence alignment
output ("star" alignment)
BLAST -- last steps
16
Protein Databases available for BLAST search
On BLAST search page, select.a database to search and then select ? to learn a little about that database.
17
Protein Databases available for BLAST search
On BLAST search page, select.a database to search and then select ? to learn a little about that database.
18
Protein Databases available for BLAST search
On BLAST search page, select.a database to search and then select ? to learn a little about that database.
19
BLAST -- Filters
You can restrict the search by TaxonomyYou can enter a Entrez search query to restrict the search.(Test your Entrez query first)(Learn about Entrez here:
https://www.youtube.com/watch?v=t8fKz9rvuOk&feature=youtu.be
Watch this!
Other forms of BLAST
21
BLAST query databaseblastn nucleotide nucleotideblastp protein proteintblastn protein translated DNAblastx translated DNA proteintblastx translated DNA translated DNA
psi-blast protein, profile proteinphi-blast pattern protein
transitive blast* any any*not really a blast. Just a way of using blast.
Psi-BLAST: Blast with profiles
Psi-BLAST searches the database iteratively.(Cycle 1) Normal BLAST (with gaps)
(Cycle 2) (a) Construct a profile from the results of Cycle 1.
(b) Search the database using the profile.
(Cycle 3) (a) Construct a profile from the results of Cycle 2.
(b) Search the database using the profile.
And So On... (user sets the number of cycles)
Psi-BLAST is much more sensitive than BLAST.
Also more vulnerable to low-complexity.
DNA or Protein search?•Advantages of searching DNA databases
Larger database. Does not assume a reading frame. Can find similarity in non-coding regions (introns, promotor regions). Can find frameshift mutations. Can find pseudogenes.
•Disadvantages
Slower. Not as sensitive. Ignores selective pressure at the protein level.
•Advantages of searching protein sequences
Faster. More sensitive. More biologically relevant.
•Disadvantages
Not applicable to non-coding DNA (promotors, introns, etc)
Bioinformatics
• Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology • Protein Structure
25
How significant is that?
Please give me a number for...
...how likely the data would not have been the result of chance,...
...as opposed to...
...a specific inference. Thanks.
Dayhoff's randomization experiment
Aligned scrambled Protein A versus scrambled Protein B
100 times (re-scrambling each time).
NOTE: scrambling does not change the AA composition!
Results: A Normal Distributionsignificance of a score is measured as the probability of getting this score in a random alignment
score
freq
Lippman's randomization experimentAligned Protein A to 100 natural sequences, not scrambled.
Results: A wider normal distribution (Std dev = ~3 times larger)WHY? Because natural sequences are different than random.
Even unrelated sequences have similar local patterns, and uneven amino acid composition.
Lippman got a similar result if he randomized the sequences by words instead of letters.
Was the significance over-estimated using Dayhoff's method?
score
freq
P(S > x)E(M) gives us the expected length of the longest number of matches in a row. But, what we really want is the answer to this question:
How good is the score x? (i.e. how significant)
So, we need to model the whole distribution of chance scores, then ask how likely is it that my score or greater comes from that model.
score
freq
A normal distribution
Suppose you had a Gaussian distribution “dart-board”. You throw 1000 darts randomly. Score your darts according the number on the X-axis where it lands. What is the probability distribution of scores? Answer:The same Gaussian distribution! (duh)
Extreme values from a normal distribution
What if we throw 10 darts at a time and keep only the highest-scoring dart (extreme value)? What is the distribution of the extreme values?
The Extreme Value Distribution
Normal distributions (Dayhoff, Lippman) overestimate significance when the scores are extreme values. EVD is the correct null model.
Fitting the EVD to random alignments
log(P(S≥x)) = log(Kmn) - λx
• Generate a large number of known false alignment scores S, (all alignments with the same two lengths m and n), • Plot log(P(S≥x)) versus x , fit to a line!
Estimated P (integral of the EVD): P(S≥x) ≈ Kmne-λx
Taking the log,
x
x x
xx
x
x
x
x x x xx
xxx x xxx
x
logP
(S≥x
)
The slope is −λ, the intercept is log(Kmn). Now we can calculate P for any score x.
where K=constant, m=size of database, n=length of sequence, λ=constant