Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty...
-
Upload
hector-griffith -
Category
Documents
-
view
328 -
download
0
Transcript of Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty...
Chapter 11
Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely)
This lecture is designed to introduce you to the theory and practice of performing the variety of sequence similiarity searches available via the NCBI BLAST service
Searching primary databases using sequence similarity
Basic Local Alignment Search Tool
BLAST
BLAST is a computer algorithm that returns sequences in the database with the highest percentage of bases in common to the query sequence
Sequence alignments
• The first line of evidence that two sequences are, because of a shared evolutionary history, related.
• If two sequences (DNA or protein) are related by descent, they will be related by sequence.
A B CThe “evolutionary distance” between A and B is smaller, and therefore, under an assumption of brownian motion (no selection) a given stretch of DNA or protein will share more nucleotides or amino acids in common.
All search algorithms will produce results when queried!
• The trick is to be able to (i) trust and evaluate the result, and (ii) to be able to quantify this evaluation.
Reasons for performing BLAST searches• There can be many reasons, but common
ones are:
Evolutionary: discovering similar sequences in different organisms allows one to ask whether and how sequence-level changes result in functional changes. Can be done for coding or non-coding (i.e. regulatory regions) .
Human or Computational annotation: For non-model systems, for which little bench work has been done compared to a model system, sequence alignments with known, experimentally verified genes, can aid in the assignment of function
Multiple sequence alignments can help identify conserved regions of coding sequences, which might have functional significance, or to help understand evolutionary relationships among difficult to classify organisms.
Multiple sequence alignments can also help with the development of primers in order to easily clone out a cDNA from an organism for which genome sequencing has not been done.
Sequence similarity can only be ascertained by aligning two sequences
ACGGCATCCGACGCTTAGCGGACTATCGATCTGA
ACCCGGCCTACGGCTACTCGCTTAGCGGACTCGG
Some basic concepts:Sequence Similarity (Data) Homology (Inference)
Percent similarity of base pairs between any two sequences, over a given length of sequence
Similar because of common descent from an ancestor that contained that sequence.
Continuous quantity any real number (0-100%)
Categorical quantity, either two sequences are homologous or not*
*Because of a variety of genomic rearrangement phenomena, two sequences that code for a non-homologous protein per se, can contain sub-sequences that are indeed homologous. This is actually a source of false positive hits during Blast searches
Two kinds of Homology: Paralogy vs. Orthology
ACGCTACGGGCAA
Orthologues are two sequences that are related because that sequence existed in an ancestor.
Paralogues are two or more sequences that are related because a gene duplication event
Ancestral sequence
Orthologous sequences
ACGCTAGGGGCAAACGCTATGGCAA
AGCCTATGGCAA ACGCTAGGGCTTParalogous sequences
Global versus local alignment
Local Alignment Global Alignment
Best alignment of two sequences, based on regions of sub-sequence (the local part) with the highest similarity
Best alignment of two sequences along their entire length. Used with highly similar sequences.
Better at finding weak similarities and functional domains within coding sequences
Better for multiple sequence alignments
Most common in database searches
Not particularly useful for database searches.
DNA versus protein alignmentDNA proteinCan be much longer Only as long as the
longest coding sequence (<5K a.a.)
Because of the wobble effect, are less sensitive and tend to miss sequence relationships among distantly related species
Are more sensitive because amino acids tend to be conserved along functional domains, so even weak total similarities can still be detected
Dotplot: for comparing two sequences• A dotplot is a simple techniqe that yields a
graphical, but not statistical, representation of sequence similarity (see Fig 11.1)
A C G T C G T A
ACGAGGTA
Dot plots
• Good:
Easy visualization of syntenic regions of long regions of genomic sequence
Easy visualization of exon/intron boundaries
Easy visualization and enumeration of tandemly repeated sequence elements
• Bad:
No numerical evaluation of the degree of similarity – we examine this next
Can only compare two sequences at a time.
Scoring matrices
• When searching a complex database full of billions of nucleotides worth of sequences, we must not only identify related sequences, but develop the ability to score how “good” the match is, and then rank these “hits” in a list or a table.
Example here?
A scoring matrix is: an empirical weighting scheme used in all sequence comparisons
DNA Scoring matrices
A G
C T
Are all transitions and transversions equally likely? Different scoring matrices make different assumptions about this, but it should be clear that the wobble effect, numerical probability and molecular mech. matters here.
Amino Acid Scoring Matrices• A bit more complicated, because:
there are 20 possible substitutions at any particular site
Some substitutions are more constrained by function than others. In other words, we need to distinguish between absolute conservation (dark blue) and functional conservation (light blue).
Some amino acids are more rare among all proteins, or within proteins, so changes in these amino acids must be given higher weight.
Depending upon evolutionary distance, some amino acid changes are more likely than others.
The log odds ratio• A scoring matrix is a probability matrix, which is
an attempt to understand the probability of all pairwise substitutions, given how often they are actually observed to change in known sequences.
• Probabilities are calculated as being less than, or greater than observed by random chance, hence the negative numbers.
PAM amino acid scoring matrix• The Point Accepted Mutation considers only mutations at the single site
level.
Original matrices were done using sequences with more than 85% similarity, which means they are very closely related
The term acceptance refers to functional conservation of protein function, even if sequence changes.
Remember this?
Amino acids can be group according to their “chemistries.” Because some amino acids are very similar in their chemistries we should not score substitutions between them the same as between two amino acids that are very different in their chemistries
Assumptions of the PAM matrices
• Substitutions at a given site are (i) independent of previous changes at that site and are independent of changes to adjacent sites.
• One PAM unit corresponds to 1 a.a. change/ 100 a.a., or 1% divergence in sequence.
• PAM160 = 160 (total) changes/100 a.a. This can be problematic because this represents an extrapolation of probabilities calculated for closely related sequence, and so and error will simply be multiplied
BLOSUM matrices• Considered the fact that BLOcks of sequence
corresponding to secondary structure (i.e. functional domains like catalytic sites, DNA binding regions etc.) are likely to display different SUbstitution probabilities.
• And, BLOSUM matrices considered subsitution probabilities across several evolutionary distances, and so are more accurate than PAM for weaker sequence relationships.
• E.g. BLOSUM62 matrix means that sequences with no more than 62% sequence similarity were used to calculate substitution probabilities.
GAPS and penalties• Other types of mutations involve insertions and
deletions of sequence, collectively called indels
ACGATCGTCATCGATCGAACGATCTCGATCGA
These two sequences only align well across <half of their sequences.
ACGATCGTCATCGATCGAACGATC - - - - TCGATCGA
If we introduce four gaps, it is obvious that these sequences are more related than we thought.
But there must be a penalty for doing this (i.e. lowering the overall score) because you ,
Two kinds of Gap scoring methodsAffine gap penalty
G + Ln; where G is a penalty for introducing a gap, and L is the penalty for lengthening the range of the gapped region (G > Ln)
Non-affine gap penalty
Ln; No penalty for opening, and where L is a fixed penalty for every gap
So, How does BLAST work?• Words, Neighbourhoods, and High Scoring
Segment pairs, Oh My!
• The Word is the minimum length of sequence that is used to start a search, usually three amino acids (RDQYPQW).
• Neighbourhoods are similar words to the query word, (e.g. RDQ vs RBQ vs RDE). These are subject to the scoring matrix
• A High scoring segment pair is the region of sequence for which the highest scoring Word can be extended the most as matching sequence
Detection of high scoring segments.
Lexa M et al. Bioinformatics 2011;27:2510-2517
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]