Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty...

Chapter 11

Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely)

This lecture is designed to introduce you to the theory and practice of performing the variety of sequence similiarity searches available via the NCBI BLAST service

Searching primary databases using sequence similarity

Basic Local Alignment Search Tool

BLAST

BLAST is a computer algorithm that returns sequences in the database with the highest percentage of bases in common to the query sequence

Sequence alignments

• The first line of evidence that two sequences are, because of a shared evolutionary history, related.

• If two sequences (DNA or protein) are related by descent, they will be related by sequence.

A B CThe “evolutionary distance” between A and B is smaller, and therefore, under an assumption of brownian motion (no selection) a given stretch of DNA or protein will share more nucleotides or amino acids in common.

All search algorithms will produce results when queried!

• The trick is to be able to (i) trust and evaluate the result, and (ii) to be able to quantify this evaluation.

Reasons for performing BLAST searches• There can be many reasons, but common

ones are:

Evolutionary: discovering similar sequences in different organisms allows one to ask whether and how sequence-level changes result in functional changes. Can be done for coding or non-coding (i.e. regulatory regions) .

Human or Computational annotation: For non-model systems, for which little bench work has been done compared to a model system, sequence alignments with known, experimentally verified genes, can aid in the assignment of function

Multiple sequence alignments can help identify conserved regions of coding sequences, which might have functional significance, or to help understand evolutionary relationships among difficult to classify organisms.

Multiple sequence alignments can also help with the development of primers in order to easily clone out a cDNA from an organism for which genome sequencing has not been done.

Sequence similarity can only be ascertained by aligning two sequences

ACGGCATCCGACGCTTAGCGGACTATCGATCTGA

ACCCGGCCTACGGCTACTCGCTTAGCGGACTCGG

Some basic concepts:Sequence Similarity (Data) Homology (Inference)

Percent similarity of base pairs between any two sequences, over a given length of sequence

Similar because of common descent from an ancestor that contained that sequence.

Continuous quantity any real number (0-100%)

Categorical quantity, either two sequences are homologous or not*

*Because of a variety of genomic rearrangement phenomena, two sequences that code for a non-homologous protein per se, can contain sub-sequences that are indeed homologous. This is actually a source of false positive hits during Blast searches

Two kinds of Homology: Paralogy vs. Orthology

ACGCTACGGGCAA

Orthologues are two sequences that are related because that sequence existed in an ancestor.

Paralogues are two or more sequences that are related because a gene duplication event

Ancestral sequence

Orthologous sequences

ACGCTAGGGGCAAACGCTATGGCAA

AGCCTATGGCAA ACGCTAGGGCTTParalogous sequences

Global versus local alignment

Local Alignment Global Alignment

Best alignment of two sequences, based on regions of sub-sequence (the local part) with the highest similarity

Best alignment of two sequences along their entire length. Used with highly similar sequences.

Better at finding weak similarities and functional domains within coding sequences

Better for multiple sequence alignments

Most common in database searches

Not particularly useful for database searches.

DNA versus protein alignmentDNA proteinCan be much longer Only as long as the

longest coding sequence (<5K a.a.)

Because of the wobble effect, are less sensitive and tend to miss sequence relationships among distantly related species

Are more sensitive because amino acids tend to be conserved along functional domains, so even weak total similarities can still be detected

Dotplot: for comparing two sequences• A dotplot is a simple techniqe that yields a

graphical, but not statistical, representation of sequence similarity (see Fig 11.1)

A C G T C G T A

ACGAGGTA

Dot plots

• Good:

Easy visualization of syntenic regions of long regions of genomic sequence

Easy visualization of exon/intron boundaries

Easy visualization and enumeration of tandemly repeated sequence elements

• Bad:

No numerical evaluation of the degree of similarity – we examine this next

Can only compare two sequences at a time.

Scoring matrices

• When searching a complex database full of billions of nucleotides worth of sequences, we must not only identify related sequences, but develop the ability to score how “good” the match is, and then rank these “hits” in a list or a table.

Example here?

A scoring matrix is: an empirical weighting scheme used in all sequence comparisons

DNA Scoring matrices

A G

C T

Are all transitions and transversions equally likely? Different scoring matrices make different assumptions about this, but it should be clear that the wobble effect, numerical probability and molecular mech. matters here.

Amino Acid Scoring Matrices• A bit more complicated, because:

there are 20 possible substitutions at any particular site

Some substitutions are more constrained by function than others. In other words, we need to distinguish between absolute conservation (dark blue) and functional conservation (light blue).

Some amino acids are more rare among all proteins, or within proteins, so changes in these amino acids must be given higher weight.

Depending upon evolutionary distance, some amino acid changes are more likely than others.

The log odds ratio• A scoring matrix is a probability matrix, which is

an attempt to understand the probability of all pairwise substitutions, given how often they are actually observed to change in known sequences.

• Probabilities are calculated as being less than, or greater than observed by random chance, hence the negative numbers.

PAM250 Matrix

PAM amino acid scoring matrix• The Point Accepted Mutation considers only mutations at the single site

level.

Original matrices were done using sequences with more than 85% similarity, which means they are very closely related

The term acceptance refers to functional conservation of protein function, even if sequence changes.

Remember this?

Amino acids can be group according to their “chemistries.” Because some amino acids are very similar in their chemistries we should not score substitutions between them the same as between two amino acids that are very different in their chemistries

Assumptions of the PAM matrices

• Substitutions at a given site are (i) independent of previous changes at that site and are independent of changes to adjacent sites.

• One PAM unit corresponds to 1 a.a. change/ 100 a.a., or 1% divergence in sequence.

• PAM160 = 160 (total) changes/100 a.a. This can be problematic because this represents an extrapolation of probabilities calculated for closely related sequence, and so and error will simply be multiplied

BLOSUM matrices• Considered the fact that BLOcks of sequence

corresponding to secondary structure (i.e. functional domains like catalytic sites, DNA binding regions etc.) are likely to display different SUbstitution probabilities.

• And, BLOSUM matrices considered subsitution probabilities across several evolutionary distances, and so are more accurate than PAM for weaker sequence relationships.

• E.g. BLOSUM62 matrix means that sequences with no more than 62% sequence similarity were used to calculate substitution probabilities.

GAPS and penalties• Other types of mutations involve insertions and

deletions of sequence, collectively called indels

ACGATCGTCATCGATCGAACGATCTCGATCGA

These two sequences only align well across <half of their sequences.

ACGATCGTCATCGATCGAACGATC - - - - TCGATCGA

If we introduce four gaps, it is obvious that these sequences are more related than we thought.

But there must be a penalty for doing this (i.e. lowering the overall score) because you ,

Two kinds of Gap scoring methodsAffine gap penalty

G + Ln; where G is a penalty for introducing a gap, and L is the penalty for lengthening the range of the gapped region (G > Ln)

Non-affine gap penalty

Ln; No penalty for opening, and where L is a fixed penalty for every gap

So, How does BLAST work?• Words, Neighbourhoods, and High Scoring

Segment pairs, Oh My!

• The Word is the minimum length of sequence that is used to start a search, usually three amino acids (RDQYPQW).

• Neighbourhoods are similar words to the query word, (e.g. RDQ vs RBQ vs RDE). These are subject to the scoring matrix

• A High scoring segment pair is the region of sequence for which the highest scoring Word can be extended the most as matching sequence

Detection of high scoring segments.

Lexa M et al. Bioinformatics 2011;27:2510-2517

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Let’s Blast!!

Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty...

Documents

Transcript of Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty...