Sequence alignment - Rensselaer Polytechnic Institute · query sequence identity matches seeds HSPs...

34
Bioinformatics Sequence alignment • BLAST • Significance Next time Protein Structure 1

Transcript of Sequence alignment - Rensselaer Polytechnic Institute · query sequence identity matches seeds HSPs...

Bioinformatics

• Sequence alignment • BLAST • SignificanceNext time

• Protein Structure

1

Experimental origins of sequence data

F

Each color is one lane of an electrophoresis gel.

The Sanger dideoxynucleotide method

AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC

Building an alignment starts with a scoring matrix. In its simplest form, a dot plot.

Everything aligned to everything.

AAAGAGATTCTGCTAGCGGTCGGAGAGATGCTGCAGCGAGTCGGCC

An alignment is a path through the scoring matrix, always proceeding to the right and

down. (no non-sequential alignments allowed.)

AAAGAGATTCTGCTAGCGGTCGG

AGAGATGCTGCAGCGAGTCGGCC

Unbroken diagonals represent “blocks” of sequence without indels.

BLOSUM62: protein substitution matrix

PAM250

Bioinformatics

• Sequence alignment • BLAST • SignificanceNext time

• Protein Structure

8

Database searching

Why do a database search?Mol. Bio: Determination of gene function. Primer design.

Pathology, epidemiology, ecology: Determination of species, strain, lineage, phylogeny.

Biophysics: Prediction of RNA or protein structure, effect of mutation.

one sequenceGenBank, PIR,

Swissprot,GenEMBL, DDBJ

lots of sequences

Searching millions of sequences

Given a protein or DNA sequence, we want to find all of the sequences in GenBank (over 17 million sequences!!) that have a good alignment score.

Each alignment score should be the optimal score (or a close approximation).

How do we do it?

Fast Database SearchingBLAST S. Altschul et al.

First make a set of lookup tables for all 3-letter (protein) or 11-letter (DNA) matches.

Make another lookup table: the locations of all 3-letter words in the database.

Start with a match, extend to the left and right until the score no longer increases.

Very fast. Selective, but not as sensitive as slower search methods (SSEARCH). Reliable statistics. Heuristic, not optimal.

BLAST, precalculations

PGQ

...

PGQ PGR PGS ... PGT PGV PGWPGY PAQ PCQPDQ PEQ PFQ ...

...

All 8000 possible 3-tuples

50 high-scoring

3-tuples

Each 3-tuple is scored against all 8000 possible 3-tuples using BLOSUM. The top scoring 50 are kept as that 3tuple’s “neighborhood words”

BLASTquery sequence

identity matches

seeds HSPs

a 3-tuple

For every 3-residue window, we get the set of 50 nearest neighbors. Use each word to get identity matches (seeds). Then extend the seed alignments as long as the score increases.

neighborhood words for 3-tuple

target sequence

BLAST

HSPs alignment

The best extended seeds are called HSPs (high scoring pairs). The top scoring HSP is picked first, then the second (as long as it falls "northwest" or "southeast" of the first.), and so on.

15

» Local Dynamic Programming (DP) alignment is applied to only the sequences that pass the FASTA score cutoff.

» DP scores are converted to e-values.» Local alignments are output for the top

hits.» Optionally, multiple sequence alignment

output ("star" alignment)

BLAST -- last steps

16

Protein Databases available for BLAST search

On BLAST search page, select.a database to search and then select ? to learn a little about that database.

17

Protein Databases available for BLAST search

On BLAST search page, select.a database to search and then select ? to learn a little about that database.

18

Protein Databases available for BLAST search

On BLAST search page, select.a database to search and then select ? to learn a little about that database.

19

BLAST -- Filters

You can restrict the search by TaxonomyYou can enter a Entrez search query to restrict the search.(Test your Entrez query first)(Learn about Entrez here:

https://www.youtube.com/watch?v=t8fKz9rvuOk&feature=youtu.be

Watch this!

20

Entrez

Other forms of BLAST

21

BLAST query databaseblastn nucleotide nucleotideblastp protein proteintblastn protein translated DNAblastx translated DNA proteintblastx translated DNA translated DNA

psi-blast protein, profile proteinphi-blast pattern protein

transitive blast* any any*not really a blast. Just a way of using blast.

Psi-BLAST: Blast with profiles

Psi-BLAST searches the database iteratively.(Cycle 1) Normal BLAST (with gaps)

(Cycle 2) (a) Construct a profile from the results of Cycle 1.

(b) Search the database using the profile.

(Cycle 3) (a) Construct a profile from the results of Cycle 2.

(b) Search the database using the profile.

And So On... (user sets the number of cycles)

Psi-BLAST is much more sensitive than BLAST.

Also more vulnerable to low-complexity.

PHI-BLAST -- Patterned Hit Initiated BLAST

23

DNA or Protein search?•Advantages of searching DNA databases

Larger database. Does not assume a reading frame. Can find similarity in non-coding regions (introns, promotor regions). Can find frameshift mutations. Can find pseudogenes.

•Disadvantages

Slower. Not as sensitive. Ignores selective pressure at the protein level.

•Advantages of searching protein sequences

Faster. More sensitive. More biologically relevant.

•Disadvantages

Not applicable to non-coding DNA (promotors, introns, etc)

Bioinformatics

• Sequence alignment • Database searching • Significance, e-values • Trees • Gene ontology • Protein Structure

25

How significant is that?

Please give me a number for...

...how likely the data would not have been the result of chance,...

...as opposed to...

...a specific inference. Thanks.

Dayhoff's randomization experiment

Aligned scrambled Protein A versus scrambled Protein B

100 times (re-scrambling each time).

NOTE: scrambling does not change the AA composition!

Results: A Normal Distributionsignificance of a score is measured as the probability of getting this score in a random alignment

score

freq

Lippman's randomization experimentAligned Protein A to 100 natural sequences, not scrambled.

Results: A wider normal distribution (Std dev = ~3 times larger)WHY? Because natural sequences are different than random.

Even unrelated sequences have similar local patterns, and uneven amino acid composition.

Lippman got a similar result if he randomized the sequences by words instead of letters.

Was the significance over-estimated using Dayhoff's method?

score

freq

P(S > x)E(M) gives us the expected length of the longest number of matches in a row. But, what we really want is the answer to this question:

How good is the score x? (i.e. how significant)

So, we need to model the whole distribution of chance scores, then ask how likely is it that my score or greater comes from that model.

score

freq

A normal distribution

Suppose you had a Gaussian distribution “dart-board”. You throw 1000 darts randomly. Score your darts according the number on the X-axis where it lands. What is the probability distribution of scores? Answer:The same Gaussian distribution! (duh)

Extreme values from a normal distribution

What if we throw 10 darts at a time and keep only the highest-scoring dart (extreme value)? What is the distribution of the extreme values?

The Extreme Value Distribution

Normal distributions (Dayhoff, Lippman) overestimate significance when the scores are extreme values. EVD is the correct null model.

Fitting the EVD to random alignments

log(P(S≥x)) = log(Kmn) - λx

• Generate a large number of known false alignment scores S, (all alignments with the same two lengths m and n), • Plot log(P(S≥x)) versus x , fit to a line!

Estimated P (integral of the EVD): P(S≥x) ≈ Kmne-λx

Taking the log,

x

x x

xx

x

x

x

x x x xx

xxx x xxx

x

logP

(S≥x

)

The slope is −λ, the intercept is log(Kmn). Now we can calculate P for any score x.

where K=constant, m=size of database, n=length of sequence, λ=constant

Pop-quiz

You did a BLAST search using a sequence that has absolutely no homologs in the database. Absolutely none.

The BLAST search gave you false “hits” with the top e-values ranging from 0 to 20. You look at them and you notice a pattern in the e-values.

How many of your hits have e-value ≤ 10.?