From Pairwise Alignment to Database Similarity Search

48
From Pairwise Alignment to Database Similarity Search

description

From Pairwise Alignment to Database Similarity Search. Outline. Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments. Local. Global. Best score for aligning the full length sequences Dynamic programming Algorithm: Needelman- Wunch - PowerPoint PPT Presentation

Transcript of From Pairwise Alignment to Database Similarity Search

Page 1: From Pairwise Alignment  to Database Similarity Search

From Pairwise Alignment to

Database Similarity Search

Page 2: From Pairwise Alignment  to Database Similarity Search

Outline

• Summary Local and Global alignments

• FASTA and BLAST algorithms

• Evaluating significance of alignments

Page 3: From Pairwise Alignment  to Database Similarity Search

• Best score for aligning part of sequences

• Dynamic programming • Algorithm:

Smith-Waterman• Table cells never score

below zero

• Best score for aligning the full length sequences

• Dynamic programming• Algorithm:

Needelman- Wunch• Table cells are allowed

any score

Global Local

Pairwise Alignment Summary

Page 4: From Pairwise Alignment  to Database Similarity Search

Available tools for Sequence Alignments

ALIGN -- GLOBAL (N-W)/ LOCAL (S-W)

BLAST2SEQ – Only LOCAL using word match

spidey -- aligns mRNAs to genomic sequence

est2genome -- aligns ESTs to genomic sequence

Page 5: From Pairwise Alignment  to Database Similarity Search

Gap Scores

>Human DNACATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA>Human mRNACATGCGACTGACATCGATCATA

Biologically, indels occur in groups we want our gap score to reflect this

Page 6: From Pairwise Alignment  to Database Similarity Search

Gap Scores

• Standard solution: affine gap model– Once-off cost for opening a gap– Lower cost for extending the gap– Changes required to algorithm

Page 7: From Pairwise Alignment  to Database Similarity Search

Affine Gap Penalty

• wx = g + r(x-1)

• wx : total gap penalty; g: gap open penalty; r: gap

extend penalty;x: gap length

• gap penalty chosen

– Gaps not excluded

– Gaps not over included

– Typical Values: g=-12; r = -4

Page 8: From Pairwise Alignment  to Database Similarity Search

Is this good enough ???

Page 9: From Pairwise Alignment  to Database Similarity Search

Drawbacks to DP Approaches

• Compute intensive

• Memory Intensive

Page 10: From Pairwise Alignment  to Database Similarity Search

Complexity

• Complexity is determined by size of table– Aligning a sequence of length m against one of

length n requires calculating (m n) cells

• Estimate: we calculate 108 cells per second– Aligning two mRNA sequences of 8,000 bp

requires 64,000,000 cells – Aligning an mRNA and a 107 bp chromosome

requires ~1011 cells

Page 11: From Pairwise Alignment  to Database Similarity Search

Searching databases

• Goal: Find homologue sequences in database to query input.

Page 12: From Pairwise Alignment  to Database Similarity Search

new sequence

?

Sequence Database

similarfunction

≈ Similar function

Page 13: From Pairwise Alignment  to Database Similarity Search

Searching databases

• Goal: Find homologue sequences in database to query input.

• Naïve solution:Use exact algorithm to compare each sequence in the database to query.

Page 14: From Pairwise Alignment  to Database Similarity Search
Page 15: From Pairwise Alignment  to Database Similarity Search

Complexity for genomes

• Human genome contains 3 109 base pairs– Searching an mRNA against HG requires ~1013

cells

-Even efficient exact algorithms will be extremely slow when preformed millions of times.

-Running the computations in parallel is expensive.

Page 16: From Pairwise Alignment  to Database Similarity Search

So what can we do?

Page 17: From Pairwise Alignment  to Database Similarity Search

Searching databases

• Solutions:1. Use a heuristic (approximate) algorithm to

discard most irrelevant sequences.

2. Perform the exact algorithm on the small group of remaining sequences.

Page 18: From Pairwise Alignment  to Database Similarity Search

Heuristic strategy

• Homologous sequences are expected to contain common short segments (probably with substitutions, but without ins/dels)

• Preprocess database (DB) into new data structure to enable fast accession

• Remove low-complexity regions that are not useful for meaningful alignments

Page 19: From Pairwise Alignment  to Database Similarity Search

• AAAAAAAAAAA

• ATATATATATATA

• Transposable elements

(LINEs, SINEs)

Low Complexity Sequences

Page 20: From Pairwise Alignment  to Database Similarity Search

Low Complexity Sequences

Whats wrong with them?produce artificial high scoring alignments.

So what do we do?: We apply Low Complexity masking to the query sequence

MaskTCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

Page 21: From Pairwise Alignment  to Database Similarity Search

Low Complexity SequencesComplexity is calculated as:

Where N=4 in DNA (4 bases), L is the length of the sequenceAnd ni the number of each residue in the sequence

K=1/L logN(L!/Π ni!)all i

For the sequence GGGG:

L! =4x3x2x1=24ng =4nc =0na =0nt =0Πni =24x1x1x1=24

K =1/4 log4 (24/24)=0

For the sequence CTGA:

L! =4x3x2x1=24ng =1nc =1na =1nt =1Πni =1x1x1x1

K =1/4 log4 (24/1)=0.573

Page 22: From Pairwise Alignment  to Database Similarity Search

Heuristic (approximate solution) Methods: FASTA and BLAST

• FASTA (Lipman & Pearson 1985)– First fast sequence searching algorithm for comparing

a query sequence against a database

• BLAST - Basic Local Alignment Search Technique (Altschul et al 1990) – improvement of FASTA: Search speed, ease of use,

statistical rigor– Gapped BLAST (Altschul et al 1997)

Page 23: From Pairwise Alignment  to Database Similarity Search

FASTA and BLAST

• Common idea - a good alignment contains subsequences of absolute identity:– First, identify very short (almost) exact matches.– Next, the best short hits from the 1st step are extended

to longer regions of similarity.– Finally, the best hits are optimized using the Smith-

Waterman algorithm.

Page 24: From Pairwise Alignment  to Database Similarity Search

FastA (fast alignment)• Assumption: a good alignment probably matches

some identical ‘words’ • Example: Aligning a query sequence to a database

Database record:

ACTTGTAGATACAAAATGTG

Query sequence:

A-TTGTCG-TACAA-ATCTG

Matching words of size 4

Page 25: From Pairwise Alignment  to Database Similarity Search

• Preprocess of all the sequences in the database. Find short words and organize in dictionaries.

• Process the query sequence and prepare a dictionary.

– ATGGCTGCTCAAGT….

ATGG TGGC GGCT … …

FastA

Query

Page 26: From Pairwise Alignment  to Database Similarity Search

FastA locates regions of the query sequence and the search set sequence that have high densities of exact word matches.For DNA sequences the word length used is 6.

seq1

seq2

Page 27: From Pairwise Alignment  to Database Similarity Search

The 10 highest-scoring sequence regions are saved and re-scored using a scoring matrix.

seq1

seq2

Page 28: From Pairwise Alignment  to Database Similarity Search

FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined.

seq1

seq2

Page 29: From Pairwise Alignment  to Database Similarity Search

The score for the joined regions is the sum of the scores of the initial regions minus a joining penaltyfor each gap.

seq1

seq2

Page 30: From Pairwise Alignment  to Database Similarity Search

FastA final stage• Apply an exact algorithm of local alignment

on surviving records, computing the final alignment score.

• Calculate an Alignment score (S)• Evaluate the statistical significance

Page 31: From Pairwise Alignment  to Database Similarity Search

Assessing Alignment SignificanceDetermine probability of alignment occurring at random

IdealNo Good

Random

Related

Page 32: From Pairwise Alignment  to Database Similarity Search

FastA at EMBL

Page 33: From Pairwise Alignment  to Database Similarity Search

FastA

• A set of programs for database searching.

FastA EMBL

Page 34: From Pairwise Alignment  to Database Similarity Search

FASTA SEQUENCE FORMAT

This format contains a one line header followed by lines of sequence data. •Sequences in fasta formatted files are preceded by a line starting with a" >" symbol. •The first word on this line is the name of the sequence. •The remaining lines contain the sequence itself. •Blank lines in a FASTA file are ignored,

Page 35: From Pairwise Alignment  to Database Similarity Search

FastA at EMBL• Output, general information:

–Z’ score = deviation (in sd) of the actual score from the mean of random scores

Z=(x-mean)/sd

–Opt: the number of optimized scores observed. Lower limit–E( ): the number of sequences expected in the score range.

Page 36: From Pairwise Alignment  to Database Similarity Search

FastA at EMBL

Page 37: From Pairwise Alignment  to Database Similarity Search

FastA at EMBL

Page 38: From Pairwise Alignment  to Database Similarity Search

BLAST Basic Local Alignment Search Tool

• Developed to be as sensitive as FastA but much faster.

• Also searches for short words.– Protein 3 letter words– DNA 11 letter words.– Words can be similar, not only identical

Page 39: From Pairwise Alignment  to Database Similarity Search

Word Search -BLAST

• Identity - CAT : CAT

• Similarity – CAT : CAT, CAY, HAT …

• But even CAT : XTX can be similar

• For each three letter word there are many

similar words (depending on the alphabet).

• Similar words are only the ones that have a minimum cut-off score (T).

Y= C or TH= A, C, TX=A or T or C or G

Page 40: From Pairwise Alignment  to Database Similarity Search

BLAST

• Find matching word pairs

• Extend word pairs as much as possible,i.e., as long as the total weight increases

• Result: High-scoring Segment Pairs (HSPs)

THEFIRSTLINIHAVEADREAMESIRPATRICKREAD

INVIEIAMDEADMEATTNAMHEWASNINETEEN

Page 41: From Pairwise Alignment  to Database Similarity Search

BLAST

• Try to connect HSPs by aligning the sequences in between them:

THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD

INVIEIAMDEADMEATTNAMHEW___ASNINETEEN

The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

Page 42: From Pairwise Alignment  to Database Similarity Search

Score and E-value•The score is a measure of the similarity of the query to the sequence shown.

•The E-value is a measure of the reliability of the score.

•E-value is the probability due to chance, that there is another alignment with a similarity greater than the given S score.

Page 43: From Pairwise Alignment  to Database Similarity Search

Bit score (S) :– Similar to alignment score– Normalized– Higher means more significant

Score (S): (identities + mismatches)-gaps

BLAST- Score

Page 44: From Pairwise Alignment  to Database Similarity Search

BLAST- E value:Expected # of alignments with score at least S

Increases linearly with

length of query sequence

Increases linearly with

length of database

Decreases exponentially with score of

alignment

–K ,λ: statistical parameters dependent upon scoring system and background residue frequencies

m = length of query ; n= length of database ; s= score

Page 45: From Pairwise Alignment  to Database Similarity Search

What is a Good E-value - thumb rules

• E values of less than 0.00001 show that sequences are almost always homologues.

• Greater E values, can represent homologues as well.

• Generally the decision whether an E-value is biologically significant depends on the size of database that is searched

Page 46: From Pairwise Alignment  to Database Similarity Search

Significance of Gapped Alignments

• Gapped alignments use same statistics

and K cannot be easily estimated

• Empirical estimations and gap scores determined by looking at random alignments

Page 47: From Pairwise Alignment  to Database Similarity Search

BLAST• Blast is a family of programs: BlastN, BlastP, BlastX, tBlastN, tBlastX

• BlastN - nt versus nt database• BlastP - protein versus protein database• BlastX - translated nt versus protein database• tBlastN - protein versus translated nt database• tBlastX - translated nt versus translated nt database

Query: DNA Protein

Database: DNA Protein

Page 48: From Pairwise Alignment  to Database Similarity Search

BLAST at NCBI

• Output– Graphical out put of top results– The alignments for top scores– Scores for each alignment:

1. E value

2. Bits score: a score normalized with respect to the scoring system. Can be used to compare different searches.