From Pairwise Alignment to Database Similarity Search.

40
From Pairwise Alignment to Database Similarity Search
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    2

Transcript of From Pairwise Alignment to Database Similarity Search.

Page 1: From Pairwise Alignment to Database Similarity Search.

From Pairwise Alignment to

Database Similarity Search

Page 2: From Pairwise Alignment to Database Similarity Search.

• Best score for aligning part of sequences

• Dynamic programming • Algorithm:

Smith-Waterman• Table cells never score

below zero

• Best score for aligning the full length sequences

• Dynamic programming• Algorithm:

Needelman- Wunch• Table cells are allowed

any score

Global Local

Pairwise Alignment Summary

Page 3: From Pairwise Alignment to Database Similarity Search.

3

Sequences that are similar probably have the same function

Why do we care to align sequences?

Page 4: From Pairwise Alignment to Database Similarity Search.

new sequence

?

Sequence Database

≈ Similar function

Discover Function of a new sequence

Page 5: From Pairwise Alignment to Database Similarity Search.

Searching Databases for similar sequences

Naïve solution: Use exact algorithm to compare each sequence in the database to query.

Is this reasonable ??

How much time will it take to calculate?

Page 6: From Pairwise Alignment to Database Similarity Search.

Complexity for genomes

• Human genome contains 3 109 base pairs– Searching an mRNA against HG requires ~1013

cells

-Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.

Page 7: From Pairwise Alignment to Database Similarity Search.

So what can we do?

Page 8: From Pairwise Alignment to Database Similarity Search.

Searching databases

Solution:Use a heuristic (approximate) algorithm to

discard most irrelevant sequences and perform

the exact algorithm on the small group of remaining sequences.

Page 9: From Pairwise Alignment to Database Similarity Search.

Heuristic strategy

• Remove regions that are not useful for meaningful alignments

• Preprocess database into new data structure to enable fast accession

Page 10: From Pairwise Alignment to Database Similarity Search.

Heuristic strategy

• Remove regions that are not useful for meaningful alignments

• Preprocess database into new data structure to enable fast accession

Page 11: From Pairwise Alignment to Database Similarity Search.

• AAAAAAAAAAA

• ATATATATATATA

• Transposable elements

(LINEs, SINEs)

What sequences to remove?

Low complexity sequences

Page 12: From Pairwise Alignment to Database Similarity Search.

Low Complexity Sequences

What's wrong with them?Produce artificial high scoring alignments.

So what do we do?We apply Low Complexity masking to the database and the query sequence

MaskTCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

Page 13: From Pairwise Alignment to Database Similarity Search.

Low Complexity SequencesComplexity is calculated as:

Where N=4 in DNA (4 bases), L is the length of the sequenceand ni the number of each residue in the sequence

K=1/L logN(L!/Π ni!)all i

For the sequence GGGG:

L! =4x3x2x1=24ng =4nc =0na =0nt =0Πni =24x1x1x1=24

K =1/4 log4 (24/24)=0

For the sequence CTGA:

L! =4x3x2x1=24ng =1nc =1na =1nt =1Πni =1x1x1x1

K =1/4 log4 (24/1)=0.573

Page 14: From Pairwise Alignment to Database Similarity Search.

Heuristic strategy

• Remove low-complexity regions that are not useful for meaningful alignments

• Preprocess database into new data structure to enable fast accession

Page 15: From Pairwise Alignment to Database Similarity Search.

Heuristic (approximate solution) Methods: FASTA and BLAST

• FASTA (Lipman & Pearson 1985)– First fast sequence searching algorithm for comparing

a query sequence against a database

• BLAST - Basic Local Alignment Search Technique (Altschul et al 1990) – improvement of FASTA: Search speed, ease of use,

statistical rigor

Page 16: From Pairwise Alignment to Database Similarity Search.

FASTA and BLAST

• Common idea - a good alignment contains subsequences of absolute identity:– First, identify very short (almost) exact matches.– Next, the best short hits from the 1st step are extended

to longer regions of similarity.– Finally, the best hits are optimized using the Smith-

Waterman algorithm.

Page 17: From Pairwise Alignment to Database Similarity Search.

FastA (fast alignment)• Assumption: a good alignment probably matches

some identical ‘words’ • Example: Aligning a query sequence to a database

Database record:

ACTTGTAGATACAAAATGTG

Query sequence:

A-TTGTCG-TACAA-ATCTG

Page 18: From Pairwise Alignment to Database Similarity Search.

• Preprocess of all the sequences in the database. Find short words and organize in dictionaries.

• Process the query sequence and prepare a dictionary.

– ATGGCTGCTCAAGT….

ATGG TGGC GGCT … …

FastA

Query

Page 19: From Pairwise Alignment to Database Similarity Search.

FastA locates regions of the query sequence and the search set sequence that have high densities of exact word matches.For DNA sequences the word length used is 6.

Words inseq1

Words in seq2

Page 20: From Pairwise Alignment to Database Similarity Search.

The 10 highest-scoring sequence regions are saved and re-scored using a scoring matrix.

seq1

seq2

Page 21: From Pairwise Alignment to Database Similarity Search.

FastA determines if any of the initial regions from different diagonals may be joined together to form an approximate alignment with gaps. Only non-overlapping regions may be joined.

seq1

seq2

Page 22: From Pairwise Alignment to Database Similarity Search.

The score for the joined regions is the sum of the scores of the initial regions minus a joining penaltyfor each gap.

seq1

seq2

Page 23: From Pairwise Alignment to Database Similarity Search.

BLAST Basic Local Alignment Search Tool

• Developed to be as sensitive as FastA but much faster.

• Also searches for short words.– Protein 3 letter words– DNA 11 letter words.– Words can be similar, not only identical

Page 24: From Pairwise Alignment to Database Similarity Search.

BLAST(Protein Sequence Example)

1. Search the database for matching word pairs (> T)

Example:…FSGTWYA…

A list of words (w=3) is:FSG SGT GTW TWY WYAYSG TGT ATW SWY WFAFTG SVT GSW TWF WYS

Page 25: From Pairwise Alignment to Database Similarity Search.

BLAST(Protein Sequence Example)

1.Search the database for matching word pairs (>T)

2.Extend word pairs as much as possible,i.e., as long as the total score increases

• Result: High-scoring Segment Pairs (HSPs)

THEFIRSTLINIHFSGTWYAAMESIRPATRICKREAD

INVIEIAFDGTWTCATTNAMHEWASNINETEEN

Page 26: From Pairwise Alignment to Database Similarity Search.

BLAST

3. Try to connect HSPs by aligning the sequences in between them:

THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD

INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN

The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

Page 27: From Pairwise Alignment to Database Similarity Search.

How to interpret a BLAST search:

•The score is a measure of the similarity of the query to the sequence shown.

How do we know if the score is significant?

-Statistical significance

-Biological significance

Page 28: From Pairwise Alignment to Database Similarity Search.

Assessing Alignment SignificanceDetermine probability of alignment occurring at random

IdealNo Good

Random

Related

Score Score

Fre

quen

cy

For each score we can count the probability of getting it by chance

Page 29: From Pairwise Alignment to Database Similarity Search.

The expect value E-value is the number of alignmentswith scores greater than or equal to score Sthat are expected to occur by chance in a database search.

An E value is related to a probability value p (p-value).

page 105

How to interpret a BLAST search:

For each blast score we get an E-value

Page 30: From Pairwise Alignment to Database Similarity Search.

BLAST- E value:

Increases linearly with

length of query sequence

Increases linearly with

length of database

Decreases exponentially with score of

alignment

–K ,λ: statistical parameters dependent upon scoring system and background residue frequencies

m = length of query ; n= length of database ; s= score

Page 31: From Pairwise Alignment to Database Similarity Search.

From raw scores to bit scores

• Bit scores S’ are normalized and are comparable in different databases

The E value corresponding to a given bit score is:

E = mn 2 -S’

page 106

Page 32: From Pairwise Alignment to Database Similarity Search.

What is a Good E-value (Thumb rule)

• E values of less than 0.00001 show that sequences are almost always homologues.

• Greater E values, can represent homologues as well.• Generally the decision whether an E-value is biologically

significant depends on the size of database that is searched

• Sometimes a real match has an E value > 1

• Sometimes a similar E value occurs for a short exact match and long less exact match

Page 33: From Pairwise Alignment to Database Similarity Search.

Treating Gaps in BLAST

>Human DNACATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA>Human mRNACATGCGACTGACATCGATCATA

Biologically, indels occur in groups we want our gap score to reflect this

Page 34: From Pairwise Alignment to Database Similarity Search.

Gap Scores

• Standard solution: affine gap model

wx = g + r(x-1) wx : total gap penalty; g: gap open penalty;

r: gap extend penalty ;x: gap length

– Once-off cost for opening a gap– Lower cost for extending the gap– Changes required to algorithm

Page 35: From Pairwise Alignment to Database Similarity Search.

Significance of Gapped Alignments

• Gapped alignments use same statistics

and K cannot be easily estimated

• Empirical estimations and gap scores determined by looking at random alignments

Page 36: From Pairwise Alignment to Database Similarity Search.

BLAST BLAST is a family of programs

Query: DNA Protein

Database: DNA Protein

Page 37: From Pairwise Alignment to Database Similarity Search.

Choose the BLAST program

Program Input Database 1

blastn DNA DNA 1

blastp protein protein 6

blastx DNA protein 6

tblastn protein DNA 36

tblastx DNA DNA

Page 38: From Pairwise Alignment to Database Similarity Search.

Example :The lipocalins (each dot is a protein)

retinol-binding protein

odorant-binding protein

apolipoprotein D

Example is taken from Bioinformatics and Functional Genomicsby Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

Page 39: From Pairwise Alignment to Database Similarity Search.

BLAST search with PAEP as a query finds many other lipocalins

Page 40: From Pairwise Alignment to Database Similarity Search.

Assessing whether proteins are homologous

RBP4 and PAEP:Low bit score, E value 0.49, 24% identitybut they are indeed homologous.