From Pairwise Alignment to Database Similarity Search

36
From Pairwise Alignment to Database Similarity Search

description

From Pairwise Alignment to Database Similarity Search. Global vs Local Alignment. Global Alignment. ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT. Local Alignment. CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT. Global vs. Local alignment. - PowerPoint PPT Presentation

Transcript of From Pairwise Alignment to Database Similarity Search

Page 1: From Pairwise Alignment  to Database Similarity Search

From Pairwise Alignment to

Database Similarity Search

Page 2: From Pairwise Alignment  to Database Similarity Search

2

Global vs Local Alignment

ATTGCAGTG-TCGAGCGTCAGGCT

ATTGCGTCGATCGCAC-GCACGCT

Global Alignment

Local Alignment

CATATTGCAGTGGTCCCGCGTCAGGCT

TAAATTGCGT-GGTCGCACTGCACGCT

Page 3: From Pairwise Alignment  to Database Similarity Search

3

>Human DNACATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA>Mouse DNACATGCGTCTGACgctttttgctagcgatatcggactATCGATATA

Global vs. Local alignment

Alignment of two Genomic sequences

Page 4: From Pairwise Alignment  to Database Similarity Search

4

Human:CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATAMouse:CATGCGTCTGACgct---ttttgctagcgatatcggactATCGAT-ATA ****** ***** * *** * ****** ***

Global Alignment

Human:CATGCGACTGAC Mouse:CATGCGTCTGAC

Human:ATCGATCATAMouse:ATCGAT-ATA

Local Alignment

Global vs. Local alignment

Alignment of two Genomic sequences

Page 5: From Pairwise Alignment  to Database Similarity Search

5

>Human DNACATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA>Human mRNACATGCGACTGACATCGATCATA

Global vs. Local alignment

Alignment of Genomic DNA and mRNA

Page 6: From Pairwise Alignment  to Database Similarity Search

6

DNA: CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATAmRNA:CATGCGACTGAC---------------------------ATCGATCATA ************ **********

Global Alignment

DNA: CATGCGACTGAC mRNA:CATGCGACTGAC

DNA: ATCGATCATAmRNA:ATCGATCATA

Local Alignment

Global vs. Local alignment

Alignment of Genomic DNA and mRNA

Page 7: From Pairwise Alignment  to Database Similarity Search

7

Sequences that are similar probably have the same function

Why do we care to align sequences?

Page 8: From Pairwise Alignment  to Database Similarity Search

8

Why do we care to align sequences?

Page 9: From Pairwise Alignment  to Database Similarity Search

new sequence

?

Sequence Database

≈ Similar function

Discover Function of a new sequence

Page 10: From Pairwise Alignment  to Database Similarity Search

Searching Databases for similar sequences

Naïve solution: Use exact algorithm to compare each sequence in the database to query.

Is this reasonable ??

How much time will it take to calculate?

Page 11: From Pairwise Alignment  to Database Similarity Search

Complexity for genomes

• Human genome contains 3 109 base pairs– Searching an mRNA against HG requires ~1012

cells

-Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.

Page 12: From Pairwise Alignment  to Database Similarity Search

So what can we do?

Page 13: From Pairwise Alignment  to Database Similarity Search

Searching databases

Solution:Use a heuristic (approximate) algorithm

Page 14: From Pairwise Alignment  to Database Similarity Search

Heuristic strategy

• Remove regions that are not useful for meaningful alignments

• Preprocess database into new data structure to enable fast accession

Page 15: From Pairwise Alignment  to Database Similarity Search

Heuristic strategy

• Remove regions that are not useful for meaningful alignments

• Preprocess database into new data structure to enable fast accession

Page 16: From Pairwise Alignment  to Database Similarity Search

• AAAAAAAAAAA

• ATATATATATATA

• Transposable elements

What sequences to remove?

53% of the genomeis repetitive DNALow complexity sequences(JUNK???)

Page 17: From Pairwise Alignment  to Database Similarity Search

Low Complexity Sequences

What's wrong with them?* Not informative* Produce artificial high scoring alignments.

So what do we do?We apply Low Complexity masking to the database and the query sequence

MaskTCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

Page 18: From Pairwise Alignment  to Database Similarity Search

Heuristic strategy

• Remove low-complexity regions that are not useful for meaningful alignments

• Preprocess database into new data structure to enable fast accession

Page 19: From Pairwise Alignment  to Database Similarity Search

BLAST Basic Local Alignment Search Tool

• General idea - a good alignment contains subsequences of high identity:– First, identify very short almost exact matches.– Next, the best short hits from the 1st step are extended

to longer regions of similarity.– Finally, the best hits are optimized using the Smith-

Waterman algorithm.

Altschul et al 1990

Page 20: From Pairwise Alignment  to Database Similarity Search

BLAST(Protein Sequence Example)

1. Search the database for matching words

Example:Protein sequence …FSGTWYA…Words of length 3: FSG, SGT, GTW, TWY, WYA

All words in database (bag of words): FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS….

Page 21: From Pairwise Alignment  to Database Similarity Search

BLAST(Protein Sequence Example)

1. Search the database for matching words

Example:Protein sequence …FSGTWYA…Words of length 3: FSG, SGT, GTW, TWY,

WYA…

All words in database (bag of words): FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS….

Page 22: From Pairwise Alignment  to Database Similarity Search

BLAST(Protein Sequence Example)

1.Search the database for matching word pairs (L= 3)

2.Extend word pairs as much as possible,i.e., as long as the total score increases

• High-scoring Segment Pairs (HSPs)

Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD

D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN

Q= query sequence, D= sequence in database

Page 23: From Pairwise Alignment  to Database Similarity Search

BLAST

3. Try to connect HSPs by aligning the sequences in between them:

THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD

INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN

The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

Page 24: From Pairwise Alignment  to Database Similarity Search

Running BLAST to predict a function of a new protein

>Arrestin protein (C. elegance)MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKGIGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQFGSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPFGCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKKLAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTALPGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR

Page 25: From Pairwise Alignment  to Database Similarity Search
Page 26: From Pairwise Alignment  to Database Similarity Search
Page 27: From Pairwise Alignment  to Database Similarity Search

How to interpret a BLAST score:

•The score is a measure of the similarity of the query to the sequence shown.

How do we know if the score is significant?

-Statistical significance

-Biological significance

Page 28: From Pairwise Alignment  to Database Similarity Search

The expectation value E-value is the number of alignmentswith scores greater than or equal to score Sthat are expected to occur by chance in a database search.

An E value is related to a probability value p (p-value).

page 105

How to interpret a BLAST search:

For each blast score we can calculate an expectation value (E-value)

Page 29: From Pairwise Alignment  to Database Similarity Search
Page 30: From Pairwise Alignment  to Database Similarity Search

BLAST- E value:

Increases linearly with

length of query sequence

Increases linearly with

length of database

Decreases exponentially with score of

alignment

–K ,λ: statistical parameters dependent upon scoring system and background residue frequencies

m = length of query ; n= length of database ; s= score

Page 31: From Pairwise Alignment  to Database Similarity Search

What is a Good E-value (Thumb rule)

• E values of less than 0.00001 show that sequences are almost always homologues.

• Greater E values, can represent homologues as well.• Generally the decision whether an E-value is biologically

significant depends on the size of database that is searched

• Sometimes a real match has an E value > 1

• Sometimes a similar E value occurs for a short exact match and long less exact match

Page 32: From Pairwise Alignment  to Database Similarity Search

How to interpret a BLAST search:

•The score is a measure of the similarity of the query to the sequence shown.

How do we know if the score is significant?

-Statistical significance

-Biological significance

Page 33: From Pairwise Alignment  to Database Similarity Search

Treating Gaps in BLAST

>Human DNACATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA>Human mRNACATGCGACTGACATCGATCATA

Sometimes correction to the model are needed to infer biological significance

Page 34: From Pairwise Alignment  to Database Similarity Search

Gap Scores

• Standard solution: affine gap model

wx = g + r(x-1) wx : total gap penalty; g: gap open penalty;

r: gap extend penalty ;x: gap length

– Once-off cost for opening a gap– Lower cost for extending the gap– Changes required to algorithm

Page 35: From Pairwise Alignment  to Database Similarity Search

Significance of Gapped Alignments

• Gapped alignments use same statistics

and K cannot be easily estimated

• Empirical estimations and gap scores determined by looking at random alignments

Page 36: From Pairwise Alignment  to Database Similarity Search

BLAST BLAST is a family of programs

Query: DNA Protein

Database: DNA Protein