From Pairwise Alignment to Database Similarity Search
description
Transcript of From Pairwise Alignment to Database Similarity Search
From Pairwise Alignment to
Database Similarity Search
2
Global vs Local Alignment
ATTGCAGTG-TCGAGCGTCAGGCT
ATTGCGTCGATCGCAC-GCACGCT
Global Alignment
Local Alignment
CATATTGCAGTGGTCCCGCGTCAGGCT
TAAATTGCGT-GGTCGCACTGCACGCT
3
>Human DNACATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA>Mouse DNACATGCGTCTGACgctttttgctagcgatatcggactATCGATATA
Global vs. Local alignment
Alignment of two Genomic sequences
4
Human:CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATAMouse:CATGCGTCTGACgct---ttttgctagcgatatcggactATCGAT-ATA ****** ***** * *** * ****** ***
Global Alignment
Human:CATGCGACTGAC Mouse:CATGCGTCTGAC
Human:ATCGATCATAMouse:ATCGAT-ATA
Local Alignment
Global vs. Local alignment
Alignment of two Genomic sequences
5
>Human DNACATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA>Human mRNACATGCGACTGACATCGATCATA
Global vs. Local alignment
Alignment of Genomic DNA and mRNA
6
DNA: CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATAmRNA:CATGCGACTGAC---------------------------ATCGATCATA ************ **********
Global Alignment
DNA: CATGCGACTGAC mRNA:CATGCGACTGAC
DNA: ATCGATCATAmRNA:ATCGATCATA
Local Alignment
Global vs. Local alignment
Alignment of Genomic DNA and mRNA
7
Sequences that are similar probably have the same function
Why do we care to align sequences?
8
Why do we care to align sequences?
new sequence
?
Sequence Database
≈ Similar function
Discover Function of a new sequence
Searching Databases for similar sequences
Naïve solution: Use exact algorithm to compare each sequence in the database to query.
Is this reasonable ??
How much time will it take to calculate?
Complexity for genomes
• Human genome contains 3 109 base pairs– Searching an mRNA against HG requires ~1012
cells
-Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.
So what can we do?
Searching databases
Solution:Use a heuristic (approximate) algorithm
Heuristic strategy
• Remove regions that are not useful for meaningful alignments
• Preprocess database into new data structure to enable fast accession
Heuristic strategy
• Remove regions that are not useful for meaningful alignments
• Preprocess database into new data structure to enable fast accession
• AAAAAAAAAAA
• ATATATATATATA
• Transposable elements
What sequences to remove?
53% of the genomeis repetitive DNALow complexity sequences(JUNK???)
Low Complexity Sequences
What's wrong with them?* Not informative* Produce artificial high scoring alignments.
So what do we do?We apply Low Complexity masking to the database and the query sequence
MaskTCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA
Heuristic strategy
• Remove low-complexity regions that are not useful for meaningful alignments
• Preprocess database into new data structure to enable fast accession
BLAST Basic Local Alignment Search Tool
• General idea - a good alignment contains subsequences of high identity:– First, identify very short almost exact matches.– Next, the best short hits from the 1st step are extended
to longer regions of similarity.– Finally, the best hits are optimized using the Smith-
Waterman algorithm.
Altschul et al 1990
BLAST(Protein Sequence Example)
1. Search the database for matching words
Example:Protein sequence …FSGTWYA…Words of length 3: FSG, SGT, GTW, TWY, WYA
All words in database (bag of words): FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS….
BLAST(Protein Sequence Example)
1. Search the database for matching words
Example:Protein sequence …FSGTWYA…Words of length 3: FSG, SGT, GTW, TWY,
WYA…
All words in database (bag of words): FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS….
BLAST(Protein Sequence Example)
1.Search the database for matching word pairs (L= 3)
2.Extend word pairs as much as possible,i.e., as long as the total score increases
• High-scoring Segment Pairs (HSPs)
Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD
D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN
Q= query sequence, D= sequence in database
BLAST
3. Try to connect HSPs by aligning the sequences in between them:
THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD
INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN
The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment
Running BLAST to predict a function of a new protein
>Arrestin protein (C. elegance)MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKGIGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQFGSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPFGCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKKLAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTALPGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR
How to interpret a BLAST score:
•The score is a measure of the similarity of the query to the sequence shown.
How do we know if the score is significant?
-Statistical significance
-Biological significance
The expectation value E-value is the number of alignmentswith scores greater than or equal to score Sthat are expected to occur by chance in a database search.
An E value is related to a probability value p (p-value).
page 105
How to interpret a BLAST search:
For each blast score we can calculate an expectation value (E-value)
BLAST- E value:
Increases linearly with
length of query sequence
Increases linearly with
length of database
Decreases exponentially with score of
alignment
–K ,λ: statistical parameters dependent upon scoring system and background residue frequencies
m = length of query ; n= length of database ; s= score
What is a Good E-value (Thumb rule)
• E values of less than 0.00001 show that sequences are almost always homologues.
• Greater E values, can represent homologues as well.• Generally the decision whether an E-value is biologically
significant depends on the size of database that is searched
• Sometimes a real match has an E value > 1
• Sometimes a similar E value occurs for a short exact match and long less exact match
How to interpret a BLAST search:
•The score is a measure of the similarity of the query to the sequence shown.
How do we know if the score is significant?
-Statistical significance
-Biological significance
Treating Gaps in BLAST
>Human DNACATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA>Human mRNACATGCGACTGACATCGATCATA
Sometimes correction to the model are needed to infer biological significance
Gap Scores
• Standard solution: affine gap model
wx = g + r(x-1) wx : total gap penalty; g: gap open penalty;
r: gap extend penalty ;x: gap length
– Once-off cost for opening a gap– Lower cost for extending the gap– Changes required to algorithm
Significance of Gapped Alignments
• Gapped alignments use same statistics
and K cannot be easily estimated
• Empirical estimations and gap scores determined by looking at random alignments
BLAST BLAST is a family of programs
Query: DNA Protein
Database: DNA Protein