Sequence Alignment,Blast, Fasta, MSA

17
Sequence Similarity Searches (Blast) Pairwise and Multiple Sequence Alignments Sucheta Tripathy, 16 th November 2012

description

Similarity searches (Blast, Fasta, sequence alignments), pairwise sequence alignments for the aCSIR Ph.D course work

Transcript of Sequence Alignment,Blast, Fasta, MSA

Page 1: Sequence Alignment,Blast, Fasta, MSA

Sequence Similarity Searches (Blast) Pairwise and Multiple

Sequence AlignmentsSucheta Tripathy, 16th November 2012

Page 2: Sequence Alignment,Blast, Fasta, MSA

A protein sequence from species A◦ What is the nearest species this protein is similar

to?◦ Where is it originated from?◦ Putative function.◦ If it has a conserved motif etc.

Sequence Similarities Why??

Page 3: Sequence Alignment,Blast, Fasta, MSA

Blast (Basic Local Alignment Search Tool)◦ NCBI Blast◦ Wu-Blast◦ PSI-Blast

Fasta SSearch

Similarity Searches

Page 4: Sequence Alignment,Blast, Fasta, MSA

Heuristic (Educated guess) Does not compare sequence to its entirety. Quickly locates short matches(seeds) Word size Seeds are extended in both directions Threshold is defined

◦ > Threshold -> keep the alignment◦ < Threshold -> discard the alignment

Blast

Page 5: Sequence Alignment,Blast, Fasta, MSA

Example of word size

GLKFA -> 3GLK, LKF, FKA

Page 6: Sequence Alignment,Blast, Fasta, MSA

A Query sequence:◦ Nucleotide◦ Protein

A Target Database◦ Nucleotide◦ Protein

Blast Program◦ Blastn◦ Blastp◦ tBlastx (Slowest Nt query translated against Nt database

trlt.)◦ tBlastn (Protein query translated nt. Database)◦ Blastx (Nucleotide trnslt against Protein database)

Blast Contd…

Page 7: Sequence Alignment,Blast, Fasta, MSA

E Value -> Probability value at which the sequence hits may occur by chance

Score -> Similarity score.◦ By chance rain probability is 0.001◦ Passing by chance etc.◦ Less the e –value the better is the sensitivity of

the alignment.

Blast Parameters

Page 8: Sequence Alignment,Blast, Fasta, MSA

Remove Low Complexity regions Generate all the k mers. List All Possible matching key words.

- Blast cares about only high scoring pairs- Fasta stores all pairs irrespective of the scores.

Extend the matches into high scoring pairs(HSPs)

Evaluate results depending on thresholds set. Extend HSPs and join them together.

Blast Step by Step

Page 9: Sequence Alignment,Blast, Fasta, MSA

ATGGGGCGAGGCAGCGGCACCTTCGAGCGTCTCCTAGACAAGGCGACCAGCCAGCTCCTGTTGGAGACAGATTGGGAGTCCATTTTGCAGATCTGCGACCTGATCCGCCAAGGGGACACACAAGCAAAATATGCTGTGAATTCCATCAAGAAGAAAGTCAACGACAAGAACCCACACGTCGCCTTGTATGCCCTGGAGGTCATGGAATCTGTGGTAAAGAACTGTGGCCAGACAGTTCATGATGAGGTGGCCAACAAGCAGACCATGGAGGAGCTGAAGGACCTGCTGAAGAGACAAGTGGAGGTAAACGTCCGTAACAAGATCCTGTACCTGATCCAGGCCTGGGCGCATGCCTTCCGGAACGAGCCCAAGTACAAGGTGGTCCAGGACACCTACCAGATCATGAAGGTGGAGGGGCACGTCTTTCCAGAATTCAAAGAGAGCGATGCCATGTTTGCTGCCGAGAGAGCCCCAGACTGGGTGGACGCTGAGGAATGCCACCGCTGCAGGGTGCAGTTCGGGGTGATGACCCGTAAGCACCACTGCCGGGCGTGTGGGCAGATATTCTGTGGAAAGTGTTCTTCCAAGTACTCCACCATCCCCAAGTTTGGCATCGAGAAGGAGGTGCGCGTGTGTGAGCCCTGCTACGAGCAGCTGAACAGGAAAGCGGAGGGAAAGGCCACTTCCACCACTGA

Page 10: Sequence Alignment,Blast, Fasta, MSA

Dot matrix method (bioinfx.net) Dynamic Programming method

◦ Global(Needleman-Wunsch method)◦ Local (Smith-Waterman method)

Word Method or K-tuple method(Heuristic)

Pairwise Sequence comparison

FTFTALILLAVAVFTALLLAAV

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC50453/pdf/pnas01096-0363.pdf

Page 11: Sequence Alignment,Blast, Fasta, MSA

Multiple Sequence Alignment

Page 12: Sequence Alignment,Blast, Fasta, MSA

Uses Neighbor joining guide tree(NJ).◦ N number of sequences

½ * N! / (N-r)! -> Number of pairs 5 sequences (5,4,3,2,1)

(5,4), (5,3), (5,2), (5,1); (4,3),(4,2),(4,1);(3,2),(3,1);(2,1)

Clustal

Page 14: Sequence Alignment,Blast, Fasta, MSA

Substitution Matrices Insertion and deletions are less likely

than a substitution Insertion and Deletion in DNA sequence leads to

Frame shift.

Scores and Penalty

PAM Matrices(Point Accepted Mutation Matrices) Margaret Dayhoff 1978

PAM1 -> Expected rates of substition if 1% of the amino acids have changed

BLOSUM : Blocks Substitution Matrix (% of identity)

Page 15: Sequence Alignment,Blast, Fasta, MSA

15

PAM matrices are based on a simple evolutionary model

MATLFC MLTLCC

M(A/L)TL(F/C)CAncestral sequence?

Two changes

• Only mutations are allowed • Sites evolve independently

Page 16: Sequence Alignment,Blast, Fasta, MSA

Guidelines for Using MatricesGuidelines for using matricies

Protein Query LengthMatrix Open Gap Extend Gap>300 BLOSUM50 -10 -285-300 BLOSUM62 -7 -150-85 BLOSUM80 -16 -4>300 PAM250 -10 -285-300 PAM120 -16 -435-85 MDM40 -12 -2<=35 MDM20 -22 -4<=10 MDM10 -23 -4

PAM100 ==> Blosum90PAM120 ==> Blosum80PAM160 ==> Blosum60PAM200 ==> Blosum52PAM250 ==> Blosum45

Page 17: Sequence Alignment,Blast, Fasta, MSA

17

Scoring MatricesS = [sij] gives score of aligning character i

with character j for every pair i, j.

STPPCTCA

0+ 3+ (-3)+ 1

= 1