Fundamentals of Predicting Biology from Sequence...
Transcript of Fundamentals of Predicting Biology from Sequence...
![Page 1: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/1.jpg)
Fundamentals of PredictingBiology from Sequence (Part1):
Pairwise alignment and databasesearching
MGH/MIND April 2005
Fran Lewitter, Ph.D.Director, Bioinformatics & Research ComputingWhitehead Institute for Biomedical Researchhttp://jura.wi.mit.edu/bio
![Page 2: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/2.jpg)
2WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Bioinformatics Definitions“The use of computational methods to make biologicaldiscoveries.”
Fran Lewitter
“An interdisciplinary field involving biology, computerscience, mathematics, and statistics to analyze biologicalsequence data, genome content, and arrangement, and topredict the function and structure of macromolecules.”
David Mount
![Page 3: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/3.jpg)
3WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
The Old Way
![Page 4: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/4.jpg)
4WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Topics to Cover• Introduction
– Why do alignments?– A bit of history– Definitions
• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods• Pattern searching
![Page 5: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/5.jpg)
5WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Doolittle RF, Hunkapiller MW, Hood LE, DevareSG, Robbins KC, Aaronson SA, Antoniades
HN. Science 221:275-277, 1983.
Simian sarcoma virus onc gene, v-sis, is derivedfrom the gene (or genes) encoding a platelet-
derived growth factor.
![Page 6: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/6.jpg)
6WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Evolutionary Basis ofSequence Alignment
• Similarity - observable quantity, such aspercent identity
• Homology - conclusion drawn from datathat two genes share a commonevolutionary history; no metric isassociated with this
![Page 7: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/7.jpg)
7WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Some Definitions• An alignment is a mutual arrangement of
two sequences, which exhibits where thetwo sequences are similar, and where theydiffer.
• An optimal alignment is one that exhibitsthe most correspondences and the leastdifferences. It is the alignment with thehighest score. May or may not bebiologically meaningful.
![Page 8: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/8.jpg)
8WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Alignment Methods
• Global alignment - Needleman-Wunsch(1970) maximizes the number of matchesbetween the sequences along the entirelength of the sequences.
• Local alignment - Smith-Waterman (1981)is a modification of the dynamicprogramming algorithm giving the highestscoring local match between two sequences.
![Page 9: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/9.jpg)
9WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Alignment MethodsGlobal vs Local
Modular proteins
Fn2 EGF Fn1 EGF Kringle CatalyticF12
EGFFn1 Kringle CatalyticKringlePLAT
![Page 10: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/10.jpg)
10WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Local vs Global AlignmentL G P S S K Q T G K G S - S R I W D NL N - I T K S A G K G A I M R L G D A
GLOBAL
- - - - - - - T G K G - - - - - - - -- - - - - - - A G K G - - - - - - - -
LOCALFrom Mount, Bioinformatics, 2004, pg 71
![Page 11: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/11.jpg)
11WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Possible AlignmentsA: T C A G A C G A G T GB: T C G G A G C T G
I. T C A G A C G A G T GT C G G A - - G C T G
II. T C A G A C G A G T G
T C G G A - G C - T G III. T C A G A C G A G T G
T C G G A - G - C T G
![Page 12: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/12.jpg)
12WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Topics to Cover• Introduction• Scoring alignments
– Nucleotide vs Proteins– Definitions
• Alignment methods• Significance of alignments• Database searching methods• Pattern searching
![Page 13: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/13.jpg)
13WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Example of simple scoringsystem for nucleic acids
• Match = +1 (ex. A-A, T-T, C-C, G-G)• Mismatch = -1 (ex. A-T, A-C, etc)• Gap opening = - 5• Gap extension = -2
T C A G A C G A G T GT C G G A - - G C T G+1 +1 -1 +1 +1 -5 -2 -1 -1 +1 +1 = -4
![Page 14: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/14.jpg)
14WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Possible AlignmentsA:T C A G A C G A G T GB:T C G G A G C T G
I. T C A G A C G A G T GT C G G A - - G C T G
II. T C A G A C G A G T G
T C G G A - G C - T G III. T C A G A C G A G T G
T C G G A - G - C T G
-4
-5
-5
![Page 15: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/15.jpg)
15WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Gap Penalties
• Insertion and Deletions (indels)• Affine gap costs - a scoring system for gaps
within alignments that charges a penalty forthe existence of a gap and an additional per-residue penalty proportional to the gap’slength
![Page 16: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/16.jpg)
16WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Amino Acid SubstitutionMatrices
• PAM - point accepted mutation based onglobal alignment [evolutionary model]
• BLOSUM - block substitutions based onlocal alignments [similarity among conservedsequences]
![Page 17: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/17.jpg)
17WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Substitution MatricesBLOSUM 30
BLOSUM 62
BLOSUM 80
% identity
PAM 250 (80)
PAM 120 (66)
PAM 90 (50)
% change
Lesschange
![Page 18: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/18.jpg)
18WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Part of BLOSUM 62 MatrixC S T P A G N
C 9 S -1 4T -1 1 5P -3 -1 -1 7A 0 1 0 -1 4G -3 0 -2 -2 0 6N -3 1 0 -2 -2 0
Log-odds = obs freq of aa substitutions freq expected by chance
![Page 19: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/19.jpg)
19WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Part of PAM 250 MatrixC S T P A G N
C 12 S 0 2T -2 1 3P -3 1 0 6A -2 1 1 1 2G -3 1 0 -1 1 5N -4 1 0 -1 0 0
Log-odds = pair in homologous proteins pair in unrelated proteins by chance
![Page 20: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/20.jpg)
20WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Scoring for BLAST 2 SequencesScore = 94.0 bits (230), Expect = 6e-19Identities = 45/101 (44%), Positives = 54/101 (52%), Gaps = 7/101 (6%)
Query: 204 YTGPFCDV----DTKASCYDGRGLSYRGLARTTLSGAPCQPWASEATYRNVTAEQ---AR 256 Y+ FC + + CY G G +YRG T SGA C PW S V Q A+Sbjct: 198 YSSEFCSTPACSEGNSDCYFGNGSAYRGTHSLTESGASCLPWNSMILIGKVYTAQNPSAQ 257
Query: 257 NWGLGGHAFCRNPDNDIRPWCFVLNRDRLSWEYCDLAQCQT 297 GLG H +CRNPD D +PWC VL RL+WEYCD+ C TSbjct: 258 ALGLGKHNYCRNPDGDAKPWCHVLKNRRLTWEYCDVPSCST 298
Position 1: Y - Y = 7Position 2: T - S = 1Position 3: G - S = 0Position 4: P - E = -1 . . .Position 9: - - P = -11Position 10: - - A = -1
. . . Sum 230
Based onBLOSUM62
![Page 21: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/21.jpg)
21WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Topics to Cover• Introduction• Scoring alignments• Alignment methods
– Dot matrix analysis– Exhaustive methods; Dynamic programming algorithm
(Smith-Waterman (Local), Needleman-Wunsch(Global))
– Heuristic methods; Approximate methods; word or k-tuple (FASTA, BLAST, BLAT)
• Significance of alignments• Database searching methods• Pattern searching
![Page 22: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/22.jpg)
22WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Dot Matrix Comparison
CoFaX11
Window Size = 8 Scoring Matrix: pam250 matrixMin. % Score = 50Hash Value = 2
100 200 300 400 500 600
100
200
300
400
500
F1EKK
Cata
lytic
CatalyticF2 E F1 E K
![Page 23: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/23.jpg)
23WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Dot Matrix Comparison
FLO11
Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2
200 400 600 800 1000 1200
200
400
600
800
1000
1200
FLO11
Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2
950 1000 1050 1100 1150 1200 1250 1300 1350
900
1000
1100
1200
1300
FLO11
Window Size = 16 Scoring Matrix: pam250 matrixMin. % Score = 60Hash Value = 2
200 220 240 260 280 300 320 340 360 380
200
220
240
260
280
300
320
340
360
380
400
![Page 24: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/24.jpg)
24WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
BLAST Algorithm (1990)“Ungapped” alignment
• To improve speed, use a word based hashingscheme to index database
• Limit search for similarities to only the regionnear matching words
• Use Threshold parameter to rate neighbor words
• Extend match left and right to search for highscoring alignments
![Page 25: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/25.jpg)
25WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Original BLAST Algorithm (1990)Query word (W=3)
Query: GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMPQG 18 PHG 13PEG 15 PMG 13PNG 13 PTG 12PDG 13 Etc.
NeighborhoodScorethreshold(T=13)
Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA+LA++L+ TP G R++ +W+ P+ D + ER I A
Sbjct: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA
Neighborhoodwords
![Page 26: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/26.jpg)
26WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
BLAST Refinements (1997)
• “two-hit” method for extending word pairs• Gapped alignments• Additional algorithms
– Iterate with position-specific matrix (PSI-BLAST)
– Pattern-hit initiated BLAST (PHI-BLAST)
![Page 27: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/27.jpg)
27WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Gapped BLAST
15(+) > 1322(•) > 11
(Altschul et al 1997)
![Page 28: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/28.jpg)
28WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Gapped BLAST
(Altschul et al 1997)
![Page 29: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/29.jpg)
29WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Programs to Compare twosequences - Unix or Web
NCBIBLAST 2 Sequences
EMBOSSwater - Smith-Watermanneedle - Needleman -Wunschdotmatch (dot plot)einverted or palindrome (inverted repeats)equicktandem or etandem (tandem repeats)
Otherlalign (multiple matching subsegments in two sequences)
![Page 30: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/30.jpg)
30WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Topics to Cover• Introduction• Scoring alignments• Alignment methods• Significance of alignments
– How strong can alignment be by chance alone?• Database searching methods• Pattern searching
![Page 31: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/31.jpg)
31WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Statistical Significance• Raw Scores - score of an alignment equal to the sum of
substitution and gap scores.• Bit scores - scaled version of an alignment’s raw score
that accounts for the statistical properties of the scoringsystem used.
• E-value - expected number of distinct alignments thatwould achieve a given score by chance. Lower E-value =>more significant.
![Page 32: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/32.jpg)
32WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Some formulas
E = Kmn e-λS
This is the Expected number of high-scoringsegment pairs (HSPs) with score at least S
for sequences of length m and n.
This is the E value for the score S.
![Page 33: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/33.jpg)
33WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Topics to Cover• Introduction• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods
– BLAST– BLAST vs. FASTA– BLAT
• Pattern searching
![Page 34: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/34.jpg)
34WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Questions• Why do a database search?• What database should be searched?• What alignment algorithm to use?• What do the results mean?• What parameters can be changed?
– Substitution matrices– Statistical significance– Filtering for low complexity
![Page 35: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/35.jpg)
35WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
BLASTP Results
![Page 36: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/36.jpg)
36WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
WU-BLAST vs NCBI BLAST• WU-BLAST first for gapped alignments• Use different scoring system for gaps• Report different statistics• WU-BLAST does not filter low-complexity by
default• WU-BLAST looks for and reports multiple
regions of similarity• Results will be different!
![Page 37: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/37.jpg)
37WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
BLAT• Blast-Like Alignment Tool• Developed by Jim Kent at UCSC• For DNA it is designed to quickly find sequences of >=
95% similarity of length 40 bases or more.• For proteins it finds sequences of >= 80% similarity of
length 20 amino acids or more.• DNA BLAT works by keeping an index of the entire
genome in memory - non-overlapping 11-mers (< 1 GB ofRAM)
• Protein BLAT uses 4-mers (~ 2 GB)
![Page 38: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/38.jpg)
38WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
FASTA• Index "words" and locate identities• Rescore best 10 regions• Find optimal subset of initial regions that
can be joined to form single alignment• Align highest scoring sequences using
Smith-Waterman
![Page 39: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/39.jpg)
39WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
![Page 40: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/40.jpg)
40WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Basic Searching Strategies
• Search early and often• Use specialized databases• Use multiple matrices• Use filters• Consider Biology
![Page 41: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/41.jpg)
41WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Topics to Cover• Introduction• Scoring alignments• Alignment methods• Significance of alignments• Database searching methods• Pattern searching
– PSI-BLAST– PHI-BLAST– Finding patterns
![Page 42: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/42.jpg)
42WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
PSI-BLAST• Position Specific Iterative BLAST uses a profile (or position
specific scoring matrix, PSSM) that is constructed (automatically)from a multiple alignment of the highest scoring hits in an initialBLAST search.
• The PSSM is generated by calculating position-specific scores foreach position in the alignment. Highly conserved positions receivehigh scores and weakly conserved positions receive scores near zero.
• The profile is used to perform a second (etc.) BLAST search and theresults of each "iteration" is used to refine the profile. This iterativesearching strategy results in increased sensitivity.
![Page 43: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/43.jpg)
43WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Start with a BLASTP search
![Page 44: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/44.jpg)
44WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
PSI-BLAST - Iteration 1
![Page 45: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/45.jpg)
45WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
PSSM from PSI-BLAST
POSITIONS
A R N D C Q E G H I L K M F P S T W Y V
1 0 2 3 2 4 1 1 4 3 0 3 3 7 3 3 2 1 0 1 2
2 6 0 3 3 5 4 0 3 2 5 0 1 2 2 4 1 3 2 4 2
3 4 3 0 3 3 1 3 2 4 2 3 2 5 0 1 2 1 0 5 7
4 3 2 3 2 4 9 3 3 5 4 0 3 2 5 1 2 2 4 1 2
5 0 1 2 2 4 1 6 3 3 1 3 2 0 4 8 3 1 0 3 0
6 4 3 2 …
• …
• …
N
Aminoacids
![Page 46: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/46.jpg)
46WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Pattern Hit Initiated (PHI)-BLAST
>HUMAN MSH2
MAVQPKETLQLESAAEVGFVRFFQGMPEKPTTTVRLFDRGDFYTAHGEDALLAAREVFKTQGVIKYMGPAGAKNLQSVVLSKMNFESFVKDLLLVRQYRVEVYKNRAGNKASKENDWYLAYKASPGNLSQFEDILFGNNDMSASIGVVGVKMSAVDGQRQVGVGYVDSIQRKLGLCEFPDNDQFSNLEALLIQIGPKECVLPGGETAGDMGKLRQIIQRGGILITERKKADFSTKDIYQDLNRLLKGKKGEQMNSAVLPEMENQVAVSSLSAVIKFLELLSDDSNFGQFELTTFDFSQYMKLDIAAVRALNLFQGSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDLLRRFPDLNRLAKKFQRQAANLQDCYRLYQGINQLPNVIQALEKHEGKHQKLLLAVFVTPLTDLRSDFSKFQEMIETTLDMDQVENHEFLVKPSFDPNLSELREIMNDLEKKMQSTLISAARDLGLDPGKQIKLDSSAQFGYYFRVTCKEEKVLRNNKNFSTVDIQKNGVKFTNSKLTSLNEEYTKNKTEYEEAQDAIVKEIVNISSGYVEPMQTLNDVLAQLDAVVSFAHVSNGAPVPYVRPAILEKGQGRIILKASRHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATKDSLIIIDELGRGTSTYDGFGLAWAISEYIATKIGAFCMFATHFHELTALANQIPTVNNLHVTALTTEETLTMLYQVKKGVCDQSFGIHVAELANFPKHVIECAKQKALELEEFQYIGESQGYDIMEPAAKKCYLEREQGEKIIQEFLSKVKQMPFTEMSEENITIKLKQLKAEVIAKNNSFVNEIISRIKVTTDNA mismatch
repair proteins mutSfamily signature
![Page 47: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/47.jpg)
47WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
PHI-BLAST
![Page 48: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/48.jpg)
48WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Pattern Searching
allow one mismatch, deletion andinsertion
p1=6...6 3..8 p1[1,1,1]
exact 6 character repeat separated byup to 8
p1=6...6 3...8 p1
hairpin with GAGA as the loopp1=6...8 GAGA ~p1
[ ] Acceptable { } Not acceptableX(2) 2 of anything in a rowX(2,4) from 2 to four of of anything
[DE](2)HS{P}X(2)PX(2,4)CExact pattern: TATAATATAA
4 purines followed by 4 pyrimidinesRRRRYYYY or R(4)Y(4)
![Page 49: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/49.jpg)
49WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Pattern Searching Programs
EMBOSS programs; web and Unixfuzznuc,fuzzprot,fuzztrans,dreg
scan_for_matches patfile < inputfilePatscan
![Page 50: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/50.jpg)
50WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
Sequence of Note 1 GCGTTGCTGG CGTTTTTCCA TAGGCTCCGC 31 CCCCCTGACG AGCATCACAA AAATCGACGC 61 GGTGGCGAAA CCCGACAGGA CTATAAAGAT…………..1371 GTAAAGTCTG GAAACGCGGA AGTCAGCGCC
“Here you see the actual structure of a smallfragment of dinosaur DNA,” Wu said. “Notice thesequence is made up of four compounds -adenine, guanine, thymine and cytosine. Thisamount of DNA probably contains instructions tomake a single protein - say, a hormone or anenzyme. The full DNA molecule contains threebillion of these bases. If we looked at a screen likethis once a second, for eight hours a day, it’d stilltake more than two years to look at the entireDNA strand. It’s that big.” (page 103)
![Page 51: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/51.jpg)
51WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
DinoDNA "Dinosaur DNA" fromCrichton's THE LOST WORLD p. 135
GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACGACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCCATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGGGGGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGCCTCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTGCCGTGGCAGACACGGGTACTTTGGGGACCCCCCAGTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCCCACTACCTGGAGCTGCTGCAACCCCCCCGGGGCAGCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCCACTCAGCAGCGCCTGCGGCCTCTACTACAAAC…………………………
![Page 52: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/52.jpg)
52WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
>Erythroid transcription factor (NF-E1 DNA-binding protein)
Query: 121 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLGSbjct: 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60
Query: 301 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGATSbjct: 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116
Query: 481 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQTSbjct: 117 ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169
Query: 661 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG Sbjct: 170 STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226
Query: 841 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF Sbjct: 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286
Query: 1021 FGGGAGGYTAPPGLSPQI 1074 FGGGAGGYTAPPGLSPQISbjct: 287 FGGGAGGYTAPPGLSPQI 304
![Page 53: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/53.jpg)
53WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
>Erythroid transcription factor (NF-E1 DNA-binding protein)
Query: 121 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 300 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLGSbjct: 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60
Query: 301 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 480 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGATSbjct: 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116
Query: 481 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 660 ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQTSbjct: 117 ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169
Query: 661 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 840 STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPGSbjct: 170 STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226
Query: 841 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 1020 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGFSbjct: 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286
Query: 1021 FGGGAGGYTAPPGLSPQI 1074 FGGGAGGYTAPPGLSPQISbjct: 287 FGGGAGGYTAPPGLSPQI 304
![Page 54: Fundamentals of Predicting Biology from Sequence (Part1)barc.wi.mit.edu/education/lewitter.mgh.4.06.05.pdf · 2009-11-05 · WI Intro to Sequence Analysis, © Whitehead Institute,](https://reader036.fdocuments.us/reader036/viewer/2022070812/5f0b0da37e708231d42e9e8c/html5/thumbnails/54.jpg)
54WI Intro to Sequence Analysis, © Whitehead Institute, April 2005
References1. Class web site:• http://jura.wi.mit.edu/bio/education/bioinfo2005
2. Books• David Mount - Bioinformatics: Sequence and
Genome Analysis, 2nd edition, CSHL Press, 2004.• Andreas D. Baxevanis and B. F. Francis Ouellette
(editors) - Bioinformatics: A Practical Guide to theAnalysis of Genes and Proteins, 3rd edition, Wiley,2004.