Sequence Analysis Methods
description
Transcript of Sequence Analysis Methods
CZ5225: Modeling and Simulation in CZ5225: Modeling and Simulation in BiologyBiology
Lecture 3: Sequence analysis methods Lecture 3: Sequence analysis methods
Prof. Chen Yu ZongProf. Chen Yu Zong
Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of SingaporeNational University of Singapore
22
Sequence Analysis Methods
33
Gene and Protein Sequence Alignment Gene and Protein Sequence Alignment as a Mathematical Problem: as a Mathematical Problem:
Example: Sequence a: ATTCTTGC Sequence b: ATCCTATTCTAGC
Best Alignment: ATTCTTGC
ATCCTATTCTAGC /|\ gap Bad Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap
What is a good alignment?
44
How to rate an alignment?How to rate an alignment?• Match: +8 (w(x, y) = 8, if x = y)
• Mismatch: -5 (w(x, y) = -5, if x ≠ y)
• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)
55
Pairwise AlignmentPairwise AlignmentSequence a: CTTAACTSequence b: CGGATCAT
An alignment of a and b:
C---TTAACTCGGATCA--T
Insertion gap
Match Mismatch
Deletion gap
66
Alignment GraphAlignment GraphSequence a: CTTAACT
Sequence b: CGGATCATC G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
Insertion gap
Deletion gap
77
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C
C C---TTAACTCGGATCA--T
88
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C G G A
C C---TTAACTCGGATCA--T
99
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C G G A T
C
T
C---TTAACTCGGATCA--T
1010
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C G G A T C A
C
T
T
A
A
C
C---TTAACTCGGATCA--T
1111
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
1212
Pathway of an alignmentPathway of an alignmentSequence a: CTTAACT
Sequence b: CGGATCATC G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
1313
Graphic representation of an alignmentGraphic representation of an alignment
Sequence a: CTTAACT Sequence b: CGGATCAT
C G G A T C A T
C
T
T
A
A
C
T
CTTAACT-CGGATCAT
1414
Pathway of an alignmentPathway of an alignmentSequence a: CTTAACT
Sequence b: CGGATCATC G G A T C A T
C
T
T
A
A
C
T
CTTAACT-CGGATCAT
1515
Use of graph to generate alignmentsUse of graph to generate alignments
Sequence a: CTTAACT
Sequence b: CGGATCATC G G A T C A T
C
T
T
A
A
C
T
- CTTAACTCGGATCAT
1616
Use of graph to generate alignmentsUse of graph to generate alignments
Sequence a: CTTAACT
Sequence b: CGGATCATC G G A T C A T
C
T
T
A
A
C
T
- C - - TTAACTCGGATC - AT -
1717
Use of graph to generate alignmentsUse of graph to generate alignments
Sequence a: CTTAACT
Sequence b: CGGATCATC G G A T C A T
C
T
T
A
A
C
T
CTTAACT - - -
- - CGGATCAT
1818
Which pathway is better?Which pathway is better?Sequence a: CTTAACT
Sequence b: CGGATCATC G G A T C A T
C
T
T
A
A
C
T
Multiple pathways
Each with a unique scoring function
1919
Alignment ScoreAlignment ScoreSequence a: CTTAACT
Sequence b: CGGATCAT
8
C G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
2020
Alignment ScoreAlignment ScoreSequence a: CTTAACT
Sequence b: CGGATCAT
8
8-3
=5
C G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
2121
Alignment ScoreAlignment ScoreSequence a: CTTAACT
Sequence b: CGGATCAT
8
8-3
=5
5-3
=2
2-3
=-1
C G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
2222
Alignment ScoreAlignment ScoreSequence a: CTTAACT
Sequence b: CGGATCAT
8 5 2 -1
-1+8
=7
7-3
=4
4+8
=12
12-3
=9
9-3
=6
C G G A T C A T
C
T
T
A
A
C
T
C---TTAACTCGGATCA--T
6+8=14
Alignment score
2323
An optimal alignmentAn optimal alignment-- the alignment of maximum score-- the alignment of maximum score
• Let A=a1a2…am and B=b1b2…bn .
• Si,j: the score of an optimal alignment between
a1a2…ai and b1b2…bj
• With proper initializations, Si,j can be computedas follows.
),(
),(
),(
max
1,1
1,
,1
,
jiji
jji
iji
ji
baws
bws
aws
s
2424
Computing Computing SSi,ji,j
i
j
w(ai,-)
w(-,bj)
w(ai,bj)
Sm,n
2525
InitializationsInitializationsS0,0= 0
S0,1=-3, S0,2=-6,
S0,3=-9, S0,4=-12,
S0,5=-15, S0,6=-18,
S0,7=-21, S0,8=-24
S1,0=-3, S2,0=-6,
S3,0=-9, S4,0=-12,
S5,0=-15, S6,0=-18,
S7,0=-21
0 -3 -6 -9 -12 -15 -18 -21 -24
-3
-6
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
Gap symbol: -3
2626
SS1,11,1 = = ??Option 1:
S1,1 = S0,0 +w(a1, b1)
= 0 +8 = 8
Option 2:
S1,1=S0,1 + w(a1, -)
= -3 - 3 = -6
Option 3:
S1,1=S1,0 + w( - , b1)
= -3-3 = -6
Optimal:
S1,1 = 8
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 ?
-6
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
2727
SS1,21,2 = = ??Option 1:
S1,2 = S0,1 +w(a1, b2)
= -3 -5 = -8
Option 2:
S1,2=S0,2 + w(a1, -)
= -6 - 3 = -9
Option 3:
S1,2=S1,1 + w( - , b2)
= 8-3 = 5
Optimal:
S1,2 =5
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 ?
-6
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
2828
SS2,12,1 = = ??Option 1:
S2,1= S1,0 +w(a2, b1)
= -3 -5 = -8
Option 2:
S2,1=S1,1 + w(a2, -)
= 8 - 3 = 5
Option 3:
S2,1=S2,0 + w( - , b1)
= -6-3 = -9
Optimal:
S2,1 =5
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5
-6 ?
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
2929
SS2,22,2 = = ??Option 1:
S2,2= S1,1 +w(a2, b2)
= 8 -5 = 3
Option 2:
S2,2=S1,2 + w(a2, -)
= 5 - 3 = 2
Option 3:
S2,2=S2,1 + w( - , b2)
= 5-3 = 2
Optimal:
S2,2 =3
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5
-6 5 ?
-9
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
3030
SS3,53,5 = = ??
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 ?
-12
-15
-18
-21
C G G A T C A T
C
T
T
A
A
C
T
3131
SS3,53,5 = = ??
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 5 -1 -4 9
-12 -1 -3 -5 6 3 0 7 6
-15 -4 -6 -8 3 1 -2 8 5
-18 -7 -9 -11 0 -2 9 6 3
-21 -10 -12 -14 -3 8 6 4 14
C G G A T C A T
C
T
T
A
A
C
T
optimal score
3232
C T T A A C – TC T T A A C – TC G G A T C A TC G G A T C A T
0 -3 -6 -9 -12 -15 -18 -21 -24
-3 8 5 2 -1 -4 -7 -10 -13
-6 5 3 0 -3 7 4 1 -2
-9 2 0 -2 -5 5 -1 -4 9
-12 -1 -3 -5 6 3 0 7 6
-15 -4 -6 -8 3 1 -2 8 5
-18 -7 -9 -11 0 -2 9 6 3
-21 -10 -12 -14 -3 8 6 4 14
C G G A T C A T
C
T
T
A
A
C
T
8 – 5 –5 +8 -5 +8 -3 +8 = 14
3333
Local vs. Global Sequence Alignment: Local vs. Global Sequence Alignment:
Example:
DNA sequence a: ATTCTTGC
DNA sequence b: ATCCTATTCTAGC
Local Alignment: ATTCTTGC Gaps ignored in local alignments
ATCCTATTCTAGC /|\ gap Global Alignment: AT TCTT GC ATCCTATTCTAGC /|\ /|\ gap gap Gaps counted in global alignments
3434
Global Alignment vs. Local AlignmentGlobal Alignment vs. Local Alignment
• global alignment:
• local alignment:
All sections are counted
Only local sections (normally separated by gaps) are counted
3535
An optimal local alignmentAn optimal local alignment
• Si,j: the score of an optimal local alignment ending at ai and bj
• With proper initializations, Si,j can be computedas follows.
),(
),(),(
0
max
1,1
1,
,1
,
jiji
jji
iji
ji
baws
bwsaws
s
3636
InitializationsInitializations
0 0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
3737
SS1,11,1 = = ?? Option 1:
S1,1 = S0,0 +w(a1, b1)
= 0 +8 = 8
Option 2:
S1,1=S0,1 + w(a1, -)
= 0 - 3 = -3
Option 3:
S1,1=S1,0 + w( - , b1)
= 0-3 = -3
Option 4:
S1,1=0
Optimal:
S1,1 = 8
0 0 0 0 0 0 0 0 0
0 ?
0
0
0
0
0
0
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
3838
local alignmentlocal alignment
0 0 0 0 0 0 0 0 0
0 8 5 2 0 0 8 5 2
0 5 3 0 0 8 5 3 13
0 2 0 0 0 8 5 2 11
0 0 0 0 8 5 3 ?
0
0
0
C G G A T C A T
C
T
T
A
A
C
T
Match: 8
Mismatch: -5
Gap symbol: -3
3939
0 0 0 0 0 0 0 0 0
0 8 5 2 0 0 8 5 2
0 5 3 0 0 8 5 3 13
0 2 0 0 0 8 5 2 11
0 0 0 0 8 5 3 13 10
0 0 0 0 8 5 2 11 8
0 8 5 2 5 3 13 10 7
0 5 3 0 2 13 10 8 18
C G G A T C A T
C
T
T
A
A
C
T
The best
score
A – C - TA T C A T8-3+8-3+8 = 18
local alignmentlocal alignment
4040
BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool
Procedure:
• Divide all sequences into overlapping constituent words (size k)
• Build the hash table for Sequence a.• Scan Sequence b for hits.• Extend hits.
4141
BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool
Step 1:Hash table for sequence A
4242
Amino acid Amino acid similarity similarity matrix matrix PAM 120PAM 120
Instead of using the simple values +8 and -5 for matches and mismatches, this statistically derived score matrix is used to rank the level of similarity between two amino acids
4343
Amino acid similarity matrix PAM 250Amino acid similarity matrix PAM 250This is a more popularly used score matrix for ranking the level of similarity of two amino acids. It is derived by consideration of more diverse sets of data and more number of statistical steps.
4444
Amino acid similarity matrix Blosum 45Amino acid similarity matrix Blosum 45The Blosum matrices were calculated using data from the BLOCKS database which contains alignments of more distantly-related proteins. In principle, Blosum matrices should be more realistic for comparing distantly-related proteins, but may introduce error for conventional proteins. .
4545
BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool
4646
BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool
LN:LN=9
NF:NY=8
GW:PW=10
Step 2:
Use all of the 2-letter words in query sequence to scan against database sequence and mark those with score > 8
Note:
Marked points can be on the diagonal and off-diagonal
4747
BLASTStep2: Scan sequence b for hits.
4848
BLASTStep2: Scan sequence b for hits.
Step 3: Extend hits.
hit
Terminate if the score of the extension fades away.
BLAST 2.0 saves the time spent in extension, and
considers gapped alignments.
4949
Multiple sequence alignment (MSA)Multiple sequence alignment (MSA)
• The multiple sequence alignment problem is to simultaneously align more than two sequences.
Seq1: GCTC
Seq2: AC
Seq3: GATC
GC-TC
A---C
G-ATC
5050
Multiple sequence alignment MSAMultiple sequence alignment MSA
5151
How to score an MSA?How to score an MSA?
• Sum-of-Pairs (SP-score)
GC-TC
A---C
G-ATC
GC-TC
A---C
GC-TC
G-ATC
A---C
G-ATC
Score =
Score
Score
Score
+
+
5252
How to score an MSA?How to score an MSA?
• Sum-of-Pairs (SP-score)
GC-TC
A---C
G-ATC
GC-TC
A---C
GC-TC
G-ATC
A---C
G-ATC
Score =
Score
Score
Score
+
+
-5-3+8-3+8= 5
+
8-3-3+8+8= 18
+
-5+8-3-3+8= 5
= 28
SP-score=5+18+5=28
5353
PPosition osition SSpecific pecific IIterated terated BLASTBLAST
• PSI-BLAST is a rather permissive alignment tool and it can find more distantly related sequences than FASTA or BLAST
• Especially, in many cases, it is much more sensitive to weak but biologically relevant sequence similarities.
5454
PPosition osition SSpecific pecific IIterated terated BLASTBLAST
PSI-BLAST is used for:PSI-BLAST is used for: Distant homology detection Fold assignment: profile-profile comparison Domain identification Evolutionary Analysis (e.g. tree building) Sequence Annotation / function assignment Profile export to other programs Sequence clustering Structural genomics target selection
5555
PPosition osition SSpecific pecific IIterated terated BLASTBLAST
• Collect all database sequence segments that have been aligned with query sequence with E-value below set threshold (default 0.001)
• Construct position specific scoring matrix for collected sequences. Rough idea:– Align all sequences to the query sequence as the
template.– Assign weights to the sequences – Construct position specific scoring matrix
• Iterate
MGLLTREIF--ILQQ
FGLGRT-I-T-YMTN-GLVRT-I---LGLE
FGLLRT-I---YMTQ
MGLLTREIF--ILQQ
Take a sequence
Search for similar sequences in a full sequence database
A 029001100003200C 000070000000000..Y 002000080202000
Construct a profile, and represent conservation in each position numerically
Profile holds more information than a single sequence: use the profile to retrieve additional sequences
Sequences are multiply alignedFGLLRT-I-T-YMTN
-RLTRD-I---LGLYFGLLRT-I---FMTS
New sequences in the multiple alignment
Construct a new profileA 027005101003200C 000070000000000..Y 202000060202000
After several iterations of this procedure we have:
• Sequence information, including links to annotation
• Several sets of multiple alignments.
• Profiles, derived by us or by PSI-BLAST
• Threshold information (alignment statistics)
A 029001100003200C 000070000000000..Y 002000080202000
using profile
How PLS-BLAST works?
5757
Consensus sequenceConsensus sequence
• A sequence where each position is defined by majority vote based on multiple sequence alignment. Use consensus sequence for data base search.
PEAINYGRFTPFS I KSDVW
5858Next New iteration……
MGLLTREIF--ILQQ
FGLGRT-I-T-YMTN-GLVRT-I---LGLEFGLLRT-I---YMTQ
MGLLTREIF--ILQQ
Take a sequence
Search for similar sequences in a full sequence database
A 029001100003200C 000070000000000..Y 002000080202000
Construct a profile, and represent conservation in each position numerically
Profile holds more information than a single sequence: use the profile to retrieve additional sequences
Sequences are multiply aligned
Construct a new profile
A 027005101003200C 000070000000000..Y 202000060202000
Using profile to search for similar sequences in a full sequence database
A 029001100003200C 000070000000000..Y 002000080202000
FGLLRT-I-T-YMTN-RLTRD-I---LGLYFGLLRT-I---FMTS
New sequences in the multiple alignments
New iteration
Flow chart of PSI-BLAST
5959
PSI-BLASTPSI-BLAST
NCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
6060
PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
6161
PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
6262
PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
6363
PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
6464
PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
6565
PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
6666
PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
6767
Use of PSI-BLAST to probe the Use of PSI-BLAST to probe the function of a viral proteinfunction of a viral protein
PEAINYGRFTPFS I KSDVW
6868
Summary of Today’s lectureSummary of Today’s lecture
• Sequence alignment methods revisited:– Pair-wise alignment– Multiple sequence alignment– BLAST– PSI-BLAST
• Use of PSI-BLAST to probe protein function