Sequence Analysis Methods

Post on 15-Jan-2016

67 views 1 download

Tags:

description

CZ5225: Modeling and Simulation in Biology Lecture 3: Sequence analysis methods Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. Sequence Analysis Methods. - PowerPoint PPT Presentation

Transcript of Sequence Analysis Methods

CZ5225: Modeling and Simulation in CZ5225: Modeling and Simulation in BiologyBiology

Lecture 3: Sequence analysis methods Lecture 3: Sequence analysis methods

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: csccyz@nus.edu.sgcsccyz@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg

Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of SingaporeNational University of Singapore

22

Sequence Analysis Methods

33

Gene and Protein Sequence Alignment Gene and Protein Sequence Alignment as a Mathematical Problem: as a Mathematical Problem:

Example: Sequence a:  ATTCTTGC Sequence b: ATCCTATTCTAGC

          Best Alignment:             ATTCTTGC

                                 ATCCTATTCTAGC                                           /|\                   gap        Bad Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                                /|\             /|\                                           gap          gap

What is a good alignment? 

44

How to rate an alignment?How to rate an alignment?• Match: +8 (w(x, y) = 8, if x = y)

• Mismatch: -5 (w(x, y) = -5, if x ≠ y)

• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

55

Pairwise AlignmentPairwise AlignmentSequence a: CTTAACTSequence b: CGGATCAT

An alignment of a and b:

C---TTAACTCGGATCA--T

Insertion gap

Match Mismatch

Deletion gap

66

Alignment GraphAlignment GraphSequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

Insertion gap

Deletion gap

77

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C

C C---TTAACTCGGATCA--T

88

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C G G A

C C---TTAACTCGGATCA--T

99

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C G G A T

C

T

C---TTAACTCGGATCA--T

1010

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C G G A T C A

C

T

T

A

A

C

C---TTAACTCGGATCA--T

1111

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

1212

Pathway of an alignmentPathway of an alignmentSequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

1313

Graphic representation of an alignmentGraphic representation of an alignment

Sequence a: CTTAACT Sequence b: CGGATCAT

C G G A T C A T

C

T

T

A

A

C

T

CTTAACT-CGGATCAT

1414

Pathway of an alignmentPathway of an alignmentSequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

CTTAACT-CGGATCAT

1515

Use of graph to generate alignmentsUse of graph to generate alignments

Sequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

- CTTAACTCGGATCAT

1616

Use of graph to generate alignmentsUse of graph to generate alignments

Sequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

- C - - TTAACTCGGATC - AT -

1717

Use of graph to generate alignmentsUse of graph to generate alignments

Sequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

CTTAACT - - -

- - CGGATCAT

1818

Which pathway is better?Which pathway is better?Sequence a: CTTAACT

Sequence b: CGGATCATC G G A T C A T

C

T

T

A

A

C

T

Multiple pathways

Each with a unique scoring function

1919

Alignment ScoreAlignment ScoreSequence a: CTTAACT

Sequence b: CGGATCAT

8

C G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

2020

Alignment ScoreAlignment ScoreSequence a: CTTAACT

Sequence b: CGGATCAT

8

8-3

=5

C G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

2121

Alignment ScoreAlignment ScoreSequence a: CTTAACT

Sequence b: CGGATCAT

8

8-3

=5

5-3

=2

2-3

=-1

C G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

2222

Alignment ScoreAlignment ScoreSequence a: CTTAACT

Sequence b: CGGATCAT

8 5 2 -1

-1+8

=7

7-3

=4

4+8

=12

12-3

=9

9-3

=6

C G G A T C A T

C

T

T

A

A

C

T

C---TTAACTCGGATCA--T

6+8=14

Alignment score

2323

An optimal alignmentAn optimal alignment-- the alignment of maximum score-- the alignment of maximum score

• Let A=a1a2…am and B=b1b2…bn .

• Si,j: the score of an optimal alignment between

a1a2…ai and b1b2…bj

• With proper initializations, Si,j can be computedas follows.

),(

),(

),(

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bws

aws

s

2424

Computing Computing SSi,ji,j

i

j

w(ai,-)

w(-,bj)

w(ai,bj)

Sm,n

2525

InitializationsInitializationsS0,0= 0

S0,1=-3, S0,2=-6,

S0,3=-9, S0,4=-12,

S0,5=-15, S0,6=-18,

S0,7=-21, S0,8=-24

S1,0=-3, S2,0=-6,

S3,0=-9, S4,0=-12,

S5,0=-15, S6,0=-18,

S7,0=-21

0 -3 -6 -9 -12 -15 -18 -21 -24

-3

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Gap symbol: -3

2626

SS1,11,1 = = ??Option 1:

S1,1 = S0,0 +w(a1, b1)

= 0 +8 = 8

Option 2:

S1,1=S0,1 + w(a1, -)

= -3 - 3 = -6

Option 3:

S1,1=S1,0 + w( - , b1)

= -3-3 = -6

Optimal:

S1,1 = 8

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 ?

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

2727

SS1,21,2 = = ??Option 1:

S1,2 = S0,1 +w(a1, b2)

= -3 -5 = -8

Option 2:

S1,2=S0,2 + w(a1, -)

= -6 - 3 = -9

Option 3:

S1,2=S1,1 + w( - , b2)

= 8-3 = 5

Optimal:

S1,2 =5

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 ?

-6

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

2828

SS2,12,1 = = ??Option 1:

S2,1= S1,0 +w(a2, b1)

= -3 -5 = -8

Option 2:

S2,1=S1,1 + w(a2, -)

= 8 - 3 = 5

Option 3:

S2,1=S2,0 + w( - , b1)

= -6-3 = -9

Optimal:

S2,1 =5

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5

-6 ?

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

2929

SS2,22,2 = = ??Option 1:

S2,2= S1,1 +w(a2, b2)

= 8 -5 = 3

Option 2:

S2,2=S1,2 + w(a2, -)

= 5 - 3 = 2

Option 3:

S2,2=S2,1 + w( - , b2)

= 5-3 = 2

Optimal:

S2,2 =3

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5

-6 5 ?

-9

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

3030

SS3,53,5 = = ??

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 ?

-12

-15

-18

-21

C G G A T C A T

C

T

T

A

A

C

T

3131

SS3,53,5 = = ??

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

optimal score

3232

C T T A A C – TC T T A A C – TC G G A T C A TC G G A T C A T

0 -3 -6 -9 -12 -15 -18 -21 -24

-3 8 5 2 -1 -4 -7 -10 -13

-6 5 3 0 -3 7 4 1 -2

-9 2 0 -2 -5 5 -1 -4 9

-12 -1 -3 -5 6 3 0 7 6

-15 -4 -6 -8 3 1 -2 8 5

-18 -7 -9 -11 0 -2 9 6 3

-21 -10 -12 -14 -3 8 6 4 14

C G G A T C A T

C

T

T

A

A

C

T

8 – 5 –5 +8 -5 +8 -3 +8 = 14

3333

Local vs. Global Sequence Alignment: Local vs. Global Sequence Alignment:

Example:

DNA sequence a:  ATTCTTGC

DNA sequence b: ATCCTATTCTAGC  

         Local Alignment:             ATTCTTGC Gaps ignored in local alignments

                                 ATCCTATTCTAGC                                          /|\                   gap        Global Alignment: AT     TCTT       GC                                  ATCCTATTCTAGC                                                              /|\             /|\                                      gap          gap Gaps counted in global alignments 

3434

Global Alignment vs. Local AlignmentGlobal Alignment vs. Local Alignment

• global alignment:

• local alignment:

All sections are counted

Only local sections (normally separated by gaps) are counted

3535

An optimal local alignmentAn optimal local alignment

• Si,j: the score of an optimal local alignment ending at ai and bj

• With proper initializations, Si,j can be computedas follows.

),(

),(),(

0

max

1,1

1,

,1

,

jiji

jji

iji

ji

baws

bwsaws

s

3636

InitializationsInitializations

0 0 0 0 0 0 0 0 0

0

0

0

0

0

0

0

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

3737

SS1,11,1 = = ?? Option 1:

S1,1 = S0,0 +w(a1, b1)

= 0 +8 = 8

Option 2:

S1,1=S0,1 + w(a1, -)

= 0 - 3 = -3

Option 3:

S1,1=S1,0 + w( - , b1)

= 0-3 = -3

Option 4:

S1,1=0

Optimal:

S1,1 = 8

0 0 0 0 0 0 0 0 0

0 ?

0

0

0

0

0

0

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

3838

local alignmentlocal alignment

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 ?

0

0

0

C G G A T C A T

C

T

T

A

A

C

T

Match: 8

Mismatch: -5

Gap symbol: -3

3939

0 0 0 0 0 0 0 0 0

0 8 5 2 0 0 8 5 2

0 5 3 0 0 8 5 3 13

0 2 0 0 0 8 5 2 11

0 0 0 0 8 5 3 13 10

0 0 0 0 8 5 2 11 8

0 8 5 2 5 3 13 10 7

0 5 3 0 2 13 10 8 18

C G G A T C A T

C

T

T

A

A

C

T

The best

score

A – C - TA T C A T8-3+8-3+8 = 18

local alignmentlocal alignment

4040

BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool

Procedure:

• Divide all sequences into overlapping constituent words (size k)

• Build the hash table for Sequence a.• Scan Sequence b for hits.• Extend hits.

4141

BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool

Step 1:Hash table for sequence A

4242

Amino acid Amino acid similarity similarity matrix matrix PAM 120PAM 120

Instead of using the simple values +8 and -5 for matches and mismatches, this statistically derived score matrix is used to rank the level of similarity between two amino acids

4343

Amino acid similarity matrix PAM 250Amino acid similarity matrix PAM 250This is a more popularly used score matrix for ranking the level of similarity of two amino acids. It is derived by consideration of more diverse sets of data and more number of statistical steps.

4444

Amino acid similarity matrix Blosum 45Amino acid similarity matrix Blosum 45The Blosum matrices were calculated using data from the BLOCKS database which contains alignments of more distantly-related proteins. In principle, Blosum matrices should be more realistic for comparing distantly-related proteins, but may introduce error for conventional proteins. .

4545

BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool

4646

BLAST BLAST Basic Local Alignment Search ToolBasic Local Alignment Search Tool

LN:LN=9

NF:NY=8

GW:PW=10

Step 2:

Use all of the 2-letter words in query sequence to scan against database sequence and mark those with score > 8

Note:

Marked points can be on the diagonal and off-diagonal

4747

BLASTStep2: Scan sequence b for hits.

4848

BLASTStep2: Scan sequence b for hits.

Step 3: Extend hits.

hit

Terminate if the score of the extension fades away.

BLAST 2.0 saves the time spent in extension, and

considers gapped alignments.

4949

Multiple sequence alignment (MSA)Multiple sequence alignment (MSA)

• The multiple sequence alignment problem is to simultaneously align more than two sequences.

Seq1: GCTC

Seq2: AC

Seq3: GATC

GC-TC

A---C

G-ATC

5050

Multiple sequence alignment MSAMultiple sequence alignment MSA

5151

How to score an MSA?How to score an MSA?

• Sum-of-Pairs (SP-score)

GC-TC

A---C

G-ATC

GC-TC

A---C

GC-TC

G-ATC

A---C

G-ATC

Score =

Score

Score

Score

+

+

5252

How to score an MSA?How to score an MSA?

• Sum-of-Pairs (SP-score)

GC-TC

A---C

G-ATC

GC-TC

A---C

GC-TC

G-ATC

A---C

G-ATC

Score =

Score

Score

Score

+

+

-5-3+8-3+8= 5

+

8-3-3+8+8= 18

+

-5+8-3-3+8= 5

= 28

SP-score=5+18+5=28

5353

PPosition osition SSpecific pecific IIterated terated BLASTBLAST

• PSI-BLAST is a rather permissive alignment tool and it can find more distantly related sequences than FASTA or BLAST

• Especially, in many cases, it is much more sensitive to weak but biologically relevant sequence similarities.

5454

PPosition osition SSpecific pecific IIterated terated BLASTBLAST

PSI-BLAST is used for:PSI-BLAST is used for: Distant homology detection Fold assignment: profile-profile comparison Domain identification Evolutionary Analysis (e.g. tree building) Sequence Annotation / function assignment Profile export to other programs Sequence clustering Structural genomics target selection

5555

PPosition osition SSpecific pecific IIterated terated BLASTBLAST

• Collect all database sequence segments that have been aligned with query sequence with E-value below set threshold (default 0.001)

• Construct position specific scoring matrix for collected sequences. Rough idea:– Align all sequences to the query sequence as the

template.– Assign weights to the sequences – Construct position specific scoring matrix

• Iterate

MGLLTREIF--ILQQ

FGLGRT-I-T-YMTN-GLVRT-I---LGLE

FGLLRT-I---YMTQ

MGLLTREIF--ILQQ

Take a sequence

Search for similar sequences in a full sequence database

A 029001100003200C 000070000000000..Y 002000080202000

Construct a profile, and represent conservation in each position numerically

Profile holds more information than a single sequence: use the profile to retrieve additional sequences

Sequences are multiply alignedFGLLRT-I-T-YMTN

-RLTRD-I---LGLYFGLLRT-I---FMTS

New sequences in the multiple alignment

Construct a new profileA 027005101003200C 000070000000000..Y 202000060202000

After several iterations of this procedure we have:

• Sequence information, including links to annotation

• Several sets of multiple alignments.

• Profiles, derived by us or by PSI-BLAST

• Threshold information (alignment statistics)

A 029001100003200C 000070000000000..Y 002000080202000

using profile

How PLS-BLAST works?

5757

Consensus sequenceConsensus sequence

• A sequence where each position is defined by majority vote based on multiple sequence alignment. Use consensus sequence for data base search.

PEAINYGRFTPFS I KSDVW

5858Next New iteration……

MGLLTREIF--ILQQ

FGLGRT-I-T-YMTN-GLVRT-I---LGLEFGLLRT-I---YMTQ

MGLLTREIF--ILQQ

Take a sequence

Search for similar sequences in a full sequence database

A 029001100003200C 000070000000000..Y 002000080202000

Construct a profile, and represent conservation in each position numerically

Profile holds more information than a single sequence: use the profile to retrieve additional sequences

Sequences are multiply aligned

Construct a new profile

A 027005101003200C 000070000000000..Y 202000060202000

Using profile to search for similar sequences in a full sequence database

A 029001100003200C 000070000000000..Y 002000080202000

FGLLRT-I-T-YMTN-RLTRD-I---LGLYFGLLRT-I---FMTS

New sequences in the multiple alignments

New iteration

Flow chart of PSI-BLAST

5959

PSI-BLASTPSI-BLAST

NCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

6060

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

6161

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

6262

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

6363

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

6464

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

6565

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

6666

PSI-BLASTPSI-BLASTNCBI PSI-BLAST tutorial :

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

6767

Use of PSI-BLAST to probe the Use of PSI-BLAST to probe the function of a viral proteinfunction of a viral protein

PEAINYGRFTPFS I KSDVW

6868

Summary of Today’s lectureSummary of Today’s lecture

• Sequence alignment methods revisited:– Pair-wise alignment– Multiple sequence alignment– BLAST– PSI-BLAST

• Use of PSI-BLAST to probe protein function