Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38...

25
Lecture 9: Protein Sequence Profiles and Motif Applications Calculating profiles of protein sequences - Average Score Method Pattern and Profile applications PSI-BLAST Identifying new sequence motifs: - Gibbs sampling Some slides adapted from slides by Dr. Keith Dunker Some slides adapted from slides created by Dr. Zhiping Weng (Boston University)

Transcript of Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38...

Page 1: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Lecture 9: Protein Sequence Profiles and Motif Applications

• Calculating profiles of protein sequences

- Average Score Method • Pattern and Profile applications • PSI-BLAST • Identifying new sequence motifs:

- Gibbs sampling

Some slides adapted from slides by Dr. Keith Dunker Some slides adapted from slides created by Dr. Zhiping Weng (Boston University)

Page 2: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Protein Sequence Profiles

§ A profile is a position-specific scoring matrix that gives a quantitative description of a sequence motif § For protein sequences, the profile scoring matrix has N rows and 20+ columns, N being the length of the profile (# of sequence positions) § The first 20 columns indicate the score (or probability) for finding, at that position in the target sequence, one of the 20 amino acids § Additional columns contain gap penalties for insertions/deletions at that position in the target sequence § Mkj = score for the jth amino acid (or gap) at the kth position in the sequence

Page 3: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Calculating the Profile Matrix for Protein Sequences: Average Score Method

Mkj =Cki

ZSij

i=1

20

• Mkj = Profile matrix element (score for jth amino acid at the kth position) • Cki = Number of ith type amino acid at position k in the sequence/profile • Z = Number of aligned sequences • Sij = Score between the ith and the jth amino acids based on a scoring matrix (e.g., PAM250 or BLOSUM62)

Derived from paper by Gribskov et al, (1987) PNAS 84:4355-8

Page 4: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

1  AGGCTHFWKGESM 2  SGACSRWYRGQSL 3  TGSCLKFFHG-LM 4  SGACSRMYRGESL 5  TGGCSKWMRGQSV 6  SGNCSKMWKGNSI 7  FGACSHWYKGDSL Z=8 SGQCSRFYRGQSL

Average Score Method: Example

Position k = 7

Mkj =Cki

ZSij

i=1

20

C7F = 3, C7W = 3, C7M = 2, other C7i = 0

M7F =38SFF +

38SWF +

28SMF

M7W =38SFW +

38SWW +

28SMW

M7M =38SFM +

38SWM +

28SMM

M7 j =38SFj +

38SWj +

28SMjUsing BLOSUM62:

SFF = 6; SWF = 1; SMF = 0

M7F = (3/8)(6) + (3/8)(1) + (2/8)(0) = 2.625

Page 5: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Average Score Method: Example

M7Y =38SFY +

38SWY +

28SMY =

38(3) +

38(2) +

28(−1) ~ 1.6

M7E =38SFE +

38SWE +

28SME =

38(−3) +

38(−3) +

28(−2) ~ −2.8

§ Calculating the profile values for two unobserved amino acids (Y and E): § From the above two equations, it is easy to predict that M7Y is much more favorable than M7E, even though neither Y nor E has been observed at this position (k = 7). Why?

Page 6: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Searching for PSSM/Profile Matches

§ If we do not allow gaps (i.e., no insertions or deletions):

• Can simply do a linear scan, scoring the match to the position-specific scoring matrix (PSSM) at each position in the sequence § If we allow gaps:

• Can use dynamic programming to align the profile to the protein sequence(s) (with gap penalties)

- see Mount, Bioinformatics: sequence and genome analysis (2004) • Can use hidden Markov Model-based methods

- see Durbin et al., Biological Sequence Analysis (1998)

Page 7: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Sequence Pattern and Profile Applications

§ Predicting structural or functional domains in protein sequences

• Example: PROSITE database of protein sequence motifs § Predicting protein-protein interaction motifs § Predicting transcription factor binding sites in DNA sequence

• Example: TRANSFAC database of DNA sequence motifs § Predicting protein localization

• Example: PSORT method to predict protein localization

Page 8: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Protein motif example: PROSITE

§ PROSITE is a database of sequence motifs (patterns and profiles) § These sequence motifs can be used to predict protein structural domains § Example — Gal4 and Gcn4 transcription factors:

Gal4 Zn-finger DNA-binding protein domain matched by pattern:

[GASTPV] - C - x(2) - C - [RKHSTACW] - x(2) - [RKHQ] - x(2) - C - x(5,12) - C - x(2) - C - x(6,8) - C Gcn4 B-ZIP DNA-binding protein domain matched by profile

Gal4: Zn-finger domain

Gcn4:

B-ZIP domain

Page 9: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

DNA motif example: Yeast Promoter Elements

§ Gal4 binding sites in yeast promoter regions, predicted by sequence patterns/profiles § Visualization of Gal4 DNA binding sites in the promoter of the GAL10 gene: § Gal4 DNA binding site pattern:

Gal4 binding sites

GAL10

---------------------------------------------------------------------- YBR019C (GAL10) ---------------------------------------------------------------------- GAL4 Binding Site Pattern: CGG...........cCg -269 -253 + CGGAGGAGAGTCTTCCG -333 -317 + CGGAGCAGTGCGGCGCG -251 -235 - CGGGCGACAGCCCTCCG -232 -216 - CGGATTAGAAGCCGCCG ----------------------------------------------------------------------

Page 10: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Protein motif example: Subcellular Localization

§ Tools such as PSORT can predict the subcellular localization of a protein based on its protein sequence

§ Many sequence motifs can be used to predict protein localization

§ For example, proteins that are retained in the Endoplasmic Reticulum (ER) have a K-D-E-L sequence motif.

§ Sequence motifs are also linked to nuclear-localization of proteins

§ Example: using PredictNLS (http://cubic.bioc.columbia.edu/cgi/var/nair/resonline.pl) to predict nuclear localization of the Gcn4 transcription factor Gcn4 protein sequence:

MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIIKQDTPSNLDFDFALPQTATAPDAKTVLPIPEL DDAVVESFFSSSTDSTPMFEYENLEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLEDAKLTQTRK VKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPESSDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHL ENEVARLKKLVGER

Nuclear Localization Signal (NLS) motif present in Gcn4 protein sequence:

[PLQ]K[RK]x{1,2}[RK]x{3,6}[RK][RK]x{1,2}[RK]x{1,2}[RK][RK]

NLS motif

Page 11: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

BLAST input sequence to find significant alignments

Use MSA to construct position specific scoring matrix (PSSM)

PSI-BLAST

§ PSI-BLAST = Position-Specific Iterated BLAST (see Altschul et al., Nuc. Acids Res. (1997) 25:3389-3402)

Construct multiple sequence alignment (MSA) from hits

BLAST PSSM profile to search for new homologs of sequence

Iterate

Page 12: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

PSI-BLAST: Method

§ 1. A single protein sequence is used to search the database using the gapped BLAST method § 2. A multiple sequence alignment is constructed from significant alignments (HSPs) identified in step 1 AND a position specific score matrix (PSSM) profile is constructed from the multiple alignment § 3. Search database using the PSSM profile using a version of the BLAST method § 4. Report significant local alignments (HSPs) of the PSSM profile and any database sequences § 5. Iterate — construct new alignment and PSSM profile (step 2) using sequence alignments (HSPs) identified in step 4

Information adapted from: Altschul and Koonin (1998) TIBS 23:444-447.

Page 13: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Gapped BLAST example

§ Searching for homologs of the human fragile Histidine Triad (FHIT) protein (Bis(5'-adenosyl)-triphosphatase, P49789) § Results of standard gapped-BLAST search of Swissprot database using an Expect (E-value) threshold of 0.005 (and low complexity filter):

Example from: Altschul et al., Nuc. Acids Res. (1997) 25:3389-3402 Results images from http://blast.ncbi.nlm.nih.gov/Blast.cgi

Page 14: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

PSI-BLAST example § Results of PSI-BLAST search (iteration 1) of Swissprot database with FHIT protein (Bis(5'-adenosyl)-triphosphatase, P49789) using same parameters

• Note: iteration 1 is just a regular gapped BLAST search

Results images are output from http://blast.ncbi.nlm.nih.gov/Blast.cgi

Page 15: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

PSI-BLAST example § Results of PSI-BLAST search (iteration 2) — Top of results page

Page 16: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

PSI-BLAST example § Results of PSI-BLAST search (iteration 2) — Bottom of results page

• Iteration 2 uses PSSM profile to search for new high scoring segment pairs

• • •

Results images are output from http://blast.ncbi.nlm.nih.gov/Blast.cgi

Page 17: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

PSI-BLAST: Summary

Image from: Altschul et al., Nuc. Acids Res. (1997) 25:3389-3402

§ PSI-BLAST and other profile-based searching methods are more sensitive to detecting weakly similar proteins § Search sensitivity is due to position-specific scoring using a PSSM profile, particularly for conserved (and potentially important) segments of the sequence alignment

Page 18: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

TATA box Gene

Promoter Sequence Motif

?

Identifying New Promoter Sequence Motifs

Page 19: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Identifying New Promoter Sequence Motifs

TATA box Gene Promoter Sequence Motif

Page 20: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Gibbs Sampling: A method for local multiple sequence alignment

§ The Gibbs Sampling method (Lawrence et al., (1993) Science 262:208-214) is a a stochastic method to identify short, conserved sequence motifs by local multiple sequence alignment § Gibbs sampling typically takes as input the following parameters:

• a set of N sequences (x1, x2, …, xN) potentially sharing a common sequence motif

• the estimated width of this motif (W = width or size of motif)

• background model (e.g., background amino acid or nucleotide frequencies)

- note: the background can be calculated from the input sequences § The Gibbs sampling algorithm provides as output the following parameters:

• positions of motif (a1, a2, …, aN) within each input sequence (x1, x2, …, xN)

- these position describe the location of the motif in each sequence

Page 21: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Gibbs Sampling Algorithm I. Initialization:

• Select random locations a1, ..., aN in sequences x1, ..., xN and align sequences at these locations

II. Iterations

1. “Predictive Update Step” (Lawrence et al., (1993))

a. Remove one sequence xk from alignment

b. Recalculate model of motif from remaining sequences in alignment

qij =cij + β j

(N −1) +Bβ j = pseudocounts and B = β j∑

M

Information adapted from Lawrence et al., (1993) Science 262:208-214

Page 22: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

2. “Sampling step” (Lawrence et al., (1993)) a. Choose (or ‘Sample’) a new location (ak) of motif in sequence xk

- Choice of location (ak) is a random weighted selection among positions in sequence xk - weights are based on the probability ratio of how well each position in the sequence xk matches the model of the motif relative to the background

- Aj = weight (“probability ratio” [Stormo, 2010]) for motif starting at position j in sequence

- Qj = probability of motif matching sequence starting at position j

- Pj = background probability of matching sequence starting at position j

Gibbs Sampling Algorithm

Aj =Qj / Pj

Qj / Pjj =1

|x|−W +1

∑0 |x|

Weight

(adapted in part from notes from Serafim Batzoglou, Stanford University)

Position in sequence

Page 23: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

§ Repeat iteration steps 1 and 2 for a specified number of iterations and report found motif § After a large number of iterations, the Gibbs sampler will typically find an optimal local alignment of the sequences (e.g., based on information content of motif, etc)

Gibbs Sampling Algorithm

Image from Lawrence et al., (1993) Science 262:208-214

4

Page 24: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

§ Additional options/features of algorithm:

• Can also repeat Gibbs sampling procedure with the same initialization or with new initializations to confirm previously identified motifs or find new motifs

• Can also perform a ‘phase shift’, by moving the motif a random weighted distance to the left or to the right

- This procedure helps avoid local ‘maxima’ § Because a stochastic (random) sampling method is used, slightly different motifs will typically be obtained from each Gibbs sampling run § Gibbs sampling software:

• W-AlignACE (for DNA sequences): http://www1.spms.ntu.edu.sg/~chenxin/W-AlignACE/ • Gibbs Motif Sampler (for DNA and protein): http://bayesweb.wadsworth.org/gibbs/gibbs.html

Gibbs Sampling Algorithm

Page 25: Lecture 9: Protein Sequence Profiles and Motif Applications · 2014. 4. 29. · gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences:

Analysis of Agamous Transcription Factor targets using AlignACE"AlignACE 3.0 10/20/99 AlignACE -i Parameter values: expect = 10 gcback = 0.38 minpass = 200 seed = 1017249941 numcols = 10 undersample = 1 oversample = 1 Input sequences: #1 At4g12550 #2 At4g37940 #3 At3g50330 #4 At3g61410 #5 At2g37260 #6 At4g17710 #7 At1g14540 #8 At2g15590 #9 At2g31430 #10 At2g27550 #11 At2g02710 #12 At5g22570 #13 At5g40860 #14 At3g54990 #15 At1g73830 Motif 1 AAGAAGAAGA 3 512 0 AAGAAGAAGA 11 338 0 AAGAAGAAGA 11 558 1 AAGAAAGAGA 6 466 0 AAGAAGGAAA 3 280 1 AAGAAAGAAA 4 447 0 AAGAAAGAAA 5 102 0 AAGAAAAAAA 14 471 0 AAGAAAAAAA 9 562 1 GAGAAAGAGA 11 376 0 GAGAAGAAGA 4 485 1 GAGAAAGAGA 2 478 0 GAGAAAAAGA 12 537 1 AAAAAGGAGA 5 8 1 !

Motif 9 Gene Offset Orientation

CCAAATTAGGAAA 1 118 1 CCTATTAAGAAAA 1 451 1 CCAAATTAGGAAA 5 195 0 CCAAATTCGGATA 7 23 1 CCCATTTCGAAAA 7 479 1 CCTATTTAGTATA 9 442 1 CCAAATTAGGAAA 11 132 1 CCAAATTGGCAAA 12 437 1 TCTATTTTGGAAA 13 285 0 CCAATTTTCAAAA 15 562 1 ** **** * *** MAP Score: 5.06943 Motif 10 TATCCATATAAAA 1 419 1 TCTACAAAAAAAA 2 535 1 TATGTAATAAAAA 3 315 1 TGTAAAAACAAAA 4 165 0 TTTCCCGAGAAAA 4 556 1 TTTACCTATAGAA 5 219 1 TTTACAAACAAAA 7 568 1 TATTTCAAAAAAA 8 20 0 TATTCAAACAAAA 8 229 1 TATCCCAAAAAAA 8 290 1 TATCCATAAAAAA 9 369 1 TGTCCTAACAAAA 10 89 1 TATGTATACAAAA 10 162 1 TCTCCAAAAAAAA 10 231 1 TTTTCAATGAAAA 11 419 1 TTTATATATAAAA 14 383 0 TTTTTCAAAAAAA 15 478 1 TTTTCAAAAAAAA 15 566 1 * * **** **** MAP Score: 4.51593 Motif 11 TCGATAAATTAATA 1 152 0 TCGATAAATTAATA 1 162 1 AGAATAAATTATAA 2 16 1 GTCATAAAATAAAT 3 67 1 !