Post on 18-Jan-2018
description
Computational Biology, Part CFamily Pairwise Search and
Cobbling
Robert F. MurphyRobert F. MurphyCopyright Copyright 2000, 2001. 2000, 2001.
All rights reserved.All rights reserved.
Overall Goals
Find previously unrecognized members of a Find previously unrecognized members of a familyfamily
Develop a model of a familyDevelop a model of a family
Possible Approaches
Model-basedModel-based Motif-based (MEME/MAST)Motif-based (MEME/MAST) Hidden Markov model-based (HMMER)Hidden Markov model-based (HMMER)
Non-model-basedNon-model-based Family Pairwise Search (FPS)Family Pairwise Search (FPS)
PSSMs
Motifs can be summarized and searched for Motifs can be summarized and searched for using using PPosition-osition-SSpecific pecific SScoring coring MMatricesatrices
Calculated from a multiple alignment of a Calculated from a multiple alignment of a conserved region for members of a familyconserved region for members of a family
Learning PSSMs
Unsupervised learning methods can be used Unsupervised learning methods can be used to find motifs in unaligned sequencesto find motifs in unaligned sequences
Best characterized algorithm is MEMEBest characterized algorithm is MEME T.L. Bailey & C. Elkan (1995) Unsupervised Learning of T.L. Bailey & C. Elkan (1995) Unsupervised Learning of
Multiple Motifs in Biopolymers Using Expectation Multiple Motifs in Biopolymers Using Expectation Maximization. Maximization. Machine Learning J. 21Machine Learning J. 21:51-83:51-83
Problems with PSSMs
Some families are characterized by two or Some families are characterized by two or more “sub”-motifs with variable spacing more “sub”-motifs with variable spacing between thembetween them
Deciding upon motif boundaries difficultDeciding upon motif boundaries difficult Possible information in intervening Possible information in intervening
sequences lost if only motifs are usedsequences lost if only motifs are used
Cobbling
Pick “most representative” protein sequence Pick “most representative” protein sequence from a familyfrom a family
Convert it to a profile by replacing each Convert it to a profile by replacing each amino acid by the corresponding column amino acid by the corresponding column from a similarity matrix from a similarity matrix
Cobbling
For each recognized “motif” in the family, For each recognized “motif” in the family, replace the corresponding section of the replace the corresponding section of the profile with the profile of the motifprofile with the profile of the motif
Cobbling
Advantage: At least some sequence Advantage: At least some sequence information between motifs is retained.information between motifs is retained.
S. Henikoff & J.G. Henikoff (1997) S. Henikoff & J.G. Henikoff (1997) Embedding strategies for effective use of Embedding strategies for effective use of information from multiple sequence information from multiple sequence alignments. alignments. Protein Science 6Protein Science 6:698-705:698-705
Cobbler Illustration
scores from profiles of conserved motifs
similarity scores for sequence from “most representative” family member
sequence of “most representative” family member
Family Pairwise Search
For all known members of family, calculate For all known members of family, calculate (pairwise) homology to each sequence in (pairwise) homology to each sequence in database (using BLAST) and sum those database (using BLAST) and sum those scoresscores
Family Pairwise Search
Does not generate a model of the motifDoes not generate a model of the motif Analogous to k nearest neighbor Analogous to k nearest neighbor
classificationclassification
Which method is best?
Compare BLAST using a randomly chosen Compare BLAST using a randomly chosen family member, BLAST FPS, MEME, family member, BLAST FPS, MEME, HMMERHMMER
W.N. Gundy (1998) Homology Detection W.N. Gundy (1998) Homology Detection via Family Pairwise Search. via Family Pairwise Search. J. Comput. J. Comput. Biol. 5Biol. 5:479-492:479-492
Comparison Protocol
For each methodFor each method For each known protein familyFor each known protein family
Train with family membersTrain with family membersSearch database for matchesSearch database for matchesRank by score from searchRank by score from searchDetermine how many known family Determine how many known family
members are ranked highlymembers are ranked highly
Comparison Protocol
Evaluation metricEvaluation metric average ROCaverage ROC5050
ROCROC5050 is the fraction of true positives detected at a is the fraction of true positives detected at a threshold giving 50 false negativesthreshold giving 50 false negatives
average over all familiesaverage over all families Bigger is better!Bigger is better!
Comparison Protocol
Caution!Caution! True positive True positive defined as being listed as a defined as being listed as a
member of the family in the PROSITE member of the family in the PROSITE compilationcompilation
Some Some false positivesfalse positives could be actual family could be actual family members that were missed during PROSITE members that were missed during PROSITE compilation!compilation!
(Should be minor effect)(Should be minor effect)
Results
BLAST FPS
BLAST
HMMER
MAST
Conclusion
FPS better than single sequence BLASTFPS better than single sequence BLAST FPS better than model-based methodsFPS better than model-based methods
Which is best (part 2)?
Compare BLAST, BLAST FPS, cobbled Compare BLAST, BLAST FPS, cobbled BLAST, cobbled BLAST FPSBLAST, cobbled BLAST FPS
W.N. Grundy and T.L. Bailey (1999) W.N. Grundy and T.L. Bailey (1999) Family pairwise search with embedded Family pairwise search with embedded motif models. motif models. Bioinformatics 15:Bioinformatics 15:463-470463-470
Comparison Protocol
Evaluation metricEvaluation metric rank sumrank sum
calculate difference in ROCcalculate difference in ROC5050 for two methods for a for two methods for a given familygiven family
sort by absolute value of differencesort by absolute value of difference sum ranks of families for which one method is better sum ranks of families for which one method is better
than the otherthan the other Bigger is better!Bigger is better!
Results
Conclusion
For task of finding members of a family For task of finding members of a family given a reasonable number of known given a reasonable number of known members of that family, cobbled FPS is best members of that family, cobbled FPS is best currently available method!currently available method!