TIPP and SEPP: Metagenomic Analysis using Phylogeny...
Transcript of TIPP and SEPP: Metagenomic Analysis using Phylogeny...
![Page 1: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/1.jpg)
TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles
Tandy Warnow Department of Computer Science
The University of Illinois at Urbana-Champaign
![Page 2: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/2.jpg)
Metagenomic taxonomic identification and phylogenetic profiling
Metagenomics, Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes
![Page 3: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/3.jpg)
1. What is this fragment? (Taxonomic classification)
2. What is the taxonomic distribution in the dataset? (Note: helpful to use marker genes.)
3. What are the organisms in this metagenomic sample doing together?
Basic Questions
![Page 4: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/4.jpg)
1. What is this fragment? (Taxonomic classification)
2. What is the taxonomic distribution in the dataset? (Note: helpful to use marker genes.)
3. What are the organisms in this metagenomic sample doing together?
THIS TALK
![Page 5: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/5.jpg)
https://www.slideshare.net/EastBayWPMeetup/custom-post-types-and-custom-taxonomies
![Page 6: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/6.jpg)
TIPP and SEPP
TIPP (Bioinformatics 2014) performs
Taxonomic identification
Abundance profiling
SEPP (Pacific Symposium on Biocomputing 2012):
Adds reads into phylogenies (can be used for taxonomic identification)
Both available at https://github.com/smirarab/sepp
![Page 7: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/7.jpg)
TIPP for Taxonomic ID – output file
![Page 8: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/8.jpg)
![Page 9: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/9.jpg)
Objective: Distribution of the species (or genera, or families, etc.) within the sample.
For example: The true distribution of the sample at the species-level is:
50% species A
20% species B
15% species C
14% species D
1% species E
Abundance Profiling
![Page 10: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/10.jpg)
Objective: Distribution of the species (or genera, or families, etc.) within the sample.
For example: distributions at the species-level
True Distribution Estimated Distribution
50% species A 42% species A
20% species B 18% species B
15% species C 11% species C
14% species D 10% species D
1% species E 0% species E
19% unclassified
Abundance Profiling
![Page 11: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/11.jpg)
Objective: Distribution of the species (or genera, or families, etc.) within the sample.
We compared TIPP to
PhymmBL (Brady & Salzberg, Nature Methods 2009)
NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011)
MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland
MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard
mOTU (Bork et al., Nature Methods 2013)
MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes).
Marker gene are single-copy, universal, and resistant to horizontal transmission.
Testing TIPP in Bioinformatics 2014
![Page 12: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/12.jpg)
High indel datasets containing known genomes
Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.
![Page 13: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/13.jpg)
“Novel” genome datasets
Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.
![Page 14: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/14.jpg)
TIPP vs. other abundance profilers
TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads.
The other tested methods have some vulnerability (e.g., mOTU is only accurate for short reads and is impacted by high indel rates).
Improved accuracy is due to the use of eHMMs; single HMMs do not provide the same advantages, especially in the presence of high indel rates.
![Page 15: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/15.jpg)
This talk
![Page 16: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/16.jpg)
https://www.slideshare.net/EastBayWPMeetup/custom-post-types-and-custom-taxonomies
![Page 17: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/17.jpg)
Phylogenies and Taxonomies
Taxonomies:
Rooted, labels at every node for each taxonomic level
More or less based on phylogenies
Phylogenies:
Estimated from sequences (usually)
Branch lengths reflect amount of change
Edges/nodes sometimes given with support (typically bootstrap)
Phylogenies are usually unrooted (because root can be difficult to infer)
![Page 18: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/18.jpg)
https://www.researchgate.net/Rooted-neighbor-joining-16S-rRNA-gene-based-phylogenetic-tree-of-uncultured-bacteria-The_fig1_279565247 [accessed 26 Jul, 2018]
Rooted neighbor-joining 16S rRNA gene-based phylogenetic tree of uncultured bacteria.
![Page 19: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/19.jpg)
How are Phylogenies Estimated?
Input: Sequences (DNA, RNA, or Aminoacid), unaligned
Output: Tree on the sequences
Notes:
Two steps: first align, then compute a tree on the alignment
Many different techniques for each step
![Page 20: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/20.jpg)
DNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
![Page 21: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/21.jpg)
…ACGGTGCAGTTACCA…
MutationDeletion
…ACCAGTCACCA…
Indels (insertions and deletions)
![Page 22: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/22.jpg)
…ACGGTGCAGTTACC-A… …AC----CAGTCACCTA…
The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree
…ACGGTGCAGTTACCA…
SubstitutionDeletion
…ACCAGTCACCTA…
Insertion
![Page 23: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/23.jpg)
AGAT TAGACTT TGCACAA TGCGCTTAGGGCATGA
U V W X Y
U
V W
X
Y
Phylogeny Estimation
![Page 24: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/24.jpg)
Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
![Page 25: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/25.jpg)
Phase 1: Alignment
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
![Page 26: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/26.jpg)
Phase 2: Construct tree
S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA
S1
S4
S2
S3
![Page 27: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/27.jpg)
Multiple Sequence Alignment
Figure from Braun et al., BMC Evol. Biol., 2011. “Homoplastic Microinversions and the Avian Tree of Life”
![Page 28: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/28.jpg)
Figure from Braun et al., BMC Evol. Biol., 2011. “Homoplastic Microinversions and the Avian Tree of Life”
![Page 29: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/29.jpg)
Phylogeny estimation
First compute a multiple sequence alignment Computationally challenging, accuracy on large divergent datasets can be low Best current methods for moderate-sized DNA datasets: MAFFT Best current methods for large datasets: PASTA and UPP
Then compute a tree Typically using maximum likelihood heuristics (e.g., RAxML, IQTree, PhyML, and FastTree) Computationally challenging, accuracy on large divergent datasets can be low Produces trees with branch lengths and other numeric model parameters
![Page 30: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/30.jpg)
This talk
![Page 31: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/31.jpg)
Phylogenetic Placement
![Page 32: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/32.jpg)
https://www.researchgate.net/Rooted-neighbor-joining-16S-rRNA-gene-based-phylogenetic-tree-of-uncultured-bacteria-The_fig1_279565247 [accessed 26 Jul, 2018]
Rooted neighbor-joining 16S rRNA gene-based phylogenetic tree of uncultured bacteria.
![Page 33: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/33.jpg)
Marker-based Taxon Identification
ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT
ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG •. •. •. ACCT
Fragmentary sequences from some gene
Full-length sequences for same gene, and an alignment and a tree
![Page 34: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/34.jpg)
Input
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC
![Page 35: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/35.jpg)
Align Sequence
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
![Page 36: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/36.jpg)
Place Sequence
S1
S4
S2
S3Q1
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
![Page 37: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/37.jpg)
Phylogenetic Placement in 2011
Align each query sequence to backbone alignment HMMER (Finn et al., NAR 2011) PaPaRa (Berger and Stamatakis, Bioinformatics 2011)
Place each query sequence into backbone tree pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011)
Note: pplacer and EPA solve same problem (maximum likelihood placement under standard sequence evolution models)
![Page 38: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/38.jpg)
HMMER vs. PaPaRa Alignments
Increasing rate of evolution
0.0
![Page 39: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/39.jpg)
Phylogenetic Placement in 2011
![Page 40: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/40.jpg)
Input
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC
![Page 41: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/41.jpg)
Align Sequence using HMMER
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
![Page 42: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/42.jpg)
Place Sequence using pplacer
S1
S4
S2
S3Q1
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
![Page 43: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/43.jpg)
What is HMMER+pplacer?
HMMER (Finn et al., NAR 2011) is used to add the read s into the backbone alignment, thus producing an “extended alignment”. HMMER is based on profile Hidden Markov Models (profile HMMs).
pplacer (Matsen et al. BMC Bioinformatics 2010) is used to add read s into the best location in the tree T. pplacer is based on phylogenetic sequence evolution models (e.g., GTR), and uses maximum likelihood.
![Page 44: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/44.jpg)
![Page 45: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/45.jpg)
From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html
A general topology for a profile HMM
![Page 46: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/46.jpg)
Profile Hidden Markov Models
Profile HMMs are probabilistic generative models to represent multiple sequence alignments.
HMMER software suite can
Build a profile HMM given a multiple sequence alignment A
Use the profile HMM to add a sequence s into A, and return the “probability” that the HMM generated s (the “score”)
![Page 47: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/47.jpg)
INPUT
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC
![Page 48: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/48.jpg)
Align Sequence using HMMER
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
![Page 49: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/49.jpg)
Align Sequence using HMMER
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
1. Build a profile HMM for the backbone alignment 2. Add Q1 to backbone alignment: Compute a
maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment.
3. Record the score for the alignment!
![Page 50: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/50.jpg)
What is pplacer?
pplacer: software developed by Erick Matsen and colleagues. See http://matsen.fhcrc.org/pplacer/ Input: read s, alignment A (on S and s), tree on S Output:
“Best” location to add s in T (under maximum likelihood).
For every edge e in T, the value p(e) for the probability for s being placed on e (these probabilities add up to 1)
![Page 51: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/51.jpg)
Place Sequence using pplacer
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)
2. Return Te that has the best ML score.
![Page 52: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/52.jpg)
Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)
S1
S4
S2
S30.5
0.4
0.05
0.03
0.02
![Page 53: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/53.jpg)
Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)
2. Find edge e* producing the best ML score
S1
S4
S2
S30.5
0.4
0.05
0.03
0.02
![Page 54: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/54.jpg)
Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)
2. Find edge e* producing the best ML score
S1
S4
S2
S30.5
![Page 55: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/55.jpg)
Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)
2. Add Q1 into edge e*
S1
S4
S2
S30.5
![Page 56: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/56.jpg)
Place Sequence using pplacer
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)
2. Add Q1 into edge e*
S1
S4
S2
S3Q1
![Page 57: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/57.jpg)
HMMER vs. PaPaRa Alignments
Increasing rate of evolution
0.0
![Page 58: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/58.jpg)
One Hidden Markov Model for the entire alignment?
HMM 1
![Page 59: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/59.jpg)
One HMM works beautifully for small-diameter trees
HMM!1!
![Page 60: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/60.jpg)
One HMM works poorly for large-diameter trees
HMM!1!
![Page 61: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/61.jpg)
One Hidden Markov Model for the entire alignment?
![Page 62: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/62.jpg)
Or 2 HMMs?
HMM 1
HMM 2
![Page 63: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/63.jpg)
HMM 1
HMM 3 HMM 4
HMM 2
Or 4 HMMs?
![Page 64: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/64.jpg)
This talk
![Page 65: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/65.jpg)
Ensemble of HMMs
Basic idea
Represent a multiple sequence alignment A on S with a collection of HMMs
Construct the eHMM by dividing S into subsets of taxa, and build an HMM on each subset (using HMMER)
SEPP (PSB 2012): SATé-enabled Phylogenetic Placement
TIPP (Bioinformatics 2014): Applications of the eHMM technique to (a) taxonomic identification and (b) metagenomic abundance classification
UPP (Genome Biology 2015): constructs ultra-large alignments
HIPPI (BMC Genomics 2016): protein family classification
All available at https://github.com/smirarab/sepp:
![Page 66: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/66.jpg)
SEPP Design
To insert query sequence Q1 into backbone tree T
Represent the backbone MSA with a collection of profile HMMs, based on maximum ”alignment subset size”
Score Q1 against every profile HMM in the collection
The best scoring HMM is used to compute the extended alignment
Use pplacer to add Q1 into tree T (restricted to ”subtree” based on maximum ”placement subset size”)
![Page 67: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/67.jpg)
One Hidden Markov Model for the entire alignment?
![Page 68: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/68.jpg)
Or 2 HMMs?
HMM 1
HMM 2
![Page 69: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/69.jpg)
HMM 1
HMM 3 HMM 4
HMM 2
Or 4 HMMs?
![Page 70: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/70.jpg)
SEPP Parameter Exploration
Alignment subset size and placement subset size impact the accuracy, running time, and memory of SEPP:
Small alignment subset sizes best
Large placement subset size best
10% rule (both subset sizes 10% of backbone) had best overall performance
![Page 71: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/71.jpg)
SEPP (10%-rule) on simulated data
0.00.0
Increasing rate of evolution
![Page 72: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/72.jpg)
SEPP and eHMMs
An ensemble of HMMs provides a better model of a multiple sequence alignment than a single HMM, and is better able to
(a) detect homology between full length sequences and fragmentary sequences and (b) add fragmentary sequences into an existing alignment
especially when there are many indels and/or substitutions.
![Page 73: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/73.jpg)
Metagenomic Taxon Identification
Objective: classify short reads in a metagenomic sample
![Page 74: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/74.jpg)
SEPP does Phylogenetic Placement
![Page 75: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/75.jpg)
SEPP can also be used for Taxon Identification
![Page 76: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/76.jpg)
TIPP can also be used for Taxon Identification
![Page 77: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/77.jpg)
TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes
TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and
placement within the tree. Can consider more than one extended alignment Can consider more than one placement in the tree for each extended alignment Assign taxonomic label based on MRCA (most recent common ancestor) of all selected placements for all selected extended alignments
![Page 78: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/78.jpg)
TIPP for Taxonomic ID – output file
![Page 79: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/79.jpg)
Objective: Distribution of the species (or genera, or families, etc.) within the sample.
For example: The distribution of the sample at the species-level is:
50% species A
20% species B
15% species C
14% species D
1% species E
Abundance Profiling
![Page 80: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/80.jpg)
Objective: Distribution of the species (or genera, or families, etc.) within the sample.
For example: The distribution of the sample at the species-level is:
50% species A
20% species B
15% species C
14% species D
1% species E
TIPP can be used for Abundance Profiling
![Page 81: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/81.jpg)
TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes
TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and
placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment
![Page 82: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/82.jpg)
TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes
TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and
placement within the tree. Can consider more than one extended alignment Can consider more than one placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignments
![Page 83: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/83.jpg)
TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes
TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and
placement within the tree. Can consider more than one extended alignment Can consider more than one placement in the tree for each extended alignment Assign taxonomic label based on MRCA (most recent common ancestor) of all selected placements for all selected extended alignments
![Page 84: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/84.jpg)
TIPP Design (Step 4)Input: marker gene reference alignment (computed using PASTA), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%)
For each marker gene and its associated bin of reads:
Builds eHMM to represent the MSA
For each read:
(1) Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold.
(2) For each extended MSA, use pplacer to place the read into the taxonomy (optimizing maximum likelihood) and identify the clade in the tree with sufficiently high likelihood to meet the specified placement support threshold. Note: This gives the taxonomic information — but may not classify the read if the threshold is high, because that will move the read towards the root! Taxonomically characterize each read at the MRCA of these clades.
![Page 85: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/85.jpg)
Objective: Distribution of the species (or genera, or families, etc.) within the sample.
We compared TIPP to
PhymmBL (Brady & Salzberg, Nature Methods 2009)
NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011)
MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland
MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard
mOTU (Bork et al., Nature Methods 2013)
MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes).
Marker gene are single-copy, universal, and resistant to horizontal transmission.
Abundance Profiling
![Page 86: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/86.jpg)
High indel datasets containing known genomes
Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.
![Page 87: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/87.jpg)
“Novel” genome datasets
Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.
![Page 88: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/88.jpg)
TIPP vs. other abundance profilers
TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads.
All other methods have some vulnerability (e.g., mOTU is only accurate for short reads and is impacted by high indel rates).
Improved accuracy is due to the use of eHMMs; single HMMs do not provide the same advantages, especially in the presence of high indel rates.
![Page 89: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/89.jpg)
Using SEPP
SEPP algorithmic parameters: Alignment subset size (how many sequences for each profile HMM in the ensemble?) Placement subset size (how much of the tree to search for optimal placement?)
Default settings are acceptable, but you can improve accuracy (but increase running time) by:
increasing placement subset size and decreasing alignment subset size
![Page 90: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/90.jpg)
Using TIPP
TIPP algorithmic parameters (other than SEPP parameters) Reference markers, alignments, and refined taxonomy Alignment threshold (default 95%) Placement threshold (default 95%)
Note: The default alignment and placement thresholds were optimized for abundance profiling, not for Taxon ID. Reducing the placement threshold will increase probability of taxonomic classification at the species level (but could also increase the false positive rate)
![Page 91: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/91.jpg)
Our Publications using eHMMsS. Mirarab, N. Nguyen, and T. Warnow. "SEPP: SATé-Enabled Phylogenetic Placement." Proceedings of the 2012 Pacific Symposium on Biocomputing (PSB 2012) 17:247-258. N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow "TIPP:Taxonomic Identification and Phylogenetic Profiling." Bioinformatics (2014) 30(24):3548-3555. N. Nguyen, S. Mirarab, K. Kumar, and T. Warnow, "Ultra-large alignments using phylogeny aware profiles". Proceedings RECOMB 2015 and Genome Biology (2015) 16:124 N. Nguyen, M. Nute, S. Mirarab, and T. Warnow, HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics (2016): 17 (Suppl 10):765
All software are available in open source form at https://github.com/smirarab/sepp
![Page 92: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/92.jpg)
TIPP is under development!
We are modifying TIPP’s design to improve taxonomic identification and abundance profiling on shotgun sequencing data Stay tuned!
Developers: Erin Molloy, Mike Nute, Nidhi Shah, Mihai Pop, and Tandy Warnow
![Page 93: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/93.jpg)
The Tutorial (by Mike Nute)
SEPP for taxon ID Will show you how to run SEPP Will show you how to use branch lengths in SEPP’s placement of reads to get interesting insights
TIPP for taxon ID and abundance profiling
You may also wish to look at last year’s tutorial by Erin Molloy, at https://github.com/ekmolloy/stamps-tutorial/
![Page 94: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/94.jpg)
Contacting us
Tandy Warnow: [email protected] Mike Nute: [email protected] Erin Molloy: [email protected]
![Page 95: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU](https://reader030.fdocuments.us/reader030/viewer/2022040415/5f2efaec2bb91667fe47ab6f/html5/thumbnails/95.jpg)
Acknowledgments
PhD students: Nam Nguyen (now postdoc at UCSD), Siavash Mirarab (now faculty at UCSD), Bo Liu (now at Square), Erin Molloy, and Mike Nute Mihai Pop, University of Maryland
NSF grants to TW: DBI:1062335, DEB 0733029, III:AF:1513629 NIH grant to MP: R01-A1-100947 Also: Guggenheim Foundation Fellowship (to TW), Microsoft Research New England (to TW), David Bruton Jr. Centennial Professorship (to TW), Grainger Foundation (to TW), HHMI Predoctoral Fellowship (to SM) TACC, UTCS, and UIUC computational resources