Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al.,...
Transcript of Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al.,...
![Page 1: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/1.jpg)
ChalkTalk
TandyWarnowDepartmentsofComputerScienceand
BioengineeringUniversityofIllinoisatUrbana-Champaign
![Page 2: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/2.jpg)
• Large-scale statistical phylogeny estimation • Ultra-large multiple-sequence alignment • Estimating species trees from incongruent gene trees • Supertree estimation • Genome rearrangement phylogeny • Reticulate evolution • Visualization of large trees and alignments • Data mining techniques to explore multiple optima
The Tree of Life: Multiple Challenges
Largedatasets:100,000+sequences10,000+genes“BigData”complexity
![Page 3: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/3.jpg)
Applications areas: • metagenomics • protein structure and function prediction • trait evolution • detection of co-evolution • systems biology
The Tree of Life: Multiple Challenges
Largedatasets:100,000+sequences10,000+genes“BigData”complexity
![Page 4: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/4.jpg)
Techniques: • Graph theory (especially chordal graphs) • Probability theory and statistics • Hidden Markov models • Combinatorial optimization • Heuristics • Supercomputing
The Tree of Life: Multiple Challenges
Largedatasets:100,000+sequences10,000+genes“BigData”complexity
![Page 5: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/5.jpg)
Overview• Theory:combiningprobabilitytheory,graphtheory,andopLmizaLon
• SimulaLons:evaluaLngmethodsunderstochasLcmodelsofsequenceevoluLon
• Biologicaldataanalysis:refiningmethodsandenablingdiscovery
• OpensourcesoOwaredevelopment• HighperformancecompuLng• ApplicaLonsoutsidebiology(e.g.,historicallinguisLcs,bigdataproblemsingeneral)
![Page 6: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/6.jpg)
PastWork(highlights)• GenetreeesLmaLon(theoreLcalresultsunderstochasLcmodelsofsequenceevoluLon)
• MulLplesequencealignmentonlargedatasets,andco-esLmaLonofalignmentsandtrees
• PhylogeneLcnetworksandspeciestreesfrommulL-locusdatasets
• Genomerearrangementphylogeny• Supertreemethods• Metagenomics• HistoricallinguisLcs
![Page 7: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/7.jpg)
Futurework
Theory,methods,andempiricalstudiesfor• Genome-scalephylogenyesLmaLonaddressingmulLplesourcesforgenetreeheterogeneity
• Microbiomeanalysis• Ultra-largemulLplesequencealignmentandtreeesLmaLon
AndapplicaLonsofthesetechniquesoutsidebiology
![Page 8: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/8.jpg)
CurrentNSFgrants
• Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977
• Mul*pleSequenceAlignment:NSFABI-1458652
• Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629
![Page 9: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/9.jpg)
CurrentNSFgrants
• Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977
• Mul*pleSequenceAlignment:NSFABI-1458652
• Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629
![Page 10: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/10.jpg)
MajorAreas• Phylogenomics:SpeciestreeandnetworkesLmaLonusing
wholegenomes(andgenetreeesLmaLoninthecontextofwholegenomes)
• Mul*pleSequenceAlignment:InferringrelaLonshipsbetweenleeersinmolecularsequences,especiallyonverylargedatasets(upto1,000,000sequences)
• Metagenomics:Analysisofmolecularsequencesobtainedfromenvironmentalsamples(jointwithMihaiPopandBillGropp)
• Scalingcomputa*onallyintensivemethodstolargedatasets:CombiningdiscretemathandstaLsLcalmethodstoenablehighlyaccurateanalysisofultra-largedatasets(jointwithChandraChekuriandSaLshRao)
![Page 11: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/11.jpg)
Phylogenomics = Species trees from whole genomes
“NothinginbiologymakessenseexceptinthelightofevoluLon”-Dobhzansky
![Page 12: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/12.jpg)
phylogenomics
2
gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG
CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G
AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG
CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT
gene 1000gene 1
“gene” here refers to a portion of the genome (not a functional gene)
Orangutan
Gorilla
Chimpanzee
Human
I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome
![Page 13: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/13.jpg)
Gene tree discordance
3
Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
gene1000gene 1
IncompleteLineageSorLng(ILS)isadominantcauseofgenetreeheterogeneity
![Page 14: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/14.jpg)
IncompleteLineageSorLng(ILS)
• ConfoundsphylogeneLcanalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.
• ThereissubstanLaldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundstaLsLcalconsistencyguarantees(theory)andperformanceondata.
![Page 15: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/15.jpg)
. . .
Analyzeseparately
Summary Method
MaincompeLngapproaches gene 1 gene 2 . . . gene k
. . . Concatenation
Species
![Page 16: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/16.jpg)
StaLsLcalConsistency
error
Data
![Page 17: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/17.jpg)
![Page 18: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/18.jpg)
. . .
Analyzeseparately
Summary Method
MaincompeLngapproaches gene 1 gene 2 . . . gene k
. . . Concatenation
Species
![Page 19: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/19.jpg)
Maximum Quartet Support Species Tree [Mirarab, et al., ECCB, 2014]
• Optimization Problem (NP-Hard):
• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly
8
Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees
Set of quartet trees induced by T
a gene tree
Score(T ) =X
t2TQ(T ) \Q(t)
all input gene trees
![Page 20: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/20.jpg)
ConstrainedMQST(MaximumQuartetSupportTree)
• Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbiparLLons
• Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT,subjecttoTdrawingitsbiparLLonsfromX.
Theorems(Mirarabetal.,2014):• IfXcontainsthebiparLLonsfromtheinputgenetrees(andperhaps
others),thenanexactsoluLontothisproblemisstaLsLcallyconsistentundertheMSC.
• TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)Lme.(Weusedynamicprogramming,andbuildtheunrootedtreefromtheboeom-up,basedon“allowedclades”–halvesoftheallowedbiparLLons.)
![Page 21: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/21.jpg)
200 Estimated Gene Trees
Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees,simulated 1000 bp gene sequences using INDELible 8, 1000 gene trees estimated from GTR simulated sequences using FastTree-27
7Price, Dehal, Arkin 20158Fletcher, Yang 2009
12
ASTRALisfairlyrobusttoHGT+ILS
Davidsonetal.,RECOMB-CG,BMCGenomics2015
![Page 22: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/22.jpg)
16
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
4%
8%
12%
16%
10 50 100 200 500 1000number of species
Spec
ies
tree
topo
logi
cal e
rror (
FN)
ASTRAL−IIMP−EST
1000 genes, “medium” levels of recent ILS
Tree accuracy when varying the number of species
![Page 23: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/23.jpg)
![Page 24: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/24.jpg)
ContribuLons(sample)MethodsforesLmaLngspeciestreesfromgenome-scaledata:
• ASTRAL(Mirarabetal.,BioinformaLcs2014,2015)andASTRID(VachaspaLandWarnow,BMCGenomics2015):polynomialLmemethodsthatarestaLsLcallyconsistentundertheMSC.Bothcananalyzeverylargedatasets(1000speciesand1000genes–ormore)withhighaccuracy.
• StaLsLcalbinning(Mirarabetal.,Science2014,Bayzidetal.PLOSOne2015)canreducegenetreeesLmaLonerror,andleadtoimprovedspeciestreeesLmaLons(topology,branchlengths,andincidenceoffalseposiLves)
• BBCA(Zimmermannetal.,BMCGenomics2014)enablesBayesianco-esLmaLonmethodstoscaletolargenumbersofgenes
• DCM-boosLng(Bayzidetal.,BMCGenomics2014)enablescomputaLonallyintensivemethodstoscaletolargenumbersofspecies
MathemaLcaltheory:
• RochandWarnow,SystemaLcBiology2015)regardingstaLsLcalconsistencyundertheMSCgivenfinitelengthsequences.
• Uricchioetal.,BMCBioinformaLcs2016,numberoflocineededtorecoverallthesplitswithhighprobability
Biologicaldataanalyses:
• Avianphylogenomicsproject(Jarvis,Mirarabetal.,Science2014)
• ThousandPlantTranscriptomeProject(Wickee,Mirarabetal.PNAS2014)
• Tarveretal.GenomeBiologyandEvoluLon2016,Mammalianphylogeny
![Page 25: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/25.jpg)
CurrentNSFgrants
• Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977
• Mul*pleSequenceAlignment:NSFABI-1458652
• Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629
![Page 26: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/26.jpg)
CurrentNSFgrants
• Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977
• Mul*pleSequenceAlignment:NSFABI-1458652
• Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629
![Page 27: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/27.jpg)
Metagenomictaxonomiciden*fica*onandphylogene*cprofiling
Metagenomics,Venteretal.,ExploringtheSargassoSea:Scien*stsDiscoverOneMillionNewGenesinOceanMicrobes
![Page 28: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/28.jpg)
1. Whatisthisfragment?(Classifyeachfragmentaswellaspossible.)
2.WhatisthetaxonomicdistribuLoninthedataset?(Note:helpfultousemarkergenes.)
3.Whataretheorganismsinthismetagenomicsampledoingtogether?
BasicQuesLons
![Page 29: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/29.jpg)
This talk
• SEPP (PSB 2012): SATé-enabled Phylogenetic Placement, and Ensembles of HMMs (eHMMs)
• Applications of the eHMM technique to metagenomic abundance classification (TIPP, Bioinformatics 2014)
![Page 30: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/30.jpg)
PhylogeneLcPlacement
Input:Backbonealignmentandtreeonfull-lengthsequences,andasetofhomologousquerysequences(e.g.,readsinametagenomicsampleforthesamegene)
Output:Placementofquerysequencesonbackbonetree
PhylogeneLcplacementcanbeusedinsideapipeline,aOerdeterminingthegenesforeachofthereadsinthemetagenomicsample.
![Page 31: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/31.jpg)
Marker-based Taxon Identification
ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT
ACCGCGAGCGGGGCTTAGAGGGGGTCGAGGGCGGGG• .• .• .ACCT
Fragmentarysequencesfromsomegene
Full-lengthsequencesforsamegene,andanalignmentandatree
![Page 32: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/32.jpg)
AlignSequence
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC
![Page 33: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/33.jpg)
AlignSequence
S1
S4
S2
S3
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
![Page 34: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/34.jpg)
PlaceSequence
S1
S4
S2
S3Q1
S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------
![Page 35: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/35.jpg)
PhylogeneLcPlacement• Aligneachquerysequencetobackbonealignment
– HMMALIGN(Eddy,BioinformaLcs1998)– PaPaRa(BergerandStamatakis,BioinformaLcs2011)
• Placeeachquerysequenceintobackbonetree– Pplacer(Matsenetal.,BMCBioinformaLcs,2011)– EPA(BergerandStamatakis,SystemaLcBiology2011)
Note:pplacerandEPAusemaximumlikelihood
![Page 36: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/36.jpg)
HMMERvs.PaPaRaAlignments
Increasing rate of evolution
0.0
![Page 37: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/37.jpg)
One Hidden Markov Model for the entire alignment?
![Page 38: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/38.jpg)
Or2HMMs?
HMM1
HMM2
![Page 39: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/39.jpg)
HMM1
HMM3 HMM4
HMM2
Or4HMMs?
![Page 40: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/40.jpg)
SEPPParameterExploraLon
§ Alignmentsubsetsizeandplacementsubsetsizeimpacttheaccuracy,runningLme,andmemoryofSEPP
§ 10%rule(subsetsizes10%ofbackbone)hadbestoverallperformance
![Page 41: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/41.jpg)
SEPP(10%-rule)onsimulateddata
0.0
0.0
Increasing rate of evolution
![Page 42: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/42.jpg)
Marker-based Taxon Identification
ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT
ACCGCGAGCGGGGCTTAGAGGGGGTCGAGGGCGGGG• .• .• .ACCT
Fragmentarysequencesfromsomegene
Full-lengthsequencesforsamegene,andanalignmentandatree
![Page 43: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/43.jpg)
TIPP (https://github.com/smirarab/sepp)
TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes
TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference
alignment 4. Modifies SEPP by considering statistical uncertainty in the
extended alignment and placement within the tree
![Page 44: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/44.jpg)
Objective: Distribution of the species (or genera, or families, etc.) within the sample.
For example: The distribution of the sample at the species-level is:
50% species A
20% species B
15% species C
14% species D
1% species E
Abundance Profiling
![Page 45: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/45.jpg)
Highindeldatasetscontainingknowngenomes
Note:NBC,MetaPhlAn,andMetaPhylercannotclassifyanysequencesfromatleastoneofthehighindellongsequencedatasets,andmOTUterminateswithanerrormessageonallthehighindeldatasets.
![Page 46: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/46.jpg)
“Novel”genomedatasets
Note:mOTUterminateswithanerrormessageonthelongfragmentdatasetsandhighindeldatasets.
![Page 47: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/47.jpg)
TIPPvs.otherabundanceprofilers
• TIPPishighlyaccurate,eveninthepresenceofhighindelratesandnovelgenomes,andforbothshortandlongreads.
• Allothermethodshavesomevulnerability(e.g.,mOTUisonlyaccurateforshortreadsandisimpactedbyhighindelrates).
• ImprovedaccuracyisduetotheuseofeHMMs;singleHMMsdonotprovidethesameadvantages,especiallyinthepresenceofhighindelrates.
![Page 48: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/48.jpg)
SEPPandeHMMs
AnensembleofHMMsprovidesabeeermodelofamulLplesequencealignmentthanasingleHMM,andisbeeerableto• detecthomologybetweenfulllengthsequencesandfragmentarysequences
• addfragmentarysequencesintoanexisLngalignment
especiallywhentherearemanyindelsand/orsubsLtuLons.
![Page 49: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/49.jpg)
OurPublicaLonsusingeHMMs• S.Mirarab,N.Nguyen,andT.Warnow."SEPP:SATé-EnabledPhylogeneLc
Placement."Proceedingsofthe2012PacificSymposiumonBiocompuLng(PSB2012)17:247-258.
• N.Nguyen,S.Mirarab,B.Liu,M.Pop,andT.Warnow"TIPP:TaxonomicIdenLficaLonandPhylogeneLcProfiling."BioinformaLcs(2014)30(24):3548-3555.
• N.Nguyen,S.Mirarab,K.Kumar,andT.Warnow,"Ultra-largealignmentsusingphylogenyawareprofiles".ProceedingsRECOMB2015andGenomeBiology(2015)16:124
• N.Nguyen,M.Nute,S.Mirarab,andT.Warnow,HIPPI:HighlyaccurateproteinfamilyclassificaLonwithensemblesofHMMs.BMCGenomics(2016):17(Suppl10):765
Allcodesareavailableinopensourceformatheps://github.com/smirarab/sepp
![Page 50: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/50.jpg)
Overview• Theory:combiningprobabilitytheory,graphtheory,andopLmizaLon
• SimulaLons:evaluaLngmethodsunderstochasLcmodelsofsequenceevoluLon
• Biologicaldataanalysis:refiningmethodsandenablingdiscovery
• OpensourcesoOwaredevelopment• HighperformancecompuLng• ApplicaLonsoutsidebiology(e.g.,historicallinguisLcs,bigdataproblemsingeneral)
![Page 51: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/51.jpg)
Computational Phylogenomics
NP-hardproblemsLargedatasetsComplexstaLsLcalesLmaLonproblems
MetagenomicsProteinstructureandfuncLonpredicLonMedicalforensicsSystemsbiologyPopulaLongeneLcs
![Page 52: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/52.jpg)
FutureWork-Phylogenomics• Beeertheory,addressingimpactofgenetreeesLmaLonerrorandmissingdata
• Fastgenome-scalephylogeneLctreeesLmaLon(highperformancecompuLng,staLsLcally-basedesLmaLontakingmulLplesourcesofdiscordintoaccount)
• PhylogeneLcnetworkconstrucLononlargedatasets(staLsLcalmethodswithindivide-and-conquerframework)
• BeeerstaLsLcalmodelsofsequenceevoluLon,addressingheterotachy
• Co-esLmaLonofgenetreesandspeciestrees/networks
![Page 53: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/53.jpg)
Futurework-Metagenomics
• Improvedmarker-basedanalyses,andaddressinggenetreeheterogeneity
• RigorousmethodsfordetecLngnovelgenesandspecies
• HighthroughputanalysiswithhighsensiLvity• Metagenomeassembly• HPCimplementaLons• CollaboraLonswithbiologistsandbiomedicalresearchers
![Page 54: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/54.jpg)
Futurework–MulLpleSequenceAlignment
• Improvedlarge-scaleMSA(e.g.,PASTAandUPP)• ExtendingstaLsLcalco-esLmaLonoftreesandMSAtolargedatasets(e.g.,NuteandWarnow2016)
• EfficientandusefulsamplingofMSAs• MSAesLmaLoninthepresenceofduplicaLonsandrearrangements(e.g.,wholegenomealignment)
• BeeerHMM+phylogenymodelsthatareusefulforesLmaLngalignmentsandtrees
![Page 55: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial](https://reader034.fdocuments.us/reader034/viewer/2022042802/5f3e70cdb2a851722176fbf8/html5/thumbnails/55.jpg)
Futurework-Theory
• Basicalgorithmicchallenges:– supertrees– compuLngtreesfromdistancematrices– usingchordalgraphsfordivide-and-conquer– Consensustrees
• Appliedprobability:– Trade-offbetweendataqualityandquanLty(e.g.,
staLsLcalbinning)– IdenLfiabilityoftreemodelswithnoisydata– UnderstandingensemblesofHMMs