Large-Scale Multiple Sequence Alignment - Tandy...

34
Large-Scale Multiple Sequence Alignment Tandy Warnow Founder Professor of Computer Science The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

Transcript of Large-Scale Multiple Sequence Alignment - Tandy...

Page 1: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Large-Scale Multiple Sequence Alignment

Tandy WarnowFounder Professor of Computer Science

The University of Illinois at Urbana-Champaignhttp://tandy.cs.illinois.edu

Page 2: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

1kp: Thousand Transcriptome Project

l Plant Tree of Life based on transcriptomes of ~1200 speciesl More than 13,000 gene families (most not single copy)Gene Tree Incongruence

G. Ka-Shu WongU Alberta

N. WickettNorthwestern

J. Leebens-MackU Georgia

N. MatasciiPlant

T. Warnow, S. Mirarab, N. NguyenUIUC UCSD UCSD

Challenge: Alignments and trees on > 100,000 sequences

Plus many many other people…

Page 3: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

1000-taxon models, ordered by difficulty (Liu et al., Science 324(5934):1561-1564, 2009)

Page 4: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Re-aligning on a treeA

B D

C

Merge sub-alignments

Estimate ML tree on merged

alignment

Decompose dataset

A B

C D

Align subsets

A B

C D

ABCD

Page 5: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

SATé and PASTA Algorithms

Estimate ML tree on new alignment

Tree

Obtain initial alignment and estimated ML tree

Use tree to compute new alignment

Alignment

Repeat until termination condition, and

return the alignment/tree pair with the best ML score

Page 6: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

1000 taxon models, ordered by difficulty, Liu et al., Science 324(5934):1561-1564, 2009

24 hour SATé-I analysis, on desktop machines

(Similar improvements for biological datasets)

Page 7: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

1000-taxon models ranked by difficulty

SATé-2 better than SATé-1

SATé-1 (Liu et al., Science 2009): can analyze up to 8K sequencesSATé-2 (Liu et al., Systematic Biology 2012): can analyze up to ~50K sequences

Page 8: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

RNASim

0.00

0.05

0.10

0.15

0.20

10000 50000 100000 200000

Tree

Erro

r (FN

Rat

e) Clustal−OmegaMuscleMafftStarting TreeSATe2PASTAReference Alignment

PASTA: Mirarab, Nguyen, and Warnow, J Comp. Biol. 2015– Simulated RNASim datasets from 10K to 200K taxa– Limited to 24 hours using 12 CPUs– Not all methods could run (missing bars could not finish)

PASTA: even better than SATé-2

Page 9: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

1kp: Thousand Transcriptome Project

l Plant Tree of Life based on transcriptomes of ~1200 speciesl More than 13,000 gene families (most not single copy)Gene Tree Incongruence

G. Ka-Shu WongU Alberta

N. WickettNorthwestern

J. Leebens-MackU Georgia

N. MatasciiPlant

T. Warnow, S. Mirarab, N. NguyenUIUC UCSD UCSD

Challenge: Alignments and trees on > 100,000 sequences

Plus many many other people…

Page 10: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Length

Counts

0

2000

4000

6000

8000

10000

12000 Mean:317Median:266

0 500 1000 1500 2000

1KP dataset: more than 100,000 p450 amino-acidsequences, many fragmentary

Page 11: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Length

Counts

0

2000

4000

6000

8000

10000

12000 Mean:317Median:266

0 500 1000 1500 2000

1KP dataset: more than 100,000 p450 amino-acidsequences, many fragmentary

All standard multiplesequence alignmentmethods we tested performed poorly ondatasets with fragments.

Page 12: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

UPPUPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles”

Nguyen, Mirarab, and Warnow. Genome Biology, 2014.

Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences.

Page 13: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

UPPUPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles”

Nguyen, Mirarab, and Warnow. Genome Biology, 2014.

Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences.

Uses an ensemble of HMMs

Page 14: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Simple idea (not UPP)

• Select random subset of sequences, and build “backbone alignment”

• Construct a Hidden Markov Model (HMM) on the backbone alignment

• Add all remaining sequences to the backbone alignment using the HMM

Page 15: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

One Hidden Markov Model for the entire alignment?

Page 16: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

One Hidden Markov Model for the backbone alignment?

HMM 1

Page 17: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Or 2 HMMs?

HMM 1

HMM 2

Page 18: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

HMM 1

HMM 3 HMM 4

HMM 2

Or 4 HMMs?

Page 19: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

m

HMM 2

HMM 3

HMM 1

HMM 4

HMM 5 HMM 6

HMM 7

Or all 7 HMMs?

Page 20: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

UPP Algorithmic Approach

1. Select random subset of full-length sequences, and build “backbone alignment” with PASTA

2. Construct an “Ensemble of Hidden Markov Models” on the backbone alignment

3. Add all remaining sequences to the backbone alignment using the Ensemble of HMMs

Page 21: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Evaluation• Simulated datasets (some have fragmentary sequences):– 10K to 1,000,000 sequences in RNASim – complex RNA

sequence evolution simulation– 1000-sequence nucleotide datasets from SATé papers– 5000-sequence AA datasets (from FastTree paper)– 10,000-sequence Indelible nucleotide simulation

• Biological datasets:– Proteins: largest BaliBASE and HomFam– RNA: 3 CRW datasets up to 28,000 sequences

Page 22: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

RNASim Million Sequences: alignmenterror

Notes:• We show alignment error

using average of SP-FN and SP-FP.

• UPP variants havebetter alignment scores than PASTA.

• (Not shown: Total Column Scores – PASTA more accurate than UPP)

• No other methods tested could complete on these data

Page 23: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

RNASim Million Sequences: tree error

Using 12 processors:

• UPP(Fast,NoDecomp) took 2.2 days,

• UPP(Fast) took 11.9 days, and

• PASTA took 10.3 days

Page 24: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

0.0

0.2

0.4

0.6

0 12.5 25 50% Fragmentary

Mea

n al

ignm

ent e

rror

PASTA UPP(Default)

(a) Average alignment error

0.0

0.2

0.4

0 12.5 25 50% Fragmentary

Del

ta F

N tr

ee e

rror

PASTA UPP(Default)

(b) Average tree error

Figure S32: Alignment and tree error of PASTA and UPP on the fragmentary 1000M2datasets.

80

Performance on fragmentary datasets of the 1000M2 model condition

UPP is very robust to fragmentary sequencesUnder high rates of evolution,PASTA is badly impactedby fragmentary sequences (the same is true for other methods).

UPP continues to have goodaccuracy even on datasetswith many fragments underall rates of evolution.

Page 25: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

0

5

10

15

50000 100000 150000 200000Number of sequences

Wal

l clo

ck a

lign

time

(hr)

● UPP(Fast)

UPP Running Time

Wall-clock time used (in hours) given 12 processors

Page 26: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

What about BAli-Phy?

•BAli-Phy (Redelings and Suchard): leading method forstatistical co-estimation of alignments and trees:

•Like Bayesian phylogeny estimation, it is expectedto be the most rigorous and accurate techniquefor estimating trees and alignments!

Page 27: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

BAli-Phy: Better than PASTA!

200 200

Simulator

100

Indelible (DNA)

100

RNAsim(RNA)

40%

30%

20%

10%

0%

#Taxa:

Tota

l-Col

umn

Scor

e

Alignment Accuracy (TC score)MAFFT

PASTA

BAli-Phy

*Averages over 10replicates

Simulated datasets with 100 or 200sequences.

Page 28: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

But: BAli-Phy is limited to smalldatasets

From www.bali-phy.org/README.html, 5.2.1. Too manytaxa?

“BAli-Phy is quite CPU intensive, and so we recommend using 50 or fewer taxa in order to limit the *me required to accumulate enough MCMC samples. (Despite this recommendation, data sets with more than 100 taxa have occasionally been known to converge.) We recommend initially pruning as many taxa aspossible from your data set, then adding some back if the MCMC is not tooslow.”

Page 29: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Re-aligning on a treeA

B D

C

Merge sub-alignments

Estimate ML tree on merged

alignment

Decompose dataset

A B

C D

Align subsets: MAFFT

A B

C D

ABCD

Page 30: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Re-aligning on a treeA

B D

C

Merge sub-alignments

Estimate ML tree on merged

alignment

Decompose dataset

A B

C D

Align subsets: BAli-Phy??

A B

C D

ABCD

Page 31: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Decomposition to 100-sequence subsets, one iteration of PASTA+BAli-Phy

Comparing default PASTA to PASTA+BAli-Phyon simulated datasets with 1000 sequences

PASTA+BAli−Phy Better

PASTA Better

0.0

0.1

0.2

0.3

0.4

0.0 0.1 0.2 0.3 0.4PASTA

PAST

A+BA

li−Ph

y

Total Column Score

dataIndelible M2RNAsimRose L1Rose M1Rose S1

PASTA+BAli−Phy Better

PASTA Better

0.6

0.7

0.8

0.9

1.0

0.6 0.7 0.8 0.9 1.0PASTA

PAST

A+BA

li−Ph

y

Recall (SP−Score)

PASTA+BAli−Phy Better

PASTA Better

0.00

0.05

0.10

0.15

0.00 0.05 0.10 0.15PASTA+BAli−Phy

PAST

A

Tree Error: Delta RF (RAxML)

Page 32: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

��

��� �

PASTA+BAli−Phy Better

PASTABetter

0.000

0.025

0.050

0.075

0.100

0.000 0.025 0.075 0.1000.050PASTA

PAST

A+BA

li−Ph

y

Total Column Score

data� Indelible M2

RNAsim

� �

��

� �

PASTA+BAli−Phy Better

PASTABetter

0.900

0.925

0.950

0.975

1.000

0.900 0.925 0.975 1.0000.950PASTA

PAST

A+BA

li−Ph

y

Recall (SP−Score)

��

��

�rPASTA+BAli−Phy Bette

PASTABetter

0.000

0.005

0.010

0.015

0.000 0.0150.005 0.010PASTA+BAli−Phy

PAST

A

Tree Error: Delta RF (FastTree−2)

Results on 10,000-sequence datasets, backbone size1000

Comparing UPP variants where the backbone alignment is computed using either default PASTA orPASTA+BAli-Phy

Page 33: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Summary• Large-scale multiple sequence alignment (MSA) is achievable

with good accuracy using divide-and-conquer plus iteration, and enable preferred MSA methods to be used on very large datasets

• PASTA and UPP can provide good accuracy on large (million-sequence) datasets with high heterogeneity

• Fragmentary sequences present additional challenges that UPP can address

• PASTA and UPP at https://github.com/smirarab/

• PASTA+BAli-Phy at http://github.com/MGNute/pasta

Page 34: Large-Scale Multiple Sequence Alignment - Tandy Warnowtandy.cs.illinois.edu/warnow-montpellier-v2.pdf · •Large-scale multiple sequence alignment (MSA) is achievable with good accuracy

Acknowledgments

PASTA and UPP: Nam Nguyen (now postdoc at UIUC) and Siavash Mirarab (now faculty at UCSD), undergrad: Keerthana Kumar (at UT-Austin)PASTA+BAli-Phy: Mike Nute (PhD student at UIUC)

Current NSF grants: ABI-1458652 (multiple sequence alignment)Grainger Foundation (at UIUC), and UIUCTACC, UTCS, Blue Waters, and UIUC campus cluster

PASTA, UPP, SEPP, and TIPP are available on github at https://github.com/smirarab/; see also PASTA+BAli-Phy at http://github.com/MGNute/pasta