One way that programming helps us understand biology

37
One way that programming helps us understand biology Rob Lanfear Ecology & Evolution Australian National University

Transcript of One way that programming helps us understand biology

Page 1: One way that programming helps us understand biology

One way that programming helps us understand biology

Rob LanfearEcology & Evolution

Australian National University

Page 2: One way that programming helps us understand biology

What is phylogenetics?

Page 3: One way that programming helps us understand biology

actgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgac

Page 4: One way that programming helps us understand biology

actgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgac

Page 5: One way that programming helps us understand biology

actgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgac

!" #"

$" %"

Page 6: One way that programming helps us understand biology

What is partitioning?

Page 7: One way that programming helps us understand biology

actgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgac

!" #"

$" %"

Page 8: One way that programming helps us understand biology

actgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgac

Page 9: One way that programming helps us understand biology

actgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgacactgactgactgactgactgactgactgactgactgactgactgac

Partitioning can improve inference

!" #"

$" %"

!" #"

$" %"

Page 10: One way that programming helps us understand biology

So what’s the problem?

Page 11: One way that programming helps us understand biology

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

Page 12: One way that programming helps us understand biology

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

Page 13: One way that programming helps us understand biology

Let’s solve the problem

Page 14: One way that programming helps us understand biology
Page 15: One way that programming helps us understand biology

A Simple Algorithm

Compare all possible partitioning schemes.

1. Calculate the AIC score of every possible partitioning scheme.

AIC (Akaike’s Information Criterion) measures how far a model is from the truth.

2. Use the scheme with the smallest AIC score

Page 16: One way that programming helps us understand biology

https://github.com/brettc/partitionfinder/blob/master/partfinder/submodels.py

Page 17: One way that programming helps us understand biology

Now we find a new problem…

Page 18: One way that programming helps us understand biology

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

!" #"

$" %"

Page 19: One way that programming helps us understand biology

How many schemes are there?

Number of genes (n)

Num

ber o

f pos

sible

sche

mes

(Bn)

Page 20: One way that programming helps us understand biology

Our first approach won’t work for large datasets.

What can we do?

OPTMISE!

Page 21: One way that programming helps us understand biology

s1 s2 s3 s4 s5 s6 s7 s8 s9

s10 s3

Analysing EVERY scheme EVERY time is inefficientBecause we analyse the same subsets over and over…

s4 s5 s6 s7 s8 s9

To calculate the AIC of all possible partitioning schemes, we only need to calculate the AIC of all possible subsets of genes

Then we can just add these together to get the AIC of the schemes…

Page 22: One way that programming helps us understand biology

s1 s2 s3 s4 s5 s6 s7 s8 s9

s10 s3

How do we count the number of unique subsets?Combinatorics!

For example, if we start with 20 genesThere are 51,724,158,235,372 possible partitioning schemes

but these are made up of just 1,048,575 possible subsetsThat’s a time saving of 99.99998%!!!!!

s4 s5 s6 s7 s8 s9

Page 23: One way that programming helps us understand biology

110

1001000

10000100000

100000010000000

1000000001E+091E+101E+111E+121E+131E+14

0 5 10 15 20

How many subsets and schemes?N

Number of genes

Number of schemes

Number of subsets (exhaustive search)

Page 24: One way that programming helps us understand biology

Converting schemes to subsets• Using our recursive function from earlier, we

can get all the schemes as a list of lists:

s10 s3 s4 s5 s6 s7 s8 s9

all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9],[10, 3, 4, 5, 6, 7, 8, 9]]

s1 s2 s3 s4 s5 s6 s7 s8 s9

Page 25: One way that programming helps us understand biology

Converting schemes to subsets• We want to convert our list of lists into a set of

unique subsets, i.e.

all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9],[10, 3, 4, 5, 6, 7, 8, 9]]

all_subsets = set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

• Take 2 minutes and talk to your neighbour. Try to think of ways you would do this.

• How do you think I did it?

Page 26: One way that programming helps us understand biology

Converting schemes to subsets• Lots of solutions…• A simple one:

all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9],[10, 3, 4, 5, 6, 7, 8, 9]]

all_subsets = set([]) #empty set

for i in all_schemes: # loop through listsfor j in i: # loop through each list

all_subsets.add(j) # add value if not already there

>all_subsetsset([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

Page 27: One way that programming helps us understand biology

Converting schemes to subsets• Lots of solutions…• A better one:

– Google “stack overflow python list of lists to set of unique values”

Page 28: One way that programming helps us understand biology

Real programmers use Google and Stack Overflow. A lot.

You don’t need to solve every problem from scratch.

Page 29: One way that programming helps us understand biology

Converting schemes to subsets• Convert our list of lists into a set of unique

subsets, i.e.

all_schemes = [[1, 2, 3, 4, 5, 6, 7, 8, 9],[10, 3, 4, 5, 6, 7, 8, 9]]

all_subsets = set([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

• Now we calculate the AIC of each subset and store it.

• What data structure would you use?

Page 30: One way that programming helps us understand biology

Summary• We started with a naïve solution to analyse all

partitioning schemes independently• We only optimized this when we knew it

wouldn’t work• Optimisation takes many forms – but the key

is to find the biggest inefficiencies and improve them.

• With a simple trick, we could speed up the code by millions of fold for large datasets!

Page 31: One way that programming helps us understand biology

Problem solved, right?

Wrong!

Today’s datasets can have 1000’s of gene.

Even with our new algorithm that’s at least 1.07x10301 subsets

to analyse…

Page 32: One way that programming helps us understand biology

A Solution

Heuristic search

Page 33: One way that programming helps us understand biology

Heuristic search1. Pick a starting partitioning scheme2. Get the AIC3. Try a few similar partitioning schemes4. Get the best AIC score5. Go to step 36. Stop when you can’t improve the AIC

anymore

Page 34: One way that programming helps us understand biology

Greedy AlgorithmAIC9

Subsets Examined

9

AIC7 7

If: AIC8<AIC9

AIC6 6If: AIC7<AIC8

AIC8

Page 35: One way that programming helps us understand biology

Efficient Heuristic SearchN

Number of data blocks

Number of schemes

Number of subsets (exhaustive search)

Number of subsets (heuristic search)

Page 36: One way that programming helps us understand biology

PartitionFinder: Combined Selection of Partitioning Schemesand Substitution Models for Phylogenetic AnalysesRobert Lanfear,*,1 Brett Calcott,1,2 Simon Y. W. Ho,3 and Stephane Guindon4

1Centre for Macroevolution and Macroecology, Ecology Evolution and Genetics, Research School of Biology, Australian NationalUniversity, Canberra, Australian Capital Territory, Australia2Philosophy Program, Research School of Social Sciences, Australian National University, Canberra, Australian Capital Territory,Australia3School of Biological Sciences, University of Sydney, Sydney, New South Wales, Australia4Department of Statistics, University of Auckland, Auckland, New Zealand

*Corresponding author: E-mail: [email protected].

Associate editor: Sudhir Kumar

Abstract

In phylogenetic analyses of molecular sequence data, partitioning involves estimating independent models of molecularevolution for different sets of sites in a sequence alignment. Choosing an appropriate partitioning scheme is an importantstep in most analyses because it can affect the accuracy of phylogenetic reconstruction. Despite this, partitioning schemesare often chosen without explicit statistical justification. Here, we describe two new objective methods for the combinedselection of best-fit partitioning schemes and nucleotide substitution models. These methods allow millions of partitioningschemes to be compared in realistic time frames and so permit the objective selection of partitioning schemes even forlarge multilocus DNA data sets. We demonstrate that these methods significantly outperform previous approaches,including both the ad hoc selection of partitioning schemes (e.g., partitioning by gene or codon position) and a recentlyproposed hierarchical clustering method. We have implemented these methods in an open-source program,PartitionFinder. This program allows users to select partitioning schemes and substitution models using a range ofinformation-theoretic metrics (e.g., the Bayesian information criterion, akaike information criterion [AIC], and correctedAIC). We hope that PartitionFinder will encourage the objective selection of partitioning schemes and thus lead toimprovements in phylogenetic analyses. PartitionFinder is written in Python and runs under Mac OSX 10.4 and above. Theprogram, source code, and a detailed manual are freely available from www.robertlanfear.com/partitionfinder.

Key words: partitioning, AIC, BIC, AICc, model selection, molecular evolution.

IntroductionMolecular phylogenetics provides a wealth of important in-formation for evolutionary biologists. However, the accuracyof molecular phylogenetic inference depends on having anappropriate model of molecular evolution (Sullivan andJoyce 2005; Simon et al. 2006). Because of this, there is a greatdeal of interest in developing methods to select evolutionarymodels and assess their adequacy (Ripplinger and Sullivan2010; Jayaswal et al. 2011; Nguyen et al. 2011). The goal ofmodel selection is to identify amodel that is sufficiently com-plex to capture the evolutionary processes that haveoccurred but to avoid models with more parameters thancan be reliably estimated from the available data (overpar-ameterization). One of the most important aspects ofmodels of molecular evolution is how they account forvariation in evolutionary processes among the sites of analignments, because the failure to correctly account for thisvariation can seriously mislead phylogenetic analyses(Buckley et al. 2001; Telford and Copley 2011).

There are two ways to incorporate the variation inevolutionary processes among different sites usingcurrently available phylogenetic methods: mixture modelsand partitioning. With mixture models, the likelihood of

each site is calculated under more than one substitutionmodel (e.g., Le et al. 2008). The parameters of thesesubstitution models, as well as the probability with whicheach model applies to each site, can be determined directlyfrom the data (Pagel and Meade 2004). With partitioning,the user first groups together sites that are assumed to haveevolved under similar processes and then estimates inde-pendent (i.e., unlinked) substitution models for each groupof sites (e.g., Nylander et al. 2004; Brandley et al. 2005;McGuire et al. 2007). In contrast to mixture models, par-titioning requires the a priori definition of appropriategroups of sites. Although mixture models are implementedin an increasing variety of phylogenetic software (e.g., Pageland Meade 2004; Stamatakis 2006; Le et al. 2008), partition-ing remains by far the most common approach toincorporating heterogeneity in evolutionary processesamong sites (Blair and Murphy 2011).

Choosing an appropriate partitioning scheme is a centralproblem for most phylogenetic analyses (Brandley et al.2005; Shapiro et al. 2006; McGuire et al. 2007; Li et al.2008 ; Blair and Murphy 2011). Typically, phylogeneticistsuse their biological intuition to group together similar sitesin an alignment into putatively homogeneous data blocks.

© The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, pleasee-mail: [email protected]

Mol. Biol. Evol. 29(6):1695–1701. 2012 doi:10.1093/molbev/mss020 Advance Access publication January 20, 2012 1695

Research

article at D

uke University on July 24, 2012

http://mbe.oxfordjournals.org/

Dow

nloaded from

• >100,000 downloads of the software• >3000 citations of the paper• Many follow up papers and algorithms,

including upcoming work with Dr. Bui where we have algorithms 1000’s of times faster than those I introduced today.

Page 37: One way that programming helps us understand biology

Take homes

• Start with simple, naïve solutions– Build something that works

• Avoid premature optimisation• Use Google and Stack Overflow• Go and look up:

– Git and version control– E.g. https://github.com/brettc/partitionfinder

• Find a problem you’re interested in. • Start coding!