Shifted stochastic processes evolving on trees: application to...
Transcript of Shifted stochastic processes evolving on trees: application to...
Shifted stochastic processes evolving on trees:application to models of adaptive evolution on
phylogenies
Paul Bastide1,2, Mahendra Mariadassou2, Stephane Robin1
1 UMR 518 MIA - AgroParisTech/INRA, Paris2 UR 1404 MaIAGE - INRA, Jouy en Josas
2 June 2015
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Introduction0
Unit
200 150 100 50 0
Chelonian phylogenetic tree with habitats.(Jaffe et al., 2011).
Dermochelys Coriacea
Homopus Areolatus
How can we explain the diversity, while accounting for thephylogenetic correlations ?
Modelling: a shifted stochastic process on the phylogeny.
PB, MM, SR Shifted stochastic processes on trees 2/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Principle of the Modeling and NotationsStochastic Processes UsedShifts
Stochastic Process on a Tree (Felsenstein, 1985)
E
D
C
B
AtAB
t
R
0 200 400 600 800
−4
−2
02
46
time
phen
otyp
e
E
D
C
B
A
R
Brownian Motion:
Var [A | R ] = σ2t
Cov [A; B | R ] = σ2tAB
PB, MM, SR Shifted stochastic processes on trees 3/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Principle of the Modeling and NotationsStochastic Processes UsedShifts
BM vs OU (Hansen, 1997; Butler and King, 2004)
Equation Stationary State VarianceModel-Based Comparative Analysis 685
Figure 1: Effect of varying j in a Brownian motion (BM) process. Plottedare random walks through time for a continuous character with phe-notypic value along the Y-axis and time along the X-axis. At each smallstep in time, the phenotype has an equal and independent probabilityof increasing or decreasing in value. Increasing j results in strongerrandom drift and a broader distribution of final states. Because Ornstein-Uhlenbeck (OU) processes contain a drift component, increasing j alsobroadens the distribution of final states in OU processes. Each paneldisplays 30 realizations of the stochastic process and the distribution offinal states (the Gaussian curves to the right of the random walks). Theprocess is simulated from to , with each realization havingt p 0 t p 1the same initial state. An animation of this process is provided in theonline edition of the American Naturalist.
Figure 2: Influence of the selection-strength parameter a and optimumtrait value v on a trait evolving under an Ornstein-Uhlenbeck (OU)process. Larger values of a imply stronger selection and hence a morerapid approach to the optimum value v (dotted line) as well as a tighterdistribution of phenotypes around the optimum. Each panel displays 20realizations of the OU process and the distribution of final trait values.The process is simulated from to , with each realization havingt p 0 t p 1the same initial state and value of v. See figure 1 for explanation of axes.
This term is linear in X, so it is as simple as it mightpossibly be. It contains two additional parameters: a mea-sures the strength of selection, and v gives the optimumtrait value. The force of selection is proportional to thedistance, , of the current trait value from the op-v ! X(t)timum. Thus, if the phenotype has drifted far from theoptimum, the “pull” toward the optimum will be verystrong, whereas if the phenotype is currently at the opti-mum, selection will have no effect until the stochasticitymoves the phenotype away from the optimum again orthere is a change in the optimum, v, itself. Because of itsdependence on the distance from the optimum, the OUprocess can be used to model stabilizing selection. Theeffect of varying a can be seen in figure 2.
Because the OU model reduces to BM when , ita p 0can be viewed as an elaboration of the BM model. As astatistical model, its primary justification is to be soughtin the fact that it represents a step beyond BM in thedirection of realism while yet remaining mathematicallytractable. As a model of evolution, the OU process is con-
sistent with a variety of evolutionary interpretations, twoof which we mention here.
Lande (1976) showed that under certain assumptions,evolution of the species’ mean phenotype can take theform of an OU process. In Lande’s formulation, both nat-ural selection and random genetic drift are assumed to acton the phenotypic character; the OU process’s optimum
denotes the location of a local maximum in a fitnessvlandscape. Felsenstein (1988) pointed out that in the eventthat this optimum itself moves randomly, the correct de-scription of the phenotypic evolution is no longer exactlyan OU process, but an OU process remains a goodapproximation.
Hansen (1997) raised questions concerning the time-scale of the approach of a species’ mean phenotype to itsoptimal value relative to that of macroevolution. Specifi-cally, he suggested that the macroevolutionary OU processhe proposed could only operate on far too slow a timescaleto be identical with the Landean OU process (cf. Lande1980). He proposed a different interpretation based on thesupposition that at any point in its history, a given phe-notypic character is subject to a large number of conflictingselective demands (genetic and environmental) so that itspresent value is the outcome of a compromise among
dW (t) = σdB(t) None. σij = γ2 + σ2tij
Model-Based Comparative Analysis 685
Figure 1: Effect of varying j in a Brownian motion (BM) process. Plottedare random walks through time for a continuous character with phe-notypic value along the Y-axis and time along the X-axis. At each smallstep in time, the phenotype has an equal and independent probabilityof increasing or decreasing in value. Increasing j results in strongerrandom drift and a broader distribution of final states. Because Ornstein-Uhlenbeck (OU) processes contain a drift component, increasing j alsobroadens the distribution of final states in OU processes. Each paneldisplays 30 realizations of the stochastic process and the distribution offinal states (the Gaussian curves to the right of the random walks). Theprocess is simulated from to , with each realization havingt p 0 t p 1the same initial state. An animation of this process is provided in theonline edition of the American Naturalist.
Figure 2: Influence of the selection-strength parameter a and optimumtrait value v on a trait evolving under an Ornstein-Uhlenbeck (OU)process. Larger values of a imply stronger selection and hence a morerapid approach to the optimum value v (dotted line) as well as a tighterdistribution of phenotypes around the optimum. Each panel displays 20realizations of the OU process and the distribution of final trait values.The process is simulated from to , with each realization havingt p 0 t p 1the same initial state and value of v. See figure 1 for explanation of axes.
This term is linear in X, so it is as simple as it mightpossibly be. It contains two additional parameters: a mea-sures the strength of selection, and v gives the optimumtrait value. The force of selection is proportional to thedistance, , of the current trait value from the op-v ! X(t)timum. Thus, if the phenotype has drifted far from theoptimum, the “pull” toward the optimum will be verystrong, whereas if the phenotype is currently at the opti-mum, selection will have no effect until the stochasticitymoves the phenotype away from the optimum again orthere is a change in the optimum, v, itself. Because of itsdependence on the distance from the optimum, the OUprocess can be used to model stabilizing selection. Theeffect of varying a can be seen in figure 2.
Because the OU model reduces to BM when , ita p 0can be viewed as an elaboration of the BM model. As astatistical model, its primary justification is to be soughtin the fact that it represents a step beyond BM in thedirection of realism while yet remaining mathematicallytractable. As a model of evolution, the OU process is con-
sistent with a variety of evolutionary interpretations, twoof which we mention here.
Lande (1976) showed that under certain assumptions,evolution of the species’ mean phenotype can take theform of an OU process. In Lande’s formulation, both nat-ural selection and random genetic drift are assumed to acton the phenotypic character; the OU process’s optimum
denotes the location of a local maximum in a fitnessvlandscape. Felsenstein (1988) pointed out that in the eventthat this optimum itself moves randomly, the correct de-scription of the phenotypic evolution is no longer exactlyan OU process, but an OU process remains a goodapproximation.
Hansen (1997) raised questions concerning the time-scale of the approach of a species’ mean phenotype to itsoptimal value relative to that of macroevolution. Specifi-cally, he suggested that the macroevolutionary OU processhe proposed could only operate on far too slow a timescaleto be identical with the Landean OU process (cf. Lande1980). He proposed a different interpretation based on thesupposition that at any point in its history, a given phe-notypic character is subject to a large number of conflictingselective demands (genetic and environmental) so that itspresent value is the outcome of a compromise among
dW (t) = σdB(t)
+α[β(t)−W (t)]dt
µ = β0
γ2 =σ2
2α
σij =σ2
2αe−αdij
(Root in Stationary
State)
PB, MM, SR Shifted stochastic processes on trees 4/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Principle of the Modeling and NotationsStochastic Processes UsedShifts
Shifts
E
D
C
B
A
R
0 200 400 600 800
−4
−2
02
46
time
phen
otyp
e
E
D
C
B
A
R
BM Shifts in the mean:
mj = mpa(j) +∑
k
I{τk = bj}δk
0 200 400 600 800−
2−
10
12
time
phen
otyp
e
E
D
C
B
AR
OU Shifts in the optimal value:
βj = βpa(j) +∑
k
I{τk = bj}δk
PB, MM, SR Shifted stochastic processes on trees 5/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Principle of the Modeling and NotationsStochastic Processes UsedShifts
Shifts
E
D
C
B
A
R
δ
0 200 400 600 800
−4
−2
02
46
time
phen
otyp
e
E
D
C
B
A
R
BM Shifts in the mean:
mj = mpa(j) +∑
k
I{τk = bj}δk
0 200 400 600 800−
2−
10
12
time
phen
otyp
e
E
D
C
B
AR
OU Shifts in the optimal value:
βj = βpa(j) +∑
k
I{τk = bj}δk
PB, MM, SR Shifted stochastic processes on trees 5/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Principle of the Modeling and NotationsStochastic Processes UsedShifts
Shifts
E
D
C
B
A
R
δ
0 200 400 600 800
05
10
time
phen
otyp
e
E
D
C
BA
E
D
C
BA
R
δ
BM Shifts in the mean:
mj = mpa(j) +∑
k
I{τk = bj}δk
0 200 400 600 800−
2−
10
12
time
phen
otyp
e
E
D
C
B
AR
OU Shifts in the optimal value:
βj = βpa(j) +∑
k
I{τk = bj}δk
PB, MM, SR Shifted stochastic processes on trees 5/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Principle of the Modeling and NotationsStochastic Processes UsedShifts
Shifts
E
D
C
B
A
R
δ
0 200 400 600 800
05
10
time
phen
otyp
e
E
D
C
BA
E
D
C
BA
R
δ
BM Shifts in the mean:
mj = mpa(j) +∑
k
I{τk = bj}δk
0 200 400 600 8000
24
6
time
phen
otyp
e
ED
C
BA
ED
C
BAR
δ
OU Shifts in the optimal value:
βj = βpa(j) +∑
k
I{τk = bj}δk
PB, MM, SR Shifted stochastic processes on trees 5/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Equivalencies
K fixed, several equivalent solutions.
µ
δ1 δ2
µ+δ2µ+δ1
µ
δ2 − δ1
δ1
Problem of over-parametrization: parsimonious configurations.
PB, MM, SR Shifted stochastic processes on trees 6/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Equivalencies
K fixed, several equivalent solutions.
µ
δ1 δ2
µ+δ2µ+δ1
µ
δ2 − δ1
δ1
µ+ δ1
δ2 δ1
µ+ δ2
µ+δ2µ+δ1
Problem of over-parametrization: parsimonious configurations.
PB, MM, SR Shifted stochastic processes on trees 6/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Parsimonious Solution : Definition
Definition
A coloring of the tips being given, a parsimonious allocation of theshifts is such that it has a minimum number of shifts.
PB, MM, SR Shifted stochastic processes on trees 7/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Parsimonious Solution : Definition
Definition
A coloring of the tips being given, a parsimonious allocation of theshifts is such that it has a minimum number of shifts.
PB, MM, SR Shifted stochastic processes on trees 7/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Parsimonious Solution : Definition
Definition
A coloring of the tips being given, a parsimonious allocation of theshifts is such that it has a minimum number of shifts.
PB, MM, SR Shifted stochastic processes on trees 7/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Parsimonious Solution : Definition
Definition
A coloring of the tips being given, a parsimonious allocation of theshifts is such that it has a minimum number of shifts.
≤
PB, MM, SR Shifted stochastic processes on trees 7/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Parsimonious Solution : Definition
Definition
A coloring of the tips being given, a parsimonious allocation of theshifts is such that it has a minimum number of shifts.
PB, MM, SR Shifted stochastic processes on trees 7/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Parsimonious Solution : Definition
Definition
A coloring of the tips being given, a parsimonious allocation of theshifts is such that it has a minimum number of shifts.
PB, MM, SR Shifted stochastic processes on trees 7/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Parsimonious Solution : Definition
Definition
A coloring of the tips being given, a parsimonious allocation of theshifts is such that it has a minimum number of shifts.
∼
PB, MM, SR Shifted stochastic processes on trees 7/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Equivalent Parsimonious Allocations
Definition
Two allocations are said to be equivalent (noted ∼) if they areboth parsimonious and give the same colors at the tips.
Find one solution Several existing Dynamic Programmingalgorithms (see Felsenstein, 2004).
Enumerate all solutions New recursive algorithm, adapted fromprevious ones (and implemented in R).
PB, MM, SR Shifted stochastic processes on trees 8/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Identifiability ProblemsNumber of Parsimonious SolutionsNumber of Models with K Shifts
Number of Models with K Shifts
Hypothesis “No Homoplasy”: 1 shift = 1 new color.
“K shifts ⇐⇒ K + 1 colors”
Bijection
SPIK = SP
K/ ∼; SPK = {Parsimonious allocations of K shifts}
SPIK ' {Tree compatible coloring of tips in K + 1 colors}
Problem Size of SPIK ?
Proposition∣∣SPIK
∣∣ ≤ (m+n−1K
)∣∣SPIK
∣∣ depends on the topology of the tree. It can becomputed with a recursive algorithm.
For a binary tree:∣∣SPI
K
∣∣ =(2n−2−K
K
).
PB, MM, SR Shifted stochastic processes on trees 9/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Incomplete Data Model : EMLinear Regression ModelModel Selection
Incomplete Data Model : EM
Y5
Y4
Y3
Y2
Y1
Z1
Z4
Z2
Z3
δk
Xj |Xpa(j) ∼ N
(qjXpa(j) + rj + sj
∑k
I{τk = bj}δk , σ2j
)
EM Algorithm Maximize Eθ[ log pθ(Z ,Y ) | Y ].
E step “Upward-Downward” Algorithm.
M step OU: increase objective function (GM).
Initialization LASSO regression (see next).
PB, MM, SR Shifted stochastic processes on trees 10/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Incomplete Data Model : EMLinear Regression ModelModel Selection
Linear Regression Model
Y5
Y4
Y3
Y2
Y1
Z1
Z4
Z2
Z3
δ3
δ1
δ2
∆ =
µδ1
00δ2
0δ3
00
T ∆ =
µ+ δ2
µµ+ δ1 + δ3
µ+ δ1
µ+ δ1
T =
Z1 Z2 Z3 Z4 Y1 Y2 Y3 Y4 Y5
Y1 1 0 0 1 1 0 0 0 0Y2 1 0 0 1 0 1 0 0 0Y3 1 1 0 0 0 0 1 0 0Y4 1 1 1 0 0 0 0 1 0Y5 1 1 1 0 0 0 0 0 1
BM : Y = T∆ + E
OU : Y = TW (α)∆+E
PB, MM, SR Shifted stochastic processes on trees 11/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Incomplete Data Model : EMLinear Regression ModelModel Selection
Model Selection on K
Proposition (Form of the Penalty and guaranties (α known))
Under our setting:
Y = R∆ + γE with E ∼ N (0,V ) and S = {Sη , η ∈M}, M =⋃
K≥0
SPIK
Define the following penalty:
pen(K) = An − K − 1
n − K − 2EDkhi[K + 2, n−K − 2, e−LK ], LK = log
∣∣∣SPIK
∣∣∣+ 2 log(K + 2)
and the estimator: η = argminη∈M
‖Y − sη‖2V
(1 +
pen(Kη)
n − Kη − 1
)Under some reasonable technical hypothesis, we get the non-asymptotic bound:
E
[∥∥s − sη
∥∥2
V
γ2
]≤ C(A, κ)
[inf
η∈M
{‖s − sη‖2
V
γ2+ Dη(3 + log(n))
}+ 1 + log(n)
]
PB, MM, SR Shifted stochastic processes on trees 12/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Incomplete Data Model : EMLinear Regression ModelModel Selection
Model Selection: Important Points
Based on Baraud et al. (2009)
Non-asymptotic bound.Unknown variance.No constant to be calibrated.
Novelties Non iid variance.Penalty depends on the tree topology(through
∣∣SPIK
∣∣).
PB, MM, SR Shifted stochastic processes on trees 13/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Chelonia Dataset
0
Unit
200 150 100 50 0
FreshwaterIslandMainlandSaltwater
Colors: habitats.Boxes: selected EM regimes.
PB, MM, SR Shifted stochastic processes on trees 14/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Chelonia Dataset
0
Unit
200 150 100 50 0
FreshwaterIslandMainlandSaltwater
Colors: habitats.Boxes: selected EM regimes.
Chelonia mydas
PB, MM, SR Shifted stochastic processes on trees 14/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Chelonia Dataset
0
Unit
200 150 100 50 0
FreshwaterIslandMainlandSaltwater
Colors: habitats.Boxes: selected EM regimes.
Geochelone nigra abingdoni
PB, MM, SR Shifted stochastic processes on trees 14/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Chelonia Dataset
0
Unit
200 150 100 50 0
FreshwaterIslandMainlandSaltwater
Colors: habitats.Boxes: selected EM regimes.
Chitra indica
PB, MM, SR Shifted stochastic processes on trees 14/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Chelonia Dataset
0
Unit
200 150 100 50 0
FreshwaterIslandMainlandSaltwater
Colors: habitats.Boxes: selected EM regimes.
Habitat EM
No. of shifts 16.00 5.00
No. of regimes 4.00 6.00
lnL -135.56 -97.59
ln 2/α (%) 7.83 5.43
γ2 0.35 0.22
CPU time (min) 1.25 134.49
PB, MM, SR Shifted stochastic processes on trees 14/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Conclusion and Perspectives
A general inference framework for trait evolution models.
Conclusions Some problems of identifiability arise.An EM can be written to maximize likelihood.Adaptation of model selection results to non-iidframework.
R codes Available on GitHub:https://github.com/pbastide/Phylogenetic-EM
Perspectives Multivariate traits.Deal with uncertainty (tree, data).Use fossil records.
PB, MM, SR Shifted stochastic processes on trees 15/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Bibliography
Y. Baraud, C. Giraud, and S. Huet. Gaussian model selection with an unknown variance. Annals of Statistics, 37(2):630–672, Apr. 2009. doi: 10.1214/07-AOS573.
M. A. Butler and A. A. King. Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution.The American Naturalist, 164(6):pp. 683–695, 2004. ISSN 00030147.
J. Felsenstein. Phylogenies and the Comparative Method. The American Naturalist, 125(1):pp. 1–15, Jan. 1985.ISSN 00030147.
J. Felsenstein. Inferring Phylogenies. Sinauer Associates, Suderland, USA, 2004.
T. F. Hansen. Stabilizing selection and the comparative analysis of adaptation. Evolution, 51(5):1341–1351, oct1997.
A. L. Jaffe, G. J. Slater, and M. E. Alfaro. The evolution of island gigantism and body size variation in tortoisesand turtles. Biology letters, 2011.
Photo Credits :
- ”Parrot-beaked Tortoise Homopus areolatus CapeTown 8” by Abu Shawka - Own work. Licensed under CC0via Wikimedia Commons
- ”Leatherback sea turtle Tinglar, USVI (5839996547)” by U.S. Fish and Wildlife Service Southeast Region -Leatherback sea turtle/ Tinglar, USVIUploaded by AlbertHerring. Licensed under CC BY 2.0 viaWikimedia Commons
- ”Hawaii turtle 2” by Brocken Inaglory. Licensed under CC BY-SA 3.0 via Wikimedia Commons
- ”Dudhwalive chitra” by Krishna Kumar Mishra — Own work. Licensed under CC BY 3.0 via WikimediaCommons
- ”Lonesome George in profile” by Mike Weston - Flickr: Lonesome George 2. Licensed under CC BY 2.0 viaWikimedia Commons
PB, MM, SR Shifted stochastic processes on trees 16/17
Stochastic Processes on TreesIdentifiability Problems and Counting Issues
Statistical InferenceChelonia Data Set
References
Thank you for listening
PB, MM, SR Shifted stochastic processes on trees 17/17