Introduction to Coalescent · The Genographic project National Geographic project A person can...
Transcript of Introduction to Coalescent · The Genographic project National Geographic project A person can...
Introduction to Coalescent
Pavlos Pavlidis
Pavlos Pavlidis () Introduction to Coalescent 2013/02 1 / 91
Methods in population genetics
Workflow in population genetics studies:
Population genetics Materials and Methods
Sample a population
Study the sample
Infer parameters for the population
Learn about the past of the populationGeneralize the results for the whole population
Pavlos Pavlidis () Introduction to Coalescent 2013/02 2 / 91
Methods in population genetics
Workflow in population genetics studies:
Population genetics Materials and Methods
Sample a population
Study the sample
Infer parameters for the populationLearn about the past of the population
Generalize the results for the whole population
Pavlos Pavlidis () Introduction to Coalescent 2013/02 2 / 91
Methods in population genetics
Workflow in population genetics studies:
Population genetics Materials and Methods
Sample a population
Study the sample
Infer parameters for the populationLearn about the past of the populationGeneralize the results for the whole population
Pavlos Pavlidis () Introduction to Coalescent 2013/02 2 / 91
Example of a population genetics study
The Genographic project
National Geographic project
A person can order a kit from the website and provide his DNAsample to the project
Data are analyzed and conclusions about the human populationare made
Pavlos Pavlidis () Introduction to Coalescent 2013/02 3 / 91
The Genographic Project
Pavlos Pavlidis () Introduction to Coalescent 2013/02 4 / 91
The Genographic Project
Pavlos Pavlidis () Introduction to Coalescent 2013/02 5 / 91
The Genographic Project
It uses a sample of modern humans
Understand population processes and population parameters(e.g. the migration rate)
Learn the history of the population
Pavlos Pavlidis () Introduction to Coalescent 2013/02 6 / 91
The phylogeny of languages
Pavlos Pavlidis () Introduction to Coalescent 2013/02 7 / 91
We are interested in parameters of the population
Was the population constant during its history?
Has it evolved neutrally or are there signs of selection?
What is the mutation rate, what is the recombination rate?
Is there gene flow? (migration)
Pavlos Pavlidis () Introduction to Coalescent 2013/02 8 / 91
We want to answer these questions by analyzing
population samples
Pavlos Pavlidis () Introduction to Coalescent 2013/02 9 / 91
Ideas from classical statistics
In statistics we often want to know the parameters of somedistribution.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 10 / 91
Some distributions and their parameters
−10 −5 0 5 10
0.0
0.1
0.2
0.3
0.4
N(0,1)
x
dnor
m(x
, 0, 1
)
−10 −5 0 5 100.
025
0.03
00.
035
0.04
0
N(0, 10)
x
dnor
m(x
, 0, 1
0)
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
Exp(1)
x2
dexp
(x2,
1)
0 2 4 6 8 10
02
46
810
Exp(10)
x2
dexp
(x2,
10)
Pavlos Pavlidis () Introduction to Coalescent 2013/02 11 / 91
Sampling from a distribution provides information
about the properties of the distribution
Infering population parameters
A major goal in population genetics (similar to statistics) is toestimate the parameters of the population by studying populationsamplesCan you name some population parameters?
Pavlos Pavlidis () Introduction to Coalescent 2013/02 12 / 91
Hypothesis testing
Does a neutral model (null hypothesis) or a selection model fitthe data better?
Was the population size constant (null hypothesis) or did itchange over time?
Was the migration rate 0 (null hypothesis) or did migrationoccur during the evolution of the population?
Pavlos Pavlidis () Introduction to Coalescent 2013/02 13 / 91
In population genetics there are several parameters
Population genetics parameters define ‘the probability distribution ofsequences’.
θ = 4Nµ is the mutation rate.
ρ = 4Nr is the recombination rate.
M = 4Nm is the migration rate.
These parameters will be explained later.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 14 / 91
In population genetics there are several parameters
Population genetics parameters define ‘the probability distribution ofsequences’.
θ = 4Nµ is the mutation rate.
ρ = 4Nr is the recombination rate.
M = 4Nm is the migration rate.
These parameters will be explained later.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 14 / 91
In population genetics there are several parameters
Population genetics parameters define ‘the probability distribution ofsequences’.
θ = 4Nµ is the mutation rate.
ρ = 4Nr is the recombination rate.
M = 4Nm is the migration rate.
These parameters will be explained later.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 14 / 91
Mutation rate: 0.1 versus 0.5TCCGGTTCCCATTCATATGGTCCGGTTCCCATTATCTTGGTCCGGTTCCCATTCATCTGGTCCGGTTCCCATTCATCTGGTCCGGTTCCCATTCATCTGG
TGACCACTGCCCAAACAGCTAATAGAAGAGGTGTACGCCCAAGGGCCCCCCTGAACGCACAATGGCCCAGCTGTACGTGCAAGGGCCCGGCTGTACGTAC
Pavlos Pavlidis () Introduction to Coalescent 2013/02 15 / 91
Recombination rate: 0.1 versus 0.5
CTCCTGCCCCTGAGCGGATGTTACTCCAACACAGCAGATGTTACTCCAACACAGCAGATGTTTGCCACATCCGGCGGATATTTGCCACATCCGGTGGATA
TGACCACTGCCCAAACAGCTAATAGAAGAGGTGTACGCCCAAGGGCCCCCCTGAACGCACAATGGCCCAGCTGTACGTGCAAGGGCCCGGCTGTACGTAC
Pavlos Pavlidis () Introduction to Coalescent 2013/02 16 / 91
Samples and parameters
Parameter values affect ‘How samples look like’.
Samples contain information about the parameter values of thepopulation
Pavlos Pavlidis () Introduction to Coalescent 2013/02 17 / 91
Hypothesis testing example
Assume a sample of k := 12 homologous genes, that is, k individuals.Hypothesis: Our sample is from a constant (over time) populationof N individuals.The length of the sequences is l := 1000.The mutation rate per base pair, per generation is µ := 10−8.We observe s = 20 segregating sites (SNPs, Single NucleotidePolymorphisms).
Is our hypothesis correct?
How probable is it to observe s ≤ 20 segregating sites for apopulation of size N ?
Pavlos Pavlidis () Introduction to Coalescent 2013/02 18 / 91
Ideas?
What would you do to solve this problem?
Pavlos Pavlidis () Introduction to Coalescent 2013/02 19 / 91
What we need is . . .
. . . a way to calculate sample distributions for population geneticssamples
Pavlos Pavlidis () Introduction to Coalescent 2013/02 20 / 91
Summary 1
We need methods to study samples
In population genetics studies we are often interested in sampledistributions: Number of SNPs, pairwise differences betweensequences
Sample distributions are often related to Hypothesis testing:What is the probability to observe ≤ 20 segregating sites (SNPs)in a sample of 12 individuals given a mutation rate µ and apopulation size N?
To study sample distributions either analytically or bysimulations we need a model for samples
Individuals in a population genetics sample are not independent:They are related due to co-ancestry
We will see later-on that the coalescent is a natural way to studysample distributions either analytically or via simulationsPavlos Pavlidis () Introduction to Coalescent 2013/02 21 / 91
The coalescent
is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.
Thus, it provides a natural way to model samples frompopulations.
Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.
One can also study the properties of the coalescent analytically.
It allows to estimate population genetics parameters.
. . . to perform hypothesis testing.
The coalescent is NOT a tree reconstruction method. It is asampling method.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91
The coalescent
is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.
Thus, it provides a natural way to model samples frompopulations.
Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.
One can also study the properties of the coalescent analytically.
It allows to estimate population genetics parameters.
. . . to perform hypothesis testing.
The coalescent is NOT a tree reconstruction method. It is asampling method.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91
The coalescent
is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.
Thus, it provides a natural way to model samples frompopulations.
Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.
One can also study the properties of the coalescent analytically.
It allows to estimate population genetics parameters.
. . . to perform hypothesis testing.
The coalescent is NOT a tree reconstruction method. It is asampling method.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91
The coalescent
is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.
Thus, it provides a natural way to model samples frompopulations.
Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.
One can also study the properties of the coalescent analytically.
It allows to estimate population genetics parameters.
. . . to perform hypothesis testing.
The coalescent is NOT a tree reconstruction method. It is asampling method.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91
The coalescent
is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.
Thus, it provides a natural way to model samples frompopulations.
Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.
One can also study the properties of the coalescent analytically.
It allows to estimate population genetics parameters.
. . . to perform hypothesis testing.
The coalescent is NOT a tree reconstruction method. It is asampling method.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91
The coalescent
is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.
Thus, it provides a natural way to model samples frompopulations.
Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.
One can also study the properties of the coalescent analytically.
It allows to estimate population genetics parameters.
. . . to perform hypothesis testing.
The coalescent is NOT a tree reconstruction method. It is asampling method.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91
The coalescent
is a model that describes the relationships within a sample fromthe present individuals (sequences) back to the most recentcommon ancestor.
Thus, it provides a natural way to model samples frompopulations.
Using the coalescent model for simulations, It allows torepeatedly draw samples from a population with certainparameters.
One can also study the properties of the coalescent analytically.
It allows to estimate population genetics parameters.
. . . to perform hypothesis testing.
The coalescent is NOT a tree reconstruction method. It is asampling method.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 22 / 91
The coalescent differs from the phylogenetic tree
concept!
The goals of population genetics analyses usually differ fromthose in phylogenetics.
In phylogenetics the goal is to obtain the tree that best describesthe data. The research question is: What are the evolutionaryrelationships between the given set of sequences
Pavlos Pavlidis () Introduction to Coalescent 2013/02 23 / 91
In population genetics:
we want to learn something about the population
1 but, the genealogy is used exclusively to obtain statisticalsamples
2 or to analytically infer the properties of samples.
3 there is not the tree, we calculate statistics over (all) possiblegenealogies
4 A geneaology is a rooted, strictly binary, tree!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 24 / 91
In population genetics:
we want to learn something about the population
1 but, the genealogy is used exclusively to obtain statisticalsamples
2 or to analytically infer the properties of samples.
3 there is not the tree, we calculate statistics over (all) possiblegenealogies
4 A geneaology is a rooted, strictly binary, tree!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 24 / 91
In population genetics:
we want to learn something about the population
1 but, the genealogy is used exclusively to obtain statisticalsamples
2 or to analytically infer the properties of samples.
3 there is not the tree, we calculate statistics over (all) possiblegenealogies
4 A geneaology is a rooted, strictly binary, tree!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 24 / 91
In population genetics:
we want to learn something about the population
1 but, the genealogy is used exclusively to obtain statisticalsamples
2 or to analytically infer the properties of samples.
3 there is not the tree, we calculate statistics over (all) possiblegenealogies
4 A geneaology is a rooted, strictly binary, tree!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 24 / 91
Let’s see a coalescent . . .
1 2 3 4 5 6
Figure: Schematically the coalescentlooks like a tree
1 2 3 4 65
Figure: Often they draw itupside-down
Pavlos Pavlidis () Introduction to Coalescent 2013/02 25 / 91
Terminology on the coalescent
coalescent, coalescent tree, genealogy, coancestry, . . .
Branches
Most Recent Common Ancestor (MRCA)
Height
Nodes or Coalescent Events
Total Length
Leaves
Pavlos Pavlidis () Introduction to Coalescent 2013/02 26 / 91
Let’s see how a coalescent is built
Pavlos Pavlidis () Introduction to Coalescent 2013/02 27 / 91
Constructing a coalescent tree: The discrete way
Assume a population of 20 individuals.
t = 0 (present)
Pavlos Pavlidis () Introduction to Coalescent 2013/02 28 / 91
Constructing a coalescent tree: The discrete way
Sample individuals (n = 5). Sampling is random.
t = 0 (present)
Pavlos Pavlidis () Introduction to Coalescent 2013/02 29 / 91
Constructing a coalescent tree: The discrete way
Let’s go one generation in the past . . .
t = 0 (present)
t = 1
Pavlos Pavlidis () Introduction to Coalescent 2013/02 30 / 91
Constructing a coalescent tree: The discrete way
. . . and choose parents . . .
t = 0 (present)
t = 1
Pavlos Pavlidis () Introduction to Coalescent 2013/02 31 / 91
Constructing a coalescent tree: The discrete way
. . . let’s go backwards one more generation . . .
t = 0 (present)
t = 1
t = 2
Pavlos Pavlidis () Introduction to Coalescent 2013/02 32 / 91
Constructing a coalescent tree: The discrete way
. . . and choose parents again . . .
t = 0 (present)
t = 1
t = 2
Pavlos Pavlidis () Introduction to Coalescent 2013/02 33 / 91
Constructing a coalescent tree: The discrete way
t = 0 (present)
t = 1
t = 2
t = 3
Pavlos Pavlidis () Introduction to Coalescent 2013/02 34 / 91
Constructing a coalescent tree: The discrete way
t = 0 (present)
t = 1
t = 2
t = 3
Pavlos Pavlidis () Introduction to Coalescent 2013/02 35 / 91
Constructing a coalescent tree: The discrete way
t = 0 (present)
t = 1
t = 2
t = 3
t = 4
Pavlos Pavlidis () Introduction to Coalescent 2013/02 36 / 91
Constructing a coalescent tree: The discrete way
t = 0 (present)
t = 1
t = 2
t = 3
t = 4
Pavlos Pavlidis () Introduction to Coalescent 2013/02 37 / 91
Constructing a coalescent tree: The discrete way
t = 0 (present)
t = 1
t = 2
t = 3
t = 4
t = 5
Pavlos Pavlidis () Introduction to Coalescent 2013/02 38 / 91
Constructing a coalescent tree: The discrete way
t = 0 (present)
t = 1
t = 2
t = 3
t = 4
t = 5
Pavlos Pavlidis () Introduction to Coalescent 2013/02 39 / 91
Constructing a coalescent tree: The discrete way
t = 0 (present)
t = 1
t = 2
t = 3
t = 4
t = 5
t = 6
t = 7
t = 8
t = 9
t = 10
Pavlos Pavlidis () Introduction to Coalescent 2013/02 40 / 91
Constructing a coalescent tree: The discrete way
t = 0 (present)
t = 1
t = 2
t = 3
t = 4
t = 5
t = 6
t = 7
t = 8
t = 9
t = 10
Pavlos Pavlidis () Introduction to Coalescent 2013/02 41 / 91
Building a coalescent within the population: The
discrete way
What do we need to know to construct a coalescent
the sample size: 5
the population size of each generation: 20 and we assume that itremains constant
the probability of each individual to be chosen as a parent (hereUniform = neutral)
Why is: Uniform = neutral ?
Pavlos Pavlidis () Introduction to Coalescent 2013/02 42 / 91
Why are backward simulations faster than forward simulations?
But...
Anafits by Andre Aberer
Pavlos Pavlidis () Introduction to Coalescent 2013/02 43 / 91
In forward simulations we need to simulate the
entire population at each generation
t = 0 (present)
t = 1
Then uniformly chose n = 5 individuals of the sample.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 44 / 91
In backward simulations we only care about the
present sample size (we know the population size)
Forward
t = 0 (present)
t = 1
Backward
t = 0 (present)
t = 1
Pavlos Pavlidis () Introduction to Coalescent 2013/02 45 / 91
Assumptions of the evolutionary model used in this
simple version of the coalescent (J. Kingman)
We assume the Wright-Fisher model:
discrete, non-overlapping generations
haploid individuals
population size is constant
neutrality: all individuals are equally fit to survive
no population structure, no migration, . . .
no recombination
Later we will see how to relax some of these assumptions.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91
Assumptions of the evolutionary model used in this
simple version of the coalescent (J. Kingman)
We assume the Wright-Fisher model:
discrete, non-overlapping generations
haploid individuals
population size is constant
neutrality: all individuals are equally fit to survive
no population structure, no migration, . . .
no recombination
Later we will see how to relax some of these assumptions.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91
Assumptions of the evolutionary model used in this
simple version of the coalescent (J. Kingman)
We assume the Wright-Fisher model:
discrete, non-overlapping generations
haploid individuals
population size is constant
neutrality: all individuals are equally fit to survive
no population structure, no migration, . . .
no recombination
Later we will see how to relax some of these assumptions.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91
Assumptions of the evolutionary model used in this
simple version of the coalescent (J. Kingman)
We assume the Wright-Fisher model:
discrete, non-overlapping generations
haploid individuals
population size is constant
neutrality: all individuals are equally fit to survive
no population structure, no migration, . . .
no recombination
Later we will see how to relax some of these assumptions.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91
Assumptions of the evolutionary model used in this
simple version of the coalescent (J. Kingman)
We assume the Wright-Fisher model:
discrete, non-overlapping generations
haploid individuals
population size is constant
neutrality: all individuals are equally fit to survive
no population structure, no migration, . . .
no recombination
Later we will see how to relax some of these assumptions.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91
Assumptions of the evolutionary model used in this
simple version of the coalescent (J. Kingman)
We assume the Wright-Fisher model:
discrete, non-overlapping generations
haploid individuals
population size is constant
neutrality: all individuals are equally fit to survive
no population structure, no migration, . . .
no recombination
Later we will see how to relax some of these assumptions.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 46 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
The probability that a pair of individuals coalesced
in the previous generation
This is the probability that two individuals in generation t = i hadthe same parent in generation t = i + 1
The second individual has the same parent as the first withprobability:pN = 1
N
Notice: implicitly we assume neutrality!!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 47 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
The probability that a pair of individuals coalesced
in the previous generation
This is the probability that two individuals in generation t = i hadthe same parent in generation t = i + 1
The second individual has the same parent as the first withprobability:pN = 1
N
Notice: implicitly we assume neutrality!!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 47 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
The probability that a pair of individuals coalesced
in the previous generation
This is the probability that two individuals in generation t = i hadthe same parent in generation t = i + 1
The second individual has the same parent as the first withprobability:pN = 1
N
Notice: implicitly we assume neutrality!!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 47 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
Question 1
What is the expected number of generations that will pass until acoalescent event occurs (given a sample size of 2 individuals?)
Intuitively, this is just N generations.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 48 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
More formally
Geometric Distribution with probability of success p:p(x) = (1− p)x−1p
How many failures do we expect on average before the firstsuccess?: E(x) =
∑∞x=1 xp(x) =
∑∞x=1 x(1− p)x−1p = 1/p.
The mean value of a geometric distribution with parameter p is1p
= N .
The variance of a geometric distribution is 1−pp2
.
The variance is large; thus, the process will often either be longor short.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 49 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
More formally
Geometric Distribution with probability of success p:p(x) = (1− p)x−1p
How many failures do we expect on average before the firstsuccess?: E(x) =
∑∞x=1 xp(x) =
∑∞x=1 x(1− p)x−1p = 1/p.
The mean value of a geometric distribution with parameter p is1p
= N .
The variance of a geometric distribution is 1−pp2
.
The variance is large; thus, the process will often either be longor short.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 49 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
More formally
Geometric Distribution with probability of success p:p(x) = (1− p)x−1p
How many failures do we expect on average before the firstsuccess?: E(x) =
∑∞x=1 xp(x) =
∑∞x=1 x(1− p)x−1p = 1/p.
The mean value of a geometric distribution with parameter p is1p
= N .
The variance of a geometric distribution is 1−pp2
.
The variance is large; thus, the process will often either be longor short.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 49 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
More formally
Geometric Distribution with probability of success p:p(x) = (1− p)x−1p
How many failures do we expect on average before the firstsuccess?: E(x) =
∑∞x=1 xp(x) =
∑∞x=1 x(1− p)x−1p = 1/p.
The mean value of a geometric distribution with parameter p is1p
= N .
The variance of a geometric distribution is 1−pp2
.
The variance is large; thus, the process will often either be longor short.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 49 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
Summary 2
We can construct a coalescent tree of a random samplebackward in time.
Neutrality is implied by picking parents uniformly at random.
For a sample of 2 individuals:
The probability of a coalescent event happening one generationbefore is 1
NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1
N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
Summary 2
We can construct a coalescent tree of a random samplebackward in time.
Neutrality is implied by picking parents uniformly at random.
For a sample of 2 individuals:
The probability of a coalescent event happening one generationbefore is 1
NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1
N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
Summary 2
We can construct a coalescent tree of a random samplebackward in time.
Neutrality is implied by picking parents uniformly at random.
For a sample of 2 individuals:
The probability of a coalescent event happening one generationbefore is 1
NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1
N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
Summary 2
We can construct a coalescent tree of a random samplebackward in time.
Neutrality is implied by picking parents uniformly at random.
For a sample of 2 individuals:
The probability of a coalescent event happening one generationbefore is 1
N
The waiting time until this first coalescent event follows ageometric distribution with parameter p = 1
N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
Summary 2
We can construct a coalescent tree of a random samplebackward in time.
Neutrality is implied by picking parents uniformly at random.
For a sample of 2 individuals:
The probability of a coalescent event happening one generationbefore is 1
NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1
N .
When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
Summary 2
We can construct a coalescent tree of a random samplebackward in time.
Neutrality is implied by picking parents uniformly at random.
For a sample of 2 individuals:
The probability of a coalescent event happening one generationbefore is 1
NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1
N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.
The variance of the waiting time is large!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
Summary 2
We can construct a coalescent tree of a random samplebackward in time.
Neutrality is implied by picking parents uniformly at random.
For a sample of 2 individuals:
The probability of a coalescent event happening one generationbefore is 1
NThe waiting time until this first coalescent event follows ageometric distribution with parameter p = 1
N .When there are two individuals, we will have to wait on averagefor N generations until they coalesce.The variance of the waiting time is large!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 50 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
Questions
Assume two populations A and B and two individuals from eachof these two populations, A1, A2, B1, B2. Which coalescent treeis expected to be deeper on average if NA < NB? (A1,A2 orB1,B2).
Assume two human individuals and two Drosophila individuals(Nhuman = 10000,NDroso = 1000000). Which coalescent isexpected to be deeper on average?
Pavlos Pavlidis () Introduction to Coalescent 2013/02 51 / 91
Simple mathematical formulas and the coalescent Drawing the coalescent, and the coalescent of two individuals
Questions
Assume two populations A and B and two individuals from eachof these two populations, A1, A2, B1, B2. Which coalescent treeis expected to be deeper on average if NA < NB? (A1,A2 orB1,B2).
Assume two human individuals and two Drosophila individuals(Nhuman = 10000,NDroso = 1000000). Which coalescent isexpected to be deeper on average?
Pavlos Pavlidis () Introduction to Coalescent 2013/02 51 / 91
Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals
Usually the sample size is greater than 2
t = 0 (present)
t = 1
t = 2
t = 3
t = 4
t = 5
t = 6
t = 7
t = 8
t = 9
t = 10
Pavlos Pavlidis () Introduction to Coalescent 2013/02 52 / 91
Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals
What is the probability of a coalescent event in
one generation
t = 0 (present)
t = 1
Pavlos Pavlidis () Introduction to Coalescent 2013/02 53 / 91
Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals
What is the probability of a coalescent event
occuring one generation back
It is actually easier to calculate the probability that a coalescentevent does not occur in a generation.
If the sample size is k , and the population size is 2N , then . . .
one individual chooses its parent.
the second individual chooses a DIFFERENT parent withprobability 2N−1
2N.
the third individual chooses a DIFFERENT parent withprobability 2N−2
2N.
the k th individual chooses a DIFFERENT parent with probability2N−k+1
2N.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 54 / 91
Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals
What is the probability of a coalescent event
occuring one generation back
It is actually easier to calculate the probability that a coalescentevent does not occur in a generation.
If the sample size is k , and the population size is 2N , then . . .
the probability that no coalescent occurs is:P(NOCOAL) = 2N−1
2N2N−22N
. . . 2N−k+12N
= (1− 12N
)(1− 22N
)(1−32N
) . . . = 1− 1/2N − 2/2N − . . . + 2/(2N)2 − . . . .
≈ 1− 1/2N − 2/2N − 3/2N . . . (k − 1)/2N = 1−∑k−1
i=1i
2N=
1−(k2
)12N
(NOTE: 1 + 2 + 3 + . . . + k − 1 = k(k−1)2
=(k2
))
We neglect the probability that two coalescents occursimultaneously in a single generation.why?
The probability to observe a single coalescent event onegeneration back is
(k2
)12N
.Pavlos Pavlidis () Introduction to Coalescent 2013/02 55 / 91
Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals
What is the probability of a coalescent event
occuring one generation back
It is actually easier to calculate the probability that a coalescentevent does not occur in a generation.
The probability of observing a coalescent event one generationback increases with the sample size.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 56 / 91
Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals
How does this affect the shape of the coalescent?
Pavlos Pavlidis () Introduction to Coalescent 2013/02 57 / 91
Simple mathematical formulas and the coalescent The coalescent for n > 2 individuals
Summary: the discrete coalescent for a sample of
size k
It proceeds in discrete generation steps backward in time.
It starts with k individuals and stops with 1.
The probability of a coalescent event increases with the numberof lineages.
This increasing probabilty (as a function of the number oflineages) generates coalescent trees with short branches at theleaves and long branches toward the root.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 58 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
The continuous coalescent as an approximation of
the discrete coalescent
Assume a sample of size k and a population size of 2N .
What is the waiting time until two sequence coalesce?
The probability of that no coalescent event occurs in a singlegeneration is 1−
(k2
)p, p = 1/2N .
The probability of no coalescent occuring until generation τ is
P(T > τ) = (1−(k2
)1/2N)τ
replace τ with 2Nt. This means: when t = 1, then 2Ngenerations have passed.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 59 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
The continuous coalescent as an approximation of
the discrete coalescent
P(T > 2Nt) = (1−(k2
)1/2N)2Nt
P(T/2N > t) = exp−(k2)t
P(T/2N ≤ t) = 1− exp−(k2)t
This means that the waiting time until a coalescent event occurscan be approximated by an exponentially distrubted variable withparameter
(k2
). The time is measured in units of 2N generations.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 60 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
let’s see how accurate the approximation is . . .
●
1e+00 1e+02 1e+04 1e+06
0.00
0.05
0.10
0.15
0.20
Population size
P(T
>0.
05)
●
●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
exactapprox
Question: What does this imply?Pavlos Pavlidis () Introduction to Coalescent 2013/02 61 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
let’s see how accurate the approximation is . . .
For small population sizes the exponential approximationunderestimates the number of coalescent events!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 62 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
How do we build coalescent trees using
exponentially distributed random variables?
Exponentially distributed random variables
The waiting time until a coalescent event is an exponentialrandom variable
Pavlos Pavlidis () Introduction to Coalescent 2013/02 63 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
How do exponential variables look like?
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
Exponential Cumulative par: n(n−1)/2
Waiting time
P(T
<=
t)
(n = 10) Let’s draw some random numbers from the plot!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 64 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
Let’s create a coalescent by drawing random
numbers from the exponential distribution
t = 0
Pavlos Pavlidis () Introduction to Coalescent 2013/02 65 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
Let’s create a coalescent by drawing exponential
variables MAKE CONSISTENT
t = 0
T = 0.001, n = 10, i: 1, j:5
Pavlos Pavlidis () Introduction to Coalescent 2013/02 66 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
Let’s create a coalescent by drawing exponential
variables
t = 0
T = 0.001, n = 10, i: 1, j:5
T = 0.07, n = 9, i: 7, j:10
This process continues until the MRCA (Most Recent CommonAncester)!!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 67 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
Continuous coalescent
To construct the continuous coalescent we draw exponential randomvariables with parameters:(
n2
)(n−12
)(n−22
). . .(22
)= 1
Pavlos Pavlidis () Introduction to Coalescent 2013/02 68 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
Continuous coalescent
Waiting times are getting longer and longer as we move back intime (toward MRCA)!
Recent branches (in the present) are shorter than deeperbranches (near the root)!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 69 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
How does this affect the shape of the coalescent?
Pavlos Pavlidis () Introduction to Coalescent 2013/02 70 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
Practically, the coalescent is constructed by using
the continuous approximation
It’s faster: we are only interested in the times of the coalescentevents and not in the generations where nothing happens!
Pavlos Pavlidis () Introduction to Coalescent 2013/02 71 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
Let’s play a bit with coalescent
Coalescent simulator: www.coalescent.dk
Pavlos Pavlidis () Introduction to Coalescent 2013/02 72 / 91
Simple mathematical formulas and the coalescent The continuous coalescent
Summary
The coalescent is built by using exponential waiting times.
The continuous coalescent represents a good approximation ofthe discrete coalescent, when the population size is large enough.
We assume that two coalescent events cannot occursimultaneously.
Waiting times increase with the number of coalescent eventsthat have already occured
The last waiting time before the root (tMRCA), is on average aslong as the time we need to obtain the two ancestors of the kobserved individuals.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 73 / 91
Coalescent and polymorphisms
Coalescent and polymorphisms
We saw how to build the coalescent = how to model therelationships of individuals within a sample.
How can we generate/simulate sequence data using a coalescenttree?
Pavlos Pavlidis () Introduction to Coalescent 2013/02 74 / 91
Coalescent and polymorphisms
It’s easy to simulate sequences using a coalescent
tree!
Just put mutations on the coalescent tree :-)
Pavlos Pavlidis () Introduction to Coalescent 2013/02 75 / 91
Coalescent and polymorphisms
Putting mutations on the coalescent tree
C1 C2 C3 C5C4
Pavlos Pavlidis () Introduction to Coalescent 2013/02 76 / 91
Coalescent and polymorphisms
Putting mutations on the coalescent tree
C1 C2 C3 C5C4
NOTE: INFINITE SITE MODEL!Pavlos Pavlidis () Introduction to Coalescent 2013/02 77 / 91
Coalescent and polymorphisms
Putting mutations on the coalescent tree
C1 C2 C3 C5C4
Pavlos Pavlidis () Introduction to Coalescent 2013/02 78 / 91
Coalescent and polymorphisms
Putting mutations on the coalescents
C1 C2 C3 C5C4
Pavlos Pavlidis () Introduction to Coalescent 2013/02 79 / 91
Coalescent and polymorphisms
Putting mutations on the coalescent tree
We need to . . .
choose a position on the tree to put the mutation
choose a position on the sequence where the mutation occurred
Pavlos Pavlidis () Introduction to Coalescent 2013/02 80 / 91
Coalescent and polymorphisms
How do we choose a position on the tree
We assume that the mutation rate is θ. This means that, theexpected number of mutations per unit time on a single branchof the coalescent tree is θ.
The total number of mutations on the coalescent tree is apoisson random number Poi(θT ), where T is the total treelength.
Then, we randomly put mutations on the branches of thecoalescent tree.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 81 / 91
Coalescent and polymorphisms
The number of mutations follows a Poisson
distribution
Waiting times are exponentially distributed
Events are independent from each other
Pavlos Pavlidis () Introduction to Coalescent 2013/02 82 / 91
Coalescent and polymorphisms
Waiting times between mutations
t is measured in units of 2N generations
P(T > t) = (1− µ)t2N = (1− θ
4N)2tN
Thus as N goes to infinity
P(T > t)→ e−θt/2
Additionally, mutation events are independent from each other.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 83 / 91
Coalescent and polymorphisms
Put the mutations randomly on the tree
Pavlos Pavlidis () Introduction to Coalescent 2013/02 84 / 91
Coalescent and polymorphisms
What is the expected number of mutations?
Mean of Pois(θT ) = θT .
Pavlos Pavlidis () Introduction to Coalescent 2013/02 85 / 91
Coalescent and polymorphisms
Explaining data with the coalescent
The goal is to create a coalescent tree, and put mutations on it suchas to generate/simulate a dataset with specific properties.
Pavlos Pavlidis () Introduction to Coalescent 2013/02 86 / 91
Coalescent and polymorphisms
Explaining data with the coalescent
Assume a dataset:
seq1 AAATCGseq2 AAACCGseq3 TTTCCGseq4 AAATTC
Pavlos Pavlidis () Introduction to Coalescent 2013/02 87 / 91
Coalescent and polymorphisms
Explaining data with the coalescent
AAATCG
AAATCG AAATTCAAACCG TTTCCG
4. T−>C
1. A−>T
2. A−>T
3−>A−>T
5. C−>T
6. G−>C
Pavlos Pavlidis () Introduction to Coalescent 2013/02 88 / 91
Coalescent and polymorphisms
Software in population genetics that can be
optimized
Pavlos Pavlidis () Introduction to Coalescent 2013/02 89 / 91
Coalescent and polymorphisms
Software in population genetics that can be
optimized
The IM model
Pavlos Pavlidis () Introduction to Coalescent 2013/02 90 / 91
Coalescent and polymorphisms
Software to construct coalescent trees with
recombination and multiple positively selected
mutations
Pavlos Pavlidis () Introduction to Coalescent 2013/02 91 / 91