Separating Population Structure from Recent Evolutionary History
description
Transcript of Separating Population Structure from Recent Evolutionary History
Separating Population Structure
from Recent Evolutionary History
Problem: Spatial Patterns Inferred Earlier Represent An Equilibrium Between Recurrent Evolutionary Forces Such as Gene Flow and Drift.
E.g.,
But, Can Obtain The Same Pattern Due to Recent Historical Events That Have Not Had Time to Reach Equilibrium
fst 1
4Nev
To Examine Historical Events & Non-Equilibrium States, Need to Study Genetic Variation in Both
Space & Time
Directly Sample Populations From the Past Reconstruct Variation Through Time
Indirectly
Direct Study: mtDNA in the Woolly Mammoth
Debruyne et al. 2008. Out of America: Ancient DNA Evidence for a New World Origin of Late Quaternary Woolly Mammoths. Curr. Biol. 18:1320-1326.
Direct Study: mtDNA in the Woolly Mammoth
Debruyne et al. 2008. Out of America: Ancient DNA Evidence for a New World Origin of Late Quaternary Woolly Mammoths. Curr. Biol. 18:1320-1326.
Indirect Studies
Recall that Dt=D0(1-r)t
Therefore, Multi-locus or Multi-site Polymorphic Data Contains Historical Information, and This Retention Is For Long Periods of Time When r Is Small.
Attempts to Reconstruct History Depend Upon Multiple Loci or Upon Multi-Site Haplotypes.
Multiple Loci: Principle Component Analysis of Genetic Data
This procedure has long been used in human genetics to extract multi-locus information about gene flow patterns (e.g., Cavalli-Sforza & Ammerman, 1984).
Multiple Loci: Principle Component Analysis of Genetic Data
Novembre et al. Nature 31 Aug 2008. Based on 197,146 loci in 1,387 individuals.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Overlay of the steepest slope values (upper 5%)
Microsatelite survey of naked mole rats in Meru National Park, Kenya (Jon Hess)
Haplotypes
One Method Is To Look At the Spatial Distribution of Globally Rare, Tip Haplotypes (Although They May be Locally Common)
Coalescent Theory Implies Such Haplotypes Are Recent, And Therefore Are Not In Equilibrium And Have Limited Spatial Distributions
Therefore, Globally Rare, Tip Haplotypes Provide A Straightforward Method of Observing The Movements of Genes Through Space Over Short and Recent Time Periods.
Schroeder, K. B. et al. Mol Biol Evol 2009 26:995-1016
Geographic distribution of the Asian and American populations genotyped for the microsatellite D9S1120
“Private” 9-repeat allele
Schroeder, K. B. et al. Mol Biol Evol 2009 26:995-1016
Visual genotypes, clustered by population, for individuals either homozygous or heterozygous for the 9-repeat allele
Implies that this “private allele” is identical by descent in all Western Beringians and Native Americans, which in turn implies that Native Americans Descended (at least in part) From These Western Beringian Populations.
Method for estimating the TMRCA of copies of an allele from the number of recombination events on its shared
haplotypic background
Under the different best models, the mean TMRCA of the 9-repeat
allele ranged from 293 generations to 1,596 generations; using a generation time of 25 years resulted in a TMRCA of 7,325-39,900
years ago. Averaging over all of our best models, the mean TMRCA is 513 generations ago or about 12,825 years ago. The
95% confidence intervals for all of the best models produced ages for the MRCA of the 9-repeat allele, that range from 144 to 1951
generations ago, or approximately 3,600-48,775 years ago.
Schematics of the demographic models used for the coalescent simulations: (A) population split with two equal-size descendant populations (Asia and America), (B) population split with NAs/NAm equal to 0.15 at TAs/Am, and (C) population
split with NAs/NAm equal to 0.02 at TAs/Am, followed by population growth such that NAs/NAm equals 0.15 at T0. Models D and E are the same as models B and
C, respectively, but include population substructure in Asia and in America.
Haplotype Trees
Are Biologically Meaningful Only When Recombination Is Absent Or Rare
Gives Some Information About Temporal Ordering of Mutational Variation, Both the Rare and the Common Mutations
Not Limited to Recent Events, But Can Go Back Further In Time (But Not Beyond the Most Recent Common Ancestral DNA Molecule)
A Haplotype Tree Should Never Be Equated To A Tree of
Populations. It Is Only The Tree of The Genetic Variation For
That DNA Region.There Is Information About Population History in the
Haplotype Tree, But It Must Be Extracted Carefully.
Haplotype Trees ≠Species or Population Trees
It is dangerous to equate a haplotype tree to a species tree.
It is NEVER justified to equate a haplotype tree to a tree of populations within a species because the problem of lineage sorting is greater and the
time between events is shorter. Moreover, a population tree need
not exist at all.
Nested Clade Analysis Converts Haplotype Trees Into A Nested
Statistical Design Other Data (Phenotypic or Geographical) Are
Then Overlaid Upon The Nested Design Statistical Tests Are Performed To Detect
Significant Associations Between the Data and The Haplotype Tree
DOES NOT EQUATE THE HAPLOTYPE TREE TO A POPULATION TREE!
NCPA Distance Measures
= Sample locations
A Haplotype Tree In Elephants
TsavoAmboseli
Sengwa
Hwange
Victoria Falls
Matetsi
Within 1-Step Clades Within Tota l Tree
Haplotypes No. in
sample
Dc Dn 1-Step Clades Dc Dn
1 35 1021L*** 1027L***
2 20 81S*** 657S***
3 1 0 601 1-1 884 1173L***
Old-Young 944L*** 373L***
4 11 959L*** 832L***
5 16 114 249S*
6 3 0 156S* 1-2 460S*** 768S***
Old-Young 862L*** 598L***
7 27 47 47
8 1 0 126
9 1 0 68 1-3 49S*** 759S**
Old-Young 47 -50 626L*** 409L***
Only When Statistical Significance Is Achieved Is The Biological Significance Interpreted With
Explicit, a priori Criteria
•For Example, Under Isolation By Distance, It Takes Many Generations For A New Haplotype To Spread Across Many Demes.•Therefore, Expect Older Haplotypes To Be More Widespread Than Younger Haplotypes•Younger Haplotypes Tend To Have Geographical Ranges Nested Within the Ranges of Their Ancestral Haplotypes
A Haplotype Tree In Elephants
TsavoAmboseli
Sengwa
Hwange
Victoria Falls
Matetsi
Gene flow with IBD
Gene flow with IBD
Gene flow with IBD
Gene flow with IBD
Historical Events Also LeaveLasting Patterns in Haplotype Trees.
For Example, When A Population Expands Into a New Area, Even Haplotypes Recently Created by
Mutation Can Become Geographically Widespread, and Haplotypes Created By
Mutation After the Expansion Can Be Located Far From the Geographical Center of Their Ancestral Haplotype.
Range Expansion
Present
Past
Area A Area B Area C
Nested Clade Analysis of the Chub (Leuciscus cephalus): Range Expansion (from Durand et al. 1999)
Older Clade
YoungerClade
2-1
SPE
Historical Events Also LeaveLasting Patterns in Haplotype Trees.
For Example, When A Population Is Fragmented or Otherwise Effectively Isolated, Haplotypes That Arise After
The Fragmentation/Isolation Event Cannot Spread to Other Geographical
Areas, and With Increasing Time, More Mutations Can Accumulate, Resulting In
Larger Than Average Branch Lengths Between Clades in Different Isolates.
FragmentationRecent Old
Area A Area B Area C
Area A Area B Area C
Fragmentation between Ambystoma tigrinum tigrinum (Clade 4-2) and A. t. mavortium (Clade 4-1)
The Nested Design Means That Inferences Are Robust To Topological Variation
Induced by the Evolutionary Stochasticity of the Coalescent Process
African Elephants(Roca, A. L., N. Georgiadis, and S. J. O'Brien. 2005. Cytonuclear genomic dissociation in African elephant species. 37:96-100.
Savanna ElephantForest Elephant
Fragmentation Inferences From NCA
All 5 DNA regions had a different topology with respect to the 3 elephant taxa (only BGN gave the “species tree”); yet NCPA inferred a fragmentation event between forest and savanna elephants in all 5 DNA regions.
Highly Significant Fragmentation Events Found In All Five Haplotype Trees
Past Fragmentation
Past Fragmentation Followed By Range Expansion and Secondary Contact
Y-DNAmtDNA
BGN PLP
PHKA2
Nested Clade Phylogeographic Analysis
Recurrent Gene Flow, Range Expansion and Fragmentation Could All Have Occurred at Different Times and/or Places.
NCPA Therefore Looks For Multiple Patterns, Not Just One
The Relative Temporal Ordering of Events in a Nested Series of Clades Is Also Inferred by NCPA
Inferences from mtDNA haplotype tree of Ambystoma tigrinum from NCPA and supplemental test for
secondary contact (Mol. Ecol. 10: 779-791, 2001)
Fragmentation
Secondary ContactRange Expansion
Range Expansion
Isolation by DistanceIsolation by Distance
By Analyzing Haplotype Trees for mtDNA, Y-DNA, X-linked DNA and Autosomal DNA, One Can Sample A Wide
Variety of Time Scales and Both Male and Female
Mediated Gene Flow and Historical Events
By Analyzing Multiple Haplotype Trees Can
Statistically Correct For The Evolutionary Stochasticity of The Coalescent Process For Any One Genomic Region
Inference Errors in Nested Clade Analysis
These errors can be minimized by studying multiple loci and requiring each inference (type, place and time) to be cross-
validated by two or more loci.
Inference Requires That An Appropriate Mutation Occurred At the Right Time and Right Place: Therefore, Some Events and Processes Are Missed With A Particular DNA Region.
Selection and Evolutionary Stochasticity Can Distort The Distribution of Haplotypes in Space and Time, Thereby Leading to False Positive Inferences.
Multilocus Nested Clade Analysis Perform Single Locus NCPA on n loci Discard any inferences made only by a single locus Group together all the inferences made by 2 or more loci that are
concordant by type of inference and geographical location. Test the null hypothesis that all inferences of an event that are concordant
by event type and location are a single event. Because gene flow is a recurrent process, inferences of gene flow between
two regions are not necessarily concordant in time, but can test the null hypothesis that there was no gene flow between two regions in an interval of time, say t1 to t2 given multiple inferences of gene flow between the two regions.
ALL RETAINED INFERENCES HAVE BEEN CROSS-VALIDATED ACROSS LOCI AND HAVE EXPLICIT, QUANTIFIED STATISTICAL SUPPORT.
Using Theory Developed by Tajima (1983) and Kimura (1970), The
Distribution Of The Inference Time Is:
where ki is the average pairwise nucleotide diversity among the haplotypes in DNA region i in the youngest monophyletic clade that contributed in a statistically significant fashion to the NCPA inference of interest, and Ti is the age obtained by the Takahata et al. molecular clock estimator (or perhaps some other method) for this inference from DNA region i.
Estimated Times To Common Ancestor (Method of Takahata et al. 2001)
Dh Nuc.Diff.Within Humans
Dhc Nuc.Diff.Between Humans
& Chimps
6 Million Years Ago
TMRCA = 12Dh/Dhc
A Likelihood Ratio Test of The Hypothesis That The Estimated Times of An Event From j Loci Are The Same
Highly Significant Fragmentation Events Found In All Five Haplotype Trees
Past Fragmentation
Past Fragmentation Followed By Range Expansion and Secondary Contact
Fragmentation Inferences From NCANull Hypothesis: there was a single fragmentation event between forest and savanna elephants.
log-likelihood ratio test = 1.497 with 4 degrees of freedom, p= 0.8272. Accept Null Hypothesis, with T = 4.2 MYA.
There are at least 2 lineages of African Elephants.
Y-DNAmtDNA
BGN PLP
PHKA2
Performed Nested Clade Analyses on 25 DNA Regions in Humans
• Mitochondrial DNA (Ingman et al. Nature 408, 708 - 713, 2000: Sykes
et al. American Journal of Human Genetics 57, 1463-1475, 1995; Torroni et al. American Journal of Human Genetics 53, 563-590, 1993, American Journal of Human Genetics 53, 591-608, 1993).
• Y-DNA (Hammer et al. Molecular Biology and Evolution 15, 427-441, 1998)
• 11 X-Linked Regions (Balciuniene et al. 2001; Garrigan et al. 2005;
Hammer et al. 2004; Harris. & Hey, 1999, 2001; Kaessmann et al. 1999; Nachman et al. 2004; Saunders et al. 2002; Verrelli et al. 2002; Yu et al. 2002)
• 12 Autosomal Genes (Bamshad et al. 2002, Harding et al. 1997; Hollox
et al. 2001; Jin et al. 1999; Koda et al. 2001; Rana et al. 1999; Rogers et al. 2000; Toomajian and Kreitman 2002; Wooding et al. 2002; Zhang & Rosenberg 2000).
The log likelihood ratio test rejects the null hypothesis that all 15 events are temporally concordant with a probability value of 3.89 10-15.
P = 0.95
P = 0.51
P = 0.62
Three Out-of-Africa Events, All DefinedBy Three or More Loci With A High
Degree of Temporal HomogeneityBut With Highly Significant
Heterogeneity BetweenThe Three Events
There Were At Least Three Out-of-Africa Expansion Events Over the Last 2 Million Years
Inferences of Gene Flow That Are Concordant Geographically Are NOT Necessarily Concordant Temporally Because Gene Flow is a Recurrent
Process. However, We Can Test The Null Hypothesis of NO GENE FLOW Between Two Geographical Regions
Over a Specified Time Interval.
Test Of The Null Hypothesis of NO GENE FLOW Between Two
Geographical Regions Over a Specified Time Interval l to u:
[l ,u ]=1 ti
ki exp ti (1 ki ) / Ti
Ti / (1 ki ) 1 ki (1 k
i)l
u
dti
LRT ([l,u])=-2 ln [l ,u ]i=1
j
Gamma Distributions For 19 African/Eurasian Gene Flow Inferences
With Isolation By Distance
Extensive overlap implies cross-validationwith the exception of MX1, the only locuswith most of its probability mass in the Pliocene.
The lack of clusters implies therewas no prolonged breaks in geneflow throughout the Pleistocene
Testing The Null Hypothesis of No African/Eurasian Gene Flow Throughout
the Pleistocene
The Null hypothesis of isolation (no gene
flow) in this time interval is rejected
with p < 10-8
All of The Cross Validated Inferences
Integrate Well Into A Single
Overview of The Emergence of
Humans.
Coalescent SimulationsSet of Fully Specified
Phylogeographic Hypotheses
Simulate Coal.Process Many TimesUnder Each Hypothesis
Virtual Current Generation
Draw Simulated Sample of Same Size as Real Sample
Statistics on Simulated Sample
Real Current Generation
Statistics from Real Sample
Compare Relative Fits of The Simulated Statistics Under Each Model to The Observed Statistics
Strong Vs. Weak Inference Falsification is the strongest inference possible in science, so this
is called “strong inference.” Inference in NCPA is based upon the falsification of null
hypotheses. Weak inference refers to the relative fit of a non-exhaustive set
of alternatives. It is rare that an exhaustive set of every conceivable
phylogeographic alternative can be simulated, so the coalescent simulation approach results in weak inference.
Weak inference can give high relative support to a false hypothesis when all the alternatives are also false.
E.g, Fagundes et al (PNAS 104:17614-17619, 2007)
Tested 3 Models of Human Evolution via Simulation
Templeton (Yearbook of Physical Anthropology
48:33-59, 2005) Falsified All Three Models, With AFREG
Rejected with p < 10-17
These Results Are NOT Contradictory!
E.g, Fagundes et al (PNAS 104:17614-17619, 2007)
Tested 3 Models of Human Evolution via Simulation
Eswaran et al (J. Human Evol. 49:1-18, 2005) Tested
AFREG vs. A model of Isolation By Distance and
Strongly Rejected AFREG.
These Results Are NOT Contradictory!Africa S. Europe S. Asia
Africa S. Europe N. Europe S. Asia N. Asia Pacific Americas
Interpretive Criteria• Simulations assign “probabilities” to complex models as a
whole, making it impossible to interpret the biological reason for a low probability.
• In contrast, NCPA allows individual components to be tested, making the biological interpretation clear.
Reject the Null hypothesis of no admixture with p < 10-17
Interpretive Criteria
The Null hypothesis of isolation (no gene flow) in the minimal time interval proposed by Fagundes et al is rejected with p = 1.6 X 10-6 by testing with multilocus NCPA.
Interpretive Criteria• Although Fagundes et al. (2007) interpreted the rejection of their assimilation
model as a rejection of admixture, the confounded nature of simulation inference means that such an interpretation has no logical validity.
• NCPA allows individual components to be tested, making it clear that the part of their assimilation model that is wrong is NOT admixture, but rather the assumption of prior isolation of archaic Africans and Eurasians.
X
Coherent Inference• Coherence is a property referring to nested and
composite hypotheses.
• The meaning of coherence is most easily illustrated with nested hypotheses:
B A
One measure of fit is the probability of the hypotheses. Because A is a nested subset of B, Prob.(B) ≥ Prob.(A). This relationship is “coherent”.
If one assigned Prob.(A) > Prob.(B), this is mathematically impossible and is said to be “incoherent”.
E.g, Fagundes et al (PNAS 104:17614-17619, 2007)
The “assimilation” model (B) allows the possibility of admixture between Africans and Eurasians, measured by the parameter M that can vary between 0 and 1. Note, M=0 corresponds to replacement, so the replacement model (A) is a proper subset of the assimilation model.
Note the probabilities assigned to A and B.
The ABC method is INCOHERENT!
Why Is ABC INCOHERENT?
There is no correction for dimensionality of the different hypotheses (indexed by i); and
The denominator treats all hypotheses as mutually exclusive events.
Equation 9 From Beaumont, M. A., W. Y. Zhang, and D. J. Balding. 2002. Approximate Bayesian computation in population genetics. Genetics 162:2025-2035.
E.g, Fagundes et al (PNAS 104:17614-17619, 2007)
Equation 9 states that the
Prob(A or B or C) = P(A)+P(B)+P(C)
A B C
CB A
Prob(A or B or C) = P(B)+P(C) - P(B & C)
Hence, the fundamental equation of ABC is
mathematically incoherent for nested and/or composite
hypotheses.
Other Methods of Evaluating Hypotheses in the Coalescent Simulation Approach are Incoherent
•Bayes Factors are known to be incoherent (Lavine, M., and M. J. Schervish. 1999. Bayes Factors: What They Are and What They Are Not. The American Statistician 53:119-122).
•Mesquite and all other programs that treat all phylogeographic hypotheses as mutually exclusive alternatives are incoherent.
•Coalescent Simulations Can Only Be Used to Test Single Parameter Models Against Their Complement (e.g., FST > 0 vs. FST = 0).
Statistical Phylogeography
Statistical Phylogeography
Multilocus NCPA provides a robust, flexible testing framework.
Simulations have multiple statistical flaws and cannot be used to test composite
phylogeographic hypotheses.NCPA defines the general model but does not
yield insight into details.Once the general model framework has been inferred by NCPA, simulations can be used to
estimate the underlying parameters.
Multilocus NCPA provides a robust, flexible testing framework.
Simulations have multiple statistical flaws and cannot be used to test composite
phylogeographic hypotheses.NCPA defines the general model but does not
yield insight into details.Once the general model framework has been inferred by NCPA, simulations can be used to
estimate the underlying parameters.
Statistical Phylogeography
Statistical Phylogeography
NCPA and simulation approaches are not so much alternative
techniques as they are complementary, and potentially
synergistic, techniques. Both add to the statistical toolkit of
intraspecific phylogeographers, and both should be used when
appropriate.
NCPA and simulation approaches are not so much alternative
techniques as they are complementary, and potentially
synergistic, techniques. Both add to the statistical toolkit of
intraspecific phylogeographers, and both should be used when
appropriate.