Population structure - Foundations to software

Population structure - Foundations to software

Andrew J. EckertSection of Evolution and Ecology, University of California at Davis,

Davis, CA 95616 USA

Ph: (530) 754-5743

E-mail: [email protected]

Eckert, Population Structure, 5-Aug-2008 1

An example from foxtail pine (Pinus balfouriana)


Population structure in forest trees


Canonical questions

• To what extent due gene frequencies differ among populations of forest trees?

• How is gene flow structured among populations of forest trees?

• How can population structure inform us about other processes in forest trees?


Topics

• Hardy-Weinberg Equilibrium

• Wahlund effect and F-statistics

• Estimating F-statistics from real data

• Relationship between Fst and Nem

• Clustering methods


Hardy-Weinberg Principle and Estimation of Allele Frequencies in Populations

Eckert, Population Structure, 5-Aug-2008 2Eckert, Population Structure, 5-Aug-2008 6

Mating Tables

• Construct a mating table by assuming that:– Genotype frequencies are same

between sexes– Mating is at random with respect to

the genotypes at a particular locus– No segregation distortion or

differential survival of zygotes.


A generalized mating table

Mating Mating freq. A1A1 A1A2 A2A2

A1A1 x A1A1 (x11)2 1 0 0

A1A1 x A1A2 (x11) (x12) 0.5 0.5 0

A1A1 x A2A2 (x11) (x22) 0 1 0

A1A2 x A1A1 (x12) (x11) 0.5 0.5 0

A1A2 x A1A2 (x12)2 0.25 0.5 0.25

A1A2 x A2A2 (x12) (x22) 0 0.5 0.5

A2A2 x A1A1 (x22) (x11) 0 1 0

A2A2 x A1A2 (x22) (x12) 0 0.5 0.5

A2A2 x A2A2 (x22)2 0 0 1


Genotype frequencies of newly formed zygotes

Now make 3 more assumptions:

1. No mutation

2. No drift

3. All matings produce the same number of offspring on average

The frequency of each genotype in newly formed zygotes is then:


More assumptions

To make those zygote genotype frequencies into adult genotype frequencies assume further:

1. Generations do not overlap

2. No differential survival among genotypes


Hardy-Weinberg Equilibrium (HWE)

• This is HWE:– Freq(A1A1 in zygotes) = p2

– Freq(A1A2 in zygotes) = 2p(1-p)– Freq(A2A2 in zygotes) = (1-p)2

• Deviations from HWE must occur by violation of one of the previous assumptions. This is the power of HWE.


Null hypothesis: No deviation from HW

• Procedure:– Estimate allele frequencies

• p, q• For two alleles, p is distributed as a binomial random

variable. For more than 2 alleles, this is a multinomial distribution.

– Maximum likelihood and Bayesian methods to do this.

– Generate expected HW genotypic frequencies• p2, 2pq, q2

– Compare with observed genotypic frequencies• various test statistics:

2 goodness of fit (discussed here)– G test (similar to chi-square, uses likelihood method)– Exact tests (small samples)


Chi-square test

• Numbers, not frequencies• k=number of categories• n “degrees of freedom.• This statistic is distributed as the sum of n

independent squared “random normal variables” with mean=0 and variance=1.

∑=

−=

k

in

1

22

(i)expected#

(i)]expected#(i)[Observed#χ


Relaxation of random mating and the fixation index (f)

• Now imagine that we have a mixture of randomly mating and selfing populations. The fraction of selfing individuals is .

€

x11 = p2 +σpq

2(1−σ / 2)

x12 = 2 pq− 2σpq

2(1−σ / 2)

⎛

⎝ ⎜

⎞

⎠ ⎟

x22 = q2 +σpq

2(1−σ / 2)

€

f =σ

2(1−σ / 2)

x11 = p2 + fpq

x12 = 2 pq(1− f )

x22 = q2 + fpq


More on f: A simple estimator

• Notice that x12 is an observed quantity and that 2pq is the expectation under HWE.

• If the genotype and allele frequencies were observed without error:

€

f = 1−HobsHexp

= 1−x12

2 pq


Wahlund Effect and Wright’s F-statistics


The Wahlund Effect

• Consider two subpopulations each in HWE with a single biallelic locus where p1 and p2 are the allele frequencies in each subpopulation for allele A. If a sample is collected across both populations, the heterozygosity (H) of the sample is:

• However, if the collection of subpopulations was in HWE:

• The Wahlund effect:

€

HE2=

2 p1(1− p1) + 2 p2 (1− p2 )

2= p1(1− p1) + p2 (1− p2 )

€

HE1= 2 p(1− p)

p =p1 + p2

2

€

HE2< HE1

p1(1− p1) + p2 (1− p2 )[ ] < 2 p(1− p)

iff p1 ≠ p2


The Wahlund Effect - An example

Mean H (eq. 1) 0.42

Overall H (eq. 2) 0.50

p = 0.50

p1 = 0.3 p2 = 0.7


What are the qualitative aspects of the Wahlund effect?


Connection from the Wahlund effect to F-statistics

Incorporation of local inbreeding (f)

• Let Hi be the actual heterozygosity within individuals, Hs the expected heterozygosity within subpopulations all at HWE and Ht the expected heterozygosity across the entire set of subpopulations, we can then define:

€

Fit = 1−HiHt

1− Fit =HiHt

=HiHs

⎛

⎝ ⎜

⎞

⎠ ⎟HsHt

⎛

⎝ ⎜

⎞

⎠ ⎟= (1− Fis)(1− Fst)

Fst =Ht −HsHt


The meaning of Fst

– Reduction in variance due to population structure relative to maximum variance possible in a randomly mating population.

– Proportion of the total expected heterozygosity accounted for by the expected heterozygosity within subpopulations

– Also, a measure of what is the probability that two gene copies chosen at random from two different subpopulations are identical-by-descent.

– See Slatkin (1991, Genetical Res. 58: 167-175) for a relationship of Fst to coalescence times among gene copies.


Estimating Fst from real data

• So, far we have assumed that we know quantities without error. However, you have sampled from a set of populations and therefore have three kinds of error:– Error due to taking a sample from the existing

populations.– Error due to real differences among populations.– Error associated from the fact that the existing

populations exist along one of a infinitely large number of evolutionary trajectories producing the observed gene frequencies.


A fixed effects estimator• Nei and Chesser (1983) provide a bias corrected

version of Gst (Nei, 1972; PNAS 70: 3321-3323):


€

Gst =Ht −HsHt

Precision of Gst


A random effects estimator

• Weir and Cockerham (1984; Evolution 38: 1358-1370) frame the estimator in ANOVA theory using random effects for one allele at a time (extension of Cockerham 1969, 1973):


See Smouse and Williams (1982, Biometrics 38: 757-768) for a multivariate ANOVA approach.

Multilocus estimators and significance testing

• Weir and Cockerham suggest that multilocus estimates be weighted averages across loci assumed to be in linkage equilibrium. The weights are functions of the allele frequencies at the loci.

• Significance of the estimates is typically done by bootstrapping over loci to get 95% confidence intervals. The test is then, does my 95% CI overlap 0?

• Weir and Cockerham (1984) advocate the using the jackknife over samples or loci to estimate variances for a given locus or the multilocus estimate, respectively.


Software

• FSTAT http://www2.unil.ch/popgen/softwares/fstat.htm

• Arlequin http://lgb.unige.ch/arlequin/

• Genepophttp://genepop.curtin.edu.au/


http://www2.unil.ch/popgen/softwares/fstat.htm

Extensions

• Estimation of pollen and seed movement in conifers (Ennos, 1994)• Bayesian estimation – very good for dominant data (cf. HICKORY

http://darwin.eeb.uconn.edu/hickory/hickory.html)• Outlier detection. FDIST2 (

http://www.rubic.rdg.ac.uk/~mab/software.html)• Population specific Fst values (Weir and Hill, 2002):

• Models for differing data types: Haplotype frequencies and divergence between haplotypes.– Rst: Stepwise mutation models incorporated (Slatkin, 1995)– Nst: DNA sequence models incorporated


€

Fst =

niFstii=1

P

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

nii=1

P

∑

http://www.rubic.rdg.ac.uk/~mab/software.html

Pollen-to-seed flow ratio

€

1−x1− y

=

1

Fstb−1

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟− 2

1

Fstm−1

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

1

Fstm−1

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟y

≈

1

Fstb−1

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟− 2

1

Fstm−1

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

1

Fstm−1

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

Ennos (1994; Heredity) showed that under the equilibrium conditions applicable to Fst in general:

Without inbreeding:

With inbreeding:

€

1−x1− y

=

1

Fstb−1

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟(1+ FIS ) − 2

1

Fstm−1

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

1

Fstm−1

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟


Outlier Detection: An example


cf. Beaumont and Balding (2004, Mol. Ecol. 13:969-980) for a refinement of fdist2 into a hierarchical Bayesian model.

FST to Nem

• Under a number of assumptions that lead to the n-island model:

• Where, p is the correlation in gene frequencies among populations. Typically this is assumed to be 0.

• If the mutation rate is small enough so that it can be ignored (especially if m >> u) then this simplifies to:€

FST

=1

1+ 4Neμ + 4N

em(1− ρ )

€

FST

≈1

1+ 4Nem

→ Nem ≈

14F

ST

− 0.25


Relationship between Nem and FST

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

0 0.2 0.4 0.6 0.8 1

Fixation Index (FST)

Effective number of migrants (Nm)

Nem =1

Populations not very divergent due to high gene flow

Populations very divergent due to low gene flow


Estimators of Nem

• Coalescent-based estimators of Nem (cf. Beerli and Felsentein, 2001; PNAS 98: 4563-4568)

• Coalescent-based estimators of Nem plus other forces (cf. Kuhner, 2006; Bioinofrmatics 22: 768-770).

• Coalescent-based estimators of Nem plus other forces and population divergence (cf. Hey and Nielsen, 2007; PNAS 104: 2785-2790).


Coalescent-based analyses - Examples


AMOVA - Analysis of Molecular Variance

• Hierarchical fixation indices are conducive to inclusion of more levels. Before we had:

1. Individual

2. Subpopulation3. Total Population

• We could instead have the following:1. Individual2. Subpopulation3. Groups of subpopulations4. Entire population (= all subpopulations)

• This can be properly addressed with AMOVA.


AMOVA - An Example of Hierarchical Levels

Fixation indices can be calculated for among groups of populations (red lines; FCT) and among subpopulations within groups (green lines; FSC).

This is done within an Analysis of Variance framework using (co)variance components within a general linear model.


An AMOVA table


Significance of variance components

Permutation analyses depending upon the component:

•For, FCT this is typically done by permuting populations among groups

•For FSC this done by permuting genotypes among populations within groups.

•For FST this done by permuting genotypes among populations among groups.


Isolation-by-Distance (IBD)

For populations that have large distributions, the n-island model may be inappropriate. For example, populations close in space may be less differentiated than those far way in space.

If space is the primary driving force, then there should exist a correlation between geographic and genetic distance among populations.

This is tested by correlating pairwise FST with pairwise geographic distances using a matrix correlation function (i.e., the Mantel statistic).


IBD - An ExamplePopulation 1 2 3

12 0.113 0.29 0.15

Population 1 2 312 453 105 53

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 20 40 60 80 100 120

Distance (km)

Pairwise FST

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0 20 40 60 80 100 120

Distance (km)

Pairwise FST

Population 1 2 312 0.553 0.48 0.22

Population 1 2 312 453 105 53

FST

Distance (km)

FST

Distance (km)

Mantel r significant

Mantel r not significant

IBD present

IBD not present

Clustering Methods


Bayesian Clustering as an Alternative to FST analyses

• Multilocus genotypes contain information about population structure.

• The underlying information is the same as that used in estimating FST.

• However, a model-based effort alleviates the need to define populations a priori, because the number of populations is a parameter of the model (Pritchard et al., 2000 Genetics 155: 945-959).

• So, what is the model…..?


The Model• Parametric model for the frequency distribution of alleles

in an unknown number of populations each in HWE.• Pritchard et al. assume a Dirichlet distribution for the allele

frequencies• The basic idea in to approximate:

• Using Markov Chain Monte Carlo. This is the Posterior Probability Distribution (PPD) of the allele frequencies (P) in the unknown populations of origin (Z) given the observed multilocus genotypes (X). [Loci are assumed to be in linkage equilibrium].

€

Pr(Z ,P X)∝ Pr(Z )Pr(P)Pr(X Z,P)PPD MLDPKP


http://pritch.bsd.uchicago.edu/structure.html

Inferences Using STRUCTURE

• Population structure present:

• This method can also infer the optimal number of clusters (= populations) K given the data. This is done by assuming a uniform prior on a range of possible values for K. The result are the posterior probabilities for each value of K in the prior.

K ln L Posterior Probability1 -1056 0.0052 -1038 0.993 -1055 0.0054 -1125 05 -1255 06 -1266 07 -1279 08 -1458 0


Another method for inference of K

• The K method of Evanno et al. (2005, Mol. Ecol. 14: 2611-2620):


Post-processing of STRUCTURE runs

• Production of Q-plots: DISTRUCT

• Solving the label-switching problem and averaging across replicated runs: CLUMPP

http://rosenberglab.bioinformatics.med.umich.edu/


What are those strange plots?

• Plots of Q-values are nothing more than stacked barplots

• These can be grouped into geographical regions or can be plotted onto a map if you have spatial coordinates using R scripts (available from the TESS website)


Other clustering methods

• TESS (prior on spatial coordinates using Dirichlet tessellations; Francois et al., 2006; Genetics, 174:805-816) available from: http://www-timc.imag.fr/Olivier.Francois/tess.html

• EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190).– Good for SNP data


What does it all mean? A point for discussion

• What is the “evolutionary” significance of population structure in forest trees?

• How would something like Sewall Wright’s shifting balance theory of evolution work in forest trees?


Population structure - Foundations to software

Documents

Transcript of Population structure - Foundations to software