Population structure - Foundations to software

50
Population structure - Foundations to software Andrew J. Eckert Section of Evolution and Ecology, University of California at Davis, Davis, CA 95616 USA Ph: (530) 754-5743 E-mail: [email protected] Eckert, Population Structure, 5-Aug-2008 1

description

Population structure - Foundations to software. Andrew J. Eckert Section of Evolution and Ecology, University of California at Davis, Davis, CA 95616 USA Ph: (530) 754-5743 E-mail: [email protected]. Eckert, Population Structure, 5-Aug-2008 1. - PowerPoint PPT Presentation

Transcript of Population structure - Foundations to software

Page 1: Population structure - Foundations to software

Population structure - Foundations to software

Andrew J. EckertSection of Evolution and Ecology, University of California at Davis,

Davis, CA 95616 USA

Ph: (530) 754-5743

E-mail: [email protected]

Eckert, Population Structure, 5-Aug-2008 1

Page 2: Population structure - Foundations to software

An example from foxtail pine (Pinus balfouriana)

Eckert, Population Structure, 5-Aug-2008 2

Page 3: Population structure - Foundations to software

Population structure in forest trees

Eckert, Population Structure, 5-Aug-2008 3

Page 4: Population structure - Foundations to software

Canonical questions

• To what extent due gene frequencies differ among populations of forest trees?

• How is gene flow structured among populations of forest trees?

• How can population structure inform us about other processes in forest trees?

Eckert, Population Structure, 5-Aug-2008 4

Page 5: Population structure - Foundations to software

Topics

• Hardy-Weinberg Equilibrium

• Wahlund effect and F-statistics

• Estimating F-statistics from real data

• Relationship between Fst and Nem

• Clustering methods

Eckert, Population Structure, 5-Aug-2008 6

Page 6: Population structure - Foundations to software

Hardy-Weinberg Principle and Estimation of Allele Frequencies in Populations

Eckert, Population Structure, 5-Aug-2008 2Eckert, Population Structure, 5-Aug-2008 6

Page 7: Population structure - Foundations to software

Mating Tables

• Construct a mating table by assuming that:– Genotype frequencies are same

between sexes– Mating is at random with respect to

the genotypes at a particular locus– No segregation distortion or

differential survival of zygotes.

Eckert, Population Structure, 5-Aug-2008 7

Page 8: Population structure - Foundations to software

A generalized mating table

Mating Mating freq. A1A1 A1A2 A2A2

A1A1 x A1A1 (x11)2 1 0 0

A1A1 x A1A2 (x11) (x12) 0.5 0.5 0

A1A1 x A2A2 (x11) (x22) 0 1 0

A1A2 x A1A1 (x12) (x11) 0.5 0.5 0

A1A2 x A1A2 (x12)2 0.25 0.5 0.25

A1A2 x A2A2 (x12) (x22) 0 0.5 0.5

A2A2 x A1A1 (x22) (x11) 0 1 0

A2A2 x A1A2 (x22) (x12) 0 0.5 0.5

A2A2 x A2A2 (x22)2 0 0 1

Eckert, Population Structure, 5-Aug-2008 8

Page 9: Population structure - Foundations to software

Genotype frequencies of newly formed zygotes

Now make 3 more assumptions:

1. No mutation

2. No drift

3. All matings produce the same number of offspring on average

The frequency of each genotype in newly formed zygotes is then:

Eckert, Population Structure, 5-Aug-2008 9

Page 10: Population structure - Foundations to software

More assumptions

To make those zygote genotype frequencies into adult genotype frequencies assume further:

1. Generations do not overlap

2. No differential survival among genotypes

Eckert, Population Structure, 5-Aug-2008 10

Page 11: Population structure - Foundations to software

Hardy-Weinberg Equilibrium (HWE)

• This is HWE:– Freq(A1A1 in zygotes) = p2

– Freq(A1A2 in zygotes) = 2p(1-p)– Freq(A2A2 in zygotes) = (1-p)2

• Deviations from HWE must occur by violation of one of the previous assumptions. This is the power of HWE.

Eckert, Population Structure, 5-Aug-2008 11

Page 12: Population structure - Foundations to software

Null hypothesis: No deviation from HW

• Procedure:– Estimate allele frequencies

• p, q• For two alleles, p is distributed as a binomial random

variable. For more than 2 alleles, this is a multinomial distribution.

– Maximum likelihood and Bayesian methods to do this.

– Generate expected HW genotypic frequencies• p2, 2pq, q2

– Compare with observed genotypic frequencies• various test statistics:

2 goodness of fit (discussed here)– G test (similar to chi-square, uses likelihood method)– Exact tests (small samples)

Eckert, Population Structure, 5-Aug-2008 12

Page 13: Population structure - Foundations to software

Chi-square test

• Numbers, not frequencies• k=number of categories• n “degrees of freedom.• This statistic is distributed as the sum of n

independent squared “random normal variables” with mean=0 and variance=1.

∑=

−=

k

in

1

22

(i)expected#

(i)]expected#(i)[Observed#χ

Eckert, Population Structure, 5-Aug-2008 13

Page 14: Population structure - Foundations to software

Relaxation of random mating and the fixation index (f)

• Now imagine that we have a mixture of randomly mating and selfing populations. The fraction of selfing individuals is .

x11 = p2 +σpq

2(1−σ / 2)

x12 = 2 pq− 2σpq

2(1−σ / 2)

⎝ ⎜

⎠ ⎟

x22 = q2 +σpq

2(1−σ / 2)

f =σ

2(1−σ / 2)

x11 = p2 + fpq

x12 = 2 pq(1− f )

x22 = q2 + fpq

Eckert, Population Structure, 5-Aug-2008 14

Page 15: Population structure - Foundations to software

More on f: A simple estimator

• Notice that x12 is an observed quantity and that 2pq is the expectation under HWE.

• If the genotype and allele frequencies were observed without error:

f = 1−HobsHexp

= 1−x12

2 pq

Eckert, Population Structure, 5-Aug-2008 15

Page 16: Population structure - Foundations to software

Wahlund Effect and Wright’s F-statistics

Eckert, Population Structure, 5-Aug-2008 16

Page 17: Population structure - Foundations to software

The Wahlund Effect

• Consider two subpopulations each in HWE with a single biallelic locus where p1 and p2 are the allele frequencies in each subpopulation for allele A. If a sample is collected across both populations, the heterozygosity (H) of the sample is:

• However, if the collection of subpopulations was in HWE:

• The Wahlund effect:

HE2=

2 p1(1− p1) + 2 p2 (1− p2 )

2= p1(1− p1) + p2 (1− p2 )

HE1= 2 p(1− p)

p =p1 + p2

2

HE2< HE1

p1(1− p1) + p2 (1− p2 )[ ] < 2 p(1− p)

iff p1 ≠ p2

Eckert, Population Structure, 5-Aug-2008 17

Page 18: Population structure - Foundations to software

The Wahlund Effect - An example

Mean H (eq. 1) 0.42

Overall H (eq. 2) 0.50

p = 0.50

p1 = 0.3 p2 = 0.7

Eckert, Population Structure, 5-Aug-2008 18

Page 19: Population structure - Foundations to software

What are the qualitative aspects of the Wahlund effect?

Eckert, Population Structure, 5-Aug-2008 19

Page 20: Population structure - Foundations to software

Connection from the Wahlund effect to F-statistics

Page 21: Population structure - Foundations to software

Incorporation of local inbreeding (f)

• Let Hi be the actual heterozygosity within individuals, Hs the expected heterozygosity within subpopulations all at HWE and Ht the expected heterozygosity across the entire set of subpopulations, we can then define:

Fit = 1−HiHt

1− Fit =HiHt

=HiHs

⎝ ⎜

⎠ ⎟HsHt

⎝ ⎜

⎠ ⎟= (1− Fis)(1− Fst)

Fst =Ht −HsHt

Eckert, Population Structure, 5-Aug-2008 21

Page 22: Population structure - Foundations to software

The meaning of Fst

– Reduction in variance due to population structure relative to maximum variance possible in a randomly mating population.

– Proportion of the total expected heterozygosity accounted for by the expected heterozygosity within subpopulations

– Also, a measure of what is the probability that two gene copies chosen at random from two different subpopulations are identical-by-descent.

– See Slatkin (1991, Genetical Res. 58: 167-175) for a relationship of Fst to coalescence times among gene copies.

Eckert, Population Structure, 5-Aug-2008 22

Page 23: Population structure - Foundations to software

Estimating Fst from real data

• So, far we have assumed that we know quantities without error. However, you have sampled from a set of populations and therefore have three kinds of error:– Error due to taking a sample from the existing

populations.– Error due to real differences among populations.– Error associated from the fact that the existing

populations exist along one of a infinitely large number of evolutionary trajectories producing the observed gene frequencies.

Eckert, Population Structure, 5-Aug-2008 23

Page 24: Population structure - Foundations to software

A fixed effects estimator• Nei and Chesser (1983) provide a bias corrected

version of Gst (Nei, 1972; PNAS 70: 3321-3323):

Eckert, Population Structure, 5-Aug-2008 24

Gst =Ht −HsHt

Page 25: Population structure - Foundations to software

Precision of Gst

Eckert, Population Structure, 5-Aug-2008 25

Page 26: Population structure - Foundations to software

A random effects estimator

• Weir and Cockerham (1984; Evolution 38: 1358-1370) frame the estimator in ANOVA theory using random effects for one allele at a time (extension of Cockerham 1969, 1973):

Eckert, Population Structure, 5-Aug-2008 26

See Smouse and Williams (1982, Biometrics 38: 757-768) for a multivariate ANOVA approach.

Page 27: Population structure - Foundations to software

Multilocus estimators and significance testing

• Weir and Cockerham suggest that multilocus estimates be weighted averages across loci assumed to be in linkage equilibrium. The weights are functions of the allele frequencies at the loci.

• Significance of the estimates is typically done by bootstrapping over loci to get 95% confidence intervals. The test is then, does my 95% CI overlap 0?

• Weir and Cockerham (1984) advocate the using the jackknife over samples or loci to estimate variances for a given locus or the multilocus estimate, respectively.

Eckert, Population Structure, 5-Aug-2008 27

Page 28: Population structure - Foundations to software

Software

• FSTAT http://www2.unil.ch/popgen/softwares/fstat.htm

• Arlequin http://lgb.unige.ch/arlequin/

• Genepophttp://genepop.curtin.edu.au/

Eckert, Population Structure, 5-Aug-2008 28

Page 29: Population structure - Foundations to software

Extensions

• Estimation of pollen and seed movement in conifers (Ennos, 1994)• Bayesian estimation – very good for dominant data (cf. HICKORY

http://darwin.eeb.uconn.edu/hickory/hickory.html)• Outlier detection. FDIST2 (

http://www.rubic.rdg.ac.uk/~mab/software.html)• Population specific Fst values (Weir and Hill, 2002):

• Models for differing data types: Haplotype frequencies and divergence between haplotypes.– Rst: Stepwise mutation models incorporated (Slatkin, 1995)– Nst: DNA sequence models incorporated

Eckert, Population Structure, 5-Aug-2008 29

Fst =

niFstii=1

P

∑ ⎛

⎝ ⎜

⎠ ⎟

nii=1

P

Page 30: Population structure - Foundations to software

Pollen-to-seed flow ratio

1−x1− y

=

1

Fstb−1

⎝ ⎜ ⎜

⎠ ⎟ ⎟− 2

1

Fstm−1

⎝ ⎜ ⎜

⎠ ⎟ ⎟

1

Fstm−1

⎝ ⎜ ⎜

⎠ ⎟ ⎟y

1

Fstb−1

⎝ ⎜ ⎜

⎠ ⎟ ⎟− 2

1

Fstm−1

⎝ ⎜ ⎜

⎠ ⎟ ⎟

1

Fstm−1

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Ennos (1994; Heredity) showed that under the equilibrium conditions applicable to Fst in general:

Without inbreeding:

With inbreeding:

1−x1− y

=

1

Fstb−1

⎝ ⎜ ⎜

⎠ ⎟ ⎟(1+ FIS ) − 2

1

Fstm−1

⎝ ⎜ ⎜

⎠ ⎟ ⎟

1

Fstm−1

⎝ ⎜ ⎜

⎠ ⎟ ⎟

Eckert, Population Structure, 5-Aug-2008 30

Page 31: Population structure - Foundations to software

Outlier Detection: An example

Eckert, Population Structure, 5-Aug-2008 31

cf. Beaumont and Balding (2004, Mol. Ecol. 13:969-980) for a refinement of fdist2 into a hierarchical Bayesian model.

Page 32: Population structure - Foundations to software

FST to Nem

• Under a number of assumptions that lead to the n-island model:

• Where, p is the correlation in gene frequencies among populations. Typically this is assumed to be 0.

• If the mutation rate is small enough so that it can be ignored (especially if m >> u) then this simplifies to:€

FST

=1

1+ 4Neμ + 4N

em(1− ρ )

FST

≈1

1+ 4Nem

→ Nem ≈

14F

ST

− 0.25

Eckert, Population Structure, 5-Aug-2008 32

Page 33: Population structure - Foundations to software

Relationship between Nem and FST

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

0 0.2 0.4 0.6 0.8 1

Fixation Index (FST)

Effective number of migrants (Nm)

Nem =1

Populations not very divergent due to high gene flow

Populations very divergent due to low gene flow

Eckert, Population Structure, 5-Aug-2008 33

Page 34: Population structure - Foundations to software

Estimators of Nem

• Coalescent-based estimators of Nem (cf. Beerli and Felsentein, 2001; PNAS 98: 4563-4568)

• Coalescent-based estimators of Nem plus other forces (cf. Kuhner, 2006; Bioinofrmatics 22: 768-770).

• Coalescent-based estimators of Nem plus other forces and population divergence (cf. Hey and Nielsen, 2007; PNAS 104: 2785-2790).

Eckert, Population Structure, 5-Aug-2008 34

Page 35: Population structure - Foundations to software

Coalescent-based analyses - Examples

Eckert, Population Structure, 5-Aug-2008 35

Page 36: Population structure - Foundations to software

AMOVA - Analysis of Molecular Variance

• Hierarchical fixation indices are conducive to inclusion of more levels. Before we had:

1. Individual

2. Subpopulation3. Total Population

• We could instead have the following:1. Individual2. Subpopulation3. Groups of subpopulations4. Entire population (= all subpopulations)

• This can be properly addressed with AMOVA.

Eckert, Population Structure, 5-Aug-2008 36

Page 37: Population structure - Foundations to software

AMOVA - An Example of Hierarchical Levels

Fixation indices can be calculated for among groups of populations (red lines; FCT) and among subpopulations within groups (green lines; FSC).

This is done within an Analysis of Variance framework using (co)variance components within a general linear model.

Eckert, Population Structure, 5-Aug-2008 37

Page 38: Population structure - Foundations to software

An AMOVA table

Eckert, Population Structure, 5-Aug-2008 38

Page 39: Population structure - Foundations to software

Significance of variance components

Permutation analyses depending upon the component:

•For, FCT this is typically done by permuting populations among groups

•For FSC this done by permuting genotypes among populations within groups.

•For FST this done by permuting genotypes among populations among groups.

Eckert, Population Structure, 5-Aug-2008 39

Page 40: Population structure - Foundations to software

Isolation-by-Distance (IBD)

For populations that have large distributions, the n-island model may be inappropriate. For example, populations close in space may be less differentiated than those far way in space.

If space is the primary driving force, then there should exist a correlation between geographic and genetic distance among populations.

This is tested by correlating pairwise FST with pairwise geographic distances using a matrix correlation function (i.e., the Mantel statistic).

Eckert, Population Structure, 5-Aug-2008 40

Page 41: Population structure - Foundations to software

IBD - An ExamplePopulation 1 2 3

12 0.113 0.29 0.15

Population 1 2 312 453 105 53

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 20 40 60 80 100 120

Distance (km)

Pairwise FST

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0 20 40 60 80 100 120

Distance (km)

Pairwise FST

Population 1 2 312 0.553 0.48 0.22

Population 1 2 312 453 105 53

FST

Distance (km)

FST

Distance (km)

Mantel r significant

Mantel r not significant

IBD present

IBD not present

Page 42: Population structure - Foundations to software

Clustering Methods

Eckert, Population Structure, 5-Aug-2008 42

Page 43: Population structure - Foundations to software

Bayesian Clustering as an Alternative to FST analyses

• Multilocus genotypes contain information about population structure.

• The underlying information is the same as that used in estimating FST.

• However, a model-based effort alleviates the need to define populations a priori, because the number of populations is a parameter of the model (Pritchard et al., 2000 Genetics 155: 945-959).

• So, what is the model…..?

Eckert, Population Structure, 5-Aug-2008 43

Page 44: Population structure - Foundations to software

The Model• Parametric model for the frequency distribution of alleles

in an unknown number of populations each in HWE.• Pritchard et al. assume a Dirichlet distribution for the allele

frequencies• The basic idea in to approximate:

• Using Markov Chain Monte Carlo. This is the Posterior Probability Distribution (PPD) of the allele frequencies (P) in the unknown populations of origin (Z) given the observed multilocus genotypes (X). [Loci are assumed to be in linkage equilibrium].

Pr(Z ,P X)∝ Pr(Z )Pr(P)Pr(X Z,P)PPD MLDPKP

Eckert, Population Structure, 5-Aug-2008 44

http://pritch.bsd.uchicago.edu/structure.html

Page 45: Population structure - Foundations to software

Inferences Using STRUCTURE

• Population structure present:

• This method can also infer the optimal number of clusters (= populations) K given the data. This is done by assuming a uniform prior on a range of possible values for K. The result are the posterior probabilities for each value of K in the prior.

K ln L Posterior Probability1 -1056 0.0052 -1038 0.993 -1055 0.0054 -1125 05 -1255 06 -1266 07 -1279 08 -1458 0

Eckert, Population Structure, 5-Aug-2008 45

Page 46: Population structure - Foundations to software

Another method for inference of K

• The K method of Evanno et al. (2005, Mol. Ecol. 14: 2611-2620):

Eckert, Population Structure, 5-Aug-2008 46

Page 47: Population structure - Foundations to software

Post-processing of STRUCTURE runs

• Production of Q-plots: DISTRUCT

• Solving the label-switching problem and averaging across replicated runs: CLUMPP

http://rosenberglab.bioinformatics.med.umich.edu/

Eckert, Population Structure, 5-Aug-2008 47

Page 48: Population structure - Foundations to software

What are those strange plots?

• Plots of Q-values are nothing more than stacked barplots

• These can be grouped into geographical regions or can be plotted onto a map if you have spatial coordinates using R scripts (available from the TESS website)

Eckert, Population Structure, 5-Aug-2008 48

Page 49: Population structure - Foundations to software

Other clustering methods

• TESS (prior on spatial coordinates using Dirichlet tessellations; Francois et al., 2006; Genetics, 174:805-816) available from: http://www-timc.imag.fr/Olivier.Francois/tess.html

• EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190).– Good for SNP data

Eckert, Population Structure, 5-Aug-2008 49

Page 50: Population structure - Foundations to software

What does it all mean? A point for discussion

• What is the “evolutionary” significance of population structure in forest trees?

• How would something like Sewall Wright’s shifting balance theory of evolution work in forest trees?

Eckert, Population Structure, 5-Aug-2008 50