Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely...

28
Modeling Continuous Admixture . CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/033944 doi: bioRxiv preprint

Transcript of Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely...

Page 1: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

1

Modeling Continuous Admixture 1

Keywords: Admixture-induced linkage disequilibrium; Continuous admixture; 2

Admixture model; Admixture inference; SNP 3

4

Ying Zhou†, §, Hongxiang Qiu†,‡, §, Shuhua Xu†,††,‡‡,* 5

6

† Chinese Academy of Sciences (CAS) Key Laboratory of Computational Biology, Max 7

Planck Independent Research Group on Population Genomics, CAS-MPG Partner 8

Institute for Computational Biology, Shanghai Institutes for Biological Sciences, 9

Chinese Academy of Sciences, Shanghai, 200031,China; 10

‡ Department of Mathematics, The Chinese University of Hong Kong, Shatin, Hong 11

Kong, China; 12

†† School of Life Science and Technology, ShanghaiTech University, Shanghai 13

200031, China; 14

‡‡ Collaborative Innovation Center of Genetics and Development, Shanghai 15

200438, China. 16

§ These authors contributed equally to this work. 17

* Correspondence and requests for materials should be addressed to 18

[email protected] (S.X.) 19

20

21

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 2: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

2

Abstract 1

Human migration and human isolation serve as the driving forces of modern 2

human civilization. Recent migrations of long isolated populations have resulted 3

in genetically admixed populations. The history of population admixture is 4

generally complex; however, understanding the admixture process is critical to 5

both evolutionary and medical studies. Here, we utilized admixture induced 6

linkage disequilibrium (LD) to infer occurrence of continuous admixture events, 7

which is common for most existing admixed populations. Unlike previous 8

studies, we expanded the typical continuous admixture model to a more general 9

admixture scenario with isolation after a certain duration of continuous gene 10

flow. Based on the extended models, we developed a method based on weighted 11

LD to infer the admixture history considering continuous and complex 12

demographic process of gene flow between populations. We evaluated the 13

performance of the method by computer simulation and applied our method to 14

real data analysis of a few well-known admixed populations. 15

Introduction 16

Human migrations involve gene flow among previously isolated populations, 17

resulting in the generations of admixed populations. In both evolutionary and medical 18

studies of admixed populations, it is essential to understand admixture history and 19

accurately estimate the time since population admixture because genetic architecture 20

at both population and individual levels are determined by admixture history, 21

especially the admixture time. However, the estimation of admixture time is largely 22

dependent on the precision of the applied admixture models. Several methods have 23

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 3: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

3

been developed to estimate admixture time based on the Hybrid Isolation (HI) model 1

(Xu and Jin 2008; Price et al. 2009; Loh et al. 2013; Qin et al. 2015) or intermixture 2

admixture model (IA) (Zhu et al. 2004), which assumes that the admixed population 3

is formed by one wave of admixture at a certain time. However, the one-wave 4

assumption often leads to under-estimation when the progress of the true admixture 5

cannot be well modeled by the HI model. Jin et al. showed earlier that under the 6

assumption of HI, the estimated time is half of the true time when the true model is a 7

gradual admixture (GA) model (Jin et al. 2013). 8

Admixture models can be theoretically distinguished by comparing the length 9

distribution of continuous ancestral tracts (CAT) (Gravel 2012; Jin et al. 2012; Ni et 10

al. 2015), which refer to continuous haplotype tracts that were deviated from the same 11

ancestral population. CAT inherently represents admixture history as it accumulates 12

recombination events. Short CAT always indicates long admixture histories of the 13

same admixture proportion, whereas long CAT may indicate a recent gene flow from 14

the ancestral populations to which the CAT belongs. Based on the information it 15

provides, CAT can be used to distinguish different admixture models and estimate 16

corresponding admixture time. However, accurately estimating the length of CAT is 17

often very difficult. 18

Weighted linkage disequilibrium (LD) is an alternative tool that can be used to 19

infer admixture (Loh et al. 2013; Pickrell et al. 2014). Previous studies have indicated 20

that this tool is more efficient than CAT because it requires neither ancestry 21

information inference nor haplotype phasing, which often provides false 22

recombination information, thus decreasing the power of estimation. Weighted LD 23

has already been used in inferring multiple-wave admixtures (Pickrell et al. 2014; 24

Zhou et al. 2015) However, these methods tend to summarize the admixture into 25

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 4: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

4

different independent waves, even if the true admixture is continuous. In our previous 1

work (Zhou et al. 2015), we mathematically described weighted LD under different 2

continuous models, allowing us to determine admixture history using these models. 3

In the present study, we first developed a weighted LD-based method to infer 4

admixture with HI, GA, and continuous gene flow (CGF) models (Pfaff et al. 2001), 5

(Fig 1). Both GA and CGF models assume that gene flow is a continuous process. 6

Next, we extended the GA and CGF models to the GA-I and CGF-I models, 7

respectively (Fig 1), which model a scenario with a continuous gene flow duration 8

followed by a period of isolation to present. We applied our method to a number of 9

well-known admixed populations and provided information that would help better 10

understanding the admixture history of these populations. 11

Material and Methods 12

Datasets 13

Data for simulation and empirical analysis were obtained from three public 14

resources: Human Genome Diversity Panel (HGDP) (Li et al. 2008), the 15

International HapMap Project phase III (The International HapMap Consortium 16

2007) and the 1000 Genomes Project (1KG) (The 1000 Genomes Project 17

Consortium 2012). Source populations for simulations are the haplotypes from 18

113 Utah residents with Northern and Western European ancestries from the 19

CEPH collection (CEU) and the 113 Africans from Yoruba (YRI). 20

Inferring Admixture Histories by using the HI, GA, and CGF Models 21

The expectation of weighted LD under a two-way admixture model has been 22

described in detail in another work (Zhou et al. 2015). Following the previous 23

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 5: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

5

notation, the expectation of weighted LD statistic between two sites separated by 1

a distance � (in Morgan) is as follows: 2

�������� � ∑ �������������

��� � ���� ∑ ������� exp�����,, EQ 1

where ���� �∑ �12���12����

����

|����|;�����, � � 0,1,2 are the weighted LD statistic of 3

the admixed population ( � � 0 ) and the source population �, � � � 1,2� , 4

respectively; �� is the admixture proportion from the source population �; and 5

������ is the allele frequency difference between populations 1 and 2 at site 6

�;���� is the set holding pairs of SNPs of distance �; ��� is admixture indicator 7

for the admixture event of � generations ago, and � is supposed to be the number 8

of generations ago when the source populations first met. To eliminate the 9

confounding effect due to background LD from the source populations, we used 10

the quantity, ����, defined as follows, to represent the admixture induced LD 11

(ALD) (Zhou et al. 2015). 12

���� � ����� � ∑ ��������������� � � ���exp ������

��

We presented it in a more compact form using the inner product of two vectors 13

as follows: 14

���� � ������ ; 15

where 16

� ����� , . . . , ������ , ������; 17

and 18

����� � �exp���� , … , exp���� � 1��� , exp�������. 19

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 6: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

6

For different admixture models where admixture began � generations 1

ago, ���� varies in terms of the vector of coefficients of exponential functions 2

(Zhou et al. 2015): 3

HI HI � �0, … ,0, ������

GA GA � ���� #�� � 1��

� , �� � 1��

�� , … , �� � 1����

���� , �� � 1����

���� $�

CGF1 CGF1 � �1 � ���/����%��

�����/� , �������/�, … , 1&�

CGF2 CGF2 � �1 � ���/����%��

�����/� , �������/� , … , 1&�

where the vector model has length � using the HI, GA, CGF1, or CGF2 model; and 4

� represents when the admixture occurred (HI) or began (GA and CGF) in terms 5

of generations. For different models, the coefficient vectors have different 6

patterns (Fig 2), which can be used to infer the best-fit model for a certain 7

admixed population. 8

In the CGF model, CGF1 represents the admixture where source 9

population 1 is the recipient of the gene flow from population 2, whereas CGF2 10

indicates source population 2 as gene flow recipient from population 1. Inference 11

of the admixture time assuming the true admixture history is one of these 12

different models that can be regarded as minimizing the objective function as 13

follows: 14

ssE�)�, )� , model� � ||)� · , � )�- model � . ||�� . EQ 2

The optimization problem is therefore expressed as follows: 15

min �, � and %model

ssE�)�, )� , model�, EQ 3

where . � %�����, �����, … , ���&�&� is the observed ALD calculated from the 16

single nucleotide polymorphism (SNP) data of both the parental populations and 17

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 7: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

7

the admixed population; )� is a real number used to correct the population 1

substructure; )� is a scalar that improves estimation robustness; , 2 3& is a 2

vector with each entry being 1; A is an 4 5 6 matrix with the �th row vector 3

defined as �������, i.e., - � %������, ������, … , ����&�&�, and 6 9 � is a pre-4

specified upper bound of �. Our definitions are consistent since we can let all 5

entries be 0 after the �-th entry in model. 6

Next, we tried to estimate the parameters )�, )�, and model , where model 7

has the information of the admixture model and the related admixture time � (in 8

generations). In our analysis, the value of � is assumed to be a positive integer; 9

therefore, our method is to go through all possible � values (with a reasonable 10

upper limit 6) to estimate � with the minimum value of the objective function. 11

Given �, we used linear regression to estimate �)�, )�� such that the objective 12

function was minimized. Using this approach, the value of � in relation to the 13

minimal objective function value for each model was determined, which 14

represents the time of admixture occurrence under each model. 15

Admixture Inference under HI, GA-I, and CGF-I Models 16

GA and CGF models assume that the admixture is strictly continuous from 17

the beginning of admixture to present. This assumption seems too strong to be 18

valid in empirical studies. Here, we extended the GA model and CGF model to GA-19

I model and CGF-I model, respectively, by considering continuous admixture 20

followed by isolation. In this case, the admixture event lasts from <'()*( 21

generations ago to <+,- generations ago. Similar to the previous case, the 22

coefficients of exponential functions can be represented as the vector of length 23

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 8: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

8

<'()*( for each model, whose first <+,- � 1 entries are filled with zeros. Suppose 1

the admixture lasted for � generations, then 2

GA-I GA-I � ���� =0, … ,0, ������

�, ������

��, … , ��������

����, ��������

����>�

CGF1-I CGF1‐I � �1 � ���/����%0, … ,0, ��

�����/�, �������/� , … , 1&�

CGF2-I CGF2‐I � �1 � ���/����%0, … ,0, ��

�����/� , �������/� , … , 1&�

In this case, we can also try to find the parameters to minimize the 3

objective function (EQ 2) under new models. By examining all possible pairs of 4

(<end, <start), it is possible to determine the global minimum of the objective 5

function, although this might not be computationally efficient. Here, we used a 6

faster algorithm (Algorithm 1) to determine the starting and ending time points 7

of admixture. 8

Let � and � be the ending and starting time points (in generations, prior 9

to the present) of the admixture, which we want to search for to minimize the 10

objective function. The search starts from ���, ��� � �1, 6�, where 6 is the upper 11

bound for the beginning of the admixture event, which can be set to be a large 12

integer to seek for a relatively ancient admixture event. In our analysis of recent 13

admixed populations, we set 6 � 500. For @ � 1,2, … , ��2 , �2� is updated from 14

��2��, �2��� by two alternative proposals. For convenience, we define 15

A��2 , �2� B� min �, � ssE�)�, )� , �2 , �2�, EQ 4

where )�, )� can be determined by linear regression. 16

We choose the proposal that results in a smaller value for A. The search 17

stops when the value of A with ��2��, �2��� is no larger than that of either 18

proposal or �2 � �2 . In this way, we can readily estimate the time interval of the 19

admixture event (<end, <start) quickly. 20

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 9: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

9

Algorithm 1:

CDE F GH ,, I, …

%J34, K3

4& L %J4�3 � ,, K4�3&

%J54, K5

4& L %J4�3, K4�3 � ,&

%J4, K4& B� MNOPQR�6,7�89:6�

�,7��;,:6�

�,7��;,:6���,7���;<

C�J, K�

GC %J4, K4& � %J4�3 , K4�3& DE J4 � K4

�S=>? , Sstart� B� %J4, K4&

TUDV

Result evaluation 1

To check our assumption of the true history and evaluate the inference, an 2

intuitive way is to compare empirical weighted LD with the fitted LD. Here, we 3

use two quantities: msE and Quasi F, defined by the following: 4

1) Let W � )� · 1 � )�- model � . . We look at msE � ∑ W��&

� 4 � 1X with W� 5

being the �th entry of W. This reflects goodness of fit and strength of 6

background noise. A smaller msE indicates less background noise and 7

better fit. 8

2) Let W@ � .Y � . , where .Y is the fitted weighted LD obtained from 9

MALDmef, which theoretically can be regarded as the de-noised weighted 10

LD. W@ is a vector of length 4, with the �th entry denoted by W�@. We look at 11

the quasi-F statistic � � ∑ B���

∑ :B��;��

. A small � indicates that the current fit 12

does not significantly deviate from the previous fit. 13

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 10: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

10

A reliable result should have both small msE and small � values. Particularly, � is 1

involved in model comparison: when � is too large, one would suspect that the 2

true admixture history is far from any one of these models. Both � and msE are 3

involved in revealing data quality. If � is small but msE is large, one would 4

suspect that the quality of data is not good enough to draw convincing 5

conclusions. Further explanation of these statistics is in Results and Discussion 6

sessions. 7

Identification of the best-fit model 8

For the convenience of illustration, we define the core model as the 9

model used to infer admixture time. When inferring admixture of a target 10

population, HI, GA, CGF1, CGF2, GA-I, CGF1-I and CGF2-I are used as the core 11

models for conducting inference. Because GA-I, CGF1-I and CGF2-I describe more 12

general admixture models than GA, CGF1, and CGF2, we classified model 13

selection into two cases: one case is to identify the best-fit model(s) among the 14

HI, GA, CGF1, and CGF2 models, whereas the more general case is to determine 15

the best-fit model(s) among HI, GA-I, CGF1-I and CGF2-I models. In both cases, 16

the same strategy is adopted, which depends on the pairwise paired difference of 17

pseudo log(msE) values associated with each core model, which will be defined 18

later. For an admixed population, there are Z � 1 observed weighted LD curves 19

obtained as follows: Z (typically 22) autosomal chromosomes are considered in 20

an individual genome, and one weighted LD curve is calculated from all these Z 21

chromosomes while the other Z weighted LD curves are obtained by jackknife 22

resampling, leaving out one chromosome for each LD curve (Loh et al. 2013; 23

Pickrell et al. 2014; Zhou et al. 2015). Next, we fit each observed weighted LD 24

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 11: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

11

curve for each core model by estimating )�, )� and the time interval, which in 1

turn allowed us to obtain the msE value associated with the optimal parameters 2

for each weighted LD curve. Taken together, a total of Z � 1 msE values 3

associated with Z � 1 LD curves were evaluated in each core model. For model 4

[, the log(msE) obtained from all Z chromosomes is denoted by \�C and that 5

from the LD curve with the ]-th chromosome left out by \DC . Following Tukey 6

(Tukey 1958), we defined the ] -th pseudo log(msE) for model [ to be 7

\D̂C � Z\�

C � �Z � 1�\DC and treated these pseudo values approximately as 8

independent. Next, we defined the best-fit core model(s) to be the model(s) with 9

significantly small \D̂C . A pairwise Wilcoxon signed-rank test was conducted for 10

the pseudo log(msE) of the four models. More precisely, Wilcoxon signed-rank 11

test is applied to all pairs of models with the \D̂C being paired by index ], and 12

then the p-values are adjusted to control familywise error rate (Table 1). We 13

used the Holm-Bonfferroni method to adjust p-values (Holm 1979). When 14

\D̂HIwere not significantly larger than those of the best model, i.e., the model 15

associated with the smallest sample median of pseudo log(msE) values, HI was 16

selected. Otherwise the models whose \D̂C were not significantly larger than 17

those of the best model were selected (the best model was selected as well). The 18

significance level was set to be 0.05. Here, we paired the pseudo values 19

according to index ] and used Wilcoxon signed-rank test on the paired 20

differences because according to our experience, \D̂C are strongly correlated with 21

] and hence ] is a major covariate that must be controlled in the test to gain 22

higher power. This is also why even though theoretically there are examples 23

where the best model according to our definition can be significantly worse than 24

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 12: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

12

another model in our process, we believe such extreme cases are unlikely in 1

practice and still use this method. In addition, log(msE) rather than msE were 2

used because after logarithm transformation small values of msE can also have 3

large effect to the comparison. That is, we could better detect the difference 4

between small msE, thus gaining greater power in the test. This claim is also 5

justified by our experience. 6

Software 7

Our algorithm has been implemented in an R package (R Core Team 2014), 8

named CAMer (Continuous Admixture Modeler). The package is available on the 9

website of population genetic group: http://www.picb.ac.cn/PGG/resource.php 10

or on Github: https://www.github.com/david940408/CAMer. 11

Results 12

Simulation studies 13

Admixed populations were simulated in a forward-time way under 14

different admixture models with the software AdmixSim (Yang 2015). 15

Simulation was initiated with the haplotypes from source populations (YRI and 16

CEU) and haplotypes for the admixed population were generated by resampling 17

haplotypes with recombination from source populations and the admixed 18

population of last generation. During the simulation, population size was kept as 19

5000 and migration rates was controlled by the admixture model with the final 20

admixture proportion in the admixed population to be 0.3. We employed a mono 21

recombination map in our simulation, which means recombination rate between 22

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 13: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

13

two markers is positively proportional to their physical distance. For each model, 1

simulation was performed using 10 replicates; each replicate contained 10 2

chromosomes with a total length of 3 Morgans. To evaluate the performance of 3

our algorithm, we simulated admixed populations under the following 4

conditions: 5

1) HI of 50 and 100 generations, designated as HI (50) and HI (100), 6

2) GA of 50 and 100 generations, designated as GA (1-50) and GA (1-7

100), respectively, 8

3) CGF of 50 and 100 generation, population 1 as the recipient, 9

designated as CGF1 (1-50) and CGF1 (1-100) respectively, 10

4) CGF-I of a 70-generation admixture followed by 30-generation 11

isolation, and a 30-generation admixture followed by a 70-generation 12

isolation, with population 1 as the recipient, designated as CGF1-I (30-13

100) and CGF1-I (70-100) respectively, and, 14

5) GA-I of a 70-generation admixture followed by a 30-generation 15

isolation and a 30-generation admixture followed by a 70-generation 16

isolation, designated as GA-I (30-100) and GA-I (70-100), respectively. 17

18

With simulated admixed populations, we first used the HI, GA and CGF 19

models as core models to conduct inference (Fig S1). When the simulated model 20

was a HI, GA, or CGF model, our method was able to accurately estimate the 21

simulated admixture time, as well as to determine the correct model, with an 22

accuracy of 73.33%. When the simulated model was a CGF-I or GA-I model, the 23

estimated time based on the core model HI was within the time interval of the 24

admixture, whereas all best-fit models were HI (Table 2). This result has 25

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 14: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

14

indicated the limitation of using the GA and CGF models in inferring admixture 1

history. 2

Using the same simulated admixed populations, we then employed GA-I, 3

CGF-I and HI as core models for performing inference (Figs 3 and S2-S11). With 4

HI, GA, or CGF considered as the true model, our estimation of the optimal model 5

remained highly accurate. On the other hand, when the true model was GA-I or 6

CGF-I, the failure rate decreased by 25%, compared to the estimation in the 7

previous setting. Furthermore, the estimated time intervals were wider than 8

those of the true ones, although the findings were still more accurate than those 9

using GA and CGF as core models (Table 2). 10

Empirical analysis 11

We applied CAMer to the selected admixed populations from HapMap, 12

HGDP, and 1KG. For each target population, we first used MALDmef to calculate 13

the weighted LD and fit the weighted LD with hundreds of exponential functions 14

(Zhou et al. 2015). Next, with the weighted LD of target populations, we 15

determined the admixture model and estimated admixture time with CAMer. 16

Quasi F and msE are designed for evaluating the inference with CAMer. The value 17

of msE usually indicates data quality: small msE may indicate a high signal-to-18

noise ratio (SNR) and vice versa. The quasi F value measures the goodness of fit 19

of the model we employed to fit the admixture event. A small � value indicates 20

that the model we used was of satisfactory performance in modeling an 21

admixture event. In our analysis, we used 10�E as the threshold for msE and 1.5 22

as the threshold for �. Therefore, when the msE value _ 10�E and the � value 23

_ 1.5, we could not “reject the null hypothesis” that the related model was the 24

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 15: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

15

true model, i.e., the model well fit the admixture event. On the other hand, an 1

msE value 9 10�E indicates low- quality data that is incapable of identifying the 2

best-fit model, whereas an � value 9 1.5 prompts us to “reject the null 3

hypothesis” and concludes that the model did not well fit the admixture. In the 4

case of the same population from different databases, the data with smaller msE 5

values were given more credits. For example, we obtained samples of ASW from 6

the HapMap and the 1KG. With the ASW data (CEU and YRI as source 7

populations) from HapMap, the best-fit model was HI of 6 generations, and both 8

msE and F values indicated that the inference was acceptable (Fig S12). 9

Similarly, using the ASW data (CEU and YRI as source populations) from 1KG, the 10

best-fit model was HI of 6 generations (Fig S13). However, a quasi F value of 2.54 11

indicated that HI model did not satisfactorily fit the admixture event. Because the 12

msE value of the data set from 1KG was smaller, the conclusion using ASW was 13

as follows: based on the best data we had, the time intervals estimated under the 14

HI, GA-I, CGF1-I, and CGF2-I model were 6 generations, 1–9 generations, 1–13 15

generations, and 1–9 generations, respectively. Furthermore, none of these 16

models satisfactorily modeled the admixture, whereas the HI model showed 17

better performance. We also applied CAMer to other admixed populations (Table 18

3, Figs S14–17). MEX (source poulations: CEU (64 individuals) and American 19

Indian (7 Colombians, 14 Karitiana, 21 Maya, 14 Pimas and 8 Suruis)) was 20

satisfactorily modeled by the CGF1 model or GA-I model, with the estimated 21

admixture time interval being 1–17 or 2–16 generations, respectively. We also 22

analyzed Eurasian populations, which showed that the Uygurs (source 23

populations: Han (n = 34) and French (n = 28)) most likely fit a continuous 24

model, with a gene flow lasting for more than 60 generations to the present or 25

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 16: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

16

near present. We cannot determine which model fits best. However, the values of 1

msE were all larger than 10�E, indicating that the results were not so reliable. 2

The Hazara population (source populations: Han (n = 34) and French (n = 28)) 3

experienced a GA-I-like admixture event that lasted for about 58 generations, 4

which started 63 generations ago and ended approximately 5 generations ago. 5

Discussion 6

Modeling the demographic history of an admixed population and estimating time 7

points of this particular event are essential components of evolutionary and 8

medical research studies (Zhu et al. 2004; Zhu and Cooper 2007; Gravel 2012; Jin 9

et al. 2012, 2013; Ni et al. 2015; Zhou et al. 2015). Previous methods have 10

employed the length distribution of ancestral tracts (Gravel 2012; Jin et al. 2012, 11

2013), which highly depends on the result of local ancestral inference and 12

haplotype phasing. Another limitation of earlier methods is that only HI, GA, and 13

CGF models were utilized to fit the admixture as well as in identifying the best-fit 14

model. In the present study, our simulations showed that when the true model 15

was not HI, GA, or CGF, the generated inferences were relatively difficult to 16

interpret. 17

Our method, CAMer, can be utilized in inferring admixture histories by 18

using weighted LD, which can be calculated using genotype data with MALDmef 19

(Zhou et al. 2015). Furthermore, we extended the GA and CGF models to the GA-I 20

and CGF-I models in order to infer the time interval for a period of continuous 21

admixture events followed by isolation. Although HI model is a degenerate case 22

for both GA-I and CGF-I models, where the admixture window becomes 1 23

generation, we kept it in our method because it is the most popular model 24

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 17: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

17

employed in previous admixture studies. Considering the difficulty in the fitting 1

problem with exponential functions, it is in our expectation that CAMer was not 2

consistently very accurate in determining the admixture model based on the 3

weighted LD decay. However, its natural advantage of independence of both 4

haplotype phasing and local ancestry inference makes it privilege to other CAT 5

based method. And our simulations indicated that its time interval estimations 6

were reliable when its assumption that the true admixture history could be well 7

approximated by one of the core models is valid. 8

Two quantities, namely msE and quasi F, were used to check the 9

assumption of our method stated above and evaluate the credibility of the 10

models’ inference. These two quantities should both be taken into consideration 11

to identify whether the models well describe the admixture history. Both the 12

data quality and the goodness of fitting of models can affect the value of msE, 13

although the � value mainly measures the goodness of modeling. Informally, for 14

the convenience of interpretation, msE is considered to reveal the data quality 15

and � value is considered to check model assumption on admixture history. In 16

our analysis, we suggested thresholds for msE and � to determine whether the 17

null hypothesis should be rejected or not, which may be too strict in empirical 18

analysis. Actually, msE and � values together measure whether the observed 19

weighted LD can be well fit by the best-fit model(s). For example, the fitting 20

process showed poor performance in the MKK population, which was 21

accompanied by exaggerated msE and � values, showing significant 22

inconsistencies between the observed and fitted weight LD curves, which 23

indicates that the true admixture history cannot be well explained by any of the 24

core models (Fig S17). Therefore, in empirical analysis, one can informally think 25

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 18: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

18

that the msE value reflects the quality of the data, whereas � value describes the 1

performance of the model, although both of them measure the goodness of 2

fitting. 3

In our previous study (Zhou et al. 2015), we fit the weighted LD with 4

hundreds of exponential functions. However, this approach did not fully reveal 5

the occurrence of continuous admixture. To address this issue, the present study 6

developed CAMer to model admixture as a continuous process. CAMer also 7

employed extensions of the classic continuous models, GA-I and CGF-I, which 8

may bring the bias to have a wider admixture window when the real admixture 9

exists in a short time. But it is still proved to be able to give more credible 10

estimations in modeling population admixture. 11

Taken together, CAMer is a powerful method to model a continuous 12

population admixture, which in turn would help us elucidate the complex 13

demographic history of population admixture. 14

15

Author contributions 16

Conceived and designed the study: SX. Developed methods and computer tools: YZ 17

HQ. Analyzed the data: YZ and HQ. Interpreted the data and wrote the paper: SX YZ 18

HQ. 19

Funding: These studies were supported by the Strategic Priority Research 20

Program of the Chinese Academy of Sciences (CAS) (XDB13040100), by the 21

National Science Fund for Distinguished Young Scholars (31525014), by the 22

National Natural Science Foundation of China (NSFC) grants (91331204, 23

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 19: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

19

31171218, 31501011), and by Science and Technology Commission of Shanghai 1

Municipality (14YF1406800). S.X. is Max-Planck Independent Research Group 2

Leader and member of CAS Youth Innovation Promotion Association. S.X. also 3

gratefully acknowledges the support of the National Program for Top-notch 4

Young Innovative Talents of The "Wanren Jihua" Project. The funders had no role 5

in study design, data collection and analysis, decision to publish, or preparation 6

of the manuscript. 7

Competing interests: The authors have declared that no competing interests 8

exist. 9

Acknowledgements: None. 10

11

12

13

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 20: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

20

Fig Legends 1

Fig 1: Classic admixture models (HI, GA and CGF) and the models we extended 2

(GA-I and CGF-I). For each model, the simulated admixed population (Hybrid) is 3

in the middle of two source populations (POP1 and POP2). Each horizontal 4

arrow represents the direction of gene flow from the source populations to the 5

admixed population. Once the genetic components flow into the admixed 6

population, the admixed population randomly hybridizes with other existing 7

components. The existence of horizontal arrows indicates gene flow from the 8

corresponding source population. 9

Fig 2: Coefficient vector of exponential functions for each model. For each 10

admixture model, the starting time of the population admixture is 50 generations 11

ago. 12

Fig 3: Evaluation of CAMer under various simulated admixture models. Here, the 13

core models are HI, GA-I, CGF1-I, and CGF2-I. The simulated models (True 14

Model) are listed on the left, with the admixture time interval depicted in the 15

parentheses. The gray area on the middle vertical panel is the simulated time 16

interval, whereas colored lines indicate the estimated time intervals under 17

different core models. HI: pink; CGF1-I: green; CGF2-I: purple; GA-I: blue. The 18

intensity of lines means the number each point is covered by the time intervals 19

estimated from all jackknives. Lighter colors represent fewer covers while 20

darker colors mean more. 21

22

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 21: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

21

Table 1: Adjusted p-values of pairwise Wilcoxon signed-rank test among core 1

models: HI, GA-I, CGF1-I, CGF2-I. 2

True Model

Best Model(s)

Adjusted p-Values of Pairwise Wilcoxon Signed Rank Test

HI: GA-I

HI: CGF1-I

HI: CGF2-I

GA-I: CGF1-I

GA-I: CGF2-I

CGF2-I: CGF1-I

HI (100) HI 0.97 0.14 0.97 0.012 0.15 0.97

HI (50) HI 0.98 0.98 0.52 0.70 0.16 0.16

CGF1 (1-100) CGF1-I 0.012 0.012 0.012 0.012 0.012 0.012

CGF1 (1-50)

CGF1-I, CGF2-I 0.012 0.012 0.012 0.041 0.064 0.055

GA (1-100) GA-I 0.012 0.012 0.012 0.012 0.012 0.012

GA (1-50) GA-I 0.012 0.012 0.012 0.012 0.020 0.012

CGF1-I (30-100) HI 0.19 0.55 0.15 0.012 0.55 0.012

CGF1-I (70-100) HI 0.97 0.97 0.97 0.020 0.012 0.20

GA-I (30-100)

CGF1-I, GA-I 0.012 0.012 0.012 0.30 0.012 0.020

GA-I (70-100) HI 0.52 0.52 0.52 0.029 0.012 1

In each column, the adjusted p-values of the Wilcoxon signed-rank test 3

comparing the two models are presented for all simulation cases. Simulated true 4

model is followed by the parenthesis of time interval for the corresponding gene 5

flow, where the first term in the parenthesis is the ending time of the admixture 6

and the second term is the beginning time of the admixture. They are in the 7

measurements of generation before present. For HI model, only one time point is 8

included in the parenthesis. 9

10

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 22: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

22

Table 2: Accuracy of model detection 1

True

models

Core

models

Counts Rates

Correct Undeter-

mined Wrong Correct

Undeter-

mined Wrong

HI;GA;

CGF

HI;GA;C

GF 44 15 1 73.3% 25.0% 1.7%

GA-I;

CGF-I

HI;GA;C

GF 0 0 40 0.0% 0.0% 100.0%

HI;GA;

CGF

HI;GA-

I ; CGF-I 37 22 1 61.7% 36.7% 1.6%

GA-I;

CGF-I

HI;GA-

I ; CGF-I 1 9 30 2.5% 22.5% 75.0%

Here, as our method can hardly distinguish CGF1 from CGF2 model, we regard 2

CGF1, CGF2 as the CGF model; CGF1-I and CGF2-I as the CGF-I model, which is 3

different from GA-I and HI models. 4

5

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 23: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

23

Table 3: Results of CAMer on empirical populations 1

2

Population Core

model

End

time

Start

time msE Quasi.F

ASW-HapMap (57)

HI* 6 6 2.72 � 10�� 1.19

CGF1-I 2 10 3.02 � 10�� 1.41

CGF2-I 1 9 2.98 � 10�� 1.40

GA-I 3 8 2.97 � 10�� 1.19

ASW-1KG (56)

HI* 6 6 2.19 � 10�� 2.54

CGF1-I 1 11 1.88 � 10�� 2.19

CGF2-I 1 9 1.84 � 10�� 2.12

GA-I 2 9 1.86 � 10�� 2.13

MEX (86)

HI 9 9 6.73 � 10�� 2.19

CGF1-I* 1 17 3.57 � 10�� 1.13

CGF2-I 1 18 3.57 � 10�� 1.15

GA-I* 2 16 3.60 � 10�� 1.15

MKK (143)

HI* 6 6 2.36 � 10�� 11.68

CGF1-I 1 16 2.04 � 10�� 10.24

CGF2-I 1 11 2.15 � 10�� 10.82

GA-I 1 17 1.97 � 10�� 9.83

UIG (10)

HI 26 26 4.73 � 10�� 1.29

CGF1-I* 1 66 4.01 � 10�� 1.08

CGF2-I* 1 64 4.01 � 10�� 1.08

GA-I* 3 63 4.03 � 10�� 1.09

Hazara (24)

HI 27 27 1.26 � 10�� 1.95

CGF1-I 3 69 8.78 � 10�� 1.35

CGF2-I 3 65 8.87 � 10�� 1.36

GA-I* 5 63 8.53 � 10�� 1.30

Number of individuals listed in the parentheses. Values underlined do not pass 3

our threshold. The time interval is summarized from 22 jackknives, which is 4

shared by more than half of all estimated intervals for continuous models or the 5

nearest integer to the mean of estimated time point for HI model. The best-fit 6

model is marked by an asterisk “*”. For the HI model, the beginning time is the 7

same as the ending time. 8

9

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 24: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

24

Reference 1

Gravel S., 2012 Population genetics models of local ancestry. Genetics 191: 607–2

619. 3

Holm S., 1979 A Simple Sequentially Rejective Multiple Test Procedure. Scand. J. 4

Stat. 6: 65–70. 5

Jin W., Wang S., Wang H., Jin L., Xu S., 2012 Exploring population admixture 6

dynamics via empirical and simulated genome-wide distribution of 7

ancestral chromosomal segments. Am. J. Hum. Genet. 91: 849–862. 8

Jin W., Li R., Zhou Y., Xu S., 2013 Distribution of ancestral chromosomal segments 9

in admixed genomes and its implications for inferring population history 10

and admixture mapping. Eur. J. Hum. Genet. 22: 930–937. 11

Li J. Z., Absher D. M., Tang H., Southwick A. M., Casto A. M., Ramachandran S., 12

Cann H. M., Barsh G. S., Feldman M., Cavalli-Sforza L. L., Myers R. M., 2008 13

Worldwide human relationships inferred from genome-wide patterns of 14

variation. Science 319: 1100–1104. 15

Loh P. R., Lipson M., Patterson N., Moorjani P., Pickrell J. K., Reich D., Berger B., 16

2013 Inferring admixture histories of human populations using linkage 17

disequilibrium. Genetics 193: 1233–1254. 18

Ni X., Yang X., Guo W., Yuan K., Zhou Y., Ma Z., Xu S., 2015 Length distribution of 19

ancestral tracts under a general admixture model and its applications in 20

admixture history indeference. Under Rev. 21

Pfaff C. L., Parra E. J., Bonilla C., Hiester K., McKeigue P. M., Kamboh M. I., 22

Hutchinson R. G., Ferrell R. E., Boerwinkle E., Shriver M. D., 2001 Population 23

structure in admixed populations: effect of admixture dynamics on the 24

pattern of linkage disequilibrium. Am. J. Hum. Genet. 68: 198–207. 25

Pickrell J. K., Patterson N., Loh P.-R., Lipson M., Berger B., Stoneking M., 26

Pakendorf B., Reich D., 2014 Ancient west Eurasian ancestry in southern 27

and eastern Africa. Proc. Natl. Acad. Sci. U. S. A. 111: 2632–7. 28

Price A. L., Tandon A., Patterson N., Barnes K. C., Rafaels N., Ruczinski I., Beaty T. 29

H., Mathias R., Reich D., Myers S., 2009 Sensitive detection of chromosomal 30

segments of distinct ancestry in admixed populations. PLoS Genet. 5. 31

Qin P., Zhou Y., Lou H., Lu D., Yang X., Wang Y., Jin L., Chung Y.-J., Xu S., 2015 32

Quantitating and Dating Recent Gene Flow between European and East 33

Asian Populations. Sci. Rep. 5: 9500. 34

R Core Team, 2014 R: A Language and Environment for Statistical Computing. 0. 35

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 25: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

25

The 1000 Genomes Project Consortium, 2012 An integrated map of genetic 1

variation from 1,092 human genomes. Nature 135: 0–9. 2

The International HapMap Consortium, 2007 A second generation human 3

haplotype map of over 3.1 million SNPs. Nature 449: 851–861. 4

Tukey J. W., 1958 Bias and Confidence in Not-Quite Large Samples. Ann. Math. 5

Stat. 29: 614. 6

Xu S., Jin L., 2008 A Genome-wide Analysis of Admixture in Uyghurs and a High-7

Density Admixture Map for Disease-Gene Discovery. Am. J. Hum. Genet. 83: 8

322–336. 9

Yang X., 2015 AdmixSim-v1.0.2, http://www.picb.ac.cn/PGG/resource.php. 10

Zhou Y., Yuan K., Yu Y., Ni X., Xie P., Xing E. P., Xu S., 2015 Inference of multiple-11

wave population admixture by modeling decay of linkage disequilibrium 12

with multiple exponential functions. Under Rev. 13

Zhu X., Cooper R. S., Elston R. C., 2004 Linkage analysis of a complex disease 14

through use of admixed populations. Am. J. Hum. Genet. 74: 1136–1153. 15

Zhu X., Cooper R. S., 2007 Admixture mapping provides evidence of association 16

of the VNN1 gene with Hypertension. PLoS One 2. 17

18

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 26: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

HI

POP1 POP2Hybrid

Past

Pre

sent

GA

POP1 POP2Hybrid

CGF

POP1 POP2Hybrid

GA-I

POP1 POP2Hybrid

CGF-I

POP1 POP2Hybrid

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 27: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

A B

C D

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint

Page 28: Modeling Continuous Admixture€¦ · However, the estimation of admixture time is largely dependent on the precision of the applied admixture models. Several methods have It is made

True Model

HI (100)

HI (50)

CGF1 (1−100)

CGF1 (1−50)

GA (1−100)

GA (1−50)

CGF1−I (30−100)

CGF1−I (70−100)

GA−I (30−100)

GA−I (70−100)

0 50 100 150 200

Time Inter vals

Generation

Best Model(s)

HI

HI

CGF1−I

CGF1−I,CGF2−I

GA−I

GA−I

HI

HI

CGF1−I,GA−I

HI

.CC-BY-NC 4.0 International licenseis made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It. https://doi.org/10.1101/033944doi: bioRxiv preprint