Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS |...

36
HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry Using Wavelet Transformation Jean Sanderson,* ,1 Herawati Sudoyo, Tatiana M. Karafet, Michael F. Hammer, ,§ and Murray P. Cox* ,2 *Statistics and Bioinformatics Group, Institute of Fundamental Sciences, Massey University, Palmerston North 4442, New Zealand, Eijkman Institute for Molecular Biology, Jakarta, Indonesia, Division of Biotechnology, Arizona Research Laboratories and § Department of Anthropology, University of Arizona, Tucson, Arizona 85721 ABSTRACT Admixture between long-separated populations is a dening feature of the genomes of many species. The mosaic block structure of admixed genomes can provide information about past contact events, including the time and extent of admixture. Here, we describe an improved wavelet-based technique that better characterizes ancestry block structure from observed genomic patterns. principal components analysis is rst applied to genomic data to identify the primary population structure, followed by wavelet decomposition to develop a new characterization of local ancestry information along the chromosomes. For testing purposes, this method is applied to human genome-wide genotype data from Indonesia, as well as virtual genetic data generated using genome- scale sequential coalescent simulations under a wide range of admixture scenarios. Time of admixture is inferred using an approximate Bayesian computation framework, providing robust estimates of both admixture times and their associated levels of uncertainty. Crucially, we demonstrate that this revised wavelet approach, which we have released as the R package adwave, provides improved statistical power over existing wavelet-based techniques and can be used to address a broad range of admixture questions. KEYWORDS wavelets; principal component analysis (PCA); admixture; local ancestry; dating A DMIXTURE occurs when previously separated popula- tions interact and merge. This process has been instru- mental in human history, with most global groups showing at least some signals of population merger (Hellenthal et al. 2014). The admixture process produces mosaicgenomes with alternating blocks of DNA from each ancestral popula- tion. Over time, recombination decreases the length of these ancestry blocks, and therefore the distribution of block sizes is informative about the time of admixture. However, the extent to which these patterns can provide additional information about historic admixture processes is still a young area of exploration. A range of methods have been developed to partition the genome of an admixed individual into ancestry blocks based on raw genomic data (Falush et al. 2003; Price et al. 2009). Some methods assign ancestry directly. For instance, HAPMIX uses a hidden Markov model to estimate the break points of ancestry blocks, while other approaches dene ancestry blocks using simple empirical criteria, such as strings of shared vs. nonshared polymorphisms (Pool and Nielsen 2009) or the differential presence of population-specic var- iants (Brown and Pasaniuc 2014). Another set of methods is more indirect. ROLLOFF (Moorjani et al. 2011), LAMP (Baran et al. 2012), and ALDER (Loh et al. 2013) all search for rapid changes in linkage disequilibrium to dene the borders of ancestry blocks, while other approaches assign ancestry for predened genomic windows using conditional random elds (Maples et al. 2013) or principal component analysis (PCA) (Gravel 2012). These methods vary in their effectiveness. Simple empir- ical criteria perform surprisingly well for admixture between species (as for the mouse admixture zone studied by Pool and Nielsen 2009). Similarly, most of these methods tend to be highly accurate for recent admixture between well-separated human groups (such as African Americans or American La- tinos). Indeed, in these settings, subtleties such as multiple waves of admixture have even be detected (Gravel 2012). Copyright © 2015 by the Genetics Society of America doi: 10.1534/genetics.115.176842 Manuscript received October 29, 2014; accepted for publication April 3, 2015; published Early Online April 7, 2015. Available freely online through the author-supported open access option. Supporting information is available online at http://www.genetics.org/lookup/suppl/ doi:10.1534/genetics.115.176842/-/DC1. 1 Present address: School for Health and Related Research, University of Shefeld, Shefeld, United Kingdom. 2 Corresponding author: Institute of Fundamental Sciences, Massey University, Private Bag 11 222, Palmerston North 4442, New Zealand. E-mail: [email protected]. Genetics, Vol. 200, 469481 June 2015 469

Transcript of Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS |...

Page 1: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

HIGHLIGHTED ARTICLEGENETICS | INVESTIGATION

Reconstructing Past Admixture Processes from LocalGenomic Ancestry Using Wavelet Transformation

Jean Sanderson,*,1 Herawati Sudoyo,† Tatiana M. Karafet,‡ Michael F. Hammer,‡,§ and Murray P. Cox*,2

*Statistics and Bioinformatics Group, Institute of Fundamental Sciences, Massey University, Palmerston North 4442, New Zealand,†Eijkman Institute for Molecular Biology, Jakarta, Indonesia, ‡Division of Biotechnology, Arizona Research Laboratories and

§Department of Anthropology, University of Arizona, Tucson, Arizona 85721

ABSTRACT Admixture between long-separated populations is a defining feature of the genomes of many species. The mosaic blockstructure of admixed genomes can provide information about past contact events, including the time and extent of admixture. Here,we describe an improved wavelet-based technique that better characterizes ancestry block structure from observed genomic patterns.principal components analysis is first applied to genomic data to identify the primary population structure, followed by waveletdecomposition to develop a new characterization of local ancestry information along the chromosomes. For testing purposes, thismethod is applied to human genome-wide genotype data from Indonesia, as well as virtual genetic data generated using genome-scale sequential coalescent simulations under a wide range of admixture scenarios. Time of admixture is inferred using an approximateBayesian computation framework, providing robust estimates of both admixture times and their associated levels of uncertainty.Crucially, we demonstrate that this revised wavelet approach, which we have released as the R package adwave, provides improvedstatistical power over existing wavelet-based techniques and can be used to address a broad range of admixture questions.

KEYWORDS wavelets; principal component analysis (PCA); admixture; local ancestry; dating

ADMIXTURE occurs when previously separated popula-tions interact and merge. This process has been instru-

mental in human history, with most global groups showing atleast some signals of population merger (Hellenthal et al.2014). The admixture process produces “mosaic” genomeswith alternating blocks of DNA from each ancestral popula-tion. Over time, recombination decreases the length of theseancestry blocks, and therefore the distribution of block sizes isinformative about the time of admixture. However, the extentto which these patterns can provide additional informationabout historic admixture processes is still a young area ofexploration.

A range of methods have been developed to partition thegenome of an admixed individual into ancestry blocks based

on raw genomic data (Falush et al. 2003; Price et al. 2009).Some methods assign ancestry directly. For instance, HAPMIXuses a hidden Markov model to estimate the break pointsof ancestry blocks, while other approaches define ancestryblocks using simple empirical criteria, such as strings ofshared vs. nonshared polymorphisms (Pool and Nielsen2009) or the differential presence of population-specific var-iants (Brown and Pasaniuc 2014). Another set of methodsis more indirect. ROLLOFF (Moorjani et al. 2011), LAMP(Baran et al. 2012), and ALDER (Loh et al. 2013) all searchfor rapid changes in linkage disequilibrium to define theborders of ancestry blocks, while other approaches assignancestry for predefined genomic windows using conditionalrandom fields (Maples et al. 2013) or principal componentanalysis (PCA) (Gravel 2012).

These methods vary in their effectiveness. Simple empir-ical criteria perform surprisingly well for admixture betweenspecies (as for the mouse admixture zone studied by Pooland Nielsen 2009). Similarly, most of these methods tend tobe highly accurate for recent admixture between well-separatedhuman groups (such as African Americans or American La-tinos). Indeed, in these settings, subtleties such as multiplewaves of admixture have even be detected (Gravel 2012).

Copyright © 2015 by the Genetics Society of Americadoi: 10.1534/genetics.115.176842Manuscript received October 29, 2014; accepted for publication April 3, 2015;published Early Online April 7, 2015.Available freely online through the author-supported open access option.Supporting information is available online at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.176842/-/DC1.1Present address: School for Health and Related Research, University of Sheffield,Sheffield, United Kingdom.

2Corresponding author: Institute of Fundamental Sciences, Massey University, PrivateBag 11 222, Palmerston North 4442, New Zealand. E-mail: [email protected].

Genetics, Vol. 200, 469–481 June 2015 469

Page 2: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

However, reconstructing complex demographic features formuch older admixture events (i.e., thousands rather thanhundreds of years in the past) remains extremely challenging(Moorjani et al. 2011). While methods have in principle beenproposed to detect multiple waves of ancient admixture, inmany realistic settings they are still restricted to single admix-ture events (Loh et al. 2013), although some evidence formultiple ancient admixture events has been presented forseveral Indian populations (Moorjani et al. 2011).

Other indirect methods look increasingly promising inthis “old admixture” space. Approaches based on principalcomponents analysis and wavelets have been employed withsome success. PCA is a nonparametric data-reduction tech-nique, which has been used widely to identify patterns ofpopulation structure in genetic data (Patterson et al. 2006;Novembre and Stephens 2008; McVean 2009; Bryc et al.2010; Ma and Amos 2012). Dispersion of admixed individ-uals along the first principal component connecting ancestralpopulations can be used as a diagnostic for two-way admix-ture (Patterson et al. 2006; Mcvean 2009). For instance,PCAdmix employs PCA to assign ancestry to localized win-dows along the genome for each individual (Brisbin et al.2012). Pugach et al. (2011) also use PCA, but do not directlyassign ancestry to genomic regions, instead applying awavelettransform to obtain an indirect measure of the average ad-mixture block length. While this approach has been shown tobe powerful for dating old admixture events, there remainsconsiderable scope for (i) the development of more sophisti-cated wavelet constructions, (ii) examining the resultingwavelet decompositions in greater detail (particularly to iden-tify aspects of non-time-related information in the transformeddata), and (iii) to provide a more user-friendly software solu-tion for wavelet analysis.

Wavelet techniques themselves are an active and evolvingarea, with much potential for novel application in populationgenetics, as highlighted in the review article by Liò (2003).Wavelets can be thought of as localized, oscillatory functionsand are particularly useful for representing data that haslocal features such as sharp changes and discontinuities. Inthe context of genome-wide single nucleotide polymorphism(SNP) data, wavelets can be used to represent the mosaicpattern of ancestry blocks. A wavelet decomposition ofthe data provides information on the size of the ancestryblocks and, importantly, how they are distributed alongthe chromosomes. Summary measures of the wavelet de-composition allow aspects of the admixture process to bereconstructed, such as the time of admixture and admixtureproportions.

Here, we present a substantially revised wavelet-basedapproach to describe population admixture that builds on thework of Pugach et al. (2011). This new method has signifi-cantly fewer model assumptions and allows us to identifymore complex demographic processes, such as multipleadmixture events. As with previous methods, PCA is firstemployed to describe the population structure. The maximaloverlap discrete wavelet transform (MODWT) is then applied

directly to the SNP-level data, without the need to computeaverages over localized genomic windows as implemented inrelated procedures (Pugach et al. 2011; Brisbin et al. 2012).Instead, windowing is performed naturally and objectively aspart of the wavelet decomposition procedure. We show thatthis new method provides robust estimates of admixture time(including improved control of uncertainty estimates), as wellas recognizing other aspects of admixture processes that pre-vious wavelet-based methods have not been able to identifywith any accuracy.

Methods

General framework

Initially, we consider a simple admixture scenario where twoancestral populations PA and PB merged T generations agoto form the admixed population PC. The ancestral popula-tions contribute to the admixed population with probabili-ties p and 12 p. The sizes of the populations, the admixturetime, and the admixture proportions are free to vary.

To quantify patterns of genomic block size variation,a three-step analysis procedure was used: (i) PCAwas appliedto the genomic data to describe population structure; (ii) thewavelet variance was computed to provide a scale-by-scaledecomposition of the variance for each population; and (iii)the portion of this measure that is informative for admixtureprocesses was extracted relative to background levels ob-served in the ancestral populations.

Data simulation

Genome-wide SNP data were simulated using the sequentialcoalescent simulator MaCS (Chen et al. 2009). Because ourprimary interest is in the admixture history of Island South-east Asia (see Real genomic data section below), we choseparameter settings that produce genomic data that broadlyfit observed patterns of genetic diversity in this study region(Cox et al. 2008). The demographic model, parameters, andinformation sources are described in more detail in the Sup-porting Information (Figure S1). We emphasize, however,that the method we describe is general and can be applied tomost admixed genomic systems.

Data setup

Given an admixed population PC derived from two ancestralpopulations PA and PB, the number of individuals in theanalysis (i.e., present day samples) is n = nA = nB + nC.For each individual i, we observe a collection of T SNPsalong a chromosome. Thus the raw data matrix X isa T3n matrix with T genotype counts in columns and nindividuals in rows. The SNPs s are ordered by their physicalpositions along the chromosome, with the cells of the datamatrix Xs;i taking the value 0 if heterozygous, and arbitrarily21 or 1 for the alternative homozygous states. Prior to prin-cipal components analysis, the data matrix is centered suchthat the column mean with respect to the ancestral refer-ence populations is zero, giving

470 J. Sanderson et al.

Page 3: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

X9s;i ¼ X s;i 2

1nA þ nB

Xi2PE ;PB

X s;i:

Principal components analysis

PCA is performed using only individuals from the ancestralpopulations. Rather than performing PCA on all samplescombined, this approach has the advantage that other featuresof the admixed sample (such as admixture from additionalancestral populations) will not influence the projection(McVean 2009). The first eigenvector v1 reflects the primarypopulation structure. Projection of individuals onto this axisof variation is given by

yi1 ¼XT

s¼1X9s;iv1;s: (1)

The proportion of ancestry inherited from population PA canbe estimated for each individual (or population) using thedistance from the centroids of the ancestral populations; thatis, pi ¼ ðcB 2 yi1Þ=ðcB 2 cAÞ; where cA ¼ ð1=nAÞ

Pi2PE y

i1 and

cB ¼ ð1=nBÞP

i2PB yi1 are the centroids of the ancestral popula-

tions along the first principal axis (Bryc et al. 2010). Note thatvariation between individuals within a population is repre-sented by the smaller eigenvalues and corresponding eigenvec-tors (Ma and Amos 2010).

This representation of admixed individuals in PCA space,as shown in Figure 1A, provides a genome-wide estimate ofaverage ancestry, but does not indicate how admixture tractsare distributed along the chromosomes. To obtain localizedestimates, the projection is performed at the SNP level ratherthan summing over the length of the genome as in Equation1. The raw SNP-level admixture signals are given by

Yis ¼

2X9s;iv1;s2 ðYs

B þ YsAÞ

ðY sB2 YsAÞ

;

����Y sB 2 Ys

A����$ e

0;����Ys

B 2 YsA����, e

;

8>>>><>>>>:

(2)

where YGs ¼ ð1=nGÞ

Pi2PGX

9s;iv1;s for G 2 A; B. The additional

terms in Equation 2 ensure that the signals are normalizedsuch that the mean of the ancestral populations are arbi-trarily 1 and 21. This normalization step makes the mea-sure robust to uneven sample sizes, which can affect thestructure of the PCA (Novembre and Stephens 2008;Mcvean 2009). Stability of the signals is maintained by spec-ifying a tolerance E for separating the ancestral populationsat a given SNP. This ensures that SNPs with poor discrimi-nation are treated as uninformative in the next step of theanalysis.

Wavelet transform

The resulting SNP-level admixture signals indicate howancestry varies along the genome, but they invariablyexhibit a high noise-to-information ratio. To interpret thesignal, its frequency content can be described using the

wavelet variance (Percival 1995). The wavelet varianceSj for scales j ¼ 1; . . . ; J provides a scale-by-scale decompo-sition of the variance of the signal. The first scale (j ¼ 1)captures the highest frequency patterns, representing verylocal information. Increasing the scale index provides suc-cessively coarser, or lower frequency information, equivalentto “zooming out” on the signal until the level of the entirechromosome is reached. A plot of Sj vs. j indicates whichscales are important contributors to the process varianceand indirectly provides information about the distributionof admixture tracts. For example, recent admixture producesa peak in the wavelet variance at a large wavelet scale,reflecting long admixture tracts, while more ancient admix-ture events produce peaks at lower wavelet scales, reflectingshorter admixture tracts.

The wavelet variance for an individual i is given by

Sij ¼1T

XTk¼1

��d ij;k

��2; (3)

where dij;k ¼P

sYiscj;s2k are the wavelet coefficients for

the signal Yi constructed using the wavelet system c. To

Figure 1 Simulated example with 13,000 SNPs, 15 diploid individuals inancestral populations (PA, PB), and 20 diploid individuals in the admixedpopulation (PC). Populations are shown in green (PA), blue (PB), and red(PC). (A) PCA is used to describe the primary population structure; (B) rawwavelet variance for each population illustrates high frequency noise; (C)informative variation in the admixed population after standard correctionfor noise estimated from the ancestral populations. Note that this exam-ple uses the default threshold m = 1.

Admixture Processes from Local Ancestry 471

Page 4: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

appreciate the methodology, it is sufficient to understandthat the wavelet variance reflects the frequency content ofthe signal, but more detailed background material is pro-vided in the Supporting Information (Figure S5, Figure S6,Figure S7, Figure S8, Figure S9, and File S1). Our imple-mentation employs Daubechies’ least asymmetric waveletnumber 8 in the waveslim (Whitcher 2013) package of thestatistical software R (R Development Core Team 2014). Weemphasize, however, that the methods proposed here arerobust to other choices of analyzing wavelet (see SupportingInformation, Figure S5, Figure S6, Figure S7, Figure S8, FigureS9, and File S1).

Population averages are computed as

SCj ¼ 1nC

Xi2PC

Sij

(and similarly for populations PA and PB). An example of theaverage wavelet variance for each population is shown inFigure 1B. The wavelet variance is highest at fine scales, butas the ancestral populations also show this pattern, it shouldbe considered background noise. It is intuitive that the veryfinest wavelet scales are uninformative because small num-bers of SNPs should be insufficient to differentiate betweenpopulations. The raw wavelet variance is therefore consid-ered as a combination of informative variation and back-ground noise

SCj ¼ I Cj þ N j: (4)

To extract the informative variance ICJ , we subtract the pro-portion that can be attributed to noise. This is estimatedfrom the variation observed in the ancestral populations;bNj ¼ m �maxðSjA; SjBÞ; where m is a multiplicative factor thatallows the degree of thresholding to be controlled. Underalmost all conditions, a default value of m ¼ 1 may be as-sumed, and this threshold should be raised only if the ad-mixture signals exhibit high levels of noise (see SupportingInformation, Figure S5, Figure S6, Figure S7, Figure S8,Figure S9, and File S1 for details). Population characteristicsthat influence noise levels in the admixture signals are ex-plored in the next section. The final measure of the infor-mative variance is given by

ICJ ¼ max�SCJ 2 bNj; 0

�; (5)

which describes the frequency content that is unique to theadmixed population (in contrast to the ancestral populations).

Real genomic data

To illustrate that our method performs well in real-worldsituations, it was applied to a SNP genotyping chip data setof 394 individuals from 16 communities spread across theIndonesian archipelago (Table 1). Equivalent SNP data fromSouthern Han Chinese and Papua New Guinea Highlanderswere used as proxies for the ancestral populations. Permis-sion to conduct research in Indonesia was granted by the

Indonesian Institute of Sciences. Blood samples or buccalswabs were collected from consenting, closely unrelated,and seemingly healthy individuals by J. Stephen Lansing(University of Arizona) and Herawati Sudoyo (Eijkman In-stitute for Molecular Biology, Indonesia), with the assistanceof Indonesian Public Health clinic staff. All sample collectionfollowed protocols for the protection of human subjectsestablished by both the Eijkman Institute and the Universityof Arizona institutional review boards. Participant inter-views confirmed local residence for at least two generationsinto the past. Samples were genotyped with the AffymetrixAxiom chip, yielding 548,994 SNPs across the autosomes.(Sex-linked markers were excluded from the analysis.) TheSNP data were cleaned using standard protocols in PLINK v.1.07 (Purcell et al. 2007; Purcell 2009) and the wavelettransform performed as described above.

The approximate Bayesian computation analysis employed1000 data sets with sample sizes and SNP numbers set tothose of the real data. These data sets were simulated bydrawing from a uniform prior of admixture times between10 and 300 generations. The admixture proportion forthe Bena population (used as our primary test case) wasset to 0.6, as estimated previously from the real data. TheABS metric was calculated for each simulation, and themultiple chromosome structure of the data were mimickedby sampling each individual repeatedly with different datadensities.

Results

As proof of concept, we first applied our wavelet method tosimulated data. A range of admixture scenarios was ex-plored by varying parameters of the demographic model,particularly the time of admixture, admixture proportion,and single vs. multiple admixture events. Fifty simulationswere performed for each scenario with modest (but there-fore realistic) ancestral sample sizes of nA = nB = 15 and anadmixed sample size nC = 20.

Admixture time

Because the ability of wavelet methods to calculate the timeof admixture is well known from earlier work (Pugach et al.2011), we explored this feature first. Simulations were per-formed for admixture times ranging from 10 to 320 gener-ations (i.e., from the recent past to �10,000 years ago, usinga generation interval of 30 years; Fenner 2005). Admixtureat 10 generations shows the highest informative waveletvariance at scale 13, reflecting relatively few, long admixtureblocks (Figure 2). As the time of admixture occurs furtherback in the past, the peak in wavelet variance shifts towardsuccessively lower wavelet scales, reflecting ever-smaller ad-mixture blocks driven by cumulative recombination alongthe chromosome. The average frequency content can becharacterized by the average block size metric ABS, termedthe “wavelet center” by Pugach et al. (2011), which asshown later, can be used to date the admixture event

472 J. Sanderson et al.

Page 5: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

ABS ¼P

j j � I ABjPjI

ABj

: (6)

Admixture proportion

Admixture proportions were varied between 0.5 (equalancestry from PA and PB) and 0.025 (ancestry predominatelyfrom PA). For this analysis, the time of admixture was fixed at160 generations. As the proportion of admixture decreases,the raw wavelet variance exhibits increasing levels of noiserelative to informative variation. This is shown by thereduced magnitude of the informative wavelet variance(Figure 3) and emphasizes that, as expected, it is increasinglydifficult to extract informative variation at low admixtureproportions (small p) even where the signal is technicallypresent. In this example, informative estimates were obtainedfor admixture proportions as low as 2.5%, although ingeneral, the range of p for which this method is applicablewill also depend on other characteristics of the data, suchas the SNP density and sample size, as considered in thenext section.

Sensitivity analyses

The sensitivity of the method to a wide range of datacharacteristics was considered by repeating the results of theadmixture time example with a large number of simulateddata sets. Results are summarized in Table 2 and Figure 4.

Condition 1 shows the original results, exactly as de-scribed above. New simulations were then performed tomimic realistic linkage disequilibrium (LD) (condition 2). Todo so as accurately as possible, we applied the real re-combination rates observed along the first 100 Mb ofchromosome 1, as recombination rates for chromosome 1

are near the average of all chromosome-level recombinationrates (Figure S2). The effect of lower sample size (condition3) was investigated by reducing the number of individualssampled from each population by 5, thus yielding samplesizes that would be smaller than almost any published pop-ulation genetics study (nA, nB = 10, nC = 15). The effect ofmore recent divergence between the ancestral populations(condition 4) was investigated by decreasing TAncestral from2000 to 1200 generations ago (50,000–30,000 years ago).The effect of using a misrepresentative modern populationas a proxy for an ancestral population (condition 5) wasinvestigated by studying ancestral populations with mixed(rather than “pure”) ancestry. Rather than using samplesfrom the true ancestral population PA, an admixed ancestralpopulation P*A was employed instead (p ¼ 0.1). A widerange of parameters was applied for sensitivity testing, butfor clarity, only results for single parameter values are shownon Figure 4. These examples are representative of all thetests that were run.

Variation in summary measures between simulations wascompared by computing the relative standard deviation(RSD) at each admixture time. For all of the error conditionsabove, the computed ABS metrics are consistent with thereference case (condition 1), but with slightly larger relativestandard deviations. Only for one case (condition 4; reduceddivergence between the ancestral populations and admixtureat 320 generations) are the ABS metrics biased, with themean falling outside the range of values observed for thereference example. We emphasize that this is expected:admixture should be more difficult to detect when it occursbetween two ancestral populations that diverged only recently.Stability of the ABS metrics in this particular scenario could beimproved by applying a higher level of thresholding. However,the default value of m ¼ 1 was retained here to provide consis-tency across scenarios, to demonstrate the deterioration in res-olution, and to illustrate that the thresholding parameter can beignored for all but the most extreme admixture cases.

In all of these examples, including the standard referencecase, the localized admixture signals provide a noisy in-dication of how ancestry varies along the chromosome.Indeed, the inherent stochasticity of the block structure isthe primary reason why other sources of variance, such asthe cases discussed above, have relatively little additionaleffect on the overall results. This noise is addressed usingwavelets to capture the distribution of block sizes, coupledwith a correction based on the ancestral populations todistinguish informative signals from background variation.The cases considered above all slightly increase noise levelsrelative to informative variation, which, as demonstrated bythe admixture proportion example in Figure 3, reduces themagnitude of the extracted informative wavelet variance. Asnoise increases, it naturally becomes more difficult to extractinformative variation. However, this increase in noise levelsis minimal for all but the most extreme confounds, thusallowing the technique to be applied robustly to a very widerange of scenarios.

Table 1 Summary of case study populations describing sample size(n), proportion of Asian ancestry as inferred by PCA (p), and theaverage block size metric (ABS, for admixed populations only)

Population n pAverage block

size metric (ABS)

Southern Han Chinese 13 1.00 –

Nias 28 0.87 4.26Mentawai 29 0.87 4.30Java 21 0.84 4.41Sumatra 30 0.83 4.49Bali 19 0.83 4.90Sulawesi 21 0.80 6.57Sumba, Wunga 30 0.67 7.90Sumba, Anakalang 30 0.66 7.83Flores, Rampasasa 12 0.66 8.05Flores, Bena 30 0.57 8.14Flores, Bama 30 0.55 8.34Timor, Umanen Lawalu 17 0.55 8.44Timor, Kamanasa 19 0.53 8.42Lembata 28 0.53 8.39Pantar 27 0.45 8.47Alor 23 0.42 8.46Papua New Guinea Highlands 13 0.00 –

Admixture Processes from Local Ancestry 473

Page 6: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

The effect of SNP density (which is always a knownvariable) is demonstrated by down sampling the data(conditions 6–8). The original density of 4306 SNPs (condi-tion 6) was chosen to correspond to the size of our realchromosome 22 data set. Reducing the SNP density of thisdata set means that the resulting wavelet decomposition isgiven over 11 wavelet scales rather than the earlier 13, andso as expected, the computed mean ABS metrics are corre-spondingly much smaller. However, this has no effect on theinference, as the data size is always known and simulationsare simply run to match the size of the observed data. Fur-ther reductions in SNP density to 3250 SNPs (condition 7)and 1625 SNPs (condition 8) are also shown. Note thatalthough the absolute values of the ABS metrics are shifted,the trend with admixture time remains consistent.

Method comparison

The original StepPCO method (Pugach et al. 2011) has al-ready been tested extensively against other admixture de-tection methods, particularly HAPMIX (Price et al. 2009).We therefore focus here on comparing our improved wave-let method against the StepPCO procedure. Figure 5 showsthat the summary measure (wavelet center) used inStepPCO is comparable to the adwave ABS metrics, as bothexhibit a strong trend with time of admixture. However, thedispersion is consistently smaller for the adwave ABS met-rics. For example, the wavelet centers (StepPCO) computedfor T ¼ 320 and T ¼ 160 show substantial overlap, whilethe ABS metrics (adwave) for the same populations show

only minimal overlap. This illustrates that adwave offers in-creased power to differentiate between older admixture sce-narios, with substantially reduced uncertainty in dating.

We also emphasize that adwave requires far fewer userspecifications with regard to runtime options. The only vari-able for adwave is the thresholding parameter, and as shownabove, the default value of m ¼ 1 should be used for almostall admixture scenarios. In contrast, the StepPCO results re-quired a signal length parameter (K ¼ 1024), a window sizeparameter (l ¼ 5), and two thresholding parameters (thresh-old = 0.1, maxlevel = 6) (all notations from Pugach et al.2011). A detailed demonstration of this method comparison,with explanation of the settings chosen for StepPCO, is pro-vided in Figure S3.

Admixture in Indonesian populations

Populations across Indonesia show genomic admixturebetween Asian and Melanesian ancestral sources (Coxet al. 2010), which has been dated using other methods toan admixture event �4000 years ago (�130 generations)(Xu et al. 2012). We calculated wavelet summary measuresfor 16 communities across the Indonesian archipelago using548,994 autosomal SNPs screened in 394 individuals (Table1). Equivalent data from Southern Han Chinese and PapuaNew Guinea Highlanders was used as proxies for ancestralpopulations, as described in Cox et al. (2010).

The PCA for all individuals, where only the ancestralpopulations were used to define the axes, is shown in Figure6. Admixed individuals dispersed along the first principal

Figure 2 Informative wavelet variance for each time of admixture (10–320 generations using default thresholding m ¼ 1). Shaded bars represent theaverage over 50 simulations at each admixture time; black bars represent the range across individual simulations. The average block size metric for eachscenario is indicated by a dotted blue line.

474 J. Sanderson et al.

Page 7: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

component illustrate the primary genomic signal, a stronggradient in Asian-Melanesian ancestry that has previouslybeen observed across the region (Cox et al. 2010). The in-formative wavelet variance was computed separately foreach chromosome and individual and subsequently com-bined to provide a single measure for each population (Fig-ure S4). To combine information across chromosomes,which vary considerably in size, the raw admixture signalswere windowed: all signals were reduced to the size of thesmallest chromosome (importantly without discarding anydata) by computing averages over a window of SNPs(details of the windowing procedure are provided in Sup-porting Information, Figure S5, Figure S6, Figure S7, FigureS8, Figure S9, and File S1). The SNP density and windowsize for each chromosome are shown in Table S1. This win-dowing procedure is used only to standardize chromosomesto the same length and utilizes very short windows of SNPs(unlike the approach of Pugach et al. 2011).

The average block size metrics calculated for each pop-ulation are shown in Table 1. The first six Indonesian popu-lations (Nias, Mentawai, Java, Sumatra, Bali, and Sulawesi)exhibit predominantly Asian ancestry, with high-frequencynoise in the signals causing some bias in the computed ABSmetrics (Figure S4). The remaining Indonesian populationsexhibit less extreme Asian ancestry proportions (42–67%),with the resulting ABS metrics appearing broadly similarbetween populations.

Under the assumption of a single admixture time (relaxedin later sections), the average block size metric can be usedto date the time of admixture using approximate Bayesiancomputation (ABC). A general introduction to ABC can befound in Csilléry et al. (2010) and Sunnåker et al. (2013),while ABC in the context of parameter estimation for popu-lation admixture has been considered by Sousa et al. (2009)and Robinson et al. (2014).

The ABC inference procedure allows us to capture un-certainty in admixture time estimates more robustly thanearlier wavelet dating approaches (Pugach et al. 2011; Xuet al. 2012). To illustrate this process, dating was performedon the Bena population of Flores in eastern Indonesia,resulting in an estimated median admixture time of 147generations (95% credible region: 122–178 generations),or 4410 years before present (95% CR: 3660–5340 yearsBP). This almost exactly matches earlier point estimates ofthe admixture time (Xu et al. 2012) and is consistent withour current understanding of Island Southeast Asian prehis-tory (Bellwood 2007).

The relationship between time of admixture and the ABSmetric across all simulations is illustrated in Figure 7A. ABCwas implemented using the R package abc (Csilléry et al.2012), and the posterior distribution of admixture timewas computed using a local linear regression (Beaumontet al. 2002) with a tolerance rate of 0.2. Cross validationwas used to evaluate the accuracy of this estimate: the pre-diction error was low (0.038) and insensitive to the exacttolerance value. For future research focusing on parameterinference, this procedure could be modified to use a largernumber of simulated data sets and a lower tolerance rate.However, this simple example clearly illustrates that theadwavemethod has good statistical power to date admixtureusing a relatively small number of simulations.

Multiple admixture events

Another aim of this work is to show that our improvedwavelet approach can be used to study other features of theadmixture process beyond the well-explored question ofadmixture time. In the examples covered thus far, it hasbeen assumed that admixture occurred as a single event.However, additional waves of admixture will result in theintroduction of new ancestry tracts, replacing a proportionof older, shorter ancestry blocks with newer, longer ones.Pugach et al. (2011) briefly considered the effects of contin-uous admixture within a wavelet setting, showing that thisleads to underestimated admixture times in their originalmethodological framework. In contrast, we instead considerscenarios with two distinct admixture events. We show thatthis process creates distinctive patterns in the observed in-formative variation, which can be used to reconstruct morecomplex demographic processes (as opposed to being trea-ted solely as a potential source of bias).

In the following dual-admixture scenarios, the first admix-ture event always occurs at 160 generations. To investigatethe effect of separation between admixture events, the secondadmixture event varies between 10 and 80 generations. In theextreme case of admixture at 160 and 10 generations ago, thelocalized admixture signals contain two dominant frequencies.Single admixture events at 160 and 10 generations lead topeaks in the informative wavelet variance at wavelet scalesof 9 and 13, respectively. When two admixture events occur,the informative wavelet variance is instead spread betweenthese scales (Figure 8A). As the admixture events occur closer

Figure 3 Relationship between proportion of admixture and informa-tive wavelet variance. For this example only, a nondefault value for thethreshold m ¼ 1:1 was used to account for increased noise in the ad-mixture signals due to low proportions of admixture, as described in thetext. The magnitude of the wavelet variance decreases with the admix-ture proportion, shown as colored bars from black (P = 0.50) to yellow(P = 0.025).

Admixture Processes from Local Ancestry 475

Page 8: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

together, this spread in the observed informative wavelet var-iance decreases (Figures 8, B–D).

For one admixture event, a single dominant peak isobserved in the informative wavelet variance, and the ABSmetric therefore provides a convenient summary measure.For multiple admixture events, the ABS metric describes theaverage admixture time, but provides no information aboutthe duration over which admixture occurred. In contrast, theinformative wavelet variance should provide additional in-formation about the peak dispersion. To explore the potentialfor identifying more complex admixture scenarios, a simpleclassification rule was implemented. An admixed populationPC is assigned to one of two groups G1; G2, which are char-acterized by the summary measures M1; M2. This scheme isdescribed with abstract choice of summary measure, but be-low, we consider how different summary measures (takingM1; M2 to be either the ABS metric or wavelet informativevariance) affect the success of classification.

The classification rule is implemented as follows:

1. The “true” summary measures M1, M2 are computed foreach group using values obtained from the first 25simulations.

2. For each of the remaining 25 trial data sets (s= 1,. . .,25),estimated summary statistics cMs are calculated. The di-vergence measures are defined as

Di ¼XSs¼1

���cMs2Mi

���; (7)

for i ¼ 1; 2.

3. If D1 ,D2, classify to G1; otherwise classify to G2.

The classification rates are shown in Table 3 for scenario1 (a single admixture event at 60 generations; mean ABS10.47, range 10.30–10.64) and scenario 2 (two admixtureevents at 160 and 10 generations; mean ABS 10.57, range10.35–10.90). With a sample size of just 10 individuals forthe admixed population, perfect classification is achieved

using the informative wavelet variance, while the ABS met-ric correctly classifies only 60% of cases. For real multipleadmixture situations, this classification framework could beextended to a more complex inferential setting (such asABC), but this simple example demonstrates the potentialfor reconstructing complex admixture scenarios from the fullwavelet variance profile.

Discussion

Wavelet techniques provide information on the ancestry blockstructure of admixed genomes and hence can be used toreconstruct the processes involved in past admixture events.Ancestry blocks are strictly unobservable and can be inferredonly from the data. Wavelets provide indirect informationon the block structure, thus providing an alternative overmethods that assign ancestry directly (Sankararaman et al.2008; Price et al. 2009). A growing body of methods nowassign ancestry indirectly using various unrelated approaches(Moorjani et al. 2011; Baran et al. 2012; Gravel 2012; Lohet al. 2013; Maples et al. 2013; Brown and Pasaniuc 2014),but here we extend the use of wavelet techniques as intro-duced by Pugach et al. (2011). Importantly, our implemen-tation differs markedly from the original StepPCO program,with the main differences at each stage of the analysis high-lighted below:

• Localized admixture signal formation: StepPCO (Pugachet al. 2011) uses large windows of SNPs to produce anaveraged admixture signal in localized windows alongthe genome. Our work demonstrates that wavelet meth-ods are equally applicable to the raw unwindowed signals,with the windowing procedure performed intrinsically aspart of the wavelet analysis, and therefore not requiringarbitrary a priori decisions on window size.

• Wavelet analysis: The wavelet methods we describe arebased on the MODWT, which offers more flexibility in itsapplication since there is no restriction on the length of thesignals. Conversely, StepPCO employs the discrete wavelettransform (DWT), which has the strict requirement that

Table 2 Sensitivity of the adwave method to a range of data limitations

Data limitations Admixture time (generations)

Condition Description 10 20 40 80 160 320

1 Reference 11.55 (0.69) 11.14 (0.58) 10.65 (0.7) 10.13 (0.66) 9.54 (1.12) 8.84 (1.72)2 Realistic LD 11.71 (0.84) 11.31 (0.88) 10.85 (1.08) 10.32 (1.04) 9.82 (1.23) 9.09 (2.28)3 Reduced sample size 11.66 (0.90) 11.25 (0.63) 10.74 (0.88) 10.20 (0.85) 9.58 (1.19) 8.85 (1.78)4 Reduced divergence between

ancestral populations11.64 (0.83) 11.24 (0.69) 10.76 (0.82) 10.21 (1.02) 9.49 (1.35) 8.32 (3.59)

5 Non-representative ancestralpopulations

11.73 (1.34) 11.23 (1.42) 10.77 (1.51) 10.21 (1.45) 9.58 (1.94) 8.60 (3.06)

6 SNP density T = 4036 10.47 (0.91) 10.04 (0.73) 9.55 (0.92) 8.98 (1.22) 8.37 (1.62) 7.60 (3.13)7 SNP density T = 3250 9.96 (0.76) 9.60 (0.60) 9.19 (0.76) 8.68 (1.08) 8.12 (1.41) 7.37 (2.6)8 SNP density T = 1625 8.99 (1.23) 8.63 (1.09) 8.20 (1.50) 7.68 (2.29) 7.10 (3.15) 6.33 (5.86)

Mean average block size values (relative standard deviation in parentheses) are shown for each admixture time. Reference data were simulated with T = 13,000 SNPs,populations sizes of nA, nB = 15, nC = 15, and divergence between the ancestral populations at TAncestral = 2000 generations ago.

476 J. Sanderson et al.

Page 9: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

signals be of length 2n. Data must therefore be windowed,or discarded, to meet the restrictive length requirements ofthe DWT framework. Another advantage of the MODWTis that the resulting wavelet coefficients are translationequivariant, meaning that circularly shifting the dataresults in the same shifting of the coefficients. Said differ-ently, changing the starting point—for instance, to avoida poor quality SNP—does not affect the resulting waveletcoefficients, whereas this is not true under the DWT frame-work. This property is particularly important if the resultsare to be used for specific localized genomic regions (asdiscussed briefly below) and thus provides a solid statisticalfoundation for future work.

• Extraction of relevant information: The portion of theresulting wavelet decomposition that is informativeabout the admixture process is extracted in a simpleprocedure with reference to the ancestral populations,offering greater simplicity and objectivity than the mul-tistage thresholding procedure described by Pugachet al. (2011).

• Software: The adwave software, which implements themethod described in this article, is an official package inthe R project (http://cran.r-project.org/web/packages/adwave/index.html). This allows extremely easy installa-tion and use, as well as providing a series of simpleworked examples as a learning exercise. The adwavepackage is also faster than the existing StepPCO codeand offers more flexibility in the choice of analyzingwavelet (unlike StepPCO, which employs only the sim-plest “square-shaped” Haar wavelet).

The work presented here also makes several otheradvances. The average block size metric has previously beenshown to capture the time of admixture. Here, we haveimplemented a more formal dating procedure using ABCunder the assumption of a single admixture event. In reality,populations may have experienced multiple admixture eventsleading to complex patterns of genetic variation. We haveshown that the wavelet variance contains additional in-formation to identify these more complex admixture scenar-ios. This highlights the potential of wavelet-based techniquesto be coupled with formal statistical inference procedures torobustly distinguish between the range of scenarios thatcould have resulted in any observed genetic pattern.

Method performance for the StepPCO procedure has al-ready been tested against other admixture detection meth-ods, most extensively with HAPMIX (Price et al. 2009), withfavorable results. This is especially true for older admixtureevents (Pugach et al. 2011). While an in-depth comparisonwith other local ancestry detection methods would be ofgreat interest (Moorjani et al. 2011; Baran et al. 2012;Gravel 2012; Loh et al. 2013; Maples et al. 2013; Brownand Pasaniuc 2014), such an analysis is beyond the scopeof this manuscript. We have therefore focused instead onshowing how adwave markedly improves on the originalwavelet method implemented in StepPCO. As shown above,adwave offers improved statistical power to differentiate be-tween admixture scenarios, offers much reduced uncertaintyin model parameter estimates, and importantly, is far easierto use than StepPCO, especially by requiring far fewer user-specified runtime parameters.

Figure 4 Sensitivity to a range of realistic data limi-tations. Comparison to reference data (condition 1)simulated with T = 13,000 SNPs, populations sizesnA;nB ¼ 15, nC ¼ 15, and ancestral population diver-gence at TAncestral ¼ 2000 generations. The gray areashows the range of ABS metrics observed under thestandard reference condition. (A) Potential sources oferror (conditions 2–5); (B) varying SNP densities (con-ditions 6–8). Note that the decline in absolute valuesof the ABS metrics in B is expected; these are easilyaccounted for in an inference setting because the SNPdensity is always a known variable. Condition descrip-tions and numeric values are presented in Table 2.

Figure 5 Comparing StepPCO and adwave showing therelationship between wavelet transform summaries andtime of admixture. (A) Adwave using m ¼ 1; (B) StepPCOusing K ¼ 1024, l ¼ 5, threshold = 0.1, and maxlevel =6. Numbers indicate the relative standard deviation (RSD,%) for each admixture time. Note the difference in dis-crimination power between the two methods for olderadmixture events (95% confidence intervals as dashedblue and green horizontal lines).

Admixture Processes from Local Ancestry 477

Page 10: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

In the future, considering the full wavelet periodogram,rather than the genome-wide summary measures used inboth adwave and StepPCO, may yield promising results wher-ever the distribution of ancestry tracts along the genome issubstantially nonstationary. Bryc et al. (2010) use their for-mulation of localized admixture signals to address whetherregions of the genome show predominant ancestry froma given population. Wavelets are well suited to distinguishinglocal features in data and could be helpful in this regard,identifying features that may not be easily detected by con-sidering the localized admixture signals in their raw form.

Other prospective areas for further work include theextension of these methods to the more general case ofmultipopulation admixture. Ma and Amos (2012) describe theuse of PCA as a diagnostic in this setting, and PCA has beenused to assign multipopulation ancestry in the software PCAd-mix (Brisbin et al. 2012). The wavelet methods described herecould be extended in a similar way by considering pairwisecombinations of any number of ancestral populations.

In contrast, key restrictions that determine our ability toreconstruct admixture events include the degree of differ-entiation between the ancestral populations and the repre-sentativeness of samples used as surrogate ancestral groups.As the ancestral populations become more similar or the

surrogate populations become more different from the trueancestral populations, the localized admixture signals becomeincreasingly noisy. Although this ultimately leads to a lossof identifiability in extreme cases, the method is remarkablyrobust to moderate deviations from these assumptions.As shown above for low admixture proportions, throughjudicial choice of the thresholding parameter even ex-tremely noisy data can still provide meaningful estimates(the only situation in which we encourage deviation fromthe default setting).

Sample size (both in terms of SNPs and individuals) is alsoimportant and affects the PCA step of the procedure. Thepurpose of the PCA step is to summarize the overall variabilityamong individuals, which includes both between-populationand within-population variability. In reconstructing popula-tion ancestry, we aim to describe between-population vari-ation, while ignoring within-population variation. This isachieved by selecting the first principal component, as longas the sample sizes are sufficiently large. Within-populationfluctuations of individual coordinates on the PCA scatterplotcan be caused by subtle population substructure. Assumingthat no such substructure is present, these fluctuationsdecrease as the total sample size increases, and an asymp-totically stable pattern of the eigenvector plot results (Ma and

Figure 6 PCA of autosomal SNP data from Indone-sian populations, with Southern Han Chinese (bluecircles) and Papua New Guinea Highlanders (greencircles) employed as proxy ancestral populations.Numbers give calculated admixture proportions.

Figure 7 Dating time of admixture for Bena (Flores,eastern Indonesia) using approximate Bayesian com-putation. (A) Relationship between admixture timeand average block size metric for all simulations; (B)weighted posterior distribution of admixture time. Me-dian estimated time of admixture, indicated by theblue line, is 147 generations (95% credible region:122–178 generations).

478 J. Sanderson et al.

Page 11: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

Amos 2010). When the number of individuals is large, vari-ation between individuals from the same population is smallcompared to that of the different populations, so that the firsteigenvector describes the primary population structure of thedata. However, as the sample size decreases, individual vari-ation carries more weight, which may be addressed in morethan the first principal component. Note that the use of meth-ods other than PCA may be helpful in this regard. For exam-ple, Jombart et al. (2010) introduced discriminant analysis ofprincipal components to achieve separation of individuals in-to predefined groups. In practice, as long as the sizes of theancestral population samples is sufficiently large, discrimi-nant analysis provides the same result as PCA (unpublisheddata). The two methods may, however, perform differently forsmall sample sizes.

How far back in time admixture processes can be reliablyidentified is strongly influenced by the number of genotypedSNPs. The relationship between the number of admixtureblocks, time of admixture and wavelet scale is summarizedin Table 4. The shaded column indicates the findings de-scribed in the Results, using simulated data sets of 13,000SNPs (chosen for a region �100 Mb in length, comparableto the SNP content of our 100 Mb chromosome 15 data set).For admixture at 10 generations, the informative waveletvariance is highest at scale 13, reflecting a small numberof large admixture blocks. As the time of admixtureincreases, the peak shifts toward lower scales, reflectinga larger number of smaller admixture blocks. This patternis illustrated for admixture up to 320 generations (�10,000years), but importantly, it is possible to reconstruct evenolder admixture events. The highest frequency (relating to

the smallest admixture blocks) that can be detected, as de-termined purely by the data density, is termed the Nyquistfrequency (Chatfield 2003). However, resolution power islikely to deteriorate well before this point and will bestrongly influenced by the degree of differentiation betweenthe ancestral populations. The more closely related the an-cestral populations, the less well they can be discriminatedusing only a small number of SNPs. Increasing the SNPdensity allows detection of higher frequency information,relating to shorter (more ancient) admixture tracts. To illus-trate this, the mapping to wavelet scale is illustrated fora hypothetical twofold and fourfold increase in the numberof genotyped SNPs (26,000 and 104,000 SNPs, respec-tively). As genetic data sets improve (particularly throughwhole-genome sequencing), wavelet methods will therefore

Figure 8 Dual admixture events at 160 and 10–80generations. Gray bars represent the average over 50simulations for each scenario; black bars represent therange for individual simulations. Blue bars show theaverage informative wavelet variance for a single ad-mixture event at 160 generations, providing a referencepoint for comparison.

Table 3 Classification rate for the summary measures averageblock size and informative wavelet variance with increasingsample size (1 £ nC £ 10)

Correct classification (%)

Sample size (individuals) Wavelet variance Average block size

1 76 562 84 583 89 594 92 605 94 616 96 617 97 628 98 629 99 6110 100 60

Admixture Processes from Local Ancestry 479

Page 12: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

gain substantial resolution. It seems entirely feasible thatwavelet approaches will have sufficient statistical power toreconstruct admixture events far deeper in time than thosecurrently studied. Advances in wavelet methods thereforeoffer exciting potential for future research, particularly forancient and complex human admixture processes.

Software

Software for the analyses described here has been releasedin the form of an R package, adwave, which is freely avail-able from the R project’s central package repository: http://cran.r-project.org/web/packages/adwave/index.html

Acknowledgments

We gratefully acknowledge assistance with sample collec-tion by Agustini Leonita and Alida Harahap (EijkmanInstitute for Molecular Biology, Jakarta, Indonesia) and J.Stephen Lansing (University of Arizona), as well as dataprocessing by Olga Savina (University of Arizona). We alsothank Matthew Nunes (University of Lancaster, UnitedKingdom) and Martin Hazelton (Massey University, NewZealand) for their valuable comments.This research wassupported by the Royal Society of New Zealand througha Rutherford Fellowship (RDF-10-MAU-001) and MarsdenGrant (11-MAU-007) to M.P.C., and by funding from theAllan Wilson Center for Molecular Biology and Evolution.

Literature Cited

Baran, Y., B. Pasaniuc, S. Sankararaman, D. G. Torgerson, C.Gignoux et al., 2012 Fast and accurate inference of local an-cestry in Latino populations. Bioinformatics 28: 1359–1367.

Beaumont, M. A., W. Zhang, and D. J. Balding, 2002 ApproximateBayesian Computation in population genetics. Genetics 162:2025–2035.

Bellwood, P., 2007 Prehistory of the Indo-Malaysian Archipelago.ANU E Press, Canberra, Australia.

Brisbin, A., K. Bryc, J. Byrnes, F. Zakharia, L. Omberg et al.,2012 PCAdmix: principal components-based assignment of an-cestry along each chromosome in individuals with admixed an-cestry from two or more populations. Hum. Biol. 84: 343–364.

Brown, R., and B. Pasaniuc, 2014 Enhanced methods for localancestry assignment in sequenced admixed individuals. PLOSComput. Biol. 10: e1003555.

Bryc, K., A. Auton, M. R. Nelson, J. R. Oksenberg, S. L. Hauser et al.,2010 Genome-wide patterns of population structure and ad-mixture in West Africans and African Americans. Proc. Natl.Acad. Sci. USA 107: 786–791.

Chatfield, C., 2003 The Analysis of Time Series: An Introduction,6th Ed. Chapman & Hall/CRC, Boca Raton, FL.

Chen, G. K., P. Marjoram, and J. D. Wall, 2009 Fast and flexiblesimulation of DNA sequence data. Genome Res. 19: 136–142.

Cox, M. P., A. E. Woerner, J. D. Wall, and M. F. Hammer,2008 Intergenic DNA sequences from the human X chromo-some reveal high rates of global gene flow. BMC Genet. 9: 76.

Csilléry, K., M. G. B. Blum, O. E. Gaggiotti, and O. François,2010 Approximate Bayesian Computation (ABC) in practice.Trends Ecol. Evol. 25: 410–418.

Csilléry, K., O. François, and M. G. B. Blum, 2012 abc: an R pack-age for approximate Bayesian computation (ABC). MethodsEcol. Evol. 3: 475–479.

Falush, D., M. Stephens, and J. K. Pritchard, 2003 Inference ofpopulation structure using multilocus genotype data: linked lociand correlated allele frequencies. Genetics 164: 1567–1587.

Fenner, J. N., 2005 Cross-cultural estimation of the human gen-eration interval for use in genetics-based population divergencestudies. Am. J. Phys. Anthropol. 128: 415–423.

Gravel, S., 2012 Population genetics models of local ancestry.Genetics 191: 607–619.

Hellenthal, G., G. B. J. Busby, G. Band, J. F. Wilson, C. Capelli et al.,2014 A genetic atlas of human admixture history. Science 343:747–751.

Table 4 Relationship between the number of admixture blocks, time of admixture, and wavelet scale

Wavelet scale (no. of SNPs)

Admixture blocks Time of admixture (generations) 13,000 26,000 104,000

8,192–16,384 163,840 — — 14,096–8,192 81,920 — 1 22,048–4,096 40,960 1 2 31,024–2,048 20,480 2 3 4512–1,024 10,240 3 4 5256–512 5,120 4 5 6128–256 2,560 5 6 764–128 1,280 6 7 832–64 640 7 8 916–32 320 8 9 108–16 160 9 10 114–8 80 10 11 122–4 40 11 12 131–2 20 12 13 140–1 10 13 14 15

The dominant admixture block size decreases with time since admixture, while conversely, the number of admixture blocksincreases. Underlined numbers are from the example presented in the Results section (13,000 SNPs from a genomic region �100Mb in length, comparable to the data set for chromosome 15). Columns to the right show how mapping to wavelet scale dependsheavily on SNP density: increasing the number of SNPs two- and fourfold allows higher frequency information to be detected,which in turn informs about shorter (more ancient) admixture tracts.

480 J. Sanderson et al.

Page 13: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

Jombart, T., S. Devillard, and F. Balloux, 2010 Discriminant anal-ysis of principal components: a new method for the analysis ofgenetically structured populations. BMC Genet. 11: 94.

Liò, P., 2003 Wavelets in bioinformatics and computational biol-ogy: state of art and perspectives. Bioinformatics 19: 2–9.

Loh, P.-R., M. Lipson, N. Patterson, P. Moorjani, J. K. Pickrell et al.,2013 Inferring admixture histories of human populations us-ing linkage disequilibrium. Genetics 193: 1233–1254.

Ma, J., and C. I. Amos, 2010 Theoretical formulation of principalcomponents analysis to detect and correct for population strat-ification. PLoS ONE 5: e12510.

Ma, J., and C. I. Amos, 2012 Principal components analysis ofpopulation admixture. PLoS ONE 7: e40115.

Maples, B. K., S. Gravel, E. E. Kenny, and C. D. Bustamante,2013 RFMix: a discriminative modeling approach for rapid androbust local-ancestry inference. Am. J. Hum. Genet. 93: 278–288.

McVean, G., 2009 A genealogical interpretation of principal com-ponents analysis. PLoS Genet. 5: e1000686.

Moorjani, P., N. Patterson, J. N. Hirschhorn, A. Keinan, L. Hao et al.,2011 The history of African gene flow into southern Euro-peans, Levantines, and Jews. PLoS Genet. 7: e1001373.

Novembre, J., and M. Stephens, 2008 Interpreting principal com-ponent analyses of spatial population genetic variation. Nat.Genet. 40: 646–649.

Patterson, N., A. L. Price, and D. Reich, 2006 Population structureand eigenanalysis. PLoS Genet. 2: e190.

Percival, D. P., 1995 On estimation of the wavelet variance. Bio-metrika 82: 619–631.

Pool, J. E., and R. Nielsen, 2009 Inference of historical changes inmigration rate from the lengths of migrant tracts. Genetics 181:711–719.

Price, A. L., A. Tandon, N. Patterson, K. C. Barnes, N. Rafaels et al.,2009 Sensitive detection of chromosomal segments of distinctancestry in admixed populations. PLoS Genet. 5: e1000519.

Pugach, I., R. Matveyev, A. Wollstein, M. Kayser, and M. Stoneking,2011 Dating the age of admixture via wavelet transform anal-ysis of genome-wide data. Genome Biol. 12: R19.

Purcell, S., 2009 PLINK, http://pngu.mgh.harvard.edu/purcell/plink/.

Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. A. R. Ferreiraet al., 2007 PLINK: a tool set for whole-genome associationand population-based linkage analyses. Am. J. Hum. Genet.81: 559–575.

R Development Core Team, 2014 R: A Language and Environmentfor Statistical Computing. R Foundation for Statistical Comput-ing, Vienna, Austria.

Robinson, J. D., L. Bunnefeld, J. Hearn, G. N. Stone, and M. J.Hickerson, 2014 ABC inference of multi-population diver-gence with admixture from unphased population genomic data.Mol. Ecol. 23: 4458–4471.

Sankararaman, S., G. Kimmel, E. Halperin, and M. I. Jordan,2008 On the inference of ancestries in admixed populations.Genome Res. 18: 668–675.

Sousa, V. C., M. Fritz, M. A. Beaumont, and L. Chikhi,2009 Approximate Bayesian Computation without summarystatistics: the case of admixture. Genetics 181: 1507–1519.

Sunnåker, M., A. G. Busetto, E. Numminen, J. Corander, M. Follet al., 2013 Approximate Bayesian computation. PLOS Com-put. Biol. 9: e1002803.

Whitcher, B., 2013 waveslim: Basic wavelet routines for one-, two-and three-dimensional signal processing. http://cran.r-project.org/web/packages/waveslim/index.html.

Xu, S., I. Pugach, M. Stoneking, M. Kayser, L. Jin et al.,2012 Genetic dating indicates that the Asian-Papuan admix-ture through Eastern Indonesia corresponds to the Austronesianexpansion. Proc. Natl. Acad. Sci. USA 109: 4574–4579.

Communicating editor: K. M. Roeder

Admixture Processes from Local Ancestry 481

Page 14: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

GENETICSSupporting Information

http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.176842/-/DC1

Reconstructing Past Admixture Processes from LocalGenomic Ancestry Using Wavelet Transformation

Jean Sanderson, Herawati Sudoyo, Tatiana M. Karafet, Michael F. Hammer, and Murray P. Cox

Copyright © 2015 by the Genetics Society of AmericaDOI: 10.1534/genetics.115.176842

Page 15: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

2 SI J. Sanderson et al. 

File S1 

SUPPLEMENTARY INFORMATION 

 

Wavelet transform method 

Wavelets can be thought of as localized waves or oscillations, where localization in this 

context  refers  to  a  region  of  SNPs  along  the  genome  (see  Figure  S5).  Our 

implementation utilizes families of discrete non‐decimated wavelets  , , where 

1,… ,  denotes the scale of the wavelet  (related to Fourier  frequency) and   denotes 

the location (i.e., SNP number). For detailed introductions to wavelets and their use in 

statistics, see Nason  (2008) and Vidakovic  (2009), or  the  review by Liò  (2003)  for an 

introduction to the use of wavelets in biostatistics. 

 

The  importance  of  localization  can  be  appreciated  by  contrasting  wavelets  to  the 

sinusoids (big waves) used in classical Fourier analysis. Sinusoids are associated with a 

particular  frequency, but do not have the  location component provided by wavelets. 

The fact that each wavelet is associated with a particular small genomic region means 

that they can capture the structure of the data, especially where the admixture tracts 

are not uniformly distributed over the chromosome. 

 

The  wavelet  periodogram  of  a  signal  is  given  by  the  square  of  the  raw  wavelet 

coefficients 

Page 16: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  3 SI  

 

, |

1

0

, |   (7) 

The  wavelet  periodogram  for  one  individual  is  shown  in  Figure  S6A.  The  wavelet 

transform provides a decomposition of the data in terms of location along the genome 

on the x‐axis and wavelet scale on the y‐axis. 

 

For the simulated data with 13,000 SNPs, the maximum number of wavelet scales  in 

the decomposition  is 13  ( 13000)). Scale 1 captures  the highest  frequency, 

very  local  information.  Increasing  the  scale  index  provides  successively  coarser,  or 

lower frequency information, zooming out of the signal until we reach the level of the 

entire chromosome. For an  individual chromosome,  the  information  is nonstationary 

in  that  the  width  of  the  admixture  tracts  can  vary  over  the  chromosome.  At  the 

population  level,  the wavelet  transforms  show  greater  evidence  of  stationarity,  as 

demonstrated  in  Figure  S6B.  The  population  average  periodogram  has  a  smoother 

appearance when  examined  from  left  to  right  (i.e.,  along  the  genome) within  each 

scale.  The  information  can  therefore  be  conveniently  summarized  by  summing  the 

wavelet coefficients within each scale to give the wavelet variance.  

 

The  discrete wavelet  transform  (DWT)  has  also  been  used  in  a  similar  context. Our 

implementation  makes  use  of  the  Maximal  Overlap  Discrete  Wavelet  Transform 

(MODWT), which has several benefits over the DWT. First, there  is no restriction that 

the  data  need  be  a  power  of  two, which means  it  can  be  applied  directly  to  the 

available genetic data without  first windowing the signal or down‐sampling the data. 

Page 17: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

4 SI J. Sanderson et al. 

The  resulting wavelet coefficients are  translation‐equivariant, meaning  that circularly 

shifting the data results in the same shifting of the coefficients. With the DWT, shifting 

the  data  could  lead  to  a  different  decomposition.  We  note  that  other  localized 

decompositions,  such  as  the  short  time  Fourier  transform  or  continuous  wavelet 

transform, could also be applied. 

 

Pre‐windowing of the admixture signal and visualization 

Both  PCAdmix  (Brisbin  et  al.  2012)  and  StepPCO  (Pugach  et  al.  2011)  compute  an 

averaged  admixture  signal  in  predefined  localized windows  along  the  genome. Our 

approach  instead  uses  the  SNP  level  information  and  offers  several  advantages.  It 

avoids the subjective choice of properties of the signal (window width and number of 

bins),  and  ensures  that  the  information  is  considered  at  the  most  detailed  level 

possible. Subsequent wavelet analysis  then  considers  the data  in  localized windows, 

the  width  of  the  window  increasing  as  we  zoom  out  to  coarser  scales.  Whether 

information at a particular window size  is  informative  is determined by  reference  to 

the  variation  observed  in  the  ancestral  populations.  The  informative  variation  is 

therefore extracted in an objective, data driven manner. 

 

One possible advantage of pre‐windowing  the  signals  is  in  visualizing  local ancestry. 

Windowing  reduces high  frequency noise  and produces  signals  that  are more easily 

related to ancestry by eye. For example, applying a window of   SNPs, the signals can 

be computed as 

Page 18: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  5 SI  

  1,′

1,

  (8) 

 

  2, | |

0, | |

  (9) 

 

where  ∑ ∈  for  , .  Subsequent  wavelet  analysis  of  the  pre‐

windowed signals would have the same interpretation, as illustrated in Figure S7. 

 

Choice of measurement scale 

The raw admixture signals are estimated at each SNP  location (see equation 2) or for 

each  SNP window  if  pre‐windowing  is  implemented  (equations  8  and  9).  It  is  also 

possible  to construct  the signals  in  terms of genetic distance along  the chromosome 

(as  opposed  to  physical  distance).  Both  options  are  implemented  in  the  adwave 

software. 

 

Threshold choices 

To extract the informative variation in cases where high levels of noise are present in 

the signals (e.g., at very low admixture proportions), a higher threshold for   could be 

selected.  In  setting  a  tougher  criterion,  this  ensures  that  the  raw wavelet  variance 

must  be  larger  before  we  are  willing  to  accept  that  it  is  informative  about  the 

admixture  process  rather  than  simply  being  noise.  Choice  of  threshold  is  a  balance 

Page 19: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

6 SI J. Sanderson et al. 

between two extremes: too high a threshold may remove  informative variation along 

with  the  noise,  while  a  weak  threshold  may  result  in  noise  contamination  and 

potentially biased summary measures, such as the ABS metric. The choice of threshold 

is necessarily data dependent, but we advocate altering the default value only for rare 

cases that exhibit evidence of high noise. 

 

The effects of varying   are  illustrated  in Figure S8  for  two  simulated data  sets; one 

with  low  levels  of  noise  in  the  resulting  admixture  signals,  and  the other with high 

levels.  In  this  example,  a  low  admixture  proportion  from  one  of  the  ancestral 

populations is used to mimic the effect of “high noise” in the admixture signals, but the 

results are also applicable  to other  sources of noise,  such as  short divergence  times 

between  the  ancestral  populations  (see  Discussion  in  the main  text).  In  low  noise 

situations, the ABS metrics are strongly robust to the choice of  , while for high noise 

situations, a larger value of   is necessary to avoid bias in the summary measures. The 

recommended procedure  for selecting   is to produce  initial results using the default 

value, and  then  increase   only  if  there  is evidence of  low‐scale noise. An automatic 

method for selecting   may be considered in future work. 

 

Also note  that any bias due  to non‐optimal  choice of   is avoided  in  the ABC dating 

procedure by ensuring that the same value is used for both the simulated and sampled 

data. 

 

 

Page 20: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  7 SI  

Sensitivity to method options  

The default method options are  to estimate  the  raw admixture  signals  for each SNP 

location, constructed according  to physical distance  (as opposed  to genetic distance) 

without  windowing  the  signals,  and  using  Daubechies’  Least  Asymmetric  wavelet 

number 8.  

 

Other options are also  implemented  in the adwave software, providing flexibility that 

may  be  required  for  different  applications.  Sensitivity  to  the  different  options was 

considered by mimicking the results of the admixture time example for variations on 

the default options. A summary of these results is presented in Table S2 and Figure S9. 

 

In  this  instance,  results using  the Haar wavelet  (condition 9)  are  very  similar  to  the 

MOWDT  default.  The  slight  variation  in  the ABS metrics  is  expected  since  different 

wavelets cover slightly different frequency ranges, although the effect on the results is 

insubstantial.  

 

The effect of pre‐windowing the signals  is  illustrated for two window sizes: 130 SNPs 

(condition 10) and 65 SNPs (condition 11). Choice of window size clearly modifies the 

relationship between admixture time and the resulting ABS metrics. As  illustrated by 

condition 10,  if  the window size  is  too  large,  it will not be able  to capture  the small 

admixture blocks characteristic of ancient admixture. Lack of windowing as the default 

approach is a major point of advantage of adwave over StepPCO. 

 

Page 21: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

8 SI J. Sanderson et al. 

When constructing signals  in  terms of genetic distance,  it  is necessary  to specify  the 

number  of  bins  for  the  signal  and  the  size  of  the  analyzing window.  The  example 

represented by condition 12 utilizes 13,000 bins (i.e., one per SNP), and small windows 

of 13 SNPs  so  that  the  resulting decomposition  is over  the same number of wavelet 

scales and the effect of windowing is minimized. This choice of options provides results 

that are consistent with the default. 

 

The default options provide ease of implementation, avoiding the subjective choice of 

properties of  the  signal  (window width  and number of bins),  and  ensuring  that  the 

signal  information  is  considered  at  the most  detailed  level  possible.  Nevertheless, 

experienced users are free to vary these parameters. 

 

Method comparison: a demonstration for one population  

Using StepPCO,  formation of  the  localized admixture  signals  requires  specification of 

the number of bins  in  the  signal and a  tolerance  for  the window  size. Pugach et al. 

(2011)  recommend  that  the number of bins  should be  chosen  so  that  the windows 

span the entire chromosome, leaving no gaps in between. For their wavelet analysis, it 

is a strict requirement that the number of bins is a power of two. 

 

Window  size  is  allowed  to  vary  along  the  chromosome  and  is  specified  via  an 

automatic method, for which it is necessary to set a tolerance  . Starting with a small 

window  of  SNPs, window  size  is  increased  until  the mean  PCA  coordinates  of  the 

ancestral populations are separated by   standard deviations. Pugach et al. (2011) use 

Page 22: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  9 SI  

1024 and  3 for their implementation. For our demonstration, we have used a 

stronger window  criterion  of  5 to  ensure  that  the  localized windows  cover  the 

entire chromosome. 

 

To produce the wavelet summaries, StepPCO uses a three stage filtering procedure:  

 

1. Coefficients smaller than a specified threshold are set to zero, to remove  low 

amplitude  oscillations.  This  parameter  was  set  to  0.1  for  our  application, 

following advice stated in the accompanying software manual.  

2. Wavelet scales that correspond to high frequencies are deemed characteristic 

of  noise  and  removed  completely.    For  guidance  on  setting  this  option,  the 

manual states that it depends on the length of the chromosome and suggests a 

maximum  scale  of  7,  6  and  5  for  chromosomes  1‐5,  6‐20  and  21‐22, 

respectively.  For  our  example, we  truncate  at  6  scales,  since  the  number  of 

SNPs in the example is comparable to chromosomes 6‐10 in the StepPCO paper. 

3. A  scale  dependent  threshold  is  then  applied.  The  threshold  is  computed  by 

averaging  the  wavelet  coefficients  across  each  scale,  and  subtracting  the 

maximum value observed in the ancestral populations.  

 

Stage 3  in  this procedure  is similar  to  the adwave  thresholding process described by 

Equation 5, but in adwave, this correction is based on population averages rather than 

individual‐level values. StepPCO therefore uses a stronger threshold than the adwave 

procedure (i.e., it removes more of the raw information). 

Page 23: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

10 SI J. Sanderson et al. 

 

With adwave, it is not necessary to pre‐window the signal. A method demonstration is 

shown  in Figure 1, using the raw SNP‐level data and default threshold value of   = 1.  

However,  in  order  to  provide  a  closer  comparison  with  StepPCO,  we  also  provide 

results  using  options  similar  to  those  applied by  Pugach  et  al.  (2011).  The  localized 

admixture  signals  were  formed  using  1024  points  along  the  chromosome, 

sampled according to genetic distance with a fixed window size of 13,000 × 0.0025 = 

37 SNPs (chosen to mimic the mean window size obtained by StepPCO). 

 

A  comparison  of  both  methods  for  one  simulated  population  with  160  is 

provided  in  Figure  S3.  The  admixture  signals  produced  by  StepPCO  have  a  variable 

window size of 2 to 195 SNPs with mean 39.2 and median 29. The variable window size 

can sometimes  lead  to  instability  in  the signals  (shown by  the  ‘spikes’  in Figure S3A, 

which correspond to windows with small numbers of SNPs). It is possible to set upper 

and  lower bounds  for  the number of SNPs per window, but  this  requires more user 

choice of runtime settings.  

 

The  raw  StepPCO wavelet  summaries  presented  in  Figure  S3B  are  similar  to  those 

obtained by adwave (Figure S3E), but exhibit a larger amount of high‐frequency noise, 

as  is apparent  in all  three populations. The  final StepPCO wavelet summaries  (Figure 

S3C) look similar to the final informative wavelet variance of adwave (Figure S3F), but 

without  the  four highest‐frequency scales. Truncation of  these high‐frequency scales 

Page 24: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  11 SI  

will  have  a  particularly  large  influence  for  older  admixture  events,  an  issue  that  is 

mentioned in the Pugach et al. (2011) paper. 

   

Page 25: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

12 SI J. Sanderson et al. 

LITERATURE CITED 

 

Brisbin, A., K. Bryc, J. Byrnes, F. Zakharia, L. Omberg et al., 2012 PCAdmix: principal 

components‐based assignment of ancestry along each chromosome in 

individuals with admixed ancestry from two or more populations. Hum. Biol. 

84: 343–364. 

Cox, M. P., A. E. Woerner, J. D. Wall, and M. F. Hammer, 2008 Intergenic DNA 

sequences from the human X chromosome reveal high rates of global gene 

flow. BMC Genetics 9: 76. 

Liò, P., 2003 Wavelets in bioinformatics and computational biology: state of art and 

perspectives. Bioinformatics 19: 2–9. 

Nason, G., 2008 Wavelet Methods in Statistics with R. Springer, New York ; London. 

Pugach, I., R. Matveyev, A. Wollstein, M. Kayser, and M. Stoneking, 2011 Dating the 

age of admixture via wavelet transform analysis of genome‐wide data. Genome 

Biology 12: R19. 

Vidakovic, B., 2009 Statistical Modeling by Wavelets. John Wiley & Sons: New York. 

   

Page 26: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  13 SI  

 

 

Parameter 

 

Value  Reason 

NAncestral  10,500  Average NA from Cox et al. (2008)  

NAsian  2,050  Average Ne for Han Chinese from Cox et al. (2008) 

NMelanesian  800  Average Ne for Melanesians from Cox et al. (2008) 

NAdmixed  1,425  Average of NAsian and NMelanesian

TAdmixture  160 gen  ~4,000 years ago; starting value, varied in simulations 

TAncestral  2,000 gen  ~50,000 years ago from Cox et al. (2008) 

PAdmixure  0.5  Starting value; varied in simulations between (0,1) 

  

Figure S1   Demographic model and parameters.      

Page 27: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

14 SI J. Sanderson et al. 

 Figure S2   Distribution of recombination rates (cM/Mb) across the 22 human autosomal chromosomes. Note that chromosome 1 is representative of the other chromosomes.      

Page 28: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  15 SI  

  

  Figure S3   Method demonstration for (top) StepPCO and (bottom) adwave.  

Page 29: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

16 SI J. Sanderson et al. 

  Figure S4   Informative wavelet variance for 16 Indonesian populations. The average block size (ABS) metric is indicated by a vertical line.      

Page 30: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  17 SI  

  

 Figure S5   Daubechies’ Least Asymmetric wavelet with eight vanishing moments.       

Page 31: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

18 SI J. Sanderson et al. 

  

Figure S6   Wavelet periodograms A) for one individual and B) for the population average. Periodograms have been smoothed using a simple density smoother.    

Page 32: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  19 SI  

 

  

 Figure S7   Visualization of ancestral block structure by windowing the admixture signal, using the same data as in Figure 1, but pre‐windowing signals with W = 130. A) Admixture signal for one individual; B) raw wavelet variance for each population showing less high frequency noise than the non‐windowed version in Figure 1B; C) informative variation after correcting for noise estimated from the ancestral populations.    

Page 33: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

20 SI J. Sanderson et al. 

 

  Figure S8   Relationship between choice of thresholding parameter   and the resulting informative wavelet variance and ABS metrics for two simulated examples. A) For admixture signals with low noise levels ( . ), increasing   results in a decrease in the magnitude of the extracted informative wavelet variance, but the location of the peak remains unchanged. B) For admixture signals with high noise levels ( . ), increasing   successfully eliminates the noise observed at low scales, while maintaining the peak in the informative wavelet variance that is attributed to the admixture process. C) The resulting ABS metrics for both the low and high noise examples.  For low levels of noise, the ABS metrics are robust to choice of  , while for high levels of noise, a larger value is necessary to avoid bias. 

    

Page 34: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  21 SI  

  Figure S9   Sensitivity to different method options. The grey area reflects the range of ABS values observed under the default method options (condition 1). Condition descriptions and numeric values are presented in Table S2.   

Page 35: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

22 SI J. Sanderson et al. 

Table S1   Chromosome level information. The reported number of SNPs for each chromosome reflects SNPs that are not fixed in both of the ancestral reference populations; percentage missing data due to failed genotyping; size of analyzing window used to combine information across chromosomes.   

Chromosome  Number of SNPs  Missing data (%)  Analyzing window 

1  32,417  0.75  7.52 

2  35,221  0.69  8.17 

3  32,005  0.62  7.43 

4  29,117  0.65  6.76 

5  28,179  0.70  6.54 

6  33,862  0.72  7.86 

7  24,676  0.73  5.73 

8  25,025  0.67  5.81 

9  20,914  0.65  4.85 

10  21,755  0.76  5.05 

11  20,494  0.77  4.76 

12  21,398  0.74  4.97 

13  17,564  0.63  4.08 

14  14,552  0.77  3.38 

15  13,856  0.75  3.22 

16  13,402  0.81  3.11 

17  9,348  0.90  2.17 

18  14,246  0.74  3.31 

19  5,733  1.12  1.33 

20  10,553  0.87  2.45 

21  6,403  0.74  1.49 

22  4,309  1.16  1.00 

 

    

Page 36: Reconstructing Past Admixture Processes from Local Genomic ... · HIGHLIGHTED ARTICLE GENETICS | INVESTIGATION Reconstructing Past Admixture Processes from Local Genomic Ancestry

J. Sanderson et al.  23 SI  

Table S2   Sensitivity to different method options. The mean ABS (relative standard deviation in parentheses) is given for each admixture time.  

 

                       

Method option   Admixture time  (generations) 

Condition  Description  10  20  40  80  160  320 

1  Reference  11.68(0.69)  11.27 (0.58)  10.76(0.70)  10.22(0.66)  9.61(1.12)  8.82 (1.72) 

9  Haar wavelet  11.55 (0.65)  11.14 (0.56)  10.65 (0.70)  10.13 (0.68)  9.54 (1.08)  8.84 (1.61) 

10 Pre‐windowing signal (130 SNPs) 

11.51 (0.73)  11.10 (0.56)  10.62 (0.58)  10.14 (0.57)  9.70 (0.79)  9.34 (0.99) 

11 Pre‐windowing signal (65 SNPs) 

11.50 (0.78)  11.07(0.60)  10.54 (0.61)  9.99 (0.64)  9.44 (0.85)  8.95(1.10) 

12  Genetic distance  11.74 (0.66)  11.32 (0.58)  10.83(0.64)  10.28(0.65)  9.74 (0.98)  9.16 (1.33)