TeraPCA: a fast and scalable software package to study ... · based library which is widely used...

6
Genetics and Population Analysis TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes Aritra Bose 1,, Vassilis Kalantzis 2,, Eugenia Kontopoulou 1,, Mai Elkady 1 , Peristera Paschou 3,* and Petros Drineas 1 1 Computer Science Department, Purdue University, West Lafayette, IN, 47907, USA and 2 IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA and 3 Department of Biological Sciences, Purdue University, West Lafayette, IN, 47907, USA. * To whom correspondence should be addressed. Equal Contribution. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Motivation: Principal Component Analysis (PCA) is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (RAM) become impractical and out-of-core implementations are the only viable alternative. Results: We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform PCA of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires less than five hours (in multi-threaded mode) to accurately compute the ten leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task. Availability: Source code and documentation are both available at https://github.com/aritra90/TeraPCA Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction Principal Component Analysis (PCA) is perhaps the most fundamental unsupervised linear dimensionality reduction technique. It was invented by Pearson in the early 1900s (Pearson, 1901); and later reinvented and named by Hotelling in the 1930s (Hotelling, 1933, 1936). In statistical parlance, PCA converts a set of observations of possibly correlated variables into a set of linearly uncorrelated (orthogonal) variables called principal components (PCs). The seminal work of Luca Cavalli-Sforza and collaborators in the late 1970s (Menozzi et al., 1978; Chisholm et al., 1995) pioneered the application of PCA for the study of human genetic variation. PCA analyses and plots appear in virtually every single paper that analyzes human genetic variation in order to make inferences about population structures. Given m samples genotyped on n genetic loci, it is well-known that applying PCA on the m × m covariance matrix that emerges by computing any reasonable notion of genotypic distance between every pair of samples using the n genotyped loci results in the observation that the leading PCs mirror geography, e.g. see (Novembre et al., 2008; Wang et al., 2010; Paschou et al., 2014) for detailed discussions and examples. This observation was leveraged by (Price et al., 2006; Patterson et al., 2006; Price et al., 2010) to derive one of the most established methods to account (and correct) for the confounding effects of population stratification in genome-wide association studies (GWAS). The method in (Price et al., 2006; Patterson et al., 2006; Price et al., 2010) is essentially equivalent to using a small number of leading PCs as covariates in order to check for associations between genetic loci and affection status in statistical tests, and is implemented in the EIGENSTRAT software package which is routinely used in GWAS analyses to correct for population stratification. Other applications of PCA include the identification of sets of genetic loci that are ancestry- informative or are under selective pressure (Paschou et al., 2007; Price 1 © The Author(s) (2019). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Downloaded from https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz157/5430929 by Bukkyo University Library user on 08 April 2019

Transcript of TeraPCA: a fast and scalable software package to study ... · based library which is widely used...

Page 1: TeraPCA: a fast and scalable software package to study ... · based library which is widely used for solving systems of linear equations, least-squares problems, eigenvalue problems,

Genetics and Population Analysis

TeraPCA: a fast and scalable software package tostudy genetic variation in tera-scale genotypesAritra Bose 1,†, Vassilis Kalantzis 2,†, Eugenia Kontopoulou 1,†, Mai Elkady 1,Peristera Paschou 3,∗ and Petros Drineas 1

1Computer Science Department, Purdue University, West Lafayette, IN, 47907, USA and2IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA and3Department of Biological Sciences, Purdue University, West Lafayette, IN, 47907, USA.

∗To whom correspondence should be addressed. †Equal Contribution.

Associate Editor: XXXXXXX

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Abstract

Motivation: Principal Component Analysis (PCA) is a key tool in the study of population structure in humangenetics. As modern datasets become increasingly larger in size, traditional approaches based on loadingthe entire dataset in the system memory (RAM) become impractical and out-of-core implementations arethe only viable alternative.Results: We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method toperform PCA of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able tosuccessfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover,TeraPCA has minimal dependencies on external libraries and only requires a working installation of theBLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on amillion markers, TeraPCA requires less than five hours (in multi-threaded mode) to accurately computethe ten leading principal components. An extensive experimental analysis shows that TeraPCA is both fastand accurate and is competitive with current state-of-the-art software for the same task.Availability: Source code and documentation are both available at https://github.com/aritra90/TeraPCAContact: [email protected] information: Supplementary data are available at Bioinformatics online.

1 IntroductionPrincipal Component Analysis (PCA) is perhaps the most fundamentalunsupervised linear dimensionality reduction technique. It was invented byPearson in the early 1900s (Pearson, 1901); and later reinvented and namedby Hotelling in the 1930s (Hotelling, 1933, 1936). In statistical parlance,PCA converts a set of observations of possibly correlated variables into a setof linearly uncorrelated (orthogonal) variables called principal components(PCs). The seminal work of Luca Cavalli-Sforza and collaborators in thelate 1970s (Menozzi et al., 1978; Chisholm et al., 1995) pioneered theapplication of PCA for the study of human genetic variation.

PCA analyses and plots appear in virtually every single paper thatanalyzes human genetic variation in order to make inferences aboutpopulation structures. Given m samples genotyped on n genetic loci,it is well-known that applying PCA on the m × m covariance matrix

that emerges by computing any reasonable notion of genotypic distancebetween every pair of samples using the n genotyped loci results in theobservation that the leading PCs mirror geography, e.g. see (Novembreet al., 2008; Wang et al., 2010; Paschou et al., 2014) for detaileddiscussions and examples. This observation was leveraged by (Priceet al., 2006; Patterson et al., 2006; Price et al., 2010) to derive one of themost established methods to account (and correct) for the confoundingeffects of population stratification in genome-wide association studies(GWAS). The method in (Price et al., 2006; Patterson et al., 2006;Price et al., 2010) is essentially equivalent to using a small number ofleading PCs as covariates in order to check for associations betweengenetic loci and affection status in statistical tests, and is implemented inthe EIGENSTRAT software package which is routinely used in GWASanalyses to correct for population stratification. Other applications ofPCA include the identification of sets of genetic loci that are ancestry-informative or are under selective pressure (Paschou et al., 2007; Price

1

© The Author(s) (2019). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Dow

nloaded from https://academ

ic.oup.com/bioinform

atics/advance-article-abstract/doi/10.1093/bioinformatics/btz157/5430929 by Bukkyo U

niversity Library user on 08 April 2019

Page 2: TeraPCA: a fast and scalable software package to study ... · based library which is widely used for solving systems of linear equations, least-squares problems, eigenvalue problems,

2 Bose et al.

et al., 2006; Paschou et al., 2008); and, when combined with other linesof evidence such as social structure and linguistics, the extraction ofcomplex population histories and demographic structures (Bose et al.,2017). We also note that PCA extracts the fundamental features of a datasetwithout complex computational modeling. Interestingly, even the output ofmodel-based, more complex, methods to detect population structure (suchas ADMIXTURE (Alexander and Novembre, 2009)) typically exhibitshigh correlation with the output of PCA, rendering further support to thesignificance of PCA in the analysis of human genetics data.

From a computational viewpoint, PCA essentially amounts tocomputing eigenvectors of the m × m (normalized) covariance matrixassociated with the dataset at hand. When m does not exceed a fewthousands, all eigenvectors can be computed by appropriate denselinear algebra routines in LAPACK, a Fortran 90 matrix factorization-based library which is widely used for solving systems of linearequations, least-squares problems, eigenvalue problems, and singularvalue problems (Anderson et al., 1999). Matrix factorization-based denseeigenvalue solvers return allm eigenvectors with a time complexity in theorder ofO(m3), which becomes impractical asm, the number of samples,increases. Practical applications of PCA in population genetics onlyrequire the computation of those principal components (PCs) determinedby the eigenvectors associated with only a few (say 10-20) of thelargest eigenvalues. Computing a few of the leading eigenvalues andassociated eigenvectors of large (sparse or dense) matrices is typicallyachieved by first projecting the original eigenvalue problem onto a low-dimensional subspace which includes an invariant subspace associatedwith the relevant eigenvectors. This low-dimensional subspace can beformed in many different ways, e.g., by means of subspace iteration orKrylov projection schemes and much work in the Numerical Analysiscommunity has been devoted in understanding the theoretical propertiesof such approaches (Parlett, 1998; Saad, 2011). In particular, a variantof the family of Krylov projection schemes, the so-called ImplicitlyRestarted Arnoldi method (IRA), is the projection scheme of choice inFlashPCA2 (Abraham et al., 2017), a software package which has beenshown to outperform other PCA software packages, both in terms ofmemory usage and wall-clock time. On the other hand, recent advancesin the design and analysis of Randomized Numerical Linear Algebra(RandNLA) (Drineas and Mahoney, 2016) algorithms have yielded novelinsights as well as fast and efficient alternatives to approximate the leadingprincipal components of large matrices (Halko et al., 2011; Musco andMusco, 2015; Drineas and Mahoney, 2018; Drineas et al., 2018). Indeed,FastPCA (Galinsky et al., 2016) applied such randomized algorithms toperform PCA analyses in population genetics data.

This paper presents TeraPCA, a C++ software package to performPCA of tera-scale genotypic datasets that can not fully reside in thesystem memory. TeraPCA is essentially an out-of-core implementation ofthe Randomized Subspace Iteration method (Rokhlin et al., 2010; Halkoet al., 2011) and features minimal dependencies to external1 libraries.As the amount of time spent on I/O typically dominates the wall-clocktime in out-of-core scenarios, TeraPCA builds a high-dimensional initialapproximation subspace by loading the dataset from secondary storageexactly once. The dimension of this initial approximation subspace can becontrolled directly by the user. Each subsequent iteration of RandomizedSubspace Iteration “corrects” the initial subspace so that an invariantsubspace associated with the leading target eigenvectors is computed. Thedataset needs to be accessed twice in each iteration, but, fortunately, afew steps of Randomized Subspace Iteration are typically sufficient inpractice in order to get highly accurate approximations to the leading

1 In contrast to FlashPCA2 which relies on the IRA implementation on theSpectra C++ library, TeraPCA comes with an in-house implementation ofthe Randomized Subspace Iteration algorithm.

eigenvectors. Note here that the above idea is somewhat orthogonal tothe ideas underlying IRA, which builds the approximation subspace ina vector-by-vector manner, thus necessitating a large number of datasetfetches from secondary storage to even form an approximation subspacewhose dimension is equal to or slightly larger than the number of PCs thatwe seek to approximate.

TeraPCA was tested extensively on both real (Human GenomeDiversity Panel, 1000 Genomes, etc.) and synthetic datasets. Oursynthetic datasets were generated via the Pritchard-Stephens-Donelly(PSD) model (Pritchard et al., 2000; Gopalan et al., 2016). Our resultssuggest that TeraPCA is both fast and accurate and in most casesoutperforms other out-of-core PCA libraries such as FlashPCA2. Specifichighlights include the computation of the ten leading principal componentsof a dataset of one million samples genotyped on one million geneticmarkers (this dataset exceeds 3.5 TBs in uncompressed format) in about13 hours (using a single thread) and in less than 4.5 hours (using 12 threads).

2 Methods

Simulated Datasets

The first group of the datasets used for our experiments was generated usingthe Pritchard-Stephens-Donelly’s (PSD) model of simulating genotypes.In particular, a recent study (Gopalan et al., 2016) simulated genotypicdata by obtaining individual ancestry proportions from the PSD model tofit the 1000 Genomes dataset and then modelling the per-population allelefrequencies using Wright’sFST and the Weir & Cockerham estimate (Weirand Cockerham, 1984). We developed a multi-threaded C++ packagewhich is essentially an efficient implementation of the R code developedin Tera-Structure (Gopalan et al., 2016). We generated various datasetsin order to evaluate TeraPCA’s performance, with the number of markersranging from 100,000 to 1,000,000 and the number of samples rangingfrom 5,000 to 1,000,000.

Table 1. Our data sets (simulated and real)

DatasetSize

(.PED file)

Size(.BED file)

# Samples # SNPs

S1 (simulated) 19 GB 120 MB 5,000 1,000,000S2 (simulated) 38 GB 239 MB 10,000 1,000,000S3 (simulated) 373 GB 24 GB 100,000 1,000,000S4 (simulated) 1.9 TB 117 GB 500,000 1,000,000S5 (simulated) 3.7 TB 233 GB 1,000,000 1,000,000S6 (simulated) 38 GB 2.4 GB 100,000 100,000S7 (simulated) 150 GB 9.4 GB 2,000 20,000,000HGDP 615 MB 39 MB 1,043 154,4171000 Genomes 8.4 GB 483 MB 2,504 808,704PRK 2 GB 126 MB 4,706 111,831T2D 1.8 GB 111 MB 6,370 72,457

Real Datasets

The Human Genome Diversity Panel (HGDP) dataset consists of 1,043individuals genotyped at 660,734 SNPs, across 51 populations acrossAfrica, Europe, Middle East, South and Central Asia, East Asia, Oceania,and the Americas (Cann et al., 2002). We ran Quality Control (QC) onthe data by filtering SNPs with minor allele frequency below 0.01 andsubsequently pruning for LD using a window size of 1000 kb. Moreover,we set the variance inflation factor to 50 and set r2 > 0.2, thus retaining154,471 variants. We applied the same parameters for LD pruning on

Dow

nloaded from https://academ

ic.oup.com/bioinform

atics/advance-article-abstract/doi/10.1093/bioinformatics/btz157/5430929 by Bukkyo U

niversity Library user on 08 April 2019

Page 3: TeraPCA: a fast and scalable software package to study ... · based library which is widely used for solving systems of linear equations, least-squares problems, eigenvalue problems,

TeraPCA 3

the 1000 Genomes dataset which has 2,504 individuals sampled from26 different populations across all continents genotyped at 39 millionSNPs. After QC, we retained approximately 808,704 SNPs and ran ourexperiments on the pruned dataset.

We also tested the performance of TeraPCA on case-control data,which are ubiquitous in population genetics. We used the WellcomeTrust Case Control Consortium’s (WTCCC) Type 2 Diabetes (T2D) andParkinson’s (PRK) datasets. The T2D dataset had 6,371 individuals (1,816cases and 4,555 controls) genotyped on 313,654 SNPs and the PRKdataset had 5,000 individuals (2,000 cases and 3,000 controls) genotypedon 500,000 SNPs. We removed related samples from these datasets andpruned them using the aforementioned QC parameters resulting in datasetswith 6,370 individuals genotyped on 72,457 SNPs for T2D and 4,706individuals genotyped on 111,831 SNPs for Parkinson’s.

TeraPCA

TeraPCA first normalizes the genotypes using the same procedurethat was used by both FlashPCA (Abraham and Inouye, 2014) andFastPCA (Galinsky et al., 2016) (see our supplementary material fordetails) and then applies Randomized Subspace Iteration in an out-of-corefashion.

The main parameters of TeraPCA are as follows (see our supplementarymaterial for more details and our code release for full documentation):

1. Number of PCs to be computed (denoted by k). Default value is setto k := 10.

2. Number of contiguous rows of the SNP-major input matrix fetchedfrom the secondary storage at each time unit (denoted by β). This canbe user-defined or automatically determined based on the availablesystem memory.

3. Dimension of the initial approximation subspace (denoted by s).Default value is set to s := 2k.

4. Convergence tolerance (denoted by tol). Default value is set to tol :=

1e− 3.

The wall-clock time of TeraPCA is affected by all of the aboveparameters. Clearly, reducing tol or increasing k results in an increase ofthe wall-clock time. Using a higher-dimensional approximation subspace,i.e., increasing s, might reduce the corresponding wall-clock time as ittypically enhances convergence towards the k-leading eigenvectors. Onthe other hand, increasing the value of s also increases the amount offloating-point operations performed. Finally, since only a part of the datasetcan fit in the system memory at any time unit, the choice of β is typicallydetermined automatically by TeraPCA based on the size of the systemmemory. The total amount of time spent on I/O is largely independent ofthe value of β but we have observed that the value of β has an effect onthe wall-clock time of the LAPACK routines.

3 Implementation and DiscussionThe performance of TeraPCA was tested on both simulated and real-world genotypic datasets. All our experiments were performed at Purdue’sBrown cluster on a dedicated node which features an Intel Xeon Gold6126 processor running at 2.6 GHz with 96 GB of RAM and a 64-bitCentOS Linux 7 operating system. Table 1 lists the number of samples,number of SNPs, and size of each dataset. Datasets S1 through S7 aresynthetic datasets and the remaining ones are real-world datasets. Thissection provides comparisons between TeraPCA and FlashPCA2. Thelatter has already been shown to be faster than previous methods suchas FlashPCA (Abraham and Inouye, 2014), FastPCA (Galinsky et al.,2016), etc. The results reported throughout the remainder of this sectionwere obtained by setting the amount of system memory made available to

Fig. 1. Projection of the samples of the 1000 Genomes dataset on the top two left singularvectors (PC1 and PC2), as computed by TeraPCA.

TeraPCA (as well as FlashPCA2) to 2 GBs. This is precisely the amountof memory allowed to FlashPCA2 in prior work.

3.1 Synthetic datasets

Datasets S1 through S5 in Table 1 have a fixed number of SNPs (equal toone million) and a varying number of samples (from 5,000 to one million).On the other hand, dataset S6 was used to fine-tune prior state-of-the-artmethods and contains 100,000 samples genotyped on 100,000 SNPs. S7

was used to test the performance of TeraPCA on extremely rectangularmatrices, where the number of SNPs heavily outnumbers the number ofindividuals.

We first consider the plots of the three leading principal componentsreturned by both TeraPCA and FlashPCA2 for dataset S6 (see Figure 1in supplementary material). TeraPCA and FlashPCA2 show a completevisual agreement with each other and both libraries agree with theexpected outcome of the PSD model. For this particular example, TeraPCAterminated in just under 40 minutes, while FlashPCA2 required 141minutes2.

Table 2 lists the wall-clock times achieved by TeraPCA when appliedon datasets S1 through S7. For datasets S4 and S5, which were thelargest ones in our collection, TeraPCA terminated after 7.3 and 13.2 hoursrespectively. On the other hand, FlashPCA2 did not terminate within the50 hours limit that we imposed. TeraPCA outperformed FlashPCA2 onall synthetic datasets, with a speedup that ranged between 1.3 and 4.5, atleast for those datasets where FlashPCA2 terminated within our 50 hourlimit. We note that for all synthetic datasets the leading PCs returned byTeraPCA and FlashPCA2 showed perfect correlation as measured by thePearson correlation coefficient (equal to one in all cases). To further testTeraPCA’s performance on datasets where the number of SNPs heavilyoutnumbers the number of individuals, we applied it to S7 and observed

2 To be fair in our comparisons between TeraPCA and FlashPCA2, weperformed multiple runs of FlashPCA2 on dataset S6 in order to exploreand understand its properties. In particular, we varied the convergencecriterion in FlashPCA2 and recorded the resulting trade-off between wall-clock time and digits of accuracy for the top ten computed eigenvalues.Fixing the convergence tolerance in FlashPCA2 to three digits of accuracyand the maximum number of iterations of FlashPCA2 to 100 was the bestchoice in terms of the tradeoff between running time and accuracy (seeSupplementary text for more details)

Dow

nloaded from https://academ

ic.oup.com/bioinform

atics/advance-article-abstract/doi/10.1093/bioinformatics/btz157/5430929 by Bukkyo U

niversity Library user on 08 April 2019

Page 4: TeraPCA: a fast and scalable software package to study ... · based library which is widely used for solving systems of linear equations, least-squares problems, eigenvalue problems,

4 Bose et al.

that even in a heavily under-determined system, TeraPCA outperformedFlashPCA2 by a factor of 2.9, with similar accuracy guarantees.

3.2 Real datasets

We first considered the Human Genome Diversity Panel (HGDP)dataset (Cann et al., 2002). TeraPCA was marginally faster thanFlashPCA2 and both libraries required about seven seconds. A plot ofthe projection of the HGDP dataset along the two leading PCs computedby TeraPCA is shown in Figure 2 in the supplementary material.Given therelatively small size of this dataset, we were able to compute the exactten leading eigenvectors using LAPACK. Figure 2 reports the entry-wise

Fig. 2. Entry-wise relative error of the top ten leading eigenvectors returned by TeraPCAfor the HGDP dataset, compared to the eigenvectors returned by LAPACK. The y-axisshows the relative error; recall that each eigenvector has 1,043 entries. We observe that therelative error is roughly the same for each entry of a specific eigenvector.

error of the ten leading eigenvectors returned by TeraPCA. As expected,eigenvectors associated with the largest eigenvalues are captured moreaccurately since they converge faster.

In addition, Supplementary Table 1 reports the relative and absoluteerrors of the ten leading eigenvalues returned by TeraPCA and FlashPCA2.For TeraPCA, the (much) higher accuracy in the approximation of thethree-four leading eigenvalues is due to the fact that these approximateeigenvalues kept improving as Randomized Subspace Iteration keptiterating to approximate the trailing eigenvalues and eigenvectors. On theother hand, the accuracy in the approximation of the eigenvalues returnedby FlashPCA2 was somewhat uniform for all eigenvalues.

TeraPCA and FlashPCA2 showed similar qualitative and computationalperformance on the pruned 1000 Genomes dataset (see Figure 1), withFlashPCA2 terminating slightly faster than TeraPCA. Notice that thisdataset is also the one in which the number of SNPs outnumbered thenumber of individuals by the largest factor.

PCA is an essential tool to detect population stratication in GWAS.In order to evaluate TeraPCA’s performance on real-world case-controlstudies, we applied it on WTCCC’s T2D and PRK datasets. Like otherreal-world datasets, both FlashPCA2 and TeraPCA performed similarly,needing roughly the same wall-clock time. Execution of TeraPCA on thesedatasets can also be done in-core, as they fit in the system memory, leadingto comparatively faster computation.

Table 2. Wall-clock running times comparisons forthe datasets of Table 1 using a single thread and 2GBs of system memory (∗ indicates no convergenceafter 50 hrs).

Dataset TeraPCA FlashPCA2 Speed-up

S1 26.2 mins 33.3 mins 1.27S2 39.3 mins 87.5 mins 2.22S3 7.9 hrs 35.6 hrs 4.50S4 7.3 hrs n/a∗ ∞S5 13.2 hrs n/a∗ ∞S6 39.5 mins 141.1 mins 3.57S7 37.3 mins 106.5 mins 2.86HGDP 6.5 secs 7.7 secs 1.221000 Genomes 4.3 mins 3.5 mins 0.81T2D 96 secs 119 secs 1.24PRK 76 secs 73 secs 0.96

3.3 Multithreading

The wall-clock times of TeraPCA and FlashPCA2 can significantlyimprove by executing the associated linear algebra computations usingmore than one threads. This is indeed the most obvious way to speed upsoftware such as ours. To test the performance of TeraPCA as a functionof the number of threads, we focused on datasets S1, S2, S4, S6, S7,and the 1000 Genomes dataset. The number of threads was set to4, 8, and 12 and the speedups reported in Figure 3 are against thesingle-thread execution of TeraPCA. Generally speaking, we observed a1.6x-2.8x speedup, which is somewhat sub-optimal. The reason underlyingthis non-optimality is that we used multithreading only for the linearalgebraic operations. However, much of the wall-clock time is spent onI/O operations in order to load the dataset from secondary memory, aprocedure that cannot be multithreaded. We emphasize that FlashPCA2did not demonstrate comparable improvements when multi-threading wasenabled. In particular, when applied to the dataset S6, the wall-clock timeof FlashPCA2 reduced only by two minutes, i.e., from 141 minutes to 139minutes.

Fig. 3. Speedup of TeraPCA over single-threaded execution.

In all of the above experiments we set s := 2k and k := 10. Finally,Supplementary Figure 6 reports the amount of time required to multiplythe (normalized) covariance matrix by a set of s vectors using the DGEMMBLAS routine of MKL and a varying number of threads for different values

Dow

nloaded from https://academ

ic.oup.com/bioinform

atics/advance-article-abstract/doi/10.1093/bioinformatics/btz157/5430929 by Bukkyo U

niversity Library user on 08 April 2019

Page 5: TeraPCA: a fast and scalable software package to study ... · based library which is widely used for solving systems of linear equations, least-squares problems, eigenvalue problems,

TeraPCA 5

of s and β for datasets S6 and HGDP. It is worth noting that while anexhaustive analysis lies outside the goals of this paper, it is easy to verifythat doubling the value of s does not double the amount of time requiredto perform the multiplication, while larger values of s also lead to higherspeedups when multiple threads are used. Similarly, very small values ofβ are likely to penalize the performance of DGEMM due to non-optimalcache utilization.

4 Summary and future workIn this paper we presented TeraPCA, a C++ library to perform out-of-corePCA analysis of massive genomic datasets. It is based on RandomizedSubspace Iteration, building upon principled and theoretically soundmethods to approximate the top principal components of massivecovariance matrices. TeraPCA returns highly accurate approximationsto the top principal components, while taking advantage of moderncomputer architectures that support multi-threading and it has minimaldependencies to external libraries. TeraPCA can be applied both in-coreand out-of-core and is able to successfully operate even on personalworkstations with a system memory of just a few gigabytes. Numericalexperiments performed on synthetic and real datasets demonstrate thatTeraPCA performs similarly or better when compared to state-of-the-artsoftware packages such as FlashPCA2, on a single thread and significantlybetter with multi-threading.

Future work will focus on implementing a distributed memory versionof TeraPCA using the Message Passing Interface (MPI) standard. Anotherinteresting research direction would be to combine TeraPCA with blockKrylov subspace techniques.

5 Author’s contributionsAB, VK, EK, and PD conceived and designed the work. AB, VK, andEK developed the TeraPCA C++ package. EK, ME, and AB developedthe C++ package to generate the simulated data sets from the PSD model.PD and PP participated in and discussed analyses. AB and VK ran theexperiments. AB, VK, EK, PD, and PP wrote and revised the manuscript.

6 FundingThis work has been partially supported by NSF [IIS-1661760, IIS-1661756, IIS-1715202] to PP and PD.

7 AcknowledgementsThe authors are grateful to P. Gopalan and W. Hao for sharing their R scriptto generate the simulated data sets as well as for their valuable comments.The authors are also thankful to P. Anappindi for contributing to the codein the pilot phase.

ReferencesAbraham, G. and Inouye, M. (2014). Fast principal component analysis of large-scale

genome-wide data. PLOS ONE, 9(4), 1–5.Abraham, G., Qiu, Y., and Inouye, M. (2017). Flashpca2: principal component

analysis of biobank-scale genotype datasets. Bioinformatics, 33(17), 2776–2778.Alexander, D. H. and Novembre, J. & Lange, K. (2009). Fast model-based estimation

of ancestry in unrelated individuals. Genome Res.Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J.,

Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D.(1999). LAPACK Users’ Guide. Society for Industrial and Applied Mathematics,Philadelphia, PA, third edition.

Bose, A., Platt, D. E., Parida, L., Paschou, P., and Drineas, P. (2017). Dissectingpopulation substructure in india via correlation optimization of genetics andgeodemographics. bioRxiv.

Cann, H. M., de Toma, C., Cazes, L., Legrand, M.-F., Morel, V., Piouffre, L.,Bodmer, J., Bodmer, W. F., Bonne-Tamir, B., Cambon-Thomsen, A., Chen, Z.,Chu, J., Carcassi, C., Contu, L., Du, R., Excoffier, L., Ferrara, G. B., Friedlaender,J. S., Groot, H., Gurwitz, D., Jenkins, T., Herrera, R. J., Huang, X., Kidd, J., Kidd,

K. K., Langaney, A., Lin, A. A., Mehdi, S. Q., Parham, P., Piazza, A., Pistillo,M. P., Qian, Y., Shu, Q., Xu, J., Zhu, S., Weber, J. L., Greely, H. T., Feldman,M. W., Thomas, G., Dausset, J., and Cavalli-Sforza, L. L. (2002). A human genomediversity cell line panel. Science, 296(5566), 261–262.

Chisholm, B., Cavalli-Sforza, L. L., Menozzi, P., and Piazza, A. (1995). The Historyand Geography of Human Genes. The Journal of Asian Studies, 54(2), 490.

Drineas, P. and Mahoney, M. W. (2016). RandNLA: Randomized Numerical LinearAlgebra. Communications of the ACM, 59(6), 80–90.

Drineas, P. and Mahoney, M. W. (2018). Lectures on Randomized Numerical LinearAlgebra, The Mathematics of Data, IAS/Park City Math. Ser., volume 25, pages1–45. Amer. Math. Soc., Providence, RI.

Drineas, P., Ipsen, I. C. F., Kontopoulou, E., and Magdon-Ismail, M. (2018).Structural convergence results for low-rank approximations from block krylovspaces. SIAM Journal of Matrix Analysis and Applications, to appear.

Galinsky, K. J., Bhatia, G., Loh, P.-R., Georgiev, S., Mukherjee, S., Patterson, N. J.,and Price, A. L. (2016). Fast principal-component analysis reveals convergentevolution of <em>adh1b</em> in europe and east asia. The American Journal ofHuman Genetics, 98(3), 456–472.

Gopalan, P., Hao, W., Blei, D. M., and Storey, J. D. (2016). Scaling probabilisticmodels of genetic variation to millions of humans. Nat Genet, 48(12), 1587–1590.27819665[pmid].

Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). Finding structurewith randomness: Probabilistic algorithms for constructing approximate matrixdecompositions. SIAM Review, 53(2), 217–288.

Hotelling, H. (1933). Analysis of a complex of statistical variables into principalcomponents. Journal of Educational Psychology, 24(6), 417–441.

Hotelling, H. (1936). Relations Between Two Sets of Variates. Biometrika, 28(3/4),321–377.

Menozzi, P., Piazza, A., and Cavalli-Sforza, L. (1978). Synthetic maps of humangene frequencies in europeans. Science, 201(4358), 786–792.

Musco, C. and Musco, C. (2015). Randomized block krylov methods for stronger andfaster approximate singular value decomposition. In C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural InformationProcessing Systems 28, pages 1396–1404. Curran Associates, Inc.

Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A. R., Auton, A., Indap, A.,King, K. S., Bergmann, S., Nelson, M. R., Stephens, M., and Bustamante, C. D.(2008). Genes mirror geography within europe. Nature, 456, 98 EP –.

Parlett, B. (1998). The Symmetric Eigenvalue Problem. Society for Industrial andApplied Mathematics.

Paschou, P., Ziv, E., Burchard, E. G., Choudhry, S., Rodriguez-Cintron, W.,Mahoney, M. W., and Drineas, P. (2007). Pca-correlated snps for structureidentification in worldwide human populations. PLOS Genetics, 3(9), 1–15.

Paschou, P., Drineas, P., Lewis, J., Nievergelt, C. M., Nickerson, D. A., Smith,J. D., Ridker, P. M., Chasman, D. I., Krauss, R. M., and Ziv, E. (2008). Tracingsub-structure in the european american population with pca-informative markers.PLOS Genetics, 4(7), 1–13.

Paschou, P., Drineas, P., Yannaki, E., Razou, A., Kanaki, K., Tsetsos,F., Padmanabhuni, S. S., Michalodimitrakis, M., Renda, M. C., Pavlovic,S., Anagnostopoulos, A., Stamatoyannopoulos, J. a., Kidd, K. K., andStamatoyannopoulos, G. (2014). Maritime route of colonization of Europe.Proceedings of the National Academy of Sciences of the United States of America,111(25), 9211–9216.

Patterson, N., Price, A. L., and Reich, D. (2006). Population structure andeigenanalysis. PLOS Genetics, 2(12), 1–20.

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. TheLondon, Edinburgh, and Dublin Philosophical Magazine and Journal of Science,2(11), 559–572.

Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A.,and Reich, D. (2006). Principal components analysis corrects for stratification ingenome-wide association studies. Nature Genetics, 38, 904 EP –. Article.

Price, A. L., Zaitlen, N. A., Reich, D., and Patterson, N. (2010). New approachesto population stratification in genome-wide association studies. Nat Rev Genet,11(7), 459–463. 20548291[pmid].

Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference ofpopulation structure using multilocus genotype data. Genetics, 155(2), 945–959.10835412[pmid].

Rokhlin, V., Szlam, A., and Tygert, M. (2010). A randomized algorithm for principalcomponent analysis. SIAM Journal on Matrix Analysis and Applications, 31(3),1100–1124.

Saad, Y. (2011). Numerical Methods for Large Eigenvalue Problems. Society forIndustrial and Applied Mathematics.

Wang, C., A Szpiech, Z., Degnan, J., Jakobsson, M., J Pemberton, T., Hardy, J., BSingleton, A., and A Rosenberg, N. (2010). Comparing Spatial Maps of HumanPopulation-Genetic Variation Using Procrustes Analysis, volume 9. Statisticalapplications in genetics and molecular biology.

Dow

nloaded from https://academ

ic.oup.com/bioinform

atics/advance-article-abstract/doi/10.1093/bioinformatics/btz157/5430929 by Bukkyo U

niversity Library user on 08 April 2019

Page 6: TeraPCA: a fast and scalable software package to study ... · based library which is widely used for solving systems of linear equations, least-squares problems, eigenvalue problems,

6 Bose et al.

Dow

nloaded from https://academ

ic.oup.com/bioinform

atics/advance-article-abstract/doi/10.1093/bioinformatics/btz157/5430929 by Bukkyo U

niversity Library user on 08 April 2019