Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling...

46
Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. Ping Ma Department of Statistics, University of Georgia, Athens, GA, USA. Michael W. Mahoney International Computer Science Institute and Department of Statistics, University of California at Berkeley, Berkeley, CA, USA. Bin Yu Department of Statistics and EECS, University of California at Berkeley, Berkeley, CA, USA. November 6, 2018 Abstract A significant hurdle for analyzing large sample data is the lack of effective statisti- cal computing and inference methods. An emerging powerful approach for analyzing large sample data is subsampling, by which one takes a random subsample from the original full sample and uses it as a surrogate for subsequent computation and estimation. In this paper, we study subsampling methods under two scenarios: ap- proximating the full sample ordinary least-square (OLS) estimator and estimating the coefficients in linear regression. We present two algorithms, weighted estimation algorithm and unweighted estimation algorithm, and analyze asymptotic behaviors of their resulting subsample estimators under general conditions. For the weighted estimation algorithm, we propose a criterion for selecting the optimal sampling prob- ability by making use of the asymptotic results. On the basis of the criterion, we provide two novel subsampling methods, the optimal subsampling and the predictor- length subsampling methods. The predictor-length subsampling method is based on the L 2 norm of predictors rather than leverage scores. Its computational cost is scal- able. For unweighted estimation algorithm, we show that its resulting subsample estimator is not consistent to the full sample OLS estimator. However, it has better performance than the weighted estimation algorithm for estimating the coefficients. Simulation studies and a real data example are used to demonstrate the effectiveness of our proposed subsampling methods. 1 arXiv:1509.05111v1 [stat.ME] 17 Sep 2015

Transcript of Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling...

Page 1: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Optimal Subsampling Approaches for LargeSample Linear Regression

Rong ZhuAcademy of Mathematics and Systems Science,Chinese Academy of Sciences, Beijing, China.

Ping MaDepartment of Statistics, University of Georgia, Athens, GA, USA.

Michael W. MahoneyInternational Computer Science Institute and Department of Statistics,

University of California at Berkeley, Berkeley, CA, USA.Bin Yu

Department of Statistics and EECS,University of California at Berkeley, Berkeley, CA, USA.

November 6, 2018

Abstract

A significant hurdle for analyzing large sample data is the lack of effective statisti-cal computing and inference methods. An emerging powerful approach for analyzinglarge sample data is subsampling, by which one takes a random subsample fromthe original full sample and uses it as a surrogate for subsequent computation andestimation. In this paper, we study subsampling methods under two scenarios: ap-proximating the full sample ordinary least-square (OLS) estimator and estimatingthe coefficients in linear regression. We present two algorithms, weighted estimationalgorithm and unweighted estimation algorithm, and analyze asymptotic behaviorsof their resulting subsample estimators under general conditions. For the weightedestimation algorithm, we propose a criterion for selecting the optimal sampling prob-ability by making use of the asymptotic results. On the basis of the criterion, weprovide two novel subsampling methods, the optimal subsampling and the predictor-length subsampling methods. The predictor-length subsampling method is based onthe L2 norm of predictors rather than leverage scores. Its computational cost is scal-able. For unweighted estimation algorithm, we show that its resulting subsampleestimator is not consistent to the full sample OLS estimator. However, it has betterperformance than the weighted estimation algorithm for estimating the coefficients.Simulation studies and a real data example are used to demonstrate the effectivenessof our proposed subsampling methods.

1

arX

iv:1

509.

0511

1v1

[st

at.M

E]

17

Sep

2015

Page 2: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Keywords: big data, ordinary least-square, subsampling algorithm, algorithmic leveraging,linear regression, OLS approximation.

2

Page 3: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

1 Introduction

Regression analysis is probably the most popular statistical tool for modeling the relation-

ship between the response yi and predictors xi = (x1i, . . . , xpi)T , i = 1, . . . , n. Given a

set of n observations, in modern massive data sets, the number of predictors p and/or the

sample size n are large, in which case conventional methods suffer from high-dimension

(p is large) challenge, large sample size (n is large) challenge or both. When p is large,

researchers typically assume a sparsity principle, that is, the response only depends on a

sparse subset of the predictors. Under this assumption, using a subset of the predictors in

model fitting has been employed traditionally in overcoming the curse of dimensionality.

Various methods have been developed for selecting the subset of predictors in regression

analysis. See Tibshirani (1996), Meinshausen and Yu (2009), and Buhlmann and van de

Geer. S. (2011). When n is large, an accurate estimation is usually assumed for model-

fitting because a large sample size has historically been considered good. See Lehmann

and Casella (2003). However, the advance of computing technologies still lags far behind

the growth of data size. Thus, calculating ordinary least-square (OLS) estimate might be

computationally infeasible for modern large sample data.

Given fixed computing power, one popular method for handling large sample is sub-

sampling, that is, one uses a small proportion of the data as a surrogate of the full sample

for model fitting and statistical inference. Drineas et al. (2008) and Mahoney and Drineas

(2009) developed the effective subsampling method in matrix decomposition, which used

normalized statistical leverage scores of the data as the non-uniform sampling probabil-

ity. Drineas et al. (2006, 2011) applied the subsampling method to approximate the OLS

estimator in linear regression. Another approach is to make random projections for fast

algorithms to solve OLS, which was studied by Rokhlin and Tygert (2008), Avron et al.

(2010), Dhillon et al. (2013), Clarkson and Woodruff (2013) and McWilliams et al. (2014).

In this paper, we consider two kinds of algorithms by subsampling, i.e., weighted es-

timation algorithm and unweighted estimation algorithm, which are classified according

to weighted least-square and unweighted least-square on the subsample. We establish the

first asymptotic properties of their resulting weighted subsample estimator and unweighted

subsample estimator, respectively, and propose new subsampling methods from a statisti-

3

Page 4: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

cal point of view. We do so in the context of the OLS approximation and the coefficients

estimation, respectively, in fitting linear regression models for large sample data, where by

“large sample”, we mean that the computational cost is too expensive to calculate the full

sample OLS estimator.

Our main theoretical contribution is to establish the asymptotic normality and consis-

tency of weighted/unweighted subsample estimators for approximating the full sample OLS

estimator. Unlike the worst-case analysis in Drineas et al. (2011), the asymptotic analysis

provides a statistical tool to describe the approximation error of weighted/unweighted sub-

sample estimators with respect to the full sample OLS estimator. Meanwhile, the asymp-

totic properties of weighted/unweighted subsample estimators for estimating the coefficients

are established here. These asymptotic results hold for various subsampling methods, as

long as our general conditions are satisfied. Recently Ma et al. (2014, 2015) derived the bias

and variance formulas for the weighted subsample estimator by Taylor series expansion.

However, in their work, Taylor expansion remainder is not precisely quantified. Unlike their

results, we give the asymptotic distribution of the approximation error, as well as show that

the variance is approximated by an explicit expression and the bias is negligible relative

to the variance. From these results, we provide a framework to develop novel subsampling

methods for the weighted subsample estimator. For the unweighted subsample estimator,

theoretical analysis reveals that it is NOT consistent to the full sample OLS estimator.

However it has better performance than the weighted subsample estimator for estimating

the coefficients.

Our main methodological contribution is to propose two optimal subsampling methods

for the weighted subsample estimator. On the basis of asymptotic results, we propose

an optimal criterion for the weighted subsample estimator to approximate the full sample

OLS estimator. This criterion provides a guide to construct optimal subsampling methods.

Following it, we propose an optimal subsampling method (denoted by OPT below). The

computational cost of OPT is at the same order of the available subsampling methods

based on leverage scores. More importantly, the sampling probability based on OPT can

be approximated by a normalized predictor-length, i.e., the L2 norm of predictors. Thus,

the predictor-length subsampling method (denoted by PL below) is approximately optimal

4

Page 5: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

with respect to the criterion for approximating the full sample OLS estimator. Remarkably,

PL is also optimal with respect to another criterion for estimating the coefficients. In

particular, the computational cost of PL is scalable.

Our main empirical contribution is to provide a detailed evaluation of statistical prop-

erties for weighted/unweighted subsample estimators on both synthetic and real data sets.

These empirical results show that our proposed subsampling methods lead to improved per-

formance over existing subsampling methods. They also reveal the relationship between

good performance of nonuniform subsampling methods and the disperse degree of sample

data.

The remainder of this article is organized as follows. We briefly review the least-square

problem as well as existing subsampling methods in Section 2. In Section 3, we study

the asymptotic properties of the weighted subsample estimator and propose OPT and PL

subsampling methods. In Section 4, the unweighted subsample estimator is investigated.

In addition, the comparison between the weighted subsample estimator and unweighted

subsample estimator is also shown in Section 4. Simulation studies and a real data example

are presented in Sections 5 and 6, respectively. A few remarks in Section 7 conclude the

article. All technical proofs are relegated to the Appendix and some additional contents

are reported in Supplementary Material.

2 Least-square Estimate and Subsampling Methods

In this section, we provide an overview of subsampling methods for the large sample linear

regression problem.

2.1 Ordinary Least-square Estimate

In this paper, we consider a linear model

yi = xTi β + εi; i = 1, . . . , n, (1)

where yi is a response variable, xi is a p-dimensional design vector, independent of random

error εi, and random errors {εi}ni=1 are identically and independently distributed with

5

Page 6: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

mean zero and variance σ2 such that E(|εi|2+δ) <∞ for some δ > 0. Note that we assume

E(|εi|2+δ) < ∞ for the convenience of theoretical study. Model (1) is expressible in a

matrix form

y = Xβ + ε, (2)

where y = (y1, . . . , yn)T ∈ Rn is a response vector, X = (x1, . . . ,xn)T is an n × p design

matrix, and ε = (ε1, . . . , εn)T ∈ Rn is an error vector. We assume that n is large and

p � n, as high-dimension problems are not investigated here. The ordinary least-square

(OLS) estimator βols of β is

βols = arg minβ‖y −Xβ‖2 = (XTX)−1XTy, (3)

where ‖ · ‖ represents the Euclidean norm, and the 2nd equality holds when X has full

rank. Note that we assume, without loss of generality, that X has full rank. The predicted

response is given by y = Hy, where the hat matrix H = X(XTX)−1XT . The ith diagonal

element of the hat matrix H, hii = xTi (XTX)−1xi, is called the leverage score of the ith

observation. It is easy to see that as hii approaches to 1, the predicted response of the

ith observation, yi, tends to yi. Thus, hii has been regarded as the importance influence

index indicating how influential the ith observation is to the full sample OLS estimator.

See Christensen (1996).

The full sample OLS estimator βols in (3) can be calculated using the singular value

decomposition (SVD) algorithm in Golub and Van Loan (1996). By SVD for X, H is

alternatively expressed as H = UUT , where U is an n×p orthogonal matrix whose columns

contain the left singular vectors of X. Then, the leverage score of the ith observation is

expressed as

hii = ‖ui‖2, (4)

where ui is the ith row of U. The exact computation of {hii}ni=1 using (4) requires O(np2)

time. See Golub and Van Loan (1996). Fast algorithms for approximating {hii}ni=1 were

proposed by Drineas et al. (2012), Clarkson and Woodruff (2013) and Cohen et al. (2014)

to further reduce the computational cost.

6

Page 7: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

2.2 Subsampling Methods

When the sample size n is large, the computational cost of the full sample OLS estima-

tor becomes extremely high. An alternative strategy is Weighted Estimation Algorithm

(Algorithm 1) by subsampling methods presented below.

Algorithm 1 Weighted Estimation Algorithm

• Step 1. Subsample with replacement from the data. Construct sampling

probability {πi}ni=1 for all data points. Draw a random subsample of size r � n,

denoted as (X∗,y∗), i.e., draw r rows from the original data (X,y) according to

the probability {πi}ni=1. Construct the corresponding sampling probability matrix

Φ∗ = diag{π∗k}rk=1.

• Step 2. Calculate weighted least-square using the subsample. Solve weighted

least-square on the subsample to get the Weighted Subsample Estimator β, i.e.

β = arg minβ||Φ∗−1/2y∗ −Φ∗−1/2X∗β||2. (5)

In Algorithm 1, the key component is the sampling probability {πi}ni=1 in Step 1. Below

are several subsampling methods that have been considered in the literature.

• Uniform Subsampling (UNIF). Let πi = 1/n, i.e., draw the subsample uniformly

at random.

• Basic Leveraging (BLEV). Let πi = hii/n∑i=1

hii = hii/p, i.e., draw the subsample

according to a sampling distribution that is proportional to the leverage scores of the

data matrix X.

• Approximate Leveraging (ALEV). To reduce the computational cost for getting

hii, fast algorithms were proposed by Drineas et al. (2012), Clarkson and Woodruff

(2013) and Cohen et al. (2014). We call them as approximate leveraging.

• Shrinkage Leveraging (SLEV). Let πi = λhii/p + (1 − λ)/n, where the weight

λ is a constant between zero and one. It is a weighted average of uniform sampling

probability and normalized leverage score.

7

Page 8: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

UNIF is very simple to implement but performs poorly in many cases. The first non-

uniform subsampling method is BLEV based on leverage scores which was developed by

Drineas et al. (2006), Drineas et al. (2008), Mahoney and Drineas (2009) and Drineas et al.

(2011). SLEV was introduced by Ma et al. (2014, 2015).

Another important feature of Algorithm 1 is that Step 2 uses the sampling probability

to calculate weighted least-square on the subsample. It is analogous to the Hansen-Hurwitz

estimate (Hansen and Hurwitz (1943)) in classic sampling techniques. Assume that a ran-

dom sample of size r, denoted by (y∗1, . . . , y∗r), is drawn from a given data (y1, . . . , yn) with

the sampling probability {πi}ni=1, then the Hansen-Hurwitz estimate ofn∑i=1

yi is r−1r∑i=1

y∗i /π∗i ,

where π∗i is the corresponding sampling probability of y∗i . It is well known that r−1r∑i=1

y∗i /π∗i

is an unbiased estimate ofn∑i=1

yi. For an overview see Sarndal et al. (2003).

Unlike weighted least-square on the subsample in Algorithm 1, we calculates ordinary

least-square on the subsample in Unweighted Estimation Algorithm (Algorithm 2) which is

presented below.

Algorithm 2 Unweighted Estimation Algorithm

• Step 1. Subsample with replacement from the data. This step is the same as

Step 1 in Algorithm 1.

• Step 2. Calculate ordinary least-square to the subsample. Solve an ordi-

nary least-square (instead of weighted least-square) on the subsample to get the

Unweighted Subsample Estimator βu, i.e.,

βu

= arg minβ||y∗ −X∗β||2. (6)

Algorithm 2 was introduced by Ma et al. (2014, 2015) for BLEV in order to estimate

the coefficients β and shown to have better empirical performance than Algorithm 1. We

investigate asymptotic properties of its resulting unweighted subsample estimator βu

and

make a theoretical comparison between Algorithms 1 and 2 in Section 5.

8

Page 9: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

3 Weighted Subsample Estimator

In this section, we theoretically investigate the resulting weighted subsample estimator β

in Algorithm 1 under two scenarios: approximating the full sample OLS estimator βols and

estimating the coefficients β, and propose two subsampling methods.

3.1 Weighted Subsample Estimator for Approximating the Full

Sample OLS Estimator

Now we investigate the asymptotic properties of the weighted subsample estimator β to

approximate the full sample OLS estimator βols. Motivated by the theoretical results, we

propose two novel subsampling methods to get β more efficient.

3.1.1 Asymptotic Normality and Consistency

Given data Fn = {z1, · · · , zn}, where zi = (xTi , yi)T , i = 1, . . . , n, we establish the asymp-

totic properties under the following conditions.

Condition C1.

M1 , 1

n2

n∑

i=1

‖xi‖2xix

Ti

πi= Op(1), (7)

M2 , 1

n2

n∑

i=1

xixTi

πi= Op(1), (8)

M3 , 1

n2+δ

n∑

i=1

‖xi‖2+2δ

π1+δi

= Op(1) for some δ > 0, (9)

M4 , 1

n2+δ

n∑

i=1

‖xi‖2+δπ1+δi

= Op(1) for some δ > 0. (10)

Theorem 1. If Condition C1 holds, then given Fn, we have

V−1/2(β − βols)L−→ N(0, I) as r →∞, (11)

where the notationL−→ stands for the convergence in distribution, V = M−1

X VcM−1X , and

Vc = r−1n∑i=1

e2iπixix

Ti with ei = yi − xTi βols.

Moreover,

V = Op(r−1), (12)

9

Page 10: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

and

E(β − βols|Fn) = Op(r−1). (13)

Theorem 1 states that β is consistent to βols as well as the difference between β and βols

converges to a normal distribution as r gets large. The theoretical results hold regardless

of subsampling method, as long as our conditions are satisfied. We make some discussion

about the conditions on some subsampling methods in Section 3.1.2.

Remark. There is no requirement on whether n is larger than r. The asymptotic results

still hold even when r/n→∞. However, our aim is to overcome computational bottleneck,

so n is much larger than r in our empirical studies.

Remark. One contribution of the asymptotic normality (11) is to construct the confidence

interval for β − βols. However, the large sample size n can cause the computation of V to

be infeasible. To deal with it, we use the plug-in estimator V based on the subsample to

estimate V, i.e.,

V = M−1X VcM

−1X , (14)

where MX = r−1X∗TΦ∗−1X∗ and Vc = r−2r∑i=1

e∗2iπ∗2ix∗ix

∗Ti with e∗i = y∗i − x∗Ti β.

Remark. From Theorem 1, we can get the asymptotic version of the relative-error ap-

proximation result (Theorem 1 of Drineas et al. (2011)). See details in Supplementary

Material.

Remark. Although we consider random design matrix X in Theorem 1, the asymptotic

normality and consistency still hold for fixed design matrix setting, since the proof does

not rely on the random property of design matrix X.

3.1.2 Condition C1

M1 and M2 in C1 are conditions imposed on the fourth and second moments, respectively,

of predictors with weighting by sampling probability. M3 and M4 in C1 are conditions on

higher-order moments for satisfying the Lindeberg-Feller condition to prove the asymptotic

normality (van der Vaart (1998), 2.27 at section 2.8, Page 20). Note that the moments in

C1 are sample moments imposed on full sample data rather than population moments on

population distributions.

Now we focus on C1 for three subsampling methods investigated in this paper.

10

Page 11: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

For UNIF, i.e., πi = 1n, we have

M1 =1

n

n∑

i=1

‖xi‖2xixTi , M2 =1

n

n∑

i=1

xixTi , M3 =

1

n

n∑

i=1

‖xi‖2+2δ, M4 =1

n

n∑

i=1

‖xi‖2+δ.

Thus, C1 can be derived from the condition that {xi} are independent with bounded fourth

moments.

For BLEV, i.e., πi = hii/p, since ‖xi‖2hii≤ λm, where λm is the largest eigenvalue of

1nMX , we have

M1 ≤ pλmn

n∑

i=1

xixTi , M2 ≤ pλm

n

n∑

i=1

xixTi

‖xi‖2, M3 ≤ (pλm)1+δ, M4 ≤ (pλm)1+δ

n

n∑

i=1

‖xi‖−δ.

M1, M2 and M3 can be derived from the bounded second moment, i.e., 1n

n∑i=1

xixTi = Op(1),

but M4 = Op(1) if 1n

n∑i=1

1‖xi‖δ = Op(1). However, if we change leverage score from hii to

h1/(1+δ)ii , it is sufficient to get those equations in C1 under the condition that {xi} are

independent with bounded fourth moments (See details in Supplementary Material).

For PL subsampling method to be developed in the next section, i.e., πi = ‖xi‖n∑i=1‖xi‖

, we

have

M1 =1

n2

n∑

i=1

‖xi‖n∑

i=1

‖xi‖xixTi , M2 =1

n2

n∑

i=1

‖xi‖n∑

i=1

xixTi

‖xi‖,

M3 = M4 =1

n2+δ

n∑

i=1

‖xi‖1+δ(n∑

i=1

‖xi‖)1+δ.

We can verify that C1 can be derived from the condition that {xi} are independent with

bounded fourth moments.

Thus, C1 is not strong, as it is sufficient to assume that {xi} are independent with

bounded fourth moments. For example, Gaussian, log-normal, mixture Gaussian and trun-

cated t distributions all have bounded fourth moments. In Section 6.1, we shall show

the empirical comparison of sampling probabilities among those distributions where their

leverage scores have different degree of heterogeneousness. Of course, the bounded fourth

moments exclude some random distribution, such as t1 distribution. However, the empiri-

cal analysis in Section 6 shows that good performance of various subsampling methods is

also obtained for the dataset generated from t1 distribution.

11

Page 12: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

4 Optimal Subsampling

By Theorem 1, given Fn, Var(β) can be approximated by V = M−1X VcM

−1X as r becomes

large. Meanwhile, following (12) and (13) in Theorem 1, V dominates squared bias, that

is, we can make mean squared-error (MSE) to attain its minimum approximately by mini-

mizing V. Thus, a direct statistical objective is to minimize V in some sense.

Since MX is independent with {πi}ni=1, by the property of nonnegative definite matrix,

Vc(π1) ≤ Vc(π2) is equivalent to V(π1) ≤ V(π2) for any two sampling probability sets

π1 = {π(1)i }ni=1 and π2 = {π(2)

i }ni=1. Here we use the notation A ≤ B if B−A is nonnegative

definite. From this view, we can minimize Vc instead of V in some sense.

Let the scalar version of Vc be

tr[Vc] = r−1n∑

i=1

e2iπixTi xi. (15)

Furthermore, we take expectation for tr[Vc] under the linear model (1) to remove the effect

of model errors. Thus, we set the expectation of tr[Vc] as our objective function, i.e.,

E[tr(Vc)] =1

r

n∑

i=1

E [e2i ]

πi‖xi‖2 =

σ2

r

n∑

i=1

[1− hii]πi

‖xi‖2. (16)

Optimal Criterion for Approximating βols. Our aim is to construct sampling proba-

bility {πi}ni=1 in Algorithm 1 that minimizes the objective function E[tr(Vc)] in (16).

Theorem 2. When

πi =

√(1− hii)‖xi‖

n∑j=1

√(1− hjj)‖xj‖

, (17)

E[tr(Vc)] attains its minimum.

We denote the subsampling method according to the sampling probability in (17) as

optimal subsampling (OPT).The computational cost of OPT is in the same order as

that of BLEV, i.e., O(np2).

Remark. When the design matrix X is orthogonal, i.e., XTX = I, hii = xTi (XTX)−1xi =

‖xi‖2, the sampling probability of OPT is proportional to√

(1− hii)hii. Figure 1 illustrates

various subsampling methods for orthogonal design matrix X. We observe that optimal

12

Page 13: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

leverage score

optim

al s

core

(a) optimal score vs leverage score

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

leverage score

leng

th s

core

(b) length score vs leverage score

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

leverage score

shrin

kage

sco

re

(c) shrinkage score vs leverage

score

Figure 1: Comparison of leverage score hii, optimal score√

(1− hii)hii, predictor-length√

(1− hii)hii and shrinkage score 0.9hii + 0.1 pn

with pn

= 0.2 for orthogonal design matrix

X. Dotdash lines give the value of y-axis equals that of x-axis, dotted lines give the value

of their intersection.

score√

(1− hii)hii amplifies small hii but shrinks large hii to zero in Figure 1(a), whereas

the shrinkage score 0.9hii + 0.1p/n is linear of the leverage score hii as seen Figure 1(c).

Moreover,√

(1− hii)hii is nonlinear to hii.

Remark. The sampling probability of OPT is proportional to√

1− hii if the predictor-

length ‖xi‖ is fixed. It is a surprising observation as the sampling probability of BLEV is

proportional to hii.

4.1 Predictor-length Subsampling

Since the computational cost of the sampling probability of OPT is O(np2), the computa-

tion is not scalable to the dimension of predictors. Now we develop a subsampling method

that allows a significant reduction of computational cost. The key idea is that when lever-

age scores of all observations are small, one can simplify the expression of the sampling

probability of OPT in (17).

Condition C2. leverage scores satisfy hii = op(1) for every i = 1, · · · , n.

Remark. hii can approach to 1 for serious outliers, while hii = p/n when leverage scores are

identical asn∑i=1

hii = p. Thus, this condition means that hii’s are not highly heterogenous.

13

Page 14: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Corollary 1. Under Condition C2, the sampling probability of OPT in (17) can be ap-

proximated by

πi =‖xi‖n∑i=1

‖xi‖. (18)

We denote the subsampling method according to the sampling probability in (18) as

predictor-length subsampling (PL), as the sampling probability is proportional to the

L2 norm of predictors. The computational cost of PL is O(np).

Remark. When X is orthogonal, i.e., XTX = I, the predictor-length ‖xi‖ =√hii. In this

case, the predictor-length is nonlinear of the leverage score, as illustrated in Figure 1(b).

Thus, compared to BLEV, OPT amplifies for small probabilities but shrinks for large

probabilities.

4.2 Weighted Subsample Estimator for Estimating the Coeffi-

cients

Besides approximating βols, another aim is to estimate the coefficients β under the linear

model (1), so we are also interested in establishing the asymptotic normality and consistency

of the weighted subsample estimator β with respect to β.

Now we give the asymptotic properties of β with respect to β in the following theorem.

By the theorem, we provide an optimal subsampling method with respect to a statistical

criterion for estimating β.

Theorem 3. If Condition C1 holds, under the linear model (1) we have

V−1/20 (β − β)

L−→ N(0, I) as r →∞, (19)

where V0 = M−1X Vc0M

−1X , and Vc0 = r−1σ2

n∑i=1

1πixix

Ti + (1 − r−1)σ2

n∑i=1

xixTi . Moreover,

V0 = Op(r−1).

Theorem 3 states that β is consistent to β and β−β converges to a normal distribution

under the linear model (1).

Remark. Comparing V for approximating βols in Theorem 1 with V0 for estimating β in

Theorem 3, we observe that V0 has an extra term M−1X

[(1− r−1)σ2

n∑i=1

xixTi

]M−1

X . The

14

Page 15: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

result can be seen from the following fact:

β − β = β − βols + βols − β, (20)

where β − βols leads to the main term of V0, M−1X

[r−1σ2

n∑i=1

1πixix

Ti

]M−1

X , and βols − βleads to the extra term of V0.

Remark. Analogous to the plug-in estimator V in (14), V0 can be estimated by the

plug-in estimator V0 based on the subsample, i.e., V0 = M−1X Vc0M

−1X , where Vc0 =

r−2σ2r∑i=1

1π∗2ix∗ix

∗Ti + (r−1 − r−2)σ2

r∑i=1

x∗ix∗Ti with σ2 = 1

r−pr∑i=1

(y∗i − x∗Ti β)2.

Since β is an unbiased estimator of β, we consider its covariance matrix V0 to construct

the optimal subsampling method for estimating β. Following the optimal criterion for

approximating βols, we propose a criterion to estimate β.

Optimal Criterion for Estimating β. Our aim is to find the sampling probability

{πi}ni=1 that minimizes tr(Vc0).

Remark. Unlike the optimal criterion (16) for approximating βols, we do not need to take

expectation for tr(Vc0) under the linear model (1).

Corollary 2. If PL subsampling method is chosen, i.e., πi = ‖xi‖n∑i=1‖xi‖

, then tr(Vc0) attains

the minimum.

Corollary 2 states that PL is optimal with respect to the criterion for estimating β.

This is very interesting since PL is an approximately optimal method for approximating

βols.

5 Unweighted Subsample Estimator

Unlike the weighted subsample estimator β in Algorithm 1, the unweighted subsample es-

timator βu

in Algorithm 2 does not solve the weighted least-square using the subample but

calculates the ordinary least-square directly. In this section, we investigate βu

under two

scenarios: approximating the full sample OLS estimator βols and estimating the coefficients

β.

15

Page 16: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

5.1 Unweighted Subsample Estimator for Approximating the Full

Sample OLS Estimator

In the following theorem, we shall establish asymptotic properties of the unweighted sub-

sample estimator βu

for approximating βols. The properties show that, given the full

sample data Fn, βu

is NOT a consistent estimator of βols, but a consistent estimator of

the full sample weighted least-square (WLS) estimate

βwls = (XTΦX)−1XTΦy, (21)

where Φ = diag{πi : i = 1, · · · , n}.Condition C3.

n∑

i=1

πi‖xi‖2xixTi = Op(1),n∑

i=1

πixixTi = Op(1), (22)

n∑

i=1

π1+δi ‖xi‖2+2δ = Op(1),

n∑

i=1

π1+δi ‖xi‖2+δ = Op(1), for some δ > 0. (23)

Remark. Unlike C1 in which the sample moments are weighted by {1/πi}ni=1, the sample

moments in C3 are weighted by {πi}ni=1.

Theorem 4. If Condition C3 holds, then given Fn,

(Vu)−1/2(βu − βwls)

L−→ N(0, I) as r →∞ (24)

where Vu = (MuX)−1Vu

c (MuX)−1, Mu

X =n∑i=1

πixixTi , and Vu

c = r−1n∑i=1

πi(ewlsi )2xix

Ti with

ewlsi = yi − xTi βwls.In addition, we have

E(βu − βols|Fn) = βwls − βols +Op(r

−1). (25)

Theorem 4 states that βu

is not a good choice to approximate βols, since the main term

of its bias, βwls − βols, can not be controlled by increasing r.

Remark. βwls − βols = 0 for uniform subsampling method. For this case, the unweighted

subsample estimator βu

is identical to the weighted subsample estimator β.

Remark. Similar to the plug-in estimator V in (14), Vu can be estimated using the sub-

sample, i.e., Vu = (MuX)−1Vu

c (MuX)−1, where Mu

X = r−1X∗TX∗, and Vuc = r−2

r∑i=1

(e∗wlsi )2x∗ix∗Ti

with e∗wlsi = y∗i − x∗Ti βu.

16

Page 17: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

5.2 Unweighted Subsample Estimator for Estimating the Coeffi-

cients

In Section 5.1, we have shown that the unweighted subsample estimator βu

converges to the

full sample WLS estimate βwls rather than the full sample OLS estimator βols. However, it

is easy to see that βu

is an unbiased estimator of β, i.e., E(βu) = β, where the expectation

is taken under the linear model (1). We establish asymptotic properties of βu

for estimating

β in the following theorem.

Theorem 5. If Condition C3 holds, then under the linear model (1) we have,

(Vu0)−1/2(β

u − β)L−→ N(0, I) as r →∞, (26)

where Vu0 = (Mu

X)−1(Vuc0)(M

uX)−1 with Vu

c0 = r−1σ2n∑i=1

πixixTi + (1− r−1)σ2

n∑i=1

π2ixix

Ti .

Moreover, Vu0 = Op(r

−1).

Theorem 5 states that βu

is consistent to β and βu−β converges to a normal distribution

under the linear model (1).

Remark. Similar to the plug-in estimator V in (14), Vu0 is also estimated based on

the subsample, i.e., Vu0 = (Mu

X)−1Vuc0(M

uX)−1, where Vu

c0 = r−2σ2u

r∑i=1

x∗ix∗Ti + (r−1 −

r−2)σ2u

r∑i=1

π∗ix∗ix∗Ti with σ2

u = 1r−p

r∑i=1

(y∗i − x∗Ti βu)2.

As we know, both the weighted subsample estimator β and the unweighted subsample

estimator βu

are unbiased estimators of β. Thus, we compare β and βu

in terms of

efficiency, i.e., their corresponding asymptotic covariance matrices V0 in (19) and Vu0 in

(26).

Corollary 3. We have that, as r = o(n),

Vu0 −V0 ≤ 0, (27)

where we use the notation Vu0 −V0 ≤ 0 if Vu

0 −V0 is nonpositive definite, and the equality

holds if and only if πi = 1/n for i = 1, · · · , n.

Corollary 3 states that βu

is more efficient than β, except for uniform subsampling, in

which two estimates are identical.

17

Page 18: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Remark. The condition that r = o(n) meets the computation bottleneck problem we

study in this paper.

From the theoretical analysis in this section, we have the following recommendation

about the unweighted subsample estimator βu. β

uis not ideal to approximate βols, since

it is not a consistent estimator of βols. However, βu

is a better choice to estimate β, since

it can get more efficient than β.

6 Empirical Evaluation on Synthetic Data

Extensive simulation studies are conducted to examine the empirical performance of the

weighted subsample estimator β and the unweighted subsample estimator βu

based on

various subsampling methods. In this section, we report several representative studies.

6.1 Synthetic Data

The n × p design matrix X matrix is generated row-by-row from one of five multivariate

distributions introduced below. (1) We generated X from Gaussian distribution N(0,Σ),

where the (i, j)th element of Σ, Σij = 2 × 0.8|i−j|. (Referred to as GA data.) (2)We

generated X from mixture Gaussian distribution with 12N(0,Σ) + 1

2N(0, 25Σ). (Referred

to as MG data.) (3)We generated X from log-normal distribution LN(0,Σ). (Referred

to as LN data.) (4)We generated X from t distribution with 1 degree of freedom and

covariance matrix Σ (t1 distribution). (Referred to as T1 data.) (5) We generated X from

truncated t1 distribution with element-wise truncation at [−p, p]. (Referred to as T1 data.)

We set n = 50, 000 and p = 50.

In Figure 2, we plot scatter diagrams for showing the comparison among sampling

probabilities of those data points based on BLEV, OPT and PL. For MG, LN, TT and T1,

we observe that (1) sampling probabilities of OPT and PL increase as sampling probabilities

of BLEV increase, (2) OPT and PL amplify small sampling probabilities but shrink large

sampling probabilities, compared to BLEV, and (3) sampling probabilities of OPT and

PL look very similar. Additionally, as GA has the most homogeneous probabilities, the

comparison with GA is unclear.

18

Page 19: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Figure 2: Various sampling probabilities of data points of BLEV, OPT and PL for different

data sets. Upper Panels: sampling probabilities between OPT vs BLEV. Lower Panels:

sampling probabilities between PL vs BLEV. From left to right: GA, MG, LN, TT, and

T1 data, respectively. Dashed lines give the value of y-axis equals that of x-axis.

●●●●●●●●

●●

●●●●●●●●●●●●●●

●●●●

●●●●●●●●●

●●

●●●

●●●

●●●●

●●

●●●●●●●

●●●

●●●●

●●●

●●

●●●●●●●●●●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●

●●●

●●

●●●●●●

●●●●

●●●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●

●●

●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●

●●

●●●

●●●●●

●●●●●●●●

●●●●

●●●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●

●●●●●●●●●●●

●●●●

●●

●●

●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●

●●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●●●●●●●

●●●●●

●●

●●●●●●●●●●

●●●

●●

●●

●●●●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●●

●●

●●

●●●

●●●

●●●

●●●●

●●●●●●●

●●●●●●●

●●●●●

●●●●●

●●●

●●●

●●●●●●●●●

●●●●●

●●●●●

●●●●

●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●●●

●●

●●●●●●

●●●●●

●●●

●●●

●●

●●

●●●●●●●

●●

●●●●

●●

●●●●

●●

●●

●●●●

●●●

●●●

●●●

●●●●

●●●●●●

●●●●●●●●●

●●

●●●

●●

●●

●●●

●●●●

●●●●

●●

●●●

●●●●

●●

●●

●●●

●●

●●●

●●●●●●

●●●●●●●

●●

●●●●●●●●●●●

●●●

●●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●

●●●●●●

●●●●●●●

●●●

●●●

●●●

●●

●●●●●●●●●●

●●

●●

●●●

●●●

●●●

●●

●●●●●●●●●●

●●●●

●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●

●●●●●

●●●●●

●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●●●

●●●●●●●

●●●●

●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●

●●●

●●●●●

●●●

●●●●●●

●●●●●●●●

●●

●●●

●●●

●●●●●●

●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●

●●●●●●●●●

●●●●●●●●●●●●

●●●

●●●

●●●

●●●●

●●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●●●●

●●

●●●●

●●●

●●●

●●●

●●●●●●●●●●●

●●●●

●●●

●●●●●●

●●●●●●

●●●●●●●

●●●

●●●●

●●●●

●●●●●●●●●

●●

●●●●

●●

●●●●

●●●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●●●

●●●●●●

●●

●●●●●●●

●●●●

●●●●

●●●●●

●●●●●

●●●●

●●●●●●●●

●●●

●●●●

●●●●●●●●

●●●

●●●●●●●●

●●

●●

●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●

●●●●●

●●

●●●●●●●

●●●●

●●●●●●●●●●●

●●●●●

●●●

●●●●●

●●●●●●●●

●●

●●●

●●●

●●●●●●●

●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●

●●●●●●●●●●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●

●●●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●

●●●●

●●●●●●●●●●●●

●●●

●●●●●●●

●●●

●●●●●

●●●●●●●●●

●●●

●●●●●

●●●

●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●

●●●●

●●●●●●

●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●

●●●●

●●●

●●●

●●●

●●●●●●●●●●●●●

●●

●●●●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●●●●●●●●

●●

●●●●●

●●●●●●●●●

●●●●

●●●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●

●●

●●

●●●

●●●●

●●●●●●●

●●●

●●●●●●●

●●●●

●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●

●●

●●●●●●●

●●

●●●

●●●●●●●●●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●

●●●

●●●

●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●●

●●●

●●

●●

●●●●●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●●

●●

●●●●●●●●

●●●●●●

●●●●●

●●●

●●●●●●

●●●●

●●●●

●●●

●●

●●●●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●●●●●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●●●●●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●

●●●●●

●●

●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●

●●●●●●

●●

●●

●●●●●

●●

●●●●●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●●

●●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●

●●

●●

●●●●●

●●●●●

●●

●●●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●

●●●

●●●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●●

●●●

●●

●●●●

●●

●●●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●●●

●●●●

●●

●●

●●●●

●●

●●●●

●●●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●

GA MG LN TT T1

−20

−15

−10

−5

log(

prob

abili

ty)

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●●

●●●

●●●

●●

●●●●●●●

●●●●●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●●●●

●●●●

●●

●●●●●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●●

●●●●●●

●●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●●●

●●

●●●●●

●●

●●●●●●●●●

●●●●●●

●●

●●●●●

●●●

●●●●●

●●

●●●●●

●●●●●

●●

●●●●●●

●●●

●●

●●

●●

●●●●●

●●●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●

●●●●

●●●●

●●

●●●●

●●●

●●●●

●●●

●●

●●●●●●●●●

●●●●●

●●●●

●●●●

●●●●●●●

●●●●●●

●●

●●●

●●

●●●●●●

●●●

●●●●

●●

●●

●●

●●

●●●●●●●●●●●

●●

●●

●●●

●●

●●●●

●●●

●●●●●●●

●●●●●●●

●●●

●●

●●●●●

●●●●●

●●●

●●

●●●

●●

●●●●●

●●●●

●●●

●●

●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●

●●●●●●

●●●●

●●●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●●

●●●●●

●●

●●●●●●●

●●

●●

●●●●●●

●●●●●

●●

●●●●●

●●●

●●●●●●

●●●●●

●●

●●●●●●●

●●

●●●

●●●

●●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●

●●

●●●●

●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●●

●●

●●●●

●●●

●●●●

●●

●●●

●●●

●●

●●●●●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●

●●●●●●●●

●●

●●●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●●

●●●

●●●

●●●

●●●●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●●●●●

●●

●●

●●

●●●

●●●●●●

●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●●

●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●●●

●●●

●●

●●●

●●●●●●●

●●

●●

●●●●●●●●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●●●

●●

●●●

●●

●●●●●●

●●

●●

●●

●●●

●●●●

●●

●●●●●

●●●●●

●●●●●●●

●●●

●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●

●●●●●●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●●●

●●●

●●

●●

GA MG LN TT T1

−14

−12

−10

−8

−6

log(

prob

abili

ty)

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●●

●●●

●●●

●●

●●●●●●●

●●●●●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●●●●

●●●●

●●

●●●●●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●

●●

●●●●

●●

●●●

●●

●●

●●●●

●●●●●●

●●●

●●

●●●

●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●●●●●●

●●●

●●

●●

●●●

●●●●●

●●●●●●●●●

●●

●●●●●

●●

●●●●●●●●●●

●●

●●●●●●

●●

●●●●●

●●●●

●●●●●

●●

●●●●●

●●●●●

●●

●●●●●●

●●●

●●

●●

●●

●●●●●●

●●●●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●●●

●●

●●

●●●●

●●●●●

●●●●

●●

●●●●

●●

●●●

●●●●

●●●

●●

●●●●●●●●●●

●●●●●●●●●●

●●●●

●●●●●●●●

●●●●●●

●●

●●●●●

●●●●●●

●●●

●●●●

●●

●●

●●●

●●●●●●●●●●●

●●

●●

●●●

●●

●●●●

●●●●

●●●●●●●●●●●●●●●

●●●

●●●

●●●●●●●

●●●●●

●●●●●

●●●

●●

●●●●●

●●●●

●●●

●●

●●●●●●●●

●●●●

●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●●●●

●●●●●

●●●●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●●●

●●●

●●●

●●●

●●

●●●●●●

●●●●●●●●●●

●●

●●

●●●●●●

●●●●●

●●

●●●●●

●●●

●●●●●●

●●●●●

●●

●●●●●●●

●●

●●●

●●●

●●●

●●●

●●●

●●●●●

●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●●●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●●●

●●

●●●

●●●

●●●●●●●●

●●

●●●●●

●●●

●●●●●

●●●

●●

●●●

●●●

●●

●●●●●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●●

●●

●●

●●●●●●●●

●●●●●●

●●●●●●

●●●

●●

●●●

●●

●●●

●●●●

●●●

●●●●●

●●●●

●●

●●

●●●●●●●●●

●●

●●

●●●

●●

●●●●

●●●●

●●

●●●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●●●●●●●

●●

●●

●●●●

●●

●●●

●●

●●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●●●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●

●●

●●●●●

●●●●●●

●●

●●●

●●●●●●●

●●

●●●

●●

●●●

●●●

●●●●

●●

●●●●

●●●●

●●

●●

●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●●●

●●●●

●●

●●

●●

●●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●●●●

●●

●●●●●

●●

●●●

●●

●●●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●

●●●●

●●●●

●●●●

●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●●●

●●●●●

●●●

●●●

●●●

●●●●●●●

●●

●●

●●●●●●●●●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●●

●●●

●●●●

●●●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●●●●

●●

●●

●●●●

●●

●●●●●●

●●

●●●●

●●

●●●

●●●●

●●●

●●●●●●

●●●●●

●●●●●●●

●●

●●●

●●●

●●●●●●●

●●

●●

●●●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●●

●●●●

●●●●●●●

●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●●

●●●●

●●●

●●

●●●●

GA MG LN TT T1

−14

−12

−10

−8

−6

−4

log(

prob

abili

ty)

Figure 3: Boxplots of the logarithm of various sampling probabilities of BLEV, OPT and

PL. Left subfigure: sampling probabilities of BLEV. Middle subfigure: sampling probabil-

ities of OPT. Right subfigure: sampling probabilities of PL. From left to right: GA, MG,

LN, TT, and T1 data, respectively.

19

Page 20: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

To further compare those different sampling probabilities, Figure 3 gives the boxplots

of logarithm of sampling probabilities of BLEV, OPT, and PL. For all three subsampling

methods, we observe that GA tends to have the most homogeneous sampling probabilities

among those data sets, MG, LN and TT have less homogeneous sampling probabilities

compared to GA, and T1 tends to have the most heterogeneous sampling probabilities of

BLEV. Comparing these subfigures in Figure 3, we see that the probabilities of OPT and

PL are more concentrated than those of BLEV, and PL and OPT have similar performance.

6.2 Improvement from our Proposed Methods for the Weighted

Subsample Estimator

Given X matrices in Section 6.1, we generated y from the model y = Xβ + ε, where

β = (1T30, 0.1 × 1T20)T , ε ∼ N(0, σ2In) and σ = 10. Since five X matrices were generated,

we had five datasets.

We conduct empirical studies on OLS approximation. The empirical performance for

the coefficients estimation is not reported here but shown in Supplementary Material, since

it looks very similar to that for OLS approximation.

We calculate the full sample OLS estimator βols for each dataset. We then apply

five subsampling methods, including UNIF, BLEV, SLEV (with shrinkage parameter λ =

0.9), OPT and PL, with different subsample sizes r to the dataset. Specificly, we set

the subsample size r = 100, 200, 400, 800, 1600, 3200, 6400. For each subsample size r, we

repeatedly apply Algorithm 1 for B = 1000 times to get weighted subsample estimators

βb for b = 1, . . . , B. We calculate the empirical variance (V), squared bias (SB) and

mean-squared error (MSE) at each subsample size as follows:

V =1

B

B∑

b=1

∥∥∥X(βb − ¯β

)∥∥∥2

, SB =

∥∥∥∥∥1

B

B∑

b=1

X(βb − βols

)∥∥∥∥∥

2

,MSE =1

B

B∑

b=1

∥∥∥X(βb − βols)∥∥∥2

,

(28)

where ¯β = 1B

B∑b=1

βb.

We plot the logarithm of V, SB and MSE of Xβ for MG, LN and TT in Figure 4.

Since GA has such homogenous data points that various methods perform more or less

and T1 does not satisfy our conditions, we do not report the results of GA and T1 here.

20

Page 21: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

0 1000 3000 5000

1012

1416

18

TT

subsample size

log(

varia

nce) ●

BLEVUNIFOPTPL

0 1000 3000 500010

1112

1314

1516

17

MG

subsample size

log(

varia

nce)

0 1000 3000 5000

1011

1213

1415

1617

LN

subsample size

log(

varia

nce)

0 1000 3000 5000

46

810

1214

subsample size

log(

squa

red

bias

)

●●

●●

●●

●●

0 1000 3000 5000

56

78

910

11

subsample size

log(

squa

red

bias

)

●●

0 1000 3000 50004

68

1012

subsample size

log(

squa

red

bias

) ●

●●

● ●

● ●

0 1000 3000 5000

1012

1416

18

subsample size

log(

mse

) ●

0 1000 3000 5000

1011

1213

1415

1617

subsample size

log(

mse

)

0 1000 3000 5000

1011

1213

1415

1617

subsample size

log(

mse

)

Figure 4: Empirical variances, squared biases and mean-squared error of Xβ for predicting

Xβols. Upper panels are the logarithm of variances, middle panels the logarithm of squared

bias, and lower panels the logarithm of mean-squared error. From left to right: TT, MG

and LN data, respectively.

21

Page 22: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

In addition, the performance of SLEV is not reported here, as it has similar performance

to BLEV. There are several things worth noting according to Figure 4. First, V, SB

and MSE values decrease as the subsample size r increases, which verifies Theorem 1.

Second, V obviously dominates SB, which is consistent with the fact that the squared bias

is proved to be negligible with respective to the variance. Third, OPT and PL methods

have better performance than other subsampling methods, while OPT and PL have similar

performance.

In addition, we report relative MSE values for various subsampling methods to UNIF

among all five datasets in Table 1. First, there is no obvious difference among those subsam-

pling methods for GA. It indicates that uniform subsampling has such nice performance

that it is not necessary to use non-uniform subsampling for GA. Second, comparing Ta-

ble 1 with Figure 3, as the data becomes more dispersed, i.e., more heterogeneous leverage

scores, all non-uniform subsampling methods get smaller MSE relative to UNIF. Specifi-

cally, non-uniform subsampling methods perform nearly the same as UNIF for GA, how-

ever, significantly outperform UNIF for MG, TT and LN, and get the best performance

compared to UNIF for T1 among five datasets.

6.3 Limitation of Unweighted Subsample Estimator

The squared bias of the the unweighted subsample estimator βu

is shown in Figure 5.

However, the empirical variance and mean-squared error of βu

are not reported here since

they look similar to the weighted subsample estimator β.

For LN, TT and MG, the squared bias does not decrease for various non-uniform

subsampling methods as the subsample size increases. Thus, we can not control the bias

of Xβu

relative to Xβols for non-uniform subsampling methods. For UNIF, βu

is identical

to β, so the squared bias decreases as the sample size increases. However, the bias of Xβu

relative to Xβ obviously decreases as r increases, which is consistent with the unbiasedness

of βu

to β.

Additionally, we conduct some numerical comparisons between Xβ and Xβu

in pre-

dicting Xβ. We report the results of BLEV and PL methods in Table 2 and ignore those

of other subsampling methods because of the similarity. The table suggests that (1) βu

22

Page 23: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Table 1: The relative MSE comparison of various methods to UNIF of Xβ for predicting

Xβols among GA, MG, LN, TT and T1 data.

r 100 200 400 800 1600 3200 6400

GA

BLEV 0.988 0.968 0.999 1.003 0.986 0.983 0.990

SLEV 0.981 0.971 0.998 0.971 0.992 0.969 0.992

OPT 1.001 1.017 1.010 0.996 1.016 1.021 0.994

PL 1.002 1.006 1.002 1.010 1.000 1.014 1.009

MG

BLEV 0.377 0.692 0.884 0.943 0.943 0.989 1.003

SLEV 0.336 0.567 0.691 0.725 0.722 0.747 0.774

OPT 0.346 0.557 0.642 0.659 0.668 0.704 0.708

PL 0.353 0.559 0.652 0.660 0.665 0.704 0.692

LN

BLEV 0.227 0.321 0.423 0.536 0.645 0.741 0.827

SLEV 0.281 0.263 0.314 0.396 0.471 0.520 0.582

OPT 0.374 0.358 0.336 0.340 0.363 0.360 0.384

PL 0.387 0.337 0.345 0.331 0.360 0.355 0.387

TT

BLEV 0.045 0.093 0.211 0.371 0.634 0.826 0.917

SLEV 0.040 0.050 0.098 0.169 0.273 0.341 0.390

OPT 0.081 0.062 0.080 0.116 0.172 0.201 0.223

PL 0.072 0.060 0.081 0.113 0.165 0.208 0.223

T1

BLEV 3.41e-4 2.05e-05 2.56e-05 6.96e-05 1.15e-04 1.84e-04 4.77e-04

SLEV 2.91e-4 3.90e-05 2.83e-06 5.85e-06 1.10e-05 1.82e-05 4.15e-05

OPT 5.25e-3 5.84e-04 1.98e-04 6.95e-05 2.93e-06 4.12e-06 8.67e-06

PL 1.03e-3 2.43e-05 5.31e-06 4.15e-06 5.01e-06 7.53e-06 1.64e-05

23

Page 24: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

5 6 7 8

46

810

1214

TT

log(subsample size)

log(

squa

red

bias

)

● ● ● ● ● ● ●

●● ● ●

● ● ● ● ●

BLEVUNIFOPTPL

5 6 7 8

46

810

12

MG

log(subsample size)

log(

squa

red

bias

)

●●

●● ●

●●

● ●

●●

5 6 7 8

46

810

12

LN

log(subsample size)

log(

squa

red

bias

)

●●

● ● ● ●

●●

● ●●

5 6 7 8

46

810

1214

log(subsample size)

log(

squa

red

bias

)

● ●●

● ●

●●

●●

5 6 7 8

46

810

12

log(subsample size)

log(

squa

red

bias

)

●●

5 6 7 8

46

810

12

log(subsample size)

log(

squa

red

bias

)

● ● ●

● ●

Figure 5: Empirical squared biases of Xβu

for predicting Xβols and Xβ, respectively. From

top to bottom: upper panels are results for predicting Xβols, and lower panels results for

predicting Xβ. From left to right: TT, MG and LN data, respectively.

24

Page 25: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Table 2: Ratios between MSEs of Xβu

and those of Xβ for predicting Xβ, among MG,

LN and TT, respectively.

r 100 200 400 800 1600 3200 6400

MG (all values less 1)

BLEV 0.538 0.524 0.538 0.532 0.522 0.542 0.586

PL 0.741 0.813 0.844 0.860 0.870 0.872 0.894

LN (all values less 1)

BLEV 0.222 0.133 0.092 0.104 0.141 0.229 0.406

PL 0.782 0.732 0.662 0.628 0.584 0.598 0.701

TT (all values less 1)

BLEV 0.091 0.071 0.065 0.078 0.097 0.148 0.221

PL 0.533 0.507 0.507 0.547 0.602 0.667 0.749

is more efficient than β for all cases, which is consistent with Corollary 3, and (2) the

advantage of βu

relative to β for BLEV is more obvious than that for PL.

Thus, we have the following empirical conclusion. Although βu

may not be a good

choice for approximating βols as the bias can not be controlled, it is better (but risky) to

choose βu

for estimating β, if one is sure that the dataset satisfies the linear model (1).

6.4 Computational Cost

We report the running time for the synthetic data TT in Table 3, where BLEV is based on

exact leverage scores by QR decomposition, ALEVCW and ALEVGP are based on the ap-

proximate leverage scores by CW projection in Clarkson and Woodruff (2013) and Gaussian

projection in Drineas et al. (2012), respectively, with r1 = 50 and r2 = 11 ≈ log(50, 000),

and T0 denotes the time cost to get sampling probabilities. All values in Table 3 were

computed by R software in PC with 3 GHz intel i7 processor, 8 GB memory and OS X

operation system.

From Table 3, firstly, since computing (approximate) leverage scores takes the most

25

Page 26: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

time for BLEV, ALEVCW and ALEVGP , PL greatly reduces the running time, both sys-

tem time and user time. Secondly, although ALEVCW and ALEVGP greatly reduce the

computational cost compared to BLEV, PL has much less computational cost than them.

Thus, PL has notable computation advantage.

In addition, we also report the time cost for two larger design matrices X with 5M ×50

and 50K × 1, 000 respectively in Table 3. We see that, PL is more efficient in saving

computational cost over other methods when p is large.

7 Real Data Example

RNA-seq, an ultra-high throughput transcriptome sequencing technology, produces millions

of short reads. After mapping to the genome, it outputs read counts at each nucleotide of

every gene. To study the transcriptome of Drosophila melanogaster throughout develop-

ment, Graveley et al. (2011) conducted RNA-Seq experiments using RNA isolated from 30

whole-body samples representing distinct stages of development.

Through calculating the correlation of gene expression levels, the authors showed that

gene expressions at each developmental stage are highly correlated with those at its adjacent

stages. We are interested in investigating how much variation of RNA-seq read counts at

the 24th hour can be explained by those at early developmental stages, i.e., the 2nd hour,

the 4th hour, . . ., the 10th hour, by the linear regression. For this linear regression problem,

there are read counts on 4,227,667 nucleotides of 542 genes. Thus, the full sample with size

n = 4, 227, 667 is assumed to be from the following linear model:

yi = xTi β + εi, i = 1, · · · , n (29)

where xi = (xi,1, . . . , xi,5)T are the read counts for the i-th nucleotide at developmental

embryonic stages of the 2nd hour, the 4th hour, . . ., the 10th hour, respectively, and yi are

the read counts for the i-th nucleotide at the 24th hour.

We plot boxplots of logarithm of sampling probabilities of BLEV, OPT and PL in the

1st subfigure of Figure 6. The sampling probabilities of OPT and PL have larger means

and look more concentrated than those of BLEV, while the sampling probabilities of OPT

and PL are close to each other. The observation is similar to that in synthetic data.

26

Page 27: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Table 3: The running time, CPU seconds, for computing β with BLEV, ALEV and PL for

TT dataset, where ALEVGP and ALEVCW denote ALEV by Gaussian and CW projections

respectively, T0 the time of getting sampling probabilities, and “(Tols)” denotes the time of

getting the full sample OLS estimator.

50K × 50 design matrix X

System Time (Tols = 0.024) User Time (Tols = 0.311)

r T0 100 400 1600 6400 T0 100 400 1600 6400

BLEV 0.030 0.031 0.031 0.033 0.037 0.656 0.660 0.662 0.672 0.727

ALEVGP 0.020 0.021 0.021 0.027 0.061 0.690 0.694 0.696 0.706 0.755

ALEVCW 0.046 0.047 0.047 0.049 0.053 0.296 0.300 0.302 0.312 0.421

PL 0.013 0.014 0.015 0.015 0.020 0.238 0.242 0.245 0.252 0.299

5M × 50 design matrix X

System Time (Tols = 3.31) User Time (Tols = 22.97)

r T0 100 400 1600 6400 T0 100 400 1600 6400

BLEV 8.97 9.31 9.32 9.32 9.37 22.97 23.09 23.09 23.11 23.28

ALEVGP 6.53 6.87 6.88 6.88 6.93 69.28 69.40 69.40 69.42 69.47

ALEVCW 4.05 4.39 4.40 4.40 4.45 33.01 33.13 33.13 33.15 33.20

PL 2.30 2.64 2.65 2.65 2.70 19.12 19.24 19.24 19.26 19.31

50K × 1K design matrix X

System Time (Tols = 1.211) User Time (Tols = 86.08)

r T0 100 400 1600 6400 T0 100 400 1600 6400

BLEV 1.267 1.356 1.391 1.439 1.489 152.3 159.2 161.9 165.6 174.2

ALEVGP 0.665 0.754 0.789 0.837 0.887 89.25 96.15 98.85 102.6 111.2

ALEVCW 0.480 0.569 0.604 0.652 0.702 6.927 13.83 16.53 20.23 28.83

PL 0.213 0.363 0.451 0.387 0.452 1.353 8.145 10.86 14.45 23.22

27

Page 28: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

●●

●●●●●

●●●●●

●●

●●●●●

●●●●●●●

●●

●●●●

●●●●●●●

●●●

●●●●●●

●●●●●●●●●●

●●●

●●

●●

●●

●●

●●●●●●●●

●●

●●●

●●●●

●●

●●●●

●●

●●●●●●●●

●●

●●●●

●●●●●●

●●●●●

●●

●●

●●●●

●●

●●●●

●●●●●●●●

●●●●

●●●

●●

●●

●●●●

●●

●●●

●●●●●●●●

●●●

●●●●●●

●●

●●

●●●●●●●●

●●

●●●●

●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●●●●●●●●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●●●●

●●

●●

●●

●●

●●●●●●●

●●

●●●●●●

●●●

●●

●●●

●●●●●●

●●

●●●●●●●●

●●●

●●

●●●

●●●●●●●●

●●●●●●●●●

●●●

●●●●●

●●

●●

●●●●●●●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●●

●●●●

●●●●●●

●●

●●●●●●●●●●●

●●●

●●●●

●●

●●●●●●

●●●●●●

●●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●●

●●●

●●●●●

●●●●●

●●●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●●●●●●●●●●

●●●

●●

●●●●●●

●●●●●●

●●

●●●

●●●●●●●

●●●●

●●●●●●●

●●●●●●●

●●●

●●

●●●●

●●●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●●

●●

●●●●●

●●●●●

●●●●●●●●

●●

●●

●●●●

●●

●●●●

●●●

●●●●●●●●●●

●●

●●

●●●●

●●●●●●●●

●●

●●●●●●●

●●●●●

●●

●●

●●

●●●

●●●●●●●

●●●●●●

●●●●●●

●●●

●●●●

●●

●●●●●

●●

●●●●●●

●●●●●

●●●●●●

●●●

●●

●●●●●●●

●●

●●●

●●●●

●●●

●●●●●●●

●●●●

●●

●●●●●

●●●

●●●●●

●●

●●

●●●

●●

●●

●●●●●●●●●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●●●●●●●●●●●

●●●

●●●●●

●●

●●●

●●●

●●●●

●●●●●

●●

●●●●●●●●

●●●●

●●●

●●●

●●●●

●●

●●

●●

●●●●●

●●

●●●●

●●

●●●●●

●●

●●

●●●●

●●

●●●●●●●●●●●●

●●

●●●

●●●●●●

●●

●●

●●●●●

●●

●●

●●●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●

●●●

●●●●

●●

●●●

●●

●●●●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●●●●●

●●●●●●●

●●●●

●●

●●

●●●●

●●●●●●●●●●

●●

●●

●●●●

●●●●●

●●

●●●●●●

●●

●●●●●●

●●●●●●●●●●●

●●●

●●●●●●●

●●

●●

●●●

●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●●●

●●●●

●●

●●●●●●●●●

●●

●●●

●●●

●●●

●●●

●●●●●

●●

●●

●●●●

●●●●●

●●●●

●●●●

●●

●●●

●●●●●●●

●●●●●●

●●●

●●

●●●●●

●●●

●●

●●●

●●●●

●●

●●●●●●●●

●●

●●●

●●●●●

●●

●●●

●●

●●

●●●●●●●

●●●●

●●●●●●●

●●

●●

●●●●●

●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●●

●●

●●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●●●

●●

●●●●

●●

●●

●●●●●●●●●

●●

●●●●●●●●●●●●

●●●●●

●●

●●●●

●●●●●●●

●●●●

●●

●●

●●●●●●●

●●●●●●

●●●●

●●

●●●●●●●●

●●●●

●●

●●●●●

●●●●●●

●●

●●●●●●●●●●●●●●

●●●●

●●●●●

●●

●●

●●●●●

●●●●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●

●●●●●

●●

●●

●●●●●

●●●

●●●●●●

●●

●●

●●●●●

●●

●●●

●●●●

●●

●●

●●●●●●●●●

●●

●●

●●●●

●●

●●

●●●●●●●●

●●

●●●

●●●

●●●●●●●●

●●

●●●●●

●●●●

●●

●●●●

●●

●●●●●●

●●●

●●

●●●●●

●●●●●

●●●

●●●●

●●

●●●●

●●●●●

●●

●●

●●●●

●●●●●●●

●●●●

●●

●●●●●●

●●●●●●●●●●

●●●●●●

●●●

●●●●●

●●●

●●●

●●●●

●●●●

●●●●●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●●●●●●●●●

●●●●

●●

●●

●●●●

●●

●●●

●●

●●●●●●●●●●●●●

●●

●●

●●

●●●●

●●

●●●●●●●

●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●

●●

●●

●●●●●

●●●●●●●●●●●

●●

●●●●●

●●

●●

●●●●●

●●●

●●

●●

●●●●

●●

●●●●

●●●●●

●●●●●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●●

●●●●

●●

●●●

●●

●●●

●●●

●●●●●●●●●

●●●●

●●●●●●●

●●●●

●●●●●●●●●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●●●●●

●●

●●

●●●●●

●●●●●

●●●●

●●●●●●●

●●

●●

●●

●●●●

●●●

●●●●●

●●●●●●

●●●●●●

●●●

●●

●●●●

●●

●●

●●

●●●●●

●●

●●●●●

●●●●

●●●

●●

●●●●

●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●●●●●●

●●

●●●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●●●●●●●●

●●

●●●

●●

●●●

●●

●●●●●●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●●●●●

●●●

●●●●●●

●●

●●

●●

●●

●●●●●●●

●●●

●●●

●●●●

●●●

●●●●

●●

●●

●●

●●●●●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●●

●●●●●

●●

●●●●●●●●●

●●●

●●●

●●●●●●●

●●●●●●●

●●

●●●●

●●

●●

●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●●●

●●●●●●●●●

●●●●●●

●●●●●●

●●●

●●

●●●●

●●

●●●●

●●●●●

●●

●●

●●●●●●●●

●●●●

●●●●●●●

●●●●●

●●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●●●●

●●●●●●●●●

●●●●●

●●●

●●●

●●

●●●●●●●

●●

●●

●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●

●●●

●●●●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●●●●

●●●●

●●●●●●

●●●●

●●●●

●●●●

●●●●●●●●●

●●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●●●●

●●

●●

●●●

●●●●●●●

●●●

●●●●●●●●●

●●●●●

●●

●●

●●

●●●●●●●

●●●●●●●●

●●

●●●●●●●●

●●

●●●●

●●●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●●●●●

●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●●●

●●●

●●

●●

●●●●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●●●●●

●●●

●●

●●●●●●

●●

●●●

●●●●●●●●●●

●●●●●●

●●

●●

●●●●

●●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●●●●●●●●●●●●●●

●●●

●●

●●●●

●●●

●●●●●

●●●●●●

●●●●●●●

●●

●●●

●●●●●

●●●

●●

●●●●●●

●●●●

●●●●

●●●●●●

●●●●●

●●

●●

●●●●

●●●●●●●●●●

●●

●●●●●

●●●

●●●●●

●●

●●

●●●●●●

●●●●

●●●●●●●●

●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●●●

●●

●●

●●●●●●

●●●

●●

●●

●●●

●●●●●●●●●

●●

●●●

●●●●

●●

●●●

●●●●●●

●●

●●●

●●●

●●●

●●●●

●●●

●●●●

●●●●●●

●●●●●●

●●

●●●●

●●●●●●●●●

●●

●●●●

●●

●●●●●●

●●●

●●●●

●●●●

●●●

●●

●●●

●●●

●●●

●●

●●●●

●●●●

●●

●●●●●●

●●

●●

●●

●●●

●●●●●●●●●●

●●●●●●●●●

●●

●●●●●●●

●●●●●●●

●●●

●●●●●●●

●●●●●●●

●●●

●●

●●●●●●●●●●

●●

●●●

●●

●●●●

●●●●

●●

●●●●●●●●●

●●

●●●●●

●●

●●●●●●

●●

●●●

●●●●●●

●●

●●●●●●

●●●

●●●●

●●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●

●●

●●

●●●●●●

●●

●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●●●●●●

●●●

●●●

●●

●●●●●●●●

●●●

●●

●●●

●●●●●●●●

●●●

●●

●●●●

●●

●●●●●

●●●

●●

●●●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●●

●●

●●●●●

●●●●●

●●●●

●●

●●●●

●●

●●●●●

●●●●●●

●●●●

●●

●●

●●●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●●●●●●

●●

●●

●●●

●●

●●●

●●●●

●●●●●●●●●●●●●

●●

●●●●●

●●●

●●

●●

●●

●●●

●●●●●●●●●●●

●●

●●●●●●●

●●

●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●

●●

●●●

●●●●

●●

●●●●●

●●

●●

●●

●●●●

●●

●●

●●●●●●●

●●

●●●

●●●●●●●●

●●●●●●

●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●

●●

●●●●●●

●●

●●●●

●●●●●●●●

●●

●●

●●

●●

●●●

●●●

●●●●●●

●●

●●

●●●●

●●●●

●●●●●●●

●●

●●

●●

●●●

●●

●●

●●●●●●●●●●●

●●●

●●●

●●●●●●●●●

●●●

●●

●●

●●●●

●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●●●●●

●●

●●

●●●●

●●●

●●

●●●

●●●●●●

●●

●●●

●●

●●●●

●●

●●●●●

●●●●●●●●●●●

●●●●●●●

●●

●●

●●●●

●●

●●

●●●

●●●●

●●●

●●

●●●●●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●●●●●●●

●●●●

●●●●●●●●

●●●●●●●

●●●

●●

●●●

●●

●●●●●●

●●

●●

●●●●●

●●●

●●

●●●

●●

●●

●●●●●●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●

●●

●●●

●●●●●

●●●●●●●

●●●●●●●●●

●●●●●

●●●

●●●●

●●

●●●

●●●●●

●●

●●●●

●●●●

●●●●●●●●●●

●●●●●●●●

●●●●

●●●●●●

●●●

●●●●

●●●●

●●●●●

●●●●

●●

●●●●●

●●●●●●

●●

●●●●●●●●

●●●

●●●●●

●●●●●●

●●●

●●●●●●●

●●

●●●●●●●●●●●

●●●●●●●

●●

●●●●

●●●●

●●●

●●

●●●●●●

●●

●●●

●●●●●●●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●●●●

●●●●●

●●

●●●●●

●●●●●●●

●●

●●●●

●●●●●●●

●●●

●●●●●●

●●●●●●●●●●

●●●

●●

●●

●●

●●

●●●●●●●●

●●

●●●

●●●●

●●

●●●●

●●

●●●●●●●●

●●

●●●●

●●●●●●

●●●●●

●●

●●

●●●●

●●

●●●●

●●●●●●●●

●●●●

●●●

●●

●●

●●●●

●●

●●●

●●●●●●●●

●●●

●●●●●●

●●

●●

●●●●●●●●

●●

●●●●

●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●●●●●●●●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●●

●●●●●●

●●

●●

●●

●●

●●●●●●●

●●

●●●●●●

●●●

●●

●●●

●●●●●●

●●

●●●●●●●●

●●●

●●

●●●

●●●●●●●●

●●●●●●●●●

●●●

●●●●●

●●

●●

●●●●●●●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●●

●●●●

●●●●●●

●●

●●●●●●●●●●●

●●●

●●●●

●●

●●●●●●

●●●●●●

●●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●●

●●●

●●●●●

●●●●●

●●●●●

●●●●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●●●●●●●●●●●●

●●●

●●

●●●●●●

●●●●●●

●●

●●●

●●●●●●●

●●●●

●●●●●●●

●●●●●●●

●●●

●●

●●●●

●●●●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●

●●●

●●

●●●●●

●●●●●

●●●●●●●●

●●

●●

●●●●

●●

●●●●

●●●

●●●●●●●●●●

●●

●●

●●●●

●●●●●●●●

●●

●●●●●●●

●●●●●

●●

●●

●●

●●●

●●●●●●●

●●●●●●

●●●●●●

●●●

●●●●

●●

●●●●●

●●

●●●●●●

●●●●●

●●●●●●

●●●

●●

●●●●●●●

●●

●●●

●●●●

●●●

●●●●●●●

●●●●

●●

●●●●●

●●●

●●●●●

●●

●●

●●●

●●

●●

●●●●●●●●●●

●●

●●

●●

●●●●●

●●

●●●

●●

●●●●●●●●●●●●

●●●

●●●●●

●●

●●●

●●●

●●●●

●●●●●

●●

●●●●●●●●

●●●●

●●●

●●●

●●●●

●●

●●

●●

●●●●●

●●

●●●●

●●

●●●●●

●●

●●

●●●●

●●

●●●●●●●●●●●●

●●

●●●

●●●●●●

●●

●●

●●●●●

●●

●●

●●●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●

●●●

●●●●

●●

●●●

●●

●●●●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●●●●●

●●●●●●●

●●●●

●●

●●

●●●●

●●●●●●●●●●

●●

●●

●●●●

●●●●●

●●

●●●●●●

●●

●●●●●●

●●●●●●●●●●●

●●●

●●●●●●●

●●

●●

●●●

●●

●●●

●●●●

●●●

●●●

●●

●●●●

●●●

●●●●

●●●●

●●

●●●●●●●●●

●●

●●●

●●●

●●●

●●●

●●●●●

●●

●●

●●●●

●●●●●

●●●●

●●●●

●●

●●●

●●●●●●●

●●●●●●

●●●

●●

●●●●●

●●●

●●

●●●

●●●●

●●

●●●●●●●●

●●

●●●

●●●●●

●●

●●●

●●

●●

●●●●●●●

●●●●

●●●●●●●

●●

●●

●●●●●

●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●●

●●

●●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●●●

●●●●

●●

●●●●

●●

●●

●●●●●●●●●

●●

●●●●●●●●●●●●

●●●●●

●●

●●●●

●●●●●●●

●●●●

●●

●●

●●●●●●●

●●●●●●

●●●●

●●

●●●●●●●●

●●●●

●●

●●●●●

●●●●●●

●●

●●●●●●●●●●●●●●

●●●●

●●●●●

●●

●●

●●●●●

●●●●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●●

●●●●●

●●

●●

●●●●●

●●●

●●●●●●

●●

●●

●●●●●

●●

●●●

●●●●

●●

●●

●●●●●●●●●

●●

●●

●●●●

●●

●●

●●●●●●●●

●●

●●●

●●●

●●●●●●●●

●●

●●●●●

●●●●

●●

●●●●

●●

●●●●●●

●●●

●●

●●●●●

●●●●●

●●●

●●●●

●●

●●●●

●●●●●

●●

●●

●●●●

●●●●●●●

●●●●

●●

●●●●●●

●●●●●●●●●●

●●●●●●

●●●

●●●●●

●●●

●●●

●●●●

●●●●

●●●●●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●●●●●●●●●

●●●●

●●

●●

●●●●

●●

●●●

●●

●●●●●●●●●●●●●

●●

●●

●●

●●●●

●●

●●●●●●●

●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●

●●

●●

●●

●●

●●●●●

●●●●●●●●●●●

●●

●●●●●

●●

●●

●●●●●

●●●

●●

●●

●●●●

●●

●●●●

●●●●●

●●●●●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●●●

●●●●

●●

●●●

●●

●●●

●●●

●●●●●●●●●

●●●●

●●●●●●●

●●●●

●●●●●●●●●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●●●●●

●●

●●

●●●●●

●●●●●

●●●●

●●●●●●●

●●

●●

●●

●●●●

●●●

●●●●●

●●●●●●

●●●●●●

●●●

●●

●●●●

●●

●●

●●

●●●●●

●●

●●●●●

●●●●

●●●

●●

●●●●

●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●●●●●●

●●

●●●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●●●●●●●●

●●

●●●

●●

●●●

●●

●●●●●●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●●●●●

●●●

●●●●●●

●●

●●

●●

●●

●●●●●●●

●●●

●●●

●●●●

●●●

●●●●

●●

●●

●●

●●●●●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●●

●●●●●

●●

●●●●●●●●●

●●●

●●●

●●●●●●●

●●●●●●●

●●

●●●●

●●

●●

●●●

●●●●●●●

●●●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●●●

●●●●●●●●●

●●●●●●

●●●●●●

●●●

●●

●●●●

●●

●●●●

●●●●●

●●

●●

●●●●●●●●

●●●●

●●●●●●●

●●●●●

●●●●●●●

●●

●●

●●

●●●●●

●●

●●

●●●●●

●●●●●●●●●

●●●●●

●●●

●●●

●●

●●●●●●●

●●

●●

●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●

●●●

●●●●●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●●●●

●●●●

●●●●●●

●●●●

●●●●

●●●●

●●●●●●●●●

●●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●●●●

●●

●●

●●●

●●●●●●●

●●●

●●●●●●●●●

●●●●●

●●

●●

●●

●●●●●●●

●●●●●●●●

●●

●●●●●●●●

●●

●●●●

●●●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●●●●●

●●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●●●

●●●

●●

●●

●●●●●

●●●●●

●●●●●●●●

●●●

●●●

●●

●●●●●●

●●●

●●

●●●●●●

●●

●●●

●●●●●●●●●●

●●●●●●

●●

●●

●●●●

●●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●●●●●●●●●●●●●●

●●●

●●

●●●●

●●●

●●●●●

●●●●●●

●●●●●●●

●●

●●●

●●●●●

●●●

●●

●●●●●●

●●●●

●●●●

●●●●●●

●●●●●

●●

●●

●●●●

●●●●●●●●●●

●●

●●●●●

●●●

●●●●●

●●

●●

●●●●●●

●●●●

●●●●●●●●

●●●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●●●

●●

●●

●●●●●●

●●●

●●

●●

●●●

●●●●●●●●●

●●

●●●

●●●●

●●

●●●

●●●●●●

●●

●●●

●●●

●●●

●●●●

●●●

●●●●

●●●●●●

●●●●●●

●●

●●●●

●●●●●●●●●

●●

●●●●

●●

●●●●●●

●●●

●●●●

●●●●

●●●

●●

●●●

●●●

●●●

●●

●●●●

●●●●

●●

●●●●●●

●●

●●

●●

●●●

●●●●●●●●●●

●●●●●●●●●

●●

●●●●●●●

●●●●●●●

●●●

●●●●●●●

●●●●●●●

●●●

●●

●●●●●●●●●●

●●

●●●

●●

●●●●

●●●●

●●

●●●●●●●●●

●●

●●●●●

●●

●●●●●●

●●

●●●

●●●●●●

●●

●●●●●●

●●●

●●●●

●●●

●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●●●●●●●

●●

●●

●●●●●●●●●●●●

●●

●●

●●●●●●

●●

●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●●●●●●

●●●

●●●

●●

●●●●●●●●

●●●

●●

●●●

●●●●●●●●

●●●

●●

●●●●

●●

●●●●●

●●●

●●

●●●●

●●

●●

●●

●●●●●●

●●

●●

●●●

●●●

●●

●●●●●

●●●●●

●●●●

●●

●●●●

●●

●●●●●

●●●●●●

●●●●

●●

●●

●●●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●●●●●●●

●●

●●

●●●

●●

●●●

●●●●

●●●●●●●●●●●●●

●●

●●●●●

●●●

●●

●●

●●

●●●

●●●●●●●●●●●

●●

●●●●●●●

●●

●●

●●●●●●

●●

●●●●

●●●●●

●●

●●●

●●●

●●

●●●

●●●●

●●

●●●●●

●●

●●

●●

●●●●

●●

●●

●●●●●●●

●●

●●●

●●●●●●●●

●●●●●●

●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●

●●

●●●●●●

●●

●●●●

●●●●●●●●

●●

●●

●●

●●

●●●

●●●

●●●●●●

●●

●●

●●●●

●●●●

●●●●●●●

●●

●●

●●

●●●

●●

●●

●●●●●●●●●●●

●●●

●●●

●●●●●●●●●

●●●

●●

●●

●●●●

●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●●

●●●●●●

●●

●●

●●●●

●●●

●●

●●●

●●●●●●

●●

●●●

●●

●●●●

●●

●●●●●

●●●●●●●●●●●

●●●●●●●

●●

●●

●●●●

●●

●●

●●●

●●●●

●●●

●●

●●●●●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●●●●●●●

●●●●

●●●●●●●●

●●●●●●●

●●●

●●

●●●

●●

●●●●●●

●●

●●

●●●●●

●●●

●●

●●●

●●

●●

●●●●●●●

●●

●●

●●

●●●

●●●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●●●●●●●●●●●●●●

●●

●●●●●●

●●

●●●

●●●●●

●●●●●●●

●●●●●●●●●

●●●●●

●●●

●●●●

●●

●●●

●●●●●

●●

●●●●

●●●●

●●●●●●●●●●

●●●●●●●●

●●●●

●●●●●●

●●●

●●●●

●●●●

●●●●●

●●●●

●●

●●●●●

●●●●●●

●●

●●●●●●●●

●●●

●●●●●

●●●●●●

●●●

●●●●●●●

●●

●●●●●●●●●●●

●●●●●●●

●●

●●●●

●●●●

●●●

●●

●●●●●●

●●

●●●

●●●●●●●●

●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

BLEV OPT PL

−19

−18

−17

−16

−15

−14

−13

−12

4 5 6 7 8

1618

2022

24

log(subsample size)

log(

varia

nce)

BLEVUNIFOPTPLUW

4 5 6 7 8

1012

1416

1820

22

log(subsample size)

log(

squa

red

bias

)

●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

● ● ● ● ● ●

4 5 6 7 8

1618

2022

24

log(subsample size)

log(

mea

n−sq

uare

d er

ror)

●●

●● ●

Figure 6: Performance comparison for the real data example among Xβ by various sub-

sampling methods, UNIF, BLEV, OPT and PL, and Xβu

by BLEV denoted as UW. From

left to right: Boxplots of the logarithm of sampling probabilities among BLEV, OPT and

PL in the 1st subfigure, the logarithm of variance in the 2nd subfigure, the logarithm of

squared bias in the 3rd subfigure, and the logarithm of MSE in the 4th subfigure.

We study the performance of the weighted subsample estimator β by applying var-

ious subsampling methods to the full sample for B = 1000 times. Although the un-

weighted subsample estimator βu

is not recommended in practice as the bias for OLS

approximation can not be controlled by increasing the subsample size, we show the re-

sults of βu

based on BLEV for this real data. Different subsample sizes are chosen:

r = 25, 50, 100, 250, 500, 1000, 2500, 5000.

The resulting variance, squared bias and MSE are plotted in Figure 6. Inspecting in

Figure 6, we get similar observations to simulation studies. Firstly, the bias effect of β is

small enough to be ignored compared to the variance. Secondly, the MSEs of β for OPT

and PL are about 30% smaller than that for BLEV as the subsample size grows, while

the MSEs of β for OPT and PL are almost identical for this real data. Thirdly, for βu,

there are obviously smaller variance values compared with β, but its MSEs quickly become

flat as r increases because of its uncontrollable bias. The observation is consistent to our

theoretical and empirical results on βu.

We report the running time for computing β for BLEV and PL in Table 4. Those values

were computed by R software in PC with 3 GHz intel i7 processor, 8 GB memory and OS

X operation system. The results show that PL has obvious advantage over BLEV in terms

of system time and has slightly less computational cost than BLEV in terms of user time.

28

Page 29: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Table 4: The running time, CPU seconds, for computing β with BLEV and PL. T0 is the

running time for computing sampling probabilities.

r T0 25 50 100 250 500 1000 2500 5000

System Time

BLEV 0278 0.295 0.293 0.326 0.288 0.318 0.291 0.278 0.414

PL 0.235 0.251 0.252 0.245 0.245 0.245 0.253 0.265 0.344

User Time

BLEV 10.85 10.85 10.85 10.86 10.85 10.88 10.87 11.01 11.18

PL 9.83 9.83 9.83 9.83 9.83 9.84 9.86 10.00 10.06

On the other hand, we investigate the relative-error approximation for this real data.

We calculate the residual sum of squares ‖y −Xβols‖2 for full sample OLS estimator, and

the empirical value of expected residual sum of squares E‖y−Xβ‖2, i.e., 1B

B∑b=1

‖y−Xβb‖2,

where βb is the weighted subsample estimator on the bth subsample. We use the following

quantity Re to measure the relative-error approximation:

Re =

1B

B∑b=1

‖y −Xβb‖2

‖y −Xβols‖2− 1.

The Re values of the unweighted subsample estimator βu

is also calculated by the same

way. The results are reported in Table 5. We observe that (1) Re values of βu

do not

decrease as the subsample size increases, (2) OPT and PL have the best performance in

terms of the relative-error, and (3) even UNIF method can get the very good relative-error

approximation when the subsample size is sufficiently large. These observations empirically

show that the performance of relative-error approximation agrees the asymptotic results in

Theorem 1.

29

Page 30: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Table 5: The relative-error approximation comparison among β by various subsampling

methods, UNIF, BLEV, OPT and PL, and βu

by BLEV denoted as UW.

r 25 50 100 250 500 1000 2500 5000

BLEV 0.210 0.105 0.0584 0.0189 0.00943 0.00444 0.00201 0.00102

UW 0.147 0.0730 0.0397 0.0270 0.0212 0.0195 0.0176 0.0170

UNIF 0.403 0.163 0.0668 0.0249 0.0133 0.00646 0.00260 0.00124

OPT 0.224 0.106 0.0487 0.0183 0.00956 0.00431 0.00175 0.000888

PL 0.277 0.113 0.0462 0.0191 0.00773 0.00402 0.00182 0.000862

8 Discussion

In this paper, we have studied two classes of estimation algorithms based on subsample, i.e,

the weighted estimation algorithm and the unweighted estimation algorithm, for fitting lin-

ear models to large sample data by subsampling methods. We have established asymptotic

consistency and normality properties of their resulting subsample estimators, the weighted

subsample estimator β and the unweighted subsample estimator βu, respectively. Based

on asymptotical results of β, we have proposed two optimal criteria of the weighted es-

timation algorithm for approximating the full sample OLS estimator and estimating the

coefficients, respectively. Furthermore, two optimal subsampling methods are constructed.

Especially, PL subsampling is constructed based on the L2 norm of predictors rather than

their leverage scores. Compared with BLEV and OPT, PL is scalable to both sample size

n and predictor dimension p.

In addition, we have argued that the unweighted subsample estimator βu

is not ideal for

approximating the full sample OLS estimator, as its bias can not be controlled. However,

it is more efficient than the weighted subsample estimator β for estimating the coefficients.

Synthetic data and a real data example are used to demonstrate the performance of our

proposed methods.

30

Page 31: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Appendix: Proofs

A Two Lemmas

First, we provide the asymptotic results of the weighted subsample estimator β and the

unweighted subsample estimator βu

without the assumed model (1) in Lemma 1 and 2,

respectively. Their proofs are shown in Supplementary Material. Following Lemma 1 and

2, we prove Theorem 1 and Theorem 4 respectively.

Condition C1∗:1

n2

n∑

i=1

πi

(ziz

Ti

πi−MZ

)2

= Op(1), (A.1)

and there exists some δ > 0 such that

1

n2+δ

n∑

i=1

‖eixi‖2+δπ1+δi

= Op(1), (A.2)

where MZ =n∑i=1

zizTi and ei = yi − xTi βols.

Condition C3∗:n∑

i=1

πi(ziz

Ti −Mu

Z

)2= Op(1), (A.3)

n∑

i=1

π1+δi ‖eui xui ‖2+δ = Op(1), (A.4)

where MuZ =

n∑i=1

πizizTi , and eui = yi − xTi βwls with βwls = (XTΦX)−1XTΦy.

Note: (A.1) and (A.3) are imposed on the predictors and response variables, as well as

(A.2) and (A.4) are on the predictors and residuals of the full sample OLS, while C1 and

C3 are just imposed on the predictors.

Lemma 1. If Condition C1∗ holds, then given Fn, we have

V−1/2(β − βols)L−→ N(0, I) as r →∞ (A.5)

where V = M−1X VcM

−1X , and Vc = r−1

n∑i=1

e2iπixix

Ti .

Moreover,

V = O(r−1), (A.6)

31

Page 32: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

and

E(β − βols|Fn) = O(r−1). (A.7)

Lemma 2. If Condition C3∗ holds, then given Fn,

(Vu)−1/2(βu − βwls)

L−→ N(0, I), as r →∞, (A.8)

where Vu = (MuX)−1Vu

c (MuX)−1, Mu

X =n∑i=1

πixixTi , and Vu

c = r−1n∑i=1

πi(eui )

2xixTi .

In addition, we have

E(βu − βols|Fn) = βwls − βols +Op(r

−1). (A.9)

B Proof of Theorem 1

If we show that C1 implies C1∗ under the linear model (1), Theorem 1 is proved by Lemma

1.

First, we will verify (A.1) in C1∗.

It is easy to have that

1

n2

n∑

i=1

πi

(ziz

Ti

πi−MZ

)2

=1

n2

n∑

i=1

(zizTi )2

πi− 1

n2M2

Z =

A11 A12

AT12 A22

, (A.10)

where

A11 =1

n2

n∑

i=1

1

πi

[(xix

Ti )2 + xix

Ti y

2i

]− 1

n2

[(n∑

i=1

xixTi )2 + (

n∑

i=1

xiyi)(n∑

i=1

xTi yi)

],

A12 =1

n2

n∑

i=1

1

πi(xix

Ti xiyi + xiy

3i )−

1

n2

[(n∑

i=1

xixTi )(

n∑

i=1

xiyi) + (n∑

i=1

xiyi)(n∑

i=1

y2i )

],

and A22 =1

n2

n∑

i=1

1

πi(xix

Ti y

2i + y4i )−

1

n2

[(n∑

i=1

xiyi)(n∑

i=1

xTi yi) + (n∑

i=1

y2i )2

].

Now we investigate the order of A11. Under the linear model (1), we have

1

n2

n∑

i=1

xixTi y

2i

πi− 1

n2

n∑

i=1

xixTi (xTi β)2 + σ2xix

Ti

πi→ 0 in probability. (A.11)

By Holder’s inequality, we get that

1

n2

n∑

i=1

xixTi (xTi β)2

πi≤ 1

n2

n∑

i=1

‖β‖2‖xi‖2xixTiπi

= Op(1). (A.12)

32

Page 33: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

So combing (A.11), (A.12) and C1, i.e., 1n2

n∑i=1

xixTi

πi= Op(1), we have

1

n2

n∑

i=1

xixTi y

2i

πi= Op(1). (A.13)

Meanwhile, 1n

n∑i=1

xiyi− 1n

n∑i=1

xixTi β → 0 in probability under the linear model (1), so from

C2 we have that1

n2(n∑

i=1

xiyi)(n∑

i=1

xTi yi) = Op(1). (A.14)

Combining (A.13) and (A.14), we have A11 = Op(1).

Analogously, we can get that both of A12 and A22 in (A.10) are Op(1). Thus, (A.1) in

C1∗ is verified.

Second, we will verify (A.2) in C1∗.

1

n2+δ

n∑

i=1

‖eixi‖2+δπ1+δi

=1

n2+δ

n∑

i=1

‖[εi + xTi (βols − β)]xi‖2+δπ1+δi

≤ 1

n2+δ

n∑

i=1

‖εixi‖2+δπ1+δi

+‖βols − β‖2+δ

n2+δ

n∑

i=1

‖xi‖4+2δ

π1+δi

≤ 1

n2+δ

n∑

i=1

‖εixi‖2+δπ1+δi

+ ‖βols − β‖2+δ(

n∑

i=1

‖xi‖2)

1

n2+δ

n∑

i=1

‖xi‖2+2δ

π1+δi

=1

n2+δ

n∑

i=1

|εi|2+δ‖xi‖2+δπ1+δi

+ op(1) = Op(1),

where the 1st inequality is based on the triangle inequality and Holder’s inequality, the 2nd

inequality is from that ‖xi‖2 ≤n∑i=1

‖xi‖2, the penultimate equality holds because of (9) in

C1 and the following fact that βols − β = Op(n−1/2) under the assumed linear model, and

the last equality holds since 1n2+δ

n∑i=1

|εi|2+δ‖xi‖2+δπ1+δi

− 1n2+δ

n∑i=1

‖xi‖2+δπ1+δi

goes to 0 in probability

under the linear model assumption.

Thus, theorem 1 is proved.

C Proof of Theorem 2

By Holder’s inequality,

E[tr(Vc)] = r−1n∑

i=1

[1− hii]πi

‖xi‖2 = r−1n∑

i=1

[1− hii]πi

‖xi‖2n∑

i=1

πi ≥ (n∑

i=1

√[1− hii] ‖xi‖2)2,

33

Page 34: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

where the equality holds if and only if√

[1−hii]πi‖xi‖2 ∝

√πi. Thus, the proof is completed.

D Proof of Theorem 4

The proof of Theorem 4 proceeds in the same fashion as that of Theorem 1 by showing

that condition C3 implies C3∗ under the linear model framework.

E Proof of Theorem 3

The difference between β and β is expressed as follows:

β − β = M−1X (r−1X∗TΦ∗−1ε∗) + (M−1

X −M−1X )(r−1X∗TΦ∗−1ε∗), (A.15)

where ε∗ = y∗ −X∗β.

We follow the steps of proving Lemma 1 to show that (M−1X −M−1

X )(r−1X∗TΦ∗−1ε∗) =

Op(r−1) and M−1

X (r−1X∗TΦ∗−1ε∗) converges to a normal distribution. However, the details

are omitted here because of the similarity.

Unlike Var(r−1n−1X∗TΦ∗−1e∗|Fn) in proving Lemma 1, there is a point worthy noting

that,

Var(r−1n−1X∗TΦ∗−1ε∗|X)

= E[Var(r−1n−1X∗TΦ∗−1ε∗|Fn)|X

]+ Var

[E(r−1n−1X∗TΦ∗−1ε∗|Fn)|X

]

= E

[r−1

1

n2

(n∑

i=1

ε2ixixTi

πi− (

n∑

i=1

ε2ixi)(n∑

i=1

ε2ixi)

)|X]

+ Var

[1

n

n∑

i=1

εixi|X]

= r−11

n2

n∑

i=1

σ2xixTi

πi− r−1 1

n2σ2

n∑

i=1

xixTi +

1

n2σ2

n∑

i=1

xixTi ,

where the second equality is from the properties of Hansen-Hurwitz estimates and the third

by taking expectation and variance under the linear model assumption.

34

Page 35: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

F Proof of Corollary 2

By Holder’s inequality,

tr(Vc0) = r−1σ2

n∑

i=1

1

πi‖xi‖2

n∑

i=1

πi + (1− r−1)σ2

n∑

i=1

‖xi‖2

≥(

n∑

i=1

√‖xi‖2

)2

+ (1− r−1)σ2

n∑

i=1

‖xi‖2,

where the equality holds if and only if√

1πi‖xi‖2 ∝

√πi. Thus, the proof is completed.

G Proof of Theorem 5

Analogous to the proof of Theorem 3, Theorem 5 follows immediately from proving Lemma

2 by replacing βwls with β. However, during this proving we need to note that

Var(r−1X∗Tε∗|X) = E[Var(r−1X∗Tε∗|Fn)|X

]+ Var

[E(r−1X∗Tε∗|Fn)|X

]

=E

[r−1

n∑

i=1

πiε2ixix

Ti − r−1(

n∑

i=1

πiεixi)(n∑

i=1

πiεixi)T |X

]+ Var

[n∑

i=1

πiεixi|X]

=r−1σ2

n∑

i=1

πixixTi + r−1σ2

n∑

i=1

π2ixix

Ti + σ2

n∑

i=1

π2ixix

Ti .

H Proof of Corollary 3

Comparing V0 with Vu0 , we have

V0 −Vu0 =r−1σ2(XTX)−1XTΦ−1X(XTX)−1 + (1− r−1)σ2(XTX)−1

− r−1σ2(XTΦX)−1 − (1− r−1)σ2(XTΦX)−1XTΦ2X(XTΦX)−1

=r−1σ2[(XTX)−1XTΦ−1X(XTX)−1 − (XTΦX)−1

]

+ (1− r−1)σ2[(XTX)−1 − (XTΦX)−1XTΦ2X(XTΦX)−1

]

=r−1σ2[(XTX)−1XTΦ−1X(XTX)−1 − (XTΦX)−1

]+Op(n

−1),

where the last equality is from the facts that (XTX)−1XTΦ−1X(XTX)−1 − (XTΦX)−1 =

Op(1) and (XTX)−1 − (XTΦX)−1XTΦ2X(XTΦX)−1 = Op(n−1). Both facts are followed

by the conditions that n−2XTΦ−1X = Op(1) in C1 and XTΦX = Op(1) in C3, respectively.

35

Page 36: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

On the other hand, it is obvious that (XTΦX)−1 ≤ (XTX)−1XTΦ−1X(XTX)−1 by

matrix operations, where the equality holds if and only if πi = 1n

for i = 1, · · · , n. Thus,

V0 ≥ Vu0 as r = o(n). Thus, the proof is completed.

References

Avron, H., P. Maymounkov, and S. Toledo (2010). Blendenpik: Supercharging LAPACK’s

least-squares solver. SIAM Journal on Scientific Computing 32, 1217–1236.

Buhlmann, P. and van de Geer. S. (2011). Statistics for High-Dimensional Data. Springer,

New York.

Christensen, R. (1996). Plane Answers to Complex Questions: The Theory of Linear

Models. Springer, New York.

Clarkson, K. and D. Woodruff (2013). Low rank approximation and regression in input

sparsity time. In Proc. of the 45th STOC, pp. 81–90.

Cohen, M., Y. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford (2014). Uniform sampling

for matrix approximation. arXiv:1408.5099.

Dhillon, P., Y. Lu, D. Foster, and L. Ungar (2013). New subsampling algorithms for

fast least squares regression. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani,

and K. Weinberger (Eds.), Advances in Neural Information Processing Systems 26, pp.

360–368. Curran Associates, Inc.

Drineas, P., M. Magdon-Ismail, M. Mahoney, and D. Woodruff (2012). Fast approximation

of matrix coherence and statistical leverage. Journal of Machine Learning Research 13,

3475–3506.

Drineas, P., M. Mahoney, and S. Muthukrishnan (2006). Sampling algorithms for l2 regres-

sion and applications. In Proc. of the 17th Annual ACM-SIAM Symposium on Discrete

Algorithms, pp. 1127–1136.

Drineas, P., M. Mahoney, and S. Muthukrishnan (2008). Relative-error CUR matrix de-

composition. SIAM Journal on Matrix Analysis and Applications 30, 844–881.

Drineas, P., M. Mahoney, S. Muthukrishnan, and T. Sarlos (2011). Faster least squares

approximation. Numerische Mathematik 117, 219–249.

36

Page 37: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Golub, G. and C. Van Loan (1996). Matrix Computations. Baltimore: Johns Hopkins

University Press, Baltimore.

Graveley, B., A. Brooks, J. Carlson, M. Duff, and et al. (2011). The developmental tran-

scriptome of Drosophila melanogaster. Nature 471, 473–479.

Hansen, M. and W. Hurwitz (1943). On the theory of sampling from a finite population.

Annals of Mathematical Statistics 14, 333–362.

Lehmann, E. and G. Casella (2003). Theeory of Point Estimation, 4th ed. Springer, New

York.

Ma, P., M. Mahoney, and B. Yu (2014). A statistical perspective on algorithmic leveraging.

In Proc. of the 31th ICML Conference, pp. 91–99.

Ma, P., M. Mahoney, and B. Yu (2015). A statistical perspective on algorithmic leveraging.

Journal of Machine Learning Research 16, 861–911.

Mahoney, M. and P. Drineas (2009). CUR matrix decompositions for improved data anal-

ysis. Proceedings of the National Academy of Sciences 106, 697–702.

McWilliams, B., G. Krummenacher, M. Lucic, and J. M. Buhmann (2014, June). Fast and

Robust Least Squares Estimation in Corrupted Linear Models. ArXiv e-prints .

Meinshausen, N. and B. Yu (2009). Lasso-type recovery of sparse representations for high-

dimensional data. Annals of Statistics 37, 246–270.

Rokhlin, V. and M. Tygert (2008). A fast randomized algorithm for overdetermined linear

least-squares regression. Proceedings of the National Academy of Sciences of the United

States of America 105 (36), 13212–13217.

Sarndal, C., B. Swensson, and J. Wretman (2003). Model Assisted Survey Sampling.

Springer, New York.

Tibshirani, R. (1996). Regression shrikage and selection with the lasso. Journal of the

Royal Statistical Society. Series B 58, 267–288.

van der Vaart, A. (1998). Asymptotic Statistics. Cambridge University Press, London.

37

Page 38: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Supplementary Material of “Optimal Subsampling

Strategies for Large Sample Linear Regression”

1 The Relationship between Asymptotic Properties and Relative-

error Approximations

Drineas et al. (2011) gave an ε-dependent approximation error for β based on BLEV. Unlike their

results, we show asymptotic consistency and normality and consistency in Theorem 1. Here we

investigate the relationship between them in the following corollary.

Corollary 1. Under Condition C1, then there exists an ε ∈ (0, 1),

P(‖β − βols‖ ≤

√ε‖βols‖

)→ 0, as r →∞. (1)

Moreover,

P(‖y −Xβ‖ ≤ (1 + ε)‖y −Xβols‖

)→ 0, as r →∞. (2)

Proof. From our Theorem 1, ‖β − βols‖ = Op(r−1/2), and βols = Op(n

−1/2), so there exits an

ε ∈ (0, 1), such that the inequality (1) hold.

On the other hand, it is easy to see that

‖y −Xβ‖2 ≤ ‖y −Xβols‖2 + ‖X(β − βols)‖2.

Following our Theorem 1, we know β − βols = Op(r−1/2), which follows that ‖X(β − βols)‖2 =

Op(r−1n). Since in the assumed linear model, we have 1

n‖y −Xβols‖ ≥ c almost surely for some

constant c > 0, thus there exits a ε ∈ (0, 1), such that the inequality (2) is satisfied.

Corollary 1 is the asymptotic version of the relative-error approximation result (Theorem 1 of

Drineas et al. (2011)). Similar to their result, ε, which denotes the relative error in Corollary 1,

can be arbitrarily small as long as the r goes enough large.

1

arX

iv:1

509.

0511

1v1

[st

at.M

E]

17

Sep

2015

Page 39: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

2 Condition C1 for Revised Leverage Score

If we enlarge leverage score from hii to h1/(1+δ)ii , πi = h

1/(1+δ)ii /

n∑i=1

h1/(1+δ)ii . Thus, we have

M1 =1

n2

n∑

i=1

h1/(1+δ)ii

n∑

i=1

‖xi‖2xixTi(xTi M−1

X xi)1/(1+δ)

≤ λmn

n∑

i=1

h1/(1+δ)ii

n∑

i=1

‖xi‖2δ/(1+δ)xixTi ,

M2 =1

n2

n∑

i=1

h1/(1+δ)ii

n∑

i=1

xixTi

(xTi M−1X xi)

1/(1+δ)≤ λm

n

n∑

i=1

h1/(1+δ)ii

n∑

i=1

‖xi‖−2/(1+δ)xixTi , ,

M3 =1

n2+δ

(n∑

i=1

h1/(1+δ)ii

)1+δ n∑

i=1

‖xi‖2+2δ

xTi M−1X xi

≤ λm(

n∑

i=1

h1/(1+δ)ii

)1+δ n∑

i=1

‖xi‖2δ,

M4 =1

n2+δ

(n∑

i=1

h1/(1+δ)ii

)1+δ n∑

i=1

‖xi‖2+δxTi M−1

X xi≤ λm

(n∑

i=1

h1/(1+δ)ii

)1+δ1

n

n∑

i=1

‖xi‖δ,

where λm is the largest eigenvalue of 1nMX . Thus, it is sufficient to get the those equations in C1

under the condition that {xi} are independent with bounded fourth moments.

3 Empirical Studies on Estimation of the True coefficient

Besides empirical studies on βols approximation, we apply five subsampling methods with different

subsample sizes to each dataset. For each of five X matrices with sample size n = 50, 000 and

predictor dimension p = 1000, one thousand data sets (X,yb), where b = 1, . . . , B with B = 1000

were generated from the model yb = Xβ + εb, where εb ∼ N(0, σ2In) with σ = 10. We get βb

based on bth dataset.

We calculate the empirical variance (V u), squared bias (SBu) and mean squared error (MSEu)

of Xβ for estimating Xβ as follows:

V u =1

B

B∑

b=1

∥∥∥X(βb − ¯

β)∥∥∥

2, (3)

SBu =

∥∥∥∥∥1

B

B∑

b=1

X(βb − β

)∥∥∥∥∥

2

, (4)

MSEu =1

B

B∑

b=1

∥∥∥X(βb − β)∥∥∥2, (5)

where¯β = 1

B

B∑b=1

βb.

2

Page 40: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

0 1000 3000 5000

1012

1416

18TT

subsample size

log(

varia

nce) ●

BLEVUNIFOPTPL

0 1000 3000 5000

1011

1213

1415

1617

MG

subsample sizelo

g(va

rianc

e)

0 1000 3000 5000

1011

1213

1415

1617

LN

subsample size

log(

varia

nce)

0 1000 3000 5000

46

810

1214

subsample size

log(

squa

red

bias

)

●●

●●

0 1000 3000 5000

56

78

910

1112

subsample size

log(

squa

red

bias

) ●

●●

●●

0 1000 3000 5000

46

810

12

subsample sizelo

g(sq

uare

d bi

as)

● ●

0 1000 3000 5000

1012

1416

18

subsample size

log(

mse

) ●

0 1000 3000 5000

1011

1213

1415

1617

subsample size

log(

mse

)

0 1000 3000 5000

1011

1213

1415

1617

subsample size

log(

mse

)

Figure 1: Empirical variances, squared biases and mean-squared errors of Xβ for estimating Xβ

for three datasets. From top to bottom: upper panels are the logarithm of variances, middle panels

the logarithm of squared bias, and lower panels the logarithm of mean-squared error. From left to

right: TT, MG and LN data, respectively.

We plot the results in Figure 1. From Figure 1, we observe that the biases are negligible and

variances dominate the biases. Second, PL has the least MSEs, which is consistent to our results

of β in estimating β, interestingly, the performance of OPT is very close that of PL.

3

Page 41: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

4 Proofs of Lemma 1 & 2

4.1 Proof of Lemma 1

Condition C1∗:1

n2

n∑

i=1

πi

(ziz

Ti

πi−MZ

)2

= Op(1), (A.1)

1

n2+δ

n∑

i=1

‖eixi‖2+δπ1+δi

= Op(1) for δ > 0, (A.2)

where MZ =n∑i=1ziz

Ti and ei = yi − xTi βols.

Lemma 1. If Condition C1∗ holds, then given Fn, we have

V−1/2(β − βols)L−→ N(0, I) as r →∞ (A.3)

where V = M−1X VcM

−1X , and Vc = r−1

n∑i=1

e2iπixix

Ti .

Moreover,

V = O(r−1). (A.4)

In addition, we have

E(β − βols|Fn) = O(r−1) (A.5)

Proof. The difference between β and βols is

β − βols = (X∗TΦ∗−1X∗)−1X∗TΦ∗−1y∗ − βols= (X∗TΦ∗−1X∗)−1X∗TΦ∗−1(y∗ −X∗βols)

= M−1X (r−1X∗TΦ∗−1e∗)

= M−1X (r−1X∗TΦ∗−1e∗) + (M−1

X −M−1X )(r−1X∗TΦ∗−1e∗), (A.6)

where MX = XTX, MX = r−1X∗TΦ∗−1X∗, and e∗ = y∗ −X∗βols.

Step 1. If we show

n−1(MX −MX) = Op(r−1/2), n−1r−1X∗TΦ∗−1e∗ = Op(r

−1/2), (A.7)

then it follows that

(M−1X −M−1

X )(r−1X∗TΦ∗−1e∗) = −M−1X (MX −MX)M−1

X (r−1X∗TΦ∗−1e∗)

= −(n−1MX)−1n−1(MX −MX)(n−1MX)−1(n−1r−1X∗TΦ∗−1e∗)

= Op(r−1). (A.8)

4

Page 42: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Combining equations (A.8) and (A.6), we get

β − βols = M−1X (r−1X∗TΦ∗−1e∗) +Op(r

−1) (A.9)

So we shall prove (A.7). Note that, given Fn, MX and X∗TΦ∗−1e∗ are Hansen-Hurwitz esti-

mates shown in Hansen and Hurwitz (1943), we thus have,

E(MX) = MX , E(X∗TΦ∗−1e∗) = XT (y −Xβols) = 0.

By C1∗, we have the following equations,

1

n2

n∑

i=1

πi

(xix

Ti

πi−MX

)(xix

Ti

πi−MX

)= Op(1),

1

n2

n∑

i=1

πi

(xix

Ti

πi−MX

)(xiyiπi−XTy

)= Op(1),

1

n2

n∑

i=1

πi

(yix

Ti

πi− yTX

)(xiyiπi−XTy

)= Op(1),

which imply, for any p-dimensional vector l with finite elements,

1

n2

n∑

i=1

πi

(xix

Ti

πi−MX

)llT(xix

Ti

πi−MX

)= Op(1),

1

n2

n∑

i=1

πi

(xix

Ti

πi−MX

)llT(xiyiπi−XTy

)= Op(1),

1

n2

n∑

i=1

πi

(yix

Ti

πi− yTX

)llT(xiyiπi−XTy

)= Op(1).

Hence,

1

n2

n∑

i=1

xixTi ll

TxixTi

πi=

1

n2(XTX)llT (XTX) +Op(1), (A.10)

1

n2

n∑

i=1

xixTi ll

Txiyiπi

=1

n2(XTX)llT (XTy) +Op(1), (A.11)

1

n2

n∑

i=1

yixTi ll

Txiyiπi

=1

n2(yTX)llT (XTy) +Op(1). (A.12)

Therefore, given Fn, we have,

V ar(n−1MXl) = r−1 1

n2

n∑

i=1

πi

(xix

Ti

πi−MX

)llT(xix

Ti

πi−MX

)= Op(r

−1), (A.13)

5

Page 43: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

V ar(r−1n−1lTX∗TΦ∗−1e∗) = r−1 1

n2

n∑

i=1

πi

(eix

Ti l

πi

)2

= r−1 1

n2

n∑

i=1

1

πi

[y2i x

Ti ll

Txi + βT

olsxixTi ll

TxixTi βols − 2yix

Ti ll

TxixTi βols

]

= r−1 1

n2

[n∑

i=1

1

πiy2i x

Ti ll

Txi + βT

ols

(n∑

i=1

1

πixix

Ti ll

TxixTi

)βols − 2

(n∑

i=1

1

πiyix

Ti ll

TxixTi

)βols

]

= r−1

[1

n2yTXllTXTy +

1

n2βT

ols(XTX)llT (XTX)βols −

2

n2yTXllT (XTX)βols +Op(1)

]

= r−1

[1

n2

(lT [XT (y −Xβols)]

)2+Op(1)

]

= Op(r−1), (A.14)

where ei = yi−xTi βols, the fourth equality is based on (A.10), (A.11) and (A.12), the last equality

is based on the fact that XT (y −Xβols) = 0, and others are direct results. Thus we have (A.7).

Step 2. Next we will show, given Fn, [V ar(r−1n−1X∗TΦ∗−1e∗)]−1/2(r−1n−1X∗TΦ∗−1e∗) con-

verges to normal distribution.

We can construct the random vector ηi such that ηi take vector among { e1x1nπ1

, · · · , enxnnπn} and

Pr

[ηi =

ejxjnπj

]= πj , for j = 1, . . . , n.

Because of the independence and repeatability of each subsampling during subsampling with re-

placement, given Fn, the sequence {η1, · · · ,ηr} is independent identically distributed.

Then it is easy to see that

n−1X∗TΦ∗−1e∗ =

r∑

i=1

ηi,

and

E(ηi|Fn) = 0, V ar (ηi|Fn) =1

n2

n∑

i=1

e2ixixTi

πi= Op(1). (A.15)

In fact, {η1, · · · ,ηr} is the double array with distribution dependent on n, in order to prove the

asymptotic normality, thus we need to verify the Lindeberg-Feller condition in the following, i.e.,

for every ε > 0,

r∑

i=1

E[‖r−1/2ηi‖21{‖r−1/2ηi‖ > ε}

]=

n∑

i=1

1

πi‖eixin‖21{‖eixi

n‖ > r1/2πiε} (A.16)

≤ (εr1/2)−δn∑

i=1

1

πi‖eixin‖2‖eixi

nπi‖δ1{‖eixi

n> r1/2πiε}

≤ (εr1/2)−δn∑

i=1

1

π1+δi

‖eixin‖2+δ = op(1)

6

Page 44: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

where δ > 0, the first inequality is gotten by making use of the constrain that 1{‖ eixin ‖ > r1/2πiε},

the second inequality from the fact that 1{‖ eixin ‖ > r1/2πiε} ≤ 1, the last equality is based on (A.2)

in Condition C1∗.

Combining (A.15) and (A.16), we have, given Fn, by Lindeberg-Feller central limit theorem

(CLT) (van der Vaart (1998), 2.27 at section 2.8, Page 20),

[1

n2

n∑

i=1

e2ixixTi

πi

]−1/2(r−1/2

r∑

i=1

ηi

)L−→ N(0, I). (A.17)

Moreover,

V ar(r−1n−1X∗TΦ∗−1e∗|Fn) = r−1 1

n2

n∑

i=1

e2ixixTi

πi. (A.18)

So following (A.17) and (A.18), given Fn, [V ar(r−1n−1X∗TΦ∗−1e∗)]−1/2(r−1n−1X∗TΦ∗−1e∗) con-

verges to normal distribution.

Thus given Fn, we get

V ar[M−1

X (r−1X∗TΦ∗−1e∗)]−1/2

M−1X (r−1X∗TΦ∗−1e∗) L−→ N(0, I). (A.19)

Combining (A.9),(A.18) and (A.19), we get (A.3) of Lemma 1 proved.

Since

V = M−1X VcM

−1X =

(n∑

i=1

xixTi

)−1(r−1

n∑

i=1

e2iπixix

Ti

)(n∑

i=1

xixTi

)−1

= O(1/r), (A.20)

We have (A.4) of Lemma 1 proved.

Next we shall prove (A.5) of Lemma 1. From (A.6) and (A.17), we have

E(β − βols|Fn) = E[(M−1

X −M−1X )(r−1X∗TΦ∗−1e∗)|Fn

]=: B. (A.21)

Then we have, for some constant c, if r is big enough,

‖B‖2 ≤ E[‖(M−1

X −M−1X )(r−1X∗TΦ∗−1e∗)‖2|Fn

](A.22)

≤ E[‖n(M−1

X −M−1X )‖F ‖n−1(r−1X∗TΦ∗−1e∗)‖2|Fn

]

= E[‖(n−1MX)−1n−1(MX −MX)(n−1MX)−1‖F ‖n−1(r−1X∗TΦ∗−1e∗)‖2|Fn

]

≤ E[‖(n−1MX)−1‖F ‖n−1(MX − MX)‖F ‖(n−1MX)−1‖F ‖n−1(r−1X∗TΦ∗−1e∗)‖2|Fn

]

≤{E[‖(n−1MX)−1‖2F |Fn

]E[n−2‖MX − MX‖2F |Fn

]E[‖(n−1MX)−1‖2F |Fn

]}1/2·

{E[‖n−1(r−1X∗TΦ∗−1e∗)‖22|Fn

]}1/2

≤ c{E[n−2‖MX − MX‖2F |Fn

]}1/2 {E[‖n−1(r−1X∗TΦ∗−1e∗)‖22|Fn

]}1/2

7

Page 45: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

where the first inequality is followed by Jensen inequality, the second inequality by the inequal-

ity of Forbenius matrix norm and vector norm, the equality by the fact that M−1X −M−1

X =

−M−1X (MX −MX)M−1

X , the third inequality by matrix norm inequality, the forth inequality by

Holder’s inequality, and the last inequality is by the first result in (A.7) and the fact that the

Forbenius norm of the inverse of a positive definite matrix can be bounded by some constant.

Following (A.13) and (A.14), we have,

E[n−2‖MX − MX‖2F |Fn

]= Op(1/r),

and

E[‖n−1(r−1X∗TΦ∗−1e∗)‖22|Fn

]= Op(1/r).

By combining two formulas above and (A.22), we get

‖B‖2 = Op(r−1). (A.23)

We thus proved that

E(β − βols|Fn) = Op(r−1).

Condition C3∗:n∑

i=1

πi(ziz

Ti −Mu

Z

)2= Op(1), (A.24)

n∑

i=1

π1+δi ‖eui xui ‖2+δ = Op(1), (A.25)

where MuZ =

n∑i=1

πizizTi , and eui = yi − xTi βwls with βwls = (XTΦX)−1XTΦy.

4.2 Proof of Lemma 2

Lemma 2. If Condition C3∗ holds, then given Fn,

V−1/2u (β

u − βwls)L−→ N(0, I), as r →∞, (A.26)

where Vu = (MuX)−1Vu

c (MuX)−1, Mu

X =n∑i=1

πixixTi , and Vu

c = r−1n∑i=1

πi(eui )2xix

Ti .

In addition, we have

E(βu − βols|Fn) = βwls − βols +Op(r

−1) (A.27)

8

Page 46: Optimal Subsampling Approaches for Large Sample Linear … · 2018. 11. 6. · Optimal Subsampling Approaches for Large Sample Linear Regression Rong Zhu Academy of Mathematics and

Proof. As βu

= (X∗TX∗)−1X∗T y∗, the difference between βu

and βwls is

βu − βwls = (X∗TX∗)−1X∗T (y∗ −X∗βwls) = (Mu

X)−1(r−1X∗Te∗u), (A.28)

where MuX = r−1X∗TX∗, and e∗u = y∗ −X∗βwls.

Step 1. Analogous to the proof of Lemma 1, if we show

MuX −Mu

X = Op(r−1/2), r−1X∗Te∗u = Op(r

−1/2). (A.29)

Then it follows that

βu − βwls = (Mu

X)−1(r−1X∗Te∗u) +Op(r−1). (A.30)

By the definition of βwls, XTΦy− (XTΦX)βwls = 0, one can easily verify the following equalities,

E(MuX |Fn) = Mu

X , E(r−1X∗Te∗u|Fn) = 0.

Analogous to the analysis of (A.13) and (A.14) in the proof of Lemma 1, from C4∗, we also have,

given Fn,

V ar(MuX) = r−1

n∑

i=1

πi(xix

Ti −Mu

X

)2= Op(r

−1),

V ar(r−1XTe∗u) = r−1n∑

i=1

πi(eui )2xix

Ti = Op(r

−1). (A.31)

Thus equation (A.29) is shown.

Step 2. Analogous to the proof of Lemma 1, from C3∗ and CLT, given Fn, we have,

V ar[(Mu

X)−1(r−1X∗Te∗u)]−1/2

(MuX)−1(r−1X∗T e∗u)

L−→ N(0, I). (A.32)

Thus combining (A.30), (A.31), and (A.32), the first result (A.26) in Lemma 2 is proved. We

will prove (A.27) in Lemma 2 in the following. As βu − βols = β

u − βwls + βwls − βols, we have

E(βu − βols|Fn) = E(β

u − βwls|Fn) + βwls − βols. (A.33)

Similarly as (A.22), we have

‖E(βu − βwls|Fn)‖2 = Op(r

−1).

Thus, (A.27) in Lemma 2 is proved.

References

Drineas, P., M. Mahoney, S. Muthukrishnan, and T. Sarlos (2011). Faster least squares approxi-

mation. Numerische Mathematik 117, 219–249.

Hansen, M. and W. Hurwitz (1943). On the theory of sampling from a finite population. Annals

of Mathematical Statistics 14, 333–362.

van der Vaart, A. (1998). Asymptotic Statistics. Cambridge University Press, London.

9