Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data...
Transcript of Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data...
![Page 1: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/1.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Discussion of sampling approach in bigdata
Big data discussion group at MSCS of UIC
![Page 2: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/2.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Outline
1 Introduction
2 The framework
3 Bias and variance
4 Approximate computation of leverage
5 Empirical evaluation
![Page 3: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/3.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Mainly based onPing Ma, Michael Mahoney, Bin Yu (2015), A statisticalperspective on algorithmic leveraging, Journal of MachineLearning Research, 16, 861-911
![Page 4: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/4.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Sampling in big data analysis
One popular approachChoose a small portion of full dataOne possible way: uniform random sampling“Worst-case” may perform poorly
![Page 5: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/5.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Leveraging approach
Data-dependent sampling processLeast-square regression (Avron et al. 2010, Meng et al.2014)Least absolute deviation and quantile regression (Clarksonet al. 2013, Yang et al. 2013)Low-rank matrix approximation (Mahoney and Drineas,2009)Leveraging provides uniformly superior worst-casealgorithmic resultNo work addresses the statistical aspects
![Page 6: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/6.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Summary of the results
Based on linear modelAnalytic framework for evaluating sampling approachesUse Taylor expansion to approximate the subsamplingestimator
![Page 7: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/7.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Uniform approach vs leveraging approach
Compare the biases and variance, both conditional and notunconditionalBoth are unbiased to leading orderLeveraging approach improve the “size-scale” of thevariance but may inflate the variance with small leveragescoresNeither leveraging nor uniform approach dominates eachother
![Page 8: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/8.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
New approaches
Shrinkage Leveraging Estimator (SLEV): a convexcombination of leveraging sampling probability and uniformprobabilityUnweighted leveraging Estimator (LEVUNW): leveragingsampling approach with unweighted LS estimationBoth approaches have some improvements
![Page 9: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/9.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Outline
1 Introduction
2 The framework
3 Bias and variance
4 Approximate computation of leverage
5 Empirical evaluation
![Page 10: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/10.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Linear Model
y = Xβ0 + ε
X is n × p matrixβ0 is p × pε ∼ N(0, σ2)
Least-squared estimator: βols = (X T X )−1X T y
![Page 11: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/11.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
About βols
Computation time O(np2)
Can be written as V ∆−1UT y , where X = U∆V T (thinSVD)Can be solved approximately with computation time o(np2)with error bounded by ε
![Page 12: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/12.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Leverage
Consider y = Hy , where H = X (X T X )−1X T
The i th diagonal element, hii = xTi (X T X )−1xi , called the
statistical leverage of the i th observation.Var(ei) = (1− hii)σ
2
Student residual: ei
σ√
1−hii
hii has been used to qualify for the influential observations
![Page 13: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/13.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Leverage
hii =∑p
j=1 U2ij
Exact computation time: O(np2)
Approximate computation time: o(np2)
![Page 14: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/14.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Sampling algorithm
{πi}ni=1 is a sampling distributionRandomly sample r > p rows of X and the correspondingelements of y , using {πi}ni=1
Rescale each sampled row/element by 1(r√πi )
to form aweighted LS subproblemSolve the weighted LS subproblem, the solution denotedas βwls
![Page 15: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/15.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Weighted LS subproblem
Let STX (r × n) be the sampling matrix indicating the
selected samplesLet D (r × r ) be the diagonal matrix with the i th elementbeing 1√
rπkif the k th data is chosen
The weighted LS estimator is
argminβ||DSTX y − DST
X Xβ||
![Page 16: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/16.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Weighted sampling estimators
βW = (X T WX )−1X T Wy
with W = SX D2STX (n × n diagonal random matrix). W is a
random matrix with E(Wii) = 1.
![Page 17: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/17.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Smapling approaches
Uniform: πi = 1/n, for all i ; Uniform sampling estimator(UNIF)Leverage-based: πi = hii∑n
i hii= hii/p; Leveraging Estimator
(LEV)Shrinkage: πi = απLev
i + (1−α)πUnifi ; Shrinkage leveraging
estimator (SLEV)Unweighted leveraging: with πLev
i solving
argminβ||STX y − ST
X Xβ||
![Page 18: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/18.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Outline
1 Introduction
2 The framework
3 Bias and variance
4 Approximate computation of leverage
5 Empirical evaluation
![Page 19: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/19.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Lemma 1
A Taylor expansion of βW around the point E(W ) = 1 yields
βW = βols + (X T X )−1X T Diag{e}(w − 1) + Rw
where e = y − X βols and Rw is the Taylor expansion reminder
Remark: (1) when Taylor expansion is valid whenRW = op(||W − 1||). No theoretical justification when it holds.(2) the formula does not apply to LEVUNW
![Page 20: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/20.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Lemma 2
EW
[βW |y
]=βols + EW [Rw ]
VarW
[βW |y
]=(X T X )−1
[Diag{e}Diag{ 1
rπ}Diag{e}
]X (X T X )−1
+ VarW [Rw ]
Remark: when EW
[βW |y
]is negligible, βW is approximately
unbiased relative to full sample estimate βols. The variance isinversely proportional to subsample size r .
![Page 21: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/21.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Lemma 2
E[βW
]=β0
Var[βW
]=σ2(X T X )−1 +
σ2
r(X T X )−1Diag{(1− hii)
2
πi}X (X T X )−1
+ Var [Rw ]
Remark: βW is unbiased to true value β0. The variancedepends on leverage and sampling probability, and is inverselyproportional to subsample size r .
![Page 22: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/22.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
UNIF
EW
[βUNIF |y
]=βols + EW [RUNIF ]
VarW
[βUNIF |y
]=
nr
(X T X )−1 [Diag{e}Diag{e}] X (X T X )−1
+ VarW [RUNIF ]
E[βUNIF
]=β0
Var[βUNIF
]=σ2(X T X )−1 +
nr
(X T X )−1Diag{(1− hii)2}X (X T X )−1
+ Var [RUNIF ]
Remark: (1) The variance depends on nr , could be very large
unless r is closed to n; (2) The sandwich-type expression willnot be inflated by small hii .
![Page 23: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/23.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
LEV
EW
[βLEV |y
]=βols + EW [RLEV ]
VarW
[βLEV |y
]=
pr
(X T X )−1[Diag{e}Diag{ 1
hii}Diag{e}
]X (X T X )−1
+ VarW [RLEV ]
E[βLEV
]=β0
Var[βLEV
]=σ2(X T X )−1 +
pσ2
r(X T X )−1Diag{(1− hii)
2
hii}X (X T X )−1
+ Var [RLEV ]
Remark: (1) The variance depends on pr , not sample size n; (2)
The sandwich-type expression can be inflated by small hii .
![Page 24: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/24.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
SLEV
πi = απLevi + (1− α)πUnif
i
Lemma 2 still holdsIf (1− α) is not small, variance of the SLEV does not getinflated too muchIf (1− α) is not large, variance of the SLEV has a scale ofp/rNot only increase the small scores, but also shrinkage onlarge scores
![Page 25: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/25.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
LEVUNW
A Taylor expansion of βW around the point E(W ) = rπ yields
βLEVUNW = βwls + (X T X )−1X T Diag{eW}(W − rπ) + RLEVUNW
where βwls = (X T W0X )−1XW0y and eW = y − X βwls,W0 = Diag{rhii/p}
![Page 26: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/26.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
LEVUNW
EW
[βLEVUNW |y
]=βwls + EW [RLEVUNW ]
VarW
[βLEVUNW |y
]=(X T W0X )−1Diag{eW}W0Diag{eW}X (X T W0X )−1
+ VarW [RLEVUNW ]
Remark: for a given data set, βLEVUNW is approximatelyunbiased to βwls, but not βols.
![Page 27: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/27.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
LEVUNW
EW
[βLEVUNW
]=β0
VarW
[βLEVUNW
]=σ2(X T W0X )−1X T W 2
0 X (X T W0X )−1
+(X T W0X )−1X T Diag{I − PX ,W0}W0Diag{I − PX ,W0}X (X T W0X )−1
+ VarW [RLEVUNW ]
Remark: βLEVUNW is unbiased to β0 and the variance is notinflated by small leverage
![Page 28: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/28.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Outline
1 Introduction
2 The framework
3 Bias and variance
4 Approximate computation of leverage
5 Empirical evaluation
![Page 29: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/29.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Approximate computation
Based on Drineas et al. (2012)Generate an r1 × n random matrix
∏1
Generate an p × r2 random matrix∏
2
Compute R, where R is the thin SVD of∏
1 X = QRReturn the leverage score of XR−1∏
2
![Page 30: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/30.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Computation time
For approximate choices of r1 and r2, if one chooses∏
1 to be aHadamard-based random matrix, the the computation time iso(np2)
![Page 31: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/31.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Empirical studies
n = 20,000 and p = 1,000BFast: each element of
∏1 and
∏2 is generated i.i.d from
{-1,1} with equal samplingGFast: each element of
∏1 and
∏2 is generated i.i.d from
N(0, 1n ) and N(0, 1
p )
n = 20,000 and p = 1,000r1 = p,1.5p,2p,3p,4p,5p and r2 = klog(n) withk = 1,2, . . . ,20
![Page 32: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/32.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Empirical studies: choose of r1 and r2
With the increase of r1, the correlation are not sensitive butthe running time increase linearlyWith the increase of r2, the correlation increase rapidly butthe running time not sensitiveChoose small r1 and large r2
![Page 33: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/33.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Empirical studies: choose of r1 and r2
![Page 34: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/34.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Empirical studies: computation time
When n ≤ 20,000, exact method takes less timeWhen n > 20,000, the approximate approach has someadvantage
![Page 35: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/35.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Empirical studies: computation time
![Page 36: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/36.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Empirical studies: estimation comparision
Compare the bias and variance of LEV, SLEV, andLEVUNW using exact, BFast, and GFastThe results are almost identical
![Page 37: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/37.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Empirical studies: estimation comparision
![Page 38: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/38.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Outline
1 Introduction
2 The framework
3 Bias and variance
4 Approximate computation of leverage
5 Empirical evaluation
![Page 39: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/39.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Plan
Unconditional bias and variance for LEV and UNIFUnconditional bias and variance for SLEV and LEVUNWConditional bias and variance of SLEV and LEVUNWReal data application
![Page 40: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/40.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Synthetic data
y = Xβ + ε, where ε ∼ N(0,9In)
Nearly uniform leverage scores (GA): X ∼ N(1p,Σ),Σij = 2× 0.5|i−j|, and β = (110,0.11p−20,110)
Moderately nonuniform leverage scores (T3): X is frommultivariate t-distribution with df=3Very nonuniform leverage scores (T1): X is frommultivariate t-distribution with df=1
![Page 41: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/41.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
LEV vs UNIF: square loss and variance
n = 1000, p = 10,50,100, and repeat sampling 1000 timesSquare loss is much smaller than varianceSimilarly for GALess similarly for T3
Very different for T1
Both decrease as r increase, but slower for UNIF
![Page 42: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/42.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
LEV vs UNIF: square loss and variance
![Page 43: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/43.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Improvements from SLEV and LEVUNW
n = 1000, p = 10,50,100, and repeat sampling 1000 timesSimilarly for GALess similarly for T3
Different for T1
SLEV with α = 0.9 and LEVUNW have better performance
![Page 44: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/44.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Improvements from SLEV and LEVUNW
![Page 45: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/45.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Choices of α in SLEV
n = 1000, p = 10,50,100, and repeat sampling 1000 timesT1 data0.8 ≤ α ≤ 0.9 has beneficial effectRecommend α = 0.9LEVUNW has better performance
![Page 46: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/46.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Choices of α in SLEV
![Page 47: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/47.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Conditional bias and variance
n = 1000, p = 10,50,100, and repeat sampling 1000 timesLEVUNW is biased for βols
LEVUNW has smallest varianceRecommend use SLEV with α = 0.9
![Page 48: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/48.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Conditional bias and variance
![Page 49: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/49.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Real Data: RNA-SEQ data
n = 51,751 read counts from embryonic mouse stem cellsnij denotes the counts of reads that are mapped to thegenome starting at the j th nucleotide of the i th geneyij = log(nij + 0.5)
Independent variables: 40 nucleotides denoted as bij,−20,,bij,−19, . . . ,bij,19.
Linear model: yij = α +∑19
k=−20∑
h∈H βkhI(bij,k = h) + εij ,where H = {A,C,G}, T is used as baseline level.p = 121
![Page 50: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/50.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Sampling analysis
UNIF, LEV, and SLEVr = 2p,3p,4p,5p,10p,20p,50pCompare sample bias (respect to βols) and varianceSampling 100 times
![Page 51: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/51.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Comparison
Relatively uniform leverage scoresAlmost identical variancesLEVUNW has slightly larger bias
![Page 52: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/52.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Emprical resutls for real data I
![Page 53: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/53.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Real Data: predicting gene expression of cancerpatient
n = 5,520 genes for 46 patients.Randomly select one patient’s gene expression as y andremaining patients’ gene expressions as predictors(p = 45)Sample sizes from 100 to 5000UNIF, LEV, and SLEV
![Page 54: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/54.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Comparison
Relatively nonuniform leverage scoresSLEV and LEV have smaller variancesLEVUNW has the largest bias
![Page 55: Discussion of sampling approach in big datahomepages.math.uic.edu/~minyang/Big Data Discussion... · Sampling in big data analysis One popular approach Choose a small portion of full](https://reader036.fdocuments.us/reader036/viewer/2022071216/60488661d742be5dd81c93e7/html5/thumbnails/55.jpg)
Introduction The framework Bias and variance Approximate computation of leverage Empirical evaluation
Emprical resutls for real data II