Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB....
Transcript of Statistical Foundations of Data Sciencejqfan/fan/classes/525/Notes1.pdf · 2015 8ZB. 2020 40ZB....
Statistical Foundations of Data Science
Jianqing Fan
Princeton University
http://www.princeton.edu/∼jqfan
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 1 / 46
Big Data are ubiquitous
Big Data
Internet
Business
Finance
Government
Medicine
2003 5EB 2010 1.2ZB 2012 2.7ZB 2015 8ZB 2020 40ZB
Engineering
Science
Digital Humanities
Biological ScienceFVolume
FVelocity
FVariety
“There were 5 exabytes of information created between the dawn of civilization through 2003,but that much information is now created every 2 days” — Eric Schmidt, CEO of Google
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 2 / 46
Deep Impact of Data Tsunami
System: storage, communication, computation architectures
Analysis: statistics, computation, optimization, privacy
Acquisition & Storage
Data Science
Computing
Analysis
Applications Applications
Big Data =⇒ Smart Data
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 3 / 46
What can big data do?
Hold great promises for understanding
F Heterogeneity: personalized medicine or services
F Commonality: in presence of large variations (noises)
from large pools of variables, factors, genes, environments and their
interactions as well as latent factors.
Aims of Data Science:
� Prediction: To construct as effective a method as possible to predict
future observations.(correlation)
� Inference and Prediction: To gain insight into relationship between
features and response for scientific purposes and to construct an
improved prediction method. (causation)
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 4 / 46
Common Features and Techniques
Common Features of Big Data:
F Heterogeneity, endogeneity
F Dependence, heavy tails,
F Missing Data, measure errors, spurious correlation
F Survivor biases, sampling biases
Common Techniques for Data Science:
F Statistical Techniques: MLE, Least-Squares, M-estimation
F Regression: Parametric, Nonparametric, Sparse
F Principal Component Analysis: Supervised, Unsupervised.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 5 / 46
1. Multiple and Nonparametric
Regression
1.1. Theory of Least-squares 1.2. Model Building
1.3. Ridge Regression 1.4. Regression in RKHS
1.5. Cross-validation
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 6 / 46
1.1. Theory of Least-Squares
�Read materials and R-implementations here
http://orfe.princeton.edu/%7Ejqfan/fan/classes/245/chap11.pdf
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 7 / 46
Purpose Multiple regression
F Study assoc. between dependent & indepen. variables
F Screen irrelevant and select useful variables
F Prediction
Example: Hong Kong environmental dataset
Interest: Assoc. between levels of pollutants and No. of daily
hospital admissions for circulatory and respiratory problems.
Response Y = Daily number of hospital admissions
CovariatesI level of pollutant Sulphur Dioxide X1
I level of pollutant Nitrogen Dioxide X2
I level of respirable suspended particles X3
I · · ·Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 8 / 46
Multiple linear regression model
Y = β1X1 + β2X2 + · · ·+ βpXp + ε
Y : response / dependent variable
Xj ’s: explanatory / independent variables or covariates
ε: random error not explained / predicted by covariates
include intercept by setting X1 = 1
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
x1 x2 x3
0 1y xβ β= +0 1 1xβ β+
0 1 2xβ β+
0 1 3xβ β+
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 9 / 46
Method of least-squares
Data:{(
xi1,xi2, · · ·xip,yi)}
1≤i≤n
Model: yi = ∑pj=1 xijβj + εi
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc.
x1 x2 x3
0 1y xβ β= +0 1 1xβ β+
0 1 2xβ β+
0 1 3xβ β+
Method of Least-Squares:
minimizeβ∈Rp RSS(
β
),
n
∑i=1
(yi −
p
∑j=1
xijβj)2
RSS stands for residual sum-of-squares
When εii.i.d∼ N (0,σ2), least-squares estimator is MLE
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 10 / 46
Regression in matrix notation
y =
y1...
yn
, X =
x11 · · · x1p... · · ·
...
xn1 · · · xnp
, β =
β1...
βp
, ε =
ε1...
εn
Model becomes
y = Xβ + ε
RSS becomes
RSS(β) = ‖y−Xβ‖2
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 11 / 46
Closed-form solution
Least-squares: Minimize wrt β ∈ Rp
‖y−Xβ‖2 = (y−Xβ)T (y−Xβ)
Setting gradients to zero yields normal equations:
XT y = XT Xβ
Least-squares estimator: (assume X has full column rank)
β = (XT X)−1XT y
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 12 / 46
Geometric interpretation of least-squares
Fitted value: y = Xβ = X(XT X)−1XT︸ ︷︷ ︸,P∈Rn×n
y
Theorem 1.1 [Property of projection matrix]
F Pxj = xj , j = 1,2, · · · ,pF P2 = P or P(In−P) = 0
X1
X2
y
y
y − y
�project response vector y onto linear space spanned by X
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 13 / 46
Statistical properties of least-squares estimator
Assumption:
Exogeneity: E(ε|X) = 0;
Homoscedasticity: var(ε|X) = σ2.
Statistical Properties:
F bias: E(β|X) = β
F variance: var(β|X) = σ2(XT X)−1
often dropped
�Recall cov(U,V) = E(U−µu)(V−µv)T and var(U) = cov(U,U)
cov(aT U,bT V) = aT cov(U,V)b, var(aT U) = aT var(U)a.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 14 / 46
Gauss-Markov Theorem
How large is variance? compared with other estimators?
Theorem 1.2 [Gauss-Markov Theorem]
LSE β is best linear unbiased estimator (BLUE):
aT β is a linear unbiased estimator of parameter θ = aT β
for any linear unbiased estimator bT y of θ,
var(bT y|X)≥ var(aTβ|X)
Estimation of σ2: σ2 = RSSn−p = ‖y−Xβ‖2
n−p
σ2 is is an unbiased estimator of σ2
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 15 / 46
Statistical inference
Additional assumption: N (0,σ2In)
Under fixed design or conditioning on X,
β = β + (XT X)−1XTε =⇒ β∼N (β,(XT X)−1
σ2)
F βj ∼N (βj ,vjσ2) where vj is j th diag of (XT X)−1
F (n−p)σ2 ∼ σ2χ2n−p and σ2 is indep. of β.
F 1−α CI for βj : βj ± tn−p(1−α/2)vj σ (homework)
F H0 : βj = 0: test statistics tj =βj√vj σ∼H0 tn−p.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 16 / 46
Non-normal error
Appeal to asymptotic theory:
√n(β−β) = (n−1XT X)−1︸ ︷︷ ︸
n−1 ∑ni=1 Xi XT
i
n−1XTε︸ ︷︷ ︸
n−1/2∑
ni=1 Xi εi
LLN CLT
Using Slutsky’s theorem, (homework)
√n(β−β)
d−→ N(0,Σ−1) or βd−→N (β,(XT X)−1
σ2)
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 17 / 46
Correlated errors
y = Xβ + ε, where var(ε|X) = σ2W
Transform data as follows
y∗ = W−1/2y, X∗ = W−1/2X, ε∗ = W−1/2
ε.
General Least-Squares:
minβ∈Rp
||y∗−X∗β||2 = (y−Xβ)T W−1(y−Xβ)
Heteroscedastic errors: Wi = σ2 diag(v1, · · · ,vn)
Weighted Least-squares: minβ ∑ni=1(yi −XT
i β)2/vi .
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 18 / 46
1.2. Model Building
Nonlinear and nonparametric regression
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 19 / 46
Nonlinear regression
Univariate: Y = f (X) + ε,
�f (·) has structural property: smooth, monotone, convex ...
Weierstrass theorem: any continuous f (X) on [0,1] can be uniformly
approximated by a polynomial function.
Polynomial regression:
Y =
≈f (X)︷ ︸︸ ︷β0 + β1X + · · ·+ βdX d +ε
Fmultiple regression with X1 = X , · · · ,Xd = X d
Drawback: not suitable for functions with varying degrees of smoothness
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 20 / 46
Polynomial versus cubic spline regressions
**** *** **** ****** *****************
****
**
*
*****
*
*
**** *
*
*
**
*** *
*** **
***
****
***
*
*
*
*
**
***
*
**
*
*
*
*
** *
*
*
*
*
*
*
*
**
*
*
*
* ****
*
*
*
**
***
* *
** *
**
****
*
10 20 30 40 50
−10
0−
500
50
X0.
0000
0
Polynomial versus cubic spline fit
FRed: polynomial regression with degree 3
Blue: spline regression with with degree 3
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 21 / 46
Spline regression
F piecewise polynomials with degree d , with continuous derivatives
up to order d−1.
F Knots: {τj}Kj=1 where discontinuity occurs.
Example: Linear splines with 0 = τ0 < τ1 < τ2 < τ3 = 1
Linearity on [0,τ1] yields l(x) = β0 + β1x , x ∈ [0,τ1].
Linearity on [τ1,τ2]+ continuity at τ1 gives
l(x) = β0 + β1x + β2(x− τ1)+, x ∈ [τ1,τ2]
Linearity on [τ1,τ2]+ continuity at τ1 gives [τ2,1]
l(x) = β0 + β1x + β2(x− τ1)+ + β3(x− τ2)+, x ∈ [τ2,1]
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 22 / 46
Basis functions for Linear Splines
Basis functions for linear splines on 0 = τ0 < τ1 < τ2 < τ3 = 1:
B0(x) = 1,B1(x) = x ,B2(x) = (x− τ1)+,B3(x) = (x− τ2)+
Spline regression:
Y = β0B0(X) + β1B1(X) + β2B2(X) + β3B3(X)︸ ︷︷ ︸≈f (X)
+ε
�Multiple regression with X0 = B0(X),X1 = B1(X),X2 = B2(X),X3 = B3(X)
General case: {1,x ,(x− τj)+, j = 1, · · · ,K}Nonparametric: When K is large, Kn→ ∞
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 23 / 46
Cubic splines
Piecewise cubic polynomial with cont. 1st and 2nd derivatives:
c(x) = β0 + β1x + β2x2 + β3x3, x ≤ τ1,
c(x) = β0 + β1x + β2x2 + β3x3 + β4(x− τ1)3+, x ∈ [τ1,τ2],
c(x) = β0 + β1x + β2x2 + β3x3 + β4(x− τ1)3+ + β5(x− τ2)3
+,
x > τ2
Basis functions:
B0(x) = 1,B1(x) = x ,B2(x) = x2,B3(x) = x3
B4(x) = (x− τ1)3+,B5(x) = (x− τ2)3
+.
Fwidely used; Fmultiple regression
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 24 / 46
Extension to multiple covariates
�Bivariate quadratic regression model:
Y = β0 + β1X1 + β2X2 + β3X 21 + β4 X1X2︸︷︷︸
interaction
+β5X 22 + ε
�Multivariate quadratic regression:
Y =p
∑j=1
βjXj + ∑j≤k
βjkXjXk + ε
�Multivariate quadratic regression with main effect (linear terms) and
interactions
Y =p
∑j=1
βjXj + ∑j<k
βjkXjXk + ε
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 25 / 46
Multivariate spline regression
Idea: Tensor products of univariate basis functions
Drawbacks: curse of dimensionality, namely, number of basis
functions scales exponentially with p
Remedy: Add additional structure to f (·)Example: Additive model
Y = f1(X1) + · · ·+ fp(Xp) + ε
�Number of basis functions scales linearly with p
Example: Bivariate interaction models:
Y = ∑1≤i≤j≤p
fij(Xi ,Xj) + ε
�Number of basis functions scales quadratically with pJianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 26 / 46
1.3. Ridge Regression
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 27 / 46
Ridge Regression
Drawbacks of OLS: Fn > p;Flarge variance when collinearity
Remedy: Ridge regression (Hoerl and Kennard, 1970)
βλ = (XT X + λI)−1XT y
Fλ > 0 is a regularization parameter.
Interpretation: Penalized LS ‖y−Xβ‖2 + λ‖β‖2.
—Setting the gradient to zero, we get XT (Xβ−y) + λ
2 β = 0
Bayesian estimator: with prior β∼ N(0, σ2
λIp).
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 28 / 46
Bias-Variance Tradeoff
Smaller variances: (due to prior)
Var(βλ) = (XT X + λI)−1XT X(XT X + λI)−1σ
2 ≺ Var(β).
Larger biases: (due to prior)
E(βλ)−β = (XT X + λI)−1XT Xβ−β =−λ(XT X + λI)−1β.
Overall error: MSE(βλ) =
E‖βλ−β‖2 = tr{(XT X + λI)−2[λ2ββ
T + σ2XT X]}.
ddλ
MSE(βλ)|λ=0 < 0 ⇒ ∃ a λ > 0 outperforms OLS.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 29 / 46
Generalization: `q Penalized Least Squares
`q penalized least-squares estimate:
minβ
= ‖y−Xβ‖2 + λ‖β‖qq, q > 0.
Known as Bridge estimator (Frank and Friedman, 1993);
When q = 1, called Lasso estimator (Tibshirani, 1996);
strictly concave when 0 < q < 1 and convex when q > 1;
Only q = 2 admits a closed-form solution.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 30 / 46
Generalized Ridge Regression
Assume that β∼ N(0,Σ). Then
p(β|y,X) ∝ e−‖y−Xβ‖2/(2σ2)e−βTΣ−1
β/2.
MAP estimate is
βMAP
= argmaxβ p(β|y,X)
= (XT X + σ2Σ−1)−1XT y.
FΣ can take into account different scales of covariates.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 31 / 46
Ridge Regression Solution Path
�βλ = (XT X + λI)−1XT y as a function of λ.
Efficient computation: Let X = UDVT be the SVD.
XT X = VDUT UDVT = VD2VT ;
βλ = V(D2 + λI)−1DUT y = ∑pj=1
dj
d2j +λ
uTj yvj ,
—uj and uj are j th columns of U and V.
Compute very fast for many values of {λk}.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 32 / 46
Kernel Ridge Regression
Theorem 1.3. Alternative expression βλ = XT (XXT + λI)−1y
Prediction at x is y = xT βλ = xT XT (XXT + λI)−1y.
Note that (XXT )ij = 〈xi ,xj〉 and xT XT = (〈x,x1〉, · · · ,〈x,xn〉).
Prediction depends only pairwise inner products;
Generalize to other similarity measures K (·, ·), called kernel
trick. K(
,)
= +10 K(
,)
=−10
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 33 / 46
Kernel regression
Kernel: (K (xi ,xj))n×n is PSD, for any {xi}ni=1.
Commonly used kernels:
Flinear 〈u,v〉 Fpolynomial (1 + 〈u,v〉)d, d = 2,3, · · · ;FGaussian e−γ‖u−v‖2
FLaplacian e−γ‖u−v‖
�Fit model y = ∑nj=1 αjK (x,xj) + ε. Basis: {K (·,xj)}n
j=1 by
minα∈Rn
{‖y−Kα‖2 + λα
T Kα
}
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 34 / 46
Kernel ridge regression
Kernel ridge regression
With K = (K (xi ,xj)) ∈ Rn×n, prediction at x is
y = (K (x,x1), · · · ,K (x,xn))(K + λI)−1y,
F y =
pred︷︸︸︷f (x) = ∑
ni=1
weight︷︸︸︷αi K (x,xi), α = (K + λI)−1y;
testing testing training
F tune the parameter λ to minimize prediction errors.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 35 / 46
1.4 Reproducing Kernel Hilbert
Spaces
Justification of Kernel Tricks by Representer Theorem
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 36 / 46
Hilbert Space
Hilbert space: a space endowed with an inner product.
�X = set, H = a space of functions on X with inner product 〈·, ·〉H .
�Evaluation functional at x is Lx(f ) = f (x) ∀f ∈H .
Reproducing kernel Hilbert space(RKHS): For any x, ∃Cx s.t.
|Lx(f )|= |f (x)| ≤ Cx‖f‖H , ∀f ∈H .
�Lx is continuous linear. By Riesz representation theorem,
∃Kx ∈H s.t. 〈Kx, f 〉H = Lx(f ), ∀f ∈H .
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 37 / 46
Reproducing Kernel
Reproducing kernel: K (x,x′) = 〈Kx,Kx′〉H .
Symmetry: K (x,x′) = K (x′,x);
PSD: matrix (K (xi ,xj))n×n is PSD, for all {xi}ni=1,
�K admits eigen-decomposition
K (x,x′) =∞
∑j=1
γjψj(x)ψj(x′),∞
∑j=1
γ2j < ∞
—{γj}∞j=1 are eigenvalues, and {ψj}∞
j=1 are eigen-functions.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 38 / 46
Kernel Ridge Regression
� For any g,g′ ∈HK with g = ∑∞j=1 βjψj ,g′ = ∑
∞j=1 β′jψj , define
〈g,g′〉HK=
∞
∑j=1
γ−1j βjβ
′j ; ‖g‖HK
=√〈g,g〉HK
.
Reproducibility: 〈K (·,x ′),g〉HK= ∑j γj{γjψj(x′)}βj = g(x′).
Nonparametric Regression: yi = f (xi)︸︷︷︸f∈HK
+εi , i ∈ [n]. Find
f = argminf∈HK
{ n
∑i=1
[yi − f (xi)]2 + λ‖f‖2HK
}, λ > 0.
Using f = ∑∞j=1 βjψj , setting βj = βj/
√γj and ψj =
√γjψj , we get
min{βj}∞j=1
{ n
∑i=1
[yi −∞
∑j=1
βj ψj(xi)]2 +λ
∞
∑j=1
β2j
}.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 39 / 46
Representer Theorem
Theorem 1.4
For a loss L(y , f (x)
)and increasing function Pλ(·), let
f = argminf∈HK
{ n
∑i=1
L(yi , f (xi)
)+ Pλ(‖f‖HK
)}, λ > 0,
Then (homework)
f =n
∑i=1
αiK (·,xi),
where α = (α1, · · · , αn)T solves
minα
{ n
∑i=1
L(
yi ,n
∑j=1
αjK (xi ,xj))
+ Pλ
(√αT Kα
)}.
F Infinite-dimensional regression problem;
F Finite-dimensional representation for the solution.Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 40 / 46
Applications of Representer Theorem
Apply representer theorem to kernel ridge regression
f = argminf∈HK
{ n
∑i=1
(yi − f (xi)
)2+ λ‖f‖2
HK
}.
We must have f = ∑ni=1 αiK (·,xi) with α ∈ Rn solving
minα∈Rn
{‖y−Kα‖2 + λα
T Kα
}.
It is easily seen that
α = (K + λI)−1y.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 41 / 46
1.5 Cross-Validation
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 42 / 46
k -fold Cross-Validation
Purpose: To estimate Prediction Error for a procedure
k -fold Cross-Validation (CV)
F Divide data randomly and evenly into k subsets;
F Use one subset as testing set and remaining as training set to
compute a testing error;
F Repeat for each of k subsets and average testing errors.
Commonly-used: k = 5 or 10
Leave-one-out CV: k = n. CV = 1n ∑
ni=1[yi − f−i(xi)]2,
f−i(xi) = predicted value based on {(xj ,yj)}j 6=i .
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 43 / 46
Linear smoother
�y = Sy for data {(xi ,yi)}ni=1, S depends only on X.
self-stable if f (x) = f (x), where f is estimated function based on data
{(xi ,yi)}ni=1 and (x, f (x)), and f based on {(xi ,yi)}n
i=1
Theorem 1.5. For a self-stable linear smoother y = Sy,
yi − f−i(xi) =yi − yi
1−Sii, ∀i ∈ [n], CV =
1n
n
∑i=1
( yi − yi
1−Sii
)2.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 44 / 46
Generalized Cross-Validation
GCV (Golub et al., 1979): GCV =1n ∑
ni=1(yi−yi)
2
[1−tr(S)/n]2 .
�tr(S) is called effective degrees of freedom.
Self-stable Method S tr(S)
Multiple Linear Regression X(XT X)−1XT p
Ridge Regression X(XT X + λI)−1XT∑
pj=1
d2j
d2j +λ
Kernel Ridge Regression in RKHS K(K + λI)−1∑
nj=1
γjγj+λ
F{dj}
and {γj} are singular values of X and K.
Use GCV to choose λ by minimizing
GCV(λ) =1n yT (I−Sλ)y
[1− tr(Sλ)/n]2.
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 45 / 46
Bias variance decomposition
Double Expectation: EZ = E{E(Z |X)}, for any X
Best prediction: E(Y |X) = argminf E(Y − f (X))2
Bias-var in prediction: Letting f ∗(X) = E(Y |X), then
E(Y − f (X))2 = E(Y − f ∗(X))2︸ ︷︷ ︸var=Eσ2(X)
+E(f ∗(X)− f (X)︸ ︷︷ ︸bias
)2.
Bias-var in est: letting f (x) = Efn(x), then
E (fn(X)− f (X))2 = E (fn(X)− f (X))2︸ ︷︷ ︸var
+E (f (X)− f ∗(X)︸ ︷︷ ︸bias
)2.
Fvariance is small when n large, big when no. of parameters are big
Fbiases are small when model is complex
Jianqing Fan (Princeton University) ORF 525, S19: Statistical Foundations of Data Sciences 46 / 46