Lecture 9: Regression - BGUinabd161/wiki.files/lecture9_handouts.pdf · Kontorovich and Sabato...
Transcript of Lecture 9: Regression - BGUinabd161/wiki.files/lecture9_handouts.pdf · Kontorovich and Sabato...
Lecture 9: Regression
Introduction to Learningand Analysis of Big Data
Kontorovich and Sabato (BGU) Lecture 9 1 / 24
Beyond binary classification
Suppose we want to predict a patient’s blood pressure based onweight and height.
Binary classification no longer applies; this falls into the regressionframework.
X – instance space (the set of all possible examples)
Y – label space (the set of all possible labels, Y ⊆ R)
Training sample: S = ((x1, y1), . . . , (xm, ym))
Learning algorithm:I Input: A training sample SI Output: A prediction rule (regressor) hS : X → Y.
loss function ` : Y × Y → R+
Common loss functionsI absolute loss: `(y , y ′) = |y − y ′|I square loss: `(y , y ′) = (y − y ′)2
•Kontorovich and Sabato (BGU) Lecture 9 2 / 24
The regression problem
X – instance space (the set of all possible examples)
Y – label space (the set of all possible labels, Y ⊆ R)
Training sample: S = ((x1, y1), . . . , (xm, ym))Learning algorithm:
I Input: A training sample SI Output: A prediction rule hS : X → Y
loss function ` : Y × Y → R+
as before, assume distribution D over X × Y (agnostic setting)
given a regressor h : X → Y, define risk
risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]
also, empirical/sample risk
risk`(h, S) =1
m
m∑i=1
`(h(xi ), yi )
risk depends on `
•Kontorovich and Sabato (BGU) Lecture 9 3 / 24
Bayes-optimal regressorDefinition: h∗ : X → Y is Bayes-optimal if it minimizes the riskrisk`(h,D) over all h : X → YQ: what is the Bayes-optimal regressor for square loss
`(y , y ′) = (y − y ′)2
?A:
h∗(x) = E(X ,Y )∼D[Y |X = x ]
Q: what about absolute loss
`(y , y ′) = |y − y ′|
?A:
h∗(x) = MEDIAN(X ,Y )∼D[Y |X = x ]
proofs coming upbut D is unknown to the learner
•Kontorovich and Sabato (BGU) Lecture 9 4 / 24
Bayes-optimal regressor for square loss
h∗ : X → Y is Bayes-optimal for square loss if it minimizes the risk
E(X ,Y )∼D[(h(X )− Y )2]
over all h : X → YClaim: h∗(x) = E(X ,Y )∼D[Y |X = x ]Proof:
I
E(X ,Y )∼D[(h(X )− Y )2] = EX
[EY [(h(X )− Y )2|X ]
]I will minimize inner expectation pointwise for each X = xI Exercise: For a1, a2, . . . , an ∈ R, to minimize
n∑i=1
(ai − b)2
over b ∈ R, choose b = 1n
∑ni=1 ai = MEAN(a1, . . . , an)
I Conclude (approximating any distribution by sum of atomic masses)that least squares is minimized by the mean.
•Kontorovich and Sabato (BGU) Lecture 9 5 / 24
Bayes-optimal regressor for absolute loss
h∗ : X → Y is Bayes-optimal for absolute loss if it minimizes the risk
E(X ,Y )∼D[|h(X )− Y |]
over all h : X → YClaim: h∗(x) = MEDIAN(X ,Y )∼D[Y | X = x ]Proof:
I
E(X ,Y )∼D[|h(X )− Y |] = EX [EY [|h(X )− Y | | X = x ]]
I will minimize inner expectation pointwise for each X = xI Exercise: For a1, a2, . . . , an ∈ R, to minimize
n∑i=1
|ai − b|
over b ∈ R, choose b = MEDIAN(a1, . . . , an) [note: not unique!]I Conclude (approximating any distribution by sum of atomic masses)
that absolute loss is minimized by a median.
•Kontorovich and Sabato (BGU) Lecture 9 6 / 24
Approximation/estimation error
loss ` : Y × Y → R+
risk:I empirical risk`(h,S) = 1
m
∑mi=1 `(h(xi ), yi )
I distribution risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]
hypothesis space H ⊂ YX — set of possible regressors
ERM: hS = argminh∈H risk`(h, S)
approximation error
riskapp,` := infh∈H
risk`(h,D)
estimation error
riskest,` := risk`(hS ,D)− infh∈H
risk`(h,D)
the usual Bias-Complexity tradeoff
•Kontorovich and Sabato (BGU) Lecture 9 7 / 24
The usual questions
Statistical:I How many examples suffice to guarantee low estimation error?I How to choose H to achieve low approximation error?
Computational: how to perform ERM efficiently?
•
Kontorovich and Sabato (BGU) Lecture 9 8 / 24
Linear regression
instance space X = Rd
label space Y = Rhypothesis space H ⊂ YX :
H ={hw ,b : Rd → R | w ∈ Rd , b ∈ R
},
where hw ,b(x) := 〈w , x〉+ b.
square loss: `(y , y ′) = (y − y ′)2
inuition: fitting straight line to data [illustration on board]
as before, b can be absorbed into w ′ = [w ; b] and padding x by anextra dimension
ERM optimization problem: Find
w ∈ argminw∈Rd
m∑i=1
(〈w , xi 〉 − yi )2
a.k.a. “least squares”
•Kontorovich and Sabato (BGU) Lecture 9 9 / 24
Solving least squares
optimization problem: Minimizew∈Rd
∑mi=1(〈w , xi 〉 − yi )
2
write data as d ×m matrix X and labels as Y ∈ Rm
write objective function
f (w) = ‖X>w − Y ‖2
f is convex and differentiable
gradientOf (w) = 2X (X>w − Y )
minimum at Of = 0
X (X>w − Y ) = 0 ⇐⇒ XX>w = XY
solution:w = (XX>)−1XY
(pseudo-inverse (XX>)+ if XX> not invertible; when will thishappen?)
•Kontorovich and Sabato (BGU) Lecture 9 10 / 24
The pseudo-inversemore formally: the Moore-Penrose pseudo-inverse; denoted by A+
exists for any m × n matrix Ais uniquely defined by 4 properties:
AA+A = A, A+AA+ = A+, (AA+)> = AA+, (A+A)> = A+A
is given by limit
A+ = limλ↓0
(A>A + λI )−1A> = limλ↓0
A>(AA> + λI )−1;
limits exist even if AA> or A>A not invertible; (note that A>A + λIand AA> + λI are always invertible)not continuous in the entries of Awhen solving XX>w = XY ,
I XX> invertible =⇒ unique solution w = (XX>)−1XYI else, for any N ∈ Rd×d , can set:
w = (XX>)+XY + (I − (XX>)+(XX>))N
I choosing N = 0 yields solution of smallest norm.
(XX>)+ can be computed in time O(d3).
•Kontorovich and Sabato (BGU) Lecture 9 11 / 24
Computational complexity of least squares
Optimization problem:
Minimizew∈Rd f (w) = ‖X>w − Y ‖2
Solution:w = (XX>)−1XY
Since d × d matrix invertible in O(d3) time, total computational costis O(md + d3)
•
Kontorovich and Sabato (BGU) Lecture 9 12 / 24
Statistical complexity of least squares
Theorem
Let H = {hw | w ∈ Rd}, where hw (x) := 〈w , x〉.With high probability, for all hw ∈ H,
risk`(h,D) ≤ risk`(h,S) + O
(√d
m
)
Sample complexity is O(d)
Similar to binary classification using linear separators.
What to do if dimension is very large?
•
Kontorovich and Sabato (BGU) Lecture 9 13 / 24
Statistical complexity of least squares
Theorem
Suppose
A training sample S = {(xi , yi ), i ≤ m} satisfies ‖xi‖ ≤ R, i ≤ m
H = {hw | ‖w‖ ≤ B} (linear predictors with norm ≤ B)
|yi | ≤ BR for all i ≤ m.
Then with high probability, for all h ∈ H,
risk`(h,D) ≤ risk`(h,S) + O
(B2R2
√m
)
Insight: restrict (i.e., regularize) ‖w‖ for better generalization.
•
Kontorovich and Sabato (BGU) Lecture 9 14 / 24
Ridge regression: regularized least squaresRecall:
risk`(h,D) ≤ risk`(h, S) + O
(B2R2
√m
)Sample complexity depends on ‖w‖ ≤ B.
Instead of restricting ‖w‖, use regularization (sounds familiar?)
Optimization problem:
Minimizew∈Rdλ‖w‖2 +m∑i=1
(〈w , xi 〉 − yi )2
a.k.a. “regularized least squares”/”ridge regression”
In matrix form:f (w) = λ‖w‖2 + ‖X>w − Y ‖2
Gradient:Of (w) = 2λw + 2X (X>w − Y )
Gradient is 0 precisely when (XX> + λI )w = XY
Solution:w = (XX> + λI )−1XY (always invertible).
•Kontorovich and Sabato (BGU) Lecture 9 15 / 24
KernelizationHow to learn with non-linear hypotheses? Kernels!
recall feature map ψ : X → F (say F = Rn, n possibly huge or ∞)
induces kernel K : X × X → R via K (x , x ′) = 〈ψ(x), ψ(x ′)〉regularized least squares objective function:
f (w) = λ‖w‖2 +m∑i=1
(〈w , ψ(xi )〉 − yi )2
does the representer theorem apply?
indeed, f : Rm → R is a of the form
f (〈w , ψ(x1)〉, . . . , 〈w , ψ(xm)〉) + R(‖w‖),where R : R+ → R is non-decreasing
hence, optimal w can always be expressed as
w =m∑i=1
αiψ(xi )
.
•Kontorovich and Sabato (BGU) Lecture 9 16 / 24
Kernel ridge regression
Representer theorem =⇒ optimal w can be expressed asw =
∑mi=1 αiψ(xi )
substitute into objective functionf (w) = λ‖w‖2 +
∑mi=1(〈w , ψ(xi )〉 − yi )
2:
g(α) = λ∑
1≤i ,j≤mαiαjK (xi , xj) +
m∑i=1
〈 m∑j=1
αjψ(xj), ψ(xi )〉 − yi
2
= λ∑
1≤i ,j≤mαiαjK (xi , xj) +
m∑i=1
m∑j=1
αjK (xi , xj)− yi
2
= λα>Gα + ‖Gα− Y ‖2,
where Gij = K (xi , xj).
•Kontorovich and Sabato (BGU) Lecture 9 17 / 24
Kernel ridge regression: solutionproblem: minimize g(α) over α ∈ Rm, where
g(α) = λα>Gα + ‖Gα− Y ‖2
= λα>Gα + α>G>Gα− 2α>Gy + ‖Y ‖2,where Gij = K (xi , xj) is an m ×m matrixgradient
Og(α) = 2λGα + 2G>Gα− 2GY
when G is invertible, Og(α) = 0 at
α = (G + λI )−1Y
what about non-invertible G?
α = (λG + G>G )+GY ,
where (·)+ is the Moore-Penrose pseudoinversecomputational cost: O(m3)
•Kontorovich and Sabato (BGU) Lecture 9 18 / 24
Kernel ridge regression: prediction
after computing α = (G + λI )−1Y , where Gij = K (xi , xj)
to predict label y at new point x
compute
h(x) = 〈w , ψ(x)〉
= 〈m∑i=1
αiψ(xi ), ψ(x)〉
=m∑i=1
αi 〈ψ(xi ), ψ(x)〉
=m∑i=1
αiK (xi , x)
how to choose regularization parameter λ?
cross-validation!
•Kontorovich and Sabato (BGU) Lecture 9 19 / 24
Kernel ridge regression: generalization
Theorem
Suppose
data S = {(xi , yi ), i ≤ m} lies in ball of radius R in feature space:
∀i ≤ m, ‖ψ(xi )‖ =√
K (xi , xi ) ≤ R
H = {hw (x) ≡ 〈w , x〉 | ‖w‖ ≤ B}, that is:
w =∑
αiψ(xi ) and ‖w‖ =√α>Gα ≤ B
|yi | ≤ BR for all i ≤ m.
Then with high probability, for all h ∈ H,
risk`(h,D) ≤ risk`(h,S) + O
(B2R2
√m
)•
Kontorovich and Sabato (BGU) Lecture 9 20 / 24
LASSO: `1-regularized least squares
recall: `1 norm is ‖w‖1 =∑d
i=1 |wi |; `2 norm is ‖w‖2 =√∑d
i=1 w2i
ridge regression is `2-regularized least squares:
minw∈Rd
λ‖w‖22 +m∑i=1
(〈w , xi 〉 − yi )2
what if we penalize ‖w‖1 instead?
minw∈Rd
λ‖w‖1 +m∑i=1
(〈w , xi 〉 − yi )2
intuition: encourages sparsity [draw `1 and `2 balls on board]LASSO: “least absolute shrinkage and selection operator”
no closed-form solution; must solve quadratic program (exercise:write it down!)
not kernelizable (why not?)LARS algorithm gives entire regularization path (no need to solve
new QP for each λ)
•Kontorovich and Sabato (BGU) Lecture 9 21 / 24
LASSO regularization path
Larger λ encourages sparser w
Equivalently, can constrain ‖w‖1 ≤ B
When B = 0, must have w = 0
As B increases, gradually coordinates of w “are activated”
There is a critical set of values of B at which w gains new coordinates
One can compute these critical values analytically
Coordinates of optimal w are piecewise linear in B
LARS (“least angle regression and shrinkage”) computes the entireregularization path
[diagram on board]
Cost: roughly the same as least squares
For more details, see book:Kevin P. Murphy, Machine Learning: A Probabilistic Perspective
•Kontorovich and Sabato (BGU) Lecture 9 22 / 24
LASSO: generalization
Theorem
Suppose
data S = {(xi , yi ), i ≤ m} lies in `∞ ball of radius R in Rd :
‖xi‖∞ := max1≤j≤d
|xj | ≤ R, i ≤ m
H = {hw (x) ≡ 〈w , x〉 | ‖w‖1 ≤ B}|yi | ≤ BR for all i ≤ m.
Then with high probability, for all h ∈ H,
risk`(h,D) ≤ risk`(h,S) + O
(B2R2
√m
)Bounds in terms of ‖w‖0 :=
∑dj=1 I[wj 6= 0] are also available.
Sparsity =⇒ good generalization.
•Kontorovich and Sabato (BGU) Lecture 9 23 / 24
Regression summary
In regression, we seek to fit a function h : X → R to the data (Y = R).
An ERM learner minimizes the empirical risk:risk`(h,S) = 1
m
∑mi=1 `(h(xi ), yi )
We actually care about the distribution risk:risk`(h,D) = E(X ,Y )∼D[`(h(X ),Y )]
Both depend on choice of `; we focused on square loss: `(y , y ′) = (y − y ′)2.
Regularize w to avoid overfitting.
Ridge regression/regularized least squares:I Minimizew∈Rdλ‖w‖22 +
∑mi=1(〈w , xi 〉 − yi )
2
I Analytic, efficient, closed-form solutionI Kernelizable
`1 regularization/LASSOI Minimizew∈Rdλ‖w‖1 +
∑mi=1(〈w , xi 〉 − yi )
2
I Solution efficiently computable by LARS algorithmI Not kernelizable
For both cases, λ tuned by cross-validation.
•Kontorovich and Sabato (BGU) Lecture 9 24 / 24