Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector...

170
Advanced Introduction to Machine Learning — Spring Quarter, Week 5 — https://canvas.uw.edu/courses/1372141 Prof. JeBilmes University of Washington, Seattle Departments of: Electrical & Computer Engineering, Computer Science & Engineering http://melodi.ee.washington.edu/ ~ bilmes April 27th/29nd, 2020 Prof. JeBilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F1/65 (pg.1/170)

Transcript of Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector...

Page 1: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Advanced Introduction to Machine Learning— Spring Quarter, Week 5 —

https://canvas.uw.edu/courses/1372141

Prof. Je↵ Bilmes

University of Washington, Seattle

Departments of: Electrical & Computer Engineering, Computer Science & Engineering

http://melodi.ee.washington.edu/~bilmes

April 27th/29nd, 2020

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F1/65 (pg.1/170)

Page 2: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Announcements

HW2 is posted. Due May 1st, 6:00pm via our assignment dropbox(https://canvas.uw.edu/courses/1372141/assignments).

Virtual o�ce hours this week, Thursday night at 10:00pm via zoom(same link as class).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F2/65 (pg.2/170)

Page 3: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Class Road Map

W1(3/30,4/1): What is ML, Probability, Coins, Gaussians and linearregression, Associative Memories, Supervised LearningW2(4/6,4/8): More supervised, logistic regression, complexity andbias/variance tradeo↵W3(4/13,4/15): Bias/Variance, Regularization, Ridge, CrossVal,MulticlassW4(4/20,4/22): Multiclass classification, ERM, Gen/Disc, Naıve BayesW5(4/27,4/29): Lasso, Regularizers, Curse of DimensionalityW6(5/4,5/6): Curse of Dimensionality, Dimensionality Reduction, k-NNW7(5/11,5/13): k-NN, LSH, DTs, Bootstrap/Bagging, Boosting &Random Forests, GBDTsW8(5/18,5/20): Graphs; Graphical Models (Factorization, Inference,MRFs, BNs);W9(5/27,6/1): Learning Paradigms; Clustering; EM Algorithm;W10(6/3,6/8): Spectral Clustering, Graph SSL, Deep models, (SVMs,RL); The Future.

Last lecture is 6/8 since 5/25 is holiday (or we could just have lecture on 5/25).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F3/65 (pg.3/170)

Page 4: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Class (and Machine Learning) overview

1. Introduction• What is ML• What is AI• Why are we so interested in these topics right now?

2. ML Paradigms/Concepts• Over!tting/Under!tting, model complexity, bias/variance• size of data, big data, sample complexity• ERM, loss + regularization, loss functions, regularizers• Supervised, unsupervised, and semi-supervised learning;• reinforcement learning, RL, multi-agent, planning/control• transfer and multi-task learning• federated and distributed learning• active learning, machine teaching• self-supervised, zero/one-shot, open-set learning

3. Dealing with Features• dimensionality reduction, PCA, LDA, MDS, T-SNE, UMAP • Locality sensitive hashing (LSH)• feature selection• feature engineering• matrix factorization & feature engineering• representation learning

4. Evaluation• accuracy/error, precision/recall, ROC, likelihood/posterior, cost/utility, margin • train/eval/test data splits• n-fold cross validation• method of the bootstrap

6. Inference Methods• probabilistic inference• MLE, MAP• belief propagation• forward/backpropagation• Monte Carlo methods

7. Models & Representation• linear least squares, linear regression, logistic regression, sparsi-ty, ridge, lasso• generative vs. discriminative models• Naive Bayes• k-nearest neighbors• clustering, k-means, k-mediods, EM & GMMs, single linkage• decision trees and random forests• support vector machines, kernel methods, max margin• perceptron, neural networks, DNNs• Gaussian processes• Bayesian nonparametric methods• ensemble methods• the bootstrap, bagging, and boosting• graphical models• time-series, HMMs, DBNs, RNNs, LSTMs, Attention, Transformers • structured prediction• grammars (as in NLP)

12. Other Techniques• compressed sensing• submodularity, diversity/homogeneity modeling

8. Philosophy, Humanity, Spirituality• arti!cial intelligence (AI)• arti!cal general intelligence (AGI)• arti!cial intelligence vs. science !ction

9. Applications• computational biology• social networks• computer vision• speech recognition• natural language processing• information retrieval• collaborative !ltering/matrix factorization

10. Programming• python• libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras, H2O, etc.• HPC: C/C++, CUDA, vector processing

11. Background• linear algebra• multivariate calculus• probability theory and statistics• information theory• mathematical (e.g., convex) optimization

x6

x3

x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3

x2x4

x5

x1

x3

x2x4

x1

x3

x2

x1

x2

x1

x3

x2x4

x5

x1

x2x4

x5

x1

x2

x5

x1

x2

=X

x3

X

x4

X

x5

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

x6

(x2, x6)

| {z }� 66,2(x2)

=X

x4

X

x5

(x1, x2)� 66,2(x2)X

x3

(x1, x3) (x3, x4) (x3, x5)

| {z }� 63,1,4,5(x1,x4,x5)

=X

x5

(x1, x2)� 66,2(x2)X

x4

� 63,1,4,5(x1, x4, x5)

| {z }� 63, 64,1,5(x1,x5)

= (x1, x2)� 66,2(x2)X

x5

� 63, 64,1,5(x1, x5)

| {z }� 63,65, 64,1(x1)

= (x1, x2)� 66,2(x2)� 63, 64, 65,1(x1)

p(x1, x2) =X

x3

X

x4

· · ·

X

x6

p(x1, x2, . . . , x6)

X

x3

X

x4

X

x5

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

x6

(x2, x6)

| {z }� 66,2(x2)

=X

x3

X

x4

(x1, x2)� 66,2(x2) (x1, x3) (x3, x4)X

x5

(x3, x5)

| {z }� 65,3(x3)

=X

x3

(x1, x2)� 66,2(x2) (x1, x3)� 65,3(x3)X

x4

(x3, x4)

| {z }� 64,3(x3)

= (x1, x2)� 66,2(x2)X

x3

(x1, x3)� 65,3(x3)� 64,3(x3)

| {z }� 65, 64, 63,1(x1)

= (x1, x2)� 66,2(x2)� 65,64, 63,1(x1)

p(x1, x2) =X

x3

X

x4

· · ·

X

x6

p(x1, x2, . . . , x6)

=

Reconstituted Graph Reconstituted Graph

O(r2)

O(r4)

O(r3)

O(r2)

x6

x5

x4

x3

O(r2)

O(r2)

O(r2)

O(r2)

GraphicalTransformation

CorrespondingMarginalization Operation

GraphicalTransformation

CorrespondingMarginalization Operation

Variableto

Eliminateand

Complexity

Variableto

Eliminateand

Complexity

InputLayer

HiddenLayer 1

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

HiddenLayer 5

HiddenLayer 6

HiddenLayer 7

OutputUnit

5. Optimization Methods• Unconstrained Continuous Optimization: (stochastic) gradient descent (SGD), adap-tive learning rates, conjugate gradient, 2nd order Newton• Constrained Continuous Optimization : Frank-Wolf (conditional gradient descent), projected gradient, linear, quadratic, and convex programming• Discrete optimization - greedy, beam search, branch-and-bound, submodular optimization.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F4/65 (pg.4/170)

Page 5: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Multiclass and Voronoi tessellations

` classes, j 2 {1, 2, . . . , `}, parameters ✓(j) 2 Rm, ` linear discriminantfunctions h✓(j)(x) = h✓

(j), xi.

Partition input space x 2 Rm based on who wins

Rm = Rm1 [ Rm

2 [ · · · [ Rm` (5.19)

where Rmi ✓ Rm and Rm

i \ Rmj = ; whenever i 6= j.

Winning block, for all i 2 [`]:

Rmi , {x 2 Rm : h✓(i)(x) > h✓(j)(x), 8j 6= i} (5.20)

Alternatively,

Rmi ,

nx 2 Rm : h✓(i) � ✓

(j), xi > 0, 8j 6= i

o(5.21)

consider when h·, ·i is dot product.Alternatively, under softmax regression model,

Rmi , {x 2 Rm : Pr(Ci|x) > Pr(Cj |x), 8j 6= i} (5.22)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F5/65 (pg.5/170)

Page 6: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

0/1-loss and Bayes error

Let L be 0/1, i.e., L(y, y0) = 1{y 6=y0}.

Probability of error for a given x

Pr(h✓(x) 6= Y ) =

Zp(y|x)1{h✓(x) 6=y}dy = 1� p(h✓(x)|x) (5.7)

To minimize the probability of error, h✓(x) should ideally choose aclass having maximum posterior probability for the current x.

Smallest probability of error is known is Bayes error for x

Bayes error(x) = miny

(1� p(y|x)) = 1�maxy

p(y|x) (5.8)

Bayes classifier (or predictor) predicts using highest probability,assuming have access to p(y|x). I.e.,

BayesPredictor(x) , argmaxy

p(y|x) (5.9)

Bayes predictor has ’Bayes error’ as its error. Irreducible error rate.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F6/65 (pg.6/170)

Page 7: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Bayes Error, overall di�culty of a problem

For Bayes error, we often take the expected value w.r.t. x. This givesan indication overall of how di�cult a classification problem is, sincewe can never do better than Bayes error.

Bayes error, various equivalent forms (Exercise: show last equality):

Bayes Error = minh

Zp(x, y)1{h(x) 6=y}dydx (5.13)

= minh

Pr(h(X) 6= Y ) (5.14)

= Ep(x)[min(p(y|x), 1� p(y|x))] (5.15)

=1

2�

1

2Ep(x)[|2p(y|x)� 1|] (5.16)

For binary classification, if Bayes error is 1/2, prediction can never bebetter than flipping a coin (i.e., x tells us nothing about y).

Bayes error, property of the distribution p(x, y); can be useful to decidebetween say p(x, y) vs. p(x0, y) for two di↵erent feature sets x and x

0.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F7/65 (pg.7/170)

Page 8: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

ERM and complexity

Empirical risk minimization on data D:

h 2 argminh2H

dRisk(h) = 1

n

nX

i=1

1{h(x(i)) 6=y(i)} (5.13)

h is a random variable, computed based on random data set D.How close is h to true risk, the one for the best hypothesis h⇤?

h⇤2 argmin

h2HRisk(h) =

Zp(x, y)1{h(x) 6=y}dxdy (5.14)

We can sometimes bound the probability (e.g., PAC-style bounds)

Pr(Risk(h)� Risk(h⇤) > ✏) < �(n,VC, ✏) (5.15)

where � is function of sample size n, and complexity (ability, flexibility,expressiveness) of family of models H over which we’re learning.Bound gets worse with increasing complexity of H or decreasing n.Extensive mathematical theory behind this (e.g., Vapnik-Chervonenkis(VC) dimension, or Rademacher complexity).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F8/65 (pg.8/170)

Page 9: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Generative vs. Discriminative models

Generative models, given data drawn from p(x, y) model the entiredistribution, often factored as p(x|y)p(y) where y is the class label (orprediction) random variable and x is the feature vector randomvariable.

Discriminative model: When goal is classification/regression, cansometimes be simpler to just model p(y|x) since that is su�cient toachieve Bayes error (i.e., argmaxy p(y|x) for classification) or E[Y |x]for regression. (e.g., support vector machines, max-margin learning,conditional likelihood, etc.)

Sometimes they are the same. Naıve Bayes class-conditional Gaussiangenerative model p(x|y) leads to logistic regression.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F9/65 (pg.9/170)

Page 10: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Generative vs. Discriminative models

Generative models, given data drawn from p(x, y) model the entiredistribution, often factored as p(x|y)p(y) where y is the class label (orprediction) random variable and x is the feature vector randomvariable.

Discriminative model: When goal is classification/regression, cansometimes be simpler to just model p(y|x) since that is su�cient toachieve Bayes error (i.e., argmaxy p(y|x) for classification) or E[Y |x]for regression. (e.g., support vector machines, max-margin learning,conditional likelihood, etc.)

Sometimes they are the same. Naıve Bayes class-conditional Gaussiangenerative model p(x|y) leads to logistic regression.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F9/65 (pg.10/170)

Page 11: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Generative vs. Discriminative models

Generative models, given data drawn from p(x, y) model the entiredistribution, often factored as p(x|y)p(y) where y is the class label (orprediction) random variable and x is the feature vector randomvariable.

Discriminative model: When goal is classification/regression, cansometimes be simpler to just model p(y|x) since that is su�cient toachieve Bayes error (i.e., argmaxy p(y|x) for classification) or E[Y |x]for regression. (e.g., support vector machines, max-margin learning,conditional likelihood, etc.)

Sometimes they are the same. Naıve Bayes class-conditional Gaussiangenerative model p(x|y) leads to logistic regression.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F9/65 (pg.11/170)

Page 12: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Accuracy Objectives

Decreasing order of complexity of the task being learnt, while still solving aclassification decision problem. From strongest to weakest goals.

generative modeling, model p(x, y) and compute argmaxy p(x, y) forgiven x. Maximum likelihood principle.

discriminative modeling, model only p(y|x) and computeargmaxy p(y|x). Maximum conditional likelihood principle.

rank modeling, rank function model f(y|x) so thatf(�1|x) � f(�2|x) � · · · � f(�`|x) wheneverp(�1|x) � p(�2|x) � · · · � p(�`|x).

best decision modeling, decision function hy(x) � hy0(x) wheneverp(y|x) � p(y0|x) for all y0 6= y. Discriminant functions.

Weaker model often lives within stronger model (e.g., logistic or softmaxuses linear discriminant functions) as we saw, thereby giving it improvedinterpretation (e.g., probabilities rather than just unnormalized scores).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F10/65 (pg.12/170)

Page 13: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Naıve Bayes classifier

Recall, x 2 Rm, so m feature values in x = (x1, x2, . . . , xm) used topredict y in joint model p(x, y).

Generative conditional distribution p(x, y) = p(x|y)p(y). Conditionedon y, p(x|y) can be an arbitrary distribution.

Naıve Bayes classifiers make simplifying (factorization) assumption.

p(x|y) =mY

j=1

p(xj |y) (5.21)

or that Xi??Xj |Y , read “Xi and Xj are conditionally independentgiven Y , for all i 6= j. This does not mean that Xi??Xj .

Estimating many 1-dimensional distributions p(xj |y) much easier thanfull joint p(x|y) due to curse of dimensionality.

Naıve Bayes classifiers make a greatly simplifying assumption but oftenworks quite well.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F11/65 (pg.13/170)

Page 14: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Posterior From Binary Naive Bayes Gaussian

Binary case y 2 {0, 1}, Naıve Bayes Gaussian classifier posteriorprobability:

p(y = 1|x) =p(x|y = 1)p(y = 1)

p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)(5.23)

=1

1 + p(x|y=0)p(y=0)p(x|y=1)p(y=1)

(5.24)

=1

1 + exp⇣� log p(x|y=1)p(y=1)

p(x|y=0)p(y=0)

⌘ (5.25)

= g

⇣ mX

i=1

logp(xi|y = 1)

p(xi|y = 0)+ log

p(y = 1)

p(y = 0)

⌘(5.26)

where g(z) = 1/(1 + e�x) is the logistic function,

log p(y = 1)/p(y = 0) is log odds-ratio.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F12/65 (pg.14/170)

Page 15: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Binary (` = 2) Naıve Bayes Gaussian

Since p(xi|y) is univariate Gaussian, log p(xi|y=1)p(xi|y=0) + log p(y=1)

p(y=0) takesone of the two following forms:

1 bixi+ ci if �1,i = �2,i 8i (equal class variances). Hence, we have a linearmodel p(y = 1|x) = g(✓|x+ const) like we saw in Logistic regression!!

2 aix2i + bixi + ci otherwise, so we have a quadratic model followed by

the logistic function. p(y = 1|x) = g(Pm

i=1 aix2i + bixi + ci). Unequal

class variances introduce quadratic features. Alternatively, we couldhave just started with features like x

2i , increasing m, without need for

unequal class variances. Still no cross interaction terms of the from xixj

due to Naıve Bayes assumption! But can introduce those xixj crossfeatures as well, further increasing m, feature augmentation.

3 non-Naıve-Bayes with fewer features is identical to doing Naıve Bayeswith more features. This helps to explain why Naıve Bayes often worksso well, since one can remove unfounded assumptions simply by addingfeatures.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F13/65 (pg.15/170)

Page 16: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Binary Naive Bayes Gaussian

Bottom line: Binary Naive Bayes Gaussian (equivalently, two-classGaussian generative model classifiers) is equivalent to logisticregression when we form the posterior distribution p(y|x) forclassification. We can eliminate the Naive Bayes property in originalfeature space by adding more features (feature augmentation).

In general, feature engineering (producing and choosing right features)is an important part of any machine learning system.

Representation learning - the art and science of using machine learningto learn the features themselves, often using a deep neural network.Self-supervised learning can facilitate this (see final lecture).

Disentanglement - the art and science of taking a featurerepresentation and producing new deduced features where separatephenomena in the signal have separate representation in the deducedfeatures. A modern form of ICA (independent component analysis).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F14/65 (pg.16/170)

Page 17: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Norms and p-norms

For any real valued p � 1 and vector x 2 Rm, we have a norm:

kxkp ,

0

@mX

j=1

|xj |p

1

A1/p

(5.24)

Properties of any “norm” kxk for x 2 Rm are:1 kxk � 0 for any x 2 Rm.2 kxk = 0 i↵ x = 0.3 Absolutely homogeneous: k↵xk = |↵|kxk for any ↵ 2 R.4 Triangle inequality: For any x, y 2 Rm, we have kx+ yk kxk+ kyk.

Euclidean norm: The 2-norm (where p = 2) is the standard Euclideandistance (from zero) that we use in everyday life.

p-norm is also defined and is a valid norm for p!1 in which case

kxk1 =m

maxj=1

|xj | (5.25)

There are other norms besides p norms.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F15/65 (pg.17/170)

peo

11×1/0 Counting horn .

Counts the

numberof

non- tho

elemrb

in the vector

XEIR"

Page 18: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

The Lasso and Constrained Optimization

Lasso adds penalty to learnt weights

✓ 2 argmin✓

1

2

nX

i=1

1

2(y(i) � ✓

|x(i))2 + �k✓k1 (5.25)

where k · k1 is known as the 1-norm, defined as k✓k1 =Pm

j=1 |✓j |.

Equivalent formulation:

minimize✓

nX

i=1

(y(i) � ✓|x(i))2 (5.26)

subject to k✓k1 < t (5.27)

One-to-one correspondence between � and t yielding identical solution(see Boyd and Vandenberghe’s book on Convex Analysis).

No e↵ect if t >Pm

i=1 |✓i| where ✓ is the least squares estimate.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F16/65 (pg.18/170)

Page 19: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Sparsity

Consider linear model h✓(x) = h✓, xi, where x 2 Rm.

m might be very big since we don’t know what features to use.

Useful: allow learning system to throw away needless features, throwaway means the resulting learnt model has ✓j = 0. Helps with:

reduce computation, memory, and resources (e.g., sensor requirements)improve prediction (reduce noise variables). Trade o↵ a bit of biasincrease in order to reduce variance (to improve overall accuracy).Helps ease interpretation (what from x is critical to predict y). Removedirty variables (data hygiene).Reduce redundant cancellation, i.e., if typically ✓ixi ⇡ �✓jxj , no e↵ecton prediction even if both ✓i and ✓j are not zero.

E↵ective new model has m0⌧ m features.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F17/65 (pg.19/170)

Page 20: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Feature selection in ML

Feature Selection: a branch of feature engineering.

Many applications start with massive feature sets, more than isnecessary.

Example: spam detection, natural language processing (NLP) andn-gram features.Cancer detection in computational biology, kmer motifs, gene expressionvia DNA microarrays, mRNA abundance, biopsies.

Important in particular when m > n but potentially useful for any n,m

when features are redundant/superfluous, irrelevant, or expensive.

Overall goal, want least costly but still informative feature set.

Sparsity (via shrinkage methods) is one way to do this, but otherdiscrete optimization methods too (e.g., the greedy procedure).

Many are implemented in python’s sklearn:https://scikit-learn.org/stable/modules/classes.html#

module-sklearn.feature_selection

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F18/65 (pg.20/170)

Page 21: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Greedy Forward Selection HeuristicLet x 2 Rm. Let the features be indexed integer setU = {1, 2, . . . ,m}, and for A ✓ U let xA = (xa1 , xa2 , . . . , xak) be thefeatures indexed by subset A, where |A| = k. Similarly ✓A is so defined.Let h✓A(xA) be the best learnt model using only features A givingparameters ✓A and accuracy(A) ⇡ 1� Pr(h✓A(XA) 6= Y ) .To find best set (maxA✓U :|A|=k accuracy(A)), need to consider�mk

�= O((me/k)k) subsets by Stirling’s formula, exponential in m.

Greedy forward selection to select k features, achieve sparsity, O(mk).

Algorithm 1: Greedy Forward Selection

1 A ;;2 for i = 1, · · · , k do3 Choose v

02 argmaxv2U\A accuracy(A [ {v}) ;

4 A A [ {v0};

Better than�nk

�, still need to retrain model each time.

One solution model accuracy (e.g., submodularity, later). Anothersolution, Lasso regression (next)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F19/65 (pg.21/170)

Page 22: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Logistics Review

Forward vs. Backwards Selection

Rather than start from the empty set and add one at a time, we can:

Start with all features and greedily remove the feature that hurts usthe least. backwards selection

This is also only a heuristic (might not work well).

Backwards selection, variance is an issue when using many features.

Bidirectional greedy: pick a random order, consider both addingstarting at empty set, and removing starting at all features, based onorder, and randomly choose based on distribution proportional to both:

1 accuracy gain accuracy(A [ {v})� accuracy(A) and2 accuracy loss accuracy(S)� accuracy(S \ {v}).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F20/65 (pg.22/170)

Page 23: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Back To Lasso

Ridge regression adds sum-of-squares penalty to learnt coe�cients.penalty to learnt weights

✓ 2 argmin✓

Jridge(✓) =nX

i=1

(y(i) � ✓|x(i))2 + �k✓k

22 (5.1)

where k · k2 is known as the 2-norm, defined as k✓k2 =qPm

j=1 ✓2j .

Lasso linear regression

✓ 2 argmin✓

Jlasso(✓) =nX

i=1

(y(i) � ✓|x(i))2 + �k✓k1 (5.2)

where k · k1 is known as the 1-norm, defined as k✓k1 =Pm

j=1 |✓j |.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F21/65 (pg.23/170)

Page 24: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Back To Lasso

Ridge regression adds sum-of-squares penalty to learnt coe�cients.penalty to learnt weights

✓ 2 argmin✓

Jridge(✓) =nX

i=1

(y(i) � ✓|x(i))2 + �k✓k

22 (5.1)

where k · k2 is known as the 2-norm, defined as k✓k2 =qPm

j=1 ✓2j .

Lasso linear regression

✓ 2 argmin✓

Jlasso(✓) =nX

i=1

(y(i) � ✓|x(i))2 + �k✓k1 (5.2)

where k · k1 is known as the 1-norm, defined as k✓k1 =Pm

j=1 |✓j |.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F21/65 (pg.24/170)

Page 25: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

p-norm unit ball contours (in 2D)

m{x 2 R : x 2 = 1}

m{x 2 R : x 1 = 1}

{x 2 Rm : x 1 = 1}

x1

x2

We have (consistent w. picture) that if 0 < r < p then kxkr � kxkp, thus:

{x 2 Rm : kxk1 1} ⇢ {x 2 Rm : kxk2 1} ⇢ {x 2 Rm : kxk1 1}

Note, kxk1 = maxi xi is known as the infinity-norm (or max-norm).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F22/65 (pg.25/170)

Page 26: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

p-norm unit ball contours (in 2D)

m{x 2 R : x 2 = 1}

m{x 2 R : x 1 = 1}

{x 2 Rm : x 1 = 1}

x1

x2

We have (consistent w. picture) that if 0 < r < p then kxkr � kxkp, thus:

{x 2 Rm : kxk1 1} ⇢ {x 2 Rm : kxk2 1} ⇢ {x 2 Rm : kxk1 1}

Note, kxk1 = maxi xi is known as the infinity-norm (or max-norm).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F22/65 (pg.26/170)

Page 27: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

More p-norm unit ball contours (in 2D)

Right: various p-normballs p � 1 andanalogous p < 1.For p = 0, we get theL0 ball, which countsthe number ofnon-zero entries and ismost sparseencouraging. The L1ball is the convexenvelope of the L0ball, so it is arguablythe best combinationof easy & appropriate.

�1.00 �0.75 �0.50 �0.25 0.00 0.25 0.50 0.75 1.00x1

�1.00

�0.75

�0.50

�0.25

0.00

0.25

0.50

0.75

1.00

x2

Various norm balls

{x : kxk30 = 1}

{x : kxk5 = 1}

{x : kxk2 = 1}

{x : kxk3/2 = 1}

{x : kxk1 = 1}

{x : kxk1/2 = 1}

{x : kxk1/4 = 1}

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F23/65 (pg.27/170)

N

p a

Page 28: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Regression, Ridge, Lasso

✓ ✓

✓1 ✓1

✓2✓2✓:

n

i=1

(y(i) �✓x(i) )2 =

c

✓:

n

i=1

(y(i) �✓x(i) )2 =

c

{✓ 2 Rm : ✓ 2 t}{✓ 2 Rm : ✓ 1 t}

✓ is the linear least squares solution, with quadratic contours of equal error.For given �, solution trades o↵ regularizer cost w. error cost. From Hastie et. al.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F24/65 (pg.28/170)

Page 29: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Other norms as regularizers

Using p-norms with p < 1 encourages sparsity even further (but harder tooptimize).

✓1

✓2

✓:

n

i=1

(y(i) �✓x(i) )2 =

c

{✓ : k✓k30

{✓ : k✓k5 = 1}

= 1}

{✓ : k✓k2 = 1}

{✓ : k✓k3/2 = 1}

{✓ : k✓k1 = 1}

{✓ : k✓k1/2 = 1}

{✓ : k✓k1/4 = 1}

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F25/65 (pg.29/170)

" one-

- ( Einstein)'"

I 06••⑧ ¥÷÷÷-

Page 30: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso: How to solve

Closed form solution for ridge, no closed form for lasso.

We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.

@

@✓|✓|(✓) =

8><

>:

1 if ✓ > 0

�1 if ✓ < 0

?? if ✓ = 0

(5.3)

How to solve? Many possible ways.

Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.

Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).

Results in simple solution, soft thresholding

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.30/170)

Page 31: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso: How to solve

Closed form solution for ridge, no closed form for lasso.

We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.

@

@✓|✓|(✓) =

8><

>:

1 if ✓ > 0

�1 if ✓ < 0

?? if ✓ = 0

(5.3)

How to solve? Many possible ways.

Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.

Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).

Results in simple solution, soft thresholding

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.31/170)

Page 32: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso: How to solve

Closed form solution for ridge, no closed form for lasso.

We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.

@

@✓|✓|(✓) =

8><

>:

1 if ✓ > 0

�1 if ✓ < 0

?? if ✓ = 0

(5.3)

How to solve? Many possible ways.

Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.

Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).

Results in simple solution, soft thresholding

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.32/170)

Page 33: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso: How to solve

Closed form solution for ridge, no closed form for lasso.

We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.

@

@✓|✓|(✓) =

8><

>:

1 if ✓ > 0

�1 if ✓ < 0

?? if ✓ = 0

(5.3)

How to solve? Many possible ways.

Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.

Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).

Results in simple solution, soft thresholding

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.33/170)

Page 34: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso: How to solve

Closed form solution for ridge, no closed form for lasso.

We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.

@

@✓|✓|(✓) =

8><

>:

1 if ✓ > 0

�1 if ✓ < 0

?? if ✓ = 0

(5.3)

How to solve? Many possible ways.

Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.

Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).

Results in simple solution, soft thresholding

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.34/170)

Page 35: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso: How to solve

Closed form solution for ridge, no closed form for lasso.

We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.

@

@✓|✓|(✓) =

8><

>:

1 if ✓ > 0

�1 if ✓ < 0

?? if ✓ = 0

(5.3)

How to solve? Many possible ways.

Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.

Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).

Results in simple solution, soft thresholdingProf. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.35/170)

Page 36: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Coordinate DescentIsolate one coordinate k fix the remainder j 6= k

nX

i=1

(y(i) � ✓|x(i))2 + �k✓k1 =

nX

i=1

0

@y(i)�

mX

j=1

✓jx(i)j

1

A2

+ �k✓k1

(5.4)

=nX

i=1

0

@y(i)� ✓kx

(i)k �

X

j 6=k

✓jx(i)j

1

A2

+ �

X

j 6=k

|✓j |+ �|✓k|

(5.5)

=nX

i=1

(r(i)k � ✓kx(i)k )2 + �

X

j 6=k

|✓j |+ �|✓k| (5.6)

= J�,lasso(✓k; ✓1, ✓2, . . . , ✓k�1, ✓k+1, . . . , ✓m) (5.7)

We then optimize only ✓k holding the remainder fixed, and thenchoose a di↵erent k0 for the next round.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F27/65 (pg.36/170)

Page 37: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Coordinate DescentIsolate one coordinate k fix the remainder j 6= k

nX

i=1

(y(i) � ✓|x(i))2 + �k✓k1 =

nX

i=1

0

@y(i)�

mX

j=1

✓jx(i)j

1

A2

+ �k✓k1

(5.4)

=nX

i=1

0

@y(i)� ✓kx

(i)k �

X

j 6=k

✓jx(i)j

1

A2

+ �

X

j 6=k

|✓j |+ �|✓k|

(5.5)

=nX

i=1

(r(i)k � ✓kx(i)k )2 + �

X

j 6=k

|✓j |+ �|✓k| (5.6)

= J�,lasso(✓k; ✓1, ✓2, . . . , ✓k�1, ✓k+1, . . . , ✓m) (5.7)

We then optimize only ✓k holding the remainder fixed, and thenchoose a di↵erent k0 for the next round.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F27/65 (pg.37/170)

Page 38: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Coordinate Descent

General algorithm, one round of coordinate descent to be repeated.

Algorithm 2: Coordinate Descent for Lasso Solution

1 ✓ 0 ;2 for k = 1, · · · ,m do

3 r(i)k y

(i)�P

j 6=k x(i)j ✓j ;

4 ✓k argmin✓kPn

i=1(r(i)k � x

(i)k ✓k)2 + �|✓k|

We repeat the above several times until convergence.

This doesn’t help lack of di↵erentiability. How do we perform theminimization?

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F28/65 (pg.38/170)

Page 39: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm

! R is convex if for allx, y 2 Rm and � 2 [0, 1]:

f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)

Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g

|(z � x).

Validwhen f is di↵erentiable(e.g., at x1) and when f

is not di↵erentable (e.g.,at x2).

Set of all subgradients of f at x known as the subdi↵erential at x

@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)

If f is di↵erentiable at x then @f (x) = {rf(x)}.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.39/170)

Page 40: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm

! R is convex if for allx, y 2 Rm and � 2 [0, 1]:

f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)

Convex functions have no local minima other than the global minima

A vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g

|(z � x).

Validwhen f is di↵erentiable(e.g., at x1) and when f

is not di↵erentable (e.g.,at x2).

Set of all subgradients of f at x known as the subdi↵erential at x

@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)

If f is di↵erentiable at x then @f (x) = {rf(x)}.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.40/170)

Page 41: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm

! R is convex if for allx, y 2 Rm and � 2 [0, 1]:

f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)

Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g

|(z � x).

Validwhen f is di↵erentiable(e.g., at x1) and when f

is not di↵erentable (e.g.,at x2).

x1 x2

f(x1) + gT1 (z � x1)

f(x2) + gT2 (z � x2)

f(x2) + gT3 (z � x2)

f(z)

z

Set of all subgradients of f at x known as the subdi↵erential at x

@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)

If f is di↵erentiable at x then @f (x) = {rf(x)}.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.41/170)

Page 42: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm

! R is convex if for allx, y 2 Rm and � 2 [0, 1]:

f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)

Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g

|(z � x). Validwhen f is di↵erentiable(e.g., at x1) and when f

is not di↵erentable (e.g.,at x2).

x1 x2

f(x1) + gT1 (z � x1)

f(x2) + gT2 (z � x2)

f(x2) + gT3 (z � x2)

f(z)

z

Set of all subgradients of f at x known as the subdi↵erential at x

@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)

If f is di↵erentiable at x then @f (x) = {rf(x)}.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.42/170)

Page 43: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm

! R is convex if for allx, y 2 Rm and � 2 [0, 1]:

f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)

Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g

|(z � x). Validwhen f is di↵erentiable(e.g., at x1) and when f

is not di↵erentable (e.g.,at x2).

x1 x2

f(x1) + gT1 (z � x1)

f(x2) + gT2 (z � x2)

f(x2) + gT3 (z � x2)

f(z)

z

Set of all subgradients of f at x known as the subdi↵erential at x

@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)

If f is di↵erentiable at x then @f (x) = {rf(x)}.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.43/170)

Page 44: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm

! R is convex if for allx, y 2 Rm and � 2 [0, 1]:

f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)

Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g

|(z � x). Validwhen f is di↵erentiable(e.g., at x1) and when f

is not di↵erentable (e.g.,at x2).

x1 x2

f(x1) + gT1 (z � x1)

f(x2) + gT2 (z � x2)

f(x2) + gT3 (z � x2)

f(z)

z

Set of all subgradients of f at x known as the subdi↵erential at x

@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)

If f is di↵erentiable at x then @f (x) = {rf(x)}.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.44/170)

2bet Ctrl

*Tqtz

m X

R-

-

*Af

1/2

-O

↳ (z) = f- ( th tgz (z - ta)= HK) -1 gz

. Z - gz- te

= Zz . Z t ( fan ) - ga- Z ) = m . z +3

Page 45: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Lasso

Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,

0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)

That is, a minimum means zero is a member of the subdi↵erential.

When f is everywhere di↵erentiable, this is same as finding x⇤ that

satisfies rxf(x⇤) = 0.

Fact: The lasso coordinate objectivePn

i=1(r(i)k � x

(i)k ✓k)2 + �|✓k| is

convex in the parameter ✓k

Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).

With L1 objective, things are thus almost as easy as with L2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.45/170)

Page 46: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Lasso

Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,

0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)

That is, a minimum means zero is a member of the subdi↵erential.

When f is everywhere di↵erentiable, this is same as finding x⇤ that

satisfies rxf(x⇤) = 0.

Fact: The lasso coordinate objectivePn

i=1(r(i)k � x

(i)k ✓k)2 + �|✓k| is

convex in the parameter ✓k

Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).

With L1 objective, things are thus almost as easy as with L2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.46/170)

Page 47: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Lasso

Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,

0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)

That is, a minimum means zero is a member of the subdi↵erential.

When f is everywhere di↵erentiable, this is same as finding x⇤ that

satisfies rxf(x⇤) = 0.

Fact: The lasso coordinate objectivePn

i=1(r(i)k � x

(i)k ✓k)2 + �|✓k| is

convex in the parameter ✓k

Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).

With L1 objective, things are thus almost as easy as with L2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.47/170)

Page 48: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Lasso

Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,

0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)

That is, a minimum means zero is a member of the subdi↵erential.

When f is everywhere di↵erentiable, this is same as finding x⇤ that

satisfies rxf(x⇤) = 0.

Fact: The lasso coordinate objectivePn

i=1(r(i)k � x

(i)k ✓k)2 + �|✓k| is

convex in the parameter ✓k

Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).

With L1 objective, things are thus almost as easy as with L2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.48/170)

Page 49: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Lasso

Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,

0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)

That is, a minimum means zero is a member of the subdi↵erential.

When f is everywhere di↵erentiable, this is same as finding x⇤ that

satisfies rxf(x⇤) = 0.

Fact: The lasso coordinate objectivePn

i=1(r(i)k � x

(i)k ✓k)2 + �|✓k| is

convex in the parameter ✓k

Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).

With L1 objective, things are thus almost as easy as with L2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.49/170)

Page 50: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Convexity, Subgradients, and Lasso

Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,

0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)

That is, a minimum means zero is a member of the subdi↵erential.

When f is everywhere di↵erentiable, this is same as finding x⇤ that

satisfies rxf(x⇤) = 0.

Fact: The lasso coordinate objectivePn

i=1(r(i)k � x

(i)k ✓k)2 + �|✓k| is

convex in the parameter ✓k

Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).

With L1 objective, things are thus almost as easy as with L2.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F30/65 (pg.50/170)

Page 51: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso’s Subdi↵erential

Define ak = 2Pn

i=1(x(i)k )2 and ck = 2

Pni=1 r

(i)k x

(i)k . Note ak > 0.

Find ✓k making the following contain zero:

@f = @

nX

i=1

(r(i)k � x(i)k ✓k)

2 + �|✓k|

!

=

8><

>:

{ak✓k � ck � �} if ✓k < 0

[�ck � �,�ck + �] if ✓k = 0

{ak✓k � ck + �} if ✓k > 0

The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.

If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).

If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.51/170)

Page 52: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso’s Subdi↵erential

Define ak = 2Pn

i=1(x(i)k )2 and ck = 2

Pni=1 r

(i)k x

(i)k . Note ak > 0.

Find ✓k making the following contain zero:

@f = @

nX

i=1

(r(i)k � x(i)k ✓k)

2 + �|✓k|

!

=

8><

>:

{ak✓k � ck � �} if ✓k < 0

[�ck � �,�ck + �] if ✓k = 0

{ak✓k � ck + �} if ✓k > 0

slopea k

slopea k

The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.

If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).

If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.52/170)

Page 53: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso’s Subdi↵erential

Define ak = 2Pn

i=1(x(i)k )2 and ck = 2

Pni=1 r

(i)k x

(i)k . Note ak > 0.

Find ✓k making the following contain zero:

@f = @

nX

i=1

(r(i)k � x(i)k ✓k)

2 + �|✓k|

!

=

8><

>:

{ak✓k � ck � �} if ✓k < 0

[�ck � �,�ck + �] if ✓k = 0

{ak✓k � ck + �} if ✓k > 0

✓k

ck < ��

|ck| �

ck > �

slopea k

slopea k

slopea k

slopea k

@f

The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.

If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).

If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.53/170)

Page 54: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso’s Subdi↵erential

Define ak = 2Pn

i=1(x(i)k )2 and ck = 2

Pni=1 r

(i)k x

(i)k . Note ak > 0.

Find ✓k making the following contain zero:

@f = @

nX

i=1

(r(i)k � x(i)k ✓k)

2 + �|✓k|

!

=

8><

>:

{ak✓k � ck � �} if ✓k < 0

[�ck � �,�ck + �] if ✓k = 0

{ak✓k � ck + �} if ✓k > 0

✓k

ck < ��

|ck| �

ck > �

slopea k

slopea k

slopea k

slopea k

@f

The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.

If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).

If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.54/170)

I

[anon - en -i anIw ③

Page 55: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Lasso’s Subdi↵erential

Define ak = 2Pn

i=1(x(i)k )2 and ck = 2

Pni=1 r

(i)k x

(i)k . Note ak > 0.

Find ✓k making the following contain zero:

@f = @

nX

i=1

(r(i)k � x(i)k ✓k)

2 + �|✓k|

!

=

8><

>:

{ak✓k � ck � �} if ✓k < 0

[�ck � �,�ck + �] if ✓k = 0

{ak✓k � ck + �} if ✓k > 0

✓k

ck < ��

|ck| �

ck > �

slopea k

slopea k

slopea k

slopea k

@f

The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.

If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).

If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.55/170)

Page 56: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Soft Thresholding

This gives us the following solution (note again ak > 0):

✓k =

8><

>:

(ck + �)/ak if ck < �� (making ✓k < 0)

0 if |ck| � (making ✓k = 0)

(ck � �)/ak if ck > � (making ✓k > 0)

(5.11)

= sign(ck)max(0, |ck|� �)/ak (5.12)

Note that without any Lasso term, the solution for ✓k would be✓k = ck/ak = sign(ck)|ck|/ak = sign(ck)max(0, |ck|)/ak.

The above has a nice intuitive: we solve as normal (using gradients,ck/ak, and �) as long as ck is not too small.

If ck does get to small (in the middle term), then we just truncate ✓k

to zero, and when we truncate depends on � and ck.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F32/65 (pg.56/170)

- O

:

FCK ) a- oooo

r-b co. 93µm

pal ¥, [flyer -- redSm-µ) = { XEIR

"

: NHL - r} Fan is-morphism

of :Rm→iR"'

tan son xesm.ci i. THR'm

y *← Sm,

Cr),

x=OT( OCH)rt # Rm

-I

Page 57: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Soft Thresholding

This gives us the following solution (note again ak > 0):

✓k =

8><

>:

(ck + �)/ak if ck < �� (making ✓k < 0)

0 if |ck| � (making ✓k = 0)

(ck � �)/ak if ck > � (making ✓k > 0)

(5.11)

= sign(ck)max(0, |ck|� �)/ak (5.12)

Note that without any Lasso term, the solution for ✓k would be✓k = ck/ak = sign(ck)|ck|/ak = sign(ck)max(0, |ck|)/ak.

The above has a nice intuitive: we solve as normal (using gradients,ck/ak, and �) as long as ck is not too small.

If ck does get to small (in the middle term), then we just truncate ✓k

to zero, and when we truncate depends on � and ck.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F32/65 (pg.57/170)

Page 58: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Soft Thresholding

This gives us the following solution (note again ak > 0):

✓k =

8><

>:

(ck + �)/ak if ck < �� (making ✓k < 0)

0 if |ck| � (making ✓k = 0)

(ck � �)/ak if ck > � (making ✓k > 0)

(5.11)

= sign(ck)max(0, |ck|� �)/ak (5.12)

Note that without any Lasso term, the solution for ✓k would be✓k = ck/ak = sign(ck)|ck|/ak = sign(ck)max(0, |ck|)/ak.

The above has a nice intuitive: we solve as normal (using gradients,ck/ak, and �) as long as ck is not too small.

If ck does get to small (in the middle term), then we just truncate ✓k

to zero, and when we truncate depends on � and ck.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F32/65 (pg.58/170)

Page 59: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Soft Thresholding

This gives us the following solution (note again ak > 0):

✓k =

8><

>:

(ck + �)/ak if ck < �� (making ✓k < 0)

0 if |ck| � (making ✓k = 0)

(ck � �)/ak if ck > � (making ✓k > 0)

(5.11)

= sign(ck)max(0, |ck|� �)/ak (5.12)

Note that without any Lasso term, the solution for ✓k would be✓k = ck/ak = sign(ck)|ck|/ak = sign(ck)max(0, |ck|)/ak.

The above has a nice intuitive: we solve as normal (using gradients,ck/ak, and �) as long as ck is not too small.

If ck does get to small (in the middle term), then we just truncate ✓k

to zero, and when we truncate depends on � and ck.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F32/65 (pg.59/170)

non -centered TLE) - Flott d- Ht

- EH,

µregularization

:

Page 60: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Simple solution: X has orthonormal columns

Let ✓ be the linear least squares solution. When the design matrix X

has orthonormal columns (a very special case), then solutions aresimple modifications of ✓.

Ridge solution becomes ✓j/(1 + �). shrinkage.

Lasso solution becomes sign(✓j)max(0, (|✓j |� �)). soft thresholding

Best subset of size k: ✓�j1{jk} where � = (�1,�2, . . . ,�m) is a

permutation of {1, 2, . . . ,m} such that ✓�1 � ✓�2 � · · · � ✓�m . hardthresholding

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F33/65 (pg.60/170)

Page 61: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Simple solution: X has orthonormal columns

Let ✓ be the linear least squares solution. When the design matrix X

has orthonormal columns (a very special case), then solutions aresimple modifications of ✓.

Ridge solution becomes ✓j/(1 + �). shrinkage.

Lasso solution becomes sign(✓j)max(0, (|✓j |� �)). soft thresholding

Best subset of size k: ✓�j1{jk} where � = (�1,�2, . . . ,�m) is a

permutation of {1, 2, . . . ,m} such that ✓�1 � ✓�2 � · · · � ✓�m . hardthresholding

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F33/65 (pg.61/170)

Page 62: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Simple solution: X has orthonormal columns

Let ✓ be the linear least squares solution. When the design matrix X

has orthonormal columns (a very special case), then solutions aresimple modifications of ✓.

Ridge solution becomes ✓j/(1 + �). shrinkage.

Lasso solution becomes sign(✓j)max(0, (|✓j |� �)). soft thresholding

Best subset of size k: ✓�j1{jk} where � = (�1,�2, . . . ,�m) is a

permutation of {1, 2, . . . ,m} such that ✓�1 � ✓�2 � · · · � ✓�m . hardthresholding

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F33/65 (pg.62/170)

Page 63: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Simple solution: X has orthonormal columns

Let ✓ be the linear least squares solution. When the design matrix X

has orthonormal columns (a very special case), then solutions aresimple modifications of ✓.

Ridge solution becomes ✓j/(1 + �). shrinkage.

Lasso solution becomes sign(✓j)max(0, (|✓j |� �)). soft thresholding

Best subset of size k: ✓�j1{jk} where � = (�1,�2, . . . ,�m) is a

permutation of {1, 2, . . . ,m} such that ✓�1 � ✓�2 � · · · � ✓�m . hardthresholding

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F33/65 (pg.63/170)

Page 64: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Shrinkage relative to linear regression

(0,0)(0,0) (0,0)

ossaLegdiR tesbuStseB

|✓�k |

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F34/65 (pg.64/170)

Page 65: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Other regularizersOther regularizers

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)

where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =

Pm�1j=1 |✓j � ✓j+1|.

Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.65/170)

Page 66: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Other regularizersOther regularizers

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.

p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)

where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =

Pm�1j=1 |✓j � ✓j+1|.

Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.66/170)

Page 67: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Other regularizersOther regularizers

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.

Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)

where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =

Pm�1j=1 |✓j � ✓j+1|.

Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.67/170)

Page 68: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Other regularizersOther regularizers

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.

Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)

where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =

Pm�1j=1 |✓j � ✓j+1|.

Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.68/170)

Page 69: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Other regularizersOther regularizers

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)

where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.

Total Variation: R(✓) =Pm�1

j=1 |✓j � ✓j+1|.

Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.69/170)

Page 70: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Other regularizersOther regularizers

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)

where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =

Pm�1j=1 |✓j � ✓j+1|.

Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.70/170)

Page 71: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Other regularizersOther regularizers

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)

where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =

Pm�1j=1 |✓j � ✓j+1|.

Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F35/65 (pg.71/170)

Page 72: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

As the dimensions grow

dth order linear model, with x 2 Rm

f✓(x) = ✓0 +mX

i=1

✓ixi +mX

i=1

mX

j=1

✓i,jxixj +mX

i=1

mX

j=1

mX

k=1

✓i,j,kxixjxk

+ · · ·+mX

i1

mX

i2

· · ·

mX

id

✓i1,i2,...,id

dY

`=1

xi` (5.15)

number of parameters to learn grows as O(md) exponentially in theorder.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F36/65 (pg.72/170)

=) e-I =/

Page 73: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Geometric Tessellation. Inner and Surface Volume

Tesselation regions grow exponentially with dimension

r=5, m = 1

r=4, m = 2

r=3, m = 3

With r distinct values (edge length r), we have rm combinations in m

dimensions.

Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface

area to volume is 2mrm�1

/rm = 2m/r

A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.

A lot more surface, relative to inner volume, as m grows. 2m/r !1

as m!1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.73/170)

Page 74: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Geometric Tessellation. Inner and Surface Volume

Tesselation regions grow exponentially with dimension

r=5, m = 1

r=4, m = 2

r=3, m = 3

With r distinct values (edge length r), we have rm combinations in m

dimensions.

Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface

area to volume is 2mrm�1

/rm = 2m/r

A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.

A lot more surface, relative to inner volume, as m grows. 2m/r !1

as m!1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.74/170)

Page 75: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Geometric Tessellation. Inner and Surface Volume

Tesselation regions grow exponentially with dimension

r=5, m = 1

r=4, m = 2

r=3, m = 3

With r distinct values (edge length r), we have rm combinations in m

dimensions.

Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface

area to volume is 2mrm�1

/rm = 2m/r

A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.

A lot more surface, relative to inner volume, as m grows. 2m/r !1

as m!1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.75/170)

Page 76: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Geometric Tessellation. Inner and Surface Volume

Tesselation regions grow exponentially with dimension

r=5, m = 1

r=4, m = 2

r=3, m = 3

With r distinct values (edge length r), we have rm combinations in m

dimensions.

Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface

area to volume is 2mrm�1

/rm = 2m/r

A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.

A lot more surface, relative to inner volume, as m grows. 2m/r !1

as m!1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.76/170)

001

Page 77: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Geometric Tessellation. Inner and Surface Volume

Tesselation regions grow exponentially with dimension

r=5, m = 1

r=4, m = 2

r=3, m = 3

With r distinct values (edge length r), we have rm combinations in m

dimensions.

Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface

area to volume is 2mrm�1

/rm = 2m/r

A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.

A lot more surface, relative to inner volume, as m grows. 2m/r !1

as m!1.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.77/170)

Page 78: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Data in high dimensions

From Hastie et. al, 2009

Suppose data D =�x(i) ni=1

is distributed uniformly at random within

m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x

(i)j 1).

If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r

1/m, so em(r) = r1/m.

One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8.

0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.

Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range

In fact, limm!1 em(✏) = 1 for any ✏ > 0.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.78/170)

÷.

i

Page 79: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Data in high dimensions

From Hastie et. al, 2009

Suppose data D =�x(i) ni=1

is distributed uniformly at random within

m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x

(i)j 1).

If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r

1/m, so em(r) = r1/m.

One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8.

0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.

Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range

In fact, limm!1 em(✏) = 1 for any ✏ > 0.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.79/170)

"

im i

Page 80: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Data in high dimensions

From Hastie et. al, 2009

Suppose data D =�x(i) ni=1

is distributed uniformly at random within

m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x

(i)j 1).

If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r

1/m, so em(r) = r1/m.

One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8.

0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.

Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range

In fact, limm!1 em(✏) = 1 for any ✏ > 0.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.80/170)

Page 81: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Data in high dimensions

From Hastie et. al, 2009

Suppose data D =�x(i) ni=1

is distributed uniformly at random within

m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x

(i)j 1).

If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r

1/m, so em(r) = r1/m.

One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8. 0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.

Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range

In fact, limm!1 em(✏) = 1 for any ✏ > 0.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.81/170)

""

Page 82: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Data in high dimensions

From Hastie et. al, 2009

Suppose data D =�x(i) ni=1

is distributed uniformly at random within

m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x

(i)j 1).

If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r

1/m, so em(r) = r1/m.

One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8. 0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.

Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range

In fact, limm!1 em(✏) = 1 for any ✏ > 0.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.82/170)

Page 83: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Data in high dimensions

From Hastie et. al, 2009

Suppose data D =�x(i) ni=1

is distributed uniformly at random within

m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x

(i)j 1).

If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r

1/m, so em(r) = r1/m.

One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8. 0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.

Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range

In fact, limm!1 em(✏) = 1 for any ✏ > 0.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F38/65 (pg.83/170)

¥ :.

.

Page 84: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

All peel and no core . . .

Typical McIntosh apple (meaning the fruit, not the phone), 2.75 inchesin diameter (radius r =3.5cm) , .3mm (✏ =.03cm) thin peel. Spherevolume 4

3⇡r3, apple is about 180 cm3, while peel volume is

43⇡r

3�

43⇡(r � ✏)3 ⇡ 4.5 cm3, relative volume about 4.5/180 = 2.5%.

Volume of sphere in m dimensions Vm(r) = Vmrm for an appropriate

m-dimensional value (on HW3).

Fraction of radius 1 apple that is peel (with peel thickness ✏) is

Vm(1)� Vm(1� ✏)

Vm(1)= 1� (1� ✏)m (5.16)

As dimensionality m gets larger, apple becomes all peel and no core.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F39/65 (pg.84/170)

Page 85: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

All peel and no core . . .

Typical McIntosh apple (meaning the fruit, not the phone), 2.75 inchesin diameter (radius r =3.5cm) , .3mm (✏ =.03cm) thin peel. Spherevolume 4

3⇡r3, apple is about 180 cm3, while peel volume is

43⇡r

3�

43⇡(r � ✏)3 ⇡ 4.5 cm3, relative volume about 4.5/180 = 2.5%.

Volume of sphere in m dimensions Vm(r) = Vmrm for an appropriate

m-dimensional value (on HW3).

Fraction of radius 1 apple that is peel (with peel thickness ✏) is

Vm(1)� Vm(1� ✏)

Vm(1)= 1� (1� ✏)m (5.16)

As dimensionality m gets larger, apple becomes all peel and no core.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F39/65 (pg.85/170)

Page 86: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

All peel and no core . . .

Typical McIntosh apple (meaning the fruit, not the phone), 2.75 inchesin diameter (radius r =3.5cm) , .3mm (✏ =.03cm) thin peel. Spherevolume 4

3⇡r3, apple is about 180 cm3, while peel volume is

43⇡r

3�

43⇡(r � ✏)3 ⇡ 4.5 cm3, relative volume about 4.5/180 = 2.5%.

Volume of sphere in m dimensions Vm(r) = Vmrm for an appropriate

m-dimensional value (on HW3).

Fraction of radius 1 apple that is peel (with peel thickness ✏) is

Vm(1)� Vm(1� ✏)

Vm(1)= 1� (1� ✏)m (5.16)

As dimensionality m gets larger, apple becomes all peel and no core.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F39/65 (pg.86/170)

K 2.5%iwhn m =3

.

Page 87: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

All peel and no core . . .

Typical McIntosh apple (meaning the fruit, not the phone), 2.75 inchesin diameter (radius r =3.5cm) , .3mm (✏ =.03cm) thin peel. Spherevolume 4

3⇡r3, apple is about 180 cm3, while peel volume is

43⇡r

3�

43⇡(r � ✏)3 ⇡ 4.5 cm3, relative volume about 4.5/180 = 2.5%.

Volume of sphere in m dimensions Vm(r) = Vmrm for an appropriate

m-dimensional value (on HW3).

Fraction of radius 1 apple that is peel (with peel thickness ✏) is

Vm(1)� Vm(1� ✏)

Vm(1)= 1� (1� ✏)m (5.16)

As dimensionality m gets larger, apple becomes all peel and no core.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F39/65 (pg.87/170)

Lim l - G - aim =/ as m → a

Mea apple becomes

all pe

Page 88: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Uniformly at random unit vectors on unit sphere

Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.

Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1

Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.

It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:

Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)

which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!

One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F40/65 (pg.88/170)

too

Page 89: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Uniformly at random unit vectors on unit sphere

Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.

Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1

Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.

It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:

Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)

which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!

One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F40/65 (pg.89/170)

Page 90: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Uniformly at random unit vectors on unit sphere

Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.

Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1

Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.

It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:

Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)

which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!

One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F40/65 (pg.90/170)

Page 91: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Uniformly at random unit vectors on unit sphere

Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.

Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1

Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.

It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:

Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)

which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!

One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F40/65 (pg.91/170)

E = 10-8

Page 92: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Uniformly at random unit vectors on unit sphere

Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.

Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1

Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.

It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:

Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)

which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!

One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F40/65 (pg.92/170)

Page 93: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Inner/Outer Spherical Bounds on Cubic Boxes

Consider a cube in m-dimensions,[0, 1]m having unit volume. Thelargest sphere contained by this cubehas diameter 1 (the inner circle to theright). The smallest sphere contain-ing this cube has diameter

pm (outer

circle).

1

p m

Fraction of inner spherical volume (unit sphere radius 1/2, Vm(r) forradius r = 1/2) to unit cube (which has volume 1 for any m)

Inner spherical vol.

unit cube vol.=

Vm(r)

1=

⇡m/2

�(m2 + 1)rm (5.18)

We have limm!1 Vm(1/2)/1 = 0. Hence, number of volumes of unitspheres that can fit in a cube goes to infinity. More and more unitspheres can be packed into a unit cube as the dimensionality getshigher.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F41/65 (pg.93/170)

Page 94: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Inner/Outer Spherical Bounds on Cubic Boxes

Consider a cube in m-dimensions,[0, 1]m having unit volume. Thelargest sphere contained by this cubehas diameter 1 (the inner circle to theright). The smallest sphere contain-ing this cube has diameter

pm (outer

circle).

1

p m

Fraction of inner spherical volume (unit sphere radius 1/2, Vm(r) forradius r = 1/2) to unit cube (which has volume 1 for any m)

Inner spherical vol.

unit cube vol.=

Vm(r)

1=

⇡m/2

�(m2 + 1)rm (5.18)

We have limm!1 Vm(1/2)/1 = 0. Hence, number of volumes of unitspheres that can fit in a cube goes to infinity. More and more unitspheres can be packed into a unit cube as the dimensionality getshigher.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F41/65 (pg.94/170)

Page 95: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Inner/Outer Spherical Bounds on Cubic Boxes

Consider a cube in m-dimensions,[0, 1]m having unit volume. Thelargest sphere contained by this cubehas diameter 1 (the inner circle to theright). The smallest sphere contain-ing this cube has diameter

pm (outer

circle).

1

p m

Fraction of inner spherical volume (unit sphere radius 1/2, Vm(r) forradius r = 1/2) to unit cube (which has volume 1 for any m)

Inner spherical vol.

unit cube vol.=

Vm(r)

1=

⇡m/2

�(m2 + 1)rm (5.18)

We have limm!1 Vm(1/2)/1 = 0. Hence, number of volumes of unitspheres that can fit in a cube goes to infinity. More and more unitspheres can be packed into a unit cube as the dimensionality getshigher.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F41/65 (pg.95/170)

Page 96: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Using 2D to represent High-D as if 2D was High-D

Relationship between unit-radius (r = 1) sphere and unit volume (sidelength = 1) cube as the dimensions grow if they were to be also truein 2D.

Note that for cube, as m gets higher, distance from center to facestays at 1/2 but distance from center to vertex grows as

pm/2.

For m = 4 vertex distance ispm/2 = 1.

1

12

p22

1

12

11

12

pm2

Unit radius sphere

�Nearly all the volume

Vertex of hypercube

Illustration of the relationship between the sphere and the cube in 2, 4, and

m-dimensions (from Blum, Hopcroft, & Kannan, 2016).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F42/65 (pg.96/170)

Page 97: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Using 2D to represent High-D as if 2D was High-D

Relationship between unit-radius (r = 1) sphere and unit volume (sidelength = 1) cube as the dimensions grow if they were to be also truein 2D.

Note that for cube, as m gets higher, distance from center to facestays at 1/2 but distance from center to vertex grows as

pm/2.

For m = 4 vertex distance ispm/2 = 1.

1

12

p22

1

12

11

12

pm2

Unit radius sphere

�Nearly all the volume

Vertex of hypercube

Illustration of the relationship between the sphere and the cube in 2, 4, and

m-dimensions (from Blum, Hopcroft, & Kannan, 2016).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F42/65 (pg.97/170)

Page 98: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Using 2D to represent High-D as if 2D was High-D

Relationship between unit-radius (r = 1) sphere and unit volume (sidelength = 1) cube as the dimensions grow if they were to be also truein 2D.

Note that for cube, as m gets higher, distance from center to facestays at 1/2 but distance from center to vertex grows as

pm/2.

For m = 4 vertex distance ispm/2 = 1.

1

12

p22

1

12

11

12

pm2

Unit radius sphere

�Nearly all the volume

Vertex of hypercube

Illustration of the relationship between the sphere and the cube in 2, 4, and

m-dimensions (from Blum, Hopcroft, & Kannan, 2016).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F42/65 (pg.98/170)

#¥1

Page 99: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

The Blessing of Dimensionality

More dimensions can help to distinguish categories, make patternrecognition easier (and even possible). Recall Voronoi tessellation

Support vector machines (SVMs) can find and exploit data patternsextant only in extremely high (or even infinite) dimensional space!!!

For still more blessings, see “High-Dimensional Data Analysis: TheCurses and Blessings of Dimensionality”, David L. Donoho

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F43/65 (pg.99/170)

Page 100: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

The Blessing of Dimensionality

More dimensions can help to distinguish categories, make patternrecognition easier (and even possible). Recall Voronoi tessellation

Support vector machines (SVMs) can find and exploit data patternsextant only in extremely high (or even infinite) dimensional space!!!

For still more blessings, see “High-Dimensional Data Analysis: TheCurses and Blessings of Dimensionality”, David L. Donoho

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F43/65 (pg.100/170)

Page 101: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

The Blessing of Dimensionality

More dimensions can help to distinguish categories, make patternrecognition easier (and even possible). Recall Voronoi tessellation

Support vector machines (SVMs) can find and exploit data patternsextant only in extremely high (or even infinite) dimensional space!!!

For still more blessings, see “High-Dimensional Data Analysis: TheCurses and Blessings of Dimensionality”, David L. Donoho

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F43/65 (pg.101/170)

Page 102: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Feature Selection vs. Dimensionality Reduction

We’ve already seen that feature selection is one strategy to reducefeature dimensionality.

Feature selection, each feature is either selected or not, all or nothing.

Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0

, a lower dimensional space, m0< m.

Core advantage: If U ={1, 2, . . . ,m}, there may be nosubset A ✓ U such that xA

will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2

works quite well.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F44/65 (pg.102/170)

Page 103: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Feature Selection vs. Dimensionality Reduction

We’ve already seen that feature selection is one strategy to reducefeature dimensionality.

Feature selection, each feature is either selected or not, all or nothing.

Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0

, a lower dimensional space, m0< m.

Core advantage: If U ={1, 2, . . . ,m}, there may be nosubset A ✓ U such that xA

will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2

works quite well.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F44/65 (pg.103/170)

He

00X,Xz

X, a= f-( ti , ta)

Page 104: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Feature Selection vs. Dimensionality Reduction

We’ve already seen that feature selection is one strategy to reducefeature dimensionality.

Feature selection, each feature is either selected or not, all or nothing.

Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0

, a lower dimensional space, m0< m.

Core advantage: If U ={1, 2, . . . ,m}, there may be nosubset A ✓ U such that xA

will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2

works quite well.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F44/65 (pg.104/170)

Page 105: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Feature Selection vs. Dimensionality Reduction

We’ve already seen that feature selection is one strategy to reducefeature dimensionality.

Feature selection, each feature is either selected or not, all or nothing.

Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0

, a lower dimensional space, m0< m.

Core advantage: If U ={1, 2, . . . ,m}, there may be nosubset A ✓ U such that xA

will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2

works quite well.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F44/65 (pg.105/170)

e-

Page 106: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Principle Component Analysis (PCA)

PCA is a linear encoding e(x) = x|W for m⇥m matrix W .

Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.

Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.

Thus, taking first m0 m columns of W giving m⇥m

0 matrix Wk,yields XWk a dimensionality reduction.

When m0< m then the projection is onto a subspace that maximizes

the variance over all possible subspaces of dimension m0. When

m0 = m then we order the columns decreasing by variance explanation

(so taking any length m0 prefix of columns gives the answer).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F45/65 (pg.106/170)

Page 107: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Principle Component Analysis (PCA)

PCA is a linear encoding e(x) = x|W for m⇥m matrix W .

Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.

Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.

Thus, taking first m0 m columns of W giving m⇥m

0 matrix Wk,yields XWk a dimensionality reduction.

When m0< m then the projection is onto a subspace that maximizes

the variance over all possible subspaces of dimension m0. When

m0 = m then we order the columns decreasing by variance explanation

(so taking any length m0 prefix of columns gives the answer).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F45/65 (pg.107/170)

Page 108: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Principle Component Analysis (PCA)

PCA is a linear encoding e(x) = x|W for m⇥m matrix W .

Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.

Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.

Thus, taking first m0 m columns of W giving m⇥m

0 matrix Wk,yields XWk a dimensionality reduction.

When m0< m then the projection is onto a subspace that maximizes

the variance over all possible subspaces of dimension m0. When

m0 = m then we order the columns decreasing by variance explanation

(so taking any length m0 prefix of columns gives the answer).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F45/65 (pg.108/170)

Page 109: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Principle Component Analysis (PCA)

PCA is a linear encoding e(x) = x|W for m⇥m matrix W .

Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.

Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.

Thus, taking first m0 m columns of W giving m⇥m

0 matrix Wk,yields XWk a dimensionality reduction.

When m0< m then the projection is onto a subspace that maximizes

the variance over all possible subspaces of dimension m0. When

m0 = m then we order the columns decreasing by variance explanation

(so taking any length m0 prefix of columns gives the answer).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F45/65 (pg.109/170)

• m'

m'

Page 110: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Principle Component Analysis (PCA)

PCA is a linear encoding e(x) = x|W for m⇥m matrix W .

Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.

Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.

Thus, taking first m0 m columns of W giving m⇥m

0 matrix Wk,yields XWk a dimensionality reduction.

When m0< m then the projection is onto a subspace that maximizes

the variance over all possible subspaces of dimension m0. When

m0 = m then we order the columns decreasing by variance explanation

(so taking any length m0 prefix of columns gives the answer).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F45/65 (pg.110/170)

Page 111: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Maximum Variance: 1D projection

Consider m0 = 1 to start, vector w(1)2 Rm is the projection (i.e.,⌦

x,w(1)↵projects to 1D).

Assume unit vectors, kw(1)k2 =

q⌦w(1), w(1)

↵= 1.

Mean of data x = 1n

Pni=1 x

(i) so mean of projected data⌦w

(1), x↵.

Variance of projected data

1

n

nX

i=1

⇣hw

(1), x

(i)i � hw

(1), xi

⌘2= w

(1)|Sw

(1) (5.19)

where

S =1

n

nX

i=1

(x(i) � x)(x(i) � x)|

(5.20)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F46/65 (pg.111/170)

Page 112: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Maximum Variance: 1D projection

Consider m0 = 1 to start, vector w(1)2 Rm is the projection (i.e.,⌦

x,w(1)↵projects to 1D).

Assume unit vectors, kw(1)k2 =

q⌦w(1), w(1)

↵= 1.

Mean of data x = 1n

Pni=1 x

(i) so mean of projected data⌦w

(1), x↵.

Variance of projected data

1

n

nX

i=1

⇣hw

(1), x

(i)i � hw

(1), xi

⌘2= w

(1)|Sw

(1) (5.19)

where

S =1

n

nX

i=1

(x(i) � x)(x(i) � x)|

(5.20)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F46/65 (pg.112/170)

Page 113: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Maximum Variance: 1D projection

Consider m0 = 1 to start, vector w(1)2 Rm is the projection (i.e.,⌦

x,w(1)↵projects to 1D).

Assume unit vectors, kw(1)k2 =

q⌦w(1), w(1)

↵= 1.

Mean of data x = 1n

Pni=1 x

(i) so mean of projected data⌦w

(1), x↵.

Variance of projected data

1

n

nX

i=1

⇣hw

(1), x

(i)i � hw

(1), xi

⌘2= w

(1)|Sw

(1) (5.19)

where

S =1

n

nX

i=1

(x(i) � x)(x(i) � x)|

(5.20)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F46/65 (pg.113/170)

Page 114: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Maximum Variance: 1D projection

Consider m0 = 1 to start, vector w(1)2 Rm is the projection (i.e.,⌦

x,w(1)↵projects to 1D).

Assume unit vectors, kw(1)k2 =

q⌦w(1), w(1)

↵= 1.

Mean of data x = 1n

Pni=1 x

(i) so mean of projected data⌦w

(1), x↵.

Variance of projected data

1

n

nX

i=1

⇣hw

(1), x

(i)i � hw

(1), xi

⌘2= w

(1)|Sw

(1) (5.19)

where

S =1

n

nX

i=1

(x(i) � x)(x(i) � x)|

(5.20)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F46/65 (pg.114/170)

projectionof sand n

'

is Lw"'

,XG

')>

mean of the projector data In Lw'",X">

=L we",I '

"'7

. he

outer product

\

In [ sway His>2 I

#-22W

"'it"> 72W

'"

,I >

n > > m

* < w' '',I >2)

usually

#S is Pos. def.

Sum of outer products .Symmetric .

Page 115: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Maximum Variance: 1D projection

Goal: Maximize projected variance w(1)|

Sw(1) =

⌦w

(1), Sw

(1)↵subject

to kw(1)

k2 = 1.

Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w

(1), so w(1) is an eigenvector of S with eigenvalue

�1.

Pre-multiply by w(1) and use kw

(1)k2 = 1 yields

Dw

(1), Sw

(1)E= �1 (5.21)

Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).

This is the first principle component.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F47/65 (pg.115/170)

"

Page 116: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Maximum Variance: 1D projection

Goal: Maximize projected variance w(1)|

Sw(1) =

⌦w

(1), Sw

(1)↵subject

to kw(1)

k2 = 1.

Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w

(1), so w(1) is an eigenvector of S with eigenvalue

�1.

Pre-multiply by w(1) and use kw

(1)k2 = 1 yields

Dw

(1), Sw

(1)E= �1 (5.21)

Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).

This is the first principle component.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F47/65 (pg.116/170)

Page 117: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Maximum Variance: 1D projection

Goal: Maximize projected variance w(1)|

Sw(1) =

⌦w

(1), Sw

(1)↵subject

to kw(1)

k2 = 1.

Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w

(1), so w(1) is an eigenvector of S with eigenvalue

�1.

Pre-multiply by w(1) and use kw

(1)k2 = 1 yields

Dw

(1), Sw

(1)E= �1 (5.21)

Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).

This is the first principle component.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F47/65 (pg.117/170)

= The expression of the

variance .

Page 118: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Maximum Variance: 1D projection

Goal: Maximize projected variance w(1)|

Sw(1) =

⌦w

(1), Sw

(1)↵subject

to kw(1)

k2 = 1.

Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w

(1), so w(1) is an eigenvector of S with eigenvalue

�1.

Pre-multiply by w(1) and use kw

(1)k2 = 1 yields

Dw

(1), Sw

(1)E= �1 (5.21)

Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).

This is the first principle component.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F47/65 (pg.118/170)

Page 119: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Maximum Variance: 1D projection

Goal: Maximize projected variance w(1)|

Sw(1) =

⌦w

(1), Sw

(1)↵subject

to kw(1)

k2 = 1.

Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w

(1), so w(1) is an eigenvector of S with eigenvalue

�1.

Pre-multiply by w(1) and use kw

(1)k2 = 1 yields

Dw

(1), Sw

(1)E= �1 (5.21)

Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).

This is the first principle component.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F47/65 (pg.119/170)

:

Page 120: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Subsequent components

To produce kth component given first k � 1 components, first form

Xk = X �

k�1X

i=1

Xw(i)w

(i)| (5.22)

and then proceed with maximize projected variance⌦w

(k), Skw

(k)↵

subject to kw(k)

k2 = 1 where Sk is computed via Xk.

Overall, the collection�w

(i),�i

are the m eigenvector/eigenvalue

pairs of the matrix S.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F48/65 (pg.120/170)

c , ①'

f:"

"

Page 121: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Subsequent components

To produce kth component given first k � 1 components, first form

Xk = X �

k�1X

i=1

Xw(i)w

(i)| (5.22)

and then proceed with maximize projected variance⌦w

(k), Skw

(k)↵

subject to kw(k)

k2 = 1 where Sk is computed via Xk.

Overall, the collection�w

(i),�i

are the m eigenvector/eigenvalue

pairs of the matrix S.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F48/65 (pg.121/170)

Page 122: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA: Minimizing reconstruction error m0 < m

Let ui, i 2 [m] be a set of orthonormal vectors in Rm.

Basis decomposition: any sample x(i) can be written as

x(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .

Suppose we decide to use only m0< m dimensions, so

x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .

Reconstruction error

Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)

To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i

th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m

0 matrix of the first m0

columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.122/170)

Page 123: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA: Minimizing reconstruction error m0 < m

Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x

(i) can be written asx(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .

Suppose we decide to use only m0< m dimensions, so

x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .

Reconstruction error

Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)

To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i

th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m

0 matrix of the first m0

columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.123/170)

Page 124: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA: Minimizing reconstruction error m0 < m

Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x

(i) can be written asx(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .

Suppose we decide to use only m0< m dimensions, so

x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .

Reconstruction error

Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)

To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i

th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m

0 matrix of the first m0

columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.124/170)

Page 125: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA: Minimizing reconstruction error m0 < m

Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x

(i) can be written asx(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .

Suppose we decide to use only m0< m dimensions, so

x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .

Reconstruction error

Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)

To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i

th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m

0 matrix of the first m0

columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.125/170)

Page 126: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA: Minimizing reconstruction error m0 < m

Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x

(i) can be written asx(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .

Suppose we decide to use only m0< m dimensions, so

x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .

Reconstruction error

Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)

To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i

th largest eigenvalue.

Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m

0 matrix of the first m0

columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.126/170)

Page 127: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA: Minimizing reconstruction error m0 < m

Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x

(i) can be written asx(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .

Suppose we decide to use only m0< m dimensions, so

x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .

Reconstruction error

Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)

To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i

th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m

0 matrix of the first m0

columns.

Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.127/170)

Page 128: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA: Minimizing reconstruction error m0 < m

Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x

(i) can be written asx(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .

Suppose we decide to use only m0< m dimensions, so

x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .

Reconstruction error

Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)

To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i

th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m

0 matrix of the first m0

columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F49/65 (pg.128/170)

Page 129: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Example: 2D and two principle directions

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F50/65 (pg.129/170)

Page 130: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA and discriminability

PCA projects down to the directions of greatest variance. Are thesealso the greatest for classification?

For classification, we would want the samples in di↵erent classes to beas separate as possible. I.e., make it as easy as possible to discern thedi↵erence (i.e., discriminate) between the di↵erent classes.

Any projection that blends together samples from di↵erent classes ontop of each other would be undesirable.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F51/65 (pg.130/170)

Page 131: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA and discriminability

PCA projects down to the directions of greatest variance. Are thesealso the greatest for classification?

For classification, we would want the samples in di↵erent classes to beas separate as possible. I.e., make it as easy as possible to discern thedi↵erence (i.e., discriminate) between the di↵erent classes.

Any projection that blends together samples from di↵erent classes ontop of each other would be undesirable.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F51/65 (pg.131/170)

Page 132: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA and discriminability

PCA projects down to the directions of greatest variance. Are thesealso the greatest for classification?

For classification, we would want the samples in di↵erent classes to beas separate as possible. I.e., make it as easy as possible to discern thedi↵erence (i.e., discriminate) between the di↵erent classes.

Any projection that blends together samples from di↵erent classes ontop of each other would be undesirable.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F51/65 (pg.132/170)

Page 133: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Example: PCA on 2-class data

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F52/65 (pg.133/170)

Page 134: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Example: Data normalization

From Bishop 2006

2 4 640

50

60

70

80

90

100

ï� 0 2

ï�

0

2

ï� 0 2

ï�

0

2

Left: original data. Center: mean subtraction and variance normalization ineach dimension. Right: decorrelated data (i.e., diagonal covariance).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F53/65 (pg.134/170)

Page 135: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA and vs. Best axis for class separability

From Bishop 2006

ï5 0 5ï�

��

ï�

��

0

���

1

���

Projecting to greatest variability leads to poor separation, while existsanother 1D projection that retains good separability.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F54/65 (pg.135/170)

Page 136: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA and lack of class separability

From Bishop 2006

x6

x7

0 ���� ��� ���� 10

���

1

���

2

Left. 2 dimensions of original data. Right: PCA Projection on 2 dimensionsof greatest variability — bad for two reasons: (I) spreads individual classes(e.g., red); and (II) blends distinct classes (blue and green), both of whichmakes classification no easier.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F55/65 (pg.136/170)

Page 137: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

PCA vs. LDA

From Bishop 2006

PCA vs. Linear Discriminant Analysis (LDA).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F56/65 (pg.137/170)

Page 138: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Axis parallel projection vs. LDA

From Hastie et. al., 2009

+

++

+

Axis parallel (left) vs. LDA (right).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F57/65 (pg.138/170)

Page 139: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Discriminant Analysis (LDA), 2 classes

Consider class-conditional Gaussian data, so p(x|y) = N (x|µy, Cy) formean vectors {µy}y and covariance matrices {Cy}y, x 2 Rm.

p(x|y) =1

|2⇡Cy|m/2

exp

✓�1

2(x� µy)

|C

�1y (x� µy)

◆(5.24)

Two class case y 2 {0, 1} with equal covariances C0 = C1 = C

(homoscedasticity property) and priors p(y = 0) = p(y = 1), considerlog posterior odds ratio:

logp(y = 1|x)

p(y = 0|x)= �

1

2(x� µ1)

|C

�11 (x� µ1) +

1

2(x� µ0)

|C

�10 (x� µ0)

(5.25)

= (C�1µ1 � C

�1µ0)

|x+ µ0

|C

�1µ0 � µ1

|C

�1µ1 (5.26)

= ✓|x+ c (5.27)

✓ is a projection (m⇥ 1 matrix, linear transformation) down to the 1dimension that is su�cient for prediction w/o loss.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F58/65 (pg.139/170)

Page 140: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Discriminant Analysis (LDA), 2 classes

Consider class-conditional Gaussian data, so p(x|y) = N (x|µy, Cy) formean vectors {µy}y and covariance matrices {Cy}y, x 2 Rm.

p(x|y) =1

|2⇡Cy|m/2

exp

✓�1

2(x� µy)

|C

�1y (x� µy)

◆(5.24)

Two class case y 2 {0, 1} with equal covariances C0 = C1 = C

(homoscedasticity property) and priors p(y = 0) = p(y = 1), considerlog posterior odds ratio:

logp(y = 1|x)

p(y = 0|x)= �

1

2(x� µ1)

|C

�11 (x� µ1) +

1

2(x� µ0)

|C

�10 (x� µ0)

(5.25)

= (C�1µ1 � C

�1µ0)

|x+ µ0

|C

�1µ0 � µ1

|C

�1µ1 (5.26)

= ✓|x+ c (5.27)

✓ is a projection (m⇥ 1 matrix, linear transformation) down to the 1dimension that is su�cient for prediction w/o loss.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F58/65 (pg.140/170)

Page 141: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Discriminant Analysis (LDA), 2 classes

Consider class-conditional Gaussian data, so p(x|y) = N (x|µy, Cy) formean vectors {µy}y and covariance matrices {Cy}y, x 2 Rm.

p(x|y) =1

|2⇡Cy|m/2

exp

✓�1

2(x� µy)

|C

�1y (x� µy)

◆(5.24)

Two class case y 2 {0, 1} with equal covariances C0 = C1 = C

(homoscedasticity property) and priors p(y = 0) = p(y = 1), considerlog posterior odds ratio:

logp(y = 1|x)

p(y = 0|x)= �

1

2(x� µ1)

|C

�11 (x� µ1) +

1

2(x� µ0)

|C

�10 (x� µ0)

(5.25)

= (C�1µ1 � C

�1µ0)

|x+ µ0

|C

�1µ0 � µ1

|C

�1µ1 (5.26)

= ✓|x+ c (5.27)

✓ is a projection (m⇥ 1 matrix, linear transformation) down to the 1dimension that is su�cient for prediction w/o loss.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F58/65 (pg.141/170)

Page 142: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA

Thus, decision boundary between the two classes is linear.

This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).

Given discriminant functions of the form

hy(x) = �1

2log |Cy|�

1

2(x� µy)

|C

�1y (x� µy) + log p(y) (5.28)

all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.

With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.

In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F59/65 (pg.142/170)

Page 143: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA

Thus, decision boundary between the two classes is linear.

This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).

Given discriminant functions of the form

hy(x) = �1

2log |Cy|�

1

2(x� µy)

|C

�1y (x� µy) + log p(y) (5.28)

all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.

With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.

In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F59/65 (pg.143/170)

Page 144: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA

Thus, decision boundary between the two classes is linear.

This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).

Given discriminant functions of the form

hy(x) = �1

2log |Cy|�

1

2(x� µy)

|C

�1y (x� µy) + log p(y) (5.28)

all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.

With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.

In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F59/65 (pg.144/170)

Page 145: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA

Thus, decision boundary between the two classes is linear.

This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).

Given discriminant functions of the form

hy(x) = �1

2log |Cy|�

1

2(x� µy)

|C

�1y (x� µy) + log p(y) (5.28)

all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.

With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.

In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F59/65 (pg.145/170)

Page 146: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA

Thus, decision boundary between the two classes is linear.

This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).

Given discriminant functions of the form

hy(x) = �1

2log |Cy|�

1

2(x� µy)

|C

�1y (x� µy) + log p(y) (5.28)

all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.

With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.

In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F59/65 (pg.146/170)

Page 147: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Discriminant Analysis (LDA), ` classes

Within class scatter matrix

SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)

and µc is the class c mean, µc =1nc

Pi is of class c x

(i)

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)

First projection: a1 2 argmaxa2Rma|SBaa|SW a .

Second orthogonal to first,

a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.

In general, directions of greatest variance of the matrix S�1W SB,

maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S

�1W SB to get LDA reduction.

LDA dimensionality reduction: find `0⇥m matrix linear transform on x,

project to subspace corresponding to `0 greatest eigenvalues of S�1

W SB.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.147/170)

Page 148: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Discriminant Analysis (LDA), ` classes

Within class scatter matrix

SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)

and µc is the class c mean, µc =1nc

Pi is of class c x

(i)

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)

First projection: a1 2 argmaxa2Rma|SBaa|SW a .

Second orthogonal to first,

a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.

In general, directions of greatest variance of the matrix S�1W SB,

maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S

�1W SB to get LDA reduction.

LDA dimensionality reduction: find `0⇥m matrix linear transform on x,

project to subspace corresponding to `0 greatest eigenvalues of S�1

W SB.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.148/170)

Page 149: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Discriminant Analysis (LDA), ` classes

Within class scatter matrix

SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)

and µc is the class c mean, µc =1nc

Pi is of class c x

(i)

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)

First projection: a1 2 argmaxa2Rma|SBaa|SW a .

Second orthogonal to first,

a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.

In general, directions of greatest variance of the matrix S�1W SB,

maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S

�1W SB to get LDA reduction.

LDA dimensionality reduction: find `0⇥m matrix linear transform on x,

project to subspace corresponding to `0 greatest eigenvalues of S�1

W SB.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.149/170)

Page 150: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Discriminant Analysis (LDA), ` classes

Within class scatter matrix

SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)

and µc is the class c mean, µc =1nc

Pi is of class c x

(i)

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)

First projection: a1 2 argmaxa2Rma|SBaa|SW a . Second orthogonal to first,

a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a

, and so on.

In general, directions of greatest variance of the matrix S�1W SB,

maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S

�1W SB to get LDA reduction.

LDA dimensionality reduction: find `0⇥m matrix linear transform on x,

project to subspace corresponding to `0 greatest eigenvalues of S�1

W SB.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.150/170)

Page 151: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Discriminant Analysis (LDA), ` classes

Within class scatter matrix

SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)

and µc is the class c mean, µc =1nc

Pi is of class c x

(i)

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)

First projection: a1 2 argmaxa2Rma|SBaa|SW a . Second orthogonal to first,

a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.

In general, directions of greatest variance of the matrix S�1W SB,

maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S

�1W SB to get LDA reduction.

LDA dimensionality reduction: find `0⇥m matrix linear transform on x,

project to subspace corresponding to `0 greatest eigenvalues of S�1

W SB.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.151/170)

Page 152: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Discriminant Analysis (LDA), ` classes

Within class scatter matrix

SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)

and µc is the class c mean, µc =1nc

Pi is of class c x

(i)

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)

First projection: a1 2 argmaxa2Rma|SBaa|SW a . Second orthogonal to first,

a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.

In general, directions of greatest variance of the matrix S�1W SB,

maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S

�1W SB to get LDA reduction.

LDA dimensionality reduction: find `0⇥m matrix linear transform on x,

project to subspace corresponding to `0 greatest eigenvalues of S�1

W SB.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.152/170)

Page 153: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Linear Discriminant Analysis (LDA), ` classes

Within class scatter matrix

SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)

and µc is the class c mean, µc =1nc

Pi is of class c x

(i)

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)

First projection: a1 2 argmaxa2Rma|SBaa|SW a . Second orthogonal to first,

a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.

In general, directions of greatest variance of the matrix S�1W SB,

maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S

�1W SB to get LDA reduction.

LDA dimensionality reduction: find `0⇥m matrix linear transform on x,

project to subspace corresponding to `0 greatest eigenvalues of S�1

W SB.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.153/170)

Page 154: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since

SB has rank at most `� 1.

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)

Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have

SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)

and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that

rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.154/170)

Page 155: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since

SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)

Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have

SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)

and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that

rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.155/170)

Page 156: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since

SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)

Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).

However, we have

SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)

and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that

rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.156/170)

Page 157: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since

SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)

Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have

SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)

and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that

rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.157/170)

Page 158: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since

SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)

Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have

SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)

and since rank(AB) min(rank(A), rank(B))

, and sinceP`i=1 ni(µi � µ) = 0 which means that

rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.158/170)

Page 159: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since

SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)

Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have

SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)

and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that

rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1

, thus rank(SB) `� 1.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.159/170)

Page 160: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since

SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)

Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have

SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)

and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that

rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.160/170)

Page 161: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Repeat slide

The next slide is a repeat from earlier.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F62/65 (pg.161/170)

Page 162: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Uniformly at random unit vectors on unit sphere

Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.

Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1

Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.

It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:

Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)

which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!

One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F63/65 (pg.162/170)

Page 163: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Non-linear Dimensionality Reduction: Autoencoders

Construct two mappings, an encoder e✓e : Rm

! Rm0and a decoder

d✓d : Rm0! Rm.

Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].

Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:

min✓

Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓

1

n

nX

i=1

[kx(i) � d✓d(e✓e(x(i)))k] (5.33)

Then e✓e(x) is our learnt representation of x. If m0< m or m0

⌧ m

this is dimensionality reduction, similar to compression.

As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).

Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.163/170)

Page 164: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Non-linear Dimensionality Reduction: Autoencoders

Construct two mappings, an encoder e✓e : Rm

! Rm0and a decoder

d✓d : Rm0! Rm.

Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].

Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:

min✓

Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓

1

n

nX

i=1

[kx(i) � d✓d(e✓e(x(i)))k] (5.33)

Then e✓e(x) is our learnt representation of x. If m0< m or m0

⌧ m

this is dimensionality reduction, similar to compression.

As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).

Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.164/170)

Page 165: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Non-linear Dimensionality Reduction: Autoencoders

Construct two mappings, an encoder e✓e : Rm

! Rm0and a decoder

d✓d : Rm0! Rm.

Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].

Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:

min✓

Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓

1

n

nX

i=1

[kx(i) � d✓d(e✓e(x(i)))k] (5.33)

Then e✓e(x) is our learnt representation of x. If m0< m or m0

⌧ m

this is dimensionality reduction, similar to compression.

As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).

Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.165/170)

Page 166: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Non-linear Dimensionality Reduction: Autoencoders

Construct two mappings, an encoder e✓e : Rm

! Rm0and a decoder

d✓d : Rm0! Rm.

Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].

Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:

min✓

Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓

1

n

nX

i=1

[kx(i) � d✓d(e✓e(x(i)))k] (5.33)

Then e✓e(x) is our learnt representation of x. If m0< m or m0

⌧ m

this is dimensionality reduction, similar to compression.

As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).

Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.166/170)

Page 167: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Non-linear Dimensionality Reduction: Autoencoders

Construct two mappings, an encoder e✓e : Rm

! Rm0and a decoder

d✓d : Rm0! Rm.

Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].

Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:

min✓

Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓

1

n

nX

i=1

[kx(i) � d✓d(e✓e(x(i)))k] (5.33)

Then e✓e(x) is our learnt representation of x. If m0< m or m0

⌧ m

this is dimensionality reduction, similar to compression.

As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).

Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.167/170)

Page 168: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Non-linear Dimensionality Reduction: Autoencoders

Construct two mappings, an encoder e✓e : Rm

! Rm0and a decoder

d✓d : Rm0! Rm.

Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].

Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:

min✓

Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓

1

n

nX

i=1

[kx(i) � d✓d(e✓e(x(i)))k] (5.33)

Then e✓e(x) is our learnt representation of x. If m0< m or m0

⌧ m

this is dimensionality reduction, similar to compression.

As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).

Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F64/65 (pg.168/170)

Page 169: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

DNN-based Auto-Encoder

Encoder from x to x = d✓d(e✓e(x) where m = 9 and m0 = 3.

In some sense, a form of non-linear PCA.

InputLayer Hidden

Layer 1 HiddenLayer 2

Code

HiddenLayer 4

HiddenLayer 5

OutputUnit

x x

Encoder Decodere✓e : Rm

! Rm d✓d : Rm! Rm

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F65/65 (pg.169/170)

Page 170: Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector machines, max-margin learning, conditional likelihood, etc.) Sometimes they are

Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

DNN-based Auto-Encoder

Encoder from x to x = d✓d(e✓e(x) where m = 9 and m0 = 3.

In some sense, a form of non-linear PCA.InputLayer Hidden

Layer 1 HiddenLayer 2

Code

HiddenLayer 4

HiddenLayer 5

OutputUnit

x x

Encoder Decodere✓e : Rm

! Rm d✓d : Rm! Rm

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F65/65 (pg.170/170)