Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector...

Advanced Introduction to Machine Learning— Spring Quarter, Week 5 —

https://canvas.uw.edu/courses/1372141

Prof. Je↵ Bilmes

University of Washington, Seattle

Departments of: Electrical & Computer Engineering, Computer Science & Engineering

http://melodi.ee.washington.edu/~bilmes

April 27th/29nd, 2020

Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F1/65 (pg.1/170)

Logistics Review

Announcements

HW2 is posted. Due May 1st, 6:00pm via our assignment dropbox(https://canvas.uw.edu/courses/1372141/assignments).

Virtual o�ce hours this week, Thursday night at 10:00pm via zoom(same link as class).


Logistics Review

Class Road Map

W1(3/30,4/1): What is ML, Probability, Coins, Gaussians and linearregression, Associative Memories, Supervised LearningW2(4/6,4/8): More supervised, logistic regression, complexity andbias/variance tradeo↵W3(4/13,4/15): Bias/Variance, Regularization, Ridge, CrossVal,MulticlassW4(4/20,4/22): Multiclass classification, ERM, Gen/Disc, Naıve BayesW5(4/27,4/29): Lasso, Regularizers, Curse of DimensionalityW6(5/4,5/6): Curse of Dimensionality, Dimensionality Reduction, k-NNW7(5/11,5/13): k-NN, LSH, DTs, Bootstrap/Bagging, Boosting &Random Forests, GBDTsW8(5/18,5/20): Graphs; Graphical Models (Factorization, Inference,MRFs, BNs);W9(5/27,6/1): Learning Paradigms; Clustering; EM Algorithm;W10(6/3,6/8): Spectral Clustering, Graph SSL, Deep models, (SVMs,RL); The Future.

Last lecture is 6/8 since 5/25 is holiday (or we could just have lecture on 5/25).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F3/65 (pg.3/170)

Logistics Review

Class (and Machine Learning) overview

1. Introduction• What is ML• What is AI• Why are we so interested in these topics right now?

2. ML Paradigms/Concepts• Over!tting/Under!tting, model complexity, bias/variance• size of data, big data, sample complexity• ERM, loss + regularization, loss functions, regularizers• Supervised, unsupervised, and semi-supervised learning;• reinforcement learning, RL, multi-agent, planning/control• transfer and multi-task learning• federated and distributed learning• active learning, machine teaching• self-supervised, zero/one-shot, open-set learning

3. Dealing with Features• dimensionality reduction, PCA, LDA, MDS, T-SNE, UMAP • Locality sensitive hashing (LSH)• feature selection• feature engineering• matrix factorization & feature engineering• representation learning

4. Evaluation• accuracy/error, precision/recall, ROC, likelihood/posterior, cost/utility, margin • train/eval/test data splits• n-fold cross validation• method of the bootstrap

6. Inference Methods• probabilistic inference• MLE, MAP• belief propagation• forward/backpropagation• Monte Carlo methods

7. Models & Representation• linear least squares, linear regression, logistic regression, sparsi-ty, ridge, lasso• generative vs. discriminative models• Naive Bayes• k-nearest neighbors• clustering, k-means, k-mediods, EM & GMMs, single linkage• decision trees and random forests• support vector machines, kernel methods, max margin• perceptron, neural networks, DNNs• Gaussian processes• Bayesian nonparametric methods• ensemble methods• the bootstrap, bagging, and boosting• graphical models• time-series, HMMs, DBNs, RNNs, LSTMs, Attention, Transformers • structured prediction• grammars (as in NLP)

12. Other Techniques• compressed sensing• submodularity, diversity/homogeneity modeling

8. Philosophy, Humanity, Spirituality• arti!cial intelligence (AI)• arti!cal general intelligence (AGI)• arti!cial intelligence vs. science !ction

9. Applications• computational biology• social networks• computer vision• speech recognition• natural language processing• information retrieval• collaborative !ltering/matrix factorization

10. Programming• python• libraries (e.g., NumPy, SciPy, matplotlib, scikit-learn (sklearn), pytorch, CNTK, Theano, tensor"ow, keras, H2O, etc.• HPC: C/C++, CUDA, vector processing

11. Background• linear algebra• multivariate calculus• probability theory and statistics• information theory• mathematical (e.g., convex) optimization

x6

x3

x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3 x6

x2x4

x5

x1

x3

x2x4

x5

x1

x3

x2x4

x1

x3

x2

x1

x2

x1

x3

x2x4

x5

x1

x2x4

x5

x1

x2

x5

x1

x2

=X

x3

X

x4

X

x5

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

x6

(x2, x6)

| {z }� 66,2(x2)

=X

x4

X

x5

(x1, x2)� 66,2(x2)X

x3

(x1, x3) (x3, x4) (x3, x5)

| {z }� 63,1,4,5(x1,x4,x5)

=X

x5

(x1, x2)� 66,2(x2)X

x4

� 63,1,4,5(x1, x4, x5)

| {z }� 63, 64,1,5(x1,x5)

= (x1, x2)� 66,2(x2)X

x5

� 63, 64,1,5(x1, x5)

| {z }� 63,65, 64,1(x1)

= (x1, x2)� 66,2(x2)� 63, 64, 65,1(x1)

p(x1, x2) =X

x3

X

x4

· · ·

X

x6

p(x1, x2, . . . , x6)

X

x3

X

x4

X

x5

(x1, x2) (x1, x3) (x3, x4) (x3, x5)X

x6

(x2, x6)

| {z }� 66,2(x2)

=X

x3

X

x4

(x1, x2)� 66,2(x2) (x1, x3) (x3, x4)X

x5

(x3, x5)

| {z }� 65,3(x3)

=X

x3

(x1, x2)� 66,2(x2) (x1, x3)� 65,3(x3)X

x4

(x3, x4)

| {z }� 64,3(x3)

= (x1, x2)� 66,2(x2)X

x3

(x1, x3)� 65,3(x3)� 64,3(x3)

| {z }� 65, 64, 63,1(x1)

= (x1, x2)� 66,2(x2)� 65,64, 63,1(x1)

p(x1, x2) =X

x3

X

x4

· · ·

X

x6

p(x1, x2, . . . , x6)

=

Reconstituted Graph Reconstituted Graph

O(r2)

O(r4)

O(r3)

O(r2)

x6

x5

x4

x3

O(r2)

O(r2)

O(r2)

O(r2)

GraphicalTransformation

CorrespondingMarginalization Operation

GraphicalTransformation

CorrespondingMarginalization Operation

Variableto

Eliminateand

Complexity

Variableto

Eliminateand

Complexity

InputLayer

HiddenLayer 1

HiddenLayer 2

HiddenLayer 3

HiddenLayer 4

HiddenLayer 5

HiddenLayer 6

HiddenLayer 7

OutputUnit

5. Optimization Methods• Unconstrained Continuous Optimization: (stochastic) gradient descent (SGD), adap-tive learning rates, conjugate gradient, 2nd order Newton• Constrained Continuous Optimization : Frank-Wolf (conditional gradient descent), projected gradient, linear, quadratic, and convex programming• Discrete optimization - greedy, beam search, branch-and-bound, submodular optimization.


Logistics Review

Multiclass and Voronoi tessellations

` classes, j 2 {1, 2, . . . , `}, parameters ✓(j) 2 Rm, ` linear discriminantfunctions h✓(j)(x) = h✓

(j), xi.

Partition input space x 2 Rm based on who wins

Rm = Rm1 [ Rm

2 [ · · · [ Rm` (5.19)

where Rmi ✓ Rm and Rm

i \ Rmj = ; whenever i 6= j.

Winning block, for all i 2 [`]:

Rmi , {x 2 Rm : h✓(i)(x) > h✓(j)(x), 8j 6= i} (5.20)

Alternatively,

Rmi ,

nx 2 Rm : h✓(i) � ✓

(j), xi > 0, 8j 6= i

o(5.21)

consider when h·, ·i is dot product.Alternatively, under softmax regression model,

Rmi , {x 2 Rm : Pr(Ci|x) > Pr(Cj |x), 8j 6= i} (5.22)


Logistics Review

0/1-loss and Bayes error

Let L be 0/1, i.e., L(y, y0) = 1{y 6=y0}.

Probability of error for a given x

Pr(h✓(x) 6= Y ) =

Zp(y|x)1{h✓(x) 6=y}dy = 1� p(h✓(x)|x) (5.7)

To minimize the probability of error, h✓(x) should ideally choose aclass having maximum posterior probability for the current x.

Smallest probability of error is known is Bayes error for x

Bayes error(x) = miny

(1� p(y|x)) = 1�maxy

p(y|x) (5.8)

Bayes classifier (or predictor) predicts using highest probability,assuming have access to p(y|x). I.e.,

BayesPredictor(x) , argmaxy

p(y|x) (5.9)

Bayes predictor has ’Bayes error’ as its error. Irreducible error rate.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F6/65 (pg.6/170)

Logistics Review

Bayes Error, overall di�culty of a problem

For Bayes error, we often take the expected value w.r.t. x. This givesan indication overall of how di�cult a classification problem is, sincewe can never do better than Bayes error.

Bayes error, various equivalent forms (Exercise: show last equality):

Bayes Error = minh

Zp(x, y)1{h(x) 6=y}dydx (5.13)

= minh

Pr(h(X) 6= Y ) (5.14)

= Ep(x)[min(p(y|x), 1� p(y|x))] (5.15)

=1

2�

1

2Ep(x)[|2p(y|x)� 1|] (5.16)

For binary classification, if Bayes error is 1/2, prediction can never bebetter than flipping a coin (i.e., x tells us nothing about y).

Bayes error, property of the distribution p(x, y); can be useful to decidebetween say p(x, y) vs. p(x0, y) for two di↵erent feature sets x and x

0.


Logistics Review

ERM and complexity

Empirical risk minimization on data D:

h 2 argminh2H

dRisk(h) = 1

n

nX

i=1

1{h(x(i)) 6=y(i)} (5.13)

h is a random variable, computed based on random data set D.How close is h to true risk, the one for the best hypothesis h⇤?

h⇤2 argmin

h2HRisk(h) =

Zp(x, y)1{h(x) 6=y}dxdy (5.14)

We can sometimes bound the probability (e.g., PAC-style bounds)

Pr(Risk(h)� Risk(h⇤) > ✏) < �(n,VC, ✏) (5.15)

where � is function of sample size n, and complexity (ability, flexibility,expressiveness) of family of models H over which we’re learning.Bound gets worse with increasing complexity of H or decreasing n.Extensive mathematical theory behind this (e.g., Vapnik-Chervonenkis(VC) dimension, or Rademacher complexity).


Logistics Review

Generative vs. Discriminative models

Generative models, given data drawn from p(x, y) model the entiredistribution, often factored as p(x|y)p(y) where y is the class label (orprediction) random variable and x is the feature vector randomvariable.

Discriminative model: When goal is classification/regression, cansometimes be simpler to just model p(y|x) since that is su�cient toachieve Bayes error (i.e., argmaxy p(y|x) for classification) or E[Y |x]for regression. (e.g., support vector machines, max-margin learning,conditional likelihood, etc.)

Sometimes they are the same. Naıve Bayes class-conditional Gaussiangenerative model p(x|y) leads to logistic regression.


Logistics Review






Logistics Review

Accuracy Objectives

Decreasing order of complexity of the task being learnt, while still solving aclassification decision problem. From strongest to weakest goals.

generative modeling, model p(x, y) and compute argmaxy p(x, y) forgiven x. Maximum likelihood principle.

discriminative modeling, model only p(y|x) and computeargmaxy p(y|x). Maximum conditional likelihood principle.

rank modeling, rank function model f(y|x) so thatf(�1|x) � f(�2|x) � · · · � f(�`|x) wheneverp(�1|x) � p(�2|x) � · · · � p(�`|x).

best decision modeling, decision function hy(x) � hy0(x) wheneverp(y|x) � p(y0|x) for all y0 6= y. Discriminant functions.

Weaker model often lives within stronger model (e.g., logistic or softmaxuses linear discriminant functions) as we saw, thereby giving it improvedinterpretation (e.g., probabilities rather than just unnormalized scores).


Logistics Review

Naıve Bayes classifier

Recall, x 2 Rm, so m feature values in x = (x1, x2, . . . , xm) used topredict y in joint model p(x, y).

Generative conditional distribution p(x, y) = p(x|y)p(y). Conditionedon y, p(x|y) can be an arbitrary distribution.

Naıve Bayes classifiers make simplifying (factorization) assumption.

p(x|y) =mY

j=1

p(xj |y) (5.21)

or that Xi??Xj |Y , read “Xi and Xj are conditionally independentgiven Y , for all i 6= j. This does not mean that Xi??Xj .

Estimating many 1-dimensional distributions p(xj |y) much easier thanfull joint p(x|y) due to curse of dimensionality.

Naıve Bayes classifiers make a greatly simplifying assumption but oftenworks quite well.


Logistics Review

Posterior From Binary Naive Bayes Gaussian

Binary case y 2 {0, 1}, Naıve Bayes Gaussian classifier posteriorprobability:

p(y = 1|x) =p(x|y = 1)p(y = 1)

p(x|y = 1)p(y = 1) + p(x|y = 0)p(y = 0)(5.23)

=1

1 + p(x|y=0)p(y=0)p(x|y=1)p(y=1)

(5.24)

=1

1 + exp⇣� log p(x|y=1)p(y=1)

p(x|y=0)p(y=0)

⌘ (5.25)

= g

⇣ mX

i=1

logp(xi|y = 1)

p(xi|y = 0)+ log

p(y = 1)

p(y = 0)

⌘(5.26)

where g(z) = 1/(1 + e�x) is the logistic function,

log p(y = 1)/p(y = 0) is log odds-ratio.


Logistics Review

Binary (` = 2) Naıve Bayes Gaussian

Since p(xi|y) is univariate Gaussian, log p(xi|y=1)p(xi|y=0) + log p(y=1)

p(y=0) takesone of the two following forms:

1 bixi+ ci if �1,i = �2,i 8i (equal class variances). Hence, we have a linearmodel p(y = 1|x) = g(✓|x+ const) like we saw in Logistic regression!!

2 aix2i + bixi + ci otherwise, so we have a quadratic model followed by

the logistic function. p(y = 1|x) = g(Pm

i=1 aix2i + bixi + ci). Unequal

class variances introduce quadratic features. Alternatively, we couldhave just started with features like x

2i , increasing m, without need for

unequal class variances. Still no cross interaction terms of the from xixj

due to Naıve Bayes assumption! But can introduce those xixj crossfeatures as well, further increasing m, feature augmentation.

3 non-Naıve-Bayes with fewer features is identical to doing Naıve Bayeswith more features. This helps to explain why Naıve Bayes often worksso well, since one can remove unfounded assumptions simply by addingfeatures.


Logistics Review

Binary Naive Bayes Gaussian

Bottom line: Binary Naive Bayes Gaussian (equivalently, two-classGaussian generative model classifiers) is equivalent to logisticregression when we form the posterior distribution p(y|x) forclassification. We can eliminate the Naive Bayes property in originalfeature space by adding more features (feature augmentation).

In general, feature engineering (producing and choosing right features)is an important part of any machine learning system.

Representation learning - the art and science of using machine learningto learn the features themselves, often using a deep neural network.Self-supervised learning can facilitate this (see final lecture).

Disentanglement - the art and science of taking a featurerepresentation and producing new deduced features where separatephenomena in the signal have separate representation in the deducedfeatures. A modern form of ICA (independent component analysis).


Logistics Review

Norms and p-norms

For any real valued p � 1 and vector x 2 Rm, we have a norm:

kxkp ,

0

@mX

j=1

|xj |p

1

A1/p

(5.24)

Properties of any “norm” kxk for x 2 Rm are:1 kxk � 0 for any x 2 Rm.2 kxk = 0 i↵ x = 0.3 Absolutely homogeneous: k↵xk = |↵|kxk for any ↵ 2 R.4 Triangle inequality: For any x, y 2 Rm, we have kx+ yk kxk+ kyk.

Euclidean norm: The 2-norm (where p = 2) is the standard Euclideandistance (from zero) that we use in everyday life.

p-norm is also defined and is a valid norm for p!1 in which case

kxk1 =m

maxj=1

|xj | (5.25)

There are other norms besides p norms.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F15/65 (pg.17/170)

peo

11×1/0 Counting horn .

Counts the

numberof

non- tho

elemrb

in the vector

XEIR"

Logistics Review

The Lasso and Constrained Optimization

Lasso adds penalty to learnt weights

✓ 2 argmin✓

1

2

nX

i=1

1

2(y(i) � ✓

|x(i))2 + �k✓k1 (5.25)

where k · k1 is known as the 1-norm, defined as k✓k1 =Pm

j=1 |✓j |.

Equivalent formulation:

minimize✓

nX

i=1

(y(i) � ✓|x(i))2 (5.26)

subject to k✓k1 < t (5.27)

One-to-one correspondence between � and t yielding identical solution(see Boyd and Vandenberghe’s book on Convex Analysis).

No e↵ect if t >Pm

i=1 |✓i| where ✓ is the least squares estimate.


Logistics Review

Sparsity

Consider linear model h✓(x) = h✓, xi, where x 2 Rm.

m might be very big since we don’t know what features to use.

Useful: allow learning system to throw away needless features, throwaway means the resulting learnt model has ✓j = 0. Helps with:

reduce computation, memory, and resources (e.g., sensor requirements)improve prediction (reduce noise variables). Trade o↵ a bit of biasincrease in order to reduce variance (to improve overall accuracy).Helps ease interpretation (what from x is critical to predict y). Removedirty variables (data hygiene).Reduce redundant cancellation, i.e., if typically ✓ixi ⇡ �✓jxj , no e↵ecton prediction even if both ✓i and ✓j are not zero.

E↵ective new model has m0⌧ m features.


Logistics Review

Feature selection in ML

Feature Selection: a branch of feature engineering.

Many applications start with massive feature sets, more than isnecessary.

Example: spam detection, natural language processing (NLP) andn-gram features.Cancer detection in computational biology, kmer motifs, gene expressionvia DNA microarrays, mRNA abundance, biopsies.

Important in particular when m > n but potentially useful for any n,m

when features are redundant/superfluous, irrelevant, or expensive.

Overall goal, want least costly but still informative feature set.

Sparsity (via shrinkage methods) is one way to do this, but otherdiscrete optimization methods too (e.g., the greedy procedure).

Many are implemented in python’s sklearn:https://scikit-learn.org/stable/modules/classes.html#

module-sklearn.feature_selection


Logistics Review

Greedy Forward Selection HeuristicLet x 2 Rm. Let the features be indexed integer setU = {1, 2, . . . ,m}, and for A ✓ U let xA = (xa1 , xa2 , . . . , xak) be thefeatures indexed by subset A, where |A| = k. Similarly ✓A is so defined.Let h✓A(xA) be the best learnt model using only features A givingparameters ✓A and accuracy(A) ⇡ 1� Pr(h✓A(XA) 6= Y ) .To find best set (maxA✓U :|A|=k accuracy(A)), need to consider�mk

�= O((me/k)k) subsets by Stirling’s formula, exponential in m.

Greedy forward selection to select k features, achieve sparsity, O(mk).

Algorithm 1: Greedy Forward Selection

1 A ;;2 for i = 1, · · · , k do3 Choose v

02 argmaxv2U\A accuracy(A [ {v}) ;

4 A A [ {v0};

Better than�nk

�, still need to retrain model each time.

One solution model accuracy (e.g., submodularity, later). Anothersolution, Lasso regression (next)


Logistics Review

Forward vs. Backwards Selection

Rather than start from the empty set and add one at a time, we can:

Start with all features and greedily remove the feature that hurts usthe least. backwards selection

This is also only a heuristic (might not work well).

Backwards selection, variance is an issue when using many features.

Bidirectional greedy: pick a random order, consider both addingstarting at empty set, and removing starting at all features, based onorder, and randomly choose based on distribution proportional to both:

1 accuracy gain accuracy(A [ {v})� accuracy(A) and2 accuracy loss accuracy(S)� accuracy(S \ {v}).


Lasso Regularizers Curse of Dimensionality Dimensionality Reduction

Back To Lasso

Ridge regression adds sum-of-squares penalty to learnt coe�cients.penalty to learnt weights

✓ 2 argmin✓

Jridge(✓) =nX

i=1

(y(i) � ✓|x(i))2 + �k✓k

22 (5.1)

where k · k2 is known as the 2-norm, defined as k✓k2 =qPm

j=1 ✓2j .

Lasso linear regression

✓ 2 argmin✓

Jlasso(✓) =nX

i=1

(y(i) � ✓|x(i))2 + �k✓k1 (5.2)


j=1 |✓j |.



Back To Lasso

Ridge regression adds sum-of-squares penalty to learnt coe�cients.penalty to learnt weights

✓ 2 argmin✓

Jridge(✓) =nX

i=1

(y(i) � ✓|x(i))2 + �k✓k

22 (5.1)

where k · k2 is known as the 2-norm, defined as k✓k2 =qPm

j=1 ✓2j .

Lasso linear regression

✓ 2 argmin✓

Jlasso(✓) =nX

i=1

(y(i) � ✓|x(i))2 + �k✓k1 (5.2)


j=1 |✓j |.



p-norm unit ball contours (in 2D)

m{x 2 R : x 2 = 1}

m{x 2 R : x 1 = 1}

{x 2 Rm : x 1 = 1}

x1

x2

We have (consistent w. picture) that if 0 < r < p then kxkr � kxkp, thus:

{x 2 Rm : kxk1 1} ⇢ {x 2 Rm : kxk2 1} ⇢ {x 2 Rm : kxk1 1}

Note, kxk1 = maxi xi is known as the infinity-norm (or max-norm).



p-norm unit ball contours (in 2D)

m{x 2 R : x 2 = 1}

m{x 2 R : x 1 = 1}

{x 2 Rm : x 1 = 1}

x1

x2

We have (consistent w. picture) that if 0 < r < p then kxkr � kxkp, thus:

{x 2 Rm : kxk1 1} ⇢ {x 2 Rm : kxk2 1} ⇢ {x 2 Rm : kxk1 1}

Note, kxk1 = maxi xi is known as the infinity-norm (or max-norm).Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F22/65 (pg.26/170)


More p-norm unit ball contours (in 2D)

Right: various p-normballs p � 1 andanalogous p < 1.For p = 0, we get theL0 ball, which countsthe number ofnon-zero entries and ismost sparseencouraging. The L1ball is the convexenvelope of the L0ball, so it is arguablythe best combinationof easy & appropriate.

�1.00 �0.75 �0.50 �0.25 0.00 0.25 0.50 0.75 1.00x1

�1.00

�0.75

�0.50

�0.25

0.00

0.25

0.50

0.75

1.00

x2

Various norm balls

{x : kxk30 = 1}

{x : kxk5 = 1}

{x : kxk2 = 1}

{x : kxk3/2 = 1}

{x : kxk1 = 1}

{x : kxk1/2 = 1}

{x : kxk1/4 = 1}


N

p a


Linear Regression, Ridge, Lasso

✓ ✓

✓1 ✓1

✓2✓2✓:

n

i=1

(y(i) �✓x(i) )2 =

c

✓:

n

i=1

(y(i) �✓x(i) )2 =

c

{✓ 2 Rm : ✓ 2 t}{✓ 2 Rm : ✓ 1 t}

✓ is the linear least squares solution, with quadratic contours of equal error.For given �, solution trades o↵ regularizer cost w. error cost. From Hastie et. al.



Other norms as regularizers

Using p-norms with p < 1 encourages sparsity even further (but harder tooptimize).

✓

✓1

✓2

✓:

n

i=1

(y(i) �✓x(i) )2 =

c

{✓ : k✓k30

{✓ : k✓k5 = 1}

= 1}

{✓ : k✓k2 = 1}

{✓ : k✓k3/2 = 1}

{✓ : k✓k1 = 1}

{✓ : k✓k1/2 = 1}

{✓ : k✓k1/4 = 1}


" one-

- ( Einstein)'"

I 06••⑧ ¥÷÷÷-


Lasso: How to solve

Closed form solution for ridge, no closed form for lasso.

We used gradient descent (or SGD) for ridge. Lasso regression is notdi↵erentiable be since |✓| is not di↵erentiable at ✓ = 0.

@

@✓|✓|(✓) =

8><

>:

1 if ✓ > 0

�1 if ✓ < 0

?? if ✓ = 0

(5.3)

How to solve? Many possible ways.

Simple and e↵ective: coordinate descent, where we find a subgradientof objective Jlasso(✓) w.r.t. one element of ✓ at a time and solve it.

Use subgradient descent (see https://see.stanford.edu/materials/lsocoee364b/01-subgradients_notes.pdf for fulldetails).

Results in simple solution, soft thresholding



Lasso: How to solve



@

@✓|✓|(✓) =

8><

>:

1 if ✓ > 0

�1 if ✓ < 0

?? if ✓ = 0

(5.3)







Lasso: How to solve



@

@✓|✓|(✓) =

8><

>:

1 if ✓ > 0

�1 if ✓ < 0

?? if ✓ = 0

(5.3)




Results in simple solution, soft thresholdingProf. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F26/65 (pg.35/170)


Coordinate DescentIsolate one coordinate k fix the remainder j 6= k

nX

i=1

(y(i) � ✓|x(i))2 + �k✓k1 =

nX

i=1

0

@y(i)�

mX

j=1

✓jx(i)j

1

A2

+ �k✓k1

(5.4)

=nX

i=1

0

@y(i)� ✓kx

(i)k �

X

j 6=k

✓jx(i)j

1

A2

+ �

X

j 6=k

|✓j |+ �|✓k|

(5.5)

=nX

i=1

(r(i)k � ✓kx(i)k )2 + �

X

j 6=k

|✓j |+ �|✓k| (5.6)

= J�,lasso(✓k; ✓1, ✓2, . . . , ✓k�1, ✓k+1, . . . , ✓m) (5.7)

We then optimize only ✓k holding the remainder fixed, and thenchoose a di↵erent k0 for the next round.



Coordinate Descent

General algorithm, one round of coordinate descent to be repeated.

Algorithm 2: Coordinate Descent for Lasso Solution

1 ✓ 0 ;2 for k = 1, · · · ,m do

3 r(i)k y

(i)�P

j 6=k x(i)j ✓j ;

4 ✓k argmin✓kPn

i=1(r(i)k � x

(i)k ✓k)2 + �|✓k|

We repeat the above several times until convergence.

This doesn’t help lack of di↵erentiability. How do we perform theminimization?



Convexity, Subgradients, and Subdi↵erentialConvexity: A general function f : Rm

! R is convex if for allx, y 2 Rm and � 2 [0, 1]:

f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)

Convex functions have no local minima other than the global minimaA vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g

|(z � x).

Validwhen f is di↵erentiable(e.g., at x1) and when f

is not di↵erentable (e.g.,at x2).

Set of all subgradients of f at x known as the subdi↵erential at x

@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)

If f is di↵erentiable at x then @f (x) = {rf(x)}.





f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)

Convex functions have no local minima other than the global minima

A vector g 2 Rm is asubgradient of f at x iffor all z we have f(z) �f(x) + g

|(z � x).




@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)






f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)


|(z � x).



x1 x2

f(x1) + gT1 (z � x1)

f(x2) + gT2 (z � x2)

f(x2) + gT3 (z � x2)

f(z)

z


@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)






f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)


|(z � x). Validwhen f is di↵erentiable(e.g., at x1) and when f


x1 x2

f(x1) + gT1 (z � x1)

f(x2) + gT2 (z � x2)

f(x2) + gT3 (z � x2)

f(z)

z


@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)






f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)




x1 x2

f(x1) + gT1 (z � x1)

f(x2) + gT2 (z � x2)

f(x2) + gT3 (z � x2)

f(z)

z


@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)






f(�x+ (1� �)y) �f(x) + (1� �)f(y) (5.8)




x1 x2

f(x1) + gT1 (z � x1)

f(x2) + gT2 (z � x2)

f(x2) + gT3 (z � x2)

f(z)

z


@f (x) = {g 2 Rm : 8z, f(z) � f(x) + g|(z � x)} (5.9)

If f is di↵erentiable at x then @f (x) = {rf(x)}.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F29/65 (pg.44/170)

2bet Ctrl

*Tqtz

m X

R-

-

*Af

1/2

-O

↳ (z) = f- ( th tgz (z - ta)= HK) -1 gz

. Z - gz- te

= Zz . Z t ( fan ) - ga- Z ) = m . z +3


Convexity, Subgradients, and Lasso

Fact: a convex function is minimized at x⇤ if the vector 0 is asubgradient at x⇤. In other words,

0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)

That is, a minimum means zero is a member of the subdi↵erential.

When f is everywhere di↵erentiable, this is same as finding x⇤ that

satisfies rxf(x⇤) = 0.

Fact: The lasso coordinate objectivePn

i=1(r(i)k � x

(i)k ✓k)2 + �|✓k| is

convex in the parameter ✓k

Hence, we find a value of ✓k so that zero is a subgradient (i.e., is amember of the subdi↵erential).

With L1 objective, things are thus almost as easy as with L2.





0 2 @f (x⇤) i↵ f(x⇤) f(x)8x (5.10)





i=1(r(i)k � x

(i)k ✓k)2 + �|✓k| is






Lasso’s Subdi↵erential

Define ak = 2Pn

i=1(x(i)k )2 and ck = 2

Pni=1 r

(i)k x

(i)k . Note ak > 0.

Find ✓k making the following contain zero:

@f = @

nX

i=1

(r(i)k � x(i)k ✓k)

2 + �|✓k|

!

=

8><

>:

{ak✓k � ck � �} if ✓k < 0

[�ck � �,�ck + �] if ✓k = 0

{ak✓k � ck + �} if ✓k > 0

The value of ✓k such that 0 2 @f depends on ck, three cases ck < ��,|ck| �, and ck > � on right.

If ck < ��, or ck > �, di↵erentiable at point when 0 2 @f = {rf},analytic solution giving minimum, (pink/orange line zero, x-axis).

If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.




Define ak = 2Pn

i=1(x(i)k )2 and ck = 2

Pni=1 r

(i)k x

(i)k . Note ak > 0.


@f = @

nX

i=1

(r(i)k � x(i)k ✓k)

2 + �|✓k|

!

=

8><

>:

{ak✓k � ck � �} if ✓k < 0

[�ck � �,�ck + �] if ✓k = 0

{ak✓k � ck + �} if ✓k > 0

slopea k

slopea k







Define ak = 2Pn

i=1(x(i)k )2 and ck = 2

Pni=1 r

(i)k x

(i)k . Note ak > 0.


@f = @

nX

i=1

(r(i)k � x(i)k ✓k)

2 + �|✓k|

!

=

8><

>:

{ak✓k � ck � �} if ✓k < 0

[�ck � �,�ck + �] if ✓k = 0

{ak✓k � ck + �} if ✓k > 0

✓k

ck < ��

|ck| �

ck > �

slopea k

slopea k

slopea k

slopea k

@f







Define ak = 2Pn

i=1(x(i)k )2 and ck = 2

Pni=1 r

(i)k x

(i)k . Note ak > 0.


@f = @

nX

i=1

(r(i)k � x(i)k ✓k)

2 + �|✓k|

!

=

8><

>:

{ak✓k � ck � �} if ✓k < 0

[�ck � �,�ck + �] if ✓k = 0

{ak✓k � ck + �} if ✓k > 0

✓k

ck < ��

|ck| �

ck > �

slopea k

slopea k

slopea k

slopea k

@f





I

[anon - en -i anIw ③



Define ak = 2Pn

i=1(x(i)k )2 and ck = 2

Pni=1 r

(i)k x

(i)k . Note ak > 0.


@f = @

nX

i=1

(r(i)k � x(i)k ✓k)

2 + �|✓k|

!

=

8><

>:

{ak✓k � ck � �} if ✓k < 0

[�ck � �,�ck + �] if ✓k = 0

{ak✓k � ck + �} if ✓k > 0

✓k

ck < ��

|ck| �

ck > �

slopea k

slopea k

slopea k

slopea k

@f



If |ck| �, 0 2 [�ck � �,�ck + �], hence ✓k = 0 is the minimum.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F31/65 (pg.55/170)


Soft Thresholding

This gives us the following solution (note again ak > 0):

✓k =

8><

>:

(ck + �)/ak if ck < �� (making ✓k < 0)

0 if |ck| � (making ✓k = 0)

(ck � �)/ak if ck > � (making ✓k > 0)

(5.11)

= sign(ck)max(0, |ck|� �)/ak (5.12)

Note that without any Lasso term, the solution for ✓k would be✓k = ck/ak = sign(ck)|ck|/ak = sign(ck)max(0, |ck|)/ak.

The above has a nice intuitive: we solve as normal (using gradients,ck/ak, and �) as long as ck is not too small.

If ck does get to small (in the middle term), then we just truncate ✓k

to zero, and when we truncate depends on � and ck.


- O

:

FCK ) a- oooo

r-b co. 93µm

pal ¥, [flyer -- redSm-µ) = { XEIR

"

: NHL - r} Fan is-morphism

of :Rm→iR"'

tan son xesm.ci i. THR'm

y *← Sm,

Cr),

x=OT( OCH)rt # Rm

-I


Soft Thresholding


✓k =

8><

>:




(5.11)

= sign(ck)max(0, |ck|� �)/ak (5.12)







Soft Thresholding


✓k =

8><

>:




(5.11)

= sign(ck)max(0, |ck|� �)/ak (5.12)






non -centered TLE) - Flott d- Ht

- EH,

µregularization

:


Simple solution: X has orthonormal columns

Let ✓ be the linear least squares solution. When the design matrix X

has orthonormal columns (a very special case), then solutions aresimple modifications of ✓.

Ridge solution becomes ✓j/(1 + �). shrinkage.

Lasso solution becomes sign(✓j)max(0, (|✓j |� �)). soft thresholding

Best subset of size k: ✓�j1{jk} where � = (�1,�2, . . . ,�m) is a

permutation of {1, 2, . . . ,m} such that ✓�1 � ✓�2 � · · · � ✓�m . hardthresholding



Shrinkage relative to linear regression

(0,0)(0,0) (0,0)

�

ossaLegdiR tesbuStseB

|✓�k |



Other regularizersOther regularizers

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)

where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.Total Variation: R(✓) =

Pm�1j=1 |✓j � ✓j+1|.

Structured regularizers, contourscan be other convex polyhedra(besides just octahedron in 3D)




✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.

p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)


Pm�1j=1 |✓j � ✓j+1|.





✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.

Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)


Pm�1j=1 |✓j � ✓j+1|.





✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)

We’ve seen ridge R(✓) = k✓k22 and lasso R(✓) = k✓k1.p-”norm” regularizers R(✓) = k✓kpp for 0 < p < 1, not a norm.Mixed norms, elastic-net R(✓) = ↵k✓k1 + (1� ↵)k✓k2 for 0 ↵ 1.

Grouped Lasso

✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)


Pm�1j=1 |✓j � ✓j+1|.





✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)


✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)

where A1, A2, . . . , Ag set of groups of feature indices, and theregularizer is sum of (not-squared) 2-norms over groups.

Total Variation: R(✓) =Pm�1

j=1 |✓j � ✓j+1|.





✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �R(✓) (5.13)


✓ 2 argmin✓

nX

i=1

(y(i) � ✓|x(i))2 + �

gX

j=1

k✓Agk2 (5.14)


Pm�1j=1 |✓j � ✓j+1|.




As the dimensions grow

dth order linear model, with x 2 Rm

f✓(x) = ✓0 +mX

i=1

✓ixi +mX

i=1

mX

j=1

✓i,jxixj +mX

i=1

mX

j=1

mX

k=1

✓i,j,kxixjxk

+ · · ·+mX

i1

mX

i2

· · ·

mX

id

✓i1,i2,...,id

dY

`=1

xi` (5.15)

number of parameters to learn grows as O(md) exponentially in theorder.


=) e-I =/


Geometric Tessellation. Inner and Surface Volume

Tesselation regions grow exponentially with dimension

r=5, m = 1

r=4, m = 2

r=3, m = 3

With r distinct values (edge length r), we have rm combinations in m

dimensions.

Surface volume is 2mrm�1 while inner volume is rm. Ratio of surface

area to volume is 2mrm�1

/rm = 2m/r

A lot more inner substance, relative to surface, as r grows. 2m/r ! 0as r !1.

A lot more surface, relative to inner volume, as m grows. 2m/r !1

as m!1.





r=5, m = 1

r=4, m = 2

r=3, m = 3


dimensions.



/rm = 2m/r



as m!1.





r=5, m = 1

r=4, m = 2

r=3, m = 3


dimensions.



/rm = 2m/r



as m!1.


001




r=5, m = 1

r=4, m = 2

r=3, m = 3


dimensions.



/rm = 2m/r



as m!1.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F37/65 (pg.77/170)


Data in high dimensions

From Hastie et. al, 2009

Suppose data D =�x(i) ni=1

is distributed uniformly at random within

m-dimensional unit box so x(i)2 [0, 1]m (thus, 0 x

(i)j 1).

If we take an arbitrary data point and a fraction 0 < r < 1 of theclosest data points, will carve out m-dimensional volume, withexpected edge length r

1/m, so em(r) = r1/m.

One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8.

0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.

Hence, since max edge length is 1, small fraction r of data takes upmost of the edge length range

In fact, limm!1 em(✏) = 1 for any ✏ > 0.


÷.

i







(i)j 1).


1/m, so em(r) = r1/m.






"

im i







(i)j 1).


1/m, so em(r) = r1/m.












(i)j 1).


1/m, so em(r) = r1/m.

One percent of the data in ten dimensions has expected edge lengthe10(0.01) = 0.63. Ten percent of the data: e10(0.1) = 0.8. 0.01percent of the data in 100 dimensions, e100(0.0001) = 0.91.




""







(i)j 1).


1/m, so em(r) = r1/m.











(i)j 1).


1/m, so em(r) = r1/m.





¥ :.

.


All peel and no core . . .

Typical McIntosh apple (meaning the fruit, not the phone), 2.75 inchesin diameter (radius r =3.5cm) , .3mm (✏ =.03cm) thin peel. Spherevolume 4

3⇡r3, apple is about 180 cm3, while peel volume is

43⇡r

3�

43⇡(r � ✏)3 ⇡ 4.5 cm3, relative volume about 4.5/180 = 2.5%.

Volume of sphere in m dimensions Vm(r) = Vmrm for an appropriate

m-dimensional value (on HW3).

Fraction of radius 1 apple that is peel (with peel thickness ✏) is

Vm(1)� Vm(1� ✏)

Vm(1)= 1� (1� ✏)m (5.16)

As dimensionality m gets larger, apple becomes all peel and no core.






43⇡r

3�





Vm(1)� Vm(1� ✏)

Vm(1)= 1� (1� ✏)m (5.16)







43⇡r

3�





Vm(1)� Vm(1� ✏)

Vm(1)= 1� (1� ✏)m (5.16)



K 2.5%iwhn m =3

.





43⇡r

3�





Vm(1)� Vm(1� ✏)

Vm(1)= 1� (1� ✏)m (5.16)



Lim l - G - aim =/ as m → a

Mea apple becomes

all pe


Uniformly at random unit vectors on unit sphere

Let Sm�1(r) indicate surface “area” of m-dimensional sphere radius r.

Define uniform distribution USm�1(1) on Sm�1(1) and independentlydraw two random vectors x, y ⇠ USm�1(1), hence kxk2 = kyk2 = 1

Two vectors are orthogonal if hx, yi = 0 and are nearly so if|hx, yi| < ✏ for small ✏.

It can be shown that if x, y are independent random vectors uniformlydistributed on the m-dimensional sphere, then:

Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)

which means that uniformly at random vectors in high dimensionalspace are almost always nearly orthogonal!

One of the reasons why high dimensional random projections preserveinformation — if two random high dimensional vectors are almostorthogonal, then projections onto them wil also be almost orthogonal.


too







Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)










Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)




E = 10-8







Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)





Inner/Outer Spherical Bounds on Cubic Boxes

Consider a cube in m-dimensions,[0, 1]m having unit volume. Thelargest sphere contained by this cubehas diameter 1 (the inner circle to theright). The smallest sphere contain-ing this cube has diameter

pm (outer

circle).

1

p m

Fraction of inner spherical volume (unit sphere radius 1/2, Vm(r) forradius r = 1/2) to unit cube (which has volume 1 for any m)

Inner spherical vol.

unit cube vol.=

Vm(r)

1=

⇡m/2

�(m2 + 1)rm (5.18)

We have limm!1 Vm(1/2)/1 = 0. Hence, number of volumes of unitspheres that can fit in a cube goes to infinity. More and more unitspheres can be packed into a unit cube as the dimensionality getshigher.





pm (outer

circle).

1

p m



unit cube vol.=

Vm(r)

1=

⇡m/2

�(m2 + 1)rm (5.18)




Using 2D to represent High-D as if 2D was High-D

Relationship between unit-radius (r = 1) sphere and unit volume (sidelength = 1) cube as the dimensions grow if they were to be also truein 2D.

Note that for cube, as m gets higher, distance from center to facestays at 1/2 but distance from center to vertex grows as

pm/2.

For m = 4 vertex distance ispm/2 = 1.

1

12

p22

1

12

11

12

pm2

Unit radius sphere

�Nearly all the volume

Vertex of hypercube

Illustration of the relationship between the sphere and the cube in 2, 4, and

m-dimensions (from Blum, Hopcroft, & Kannan, 2016).






pm/2.


1

12

p22

1

12

11

12

pm2

Unit radius sphere


Vertex of hypercube








pm/2.


1

12

p22

1

12

11

12

pm2

Unit radius sphere


Vertex of hypercube




#¥1


The Blessing of Dimensionality

More dimensions can help to distinguish categories, make patternrecognition easier (and even possible). Recall Voronoi tessellation

Support vector machines (SVMs) can find and exploit data patternsextant only in extremely high (or even infinite) dimensional space!!!

For still more blessings, see “High-Dimensional Data Analysis: TheCurses and Blessings of Dimensionality”, David L. Donoho



Feature Selection vs. Dimensionality Reduction

We’ve already seen that feature selection is one strategy to reducefeature dimensionality.

Feature selection, each feature is either selected or not, all or nothing.

Other dimensionality reduction strategies take the input x 2 Rm andencode each x into e(x) 2 Rm0

, a lower dimensional space, m0< m.

Core advantage: If U ={1, 2, . . . ,m}, there may be nosubset A ✓ U such that xA

will work well. There might bea combination e(x) that workswell. On the right, simple lin-ear combination a1x1 + a2x2

works quite well.










works quite well.


He

00X,Xz

X, a= f-( ti , ta)









works quite well.










works quite well.


e-


Principle Component Analysis (PCA)

PCA is a linear encoding e(x) = x|W for m⇥m matrix W .

Let X be the n⇥m design matrix, thus X 0 = XW is the encodedn⇥m design matrix.

Any linear transformation encoder can be so written. PCA hasorthonormal columns W (so hW (:, i),W (:, j)i = 1{i=j}) ordereddecreasing by ability to explain variance in original space.

Thus, taking first m0 m columns of W giving m⇥m

0 matrix Wk,yields XWk a dimensionality reduction.

When m0< m then the projection is onto a subspace that maximizes

the variance over all possible subspaces of dimension m0. When

m0 = m then we order the columns decreasing by variance explanation

(so taking any length m0 prefix of columns gives the answer).














•

• m'

m'


Maximum Variance: 1D projection

Consider m0 = 1 to start, vector w(1)2 Rm is the projection (i.e.,⌦

x,w(1)↵projects to 1D).

Assume unit vectors, kw(1)k2 =

q⌦w(1), w(1)

↵= 1.

Mean of data x = 1n

Pni=1 x

(i) so mean of projected data⌦w

(1), x↵.

Variance of projected data

1

n

nX

i=1

⇣hw

(1), x

(i)i � hw

(1), xi

⌘2= w

(1)|Sw

(1) (5.19)

where

S =1

n

nX

i=1

(x(i) � x)(x(i) � x)|

(5.20)







q⌦w(1), w(1)

↵= 1.

Mean of data x = 1n

Pni=1 x


(1), x↵.


1

n

nX

i=1

⇣hw

(1), x

(i)i � hw

(1), xi

⌘2= w

(1)|Sw

(1) (5.19)

where

S =1

n

nX

i=1

(x(i) � x)(x(i) � x)|

(5.20)







q⌦w(1), w(1)

↵= 1.

Mean of data x = 1n

Pni=1 x


(1), x↵.


1

n

nX

i=1

⇣hw

(1), x

(i)i � hw

(1), xi

⌘2= w

(1)|Sw

(1) (5.19)

where

S =1

n

nX

i=1

(x(i) � x)(x(i) � x)|

(5.20)


projectionof sand n

'

is Lw"'

,XG

')>

mean of the projector data In Lw'",X">

=L we",I '

"'7

. he

outer product

\

In [ sway His>2 I

#-22W

"'it"> 72W

'"

,I >

n > > m

* < w' '',I >2)

usually

#S is Pos. def.

Sum of outer products .Symmetric .



Goal: Maximize projected variance w(1)|

Sw(1) =

⌦w

(1), Sw

(1)↵subject

to kw(1)

k2 = 1.

Constrained optimization, Lagrange multipliers, yielding stationarypoint Sw(1) = �1w

(1), so w(1) is an eigenvector of S with eigenvalue

�1.

Pre-multiply by w(1) and use kw

(1)k2 = 1 yields

Dw

(1), Sw

(1)E= �1 (5.21)

Largest Eigenvalue of S is the greatest variance, and projectiondirection is achieved using corresponding eigenvector w(1).

This is the first principle component.


"




Sw(1) =

⌦w

(1), Sw

(1)↵subject

to kw(1)

k2 = 1.



�1.


(1)k2 = 1 yields

Dw

(1), Sw

(1)E= �1 (5.21)







Sw(1) =

⌦w

(1), Sw

(1)↵subject

to kw(1)

k2 = 1.



�1.


(1)k2 = 1 yields

Dw

(1), Sw

(1)E= �1 (5.21)




= The expression of the

variance .




Sw(1) =

⌦w

(1), Sw

(1)↵subject

to kw(1)

k2 = 1.



�1.


(1)k2 = 1 yields

Dw

(1), Sw

(1)E= �1 (5.21)







Sw(1) =

⌦w

(1), Sw

(1)↵subject

to kw(1)

k2 = 1.



�1.


(1)k2 = 1 yields

Dw

(1), Sw

(1)E= �1 (5.21)




:


Subsequent components

To produce kth component given first k � 1 components, first form

Xk = X �

k�1X

i=1

Xw(i)w

(i)| (5.22)

and then proceed with maximize projected variance⌦w

(k), Skw

(k)↵

subject to kw(k)

k2 = 1 where Sk is computed via Xk.

Overall, the collection�w

(i),�i

are the m eigenvector/eigenvalue

pairs of the matrix S.


c , ①'

f:"

"


Subsequent components

To produce kth component given first k � 1 components, first form

Xk = X �

k�1X

i=1

Xw(i)w

(i)| (5.22)

and then proceed with maximize projected variance⌦w

(k), Skw

(k)↵

subject to kw(k)

k2 = 1 where Sk is computed via Xk.

Overall, the collection�w

(i),�i

are the m eigenvector/eigenvalue

pairs of the matrix S.



PCA: Minimizing reconstruction error m0 < m

Let ui, i 2 [m] be a set of orthonormal vectors in Rm.

Basis decomposition: any sample x(i) can be written as

x(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .

Suppose we decide to use only m0< m dimensions, so

x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .

Reconstruction error

Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)

To minimize, can be shown that we should choose ui to be theeigenvector of S corresponding to the i

th largest eigenvalue.Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m

0 matrix of the first m0

columns.Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)




Let ui, i 2 [m] be a set of orthonormal vectors in Rm.Basis decomposition: any sample x

(i) can be written asx(i) =

Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .


x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .


Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)










Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .


x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .


Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)










Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .


x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .


Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)


th largest eigenvalue.

Let Wm0 be the matrix of column eigenvectors of S sorted decreasingby eigenvalue, and W (:, 1 : m0) the m⇥m








Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .


x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .


Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)




columns.

Then XWm0 projects down to the first m0 principle components, and isalso widely known as the Karhunen-Loeve transform (KLT)






Pmj=1 ↵i,juj =

Pmj=1

⌦x(i), uj

↵uj .


x(i) =

Pm0

j=1

⌦x(i), uj

↵uj .


Jm0 =1

n

nX

i=1

kx(i)

� x(i)k22 (5.23)







Example: 2D and two principle directions



PCA and discriminability

PCA projects down to the directions of greatest variance. Are thesealso the greatest for classification?

For classification, we would want the samples in di↵erent classes to beas separate as possible. I.e., make it as easy as possible to discern thedi↵erence (i.e., discriminate) between the di↵erent classes.

Any projection that blends together samples from di↵erent classes ontop of each other would be undesirable.



Example: PCA on 2-class data



Example: Data normalization

From Bishop 2006

2 4 640

50

60

70

80

90

100

ï� 0 2

ï�

0

2

ï� 0 2

ï�

0

2

Left: original data. Center: mean subtraction and variance normalization ineach dimension. Right: decorrelated data (i.e., diagonal covariance).



PCA and vs. Best axis for class separability

From Bishop 2006

ï5 0 5ï�

ï��

ï�

ï��

0

��

1

��

Projecting to greatest variability leads to poor separation, while existsanother 1D projection that retains good separability.



PCA and lack of class separability

From Bishop 2006

x6

x7

0 �� 10

��

1

��

2

Left. 2 dimensions of original data. Right: PCA Projection on 2 dimensionsof greatest variability — bad for two reasons: (I) spreads individual classes(e.g., red); and (II) blends distinct classes (blue and green), both of whichmakes classification no easier.



PCA vs. LDA

From Bishop 2006

PCA vs. Linear Discriminant Analysis (LDA).



Axis parallel projection vs. LDA

From Hastie et. al., 2009

+

++

+

Axis parallel (left) vs. LDA (right).



Linear Discriminant Analysis (LDA), 2 classes

Consider class-conditional Gaussian data, so p(x|y) = N (x|µy, Cy) formean vectors {µy}y and covariance matrices {Cy}y, x 2 Rm.

p(x|y) =1

|2⇡Cy|m/2

exp

✓�1

2(x� µy)

|C

�1y (x� µy)

◆(5.24)

Two class case y 2 {0, 1} with equal covariances C0 = C1 = C

(homoscedasticity property) and priors p(y = 0) = p(y = 1), considerlog posterior odds ratio:

logp(y = 1|x)

p(y = 0|x)= �

1

2(x� µ1)

|C

�11 (x� µ1) +

1

2(x� µ0)

|C

�10 (x� µ0)

(5.25)

= (C�1µ1 � C

�1µ0)

|x+ µ0

|C

�1µ0 � µ1

|C

�1µ1 (5.26)

= ✓|x+ c (5.27)

✓ is a projection (m⇥ 1 matrix, linear transformation) down to the 1dimension that is su�cient for prediction w/o loss.





p(x|y) =1

|2⇡Cy|m/2

exp

✓�1

2(x� µy)

|C

�1y (x� µy)

◆(5.24)



logp(y = 1|x)

p(y = 0|x)= �

1

2(x� µ1)

|C

�11 (x� µ1) +

1

2(x� µ0)

|C

�10 (x� µ0)

(5.25)

= (C�1µ1 � C

�1µ0)

|x+ µ0

|C

�1µ0 � µ1

|C

�1µ1 (5.26)

= ✓|x+ c (5.27)




LDA

Thus, decision boundary between the two classes is linear.

This is true if we consider the decision boundary between anysame-covariance two classes of an `-class soft-max regression classifier,starting from Gaussians for p(x|y).

Given discriminant functions of the form

hy(x) = �1

2log |Cy|�

1

2(x� µy)

|C

�1y (x� µy) + log p(y) (5.28)

all decision boundaries {x : hi(x) = hj(x)} are linear between toclasses (consider Voronoi tessellation) if Cy = C for all y, or areotherwise quadratic.

With 2-classes, need only 1-dimension. With 3-classes, need onlytwo-dimensions.

In general, `-classes, need an `� 1-dimensional subspace on which toproject that retains ability to discriminate (moving orthogonal to thatsubspace leads to constant posterior).



LDA




hy(x) = �1

2log |Cy|�

1

2(x� µy)

|C

�1y (x� µy) + log p(y) (5.28)






Linear Discriminant Analysis (LDA), ` classes

Within class scatter matrix

SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)

and µc is the class c mean, µc =1nc

Pi is of class c x

(i)

Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)

First projection: a1 2 argmaxa2Rma|SBaa|SW a .

Second orthogonal to first,

a2 2 argmaxa2Rm:ha,a1i=0a|SBaa|SW a , and so on.

In general, directions of greatest variance of the matrix S�1W SB,

maximizes the between class spread relative to within-class spread. I.e.,we can do PCA on S

�1W SB to get LDA reduction.

LDA dimensionality reduction: find `0⇥m matrix linear transform on x,

project to subspace corresponding to `0 greatest eigenvalues of S�1

W SB.





SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)


Pi is of class c x

(i)


SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)









W SB.





SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)


Pi is of class c x

(i)


SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)








W SB.





SW =X

c=1

Sc where Sc =X

i is class c

(x(i) � µc)(x(i)

� µc)|

(5.29)


Pi is of class c x

(i)


SB =X

c=1

nc(µc � µ)(µc � µ)| (5.30)








W SB.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F60/65 (pg.153/170)


LDA projection, dimensionality upper limits of transform`0 `� 1 is highest dimensional subspace we can project onto since

SB has rank at most `� 1.


SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)

Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).However, we have

SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)

and since rank(AB) min(rank(A), rank(B)), and sinceP`i=1 ni(µi � µ) = 0 which means that

rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.




SB has rank at most `� 1.Between class scatter matrix, nc = number of samples of class c.

SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)


SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)







SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)

Each vector 2 Rm, we have ` rank-1 updates, so immediately rank atmost ` (ignoring matrix conditioning and potential numerical issues).

However, we have

SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)







SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)


SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)







SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)


SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)

and since rank(AB) min(rank(A), rank(B))

, and sinceP`i=1 ni(µi � µ) = 0 which means that






SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)


SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)


rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1

, thus rank(SB) `� 1.





SB =X

c=1

nc(µc � µ)(µc � µ)| (5.31)


SB =h(µ1 � µ) (µ2 � µ) . . . (µ` � µ)

i

0

BBB@

n1 0 . . . 00 n2 . . . 0...

.... . .

...0 0 . . . n`

1

CCCA

2

6664

(µ1 � µ)T

(µ2 � µ)T

...(µ` � µ)T

3

7775

(5.32)


rank([(µ1 � µ) (µ2 � µ) . . . (µ` � µ)]) `� 1, thus rank(SB) `� 1.Prof. Je↵ Bilmes EE511/Spring 2020/Adv. Intro ML - Week 5 - April 27th/29nd, 2020 F61/65 (pg.160/170)


Repeat slide

The next slide is a repeat from earlier.








Pr(| hx, yi | < ✏) > 1� e�m✏2/2 (5.17)





Non-linear Dimensionality Reduction: Autoencoders

Construct two mappings, an encoder e✓e : Rm

! Rm0and a decoder

d✓d : Rm0! Rm.

Given x 2 Rm, d✓d(e✓e(x)) is x reconstructed, and this has errorEp(x)[kx� d✓d(e✓e(x))k].

Data set Dn =�x(i) i, optimization, with ✓ = (✓e, ✓d) becomes:

min✓

Ep(x)[kx� d✓d(e✓e(x))k] ⇡ min✓

1

n

nX

i=1

[kx(i) � d✓d(e✓e(x(i)))k] (5.33)

Then e✓e(x) is our learnt representation of x. If m0< m or m0

⌧ m

this is dimensionality reduction, similar to compression.

As m0 gets smaller, more pressure to ensure the dimensions areindependent (increase capacity).

Has become very successful, many di↵erent flavors (denoisingautoencoder, variational autoencoder) using (deep) neural networks asencoders/decoders. representation learning.





! Rm0and a decoder

d✓d : Rm0! Rm.



min✓


1

n

nX

i=1

[kx(i) � d✓d(e✓e(x(i)))k] (5.33)


⌧ m






DNN-based Auto-Encoder

Encoder from x to x = d✓d(e✓e(x) where m = 9 and m0 = 3.

In some sense, a form of non-linear PCA.

InputLayer Hidden

Layer 1 HiddenLayer 2

Code

HiddenLayer 4

HiddenLayer 5

OutputUnit

x x

Encoder Decodere✓e : Rm

! Rm d✓d : Rm! Rm



DNN-based Auto-Encoder

Encoder from x to x = d✓d(e✓e(x) where m = 9 and m0 = 3.

In some sense, a form of non-linear PCA.InputLayer Hidden

Layer 1 HiddenLayer 2

Code

HiddenLayer 4

HiddenLayer 5

OutputUnit

x x

Encoder Decodere✓e : Rm

! Rm d✓d : Rm! Rm


Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector...

Documents

Transcript of Advanced Introduction to Machine Learning · 2020. 9. 27. · for regression. (e.g., support vector...