Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff...

Statistical Machine Learninga.k.a.

High Dimensional InferenceLarry Wasserman

Carnegie Mellon University

Outline

1 High Dimensional Regression and Classification

2 Graphical Models

3 Nonparametric Methods

4 Clustering

Throughout: Strong assumptions versus weak assumptions

Dimension = p or d. Sample size = n.

References:The Elements of Statistical Learning. Hastie, Tibshirani and Freedman.Statistical Machine Learning. Liu and Wasserman (coming soon).

2

Introduction

• Machine learning is statistics with a focus on prediction,scalability and high dimensional problems.

• Regression: predict Y ∈ R from X .

• Classification: predict Y ∈ 0,1 from X .

I Example: Predict if an email X is real Y = 1 or spam Y = 0.

• Finding structure. Examples:

I Clustering: find groups.

I Graphical Models: find conditional independence structure.

3

Three Main Themes

Convexity

Convex problems can be solved quickly. If necessary,approximate the problem with a convex problem.

Sparsity

Many interesting problems are high dimensional. But often, therelevant information is effectively low dimensional.

Weak Assumptions

Make the weakest possible assumptions.

4

Preview: Sparse Graphical Models

Preview: Finding relations between stocks in the S&P 500:

5

Regression

We observe pairs (X1,Y1), . . . , (Xn,Yn).

D = (X1,Y1), . . . , (Xn,Yn) is called the training data.

Yi ∈ R is the response. Xi ∈ Rp is the feature (or covariate).

For example, suppose we have n subjects. Yi is the blood pressure ofsubject i . Xi = (Xi1, . . . ,Xip) is a vector of p = 5,000 gene expressionlevels for subject i .

Remember: Yi ∈ R and Xi ∈ Rp.

Given a new pair (X ,Y ), we want to predict Y from X .

6

Regression

Let Y be a prediction of Y . The prediction error or risk is

R = E(Y − Y )2

where E is the expected value (mean).

The best predictor is the regression function

m(x) = E(Y |X = x) =

∫y f (y |x)dy .

However, the true regression function m(x) is not known. We need toestimate m(x).

7

Regression

Given the training data D = (X1,Y1), . . . , (Xn,Yn) we want toconstruct m to make

prediction risk = R(m) = E(Y − m(X ))2

small. Here, (X ,Y ) are a new pair.

Key fact: Bias-variance decomposition:

R(m) =

∫bias2(x)p(x)dx +

∫var(x)p(x) + σ2

where

bias(x) = E(m(x))−m(x)

var(x) = Variance(m(x))

σ2 = E(Y −m(X ))2

8

Bias-Variance Tradeoff

Prediction Risk = Bias2 + Variance

Prediction methods with low bias tend to have high variance.

Prediction methods with low variance tend to have high bias.

For example, the predictor m(x) ≡ 0 has 0 variance but will be terriblybiased.

To predict well, we need to balance the bias and the variance. Webegin with linear methods.

9

Linear Regression

Try to find the best linear predictor, that is, a predictor of the form:

m(x) = β0 + β1x1 + · · ·+ βpxp.

Important: We do not assume that the true regression function islinear.

We can always define x1 = 1. Then the intercept is β1 and we canwrite

m(x) = β1x1 + · · ·+ βpxp = βT x

where β = (β1, . . . , βp) and x = (x1, . . . , xp).

10

Low Dimensional Linear Regression

Assume for now that p (= length of each Xi ) is small. To find a goodlinear predictor we choose β to minimize the training error:

training error =1n

n∑i=1

(Yi − βT Xi)2

The minimizer β = (β1, . . . , βp) is called the least squares estimator.

11


The least squares estimator is:

β = (XTX)−1XTY

where

Xn×d =

X11 X12 · · · X1dX21 X22 · · · X2d

......

......

Xn1 Xn2 · · · Xnd

and

Y = (Y1, . . . ,Yn)T .

In R: lm(y ∼ x)

12


Summary: the least squares estimator is m(x) = βT x =∑

j βjxjwhere

β = (XTX)−1XTY.

When we observe a new X , we predict Y to be

Y = m(X ) = βT X .

Our goals are to improve this by:

(i) dealing with high dimensions(ii) using something more flexible than linear predictors.

13

Example

Y = HIV resistance

Xj = amino acid in position j of the virus.

Y = β0 + β1X1 + · · ·+ β100X100 + ε

14

0 10 30 50 70

−0.

4−

0.2

0.0

0.2

0.4

position

β

0 10 30 50 70

−300

−200

−100

0

100

200

300

position

α

−2

−1

0

1

2

fitted values

resi

dual

s

0 20 40 60

−0.3

−0.2

−0.1

0.0

0.1

0.2

0.3

position

β

Top left: βTop right: marginal regression coefficients (one-at-a-time)Bottom left: Yi − Yi versus Yi

Bottom right: a sparse regression (coming up soon)15

High Dimensional Linear Regression

Now suppose p is large. We even might have p > n (more covariatesthan data points).

The least squares estimator is not defined since XTX is not invertible.The variance of the least squares prediction is huge.

Recall the bias-variance tradeoff:

Prediction Error = Bias2 + Variance

We need to increase the bias so that we can decrease the variance.

16

Ridge Regression

Recall that the least squares estimator minimizes the training error1n∑n

i=1(Yi − βT Xi)2.

Instead, we can minimize the penalized training error:

1n

n∑i=1

(Yi − βT Xi)2 + λ‖β‖22

where ‖β‖2 =√∑

j β2j .

The solution is:β = (XTX + λI)−1XTY

17

Ridge Regression

The tuning parameter λ controls the bias-variance tradeoff:

λ = 0 =⇒ least squares.λ =∞ =⇒ β = 0.

We choose λ to minimize R(λ) where R(λ) is an estimate of theprediction risk.

18

Ridge Regression

To estimate the prediction risk, do not use training error:

Rtraining =1n

n∑i=1

(Yi − Yi)2, Yi = X T

i β

because it is biased: E(Rtraining) < R(β)

Instead, we use leave-one-out cross-validation:

1. leave out (Xi ,Yi)

2. find β

3. predict Yi : Y(−i) = βT Xi

4. repeat for each i

19

Leave-one-out cross-validation

R(λ) =1n

n∑i=1

(Yi − Y(i))2 =

1n

n∑i=1

(Yi − Yi)2

(1− Hii)2

≈ Rtraining(1− k

n

)2

≈ Rtraining −2 k σ2

n

where

H = X(XTX + λI)−1XT

k = trace(H)

k is the effective degrees of freedom20

Example

Y = 3X1 + · · ·+ 3X5 + 0X6 + · · ·+ 0X1000 + ε

n = 100, p = 1,000.

So there are 1000 covariates but only 5 are relevant.

What does ridge regression do in this case?

21

Ridge Regularization Paths

0 2000 4000 6000 8000 10000

−2

02

λ

β

22

Sparse Linear Regression

Ridge regression does not take advantage of sparsity.

Maybe only a small number of covariates are important predictors.How do we find them?

We could fit many submodels (with a small number of covariates) andchoose the best one. This is called model selection.

Now the inaccuracy is

prediction error = bias2 + variance

The bias is the errors due to omitting important variables. Thevariance is the error due to having to estimate many parameters.

23

The Bias-Variance Tradeoff

0 20 40 60 80 100

Var

ianc

e

Number of Variables

0 20 40 60 80 100

Bia

s

Number of Variables

24

The Bias-Variance Tradeoff

This is a Goldilocks problem. Can’t use too few or too many variables.

Have to choose just the right variables.

Have to try all models with one variable, two variables,...

If there are p variables then there are 2p models.

Suppose we have 50,000 genes. We have to search through 250,000

models. But 250,000 > number of atoms in the universe.

This problem is NP-hard. This was a major bottleneck in statistics formany years.

25

You are Here

P

NP

NP−Complete

NP−Hard X

Variable Selection

26

Two Things that Save Us

Two key ideas to make this feasible are sparsity and convexrelaxation.

Sparsity: probably only a few genes are needed to predict somedisease Y . In other words, of β1, . . . , β50,000 most βj ≈ 0.

But which ones?? (Needle in a haystack.)

Convex Relaxation: Replace model search with something easier.

It is the marriage of these two concepts that makes it all work.

27

Sparsity

Look at this:β = (5,5,5,0,0,0, . . . ,0).

This vector is high-dimensional but it is sparse.

Here is a less obvious example:

β = (50,12,6,3,2,1.4,1,0.8.,0.6,0.5, . . .)

It turns out that, if the βj ’s die off fairly quickly, then β behaves a like asparse vector.

28

Sparsity

We measure the (lack of) sparsity of β = (β1, . . . , βp) with the q-norm

‖β‖q =(|β1|q + · · ·+ |βp|q)1/q =

(∑j

|βj |q)1/q

.

Which values of q measure (lack of) sparsity?

sparse: a = 1 0 0 0 · · · 0not sparse: b = .001 .001 .001 .001 · · · .001

√ √ ×q = 0 q = 1 q = 2

‖a‖q 1 1 1‖b‖q p

√p 1

Lesson: Need to use q ≤ 1 to measure sparsity. (Actually, q < 2 ok.)29

Sparsity

So we estimate β = (β1, . . . , βp) by minimizing

n∑i=1

(Yi − [β0 + β1Xi1 + · · ·+ βpXip])2

subject to the constraint that β is sparse i.e. ‖β‖q ≤ small.

Can we do this minimization?

If we use q = 0 this turns out to be the same as searching through all2p models. Ouch!

What about other values of q?

What does the set β : ‖β‖q ≤ small look like?

30

The set ‖β‖q ≤ 1 when p = 2

q = 14 q = 1

2

q = 1 q = 32

31

Sparsity Meets Convexity

We need these sets to have a nice shape (convex). If so, theminimization is no longer NP-hard. In fact, it is easy.

Sensitivity to sparsity: q ≤ 1 (actually, q < 2 suffices)Convexity (niceness): q ≥ 1

This means we should use q = 1.

32

Where Sparsity and Convexity Meet

Sparsity

Convexity

0 1 2 3 4 5 6 7 8 9p

33

Sparsity Meets Convexity: The LASSO

So we estimate β = (β1, . . . , βp) by minimizing

n∑i=1

(Yi − [β0 + β1Xi1 + · · ·+ βpXip])2

subject to the constraint that β is sparse i.e. ‖β‖1 =∑

j |βj | ≤ small.

This is called the LASSO. Invented by Rob Tibshirani in 1996.

34

Lasso

The result is an estimated vector

β1, . . . , βp

Most are 0!

Magically, we have done model selection without searching (thanks tosparsity plus convexity).

The next picture explains why some βj = 0.

35

Sparsity: How Corners Create Sparse Estimators

36

The Lasso: HIV Example Again

• Y is resistance to HIV drug.• Xj = amino acid in position j of the virus.• p = 99, n ≈ 100.

37

The Lasso: An Example

0.0 0.2 0.4 0.6 0.8 1.0

−8

−6

−4

−2

02

4

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

LASSO

5325

5551

3617

2923

28

38

Selecting λ

We choose the sparsity level by estimating prediction error. (Ten-foldcross-validation.)

0 10 20 30 40 50 60

020

040

060

0

p

Ris

k

39

The Lasso: An Example

0 10 20 30 40 50 60 70

−0.

3−

0.2

−0.

10.

00.

10.

20.

3

position

ββ

40

The Lasso: Another Example (using glmnet)

0 5 10 15

−1

01

23

4

L1 Norm

Coe

ffici

ents

0 4 6 12

41

The Lasso: Another Example (using glmnet)

−3 −2 −1 0 1

020

4060

log(Lambda)

Mea

n−S

quar

ed E

rror

14 14 13 12 12 11 10 9 8 8 8 6 5 5 4 4 3 2 0

42

Sparvexity

To summarize: we penalize the sums of squares with

‖β‖q =

∑j

|βj |q1/q

.

To get a sparse answer: q < 2.

To get a convex problem: q ≥ 1.

So q = 1 works.

Sparvexity (the marriage of sparsity and convexity) is one of thebiggest developments in statistics and machine learning.

43

The Lasso

• β(λ) is called the lasso estimator. Then define

S(λ) =

j : βj(λ) 6= 0

.

R: lars (y,x) or glmnet (y,x)

• (Optional) After you find S(λ), you can re-fit the model by doingleast squares on the sub-model S(λ). (Less bias but highervariance.)

44

The Lasso

Choose λ by risk estimation. (For example, 10-fold cross-validation.)

If you use the “re-fit” approach, then you can apply leave-one-outcross-validation:

R(λ) =1n

n∑i=1

(Yi − Y(i))2 =

1n

n∑i=1

(Yi − Yi)2

(1− Hii)2 ≈1n

RSS(1− s

n

)2

where H is the hat matrix and s = #j : βj 6= 0.

Choose λ to minimize R(λ).

45

The Lasso

The complete steps are:

1 Find β(λ) and S(λ) for each λ.2 Choose λ to minimize estimated risk.3 Prediction: Y = X T β.

46

Some Convexity Theory for the Lasso

Consider a simpler model than regression: Suppose Y ∼ N(µ,1). Letµ minimize

A(µ) =12

(Y − µ)2 + λ|µ|.

How do we minimize A(µ)?

• Since A is convex, we set the subderivative = 0. Recall that c is asubderivative of f (x) at x0 if

f (x)− f (x0) ≥ c(x − x0).

• The subdifferential ∂f (x0) is the set of subderivatives. Also, x0minimizes f if and only if 0 ∈ ∂f .

47

`1 and Soft Thresholding

• If f (µ) = |µ| then

∂f =

−1 µ < 0[ -1 , 1 ] µ = 0+1 µ > 0.

• Hence,

∂A =

µ− Y − λ µ < 0µ− Y + λz : −1 ≤ z ≤ 1 µ = 0µ− Y + λ µ > 0.

48


• µ minimizes A(µ) if and only if 0 ∈ ∂A.

• So

µ =

Y + λ Y < −λ0 −λ ≤ Y ≤ λY − λ Y > λ.

• This can be written as

µ = soft(Y , λ) ≡ sign(Y ) (|Y | − λ)+.

49


−3 −2 −1 0 1 2 3

−2

−1

01

2

x

soft(

x)

− λ λ

50

The Lasso: Computing β

• Minimize∑

i(Yi − βT Xi)2 + λ‖β‖1.

- use lars (least angle regression) or- coordinate descent: set β = (0, . . . ,0) then iterate the

following:• for j = 1, . . . ,d :• set Ri = Yi −

∑s 6=j βsXsi

• βj = least squares fit of Ri ’s in Xj .• βj ← soft(βj,LS, λ/

∑i X 2

ij )

R: glmnet

51

Variations on the Lasso

• Sparse logistic regression: later.

• Elastic Net: minimize

n∑i=1

(Yi − βT Xi)2 + λ1‖β‖1 + λ2‖β‖2

• Group Lasso:

β = (β1, . . . , βk︸︷︷︸v1

, . . . , βt , . . . , βp︸︷︷︸vm

)

minimize:n∑

i=1

(Yi − βT Xi)2 + λ

m∑j=1

‖vj‖

52

Summary

• For low dimensional (linear) prediction, we can use least squares.

• For high dimensional linear regression, we face a bias-variancetradeoff: omitting too many variables causes bias while includingtoo many variables causes high variance.

• The key is to select a good subset of variables.

• The lasso (L1-regularized least squares) is a fast way to selectvariables.

• If there are good, sparse, linear predictors, the lasso will workwell.

53

Weak Assumptions Versus Strong Assumptions

With virtually no assumptions, the following is true. Let β∗ minimize

E(Y − βT X )2 subject to ||β||1 ≤ L.

So βT∗ x is the best, sparse, linear predictor. Let β be the Lasso

estimator. Then

R(β)− R(β∗) = OP

(√L2

nlog(p)

).

The lasso is risk consistent as long as p = o(en). We do not assumethat the true model is linear. We make no assumptions on the designmatrix.

54

Strong Assumptions

To recover the true support of β or make inferences about β, manyauthors assume:

1 Incoherence (covariarates are almost independent).2 True model is linear.3 Constant variance.4 Donut (no small effects).

In signal processing, these assumptions are realistic. In dataanalysis, I can’t imagine why we would assume these things.

55

Inference

Approach I: The HARNESS (joint work with Ryan Tibshirani)

Split the data into two halves: D1 and D2.

Use D1 to choose the variables S.

Use D2 we can get confidence intervals for the βj ’s, prediction risk R,prediction risk Rj when dropping βj . For example,

R(S) =1n

∑D2

|Yi − βT Xi |

estimatesR(S) = E|Y − βT X |.

56

Inference From D2

Selected S ⊂ 1, . . . ,p using D1. Using D2 do usual least-squareslinear regression.

Can get confidence intervals for:

1 R = E|Y − βT X |.2 Rj = E|Y − βT

(−j)X | − E|Y − βT X |.3 βS = (βj : j ∈ S).

But notice that βS is the projection of m(x) onto (xj : j ∈ S). Itdepends on what variables are selected.

57

Example (Wine Data): Confidence Intervals

−0.10 −0.05 0.00 0.05 0.10 −0.4 −0.2 0.0 0.2 0.4

Selected Variables: Alcohol, Volatile-Acidity, Sulphates,Total-Sulfur-Dioxide and pH.Left: Rj . Right: βj ’s.

58

Sequential Approach

Approach II: Sequential. Test each variables as it is added in by thealgorithm:

See: Lockhart, Taylor, Tibshirani and Tibshirani (2013).(arxiv.org/abs/1301.7161 This will appear in the Annals of Statisticswith discussion.)

Other approaches: Javanmard and Montanari (2013), Buhlmann andvan der Geer (2013).

Warning: these approaches make strong assumptions.

59

Classification

Same as regression except that Yi ∈ −1,+1.

Classification riskR(h) = P(Y 6= h(X )).

Best classifier (Bayes’ classifier):

h(x) =

+1 if m(x) ≥ 1/2−1 if m(x) < 1/2

wherem(x) = P(Y = +1|X = x).

60

Linear Classifiers

h(x) =

+1 if β0 + βT x ≥ 0−1 if β0 + βT x < 0

h(x) = sign(β0 + βT x).

Minimizing the training error

1n

n∑i=1

I(Yi 6= h(Xi))

over β is not feasible (non-convex).

61

(Sparse) Logistic Regression

m(x) = P(Y = 1|X = x) =exTβ

1 + exTβ

and, in the high-dimensional case, we minimize

− logL(β)− λ∑

j

|βj |

where the likelihood is

L(β) =∏

Yi=+1

m(Xi)∏

Yi=−1

(1−m(Xi)).

R: glmnet

62

Support Vector Machine (SVM)

Choose (β0, β) to minimize

1n

n∑i=1

[1− Yim(Xi)

]+

+ λ∑

j

β2j

where[ a ]+ = maxa, 0.

R: libsvm, e1071, ...

63

SVM: Geometric View

+1

1

64

SVM: Geometric View

The optimization problem can be written as:

minβ0,β

12‖β‖22 +

1λ′

n∑i=1

ξi

subject to ξi ≥ 1− Yi(β0 + X T

i β) and ξi ≥ 0 for all i = 1, . . . ,n. or:

α = argmaxα∈Rn

∑ni=1 αi − 1

2∑n

i=1∑n

k=1 αiαkYiYk 〈Xi ,Xk 〉

subject to∑n

i=1 αiYi = 0 and 0 ≤ α1, . . . , αn ≤ 1λ′ where

β =∑n

i=1 αiYiXi .

-quadratic program-changing 〈Xi ,Xk 〉 to K (Xi ,Xj) (kernelization) makes it nonlinear.

65

Surrogate Losses

h(x) = sign(m(x)) where m(x) = β0 + βT x .

The empirical classification risk is

R =1n

n∑i=1

I(Yi 6= h(Xi)) =1n

n∑i=1

I(Yim(Xi) ≤ 0).

Minimizing R is hard because it is non-convex. The idea is to replacethe non-convex function I(u ≤ 0) with a convex surrogate.

surrogate form classifierhinge [1− u]+ support vector machinelogistic log[1 + e−u] logistic regression

66

Surrogate Losses

yH(x)

L(yH(x))

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−2 −1 0 1 2

u

u

67

Example

South African Heart Disease. Outcome is Y=1,0(disease/non-disease). There are 9 predictors X1, . . . ,X9. I added100 extra (irrelevant) predictors.

n = 462. d = 109.

Use sparse logistic regression.

68

Example

−8 −7 −6 −5 −4 −3 −2

0.26

0.28

0.30

0.32

0.34

0.36

0.38

0.40

log(Lambda)

Mis

clas

sific

atio

n E

rror

108 106 104 101 100 89 77 67 50 29 13 5 4 3 1

−8 −7 −6 −5 −4 −3 −2

−0.

50.

00.

51.

0

Log Lambda

Coe

ffici

ents

106 101 93 69 34 5 1

69

Regression is Harder Than Classification

h(x) =

+1 if m(x) > 1/2−1 otherwise.

The risk of the plug-in classifier rule satisfies

R(h)− R∗ ≤√∫

(m(x)−m∗(x))2dPX (x),

where R∗ = infh R(h).

70

Regression is Harder Than Classification

m(x)

x = 01

+1

0

m(x)

x = 01

+1

0

71

Example: Political Blogs

Features: word counts, and links.

72

Example: Political Blogs

log(λ)

Err

ors

0.0

0.1

0.2

0.3

0.4

0.5

0.6

1.0 1.5 2.0 2.5 3.0 3.5

log(λ)

Err

ors

0.0

0.1

0.2

0.3

0.4

0.5

0.6

6 7 8 9 10 11 12

Sparse LR SVM

log(λ)

Err

ors

0.0

0.1

0.2

0.3

0.4

0.5

0.6

1.0 1.5 2.0 2.5 3.0 3.5 4.0

log(λ)

Err

ors

0.0

0.1

0.2

0.3

0.4

0.5

0.6

7 8 9 10 11

Sparse LR (+ links) SVM (+ links)

73

Undirected Graphs

UNDIRECTED GRAPHS

74

Undirected Graphs

An approach to representing multivariate data.

Unsupervised: no response variables.

Two main types:

1 Correlation Graphs2 Partial Correlation Graphs

75

Correlation Graphs

X = (X (1), . . . ,X (d)).Put an edge between j and k if θjk = Cov(X (j),X (k)) 6= 0.

Bootstrap method: (Wasserman, Kolar, Rinaldo arXiv:1309.6933).

Step 1: Estimate Σ:

S =1n

n∑i=1

(Xi − X )(Xi − X )T .

Step 2: Bootstrap: draw B bootstrap samples, each of size n, giving S∗1 , . . . ,S∗B . Find

upper quantile zα:

1B

B∑j=1

I

(max

r,s

√n|S∗j (r , s)− S(r , s)| > zα

)= α.

76

Correlation Graphs

Step 3: Put edge between (j , k) if

0 /∈ S(j , k)± zα√n.

ThenP(G 6= G) ≥ 1− α− C log d

n1/8 .

This follows from results in Wasserman, Kolar, Rinaldo (2013)arXiv:1309.6933 and Chernozhukov, Chetverikov and Kato (2012)arXiv:1212.6906.

77

Example: n = 100 and d = 100.

78

Markov (Conditional Independence) Graphs

Let X = (X (1), . . . ,X (d)). A graph G = (V ,E) has vertices V , edgesE . Independence graph has one vertex for each X (j).

X Y Z

means thatX q Z

∣∣∣ Y

V = X ,Y ,Z and E = (X ,Y ), (Y ,Z ).

79

Markov Property

A probability distribution P satisfies the global Markov property withrespect to a graph G if:

for any disjoint vertex subsets A, B, and C such that C separates Aand B,

XA q XB

∣∣∣ XC .

80

Example

1 2 3 4 5

6 7 8

C = 3, 7 separates A = 1, 2 and B = 4, 8. Hence:

X1, X2! X4, X8

∣

∣

∣

∣

∣

X3, X7.

1

C = 3,7 separates A = 1,2 and B = 4,8. Hence,

X (1),X (2) q X (4),X (8)∣∣∣ X (3),X (7)

81

Example: Protein networks (Maslov 2002)

82

Distributions Encoded by a Graph

• I(G) = all independence statements implied by the graph G.

• I(P) = all independence statements implied by P.

• P(G) = P : I(G) ⊆ I(P).• If P ∈ P(G) we say that P is Markov to G.

• The graph G represents the class of distributions P(G).

• Goal: Given random vectors X1, . . . ,Xn ∼ P estimate G.

83

Gaussian Case

• If X ∼ N(µ,Σ) then there is no edge between Xi and Xj if andonly if

Ωij = 0

where Ω = Σ−1.

• GivenX1, . . . ,Xn ∼ N(µ,Σ).

• For n > p, letΩ = Σ−1

and testH0 : Ωij = 0 versus H1 : Ωij 6= 0.

84

Gaussian Case: p > n

Two approaches:

• parallel lasso (Meinshausen and Buhlmann)

• graphical lasso (glasso; Banerjee et al, Hastie et al.)

Parallel Lasso:

1 For each j = 1, . . . ,d (in parallel): Regress Xj on all othervariables using the lasso.

2 Put an edge between Xi and Xj if each appears in the regressionof the other.

85

Glasso (Graphical Lasso)

The glasso minimizes:

−`(Ω) + λ∑j 6=k

|Ωjk |

where`(Ω) =

12

(log |Ω| − tr(ΩS))

is the Gaussian loglikelihood (maximized over µ).

There is a simple blockwise gradient descent algorithm for minimizingthis function. It is very similar to the previous algorithm.

R packages: glasso and huge

86

Graphs on the S&P 500

• Data from Yahoo! Finance (finance.yahoo.com).

• Daily closing prices for 452 stocks in the S&P 500 between 2003and 2008 (before onset of the “financial crisis”).

• Log returns Xtj = log(St ,j/St−1,j

).

• Winsorized to trim outliers.

• In following graphs, each node is a stock, and color indicatesGICS industry.

Consumer Discretionary Consumer StaplesEnergy FinancialsHealth Care IndustrialsInformation Technology MaterialsTelecommunications Services Utilities

87

S&P 500: Graphical Lasso

88

S&P 500: Parallel Lasso

89

Choosing λ

Can use:

1 Cross-validation2 BIC = log-likelihood - (p/2) log n3 AIC = log-likelihood - p

where p = number of parameters.

90

Remarks

1 Versions for discrete random variables.2 Assumption Free Approach (Wasserman, Kolar and Rinaldo

2013). (Includes confidence intervals).3 Very active area, especially in bio-informatics.4 Nonparametric approaches (coming up).

91

NONPARAMETRICMACHINE LEARNING

92

Nonparametric Regression

Given (X1,Y1), . . . , (Xn,Yn) predict Y from X .

Assume only that Yi = m(Xi) + εi where where m(x) is a smoothfunction of x .

The most popular methods are kernel methods. However, there aretwo types of kernels:

1 Smoothing kernels2 Mercer kernels

Smoothing kernels involve local averaging.Mercer kernels involve regularization.

93

Smoothing Kernels

• Smoothing kernel estimator:

mh(x) =

∑ni=1 Yi Kh(Xi , x)∑n

i=1 Kh(Xi , x)=∑

i

wi(x)Yi

where Kh(x , z) is a kernel such as

Kh(x , z) = exp(−‖x − z‖2

2h2

)and h > 0 is called the bandwidth.

• mh(x) is just a local average of the Yi ’s near x .

• The bandwidth h controls the bias-variance tradeoff:Small h = large variance while large h = large bias.

94

Example: Some Data – Plot of Yi versus Xi

X

Y

95

Example: m(x)

X

Y

96

m(x) is a local average

97

Effect of the bandwidth h

very small bandwidth

small bandwidth

medium bandwidth

large bandwidth

98

Smoothing Kernels

Risk = E(Y − mh(X ))2 = bias2 + variance + σ2.

bias2 ≈ h4,

variance ≈ 1nhd where d = dimension of X .

σ2 = E(Y −m(X ))2 is the unavoidable prediction error.

small h: low bias, high variance (undersmoothing)large h: high bias, low variance (oversmoothing)

99

Risk Versus Bandwidth

h

Variance Bias

optimal h

Risk

100

Estimating the Risk: Cross-Validation

To choose h we need to estimate the risk R(h). We can estimate therisk by using cross-validation.

1 Omit (Xi ,Yi) to get mh,(i), then predict: Y(i) = mh,(i)(Xi).2 Repeat this for all observations.3 The cross-validation estimate of risk is:

R(h) =1n

n∑i=1

(Yi − Y(i))2.

Shortcut formula:

R(h) =1n

n∑i=1

(Yi − Yi

1− Lii

)2

where Lii = Kh(Xi ,Xi)/∑

t Kh(Xi ,Xt ).101

Summary so far

1 Compute mh for each h.2 Estimate the risk R(h).

3 Choose bandwidth h to minimize R(h).4 Let m(x) = mh(x).

102

Example

X

Y

0.05 0.10 0.15 0.20

0.01

20.

014

0.01

60.

018

h

Est

imat

ed R

isk

X

Y

103

Mercer Kernels (RKHS)

Choose m to minimize∑i

(Yi −m(Xi))2 + λ penalty(m)

where penalty(m) is a roughness penalty.

λ is a smoothing parameter that controls the amount of smoothing.

How do we construct a penalty that measures roughness? Oneapproach is: Mercer Kernels and RKHS = Reproducing Kernel HilbertSpaces.

104

What is a Mercer Kernel?

A Mercer Kernel K (x , y) is symmetric and positive definite:∫ ∫f (x)f (y)K (x , y) dx dy ≥ 0 for all f .

Example: K (x , y) = e−||x−y ||2/2.

Think of K (x , y) as the similarity between x and y . We will create aset of basis functions based on K .

Fix z and think of K (z, x) as a function of x . That is,

K (z, x) = Kz(x)

is a function of the second argument, with the first argument fixed.

105

Mercer Kernels

Let

F =

f (·) =

k∑j=1

βj K (zj , ·)

Define a norm: ‖f‖K =∑

j∑

k βjβkK (zj , zk ). ‖f‖K small means fsmooth.

If f =∑

r αr K (zr , ·), g =∑

s βsK (ws, ·), the inner product is

〈f ,g〉K =∑

r

∑s

αrβsK (zr ,ws).

F is a reproducing kernel Hilbert space (RKHS) because

〈f ,K (x , ·)〉 = f (x)

106

Nonparametric Regression: Mercer Kernels

Representer Theorem: Let m minimize

J =n∑

i=1

(Yi −m(Xi))2 + λ‖m‖2K .

Then

m(x) =n∑

i=1

αi K (Xi , x)

for some α1, . . . , αn.

So, we only need to find the coefficients

α = (α1, . . . , αn).

107

Nonparametric Regression: Mercer Kernels

Plug m(x) =∑n

i=1 αiK (Xi , x) into J:

J = ||Y −Kα||2 + λαTKα

where Kjk = K (Xj ,Xk )

Now we find α to minimize J. We get: α = (K + λI)−1Y andm(x) =

∑i αiK (Xi , x).

The estimator depends on the amount of regularization λ. Again,there is a bias-variance tradeoff. We choose λ by cross-validation.This is like the bandwidth in smoothing kernel regression.

108

Smoothing Kernels Versus Mercer Kernels

Smoothing kernels: the bandwidth h controls the amount ofsmoothing.

Mercer kernels: norm ‖f‖K controls the amount of smoothing.

In practice these two methods give answers that are very similar.

109

Mercer Kernels: Examples

very small lambda

small lambda

medium lambda

large lambda

110

Multiple Regression

Both methods extend easily to the case where X has dimensiond > 1. For example, just use

K (x , y) = e−‖x−y‖2/2.

However, this is hard to interpret and is subject to the curse ofdimensionality. This means that the statistical performance and thecomputational complexity degrade as dimension d increases.

An alternative is to use something less nonparametric such asadditive model where we restrict m(x1, . . . , xd ) to be of the form:

m(x1, . . . , xd ) = β0 +∑

j

mj(xj).

111

Additive Models

Model: m(x) = β0 +∑d

j=1 mj(xj).

We can take β0 = Y and we will ignore β0 from now on.

We want to minimize

n∑i=1

(Yi −

(m1(Xi1) + · · ·+ md (Xid )

))2

subject to mj smooth.

112

Additive Models

The backfitting algorithm:

• Set mj = 0• Iterate until convergence:

• Iterate over j :• Ri = Yi −

∑k 6=j mk (Xik )

• mj ←− smooth(Xj ,R)

Here, smooth(Xj ,R) is any one-dimensional nonparametric regressionfunction.

R: glm

But what if d is large?

113

Sparse Additive Models (SpAM)

Ravikumar, Lafferty, Liu and Wasserman (JRSS, 2009).

MinimizeJ =

∑i

(Yi −

∑j

mj(Xij))2

+ λ∑

j

‖mj‖

subject to mj smooth. This is basically a nonparametric version of thelasso.

Solution: backfitting + soft thresholding: Iterate:

• Ri = Yi −∑

k 6=j mk (Xik )

• mj = smooth(R,Xj)

• mj ← mj

[1− λ

sj

]+

soft thresholding

• where s2j = 1

n∑

i mj(X 2ij )

114

Example: Boston Housing Data

Predict house value Y from 10 covariates.

We added 20 irrelevant (random) covariates to test the method.

Y = house value; n = 506, d = 30.

Y = β0 + m1(crime) + m2(tax) + · · ·+ · · ·m30(X30) + ε.

Note that m11 = · · · = m30 = 0.

We choose λ by minimizing the estimated risk.

SpAM yields 6 nonzero functions. It correctly reports thatm11 = · · · = m30 = 0.

115

L2 norms of fitted functions versus 1/λ

0.0 0.2 0.4 0.6 0.8 1.0

01

23

Com

pone

nt N

orm

s

177

56

38

104

116

Estimated Risk Versus λ

0.0 0.2 0.4 0.6 0.8 1.0

2030

4050

6070

80C

p

117

Example Fits

0.0 0.2 0.4 0.6 0.8 1.0

−10

1020

l1=177.14

x1

m1

0.0 0.2 0.4 0.6 0.8 1.0−

1010

20

118

Example Fits

0.0 0.2 0.4 0.6 0.8 1.0

−10

1020

l1=478.29

x8

m8

0.0 0.2 0.4 0.6 0.8 1.0−

1010

20

119

Nonparametric Classification

There are many nonparametric classification techniques. Becausethey are similar to regression, I will not go into great detail.

Two common methods:

Plug-in:

h(x) =

+1 m(x) ≥ 1/2−1 m(x) < 1/2

where m(x) is from nonparametric regression.

Kernelization:

Often, replacing inner-products 〈x , y〉 with a kernel K (x , y) converts alinear classifier into a nonparametric classifier. For example, thekernelized support vector machine.

120

knn Classification

Given x , find the k -nearest neighbors: (X(1),Y(1)), . . . , (X(k),Y(k)).

Set h(x) = 1 of majority of Y(i) are 1’s. (0 otherwise.)

Choose k by cross-validation.

This IS a plug-in classifier. It corresponds to a kernel regression:

m(x) =

∑i Yi I(||x − Xi || ≤ h(x))∑

i I(||x − Xi || ≤ h(x))

where h(x) is distace to k-th nearest neighbor.

121

knn Classification

−2 −1 0 1 2

−2

−1

01

2

122

knn Classification

−2 −1 0 1 2

−2

−1

01

2

123

Kernelization

Recall the SVM:

α = argmaxα∈Rn

∑ni=1 αi − 1

2∑n

i=1∑n

k=1 αiαkYiYk 〈Xi ,Xk 〉

subject to∑n

i=1 αiYi = 0 and 0 ≤ α1, . . . , αn ≤ 1λ′ where

β =∑n

i=1 αiYiXi .

Replacing 〈Xi ,Xk 〉 with K (Xi ,Xj) (kernelization) makes itnonparametric. Why?

K (x , y) = 〈φ(x), φ(y)〉 where φ maps x into a very high dimensionalspace.

Kernelized (nonlinear) SVM is equivalent to a linear SVM in ahigh-dimesional space. Like adding interactions in regression.

Can kernelize any procedure that depends on inner-products. 124

NONPARAMETRICGRAPHS

125

Regression vs. Graphical Models

assumptions regression graphical models

parametric lasso graphical lasso

nonparametric sparse additive model nonparanormal

126

The Nonparanormal (Liu, Lafferty, Wasserman, 2009)

A random vector X = (X1, . . . ,Xp)T has a nonparanormal distribution

X ∼ NPN(µ,Σ, f )

in caseZ ≡ f (X ) ∼ N(µ,Σ)

where f (X ) = (f1(X1), . . . , fp(Xp)).

Joint density

pX (x) =1

(2π)p/2|Σ|1/2 exp−1

2(f (x)− µ)T Σ−1 (f (x)− µ)

p∏j=1

|f ′j (xj)|

• Semiparametric Gaussian copula

127

Examples

128

The Nonparanormal

• Define hj(x) = Φ−1(Fj(x)) where Fj(x) = P(Xj ≤ x).

• Let Λ be the covariance matrix of Z = h(X ). Then

Xj q Xk

∣∣∣∣∣ Xrest

if and only if Λ−1jk = 0.

• Hence we need to:

1 Estimate hj(x) = Φ−1(Fj(x)).

2 Estimate covariance matrix of Z = h(X ) using the glasso.

129

Properties

• LLW (2009) show that the resulting procedure has the sametheoretical properties as the glasso, even with dimension dincreasing with n.

• If the nonparanormal is used when the data are actually Normal,little efficiency is lost.

• Implemented in R: huge

130

Gene-Gene Interactions for Arabidopsis thaliana

source: wikipedia.org

Dataset from Affymetrix microarrays,sample size n = 118, p = 40 genes(isoprenoid pathway).

131

Example Results

NPN glasso difference

1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40 1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40 1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40

1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40 1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40 1 2 34

56

7

8

9

10

11

12

13

14

1516

1718

192021222324

2526

27

28

29

30

31

32

33

34

3536

3738

39 40

132

Transformations for 3 Genes

2 4 6 8

34

56

78

910

x5

6 7 8 9 10

8.0

8.5

9.0

9.5

10.0

10.5

11.0

x8

1 2 3 4 5

23

45

6

x18

• These genes have highly non-Normal marginal distributions.

• The graphs are different at these genes.

133

S&P Data (2003–2008): Graphical Lasso

134

S&P Data: Nonparanormal

135

S&P Data: Nonparanormal vs. Glasso

common edges differences

136

Another approach: Forests

• Nonparanormal: Unrestricted graphs, semiparametric.

• We’ll now trade off structural flexibility for greaternonparametricity.

• Forests are graphs with no cycles.

• Force the graph to be a forest but make no distributionalassumptions.

137

Forest Densities (Gupta, Lafferty, Liu, Wasserman, Xu, 2011)

A distribution is supported by a forest F with edge set E(F ) if

p(x) =∏

(i,j)∈E(F )

p(xi , xj)

p(xi) p(xj)

∏k∈V

p(xk )

• For known marginal densities p(xi , xj), best tree obtained byminimum weight spanning tree algorithms.

• In high dimensions, a spanning tree will overfit.

• We prune back to a forest.

138

Step 1: Constructing a Full Tree

• Compute kernel density estimates

fn1(xi , xj) =1n1

∑s∈D1

1h2

2K

(X (s)

i − xi

h2

)K

X (s)j − xj

h2

• Estimate mutual informations

In1(Xi ,Xj) =1

m2

m∑k=1

m∑`=1

fn1(xki , x`j) logfn1(xki , x`j)

fn1(xki) fn1(x`j)

• Run Kruskal’s algorithm (Chow-Liu) on edge weights

139

Step 2: Pruning the Tree

• Heldout risk

Rn2(fF ) = −∑

(i,j)∈E

∫fn2(xi , xj) log

f (xi , xj)

f (xi) f (xj)dxidxj

• Selected forest given by

k = arg mink∈0,...,p−1

Rn2

(fT (k)

n1

)

where T (k)n1

is forest obtained after k steps of Kruskal

140

S&P Data: Forest Graph—Oops! Outliers.

141

S&P Data: Forest Graph

142

S&P Data: Forest vs. Nonparanormal

common edges differences

143

Graph-Valued Regression

• (X1,Y1), . . . , (Xn,Yn) where Yi is high-dimensional

• We’ll discuss one particular version: graph-valued regression(Chen, Lafferty, Liu, Wasserman, 2010)

• Let G(x) be the graph for Y based on p(y |x)

• This defines a partition X1, . . . ,Xk where G(x) is constant overeach partition.

• Three methods to find G(x):I Parametric

I Kernel graph-valued regression

I GO-CART (Graph-Optimized CART)

144

Graph-Valued Regression

multivariate regression graphical model(supervised) (unsupervised)µ(x) = E(Y | x) Graph(Y ) = (V ,E)Y ∈ Rp, x ∈ Rq (j , k) 6∈ E ⇐⇒ Yj q Yk |Yrest

@@@R

graph-valued regressionGraph(Y | x)

• Gene associations from phenotype (or vice versa)

• Voting patterns from covariates on bills

• Stock interactions given market conditions, news items

145

Method I: Parametric

• Assume that Z = (X ,Y ) is jointly multivariate Gaussian.

• Σ =

ΣX ΣXYΣYX ΣY

.

• Get ΣX , ΣY , and ΣXY

• Get ΩX by the glasso.

• ΣY |X = ΣY − ΣYX ΩX ΣXY .

• But, the estimated graph does not vary with different values of X .

146

Method II: Kernel Smoothing

• Y |X = x ∼ N(µ(x),Σ(x)).

Σ(x) =

∑ni=1 K

(‖x−xi‖

h

)(yi − µ(x)) (yi − µ(x))T∑n

i=1 K(‖x−xi‖

h

)µ(x) =

∑ni=1 K

(‖x−xi‖

h

)yi∑n

i=1 K(‖x−xi‖

h

) .

• Apply glasso to Σ(x)

• Easy to do but recovering X1, . . . ,Xk requires difficultpost-processing.

147

Method III: Partition Estimator

• Run CART but use Gaussian log-likelihood (on held out data) todetermine the splits

• This yields a partition X1, . . . ,Xk (and a correspdonding tree)

• Run the glasso within each partition element

148

Simulated Data

1

2 3

4

5 6

7

8 9 10 11

12

13 14

15 16

17 18

19

20 21 22 23 24 25 26 27

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

X1<

0.5

X1>

0.5

X2<

0.5

X2>

0.5

X2<

0.5

X2>

0.5

X2<

0.25

X2>

0.25

X1<

0.75

X1>

0.75

X1<

0.25

X1>

0.25

X1<

0.25

X1>

0.25

X2<

0.75

X2>

0.75

X2<

0.75

X2>

0.75

X2<

0.125

X2>

0.125

X1<

0.375

X1>

0.375

X2<

0.625

X2>

0.625

X1<

0.875

X1>

0.875

X1<

0.125

X1>

0.125

X1<

0.125

X1>

0.125

X2<

0.375

X2>

0.375

X2<

0.375

X2>

0.375

X1<

0.625

X1>

0.625

X1<

0.625

X1>

0.625

X2<

0.875

X2>

0.875

X2<

0.875

X2>

0.875

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

101112

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

201

2

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20

12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20 12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

201

2

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

19

20 12

3

4

5

6

7

8

9

1011

12

13

14

15

16

17

18

1920

5

6

13

14

17

18

28 29

30 31

32

33

34

35

36 37

38 39

40

41

42

43

0 5 10 15 20 2520.8

20.9

21

21.1

21.2

21.3

Splitting Sequence No.

Held

−out

Ris

k

(a)(c)

(b)

149

Climate Data

1

2

3

4

56

7

8

9

10

11

12

13

14

15

16

17

1819

20

21

22

23

24 25 26 27

28 29 30 31

32 33 34 35

36

37

38

39

40 41

42 43

4445

46 47

4849

5051

5253

5455

56 57

58 59 60 61

62 63

64

65

66

67

68

69

70 71 72 73

74 75 76 77

78 79 80 81

82 83 84 8586 87

1

2

3

4

5 6

7

8

9

10

11

12

13

14

15

16

17

18 19

20

21

22

23

24 25 26 27

28 29 30 31

32 33 34 35

36

37

38

39

40 41

42 43

44 45

46 47

48 49

50 51

52 53 54 55

56 57

58 59 60 61

62 63

64

65

66

67

68

69

70 71 72 73

74 75 76 77

78 79 80 81

82 83 84 85 86 87

CO2CH4

CO

H2

WET

CLD

VAP

PRE

FRSDTR

TMN

TMP

TMX

GLO

ETR

ETRN

DIR

UV

CO2CH4

CO

H2

WET

CLD

VAP

PRE

FRSDTR

TMN

TMP

TMX

GLO

ETR

ETRN

DIR

UVCO2

CH4

CO

H2

WET

CLD

VAP

PRE

FRSDTR

TMN

TMP

TMX

GLO

ETR

ETRN

DIR

UVCO2

CH4

CO

H2

WET

CLD

VAP

PRE

FRSDTR

TMN

TMP

TMX

GLO

ETR

ETRN

DIR

UV

(a)

(b)

(c)

150

CLUSTERINGAND MIXTURES

151

Clustering (Unspervised Learning)

1 Parametric Clustering (mixtures):

p(x) =k∑

j=1

πj φ(x ;µj ,Σj)

2 Nonparametric Clustering: Estimate density p(x)nonparametrically then either:

-Construct a level set tree (cluster tree)

-or find the modes and their basins of attraction.

Others: k -means, hierarchical, ...

152

Mixtures

X1, . . . ,Xn ∼ p(x ; θ) =k∑

j=1

πj φ(x ;µj ,Σj)

Find θ to maximize

L(θ) =n∏

i=1

p(Xi ; θ)

usually by EM algorithm.

The likelihood is very non-convex and EM is not good. There are nowmany papers on using spectral methods and moment methods to toestimate high dimensional mixtures.

153

Mixtures

Latent cluster variable: Zi = (0,0,1,0,0,0,0,0) means Xi is fromcluster 3.

P(Xi in cluster j) =πjφ(Xi ; µj , Σj)∑k πkφ(Xi ; µk , Σk )

.

k is also a parameter. Usually chosen by penalized maximumlikelihood, for example, BIC: max

logL(θ)− dk

2log n

where dk is the number of parameters. (Theory is a bit dodgy, but seepapers by Watanabe and also Drton).

154

Nonparametric: Level Set Trees

Step 1: Estimate density:

p(x) =1n

n∑i=1

1hd K

( ||x − Xi ||h

)where K is a kernel (smooth symmetric density) and h > 0 is abandwidth.

Step 2: Find level sets: Lt = x : p(x) > λ.

Step 3: Find connected components of Lt i.e. Lt = C1⋃ · · ·⋃Cr .

155


156


As we vary λ, the connected components form a tree.

Software: R: denpro or python: DeBaCl written by Brian Kent:

https://github.com/CoAxLab/DeBaCl

The following examples were done by Brian Kent.

Thanks Brian! (He has done this successfully in d = 11,000dimensions.)

157

Example

158

Example

159

Example

160

Density Clustering II: Mode Clustering

Step I: Find p.

Step II: Find modes m1, . . . ,mk of p. Mean-shift algorithm: repeat:

x ←∑

i Xi K(||x−Xi ||

h

)∑

i K(||x−Xi ||

h

)

Step III: Find basins of attraction Cj (Morse complex) of each modemj . Specifically, x ∈ Cj if the gradient ascent path starting at x leadsto mj .

161

Example

162

Example

−2 −1 0 1 2

−2

−1

01

2

a

b

163

Example

164

Example

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

u1

u2

165

Example

166

Dimension Reduction

Recall: linear dimension reduction.

MDS (multidimensional scaling).

Y1, . . . ,Yn ∈ Rd .

Let φ be a linear projection onto R2 and let Zi = φ(Yi).

Minimize:

∑i,j

(||Yi − Yj ||2 − ||Zi − Zj ||2)

over all linear projections.

Solution: project onto the first two principal components u1,u2.167

Nonlinear Dimension Reduction

Many methods such as:

-Isomap

-LLE

-diffusion maps

-kernel PCA

+ many others

I will briefly explain diffusion maps.

(Coifman, Lafon, Lee, 2005)168


Imagine a Markov chain starting at Xi and jumping to another point Xjwith probability

p(i , j) ∝ Kh(Xi ,Xj).

The transition matrix isM = D−1K

where K(i , j) = Kh(Xi ,Xj) and D is diagonal with Dii =∑

j Kh(Xi ,Xj).

The diffusion distance measures how easy it is to get from Xi and Xj .This is essentially Euclidean distance in a new space.

We can visualize this by plotting the eigenvectors of M.

169


170


171

Requests

Causal Inference: let’s talk.

Counterexamples to Bayes: let’s talk (after a few drinks).

But see:

http://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/

http://normaldeviate.wordpress.com/2012/12/08/flat-priors-in-flatland-stones-paradox/

Also:

Owhadi, Scovel and Sullivan (2013). arXiv:1304.6772172

What I Didn’t Cover

1 Bayes2 PCA, sparse PCA3 Nonlinear dimension reduction4 Directed graphs5 Convex optimization6 Semi-supervised methods7 Active learning8 Sequential learning9 ...

173

THE END

174

Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff...

Documents

Transcript of Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff...