Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff...
Transcript of Statistical Machine Learning a.k.a. High Dimensional Inferencelarry/Heinz.pdfBias-Variance Tradeoff...
Statistical Machine Learninga.k.a.
High Dimensional InferenceLarry Wasserman
Carnegie Mellon University
Outline
1 High Dimensional Regression and Classification
2 Graphical Models
3 Nonparametric Methods
4 Clustering
Throughout: Strong assumptions versus weak assumptions
Dimension = p or d. Sample size = n.
References:The Elements of Statistical Learning. Hastie, Tibshirani and Freedman.Statistical Machine Learning. Liu and Wasserman (coming soon).
2
Introduction
• Machine learning is statistics with a focus on prediction,scalability and high dimensional problems.
• Regression: predict Y ∈ R from X .
• Classification: predict Y ∈ 0,1 from X .
I Example: Predict if an email X is real Y = 1 or spam Y = 0.
• Finding structure. Examples:
I Clustering: find groups.
I Graphical Models: find conditional independence structure.
3
Three Main Themes
Convexity
Convex problems can be solved quickly. If necessary,approximate the problem with a convex problem.
Sparsity
Many interesting problems are high dimensional. But often, therelevant information is effectively low dimensional.
Weak Assumptions
Make the weakest possible assumptions.
4
Preview: Sparse Graphical Models
Preview: Finding relations between stocks in the S&P 500:
5
Regression
We observe pairs (X1,Y1), . . . , (Xn,Yn).
D = (X1,Y1), . . . , (Xn,Yn) is called the training data.
Yi ∈ R is the response. Xi ∈ Rp is the feature (or covariate).
For example, suppose we have n subjects. Yi is the blood pressure ofsubject i . Xi = (Xi1, . . . ,Xip) is a vector of p = 5,000 gene expressionlevels for subject i .
Remember: Yi ∈ R and Xi ∈ Rp.
Given a new pair (X ,Y ), we want to predict Y from X .
6
Regression
Let Y be a prediction of Y . The prediction error or risk is
R = E(Y − Y )2
where E is the expected value (mean).
The best predictor is the regression function
m(x) = E(Y |X = x) =
∫y f (y |x)dy .
However, the true regression function m(x) is not known. We need toestimate m(x).
7
Regression
Given the training data D = (X1,Y1), . . . , (Xn,Yn) we want toconstruct m to make
prediction risk = R(m) = E(Y − m(X ))2
small. Here, (X ,Y ) are a new pair.
Key fact: Bias-variance decomposition:
R(m) =
∫bias2(x)p(x)dx +
∫var(x)p(x) + σ2
where
bias(x) = E(m(x))−m(x)
var(x) = Variance(m(x))
σ2 = E(Y −m(X ))2
8
Bias-Variance Tradeoff
Prediction Risk = Bias2 + Variance
Prediction methods with low bias tend to have high variance.
Prediction methods with low variance tend to have high bias.
For example, the predictor m(x) ≡ 0 has 0 variance but will be terriblybiased.
To predict well, we need to balance the bias and the variance. Webegin with linear methods.
9
Linear Regression
Try to find the best linear predictor, that is, a predictor of the form:
m(x) = β0 + β1x1 + · · ·+ βpxp.
Important: We do not assume that the true regression function islinear.
We can always define x1 = 1. Then the intercept is β1 and we canwrite
m(x) = β1x1 + · · ·+ βpxp = βT x
where β = (β1, . . . , βp) and x = (x1, . . . , xp).
10
Low Dimensional Linear Regression
Assume for now that p (= length of each Xi ) is small. To find a goodlinear predictor we choose β to minimize the training error:
training error =1n
n∑i=1
(Yi − βT Xi)2
The minimizer β = (β1, . . . , βp) is called the least squares estimator.
11
Low Dimensional Linear Regression
The least squares estimator is:
β = (XTX)−1XTY
where
Xn×d =
X11 X12 · · · X1dX21 X22 · · · X2d
......
......
Xn1 Xn2 · · · Xnd
and
Y = (Y1, . . . ,Yn)T .
In R: lm(y ∼ x)
12
Low Dimensional Linear Regression
Summary: the least squares estimator is m(x) = βT x =∑
j βjxjwhere
β = (XTX)−1XTY.
When we observe a new X , we predict Y to be
Y = m(X ) = βT X .
Our goals are to improve this by:
(i) dealing with high dimensions(ii) using something more flexible than linear predictors.
13
Example
Y = HIV resistance
Xj = amino acid in position j of the virus.
Y = β0 + β1X1 + · · ·+ β100X100 + ε
14
0 10 30 50 70
−0.
4−
0.2
0.0
0.2
0.4
position
β
0 10 30 50 70
−300
−200
−100
0
100
200
300
position
α
−2
−1
0
1
2
fitted values
resi
dual
s
0 20 40 60
−0.3
−0.2
−0.1
0.0
0.1
0.2
0.3
position
β
Top left: βTop right: marginal regression coefficients (one-at-a-time)Bottom left: Yi − Yi versus Yi
Bottom right: a sparse regression (coming up soon)15
High Dimensional Linear Regression
Now suppose p is large. We even might have p > n (more covariatesthan data points).
The least squares estimator is not defined since XTX is not invertible.The variance of the least squares prediction is huge.
Recall the bias-variance tradeoff:
Prediction Error = Bias2 + Variance
We need to increase the bias so that we can decrease the variance.
16
Ridge Regression
Recall that the least squares estimator minimizes the training error1n∑n
i=1(Yi − βT Xi)2.
Instead, we can minimize the penalized training error:
1n
n∑i=1
(Yi − βT Xi)2 + λ‖β‖22
where ‖β‖2 =√∑
j β2j .
The solution is:β = (XTX + λI)−1XTY
17
Ridge Regression
The tuning parameter λ controls the bias-variance tradeoff:
λ = 0 =⇒ least squares.λ =∞ =⇒ β = 0.
We choose λ to minimize R(λ) where R(λ) is an estimate of theprediction risk.
18
Ridge Regression
To estimate the prediction risk, do not use training error:
Rtraining =1n
n∑i=1
(Yi − Yi)2, Yi = X T
i β
because it is biased: E(Rtraining) < R(β)
Instead, we use leave-one-out cross-validation:
1. leave out (Xi ,Yi)
2. find β
3. predict Yi : Y(−i) = βT Xi
4. repeat for each i
19
Leave-one-out cross-validation
R(λ) =1n
n∑i=1
(Yi − Y(i))2 =
1n
n∑i=1
(Yi − Yi)2
(1− Hii)2
≈ Rtraining(1− k
n
)2
≈ Rtraining −2 k σ2
n
where
H = X(XTX + λI)−1XT
k = trace(H)
k is the effective degrees of freedom20
Example
Y = 3X1 + · · ·+ 3X5 + 0X6 + · · ·+ 0X1000 + ε
n = 100, p = 1,000.
So there are 1000 covariates but only 5 are relevant.
What does ridge regression do in this case?
21
Ridge Regularization Paths
0 2000 4000 6000 8000 10000
−2
02
λ
β
22
Sparse Linear Regression
Ridge regression does not take advantage of sparsity.
Maybe only a small number of covariates are important predictors.How do we find them?
We could fit many submodels (with a small number of covariates) andchoose the best one. This is called model selection.
Now the inaccuracy is
prediction error = bias2 + variance
The bias is the errors due to omitting important variables. Thevariance is the error due to having to estimate many parameters.
23
The Bias-Variance Tradeoff
0 20 40 60 80 100
Var
ianc
e
Number of Variables
0 20 40 60 80 100
Bia
s
Number of Variables
24
The Bias-Variance Tradeoff
This is a Goldilocks problem. Can’t use too few or too many variables.
Have to choose just the right variables.
Have to try all models with one variable, two variables,...
If there are p variables then there are 2p models.
Suppose we have 50,000 genes. We have to search through 250,000
models. But 250,000 > number of atoms in the universe.
This problem is NP-hard. This was a major bottleneck in statistics formany years.
25
You are Here
P
NP
NP−Complete
NP−Hard X
Variable Selection
26
Two Things that Save Us
Two key ideas to make this feasible are sparsity and convexrelaxation.
Sparsity: probably only a few genes are needed to predict somedisease Y . In other words, of β1, . . . , β50,000 most βj ≈ 0.
But which ones?? (Needle in a haystack.)
Convex Relaxation: Replace model search with something easier.
It is the marriage of these two concepts that makes it all work.
27
Sparsity
Look at this:β = (5,5,5,0,0,0, . . . ,0).
This vector is high-dimensional but it is sparse.
Here is a less obvious example:
β = (50,12,6,3,2,1.4,1,0.8.,0.6,0.5, . . .)
It turns out that, if the βj ’s die off fairly quickly, then β behaves a like asparse vector.
28
Sparsity
We measure the (lack of) sparsity of β = (β1, . . . , βp) with the q-norm
‖β‖q =(|β1|q + · · ·+ |βp|q)1/q =
(∑j
|βj |q)1/q
.
Which values of q measure (lack of) sparsity?
sparse: a = 1 0 0 0 · · · 0not sparse: b = .001 .001 .001 .001 · · · .001
√ √ ×q = 0 q = 1 q = 2
‖a‖q 1 1 1‖b‖q p
√p 1
Lesson: Need to use q ≤ 1 to measure sparsity. (Actually, q < 2 ok.)29
Sparsity
So we estimate β = (β1, . . . , βp) by minimizing
n∑i=1
(Yi − [β0 + β1Xi1 + · · ·+ βpXip])2
subject to the constraint that β is sparse i.e. ‖β‖q ≤ small.
Can we do this minimization?
If we use q = 0 this turns out to be the same as searching through all2p models. Ouch!
What about other values of q?
What does the set β : ‖β‖q ≤ small look like?
30
The set ‖β‖q ≤ 1 when p = 2
q = 14 q = 1
2
q = 1 q = 32
31
Sparsity Meets Convexity
We need these sets to have a nice shape (convex). If so, theminimization is no longer NP-hard. In fact, it is easy.
Sensitivity to sparsity: q ≤ 1 (actually, q < 2 suffices)Convexity (niceness): q ≥ 1
This means we should use q = 1.
32
Where Sparsity and Convexity Meet
Sparsity
Convexity
0 1 2 3 4 5 6 7 8 9p
33
Sparsity Meets Convexity: The LASSO
So we estimate β = (β1, . . . , βp) by minimizing
n∑i=1
(Yi − [β0 + β1Xi1 + · · ·+ βpXip])2
subject to the constraint that β is sparse i.e. ‖β‖1 =∑
j |βj | ≤ small.
This is called the LASSO. Invented by Rob Tibshirani in 1996.
34
Lasso
The result is an estimated vector
β1, . . . , βp
Most are 0!
Magically, we have done model selection without searching (thanks tosparsity plus convexity).
The next picture explains why some βj = 0.
35
Sparsity: How Corners Create Sparse Estimators
36
The Lasso: HIV Example Again
• Y is resistance to HIV drug.• Xj = amino acid in position j of the virus.• p = 99, n ≈ 100.
37
The Lasso: An Example
0.0 0.2 0.4 0.6 0.8 1.0
−8
−6
−4
−2
02
4
|beta|/max|beta|
Sta
ndar
dize
d C
oeffi
cien
ts
LASSO
5325
5551
3617
2923
28
38
Selecting λ
We choose the sparsity level by estimating prediction error. (Ten-foldcross-validation.)
0 10 20 30 40 50 60
020
040
060
0
p
Ris
k
39
The Lasso: An Example
0 10 20 30 40 50 60 70
−0.
3−
0.2
−0.
10.
00.
10.
20.
3
position
ββ
40
The Lasso: Another Example (using glmnet)
0 5 10 15
−1
01
23
4
L1 Norm
Coe
ffici
ents
0 4 6 12
41
The Lasso: Another Example (using glmnet)
−3 −2 −1 0 1
020
4060
log(Lambda)
Mea
n−S
quar
ed E
rror
14 14 13 12 12 11 10 9 8 8 8 6 5 5 4 4 3 2 0
42
Sparvexity
To summarize: we penalize the sums of squares with
‖β‖q =
∑j
|βj |q1/q
.
To get a sparse answer: q < 2.
To get a convex problem: q ≥ 1.
So q = 1 works.
Sparvexity (the marriage of sparsity and convexity) is one of thebiggest developments in statistics and machine learning.
43
The Lasso
• β(λ) is called the lasso estimator. Then define
S(λ) =
j : βj(λ) 6= 0
.
R: lars (y,x) or glmnet (y,x)
• (Optional) After you find S(λ), you can re-fit the model by doingleast squares on the sub-model S(λ). (Less bias but highervariance.)
44
The Lasso
Choose λ by risk estimation. (For example, 10-fold cross-validation.)
If you use the “re-fit” approach, then you can apply leave-one-outcross-validation:
R(λ) =1n
n∑i=1
(Yi − Y(i))2 =
1n
n∑i=1
(Yi − Yi)2
(1− Hii)2 ≈1n
RSS(1− s
n
)2
where H is the hat matrix and s = #j : βj 6= 0.
Choose λ to minimize R(λ).
45
The Lasso
The complete steps are:
1 Find β(λ) and S(λ) for each λ.2 Choose λ to minimize estimated risk.3 Prediction: Y = X T β.
46
Some Convexity Theory for the Lasso
Consider a simpler model than regression: Suppose Y ∼ N(µ,1). Letµ minimize
A(µ) =12
(Y − µ)2 + λ|µ|.
How do we minimize A(µ)?
• Since A is convex, we set the subderivative = 0. Recall that c is asubderivative of f (x) at x0 if
f (x)− f (x0) ≥ c(x − x0).
• The subdifferential ∂f (x0) is the set of subderivatives. Also, x0minimizes f if and only if 0 ∈ ∂f .
47
`1 and Soft Thresholding
• If f (µ) = |µ| then
∂f =
−1 µ < 0[ -1 , 1 ] µ = 0+1 µ > 0.
• Hence,
∂A =
µ− Y − λ µ < 0µ− Y + λz : −1 ≤ z ≤ 1 µ = 0µ− Y + λ µ > 0.
48
`1 and Soft Thresholding
• µ minimizes A(µ) if and only if 0 ∈ ∂A.
• So
µ =
Y + λ Y < −λ0 −λ ≤ Y ≤ λY − λ Y > λ.
• This can be written as
µ = soft(Y , λ) ≡ sign(Y ) (|Y | − λ)+.
49
`1 and Soft Thresholding
−3 −2 −1 0 1 2 3
−2
−1
01
2
x
soft(
x)
− λ λ
50
The Lasso: Computing β
• Minimize∑
i(Yi − βT Xi)2 + λ‖β‖1.
- use lars (least angle regression) or- coordinate descent: set β = (0, . . . ,0) then iterate the
following:• for j = 1, . . . ,d :• set Ri = Yi −
∑s 6=j βsXsi
• βj = least squares fit of Ri ’s in Xj .• βj ← soft(βj,LS, λ/
∑i X 2
ij )
R: glmnet
51
Variations on the Lasso
• Sparse logistic regression: later.
• Elastic Net: minimize
n∑i=1
(Yi − βT Xi)2 + λ1‖β‖1 + λ2‖β‖2
• Group Lasso:
β = (β1, . . . , βk︸ ︷︷ ︸v1
, . . . , βt , . . . , βp︸ ︷︷ ︸vm
)
minimize:n∑
i=1
(Yi − βT Xi)2 + λ
m∑j=1
‖vj‖
52
Summary
• For low dimensional (linear) prediction, we can use least squares.
• For high dimensional linear regression, we face a bias-variancetradeoff: omitting too many variables causes bias while includingtoo many variables causes high variance.
• The key is to select a good subset of variables.
• The lasso (L1-regularized least squares) is a fast way to selectvariables.
• If there are good, sparse, linear predictors, the lasso will workwell.
53
Weak Assumptions Versus Strong Assumptions
With virtually no assumptions, the following is true. Let β∗ minimize
E(Y − βT X )2 subject to ||β||1 ≤ L.
So βT∗ x is the best, sparse, linear predictor. Let β be the Lasso
estimator. Then
R(β)− R(β∗) = OP
(√L2
nlog(p)
).
The lasso is risk consistent as long as p = o(en). We do not assumethat the true model is linear. We make no assumptions on the designmatrix.
54
Strong Assumptions
To recover the true support of β or make inferences about β, manyauthors assume:
1 Incoherence (covariarates are almost independent).2 True model is linear.3 Constant variance.4 Donut (no small effects).
In signal processing, these assumptions are realistic. In dataanalysis, I can’t imagine why we would assume these things.
55
Inference
Approach I: The HARNESS (joint work with Ryan Tibshirani)
Split the data into two halves: D1 and D2.
Use D1 to choose the variables S.
Use D2 we can get confidence intervals for the βj ’s, prediction risk R,prediction risk Rj when dropping βj . For example,
R(S) =1n
∑D2
|Yi − βT Xi |
estimatesR(S) = E|Y − βT X |.
56
Inference From D2
Selected S ⊂ 1, . . . ,p using D1. Using D2 do usual least-squareslinear regression.
Can get confidence intervals for:
1 R = E|Y − βT X |.2 Rj = E|Y − βT
(−j)X | − E|Y − βT X |.3 βS = (βj : j ∈ S).
But notice that βS is the projection of m(x) onto (xj : j ∈ S). Itdepends on what variables are selected.
57
Example (Wine Data): Confidence Intervals
−0.10 −0.05 0.00 0.05 0.10 −0.4 −0.2 0.0 0.2 0.4
Selected Variables: Alcohol, Volatile-Acidity, Sulphates,Total-Sulfur-Dioxide and pH.Left: Rj . Right: βj ’s.
58
Sequential Approach
Approach II: Sequential. Test each variables as it is added in by thealgorithm:
See: Lockhart, Taylor, Tibshirani and Tibshirani (2013).(arxiv.org/abs/1301.7161 This will appear in the Annals of Statisticswith discussion.)
Other approaches: Javanmard and Montanari (2013), Buhlmann andvan der Geer (2013).
Warning: these approaches make strong assumptions.
59
Classification
Same as regression except that Yi ∈ −1,+1.
Classification riskR(h) = P(Y 6= h(X )).
Best classifier (Bayes’ classifier):
h(x) =
+1 if m(x) ≥ 1/2−1 if m(x) < 1/2
wherem(x) = P(Y = +1|X = x).
60
Linear Classifiers
h(x) =
+1 if β0 + βT x ≥ 0−1 if β0 + βT x < 0
h(x) = sign(β0 + βT x).
Minimizing the training error
1n
n∑i=1
I(Yi 6= h(Xi))
over β is not feasible (non-convex).
61
(Sparse) Logistic Regression
m(x) = P(Y = 1|X = x) =exTβ
1 + exTβ
and, in the high-dimensional case, we minimize
− logL(β)− λ∑
j
|βj |
where the likelihood is
L(β) =∏
Yi=+1
m(Xi)∏
Yi=−1
(1−m(Xi)).
R: glmnet
62
Support Vector Machine (SVM)
Choose (β0, β) to minimize
1n
n∑i=1
[1− Yim(Xi)
]+
+ λ∑
j
β2j
where[ a ]+ = maxa, 0.
R: libsvm, e1071, ...
63
SVM: Geometric View
+1
1
64
SVM: Geometric View
The optimization problem can be written as:
minβ0,β
12‖β‖22 +
1λ′
n∑i=1
ξi
subject to ξi ≥ 1− Yi(β0 + X T
i β) and ξi ≥ 0 for all i = 1, . . . ,n. or:
α = argmaxα∈Rn
∑ni=1 αi − 1
2∑n
i=1∑n
k=1 αiαkYiYk 〈Xi ,Xk 〉
subject to∑n
i=1 αiYi = 0 and 0 ≤ α1, . . . , αn ≤ 1λ′ where
β =∑n
i=1 αiYiXi .
-quadratic program-changing 〈Xi ,Xk 〉 to K (Xi ,Xj) (kernelization) makes it nonlinear.
65
Surrogate Losses
h(x) = sign(m(x)) where m(x) = β0 + βT x .
The empirical classification risk is
R =1n
n∑i=1
I(Yi 6= h(Xi)) =1n
n∑i=1
I(Yim(Xi) ≤ 0).
Minimizing R is hard because it is non-convex. The idea is to replacethe non-convex function I(u ≤ 0) with a convex surrogate.
surrogate form classifierhinge [1− u]+ support vector machinelogistic log[1 + e−u] logistic regression
66
Surrogate Losses
yH(x)
L(yH(x))
0.0
0.5
1.0
1.5
2.0
2.5
3.0
−2 −1 0 1 2
u
u
67
Example
South African Heart Disease. Outcome is Y=1,0(disease/non-disease). There are 9 predictors X1, . . . ,X9. I added100 extra (irrelevant) predictors.
n = 462. d = 109.
Use sparse logistic regression.
68
Example
−8 −7 −6 −5 −4 −3 −2
0.26
0.28
0.30
0.32
0.34
0.36
0.38
0.40
log(Lambda)
Mis
clas
sific
atio
n E
rror
108 106 104 101 100 89 77 67 50 29 13 5 4 3 1
−8 −7 −6 −5 −4 −3 −2
−0.
50.
00.
51.
0
Log Lambda
Coe
ffici
ents
106 101 93 69 34 5 1
69
Regression is Harder Than Classification
h(x) =
+1 if m(x) > 1/2−1 otherwise.
The risk of the plug-in classifier rule satisfies
R(h)− R∗ ≤√∫
(m(x)−m∗(x))2dPX (x),
where R∗ = infh R(h).
70
Regression is Harder Than Classification
m(x)
x = 01
+1
0
m(x)
x = 01
+1
0
71
Example: Political Blogs
Features: word counts, and links.
72
Example: Political Blogs
log(λ)
Err
ors
0.0
0.1
0.2
0.3
0.4
0.5
0.6
1.0 1.5 2.0 2.5 3.0 3.5
log(λ)
Err
ors
0.0
0.1
0.2
0.3
0.4
0.5
0.6
6 7 8 9 10 11 12
Sparse LR SVM
log(λ)
Err
ors
0.0
0.1
0.2
0.3
0.4
0.5
0.6
1.0 1.5 2.0 2.5 3.0 3.5 4.0
log(λ)
Err
ors
0.0
0.1
0.2
0.3
0.4
0.5
0.6
7 8 9 10 11
Sparse LR (+ links) SVM (+ links)
73
Undirected Graphs
UNDIRECTED GRAPHS
74
Undirected Graphs
An approach to representing multivariate data.
Unsupervised: no response variables.
Two main types:
1 Correlation Graphs2 Partial Correlation Graphs
75
Correlation Graphs
X = (X (1), . . . ,X (d)).Put an edge between j and k if θjk = Cov(X (j),X (k)) 6= 0.
Bootstrap method: (Wasserman, Kolar, Rinaldo arXiv:1309.6933).
Step 1: Estimate Σ:
S =1n
n∑i=1
(Xi − X )(Xi − X )T .
Step 2: Bootstrap: draw B bootstrap samples, each of size n, giving S∗1 , . . . ,S∗B . Find
upper quantile zα:
1B
B∑j=1
I
(max
r,s
√n|S∗j (r , s)− S(r , s)| > zα
)= α.
76
Correlation Graphs
Step 3: Put edge between (j , k) if
0 /∈ S(j , k)± zα√n.
ThenP(G 6= G) ≥ 1− α− C log d
n1/8 .
This follows from results in Wasserman, Kolar, Rinaldo (2013)arXiv:1309.6933 and Chernozhukov, Chetverikov and Kato (2012)arXiv:1212.6906.
77
Example: n = 100 and d = 100.
78
Markov (Conditional Independence) Graphs
Let X = (X (1), . . . ,X (d)). A graph G = (V ,E) has vertices V , edgesE . Independence graph has one vertex for each X (j).
X Y Z
means thatX q Z
∣∣∣ Y
V = X ,Y ,Z and E = (X ,Y ), (Y ,Z ).
79
Markov Property
A probability distribution P satisfies the global Markov property withrespect to a graph G if:
for any disjoint vertex subsets A, B, and C such that C separates Aand B,
XA q XB
∣∣∣ XC .
80
Example
1 2 3 4 5
6 7 8
C = 3, 7 separates A = 1, 2 and B = 4, 8. Hence:
X1, X2! X4, X8
∣
∣
∣
∣
∣
X3, X7.
1
C = 3,7 separates A = 1,2 and B = 4,8. Hence,
X (1),X (2) q X (4),X (8)∣∣∣ X (3),X (7)
81
Example: Protein networks (Maslov 2002)
82
Distributions Encoded by a Graph
• I(G) = all independence statements implied by the graph G.
• I(P) = all independence statements implied by P.
• P(G) = P : I(G) ⊆ I(P).• If P ∈ P(G) we say that P is Markov to G.
• The graph G represents the class of distributions P(G).
• Goal: Given random vectors X1, . . . ,Xn ∼ P estimate G.
83
Gaussian Case
• If X ∼ N(µ,Σ) then there is no edge between Xi and Xj if andonly if
Ωij = 0
where Ω = Σ−1.
• GivenX1, . . . ,Xn ∼ N(µ,Σ).
• For n > p, letΩ = Σ−1
and testH0 : Ωij = 0 versus H1 : Ωij 6= 0.
84
Gaussian Case: p > n
Two approaches:
• parallel lasso (Meinshausen and Buhlmann)
• graphical lasso (glasso; Banerjee et al, Hastie et al.)
Parallel Lasso:
1 For each j = 1, . . . ,d (in parallel): Regress Xj on all othervariables using the lasso.
2 Put an edge between Xi and Xj if each appears in the regressionof the other.
85
Glasso (Graphical Lasso)
The glasso minimizes:
−`(Ω) + λ∑j 6=k
|Ωjk |
where`(Ω) =
12
(log |Ω| − tr(ΩS))
is the Gaussian loglikelihood (maximized over µ).
There is a simple blockwise gradient descent algorithm for minimizingthis function. It is very similar to the previous algorithm.
R packages: glasso and huge
86
Graphs on the S&P 500
• Data from Yahoo! Finance (finance.yahoo.com).
• Daily closing prices for 452 stocks in the S&P 500 between 2003and 2008 (before onset of the “financial crisis”).
• Log returns Xtj = log(St ,j/St−1,j
).
• Winsorized to trim outliers.
• In following graphs, each node is a stock, and color indicatesGICS industry.
Consumer Discretionary Consumer StaplesEnergy FinancialsHealth Care IndustrialsInformation Technology MaterialsTelecommunications Services Utilities
87
S&P 500: Graphical Lasso
88
S&P 500: Parallel Lasso
89
Choosing λ
Can use:
1 Cross-validation2 BIC = log-likelihood - (p/2) log n3 AIC = log-likelihood - p
where p = number of parameters.
90
Remarks
1 Versions for discrete random variables.2 Assumption Free Approach (Wasserman, Kolar and Rinaldo
2013). (Includes confidence intervals).3 Very active area, especially in bio-informatics.4 Nonparametric approaches (coming up).
91
NONPARAMETRICMACHINE LEARNING
92
Nonparametric Regression
Given (X1,Y1), . . . , (Xn,Yn) predict Y from X .
Assume only that Yi = m(Xi) + εi where where m(x) is a smoothfunction of x .
The most popular methods are kernel methods. However, there aretwo types of kernels:
1 Smoothing kernels2 Mercer kernels
Smoothing kernels involve local averaging.Mercer kernels involve regularization.
93
Smoothing Kernels
• Smoothing kernel estimator:
mh(x) =
∑ni=1 Yi Kh(Xi , x)∑n
i=1 Kh(Xi , x)=∑
i
wi(x)Yi
where Kh(x , z) is a kernel such as
Kh(x , z) = exp(−‖x − z‖2
2h2
)and h > 0 is called the bandwidth.
• mh(x) is just a local average of the Yi ’s near x .
• The bandwidth h controls the bias-variance tradeoff:Small h = large variance while large h = large bias.
94
Example: Some Data – Plot of Yi versus Xi
X
Y
95
Example: m(x)
X
Y
96
m(x) is a local average
97
Effect of the bandwidth h
very small bandwidth
small bandwidth
medium bandwidth
large bandwidth
98
Smoothing Kernels
Risk = E(Y − mh(X ))2 = bias2 + variance + σ2.
bias2 ≈ h4,
variance ≈ 1nhd where d = dimension of X .
σ2 = E(Y −m(X ))2 is the unavoidable prediction error.
small h: low bias, high variance (undersmoothing)large h: high bias, low variance (oversmoothing)
99
Risk Versus Bandwidth
h
Variance Bias
optimal h
Risk
100
Estimating the Risk: Cross-Validation
To choose h we need to estimate the risk R(h). We can estimate therisk by using cross-validation.
1 Omit (Xi ,Yi) to get mh,(i), then predict: Y(i) = mh,(i)(Xi).2 Repeat this for all observations.3 The cross-validation estimate of risk is:
R(h) =1n
n∑i=1
(Yi − Y(i))2.
Shortcut formula:
R(h) =1n
n∑i=1
(Yi − Yi
1− Lii
)2
where Lii = Kh(Xi ,Xi)/∑
t Kh(Xi ,Xt ).101
Summary so far
1 Compute mh for each h.2 Estimate the risk R(h).
3 Choose bandwidth h to minimize R(h).4 Let m(x) = mh(x).
102
Example
X
Y
0.05 0.10 0.15 0.20
0.01
20.
014
0.01
60.
018
h
Est
imat
ed R
isk
X
Y
103
Mercer Kernels (RKHS)
Choose m to minimize∑i
(Yi −m(Xi))2 + λ penalty(m)
where penalty(m) is a roughness penalty.
λ is a smoothing parameter that controls the amount of smoothing.
How do we construct a penalty that measures roughness? Oneapproach is: Mercer Kernels and RKHS = Reproducing Kernel HilbertSpaces.
104
What is a Mercer Kernel?
A Mercer Kernel K (x , y) is symmetric and positive definite:∫ ∫f (x)f (y)K (x , y) dx dy ≥ 0 for all f .
Example: K (x , y) = e−||x−y ||2/2.
Think of K (x , y) as the similarity between x and y . We will create aset of basis functions based on K .
Fix z and think of K (z, x) as a function of x . That is,
K (z, x) = Kz(x)
is a function of the second argument, with the first argument fixed.
105
Mercer Kernels
Let
F =
f (·) =
k∑j=1
βj K (zj , ·)
Define a norm: ‖f‖K =∑
j∑
k βjβkK (zj , zk ). ‖f‖K small means fsmooth.
If f =∑
r αr K (zr , ·), g =∑
s βsK (ws, ·), the inner product is
〈f ,g〉K =∑
r
∑s
αrβsK (zr ,ws).
F is a reproducing kernel Hilbert space (RKHS) because
〈f ,K (x , ·)〉 = f (x)
106
Nonparametric Regression: Mercer Kernels
Representer Theorem: Let m minimize
J =n∑
i=1
(Yi −m(Xi))2 + λ‖m‖2K .
Then
m(x) =n∑
i=1
αi K (Xi , x)
for some α1, . . . , αn.
So, we only need to find the coefficients
α = (α1, . . . , αn).
107
Nonparametric Regression: Mercer Kernels
Plug m(x) =∑n
i=1 αiK (Xi , x) into J:
J = ||Y −Kα||2 + λαTKα
where Kjk = K (Xj ,Xk )
Now we find α to minimize J. We get: α = (K + λI)−1Y andm(x) =
∑i αiK (Xi , x).
The estimator depends on the amount of regularization λ. Again,there is a bias-variance tradeoff. We choose λ by cross-validation.This is like the bandwidth in smoothing kernel regression.
108
Smoothing Kernels Versus Mercer Kernels
Smoothing kernels: the bandwidth h controls the amount ofsmoothing.
Mercer kernels: norm ‖f‖K controls the amount of smoothing.
In practice these two methods give answers that are very similar.
109
Mercer Kernels: Examples
very small lambda
small lambda
medium lambda
large lambda
110
Multiple Regression
Both methods extend easily to the case where X has dimensiond > 1. For example, just use
K (x , y) = e−‖x−y‖2/2.
However, this is hard to interpret and is subject to the curse ofdimensionality. This means that the statistical performance and thecomputational complexity degrade as dimension d increases.
An alternative is to use something less nonparametric such asadditive model where we restrict m(x1, . . . , xd ) to be of the form:
m(x1, . . . , xd ) = β0 +∑
j
mj(xj).
111
Additive Models
Model: m(x) = β0 +∑d
j=1 mj(xj).
We can take β0 = Y and we will ignore β0 from now on.
We want to minimize
n∑i=1
(Yi −
(m1(Xi1) + · · ·+ md (Xid )
))2
subject to mj smooth.
112
Additive Models
The backfitting algorithm:
• Set mj = 0• Iterate until convergence:
• Iterate over j :• Ri = Yi −
∑k 6=j mk (Xik )
• mj ←− smooth(Xj ,R)
Here, smooth(Xj ,R) is any one-dimensional nonparametric regressionfunction.
R: glm
But what if d is large?
113
Sparse Additive Models (SpAM)
Ravikumar, Lafferty, Liu and Wasserman (JRSS, 2009).
MinimizeJ =
∑i
(Yi −
∑j
mj(Xij))2
+ λ∑
j
‖mj‖
subject to mj smooth. This is basically a nonparametric version of thelasso.
Solution: backfitting + soft thresholding: Iterate:
• Ri = Yi −∑
k 6=j mk (Xik )
• mj = smooth(R,Xj)
• mj ← mj
[1− λ
sj
]+
soft thresholding
• where s2j = 1
n∑
i mj(X 2ij )
114
Example: Boston Housing Data
Predict house value Y from 10 covariates.
We added 20 irrelevant (random) covariates to test the method.
Y = house value; n = 506, d = 30.
Y = β0 + m1(crime) + m2(tax) + · · ·+ · · ·m30(X30) + ε.
Note that m11 = · · · = m30 = 0.
We choose λ by minimizing the estimated risk.
SpAM yields 6 nonzero functions. It correctly reports thatm11 = · · · = m30 = 0.
115
L2 norms of fitted functions versus 1/λ
0.0 0.2 0.4 0.6 0.8 1.0
01
23
Com
pone
nt N
orm
s
177
56
38
104
116
Estimated Risk Versus λ
0.0 0.2 0.4 0.6 0.8 1.0
2030
4050
6070
80C
p
117
Example Fits
0.0 0.2 0.4 0.6 0.8 1.0
−10
1020
l1=177.14
x1
m1
0.0 0.2 0.4 0.6 0.8 1.0−
1010
20
118
Example Fits
0.0 0.2 0.4 0.6 0.8 1.0
−10
1020
l1=478.29
x8
m8
0.0 0.2 0.4 0.6 0.8 1.0−
1010
20
119
Nonparametric Classification
There are many nonparametric classification techniques. Becausethey are similar to regression, I will not go into great detail.
Two common methods:
Plug-in:
h(x) =
+1 m(x) ≥ 1/2−1 m(x) < 1/2
where m(x) is from nonparametric regression.
Kernelization:
Often, replacing inner-products 〈x , y〉 with a kernel K (x , y) converts alinear classifier into a nonparametric classifier. For example, thekernelized support vector machine.
120
knn Classification
Given x , find the k -nearest neighbors: (X(1),Y(1)), . . . , (X(k),Y(k)).
Set h(x) = 1 of majority of Y(i) are 1’s. (0 otherwise.)
Choose k by cross-validation.
This IS a plug-in classifier. It corresponds to a kernel regression:
m(x) =
∑i Yi I(||x − Xi || ≤ h(x))∑
i I(||x − Xi || ≤ h(x))
where h(x) is distace to k-th nearest neighbor.
121
knn Classification
−2 −1 0 1 2
−2
−1
01
2
122
knn Classification
−2 −1 0 1 2
−2
−1
01
2
123
Kernelization
Recall the SVM:
α = argmaxα∈Rn
∑ni=1 αi − 1
2∑n
i=1∑n
k=1 αiαkYiYk 〈Xi ,Xk 〉
subject to∑n
i=1 αiYi = 0 and 0 ≤ α1, . . . , αn ≤ 1λ′ where
β =∑n
i=1 αiYiXi .
Replacing 〈Xi ,Xk 〉 with K (Xi ,Xj) (kernelization) makes itnonparametric. Why?
K (x , y) = 〈φ(x), φ(y)〉 where φ maps x into a very high dimensionalspace.
Kernelized (nonlinear) SVM is equivalent to a linear SVM in ahigh-dimesional space. Like adding interactions in regression.
Can kernelize any procedure that depends on inner-products. 124
NONPARAMETRICGRAPHS
125
Regression vs. Graphical Models
assumptions regression graphical models
parametric lasso graphical lasso
nonparametric sparse additive model nonparanormal
126
The Nonparanormal (Liu, Lafferty, Wasserman, 2009)
A random vector X = (X1, . . . ,Xp)T has a nonparanormal distribution
X ∼ NPN(µ,Σ, f )
in caseZ ≡ f (X ) ∼ N(µ,Σ)
where f (X ) = (f1(X1), . . . , fp(Xp)).
Joint density
pX (x) =1
(2π)p/2|Σ|1/2 exp−1
2(f (x)− µ)T Σ−1 (f (x)− µ)
p∏j=1
|f ′j (xj)|
• Semiparametric Gaussian copula
127
Examples
128
The Nonparanormal
• Define hj(x) = Φ−1(Fj(x)) where Fj(x) = P(Xj ≤ x).
• Let Λ be the covariance matrix of Z = h(X ). Then
Xj q Xk
∣∣∣∣∣ Xrest
if and only if Λ−1jk = 0.
• Hence we need to:
1 Estimate hj(x) = Φ−1(Fj(x)).
2 Estimate covariance matrix of Z = h(X ) using the glasso.
129
Properties
• LLW (2009) show that the resulting procedure has the sametheoretical properties as the glasso, even with dimension dincreasing with n.
• If the nonparanormal is used when the data are actually Normal,little efficiency is lost.
• Implemented in R: huge
130
Gene-Gene Interactions for Arabidopsis thaliana
source: wikipedia.org
Dataset from Affymetrix microarrays,sample size n = 118, p = 40 genes(isoprenoid pathway).
131
Example Results
NPN glasso difference
1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40 1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40 1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40
1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40 1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40 1 2 34
56
7
8
9
10
11
12
13
14
1516
1718
192021222324
2526
27
28
29
30
31
32
33
34
3536
3738
39 40
132
Transformations for 3 Genes
2 4 6 8
34
56
78
910
x5
6 7 8 9 10
8.0
8.5
9.0
9.5
10.0
10.5
11.0
x8
1 2 3 4 5
23
45
6
x18
• These genes have highly non-Normal marginal distributions.
• The graphs are different at these genes.
133
S&P Data (2003–2008): Graphical Lasso
134
S&P Data: Nonparanormal
135
S&P Data: Nonparanormal vs. Glasso
common edges differences
136
Another approach: Forests
• Nonparanormal: Unrestricted graphs, semiparametric.
• We’ll now trade off structural flexibility for greaternonparametricity.
• Forests are graphs with no cycles.
• Force the graph to be a forest but make no distributionalassumptions.
137
Forest Densities (Gupta, Lafferty, Liu, Wasserman, Xu, 2011)
A distribution is supported by a forest F with edge set E(F ) if
p(x) =∏
(i,j)∈E(F )
p(xi , xj)
p(xi) p(xj)
∏k∈V
p(xk )
• For known marginal densities p(xi , xj), best tree obtained byminimum weight spanning tree algorithms.
• In high dimensions, a spanning tree will overfit.
• We prune back to a forest.
138
Step 1: Constructing a Full Tree
• Compute kernel density estimates
fn1(xi , xj) =1n1
∑s∈D1
1h2
2K
(X (s)
i − xi
h2
)K
X (s)j − xj
h2
• Estimate mutual informations
In1(Xi ,Xj) =1
m2
m∑k=1
m∑`=1
fn1(xki , x`j) logfn1(xki , x`j)
fn1(xki) fn1(x`j)
• Run Kruskal’s algorithm (Chow-Liu) on edge weights
139
Step 2: Pruning the Tree
• Heldout risk
Rn2(fF ) = −∑
(i,j)∈E
∫fn2(xi , xj) log
f (xi , xj)
f (xi) f (xj)dxidxj
• Selected forest given by
k = arg mink∈0,...,p−1
Rn2
(fT (k)
n1
)
where T (k)n1
is forest obtained after k steps of Kruskal
140
S&P Data: Forest Graph—Oops! Outliers.
141
S&P Data: Forest Graph
142
S&P Data: Forest vs. Nonparanormal
common edges differences
143
Graph-Valued Regression
• (X1,Y1), . . . , (Xn,Yn) where Yi is high-dimensional
• We’ll discuss one particular version: graph-valued regression(Chen, Lafferty, Liu, Wasserman, 2010)
• Let G(x) be the graph for Y based on p(y |x)
• This defines a partition X1, . . . ,Xk where G(x) is constant overeach partition.
• Three methods to find G(x):I Parametric
I Kernel graph-valued regression
I GO-CART (Graph-Optimized CART)
144
Graph-Valued Regression
multivariate regression graphical model(supervised) (unsupervised)µ(x) = E(Y | x) Graph(Y ) = (V ,E)Y ∈ Rp, x ∈ Rq (j , k) 6∈ E ⇐⇒ Yj q Yk |Yrest
@@@R
graph-valued regressionGraph(Y | x)
• Gene associations from phenotype (or vice versa)
• Voting patterns from covariates on bills
• Stock interactions given market conditions, news items
145
Method I: Parametric
• Assume that Z = (X ,Y ) is jointly multivariate Gaussian.
• Σ =
ΣX ΣXYΣYX ΣY
.
• Get ΣX , ΣY , and ΣXY
• Get ΩX by the glasso.
• ΣY |X = ΣY − ΣYX ΩX ΣXY .
• But, the estimated graph does not vary with different values of X .
146
Method II: Kernel Smoothing
• Y |X = x ∼ N(µ(x),Σ(x)).
Σ(x) =
∑ni=1 K
(‖x−xi‖
h
)(yi − µ(x)) (yi − µ(x))T∑n
i=1 K(‖x−xi‖
h
)µ(x) =
∑ni=1 K
(‖x−xi‖
h
)yi∑n
i=1 K(‖x−xi‖
h
) .
• Apply glasso to Σ(x)
• Easy to do but recovering X1, . . . ,Xk requires difficultpost-processing.
147
Method III: Partition Estimator
• Run CART but use Gaussian log-likelihood (on held out data) todetermine the splits
• This yields a partition X1, . . . ,Xk (and a correspdonding tree)
• Run the glasso within each partition element
148
Simulated Data
1
2 3
4
5 6
7
8 9 10 11
12
13 14
15 16
17 18
19
20 21 22 23 24 25 26 27
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
X1<
0.5
X1>
0.5
X2<
0.5
X2>
0.5
X2<
0.5
X2>
0.5
X2<
0.25
X2>
0.25
X1<
0.75
X1>
0.75
X1<
0.25
X1>
0.25
X1<
0.25
X1>
0.25
X2<
0.75
X2>
0.75
X2<
0.75
X2>
0.75
X2<
0.125
X2>
0.125
X1<
0.375
X1>
0.375
X2<
0.625
X2>
0.625
X1<
0.875
X1>
0.875
X1<
0.125
X1>
0.125
X1<
0.125
X1>
0.125
X2<
0.375
X2>
0.375
X2<
0.375
X2>
0.375
X1<
0.625
X1>
0.625
X1<
0.625
X1>
0.625
X2<
0.875
X2>
0.875
X2<
0.875
X2>
0.875
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
101112
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
201
2
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20
12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20 12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
201
2
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
19
20 12
3
4
5
6
7
8
9
1011
12
13
14
15
16
17
18
1920
5
6
13
14
17
18
28 29
30 31
32
33
34
35
36 37
38 39
40
41
42
43
0 5 10 15 20 2520.8
20.9
21
21.1
21.2
21.3
Splitting Sequence No.
Held
−out
Ris
k
(a)(c)
(b)
149
Climate Data
1
2
3
4
56
7
8
9
10
11
12
13
14
15
16
17
1819
20
21
22
23
24 25 26 27
28 29 30 31
32 33 34 35
36
37
38
39
40 41
42 43
4445
46 47
4849
5051
5253
5455
56 57
58 59 60 61
62 63
64
65
66
67
68
69
70 71 72 73
74 75 76 77
78 79 80 81
82 83 84 8586 87
1
2
3
4
5 6
7
8
9
10
11
12
13
14
15
16
17
18 19
20
21
22
23
24 25 26 27
28 29 30 31
32 33 34 35
36
37
38
39
40 41
42 43
44 45
46 47
48 49
50 51
52 53 54 55
56 57
58 59 60 61
62 63
64
65
66
67
68
69
70 71 72 73
74 75 76 77
78 79 80 81
82 83 84 85 86 87
CO2CH4
CO
H2
WET
CLD
VAP
PRE
FRSDTR
TMN
TMP
TMX
GLO
ETR
ETRN
DIR
UV
CO2CH4
CO
H2
WET
CLD
VAP
PRE
FRSDTR
TMN
TMP
TMX
GLO
ETR
ETRN
DIR
UVCO2
CH4
CO
H2
WET
CLD
VAP
PRE
FRSDTR
TMN
TMP
TMX
GLO
ETR
ETRN
DIR
UVCO2
CH4
CO
H2
WET
CLD
VAP
PRE
FRSDTR
TMN
TMP
TMX
GLO
ETR
ETRN
DIR
UV
(a)
(b)
(c)
150
CLUSTERINGAND MIXTURES
151
Clustering (Unspervised Learning)
1 Parametric Clustering (mixtures):
p(x) =k∑
j=1
πj φ(x ;µj ,Σj)
2 Nonparametric Clustering: Estimate density p(x)nonparametrically then either:
-Construct a level set tree (cluster tree)
-or find the modes and their basins of attraction.
Others: k -means, hierarchical, ...
152
Mixtures
X1, . . . ,Xn ∼ p(x ; θ) =k∑
j=1
πj φ(x ;µj ,Σj)
Find θ to maximize
L(θ) =n∏
i=1
p(Xi ; θ)
usually by EM algorithm.
The likelihood is very non-convex and EM is not good. There are nowmany papers on using spectral methods and moment methods to toestimate high dimensional mixtures.
153
Mixtures
Latent cluster variable: Zi = (0,0,1,0,0,0,0,0) means Xi is fromcluster 3.
P(Xi in cluster j) =πjφ(Xi ; µj , Σj)∑k πkφ(Xi ; µk , Σk )
.
k is also a parameter. Usually chosen by penalized maximumlikelihood, for example, BIC: max
logL(θ)− dk
2log n
where dk is the number of parameters. (Theory is a bit dodgy, but seepapers by Watanabe and also Drton).
154
Nonparametric: Level Set Trees
Step 1: Estimate density:
p(x) =1n
n∑i=1
1hd K
( ||x − Xi ||h
)where K is a kernel (smooth symmetric density) and h > 0 is abandwidth.
Step 2: Find level sets: Lt = x : p(x) > λ.
Step 3: Find connected components of Lt i.e. Lt = C1⋃ · · ·⋃Cr .
155
Nonparametric: Level Set Trees
156
Nonparametric: Level Set Trees
As we vary λ, the connected components form a tree.
Software: R: denpro or python: DeBaCl written by Brian Kent:
https://github.com/CoAxLab/DeBaCl
The following examples were done by Brian Kent.
Thanks Brian! (He has done this successfully in d = 11,000dimensions.)
157
Example
158
Example
159
Example
160
Density Clustering II: Mode Clustering
Step I: Find p.
Step II: Find modes m1, . . . ,mk of p. Mean-shift algorithm: repeat:
x ←∑
i Xi K(||x−Xi ||
h
)∑
i K(||x−Xi ||
h
)
Step III: Find basins of attraction Cj (Morse complex) of each modemj . Specifically, x ∈ Cj if the gradient ascent path starting at x leadsto mj .
161
Example
162
Example
−2 −1 0 1 2
−2
−1
01
2
a
b
163
Example
164
Example
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
u1
u2
165
Example
166
Dimension Reduction
Recall: linear dimension reduction.
MDS (multidimensional scaling).
Y1, . . . ,Yn ∈ Rd .
Let φ be a linear projection onto R2 and let Zi = φ(Yi).
Minimize:
∑i,j
(||Yi − Yj ||2 − ||Zi − Zj ||2)
over all linear projections.
Solution: project onto the first two principal components u1,u2.167
Nonlinear Dimension Reduction
Many methods such as:
-Isomap
-LLE
-diffusion maps
-kernel PCA
+ many others
I will briefly explain diffusion maps.
(Coifman, Lafon, Lee, 2005)168
Nonlinear Dimension Reduction
Imagine a Markov chain starting at Xi and jumping to another point Xjwith probability
p(i , j) ∝ Kh(Xi ,Xj).
The transition matrix isM = D−1K
where K(i , j) = Kh(Xi ,Xj) and D is diagonal with Dii =∑
j Kh(Xi ,Xj).
The diffusion distance measures how easy it is to get from Xi and Xj .This is essentially Euclidean distance in a new space.
We can visualize this by plotting the eigenvectors of M.
169
Nonlinear Dimension Reduction
170
Nonlinear Dimension Reduction
171
Requests
Causal Inference: let’s talk.
Counterexamples to Bayes: let’s talk (after a few drinks).
But see:
http://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/
http://normaldeviate.wordpress.com/2012/12/08/flat-priors-in-flatland-stones-paradox/
Also:
Owhadi, Scovel and Sullivan (2013). arXiv:1304.6772172
What I Didn’t Cover
1 Bayes2 PCA, sparse PCA3 Nonlinear dimension reduction4 Directed graphs5 Convex optimization6 Semi-supervised methods7 Active learning8 Sequential learning9 ...
173
THE END
174