CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181...

27
CS 181 LECTURE NOTES AARON LANDESMAN Contents 1. Class 1/29/14 2 1.1. K-means clustering 2 1.2. Lloyd’s algorithm 2 2. Class 2/3/13 3 2.1. Section times 3 2.2. Office Hours 3 2.3. Gradient Descent 3 2.4. K-means++ 4 2.5. Hierarchical Agglomerative Clustering (HAC) 4 3. Class 2/5/14 4 3.1. Principal Component Analysis 4 3.2. Computation for PCA 5 4. Class 2/10/14 6 4.1. Unsupervised learning 6 5. Class 2/12/14 7 5.1. Frequentist Regression 7 6. Class 2/19/14 8 7. Class 2/24/14 9 7.1. Bayesian Regression 9 7.2. Cross Validation 11 8. Class 2/26/14 11 8.2. Classification 12 9. Class 3/3/14 12 9.1. Perceptrons 13 10. Class 3/10/14 14 10.1. Logistic Regression 14 10.2. Neural Networks 15 11. Class 3/12/14 16 11.1. Decision Trees 16 12. Class 3/24/14 17 12.1. Support Vector Machines 17 13. Class 3/31/14 18 14. Class 4/2/14 20 14.1. Markov Decision process (MDP) 20 15. Class 4/9/14 22 15.1. Reinforcement Learning 23 16. Class 4/8/14 23 16.1. Partially Observable Markov Decision Process (POMDP) 23 1

Transcript of CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181...

Page 1: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES

AARON LANDESMAN

Contents

1. Class 1/29/14 21.1. K-means clustering 21.2. Lloyd’s algorithm 22. Class 2/3/13 32.1. Section times 32.2. Office Hours 32.3. Gradient Descent 32.4. K-means++ 42.5. Hierarchical Agglomerative Clustering (HAC) 43. Class 2/5/14 43.1. Principal Component Analysis 43.2. Computation for PCA 54. Class 2/10/14 64.1. Unsupervised learning 65. Class 2/12/14 75.1. Frequentist Regression 76. Class 2/19/14 87. Class 2/24/14 97.1. Bayesian Regression 97.2. Cross Validation 118. Class 2/26/14 118.2. Classification 129. Class 3/3/14 129.1. Perceptrons 1310. Class 3/10/14 1410.1. Logistic Regression 1410.2. Neural Networks 1511. Class 3/12/14 1611.1. Decision Trees 1612. Class 3/24/14 1712.1. Support Vector Machines 1713. Class 3/31/14 1814. Class 4/2/14 2014.1. Markov Decision process (MDP) 2015. Class 4/9/14 2215.1. Reinforcement Learning 2316. Class 4/8/14 2316.1. Partially Observable Markov Decision Process (POMDP) 23

1

Page 2: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

2 AARON LANDESMAN

17. Class 4/16/14 2418. Class 4/21/14 2519. Class 4/23/14 26

1. Class 1/29/14

1.1. K-means clustering.

(1) Where are the cluster centers? {µk}Kk=1, µk ∈ RD. with D the same as forthe date.

(2) Which data belong to which cluster?

Definition 1.1.1. A one-hot coding is a map s 7→ eiT

, the transpose ofa basis vector

Assign responsibility vectors rn given by one-hot codings. If the databelongs to cluster k, then define

rn,j =

{1 if k = j

0 otherwise

To measure distances, might typically use Euclidean norm

||x− µ||2 =

n∑i=1

(xi − µi)2.

1.1.2. Loss Function.

Remark 1.1.3. A loss functions is a way of formalizing badness.

Distortion for data might be Jn(rn, {µk}Kk=1) =

∑Kk=1 rn,k||xk − µk||

22.

1.1.4. Empirical Loss Minimization. By summing our loss function over all the x’s,we get a function

J({rn}Nn=1, {µk}Kk=1) =

N∑n=1

K∑k=1

rn,k||xn − µk||22.

Minimizing this is in general difficult because it is nonconvex, and finding aglobal minimum is NP-hard.

Definition 1.1.5. The problem of minimizing a given empirical loss function iscalled the K-Means Problem.

1.2. Lloyd’s algorithm.

(1) First minimize J in terms of each rn, only K options. Define zn = argmink||xn−µk||22. Set rk = eTzk

(2) Minimize J in terms of µk.

||xn = µk||2 = (xn − µk)T · (xn − µk)

Page 3: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 3

∇µkJ = ∇µkN∑n=1

rn,k(xn − µk)T (xn − µk)

= −2 ·N∑n=1

rn,k(xn − µk)

= −2 ·N∑n=1

rn,kxn −Nkµk

with Nk =∑Nn=1 rn,k, µk = 1

Nk

∑Nn=1 rn,kxn

Remark 1.2.1. Setting the gradient to zero tells us

~µk =

∑Nn=1 rn,k~xn∑Nn=1 rn,k

To improve this, can use many random initializations. However, we should al-ways use k-Means++, instead of k-Means, which picks one of the data at random,then make a probability distribution, weighted to data that are farthest away fromany cluster thus far. Typically, we should standardize the data along each dimen-sion, so that one dimension doesn’t overwhelm the others.

Definition 1.2.2. K-mediods is basically the same as K-means, but the meanshave to be data points, which make sense in the context where distances aren’t welldefined.

2. Class 2/3/13

2.1. Section times.

(1) Thursday 1600, MD 223(2) Friday 1200 MD 123(3) Friday 1400 NW B150

2.2. Office Hours.

(1) Ryan 2:30 - 4 Mon MD 233(2) Diana 10-11 Mon MD 334(3) Sam 3-4 Wed MD 2nd floor lobby(4) Rest 7-11 Wed Quincy

2.3. Gradient Descent.

Remark 2.3.1. Idea: to minimize, take steps in the direction of the negative gra-dient. Suppose we had two parameters, θ1, θ2. This might give regions by contourcurves . Initialize our algorithm, and try to modify parameters so we reach the globalminimum. Let’s say our loss function is L(θ1, θ2). Some methods to minimize Lare perturbation, coordinate descent (trying to minimize coordinate one dimensionat a time while holding the other dimensions fixed). K-means is an example of co-ordinate descent, but with several blocks at a time. Another type of minimization isgradient descent. The gradient points in most “upward” direction. So, to descent,we would try to go opposite the gradient. Sometimes we can just solve a system

Page 4: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

4 AARON LANDESMAN

to find where the gradient is 0, which is some sort of stationary point. If a con-vex function has a minimum, it is unique. Much of machine learning is to frameproblems to be convex. Simulated annealing is one way to try to solve nonconvexproblems.

Recall K-means. We have data {~xn}Nn=1, responsibility vectors rn,k ∈ {0, 1} suchthat

∑k rn,k = 1, mean vectors ~µk ∈ RD. There is a loss function J({~µk}, {{rn,k}}) =∑N

n=1

∑Kk=1 rn,k||~xn − ~µk||22. To get rn,k take those that minimize J by taking the

mean closest to each point.

2.4. K-means++.

Remark 2.4.1. The algorithm is as follows. This is just the initialization. Afterthis we can apply Lloyd’s algorithm.

(1) Assign a random ~xn to ~µ1.(2) For i from 1 to K, Draw the mean µi distribution of distances to the closest

~µj for j < k.

2.5. Hierarchical Agglomerative Clustering (HAC). The two main problemswith K-means are

(1) Picking K correctly(2) Nondeterministic(3) Data are not disjoint clusters typically

Remark 2.5.1. HAC is deterministic. Every datum starts in its own group, andyou decide how to merge groups. The “action” is the decision of criterion to mergegroups. We always merge two objects at once, thus making a binary tree overthe data. We can make a dendrogram, which is a visualization in which we drawa sequence of brackets showing merging of data into groups (like a tournamentbracket.) The y axis represents points, and x axis represents the distance betweentwo groups before we merge them. Choose K by determining a point after whichthere is large spacing.

2.5.2. Linkage Function. What this does is determined by the “linkage function.”There are four linkage functions from the course notes:

(1) Single: Distance between two groups is the inf over pairs of points. I.e. ifG1 = {xn}, G2 = {ym}, D(G1, G2) = minm,n ||xn − ym||

(2) Complete: Minimizing the maximum over pairs of points. G1 = {xn}, G2 ={ym}, D(G1, G2) = maxm,n ||xn − ym||

(3) Average: G1 = {xn}, G2 = {ym}, D(G1, G2) = 1M ·N

∑m,n ||xn − ym||

(4) Centroid: G1 = {xn}, G2 = {ym}, D(G1, G2) = || 1N

∑n xn −

1M

∑m ym||

Remark 2.5.3. If used the single, complete, and average would yield a valid den-drogram, but if we use centroid, might find groups which become more similar, somight result in “non-valid” dendrogram. Complete gives compact clusters, singlelinkage can give “stringy” clusters with long chains.

3. Class 2/5/14

3.1. Principal Component Analysis.

Page 5: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 5

Remark 3.1.1. Suppose we have some data in RD and a linear map T : RD → RKwith K < D. Note that it generally takes O(n3) time to compute the eigenspectrum.

There is something called snap-shot method for computing eigenvalues of hugematrices.

One interpretation of PCA is to maximize the variance of the distance from thecentral point. A second interpretation is to minimize reconstruction error. Thisis lossy compression, png is lossless, and the image can be reconstructed. png onthe other hand loses a lot of information, but the image is quite small, doesn’t takemuch space. PCA is choosing a good lossy compression.

Definition 3.1.2. An orthonormal basis for RD is a set of D vectors {~ud}Dd=1

such that ~uTd ~ue = δde.

Computation 3.1.3. Assume we have a set of data {~xn}Nn=1 and define x̄ =1N

∑Nn=1 ~xn. Then,

xn = x̄+

N∑n=1

α(n)d ~µd

with α(n)d = (~xn − ~̄x)T ~µd. Then we perform a reconstruction with only K basis

vectors:

x̂n = x̄+

K∑d=1

α(n)d ~ud

The squared error is

J =

N∑n=1

(~xn − x̂n)2

3.2. Computation for PCA.

Computation 3.2.1. Try to either preserve variance or minimize reconstructionerror. DATA: {~xn}Nn=1, ~xn ∈ RD

Game : Find a set of K orthonormal basis vectors such that we minimize recon-struction error.

Basis: {~µd}Kd=1 that is orthonormal.

The mean of the data is x̄ = 1N

∑NN=1 ~xn.

Write any datum as ~xn = ~̄x+∑Dd=1 α

(n)d ~ud.

Take as our weights α(n)d = (~xn − ~̄x)T~ud

A compression into K < D is ~̂nx = x̄+∑Kd=1 α

(n)d ~ud

Reconstruction loss is Jn({~ud}Kd=1) = (~xn − ~̂nx)2.

Page 6: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

6 AARON LANDESMAN

The Total loss is

Jn = (~xn − x̂n)2 =

((x̄+

D∑d=1

α(n)d ~ud

)−

(~̄x+

K∑d=1

α(n)d ~ud

))2

=

(D∑

d=k+1

α(n)d ~ud

)2

=

D∑d=k+1

(α(n)d )2

=

D∑d=k+1

((~xn − ~̄x)T~ud)2

=

D∑d=k+1

~uTd (~xn − ~̄x)(~xn − ~̄x)T~ud

Definition 3.2.2. The sample covariance is a D by D positive definite matrix1N

∑Nn=1

(~xn − ~̄x)(~xn − ~̄x)T

)So, summing over all the data, we have

J =

N∑n=1

D∑d=k+1

~uTd (~xn − ~̄x)(~xn − ~̄x)T~ud

=

N∑n=1

D∑d=k+1

~uTd

(N∑n=1

(~xn − ~̄x)(~xn − ~̄x)T

)~ud

= N

D∑d=k+1

~uTd~~Σ~ud

Computation 3.2.3. We want to minimize this subject to ~uTd ~ud = 1. Using La-grange multipliers, we want to minimize

N∑Dd=k+1 ~u

Td~~Σ~ud + λd(1− ~uTd ~ud)

Taking the gradient, we get 0 = 2~~Σ~ud − 2λd~ud.

So,~~Σ~ud = λd~ud.

Then, we want to minimize∑Dd=k+1 ~u

Td~~Σ~ud =

∑Dd=k+1 ~u

Td ~udλd =

∑Dd=k+1 λd

which corresponds to choosing the largest eigenvectors.

4. Class 2/10/14

4.1. Unsupervised learning.

Remark 4.1.1. In supervised learning we have both inputs and outputs, say {xn, tn}Nn=1.The xs are called inputs, features, covariance. The ts are called label, outputs, tar-gets, response, etc. To study this, we will look at

(1) Regression: with tn ∈ R. There is also vector regression with tn ∈ Rm.(2) Classification: with tn ∈ S for S a set.(3) Ordinal Regression: with tn ∈ N

Page 7: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 7

(4) Structured Prediction: Predicting structured objects, rather than just realnumbers

Remark 4.1.2. Note, the next problem set will probably involve regression. Onething to note about the curse of dimensionality is that in high dimensions, randomdata tends to all be about the same distance apart. K nearest neighbors predictselements by their close neighbors. Three problems with it is non-canonically pickingthe k, having too many dimensions, and too much data to store.

5. Class 2/12/14

(1) Your gradient is wrong. To correct this use finite differences. Say

∇xf(x) = (d

dxif(x))Ni=1

Pick some point x0. Pick g(x) = ddxf(x). Note that

f(x0 +ε

2)− f(x0 −

ε

2) ∼ εg(x0)

so you can check this dimension-wise using the gradient. Generally, takeε ∼ 10−4. Might make a function called CHECKGRAD, and search this ongoogle to find examples.

(2) For trying to solve machine learning problems, try the simplest thing first.Make sure you underfit before overfit.

(3) To debug stuff, generate fake data, and make sure you learn what you’resupposed to.

(4) Vectorize and profile your code.

5.1. Frequentist Regression.

Remark 5.1.1. Have inputs x ∈ RD and outputs t ∈ R, we will have some function

y(x,w) = w0 +∑Di=1 wixi. A basis function regression is when we have J − 1 basis

functions φj : X → R. We might have basis vectors being polynomials, sinusoids,radial basis functions, etc.

Definition 5.1.2. A radial basis function is a basis function such that φ(x) = φ(y)if ||x|| = ||y|| In other words, we can write φ(x) = φ(||x||).

Computation 5.1.3. To deal with text documents, we might take a basis functionthat counts the number of times a words appears. Or, we could ask how similar onedocument is to some “canonical document.”

Let’s say we have data {xn, tn}Nn=1 and φ(x) =

1

φ1(x)φ2(x)

...φJ−1

Let y(x,w) = wTφ(x).

Then, φ : X → RJ and w ∈ RJ . Then, define a loss function

ED(w) =1

2

N∑n=1

(tn − y(x,w))2 =1

2

N∑n=1

(tn − wTφ(xn))2.

To minimize this find the gradient 0 = ∇wED(w) =∑Nn−1(tn − wTφ(xn))φ(xn) =∑N

n=1 tnφ(xn)−∑Nn=1 φ(xn)φ(xn)Tw

Page 8: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

8 AARON LANDESMAN

Definition 5.1.4. The design matrix Φ =

φ1(x1) φ2(x1) · · · φJ(x1)φ1(x2) φ2(x2) · · · φJ(x2)

......

. . ....

φ1(xN ) φ2(xN ) · · · φJ(xN )

.

Computation 5.1.5. Then, 0 = ∇wED(w) = ΦT t − ΦTΦw So, we get ΦT t =ΦTΦw, so (ΦTΦ)−1Φt = w is our condition.

Definition 5.1.6. (ΦTΦ)−1Φ is called the Moore-Penrose pseudoinverse of Φ.

Computation 5.1.7. Let tn = y(xnw) + εn for εn ∼ N(0, β−1 where β is theprecision, which is the inverse variance. Now, P (tn|xn, w, β) = N(tn|y(xn, w), β−1)

P (t|Φ, w, β) =∏Nn=1N(t|wTφ(xn), β−1) = N(t|Φw, β−1IN ) with IN an N × N

identity matrix.

6. Class 2/19/14

Remark 6.0.8. Use star cluster free software from MIT, which is much easierthan the raw amazon tools.

Remark 6.0.9. Suppose we have Data: {xn, tn}Nn=1, tn ∈ R and basis functionsφj(x) : X → R. In the practical, much of the challenge has to do with choosingbasis functions. We could, for instance, come up with a list of positive words, andcount the number of times these words occur. Suppose we had x’s on the left andright and o’s in the middle. We could then have functions φ1(x) = x, φ2(x) = x2.

Definition 6.0.10. Suppose we have N data points and J feature functions. Then,the design matrix Φ is an N × J matrix with Φi,j = φj(xi).

Definition 6.0.11. Let y(w, x) = φ(x)Tw. Regression is the process of finding a“good” w, which so far has been determined by minimizing a loss function.

Computation 6.0.12. The method of least squares says

ED(w) =1

2

N∑n=1

(tn − y(xn, w))2

=1

2

N∑n=1

(tn − φ(xn)Tw)2

= ED(w) =1

2(t− Φw)T (t− Φw)

Differentiating and solving for w gives wLS = (ΦTΦ)−1ΦT t.

Computation 6.0.13. Now, let’s suppose tn = y(w, xn) + εn with εn ∼i.i.d.N(0, β−1). Now, let f be the PDF of tn. Then,

f(tn|w, φ(xn), β) = N(tn|φ(xn)Tw, β−1)

=

N∏n=1

p(tn|φ(xn)Tw, β−1).

Page 9: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 9

Then,

L(w) = log p(t|Φ, w, β)

=

N∑n=1

logN(tn|φ(xn)Tw, β−1)

=

N∑n=1

(−1

2log 2π +

1

2log β − β

2(tn − φ(xn)Tw

)2

=N

2log 2π +

N

2log β − β

2(t− Φw)T (t− Φw).

Then,

WMLE = (ΦTΦ)−1ΦT t

.

Remark 6.0.14. Adapting φ by applying the chain rule is called back propaga-tion. This is what is behind neural networks and deep learning, which is really justusing the chain rule.

Computation 6.0.15. Now, to solve for β, write

L(β) =N

2log β − β

2(t− Φw)T (t− Φw) + C

Differentiating, setting to 0, and solving for β,

1

β=

(t− Φw)T (t− Φw)

N=

1

N

N∑n=1

(tn − φ(xn)Tw)2,

with w the wMLE .

7. Class 2/24/14

Remark 7.0.16. The two estimates given by frequentist methods are point es-timates - they predict a single best answer. In Bayesian analysis, we obtain adistribution for our prediction.

7.1. Bayesian Regression.

Remark 7.1.1. Let w be a vector of weights, t be outputs, Φ be our matrix of basisfunctions applied to the inputs, and β be some precision. We’d like to understandp(w|t,Φ, β).

Theorem 7.1.2. Let θ be unknown,s and D be data. Then, p(D|θ) is the likelihoodas a function of θ, p(θ) is the prior and p(D) is the marginal likelihood. Note thatP (D) =

∫P (D|θ′)p(θ′)dθ′. Bayes theorem says

P (θ|D) =P (D|θ)P (θ)

P (D).

Definition 7.1.3. We say P (θ) and P (θ|D) are conjugate distributions if theyare from the same family of distributions. We say P (θ) is a conjugate prior tothe likelihood function P (D|θ) if P (θ) and P (θ|D) are conjugate distributions.

Page 10: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

10 AARON LANDESMAN

Computation 7.1.4. Now, we can use Bayes Theorem to understand P (w|t,Φ, β)

P (w|t,Φ, β) =P (t|w,Φ, β)P (w)

P (t|Φ, β)

Let’s assume we have likelihood

P (t|w,Φ, β) = N(t|Φw, 1

βIN )

and the prior is of the form

p(w) = N(w|m0, S0)

Then, taking logs, we have

logP (w|t,Φ, β) = logP (t|Φ, w, β) + logP (w)− logP (t|Φ, β).

Differentiating with respect to w, and equating to 0, we obtain,

0 =−β2

(t− Φw)T (t− Φw)− 1

2(w −m0)TS−1

0 (w − w0)

After completing the square and exponentiating again, we obtain the posteriordistribution satisfies

logP (w|t,Φ, β) = C − 1

2

(βtT t− 2βtTΦw + βwTΦTΦw + wTS−1

0 w − 2wTS−10 m0 +mT

0 S−10 m0

)= D − 1

2(w −mN )TS−1

N (w −mN )

where C and D are constants in w and

mN = SN (S−10 m0 + βΦT t)

SN = (S−10 + βΦTΦ)−1

Definition 7.1.5. The Maximum A Posteriori (MAP) estimation is themode of the posterior distribution.

Remark 7.1.6. Using the MAP loses uncertainty, since it represents distributionssolely by their peaks.

Computation 7.1.7. Say we use the simple prior P (w) = N(w|0, α−1I) Then, letC be a constant in W. We have

logP (w|t,Φ, β) = C − β

2

N∑n=1

(tn − φ(xn)Tw)2 − α

2wTw.

With,

SN =(αI + βΦTΦ

)−1

mN = SN (βΦT t)

In this case, our map estimate is mN since the mode of a Gaussian is its mean.The above form is a term for the likelihood, together with a term penalizing largew’s, which is a form of regularization.

Page 11: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 11

7.2. Cross Validation.

Definition 7.2.1. The idea of cross validation is to split the training data intoN parts. Then, for 1 ≤ i,≤ N, you choose a part i of the data for validation,and the remaining N − 1 parts other than part i are used to train for that problem.You then typically take some average of measures of goodness over the N parts todetermine the best values of a parameter to use.

Remark 7.2.2. Cross validation is very easily parallelizable.

8. Class 2/26/14

Definition 8.0.3. A hyper-parameter is a parameter of a prior distribution.

Example 8.1. In Bayesian linear regression, we can take

P (t|Φ, w, β) = N(t|Φw, 1

βI)

P (w) = N(w|0, 1

αI).

Here, β,w are parameters and α is a hyper-parameter.

Definition 8.1.1. The maximum likelihood value of θ is given by θMLE = argmaxθP (D|θ)Then, the marginal likelihood

P (D|α) =

∫P (D|θ)P (θ|α)dθ.

The maximum likelihood estimate for the marginal likelihood is αMLE = argmaxαP (D|α).

Remark 8.1.2. Then, using Bayes theorem we can write

P (θ|D,α) =P (D|θ)P (θ|α)∫P (D|θ′)P (θ′|α)dθ′

Note that the denominator here is related to the the marginal likelihood.

Remark 8.1.3. The MAP estimate is mode of the posterior distribution. We canvaguely model our prior as a uniform distribution with some support ∆Prior. Defin-ing θMAP = argmaxθP (D|θ)P (θ). We then define ∆Post to be the mode of our pos-terior distribution. We then assume approximately P (D) = P (D|θMAP )P (θMAP ).We then approximate

P (D) =

∫P (D|θ)P (θ)dθ ∼

∫P (D|θMAP )P (θ)dθ ∼

∫P (D|θMAP )

1

∆Priordθ ∼ P (D|θMAP )· ∆Post

∆Prior.

Where we have P (θ) ∼ 1∆Prior

because we are assuming the prior is approximately

uniform, with probability P (θ) on its support.

Computation 8.1.4.

P (t|Φ, β, α) =

∫P (t|Φ, w, β)P (w|α)dw

=

∫N(t|Φw, 1

βI)N(w|0, 1

αI)dw

= N(t|0, 1

αΦΦT +

1

βΦ)

Page 12: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

12 AARON LANDESMAN

Remark 8.1.5. Gaussians are nice, so if u ∼ N(u|µ,Σ) Then,

Au = v ∼ N(v|Au,AΣAT )

If ε ∼ (0,Λ) Then,u+ ε ∼ N(u+ ε|µ,Σ + Λ)

.

Remark 8.1.6. If we have some wMLE and βMLE for our most likely coefficientsand precision, then t ∼ N(t|φ(x)TwMLE , β

−1MLE).

Computation 8.1.7.

P (t|φ(x), t,Φ, β, α) =

∫P (t|φ,w, β)P (w|t,Φ, β, α)dw

=

∫N(t|φTw, 1

β)N(w|mN , SN )

= N(t|φTmN , φTSNφ+

1

β)

8.2. Classification.

Definition 8.2.1. In machine learning, classification is assigning to each dataelement a class in the set {C1, . . . , CK}.

Definition 8.2.2. Data are linearly separable into 2 classes if there exists anaffine space perfectly dividing the data into the correct classifications.

9. Class 3/3/14

Definition 9.0.3. Take a feature space X = RD and labels t ∈ {0, 1}. Writey(x,w,w0) = wTx+w0. A decision boundary is a codimension one affine planedefined by wTx+ w0 = 0, and separates our points into those labeled by t = 1 andthose labeled by t = 0.

Computation 9.0.4. If x1, x2 are in the plane defined by wTx + w0 = 0, thenwTxi = −w0 and so

wT (x1 − x2) = wTx1 − wTx2 = (−w0)− (−w0) = 0

Intuitively, this tells us that x1 − x2 is in the plane perpendicular to w.

Computation 9.0.5. The aim of this computation is to find a vector perpendic-ular to the plane that intersects it. We know only vectors of the form c · w areperpendicular, where c ∈ R. Then, in order for the vector to intersect the plane, we

need wT(c w||w||

)= c · ||w|| + w0 = 0. Solving for c we get c = −w0

||w|| . So this is the

distance from the origin to the decision boundary.

Definition 9.0.6. A one versus all classifier is a classifier that tells you whethera certain object is in class k, and it typically used for k > 2 classes.

Remark 9.0.7. There is ambiguity if we use one versus all classifiers. For in-stance, if we use decision boundaries, there might be points in the space that wouldn’tbelong to any of the “correct” sides of our decision boundaries.

Definition 9.0.8. A one versus one classifier for k classes is a set of(k2

)binary

decision classifiers, one for each pair of classes.

Page 13: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 13

Remark 9.0.9. Note, this can run into similar problems as the one versus allclassifier.

Definition 9.0.10. A Non-ambiguous Multi-class classifier is a set of k func-tions yk(x,wk, wk0) = wTk x+wk0, and x is in class k if k = argmaxkyk(x,wk, wk0).

Remark 9.0.11. This is ill defined where two of the functions both have the max-imum value, but this is typically a set of measure 0.

Definition 9.0.12. Fisher’s linear discriminant is a function of the formy(x,w,w0) = wTx+ w0, so that w minimizes the function

J(w) =−(m1 −m2)2

v1 + v2

where mi is the mean of the projected data wTXi and vi is the variance of theprojected data wTXi.

Computation 9.0.13. Suppose we have data distributed according to X1 ∼ N(x1|m1, S1),X2 ∼ N(x2|m2, S2). Then, after projecting, we have wTXi ∼ N(wTxi|wTm1, w

TS1w).

Computation 9.0.14.

J(w) =−(wTm1 − wTm2)2

wTS1w + wTS2w=−wT (m1 −mw)(m1 −mw)Tw

wT (S1 + S2)w.

In order to minimize J, we differentiate, set equal to 0, and then solve for w.

Define Sw = S1 + S2, Sb = (m1 −m2)(m1 −m2)T . Then, J(w) = −wTSbwwTSww

0 = J ′(w) = − (2Sbw)(wTSww)− (2Sww)(wTSbw)

(wTSww)2

Solving, since wTSww,wTSbw are scalars, we need Sbw ∝ Sww. That is, we need

(m1 −m2)(m1 −m2)Tw ∝ Sww, so (m1 −m2) ∝ Sww, since again (m1 −m2)Twis a scalar. Hence, we want

w ∝ S−1w (m1 −m2).

9.1. Perceptrons.

Remark 9.1.1. A general theme in machine learning is as follows. Given somefeatures {x1, . . . , xd}, we compute a weighted sum Sw(x) =

∑d=1 wixi, apply some

nonlinear function f(Sw(x)) and obtain some result.

Definition 9.1.2. Define the function f(a) =

{1 if a ≥ 0

−1 if a < 0A perceptron is

the function f(wTxn).

Definition 9.1.3. A perceptron error function, for the perceptron defined above

is EP (w) = −∑Nn=1 f(wTxn)tn.

Remark 9.1.4. Suppose we have data {xn, tn}Nn=1 and error function Ew(x) =

−∑Nn=1 tnf(wTxn). Typically we would perform gradient descent on this error func-

tion. That doesn’t work here because f is not differentiable.

Definition 9.1.5. The Perceptron Learning Algorithm is an algorithm forclassifying data using perceptrons. Explicitly, For every piece of data, write a =wTxn + w0 and if atn ≤ 0, then set w = w + tnxn, w0 = w0 + 1.

Page 14: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

14 AARON LANDESMAN

Remark 9.1.6. It’s easy to show this converges if the data is linearly separable.Obviously if it is not linearly separable, it will never converge.

Remark 9.1.7. Perceptrons can’t recognize the 4 function, because it is not linearlyseparable. To rectify this, we introduce multi-layer perceptrons, or neural networks.

Definition 9.1.8. A class conditional. which is the conditional probability dis-tribution of x given c is in class k, which we shall notate P (x|c = k).

Computation 9.1.9. Say we have a prior probability that an element c is inclass k, which we shall notate P (c = k). Suppose we also have class conditionalsP (x|c = k). Then, using Bayes theorem,

P (c = k|x) =P (c = k)P (x|c = k)∑Kj=1 P (c = j)P (x|c = j)

Definition 9.1.10. The method of Generative Classification is given by theprocess in the previous computation, by which we start with priors and class condi-tionals, and use them to predict the probability an element is in a certain class.

Definition 9.1.11. A Naive Bayes classifier is a classifier for x assuming allof its components xd are independent. That is, we take

P (x|c = k) =

D∏d=1

P (xd|c = k),

and then use generative classifiers for predicting each of the P (xd|c = k) separately.

Remark 9.1.12. Generative models try to compute the distribution for the x’s. Itmight waste effort if we just want to be able to classify them.

Definition 9.1.13. A Sigmoid function is heuristically an S shaped function.

Definition 9.1.14. The logistic Sigmoid function is σ(x) = 11+e−x .

Computation 9.1.15. Define a = log P (c=1)P (x|c=1)P (c=0)P (x|c=0) . Then,

P (c = 1|x) =P (c = 1)P (x|c = 1)

P (c = 1)P (x|c = 1) + P (c = 0)P (x|c = 0)

=

P (c=1)P (x|c=1)P (c=0)P (x|c=0)

P (c=1)P (x|c=1)+P (c=0)P (x|c=0)P (c=0)P (x|c=0)

=

P (c=1)P (x|c=1)P (c=0)P (x|c=0)

1 + P (c=1)P (x|c=1)P (c=0)P (x|c=0)

=ea

1 + ea

=1

1 + e−a

10. Class 3/10/14

10.1. Logistic Regression.

Page 15: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 15

Remark 10.1.1. Let ρ be the probability that t is class 1. Then, as we saw lasttime,

P (t = 1|x,Σ, µ1, µ2, ρ) =1

1 + e−a

where a = wtw + w0, w = Σ−1(µ1 − µ2), w0 = −12 µ

T1 Σ−1µ1 + 1

2µT2 Σ−1µ2 + log ρ

1−ρIn the case µ1 = µ2, we see that the decision is completely determined by the prior,as we expect.

Remark 10.1.2. The generative model computes O(D2) entries of the covariancematrix, which may be expensive. The benefit we gain from this is that we can gen-erate data. Logistic regression just tries to optimize w,w0, but it loses the ability togenerate data. The aim is to find the w that optimize the labels given the parametersw.

Computation 10.1.3. Say we have data {xn, tn}, tn ∈ {0, 1} Then,

P ({tn}|{xn}, w) =

N∏n=1

P (tn|xn, w)

=

N∏n=1

σ(xTnw)tn(1− σ(xTnw))1−tnv

Then,

logP (tn|{xn}, w) =

N∑n=1

tn log σ(xnw) + (1− tn) log(1− σ(xTnw)).

Using 1− σ(z) = σ(−z), we can see

d

dzσ(z) = σ(z) · (1− σ(z))

Differentiating to use gradient ascent, we have

∇w logP (D|w) =

N∑n=1

tn1

σ(xTnw)σ(xTnw)(1− σ(xTnw)− (1− tn)

1

1− σ(xTnw)σ(xTnw)(1− σ(xTnw))xn

=

N∑n=1

(tn − σ(xTnw))xn

We then update using wnew = wold + α∇w logP ({tn}|{xn}, w).

10.2. Neural Networks.

Computation 10.2.1. The idea of Neural Networks is that instead of just takingnormal basis functions, we allow the basis functions to depend on additional param-eters θj . So suppose we have basis functions with φj(x, θj), and predicting function

y(x,w, θ) =∑Jj=1 φj(x)wj . Then, define En(w) = 1

2

(tn −

∑Jj=1 φj(x)wj

)2

, and

an =∑Jj=1 φj(xn)wj Observe that

∂wjEn =

∂En∂an

∂an∂wj

.

Page 16: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

16 AARON LANDESMAN

To do gradient descent, we have

wnewj = woldj − α∂

∂wj

N∑n=1

En.

Similarly, we can use gradient descent for the θj parameters:

θnewj = θoldj − α∂

∂θj

N∑n=1

En.

Note then that∂

∂θjEn =

∂En∂an

∂an∂φjn

∂φjn∂θj

.

Observe that ∂an∂φjn

= wj

Definition 10.2.2. A hidden layer is a set of Given some inputs x1, . . . , xD, and

weight vectors h1, (w1, x), . . . , hJ(wJ , x), together with a function

∑Jj=1 v

jσ(hj(wj , x)

where σ is some Sigmoid function, and and the wj , vj are fixed vectors, and the hjare various basis functions.

Definition 10.2.3. A Neural Network is a collection of hidden layers, togetherwith maps taking the output of one hidden layer as the inputs for the next hiddenlayer.

11. Class 3/12/14

11.1. Decision Trees.

Definition 11.1.1. First, let me give a formal definition: A decision tree is aclassification function f represented by a collection of decision functions gi, i ∈ Iand end nodes hj , j ∈ J , where each hj is a classification. Then f(x) is defined asfollows: define i1 = g1(x), and recursively define in = gin−1(x), so long as in ∈ I.If in /∈ I, then we require in ∈ J. Let j = in, where n is the least integer for whichin /∈ I. Then define f(x) = hin .

Remark 11.1.2. The definition above is highly formal and fairly opaque. The ideais that we have a tree, where we put in our input at the top, and follow the arrowsfrom one level to the next which describes the input x. Satisfying the followingproperties:

(1) Every internal note has one attribute(2) Branches occur on different attribute values(3) Each attribute appears at most once in a path(4) At a leaf node, return a label

Remark 11.1.3. In theory we could enumerate every possible decision tree andpick the ones closest to our data. However, if we have d attributes, there are 2d

possible classifications for each elements of our data based on these attributes, and

hence at least 22d total possible labelings for for all the data (since each data can beclassified in at least two different values.) This is far too big, so we’ll need to useinformation theory to determine which attributes are important.

Page 17: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 17

Definition 11.1.4. Let X be a discrete random variables. The information con-tent denoted

I(X = j) = log2

1

P (X = j).

Definition 11.1.5. The Shannon Entropy of a discrete random variable X is

H(X) = E[I(X = j)] = −J∑j=1

P (X = j) log1

P (X = j).

Example 11.2. In the case X ∼ Bern(p) we obtain H(X) = −(p log p + (1 −p) log(1 − p)). Observe that this is maximized at p = .5, with H(X) = 1, but as papproaches 0 or 1, the entropy approaches 0.

Definition 11.2.1. Let X,Y be two random variables. The specific conditional

entropy of X given Y = k is H(X|Y = k) = −∑Jj=1 P (X = j|Y = k) logP (X =

j|Y = k),

Definition 11.2.2. For X,Y two random variable, the conditional entropy ofX given Y is

H(X|Y ) =

K∑k=1

P (Y = k)H(X|Y = k) = −K∑k=1

p(Y = k)

J∑j=1

P (X = j|Y = k) logP (X = j|Y = k).

Definition 11.2.3. The mutual information of X and Y is

I(X;Y ) =

J∑j=1

K∑j=1

P (X = j, Y = k) logP (X = j, Y = k)

P (X = j)P (Y = k).

Exercise 11.3. Show I(X;Y ) = I(Y ;X) and I(X;Y ) = H(X)+H(Y )−H(X,Y ), I(X;Y ) =H(X)−H(X|Y )

Remark 11.3.1. Once you are able to calculate entropies, you want to pick at-tributes to minimize entropy, and you might stop after some given depth.

12. Class 3/24/14

12.1. Support Vector Machines.

12.1.1. Simple Margin Classifiers.

Remark 12.1.2. Today, we will search for linear basis function classifiers for data{xn, tn}, tn ∈ {±1}. Let φj : χ→ R, and Φ : χ→ RJ . Define

f : χ× RJ × R→ R, (x,w, b) 7→ φ(x)Tw + b.

To f, we associate a classifier

y(x,w, b) =

{1 φ(x)Tw + b > 0

−1 otherwise

Definition 12.1.3. The Decision Boundary associated to the classifier y aboveis the hyperplane H = {z|zTw + b = 0}.

Lemma 12.1.4. The vector w is orthogonal to all vectors in the decision boundary(where by vectors in the decision boundary, we really mean differences of vectors inH as defined above so that it forms a linear subspace.

Page 18: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

18 AARON LANDESMAN

Proof. Say z1, z2 lie in the decision boundary. Then z1−z2 is a vector in the decisionboundary. Observe that wT (z1 − z2) = (−b) − (−b) = 0, so w is perpendicular toz1 − z2. �

Computation 12.1.5. Let x ∈ χ, then we have φ(x) ∈ RJ . Define φ1(x) to bethe point on H closest to φ(x). It follows that φ(x)− φ1(x) is perpendicular to theplane of the decision boundary and so φ(x)−φ1(x) = r w

|w| , for some real number r.

As in the definition, writing φ(x) = φ1(x) + r w|w| , applying wT to both sides, we get

wTφ(x) = wTφ1(x) + rwTw

|w|= −b+ r|w|.

By this computation, we may note that r = φ(x)Tw+b|w| = f(x,w,b)

|w| .

Definition 12.1.6. Using the notation of the previous computation, the Margin

of datum n is tn · r = tn · φ(xn)Tw+b|x| .

Remark 12.1.7. If the data is linearly separable, we can choose w so that allmargins are positive.

Definition 12.1.8. The Margin for the training data is minn{tn φ(xn)Tw+b|w| }}

Remark 12.1.9. The aim is to find (w∗, b∗) = argmaxw,b1|w| minn{tn(φ(xn)Tw+

b}. Note that if we rescaled w and b, the margin remains the same. By this in-variance, we may resale to assume the margin for the training data to satisfytn(φ(xn)Tw + b) ≥ 1.

Remark 12.1.10. Given the new constraints from the previous problem, we wantto optimize

(w∗, b∗) = argmaxw,b1|w| minn{tn(φ(xn)Tw + b} = argmaxw,b

1|w| subject to

tn(φ(xn)Tw+b) ≥ 1. Equivalently, (w∗, b) = argminw,b|w|2 satisfying tn(φ(xn)Tw+b) ≥ 1, which is a quadratic program.

Definition 12.1.11. For each data point xn we associate a Slack Variable ξnsatisfying the following properties.

(1) If ξn = 0, then xn is correctly classified and outside the margin.(2) If 0 < ξn ≤ 1 then xn is correct but within the margin.(3) If ξn > 1 then xn is incorrectly classified.

Remark 12.1.12. Let ξ = (ξ1, . . . , ξN ). We make the new constraints tn(φ(xn)Tw+b) ≥ 1−ξn, and create a new objective function to find the w∗, b, ξ = argminw,b,ξ{ 1

2 |w|2+

c∑Nn=1 ξn}, subject to the new constraints ξn ≥ 0, tn(φ(xn)Tw+b) ≥ 1, where c > 0

is a regularization parameter that we choose.

Remark 12.1.13. Observe that∑Nn=1 ξn is an upper bound of the number of in-

correct ξn. People often parameterize c = 1νN , with 0 < ν ≤ 1 representing tolerance

as fraction allowed incorrect.

13. Class 3/31/14

Remark 13.0.14. Suppose we have data {xn, tn}, t ∈ {±1}, and J basis functionsφj(x), which are the components of the vector φ(x).

Page 19: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 19

We are looking to maximize the margin. Since we have scaling invariance fromthe previous lecture, we may assume

minn tn(wTφ(xn + b)) = 1. We find

(w∗, b) = argmaxw,b minn

tn(wTφ(xn) + b)

|w|= argmaxw,b

1

|w|= argminw,b

1

2|w|2

subject to tn(wTφ(xn + b) ≥ 1.

Computation 13.0.15. Denote α = (α1, . . . , αn). Define the Lagrangian

L(w, b, α) =1

2|w|2 −

N∑n=1

αn(tn(wTφ(xn) + b)− 1).

and (w∗, b) = argminw,b maxαi≥0 L(w, b, α)By strong duality, we have

minw,b

maxαi≥0

L(w, b, α) = maxαi≥0

minw,b

L(w, b, α).

We need to have ∂∂wL = 0, ∂∂bL = 0. We get 0 = ∂

∂wL = w −∑Nn=1 αntnφ(xn).

Then, w =∑Nn=1 αntnφ(xn). Also, 0 = ∂

∂bL = −∑Nn=1 αntn. Hence,plugging in

these two conditions, we have we see

M(α) =1

2|w|2 −

N∑n=1

αntn(wTφ(xn) + b) +

N∑n=1

αn

=1

2|N∑n=1

αntnφ(xn)|2 −N∑n=1

αntn(

N∑n=1

φ(xn))Tφ(xn)) +

N∑n=1

αn

=

N∑n=1

αn −1

2

N∑n=1

N∑m=1

αnαmtntmφ(xn)Tφ(xm),

With α∗ = argmaxαM(α), w =∑Nn=1 α

∗ntnφ(xn) At the optimal solution we

will have α∗n(tn∑Nn=1 α

∗ntnφ(xn)Tφ(xn) + b∗ − 1) = 0, and so either α∗n = 0 or

α∗n > 0, tn(∑Nn=1 α

∗ntnφ(xn)Tφ(xn) + b∗ = 1.

Definition 13.0.16. A (mercer) kernel function is a function K(x, y) = φ(x)Tφ(y)for some function φ : χ→ RJ .

Remark 13.0.17. Kernels are good because they allow you to compute thingsfaster. In particular, they let you compute the Lagrangian above faster.

Example 13.1. Take φ(x)jk,j<k = xixj . Then,

K(x, z) = (xT z)2 = (∑d

xdzd)2

= (∑d

xdzd)(∑c

xczc)

=∑c,d

xdxczdzc

Other examples include K(x, z) = (xT z + c)M ,K(x, z) = e−12 |x−z|

2

.

Example 13.2. Note φ(x)Tw∗+b∗ = b∗+φ(x)T∑n α∗ntnφ(xn) = b∗+

∑n α∗ntnK(x, xn).

Page 20: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

20 AARON LANDESMAN

14. Class 4/2/14

14.1. Markov Decision process (MDP).

Definition 14.1.1. A fully observable world is a 4-tuple (S,A, P,R) with

S = {1, . . . , N}the set of states

A = {1, . . . ,M}the set of actions

P : S × S ×A→ (0, 1), P (s′, s, a) = P (s′|s, a),∑s′∈S

P (s′|s, a) = 1, P (s′|s, a) ≥ 0.

a probability mass function, indexed by state and action and a reward function

R : S ×A→ R.

Definition 14.1.2. A Policy with respect to a world (S,A, P,R) is a functionπ : N× S → A or equivalently a sequence of functions πt : S → A, for each t ∈ N.

Definition 14.1.3. A Markov Decision Model is for each t ∈ N a fully observ-able world Wt = (S,A, Pt, Rt) with S and A independent of t and a policy πt withrespect to that world Wt that satisfies the Markov property with bounded rewards.By the Markov property, we mean P (s′|st, at, st−1, at−1, . . . , s1, a1) = P (s′|st, at).By bounded rewards, we mean that we require Rt : St × At → (−N,N) ⊂ R whereN is independent of t.

Definition 14.1.4. A Finite Horizon Model is a Markov decision model aiming

to maximize the function U =∑T−1t=0 R(st, at) for T a stopping time.

Definition 14.1.5. A Total Discounted Reward Model is a Markov decisionmodel aiming to maximize the function U =

∑∞t=0 γ

tR(st, at), γ ∈ (0, 1).

Definition 14.1.6. A Total Reward Model is a Markov decision model aimingto maximize the function U =

∑∞t=0R(st, at).

Definition 14.1.7. A Long Run Average Reward Model is a Markov decision

model aiming to maximize the function U = limT→∞1T

∑T−1t=0 R(st, at).

Remark 14.1.8. The latter two models are typically less tractable than the formertwo models.

Example 14.2. Suppose we are in an auction in which you value an item at 150.The bidding starts at 0. You can bid at 100 and 200. The auction closes if no onebids 200. The transition model is given by if you don’t bid, then nature bids withprobability 1

2 , if you attempt to bid, you make that bid with probability .7.

Definition 14.2.1. The Expectimax Algorithm is an algorithm for computingthe policy function. We draw a decision tree of all possible states and recursivelycompute the expected value of each state, starting at the end state and workingbackwards. There are two types of nodes, those in which you choose where to move(to maximize your utility) and those where nature moves with prescribed probabili-ties. Explicitly, we compute Expectimax of a particular start state s. If we start onterminal nodes, we return 0. Otherwise, for each action a we compute Q(s, a) =R(s, a) +

∑s′∈S P (s′|s, a)Expectimax(s′) and set π∗(s) = argmaxa∈AQ(s, a)

Page 21: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 21

Definition 14.2.2. A Value function is a function V : S → RDefinition 14.2.3. A Q Function is a function Q : S ×A→ R.Example 14.3. Let us now look at several examples of value functions.

(1) The “last gasp” policy would be

π1(s) = argmaxaR(s, a).

It would have value function

V1(s) = maxa

R(s, a).

(2) The “second to last gasp” policy would be

π2(s) = argmaxa

{R(s, a) +

∑s′∈S

P (s′|s, a) maxa′∈A

R(s′, a′)

}

V2(s) = maxa

{R(s, a) +

∑s′∈S

P (s′|s, a)V1(s′)

}(3) In general, the “k steps to go” value function

Vk(s) = maxa∈A

{R(s, a) +

∑s′∈S

P (s′|s, a)Vk−1(s′)

}(4) Define Qk(s, a) = R(s, a) +

∑s′∈S P (s′|s, a)Vk−1(s′) and then we can sim-

plifyπ∗k(s) = argmaxa∈AQk(s, a)

andVk(s) = max

a∈AQk(s, a) = Qk(s, π∗k(s),

with the special case V0(s) = 0.

Example 14.4. Define P (s′|s, a1) =

s1 s2

s1 .6 .4s2 .6 .4

and define P (s′|s, a2) =

s1 s2

s1 1 0s2 0 1

Then, define R(s, a) =

a1 a2

s1 1 0s2 -1 0

Now, we compute the optimal policies for various stepsThis shows that you want to take action 1 in state 1 always, and if there isn’t

much time left you want to take action 2 in state 2, but if there is enough time left,you should take action 1 in state 2.

Page 22: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

22 AARON LANDESMAN

Steps to go Qk(s, a1) Qk(s1, a2) Qk(s2, a1) Qk(s2, a2) π∗k(s1) π∗k(s2) Vk(s1) Vk(s2)0 - - - - - - 0 01 1 0 -1 0 a1 a2 1 02 1.6 1 -.4 a1 a2 1.6 03 1.96 1.6 -.04 0 a1 a2 1.96 04 2.176 1.96 .176 0 a1 a1 2.176 .176

Definition 14.4.1. The Discounted Infinite Horizon is the process of findinga policy corresponding to the utility function U =

∑∞t=0 γ

tR(st, at), with γ ∈ (0, 1).So, we now take the function Q-function Q(s, a) = R(s, a)+γ

∑s′∈S P (s′|s, a)V (s′)

with V = maxa∈AQ(s, a) = Q(s, π∗(s)), with π∗(s) = argmaxaQ(s, a).

Computation 14.4.2. Let us now define an algorithm for infinite horizon valueiteration, to find the optimal value and policy functions.

Initialize V(s) = 0

Repeat:

V_old(s) = V(s)

For each s in S:

For each a in A:

Q(s,a) = R(s,a) + \gamma*\sum_{s’ \in S}P(s’|s,a)V_old(s’)

\pi^*(s) = argmax_a Q(s,a)

V(s) = Q(s,\pi^*(s))

Until |V(s) - V_old(s)| < \epsilon

By the contraction lemma, these equations have a fixed point, which will bereached in the limit.

15. Class 4/9/14

Computation 15.0.3. We now define a similar algorithm to find the best policy.This time, instead of trying to optimize the value function, we try to optimize thepolicy. Denote rπs = Rs,πs , and rπ ∈ Rn, the vector of rπs .

Denote the matrix Pπ =

Pπ1

1,1 Pπ11,2 · · · Pπ1

1,N

Pπ22,1 Pπ2

2,2 · · · Pπ2

2,N...

.... . .

...Pπ2

N,1 Pπ2

N,2 · · · Pπ2

N,N

with Pπii,j = P (j|i, π(i)).

Note that setting V = rπ + γPπv, we have (I − γPπ)v = rπ so (I − γPπ)v = rπ

and v = (I − Pπ)−1rπ. This is a cubic cost in the number of states in order tocompute the inverse, which is why it may be slow. In practice this is the best.

Initialize \pi = \pi_0

Repeat:

\pi^old = \pi

V = (I - \gamma P^\pi)^{-1} r^\pi

For s in S:

For a in A:

Q_{s,a} = R_{s,a} + \gamma \sum_{s’ in S’} P_{s,s’}^a V_{s’}

\pi_s = argmax_a Q_{s,a}

Page 23: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 23

15.1. Reinforcement Learning.

Remark 15.1.1. Reinforcement Learning assumes an unknown model P (s′|s, a)and unknown rewards R(s, a), which we are trying to estimate. There are two typesof Reinforcement learning: Model-based Reinforcement Learning and Model-free Re-inforcement Learning

Definition 15.1.2. In Model Based Learning we keep track of Ns,a, the numberof times we performed action a in state s, Rtotals,a , the total reward from doing a instate s, and Ns,a,s′ , the number of times we went directly from state s to s′ after

taking action a. Define ˆRs,a =Rtotals,a

Ns,aand ˆP as,s′ =

Ns,a,s′

Ns,a.

Definition 15.1.3. In Model Free Learning we are not trying to find a model,but just trying to find what action we should take in a particular situation.

Definition 15.1.4. Q Learning is a model free learning technique which tries toapproximate the Q function. We have

Qs,a = Rs,a+γ∑s′∈S

P (s′|s, a) maxa′∈A

Qs′,a′ = Rs,a+γEs′∈S maxa′∈A

Qs′,a′ = Es′∈S(Rs,a+γmaxa′∈A

Qs′,a′).

We then use the update rule

Qs,a = Qs,a + α((r + γmaxa′∈A

Qs′,a′)−Qs,a)

We can write down an approximate Q function for a state S ∈ R write Q̂(s, a) =w0 + φ(s)Twa. and use gradient descent.

16. Class 4/8/14

16.1. Partially Observable Markov Decision Process (POMDP).

Definition 16.1.1. A Partially Observable Markov Decision Process is aprocess in which we don’t know which state we are in instead we are given observa-tions O = {1, 2, . . . , J}. and an observations model P (O||sn).

Definition 16.1.2. A Belief State MDP is an MDP with each state being a dis-tribution over the environment states, and a belief state B, which may be continuousin this case.

Example 16.2. We could have a binary environmental state B ∈ [0, 1] correspond-ing to the parameter on a Bernoulli distributions.

Example 16.3. There are two states, good and bad. Good has a +10 reward,Bad has -20, and we can observe for a cost of -1. We know P (0 = Good|S =Good) = .8, P (0 = Bad|S = Bad) = .8. Using Bayes Rule, P (S = G|O = G) =P (S=G)P (O=G|S=G)

P (O=G) = .5∗.8.5∗.8+.5∗.2 = .8

Remark 16.3.1. Note that we no longer require a finite state space.

Definition 16.3.2. A Finite Memory Policy is a new state space keeping trackof there last K observations whose state space is S′ = OK .

Definition 16.3.3. The mechanism of Stigmergy is the process of an agent mod-ifying the environment to leave behind information for future agents.

Page 24: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

24 AARON LANDESMAN

17. Class 4/16/14

Remark 17.0.4. Recall K-means clustering. We had data xn clusters µk respon-sibilities rn a distortion measure J(µ, r) =

∑n,k rnk|xn − µk|2. We could then use

Lloyd’s algorithm to update the rnk to be 1 for k = argmink|xn − µk|2 and update

µk =∑n rnkxn∑n rnk

Definition 17.0.5. An algorithm we shall call soft K-means is described as fol-lows. We will have Soft Responsibilities defined by rnk ≥ 0,

∑k rnk = 1. We

define rnk =exp −β2 |xn−µk|

2∑k′ exp −β2 |xn−µk′ |2

.

Definition 17.0.6. A Mixture Model is a linear combination of distributionsfrom a single parametric family of distributions, I.e. each mixture model has com-ponent distributions P (x|θ), K Mixture components θk and K Mixtureweights πk ≥ 0,

∑k πk = 1 and a mixture model P (X|{θk, πk}) =

∑k πkP (x|θk).

Remark 17.0.7. To generate a value of a mixture distribution first draw k from πad then draw x from the kth component of the distribution x ∼ P (x|θk).

Example 17.1. A generative model for a document could be as follows

(1) Draw topics for the document(2) For each word in the document, first pick the word’s topic, and second pick

the word from the topic’s distribution(3) Repeat until you’ve finished the document

Computation 17.1.1. Suppose we have Gaussian mixtures with K means, covari-ances µk,Σk and mixture weights πk. Then, define

P (x|π, µ,Σ) =∑k

πkN(x|µk,Σk).

Then, we can learn by setting the data xn and noting

logP (x|π, µ,Σ) =∑n

log∑k

πkN(xn|µk,Σk).

Now, let us look at a simple Gaussian mixture model in which we know πk = 1k and

Σk = 1β I. Let the µk be unknowns. Then,

logP (x|µ) =∑n

log∑k

1

kN(xn|µk,

1

βI).

Differentiating with respect to µk we obtain

∂µklogP (x|µ) =

∑n

∂µk

∑k′

1k′N(xn|µk′ , 1

β I∑s′

1s′N(xn|µs′ 1

β I

=∑n

N(xn|µk, 1β I)∑

k′ N(xn|µk′ , 1β I)· β(xn − µk)

=∑n

rnk · β(xn − µk)

Page 25: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 25

To get the rnk note that

rnk =N(xn|µk, 1

β I)∑k′ N(xn|µk, 1

β I)

=β2π

−D2 exp −β2 (xn − µk)T (xn − µk)∑

k′β2π

−D2 exp −β2 (xn − µk′)T (xn − µk′)

=exp −β2 |xn − µk|

2∑k′ exp −β2 |xn − µk′ |2

Now,setting ∂∂µk

= 0 we obtain∑n rnkxn =

∑n rnkµk and so

µk =

∑n rnkxn∑n rnk

is our maximum likelihood solution

18. Class 4/21/14

Remark 18.0.2. We will now examine latent variable models, with zn latent vari-ables and xn observables. Our goals are

(1) Infer the zn for each xn(2) Learn the model parameters π, {µk,Σk} Note that

P (znk = 1|xn,Θ) =P (znk = 1)P (xn|znk=1,Θ)∑

k′ P (znk′ = 1)P (xn|znk′ = 1,Θ)

Computation 18.0.3. Let P (x|{πk, µk,Σk}) =∑Kk=1 πkN(x|µk,Σk) and logP ({xn}|{πk, µk,Σk}) =∑N

n=1 log∑Kk=1 πkN(x|µk,Σk) We then introduce explicit latent variables zn with

znk ∈ {0, 1} and∑Kk=1 znk = 1. We then define

P (z|π) =

K∏k=1

πzkk ,

and

P (x|z, {µk,Σk}) =

K∏k=1

(N(x|µk,Σk))zk .

Now,

P (x, z|{πk, µk,Σk}) = P (z|π)P (x|z, {µk,Σk}) =∏k

(πkN(x|µk,Σk))zk

Then, the complete data log likelihood is

logP ({xn, zn}|{πk, µk,Σk}) =

N∑n=1

K∑k=1

znk log πk + znk logN(xn|µk,Σk)

with Nk =∑n znk, πk = Nk

N , µk = 1Nk

∑Nn=1 znkxn,Σk = 1

Nk

∑Nn=1 znk(xn −

µk)(xn − µk)T .Now, suppose we have estimates γn of zn with γnk ≥ 0 and

∑n γnk = 1 (Note,

bishop write γnk = γ(znk).

Page 26: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

26 AARON LANDESMAN

Definition 18.0.4. Then, the expected complete data log likelihood (ECDLL)is

Eγn [logP ({xn, zn}|{πk, µk,Σk})]

Definition 18.0.5. The Expectations Maximization Algorithm (EM) isgiven by an initialization of the γk and then there are then two steps for eachiteration:

(1) M-step: Given {γn}, maximize ECDLL in terms of {µk,Σk, πk}(2) E-step: Improve the estimate {γn} given parameters {πk, µk,Σk}

Computation 18.0.6. (1) The M step is: Take the ECDLL, and as we sawbefore, in the case of Gaussian mixtures, we can write it as∑

k

∑n

γnk

[∑k′

δk′,k log πk′ + δk,k′ logN(xn|µk′ ,Σk′)

]=∑n

∑k

γnk log πk+γnk log(xn|µk,Σk)

Then, taking the partial with respect to πk and using Lagrange multiplierswe want

0 =∑n

∑k

γnk log πk + λ(∑k

πk − 1)

, which forces∑n γnk

1πk

+ λ = 0 =⇒∑n γnk + πkλ = 0. Hence,∑

k

∑n γnk + λ

∑k πk = 0, which implies λ = −N, with πk =

∑n γnkN

(2) The E-step is:

γnk = P (znk = 1|xn, {πk′ , µk′ ,Σk′)

=P (znk = 1)P (xn|znk = 1, {µk,Σk})

P (xn|{πk, µk,Σk})

=πkN(xn|µk,Σk)∑k′ πk′N(xm|µk′ ,Σk′)

Remark 18.0.7. Recall we had the sum∑n log

∑k πkN(xn|µk,Σk), we can use

Jensen’s inequality we have Eq(f(x)) ≤ f(Eq(x)), which gives a lower bound on thequantities we are looking for.

19. Class 4/23/14

Remark 19.0.8. Let x1, . . . , xT be data, and suppose we have a model P (x1, . . . , xT ).Suppose the xi are one hot coded with k possible outcomes. If the model is fullyjoint, the number of possibilities to model is O(KT ) which is far too big. If it isfully independent, then it takes O(KT ) states to model, but this assumption mightnot be strong enough to produce a successful model. This motivates the Markovassumption.

Definition 19.0.9. A model P (x1, . . . , xT ) is a Markov Model if xt+1 is inde-pendent of xt−1|xt. Equivalently, P (xt+1|xt, xt−1, . . .) = P (xt+1|xt).

Remark 19.0.10. There is a sort of yoga of diagrams describing these models,with arrows describing relations between states

• Markov Chains A // B // C // D• Independent Objects A B C D

Page 27: CS 181 LECTURE NOTES - Stanford Universityaaronlan/assets/class_notes_cs_181_seco… · CS 181 LECTURE NOTES 5 Remark 3.1.1. Suppose we have some data in RDand a linear map T: RD!RK

CS 181 LECTURE NOTES 27

• Full dependence A //

��

B

��C

Definition 19.0.11. Let Zt be discrete variables with M outcomes, which we thinkof as hidden, and Xt be discrete random variables with K outcomes, which we thinkof as visible. A Hidden Markov Model. Then, we define

P (z1, . . . , zT , x1, . . . , xT ) = P (z1)P (x1|z1)

T∏t=2

P (xt|zt−1P (xt|zt)

with P (z1) the initial distribution π, P (zt|zt−1) transition probabilities givenby A and M×M matrix, and P (xt|zt) called the emmistions given by Φ an M×Kmatrix.

Remark 19.0.12. A hidden Markov model would have the arrow diagram of theform Z1

��

// Z2

��

// Z3

��

// Z4//

��

· · ·

X1 X2 X3 X4

Remark 19.0.13. There are several thing we might want to use hidden Markovmodels to

• Predict: we may want to know P (xt+1|x1, . . . , xt)• Filtering: P (zT |x1, . . . , xT ).• Smoothing: P (zt|x1, . . . , xt)• Explanation: z∗1 , . . . , z

∗t = argmaxzP (z1, . . . , zt|x1, . . . , xt).

• Learning: How to find π∗, A∗,Φ∗ = argmaxπ,A,ΦP (x1, . . . , xt|π,A,Φ).

Remark 19.0.14. We can solve prediction, filtering, and smoothing by the for-ward backward algorithm, explanation can be solved with the viterbi algorithm, andlearning can be solved with expectation maximization, using the forward backwardalgorithm in the inner loop, which is called Balm - Welch.

Remark 19.0.15. Imagine we have a Markov chain

Z1

��

// Z2

��

// Z3

��

// Z4//

��

Z5//

��

· · ·

X1 X2 X3 X4 X5

Note, we can write P (x1, . . . , x5) = P (x1)P (x2|x1) · · ·P (x5|x4). Then, we can writeP (x3) =

∑x1

∑x2

∑x4

∑x5P (x1, . . . , x5) =

∑x1P (x1)

∑x2P (x2|x1)P (x3|x2)

∑x4P (x4|x3)

∑x5P (x5|x4).

We can then write the marginal over x3 in terms of the marginal over x2, and writethat in terms of the marginal over x1, and then explicitly that is just P (x1).