Post on 17-Sep-2020
Midterm Reviews (CS 229/ STATS 229)
Zhihan Xiong
Stanford University
15th May, 2020
Zhihan Xiong Reviews 15th May, 2020 1 / 26
Supervised Learning
Outline
1 Supervised Learning
Discriminative Algorithms
Generative Algorithms
Kernel and SVM
2 Neural Networks
3 Unsupervised Learning
4 Practice Exam Problem (If time permits)
Zhihan Xiong Reviews 15th May, 2020 2 / 26
Supervised Learning Discriminative Algorithms
Outline
1 Supervised Learning
Discriminative Algorithms
Generative Algorithms
Kernel and SVM
2 Neural Networks
3 Unsupervised Learning
4 Practice Exam Problem (If time permits)
Zhihan Xiong Reviews 15th May, 2020 3 / 26
Supervised Learning Discriminative Algorithms
Optimization Methods
Gradient and Hessian (di↵erentiable function f : Rd 7! R)
rx f =
h@f@x1
. . . @f@xd
iT2 Rd
(Gradient)
r2x f =
2
664
@2f@x21
. . . @2f@x1@xd
.
.
.. . .
.
.
.
@2f@xd@x1
. . . @2f@x2d
3
775 2 Rd⇥d(Hessian)
Gradient Descent and Newton’s Method (objective function J (✓))
✓(t+1)= ✓(t) � ↵r✓J(✓
(t)) (Gradient descent)
✓(t+1)= ✓(t) �
hr2
✓J(✓(t))
i�1r✓J(✓
(t)) (Newton’s method)
Zhihan Xiong Reviews 15th May, 2020 4 / 26
Supervised Learning Discriminative Algorithms
Least Square—Gradient Descent
Model: h✓ (x) = ✓T x
Training data:��
x (i), y (i)� n
i=1, x (i) 2 Rd
Loss: J (✓) = 12
Pni=1
�h✓(x (i))� y (i)
�2
Update rule:
✓(t+1)= ✓(t) � ↵
nX
i=1
⇣h✓(x
(i))� y (i)
⌘x (i)
Stochastic Gradient Descent (SGD)
Pick one data point x (i) and then update:
✓(t+1)= ✓(t) � ↵
⇣h✓(x
(i))� y (i)
⌘x (i)
Zhihan Xiong Reviews 15th May, 2020 5 / 26
Supervised Learning Discriminative Algorithms
Least Square—Closed Form
Loss in matrix form: J (✓) = 12 kX✓ � yk22, where X 2 Rn⇥d
, y 2 Rn
Normal Equation (set gradient to 0):
XT(X✓? � y) = 0
Closed form solution:
✓? =⇣XTX
⌘�1XT y
Connection to Newton’s Method
✓? =⇥r2
✓J⇤�1r✓J
Newton’s method is exact for 2nd-order objective.
Zhihan Xiong Reviews 15th May, 2020 6 / 26
Supervised Learning Discriminative Algorithms
Logistic Regression
A binary classification model and y (i) 2 {0, 1}Assumed model:
p (y | x ; ✓) =(g✓ (x) if y = 1
1� g✓ (x) if y = 0, where g✓ (x) =
1
1 + e�✓T x
Log-likelihood function:
` (✓) =nX
i=1
log p(y (i) | x (i); ✓)
=
nX
i=1
hy (i) log g✓(x
(i)) + (1� y (i)) log(1� g✓(x
(i)))
i
Find parameters through maximizing log-likelihood, max✓ ` (✓) (in Pset1).
Zhihan Xiong Reviews 15th May, 2020 7 / 26
Supervised Learning Discriminative Algorithms
The Exponential Family
Definition
Probability distribution (with natural parameter ⌘) whose density (or mass function) can be
written into the following form
p (y ; ⌘) = b (y) exp⇣⌘TT (y)� a (⌘)
⌘
Example
Bernoulli distribution:
p (y ;�) = �y(1� �)1�y
= exp
✓✓log
✓�
1� �
◆◆y + log (1� �)
◆
=) b (y) = 1, T (y) = y , ⌘ = log
✓�
1� �
◆, a (⌘) = log (1 + e⌘)
Zhihan Xiong Reviews 15th May, 2020 8 / 26
Supervised Learning Discriminative Algorithms
The Exponential Family
More Examples
Categorical distribution, Poisson distribution, (Multivariate) normal distribution, etc.
Properties (In Pset1)
E [T (Y ) ; ⌘] = r⌘a (⌘)
Var (T (Y ) ; ⌘) = r2⌘a (⌘)
Non-exponential Family Distribution
Uniform distribution over interval [a, b]:
p (y ; a, b) =1
b � a· 1{ayb}
Reason: b (y) cannot depend on parameter ⌘.
Zhihan Xiong Reviews 15th May, 2020 9 / 26
Supervised Learning Discriminative Algorithms
The Exponential Family
More Examples
Categorical distribution, Poisson distribution, (Multivariate) normal distribution, etc.
Properties (In Pset1)
E [T (Y ) ; ⌘] = r⌘a (⌘)
Var (T (Y ) ; ⌘) = r2⌘a (⌘)
Non-exponential Family Distribution
Uniform distribution over interval [a, b]:
p (y ; a, b) =1
b � a· 1{ayb}
Reason: b (y) cannot depend on parameter ⌘.
Zhihan Xiong Reviews 15th May, 2020 9 / 26
Supervised Learning Discriminative Algorithms
The Generalized Linear Model (GLM)
Components
Assumed model: p (y | x ; ✓) ⇠ ExponentialFamily (⌘) with ⌘ = ✓T x
Predictor: h (x) = E [T (Y ) ; ⌘] = r⌘a (⌘).
Fitting through maximum likelihood:
max✓
` (✓) = max✓
nX
i=1
p(y (i) | x (i); ⌘)
Examples
GLM under Bernoulli distribution: Logistic regression
GLM under Poisson distribution: Poisson regression (in Pset1)
GLM under Normal distribution: Linear regression
GLM under Categorical distribution: Softmax regression
Zhihan Xiong Reviews 15th May, 2020 10 / 26
Supervised Learning Generative Algorithms
Outline
1 Supervised Learning
Discriminative Algorithms
Generative Algorithms
Kernel and SVM
2 Neural Networks
3 Unsupervised Learning
4 Practice Exam Problem (If time permits)
Zhihan Xiong Reviews 15th May, 2020 11 / 26
Supervised Learning Generative Algorithms
Gaussian Discriminant Analysis (GDA)
Generative Algorithm for Classification
Learn p (x | y) and p (y)
Classify through Bayes rule: argmaxy p (y | x) = argmaxy p (x | y) p (y)
GDA Formulation
Assume p (x | y) ⇠ N (µy ,⌃) for some µy 2 Rdand ⌃ 2 Rd⇥d
Estimate µy , ⌃ and p (y) through maximum likelihood, which is
max
nX
i=1
hlog p(x (i) | y (i)) + log p(y (i))
i
p (y) =
Pni=1 1{y (i)=y}
n, µy =
Pni=1 1{y (i)=y}x
(i)
Pni=1 1{y (i)=y}
,⌃ =1
n
nX
i=1
(x (i) � µy (i))(x (i) � µy (i))T
Zhihan Xiong Reviews 15th May, 2020 12 / 26
Supervised Learning Generative Algorithms
Naive Bayes
Formulation
Assume p (x | y) =Qd
j=1 p (xj | y)Estimate p (xj | y) and p (y) through maximum likelihood, which gives
p (xj | y) =
Pni=1 1
nx(i)j =xj ,y (i)=y
o
Pni=1 1{y (i)=y}
, p (y) =
Pni=1 1{y (i)=y}
n
Laplace Smoothing
Assume xj takes value in {1, 2, . . . , k}, the corresponding modified estimator is
p (xj | y) =1 +
Pni=1 1
nx(i)j =xj ,y (i)=y
o
k +Pn
i=1 1{y (i)=y}
Zhihan Xiong Reviews 15th May, 2020 13 / 26
Supervised Learning Kernel and SVM
Outline
1 Supervised Learning
Discriminative Algorithms
Generative Algorithms
Kernel and SVM
2 Neural Networks
3 Unsupervised Learning
4 Practice Exam Problem (If time permits)
Zhihan Xiong Reviews 15th May, 2020 14 / 26
Supervised Learning Kernel and SVM
Kernel
Motivation
Feature map: � : Rd 7! Rp
Fitting linear model with gradient descent gives us ✓ =Pn
u=1 �i�(x(i))
Predict a new example z : h✓ (z) =Pn
i=1 �i�(x(i))T� (z) =
Pni=1 �iK (x (i), z)
It brings nonlinearity without much sacrifice in e�ciency as long as K (·, ·) can be computed
e�ciently.
Definition
K (x , z) : Rd ⇥ Rd 7! R is a valid kernel if there exists � : Rd 7! Rpfor some p � 1 such that
K (x , z) = � (x)T � (z)
Zhihan Xiong Reviews 15th May, 2020 15 / 26
Supervised Learning Kernel and SVM
Kernel (Continued)
Examples
Polynomial kernels: K (x , z) =�xT z + c
�d, 8 c � 0 and d 2 N
Gaussian kernels: K (x , z) = exp
⇣�kx�zk22
2�2
⌘, 8 �2 > 0
More in Pset2...
Theorem
K (x , z) is a valid kernel if and only if for any set of {x (1), . . . , x (n)}, its Gram matrix, definedas
G =
2
64K (x (1), x (1)) . . . K (x (1), x (n))
.... . .
...K (x (n), x (1)) . . . K (x (n), x (n))
3
75 2 Rn⇥n
is positive semi-definite.
Zhihan Xiong Reviews 15th May, 2020 16 / 26
Supervised Learning Kernel and SVM
SVM
Formulation (y 2 {�1, 1})
min{w ,b}12 kwk22
subject to y (i)(wT x (i) + b) � 1, 8 i 2 {1, . . . , n} (Hard-SVM)
min{w ,b,⇠}12 kwk22 + C
Pni=1 ⇠i
subject to y (i)(wT x (i) + b) � 1� ⇠i , 8 i 2 {1, . . . , n}⇠i � 0, 8 i 2 {1, . . . , n}
(Soft-SVM)
Properties
The optimal solution has the form w?=Pn
i=1 ↵iy (i)x (i) and thus can be kernelized.
The soft-SVM can be treated as a minimization over hinge loss plus `2 regularization:
min{w ,b}
nX
i=1
max
n0, 1� y (i)(wT x (i) + b)
o+ � kwk22
Zhihan Xiong Reviews 15th May, 2020 17 / 26
Neural Networks
Outline
1 Supervised Learning
Discriminative Algorithms
Generative Algorithms
Kernel and SVM
2 Neural Networks
3 Unsupervised Learning
4 Practice Exam Problem (If time permits)
Zhihan Xiong Reviews 15th May, 2020 18 / 26
Neural Networks
Model Formulation
Multi-layer Fully-connected Neural Networks (with Activation Function f )
a[1] = f⇣W [1]x + b[1]
⌘
a[2] = f⇣W [2]a[1] + b[2]
⌘
. . .
a[r�1]= f
⇣W [r�1]a[r�2]
+ b[r�1]⌘
h✓ (x) = a[r ] = W [r ]a[r�1]+ b[r ]
Possible Activation Functions
ReLU: f (z) = ReLU (z) := max {z , 0}Sigmoid: f (z) = 1
1+e�z
Hyperbolic Tangent: f (z) = tanh (z) := ez�e�z
ez+e�z
Zhihan Xiong Reviews 15th May, 2020 19 / 26
Neural Networks
Backpropogation
Let J be the loss function and z [k] = W [k]a[k�1]+ b[k]. By chain rule, we have
@J
@W [r ]ij
=@J
@z [r ]i
@z [r ]i
@W [r ]ij
=@J
@z [r ]i
a[r�1]j =) @J
@W [r ]=
@J
@z [r ]a[r�1]T ,
@J
@b[r ]=
@J
@z [r ]
@J
@a[r�1]i
=
drX
j=1
@J
@z [r ]j
@z [r ]j
@a[r�1]i
=
drX
j=1
@J
@z [r ]j
W [r ]ji =) @J
@a[r�1]= W [r ]T @J
@z [r ]
@J
@z [r ]:= �[r ] =) @J
@z [r�1]=
⇣W [r ]T �[r ]
⌘� f 0
⇣z [r�1]
⌘:= �[r�1]
=) @J
@W [r�1]= �[r�1]a[r�2]T ,
@J
@b[r�1]= �[r�1]
Continue for layers r � 2, . . . , 1.
Zhihan Xiong Reviews 15th May, 2020 20 / 26
Unsupervised Learning
Outline
1 Supervised Learning
Discriminative Algorithms
Generative Algorithms
Kernel and SVM
2 Neural Networks
3 Unsupervised Learning
4 Practice Exam Problem (If time permits)
Zhihan Xiong Reviews 15th May, 2020 21 / 26
Unsupervised Learning
k-means
Algorithm 1: k-means
Input: Training data {x (1), . . . , x (n)}; number of clusters k1 Initialize c(1), . . . , c(k) 2 Rd
as clustering centers
2 while not converge do3 Assign each x (i) to its cloest clustering centers c(j)
4 Take the mean of each cluster as new clustering center
5 end
Property
k-means tries to minimize the following loss function approximately:
min
{c(1),...,c(k)}
nX
i=1
���x (i) � c(j(i))���2
2, where j (i) = argmin
j 02{1,...,k}
���x (i) � c(j0)���2
2
However, it does not guarantee to find the global minimum.
Zhihan Xiong Reviews 15th May, 2020 22 / 26
Unsupervised Learning
Gaussian Mixture Model (GMM)
Formulation
Assume each data point x (i) is generated independently through the following procedure:
1. Sample z(i) ⇠ Multinomial (�), wherePk
j=1 �j = 1
2. Sample x (i) ⇠ N (µz(i) ,⌃z(i))
How to estimate parameters �, {µ1, . . . , µk} and {⌃1, . . . ,⌃k} if z(i) cannot be observed?
Maximum Likelihood
` (✓) =nX
i=1
log
0
@mX
j=1
�jp(x(i);µj ,⌃j)
1
A ,
where p(x (i);µj ,⌃j) =1q
(2⇡)d |⌃j |exp
✓�1
2(x (i) � µj)
T⌃�1j (x (i) � µj)
◆
This is too complicated to optimize directly!
Zhihan Xiong Reviews 15th May, 2020 23 / 26
Unsupervised Learning
Expectation-Maximization (EM)
Jensen’s Inequality
By Jensen’s inequality, for any distribution Qi over {1, . . . , k}, we have
nX
i=1
log
0
@mX
j=1
Qi (j)p(x (i), z(i) = j ; ✓)
Qi (j)
1
A �nX
i=1
mX
j=1
Qi (j) logp(x (i), z(i) = j ; ✓)
Qi (j):= ELBO (✓)
Theorem
If we takeQi (j) = p(z(i) = j | x (i); ✓(t)) and let✓(t+1)
:= argmax✓ ELBO (✓), we thenhave `(✓(t+1)
) � `(✓(t)) (proved inlecture).
Algorithm 2: EM Algorithm
Input: Training data {x (1), . . . , x (n)}1 Initialize ✓(0) by some random guess
2 for t = 0, 1, 2, . . . do3 Set Qi (j) = p(z(i) = j | x (i); ✓(t)) for each
i , j ; // E-step
4 Set ✓(t+1)= argmax✓ ELBO (✓); // M-step
5 end
Zhihan Xiong Reviews 15th May, 2020 24 / 26
Unsupervised Learning
EM in GMM
Posterior of z(i)
p(z(i) = j | x (i); ✓(t)) =�(t)j p(x (i);µ(t)
j ,⌃(t)j )
Pkj 0=1 �
(t)j 0 p(x (i);µ(t)
j 0 ,⌃(t)j 0 )
GMM Update Rules
By defining w (i)j = p(z(i) = j | x (i); ✓(t)), we have
�(t+1)j =
Pni=1 w
(i)j
n, µ(t+1)
j =
Pni=1 w
(i)j x (i)
Pni=1 w
(i)j
, 8 j 2 {1, . . . , k}
⌃(t+1)j =
Pni=1 w
(i)j (x (i) � µ(t+1)
j )(x (i) � µ(t+1)j )
T
Pni=1 w
(i)j
, 8 j 2 {1, . . . , k}
Zhihan Xiong Reviews 15th May, 2020 25 / 26
Practice Exam Problem (If time permits)
Outline
1 Supervised Learning
Discriminative Algorithms
Generative Algorithms
Kernel and SVM
2 Neural Networks
3 Unsupervised Learning
4 Practice Exam Problem (If time permits)
Zhihan Xiong Reviews 15th May, 2020 26 / 26