Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm...

Midterm Reviews (CS 229/ STATS 229)

Zhihan Xiong

Stanford University

15th May, 2020

Zhihan Xiong Reviews 15th May, 2020 1 / 26

Supervised Learning

Outline

1 Supervised Learning

Discriminative Algorithms

Generative Algorithms

Kernel and SVM

2 Neural Networks

3 Unsupervised Learning

4 Practice Exam Problem (If time permits)

Supervised Learning Discriminative Algorithms

Outline

Kernel and SVM

2 Neural Networks

Optimization Methods

Gradient and Hessian (di↵erentiable function f : Rd 7! R)

rx f =

h@f@x1

. . . @f@xd

iT2 Rd

(Gradient)

r2x f =

@2f@x21

. . . @2f@x1@xd

.. . .

@2f@xd@x1

. . . @2f@x2d

775 2 Rd⇥d(Hessian)

Gradient Descent and Newton’s Method (objective function J (✓))

✓(t+1)= ✓(t) � ↵r✓J(✓

(t)) (Gradient descent)

✓(t+1)= ✓(t) �

✓J(✓(t))

i�1r✓J(✓

(t)) (Newton’s method)

Least Square—Gradient Descent

Model: h✓ (x) = ✓T x

Training data:��

x (i), y (i)� n

i=1, x (i) 2 Rd

Loss: J (✓) = 12

�h✓(x (i))� y (i)

Update rule:

✓(t+1)= ✓(t) � ↵

⇣h✓(x

(i))� y (i)

⌘x (i)

Stochastic Gradient Descent (SGD)

Pick one data point x (i) and then update:

✓(t+1)= ✓(t) � ↵

⇣h✓(x

(i))� y (i)

⌘x (i)

Least Square—Closed Form

Loss in matrix form: J (✓) = 12 kX✓ � yk22, where X 2 Rn⇥d

, y 2 Rn

Normal Equation (set gradient to 0):

XT(X✓? � y) = 0

Closed form solution:

✓? =⇣XTX

⌘�1XT y

Connection to Newton’s Method

✓? =⇥r2

✓J⇤�1r✓J

Newton’s method is exact for 2nd-order objective.

Logistic Regression

A binary classification model and y (i) 2 {0, 1}Assumed model:

p (y | x ; ✓) =(g✓ (x) if y = 1

1� g✓ (x) if y = 0, where g✓ (x) =

1 + e�✓T x

Log-likelihood function:

` (✓) =nX

log p(y (i) | x (i); ✓)

hy (i) log g✓(x

(i)) + (1� y (i)) log(1� g✓(x

Find parameters through maximizing log-likelihood, max✓ ` (✓) (in Pset1).

The Exponential Family

Definition

Probability distribution (with natural parameter ⌘) whose density (or mass function) can be

written into the following form

p (y ; ⌘) = b (y) exp⇣⌘TT (y)� a (⌘)

Example

Bernoulli distribution:

p (y ;�) = �y(1� �)1�y

✓✓log

✓�

1� �

◆◆y + log (1� �)

=) b (y) = 1, T (y) = y , ⌘ = log

✓�

1� �

◆, a (⌘) = log (1 + e⌘)

More Examples

Categorical distribution, Poisson distribution, (Multivariate) normal distribution, etc.

Properties (In Pset1)

E [T (Y ) ; ⌘] = r⌘a (⌘)

Var (T (Y ) ; ⌘) = r2⌘a (⌘)

Non-exponential Family Distribution

Uniform distribution over interval [a, b]:

p (y ; a, b) =1

b � a· 1{ayb}

Reason: b (y) cannot depend on parameter ⌘.

More Examples

Categorical distribution, Poisson distribution, (Multivariate) normal distribution, etc.

Properties (In Pset1)

E [T (Y ) ; ⌘] = r⌘a (⌘)

Var (T (Y ) ; ⌘) = r2⌘a (⌘)

Non-exponential Family Distribution

Uniform distribution over interval [a, b]:

p (y ; a, b) =1

b � a· 1{ayb}

Reason: b (y) cannot depend on parameter ⌘.

The Generalized Linear Model (GLM)

Components

Assumed model: p (y | x ; ✓) ⇠ ExponentialFamily (⌘) with ⌘ = ✓T x

Predictor: h (x) = E [T (Y ) ; ⌘] = r⌘a (⌘).

Fitting through maximum likelihood:

max✓

` (✓) = max✓

p(y (i) | x (i); ⌘)

Examples

GLM under Bernoulli distribution: Logistic regression

GLM under Poisson distribution: Poisson regression (in Pset1)

GLM under Normal distribution: Linear regression

GLM under Categorical distribution: Softmax regression

Supervised Learning Generative Algorithms

Outline

Kernel and SVM

2 Neural Networks

Gaussian Discriminant Analysis (GDA)

Generative Algorithm for Classification

Learn p (x | y) and p (y)

Classify through Bayes rule: argmaxy p (y | x) = argmaxy p (x | y) p (y)

GDA Formulation

Assume p (x | y) ⇠ N (µy ,⌃) for some µy 2 Rdand ⌃ 2 Rd⇥d

Estimate µy , ⌃ and p (y) through maximum likelihood, which is

hlog p(x (i) | y (i)) + log p(y (i))

p (y) =

Pni=1 1{y (i)=y}

n, µy =

Pni=1 1{y (i)=y}x

Pni=1 1{y (i)=y}

,⌃ =1

(x (i) � µy (i))(x (i) � µy (i))T

Naive Bayes

Formulation

Assume p (x | y) =Qd

j=1 p (xj | y)Estimate p (xj | y) and p (y) through maximum likelihood, which gives

p (xj | y) =

Pni=1 1

nx(i)j =xj ,y (i)=y

Pni=1 1{y (i)=y}

, p (y) =

Pni=1 1{y (i)=y}

Laplace Smoothing

Assume xj takes value in {1, 2, . . . , k}, the corresponding modified estimator is

p (xj | y) =1 +

Pni=1 1

nx(i)j =xj ,y (i)=y

i=1 1{y (i)=y}

Supervised Learning Kernel and SVM

Outline

Kernel and SVM

2 Neural Networks

Kernel

Motivation

Feature map: � : Rd 7! Rp

Fitting linear model with gradient descent gives us ✓ =Pn

u=1 �i�(x(i))

Predict a new example z : h✓ (z) =Pn

i=1 �i�(x(i))T� (z) =

Pni=1 �iK (x (i), z)

It brings nonlinearity without much sacrifice in e�ciency as long as K (·, ·) can be computed

e�ciently.

Definition

K (x , z) : Rd ⇥ Rd 7! R is a valid kernel if there exists � : Rd 7! Rpfor some p � 1 such that

K (x , z) = � (x)T � (z)

Kernel (Continued)

Examples

Polynomial kernels: K (x , z) =�xT z + c

�d, 8 c � 0 and d 2 N

Gaussian kernels: K (x , z) = exp

⇣�kx�zk22

⌘, 8 �2 > 0

Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm...

Documents

Transcript of Midterm Reviews (CS 229/ STATS 229)cs229.stanford.edu/section/mid_review_sp2020_annotated.pdfMidterm...

Casnik Finance 229

229 frigate dr

Superman 229 1960

fairytail 229

Xing Rui (Josie) Tang Siyang (Grace) Lin Zhihan (Chilly) Li Qinyue (Carol)

2 ½ POWER STORM THREAD-IN GUNITE JET (229-1200 & 229 …

Weekend 229

Learning Functional Dependencies with Sparse Regression · 2019. 5. 7. · Learning Functional Dependencies with Sparse Regression Zhihan Guo, Theodoros Rekatsinas ABSTRACT We study

Dragon Magazine #229

Your Move 229

Stats on Stats ETF June 2016

Stats 346.3 Multivariate Data Analysis Stats 848.3.

MKT 229 Presentation Title Your Name MKT 229 Final Project.

STATS GROUP - Pigging Products and Services … · © STATS GROUP STATS GROUP Process & Pipeline Integrity Solutions ... > Crude oil pipeline isolation allows piping ... © STATS

Trends in Global Capacity Availability and Trading · Mail Server Stats Collector Stats Collector Stats Collector Stats Collector Customer C Corporate Customer F share trading Customer

Zhihan Xu, Ilia Voloh, Terrence Smith GE Digital Energy · Line differential relay application, solution, configuration, communication, and setting recommendation. ... Reactor protection

SMS Achariyasmsachariya.com/downloads/Achariya-Track-SMS-Plug-In.pdf · Browser Stats Device Statistics Device Stats Geog raph iCal Stats Os Stats mobile Click stats Browser Stats

Stats for business salse stats in florida

TEMPORARY CONSTRUCTIONS WORLDWIDE PARTY TENT … · 2020. 12. 17. · PARTY TENT BODEGA 600 - 1.000 CM PARTY TENT BODEGA 600/229/327, 800/229/374, 900/229/397 & 1.000/229/420 STANDARD

Časnik Finance #229