1.1 What is Machine Learning? · Machine Learning 2016 Lecture 4: Logistic Regression Lecturer:...

Machine Learning 2016

Lecture 1: IntroductionLecturer: Endrew ng Scribe: Minsu Kim

1.1 What is Machine Learning?

Before studying Machine Learning algorthm, We should know what is Machine Learning. Two people havetried to define it as follows :

Arthur Samuel. Machine Learning : Field of study that gives computers the ability to learnwithout being explicitly programmed.

Tom Mitchell well-posed Learning problem : A computer program is said to learn from experienceE with respect to some task T and some performance measure P, if its performance on T, asmeasured by P, improves with experience E.

1.2 Supervised Learning

In supervised learning, We already have been given data set and know a relationship between input andoutput. There are two kinds of supervised learning.

1.2.1 Regression

If we have data set of housing price in which right answers are given, how could we predict housing price inspecial size. In figure1.1, the red-’X’ points are data set and two lines are the output prediction lines.

Figure 1.1: Regression

From thoes lines, we could predict housing price in some special size. These process is Regression. In otherwords, Regression is to predict continuous valued output when you have data set labeled.

1-1

1.2.2 Classification

Classificatoin is similar to regression in that predicting output from data set labeled. However, It is not topredict continuous valued output.

Figure 1.2: classification

We have data set about tumor above. We are trying to predict whether tumor is malignant or not accordingto tumor size. Like this, it is Classification to predict discrete valued output(0,1) with data set labeled.

1.3 Unsupervised learning

Unlike supervised learning, unsupervised learning allows us to address a problem with no data set that canshow us what our result are being. We can instead obtain cluster from the data in which we don’t know theeffect of the variables. Namely, it is not teaching us what is correct result. We just obtain clusters that aresomehow simiral or related.

For example, We collect 1000 articles about Greese economy, and find a way to group these articles tosmall clusters on similar topics, sentence, the number of pages ,and so on.

Figure 1.3: Unsupervised learning

1-2


Lecture 2: Linear Regression with One VariableLecturer: Endrew ng Scribe: Minsu Kim

2.1 Model Representation

On previously lecture, we studied ”Regression Problem”. In Regression Problem, We are trying to predictcontinuous valued output when you have data set in which you have already known a relationship betweeninput and output.

Linear regression with one variable is also named ”Univariate linear regression”. If you are trying to predicta single output valule from a single input value and have already known a reationship between input andoutput, you should use univariate linear regression.

2.1.1 Hypothesis Function

Before studying the hypothesis fuction, we should know a few notation.m = Number of training examples.x′s = ”input” variable / features.

y′s = ”output” variable / ”target” variable.

(x,y) = one training example.x(i),y(i) = ith training example.

Our hypothesis fuction has the general form :

hθ = θ0 + θ1x

We should choose θ0, θ1 of hθ so that hθ is close to y′s for our training examples, and hθ maps from x

′s to

y′s like bule line as follow:

Figure 2.1: Hypothesis function

2-1

2.1.2 Cost Function

We could measure how close hθ is with cost function. We takes average of squarel error of hθ with all inputand the actual output. That’s why it is called as ”squared error function” or ”Mean squared error”. Costfunction is defined as

J(θ0, θ1) =1

2m

{ m∑i=1

hθ(x(i))− y(i)

}2

The term of ( 12m ) is for the convenient computation of the gradient descent, as the derivative term of the

square function will cancel out the ( 12 ) term.

From the cost function, we could concretely measure the accuracy of our hθ against the training examples.The more accuracy our hθ is, the closer J(θ0, θ1) value goes to zero.

2.1.3 Gradient Descent

We have our hypothesis function and a way of measuring how accurate it is. Then we are studying a way ofautomatically improve our hypothesis function by using gradient descent.

The gradient descent equation is defined as

repeat until convergence :

θj := θj − α∂

∂θjJ(θ0, θ1)

(for j = 0 and j = 1)(Simultaneous Update)

Our optimization objective for our learning algorithm is fitting proper θ0, θ1 to minimize J(θ0, θ1). That’swhy we have to do ∂

∂θjJ(θ0, θ1). Here is the ituition of gradient descent.

Figure 2.2: Gradient descent

In Figure2.2, We could see that gradient descent will automatically take smaller steps, as we approach alocal minimum. So, no need to decrease α over time.

2-2

2.1.4 Gradient Descent For Linear Regression

From general gradient descent definition above, we could represent gradient descent for linear regression as


θ0 := θ0 − α∂

∂θ0J(θ0, θ1) = θ0 − α

1

m

m∑i=1

(hθ(x(i))− y(i))

θ1 := θ1 − α∂

∂θ1J(θ0, θ1) = θ1 − α

1

m

m∑i=1

(hθ(x(i))− y(i))x(i)

(Simultaneous Update)

In linear regression, we could alyaws see a global minimum. It is not available to see local minimum.

2-3


Lecture 3: Linear Regression with Multiple VariableLecturer: Endrew ng Scribe: Minsu Kim

3.1 Multiple Features

Unlike univariate linear regression, linear regression with multivariable is more variables to predict outputmore accurately. It is also called as ”Multivariate linear regression”.

3.1.1 Hypothesis Function for Multiple Features

Now we introduce notation for equations with any number of variables.

m = Number of training examples.n = Number of features.x(i) = input(features) of i(th) training example.

x(i)j = calue of feature j in i(th) training example.

Then we could define hypothesis function in muliples features as follows :

hθ = θ0 + θ1x1 + θ2x2 + θ3x3 + . . .+ θnxn

For convenience of notation, we define x0 = 1 and want more compact form :

hθ =[θ0 θ1 . . . θn

]

x0

x1

...xn

= ΘTx

Now wec collect all m training examples each with n features and record them in an (n + 1)×m matrix,as shown here :

X =

x

(1)0 x

(2)0 . . . x

(m)0

x(1)1 x

(2)1 . . . x

(m)1

...

x(1)n x

(2)n . . . x

(m)n

=

1 1 . . . 1

x(1)1 x

(2)1 . . . x

(m)1

...

x(1)n x

(2)n . . . x

(m)n

And then,

hθ =[θ0 θ1 . . . θn

]

1 1 . . . 1

x(1)1 x

(2)1 . . . x

(m)1

...

x(1)n x

(2)n . . . x

(m)n

= ΘTX

3-1

3.1.2 Cost Function for Multiple Variables

For multiple variables hθ = ΘTX and the paramethers vertor Θ is (n+1)-dimensional vector. Then the costfunction is:

J(Θ) =1

2m

{ m∑i=1

hθ(x(i))− y(i)

}2

And the vectorized version is :

J(Θ) =1

2m(ΘTX− ~y)T(ΘTX− ~y)

3.1.3 Gradient Descent for Multiple Variables

From general gradient descent form, the gradient descent for multiple variables is :


θj := θj − α1

m

m∑i=1

(hθ(x(i))− y(i))x

(i)j

(for j = 0 , 1 , 2 , . . . , n)(Simultaneous Update)

And the vectorized version is :

Θ:=Θ− α5 J(Θ)

where 5J(Θ) is defined as

5J(Θ) =

∂∂θ0

J(Θ)∂∂θ1

J(Θ)...

∂∂θn

J(Θ)

= ΘTX

The j-th component of 5J(Θ) could be vectorized as

∂

∂θjJ(Θ) =

1

m

m∑i=1


(i)j

=1

m

m∑i=1

x(i)j (hθ(x

(i))− y(i))

=1

m~xj

T(ΘTX− ~y)

3-2

And then

5J(Θ) =1

mXT(ΘTX− ~y)

From those vectorized version, we could express vectorized gradient descent as

Θ:=Θ− α5 J(Θ)

:=Θ− α 1

mXT(ΘTX− ~y)

3.1.4 Feature Scaling

previously, we have learned to choose Θ for predicting our output. Now we are studying a faster way tochoose Θ. The idea is that features are on a similar sacle. For example,

x1 = size(0 ∼ 2000(feet2)).x2 = number of bedrooms(1 ∼ 5).

x1’s range is too bigger than x2’s range. So, It will take long time to get grobal minimum. Therefore, weshould get every feature into approximately a −1 ≤ x ≤ 1 range.

- Mean normalizeReplace xi with xi − µi to make features have approximately zero mean.[NOTE : Do not apply to x0 = 1]

Xi =xi − µisi

(Where µi = average of xi in training set

si = range of xi (max−min)

)

3.1.5 Learing Rate

In gradient descent

θj := θj − α∂

∂θjJ(θ)

We could see α, and it is called as ”Learning Rate”. It also affect convergence of θ. We know that J(θ)should decrease after every iteration like below.

Figure 3.1: proper α

3-3

Figure 3.2: not proper α

However, If J(θ) increase or vibrate like Figure3.2, you should use smaller α.

Therefore,- If α is too small : slow convergence.- If α is too large : J(θ) may not decrease on every iteration, may not converge.To choose α, try from (· · · 0.001, 0.01, 0.1, 1 · · · )

3.1.6 Features and Polynomial Regression

From our features x1, x2, we could make a feature x3 = x1 × x2. Like this, we should make new featuresfrom origin features in some problem.If we get hypothesis function as

hθ = θ0 + θ1x1

and then, from that function we could change the behavior of curve hθ. Now, hθ is linear function. Thenwe just duplicate the variable of x1 to get a new function:

hθ = θ0 + θ1x1 + θ2x21

In that function, we could make a new feature x2 = x21.

By making hθ quadratic, cubic or any other form, you could improve your hypothesis function.

3.1.7 Normal Equation

Normal equation is a method of solve for θ analytically.From below :

∂

∂θJ(θ) = · · · · · · ≈ 0 for solve θ

In m examples and n features, the normal equation is defined as

Θ = (XTX)−1XT y (Θ ∈ <n+1)

where

xi =

x

(i)0

x(i)1

x(i)2...

x(i)n

and X =

· · · (x1)T · · ·· · · (x2)T · · ·· · · (x3)T · · ·

...· · · (xm)T · · ·

3-4

If we use normal equation, It doesn’t need to do feature scaling.

Now, we are comparing gradinent descent to normal equation.

Gradient Descent Normal Equation- Need to choose α - No need to choose α

- Needs many iteration - Don’t need to iterate- Works well even when n is large - Need to compute (XTX)−1

- Slow if n is very large

3-5


Lecture 4: Logistic RegressionLecturer: Endrew ng Scribe: Minsu Kim

4.1 Classification

As we mentioned before in lecture 1, classification is to predict discrete valued output(0,1) with data setlabeled. This means (y ∈ {0, 1}). hypothesis function for regression is not alyway in from 0 to 1. So, wecould’t use hypothesis function for regression and need new function.”Logistic regression” is defined for classification algorithm.

4.2 Hypothesis representation for Classification

4.2.1 Logistic Regression Model

We want the condition of 0 ≤ hθ(x) ≤ 1. With the condition, hypothesis function is defined as

hθ(x) = g(θTx)

g(z) =1

1 + e−z

Figure 4.1: Sigmoid function g(z)

Sigmoid function g(z) satisfy the condition of hθ(x). And, threshold classifier output is at 0.5 :- If hθ(x) ≥ 0.5, predict ”y = 1”- If hθ(x) < 0.5, predict ”y = 0”

Therefore, hθ(x) means estimated probability that y = 1 on input x. In other words, We could representhθ(x) as

hθ(x) = P (y = 1 | x; θ)

4-1

4.2.2 Decision Boundary

In logistic regression, there are boundary. Suppose we are trying to predict that

”y = 1” if hθ(x) ≥ 0.5 : g(z) ≥ 0.5 when z ≥ 0 −→ θTX ≥ 0

”y = 0” if hθ(x) < 0.5 : g(z) < 0.5 when z < 0 −→ θTX < 0

4.2.3 Cost Function for Classification

In logistic regression, cost function defined as

J(θ) =1

mcost(hθ(x), y)

where

cost(hθ(x), y) =

m∑i=1

cost(hθ(x(i)), y(i)) =

m∑i=1

{1

2hθ(x

(i))− y(i)}2

Unlike linear regression, hθ(x) is complicated by 11+e−z . The graph of J(θ) is non-convex form as follows :

Figure 4.2: Regression form

Non-convex form does not guarantee global minium. Therefore, we redefined logistic regression cost functionto make convex form.

4-2

- Logistic regression cost functionIt is redefined as

cost(hθ(x), y) ={ −log(hθ(x)) if y = 1−log(1− hθ(x)) if y = 0

Figure 4.3: cost(hθ(x), y)

cost(hθ(x), y) = 0 if hθ(x) = y

cost(hθ(x), y)→∞ if y = 0 and hθ(x)→ 1

cost(hθ(x), y)→∞ if y = 1 and hθ(x)→ 0

4.2.4 Simplified Cost Function and Gradient Descent

We know the logistic regression cost function as

J(θ) =1

m

m∑i=1

cost(hθ(x(i)), y(i))

cost(hθ(x), y) ={ −log(hθ(x)) if y = 1−log(1− hθ(x)) if y = 0

[NOTE : y = 0 or 1 always]

Because of NOTE in logistic regression, we could simplify cost function as

cost(hθ(x), y) = −ylog(hθ(x))− (1− y)log(1− hθ(x))

Therefore,

J(θ) = − 1

m

m∑i=1

y(i)log(hθ(x(i))) + (1− y(i))log(1− hθ(x(i)))

And then, the gradient descent could be simplified as

repeat for minθJ(θ) :

θj := θj − α∂

∂θjJ(θ)

4-3

(Simultaneous Update)where

J(θ) = − 1

m

m∑i=1


4.2.5 Multiclass Classification

This algorithm looks identical to linear regression. But,hθ(x) is different each other as follows :

Linear Regression : hθ(x) = ΘTX

Logistic Regression : hθ(x) =1

1 + e−ΘTX

Now we are going to approach classification that has more than two categories. In this case, you have morethan two outputs. If a problem has n-class categories, you could predict probability as follows :

maxih

(i)θ (x) = max

iP (y = i | x; θ) (i = 1, 2, . . . , n)

On a new input x, to make a prediction, pick the class i that maximizes.

4-4


Lecture 5: RegularizationLecturer: Endrew ng Scribe: Minsu Kim

5.1 The Problem of Overfitting

Suppose we are trying to predic housing prices, and we have three kinds of hypothesis function in figure5.1.

Figure 5.1: Prediction

case1 : hθ(x) = θ0 + θ1x

case2 : hθ(x) = θ0 + θ1x+ θ2x2

case3 : hθ(x) = θ0 + θ1x+ θ2x2 + · · ·+ θ6x

6

In case 1, function is not fitting very well (”Underfitting”).In case 2, function is fitting properly.In case 3, function is fitting well, but too much. (”Overfitting”)

- Overfitting : If we have too many features, the learned hypothesis may fit the training set very well, butfial to generalize to new examples.

5.2 Addressing Overfitting

There are two options :1. Reduce number of features.- Manually select which features to keep- Model selection algorithmBut, this option is to throw away our data.

2. Regularization.- Keep all the features, but reduce magnitude/values of paramethers θj- Works well when we have a lot of features, each of which contributes a bit to predicting y

5-1

5.3 Regularized Cost function

Our hypothesis function ishθ(x) = θ0 + θ1x+ θ2x

2 + θ3x3 + θ4x

4

Supose hθ(x) is overfitting currently, and we penalize and make θ3,θ4 really small. The result is hθ(x) maybe quadetric function, and we could adress overfitting. Without actual eliminating these features, we couldjust modify our cost function:

minθ1

2m

m∑i=1

(hθ(x(i))− y(i))2 + 1000 · θ2

3 + 1000 · θ24

So, we have to reduce the value of θ3,θ4 to converge to zero.

We could also regularize all of our θ parameters in a single summation except θ0:

minθ1

2m

m∑i=1

(hθ(x(i))− y(i))2 + λ

n∑j=1

θ2j

(λ is Regularization parameter)

Because the term (θ0) is a bias term, we explicit the term.

5.4 Regularized Linear Regression

5.4.1 Gredient Descent in Regularization


θ0 := θ0 − α1

m

m∑i=1

(hθ(xi)− y(i))x

(i)0

θj := θj − α

[1

m

m∑i=1


(i)j +

λ

mθj

]

(for j = 1,. . .,n)(Simultaneous Update)

The term of ( λmθj) is for regularization. Now we can also represent a part of θj as

θj := θj(1− αλ

m)− α 1

m

m∑i=1


(i)j

The term of (α λm ) is less than 1. This term also has an effect on reducing θj .

5-2

5.4.2 Regularized Normal Equation

Normal equation in regularization is defined as

θ =(XTX + λ · L

)−1XT y where L =

0

11

. . .

1

5.5 Regularized Logistic Regression

Logistic regression cost function in regularization is defined as

J(θ) = −

[1

m

m∑i=1


]+

λ

2m

n∑j=1

θ2j

and with J(θ) we could represent gradient descent as


θ0 := θ0 − α1

m

m∑i=1

(hθ(xi)− y(i))x

(i)0

θj := θj − α

[1

m

m∑i=1


(i)j +

λ

mθj

]

(where hθ(x) =1

1 + e−ΘTX, for j = 1,. . .,n)

(Simultaneous Update)

This is identical to the gradient descent for linear regression except hθ(x).

5-3

1.1 What is Machine Learning? · Machine Learning 2016 Lecture 4: Logistic Regression Lecturer:...

Documents

Transcript of 1.1 What is Machine Learning? · Machine Learning 2016 Lecture 4: Logistic Regression Lecturer:...