A Brief Introduction to Linear Regression

14
A Brief introduction to Linear Regression

Transcript of A Brief Introduction to Linear Regression

Page 1: A Brief Introduction to Linear Regression

A Brief introductionto Linear Regression

Page 2: A Brief Introduction to Linear Regression

General Definitionsx : input, features, independent var

y : output, response, labels, dependant var

(xi,yi) : known (input, output)

pairs used for training and evaluation.

unknown

f(x):Χ⟶Υ

Χ Υ

x2

x1

x3

xi

x

y2

y1

y3

yi

y

Page 3: A Brief Introduction to Linear Regression

The problem:. How do we approximate the unknown f(x)?

. We assume there are some hypothesis h(x) that can approximate it.

. Find the best function h(x)

h(x) ≈ f(x)

Χ Υ

x2

x1

x3

xi

x

y2

y1

y3

yi

y

Page 4: A Brief Introduction to Linear Regression

. We let the hypothesis h be linear :

hθ(x) = ϴ0 + ϴ1x1 + ϴ2x2 + ϴ3x3 + … + ϴNxN. Find the parameters ϴi ∈ (ϴ0, ϴ1, ϴ2, … ϴN)

that define our hypothesis hθ

hypothesis

hθ(x)

Χ Υ

x2

x1

x3

xi

x

y2

y1

y3

yi

y

Solution:Linear Regression

Page 5: A Brief Introduction to Linear Regression

hθ(x) = ϴ0 + ϴ1x1 + ϴ2x2 + … + ϴNxN = ∑ ϴixi (from i=0 to N)

Considering x and ϴ to be vectors we get :hθ(x) = ϴ0x0 + ϴ1x1 + ϴ2x2 + … + ϴNxN

row vector : ϴT = [ϴ0,ϴ1,ϴ2,…,ϴN]

hθ(x) = ϴTx (the signal)

Now let’s find ϴ.

Cleaning things up

Page 6: A Brief Introduction to Linear Regression

We define :

J(ϴ) = ½ ∑ ( hϴ(xi)-yi )2

This method is ordinary least squares (OLS)

J(ϴ) outputs the cost/errorof our hypothesis in terms of ϴ.

Since J(ϴ) we chose is quadratic we are guaranteed the existence of a minimum.

How to find the minimum?

The Cost/Loss function:

Page 7: A Brief Introduction to Linear Regression

We start from an initial guess for ϴ and then iterate as follows:

ϴj := ϴj - α ∇ J(ϴj)

ϴj := ϴj - α ∇ [½ ∑ ( hϴ(xi)-yi )2]

after differentiation we get:

ϴj := ϴj + α (yi-hϴ(xi))xj

α is the learning rate.

Gradient Descent:

Page 8: A Brief Introduction to Linear Regression

Two ways to descend:Stochastic descent :

Repeat (for every j) {

for i=1 to n {

ϴj:= ϴj+ α(yi-hϴ(xi))xj}

}

Batch descent :

Repeat (for every j) {

ϴj:= ϴj+ α∑(yi-hϴ(xi))xj

} until convergence

Page 9: A Brief Introduction to Linear Regression

Batch vs Stochastic:Batch Gradient Descent (BGD) descent has to scan through the entire training set before making progress.

BGD is very costly for large data sets.

Stochastic Gradient Descent (SGD) can start making progress right away and converges faster.

SGD might not converge though!

Page 10: A Brief Introduction to Linear Regression

A Closed form solution:We define :

Then we solve for ∇J(ϴ)=0 :

Page 11: A Brief Introduction to Linear Regression

Visualising things:

Page 12: A Brief Introduction to Linear Regression

J(ϴ) GD:

Page 13: A Brief Introduction to Linear Regression

✘ Change the form of h(x) (Logistic Regression)

✘ Change the Cost/Loss J(ϴ) (e.g. Locally Weighted Regression (non-parametric)).

✘ Considering probability.

✘ Going from regression to classification.

✘ Preprocessing the Data (Dimensionality Reduction).

Where to go from here ?

Page 14: A Brief Introduction to Linear Regression

THANKS!Any questions?You can find me at:[email protected]

June 2015