A Brief Introduction to Linear Regression

A Brief introductionto Linear Regression

General Definitionsx : input, features, independent var

y : output, response, labels, dependant var

(xi,yi) : known (input, output)

pairs used for training and evaluation.

unknown

f(x):Χ⟶Υ

Χ Υ

x2

x1

x3

xi

x

y2

y1

y3

yi

y

The problem:. How do we approximate the unknown f(x)?

. We assume there are some hypothesis h(x) that can approximate it.

. Find the best function h(x)

h(x) ≈ f(x)

Χ Υ

x2

x1

x3

xi

x

y2

y1

y3

yi

y

. We let the hypothesis h be linear :

hθ(x) = ϴ0 + ϴ1x1 + ϴ2x2 + ϴ3x3 + … + ϴNxN. Find the parameters ϴi ∈ (ϴ0, ϴ1, ϴ2, … ϴN)

that define our hypothesis hθ

hypothesis

hθ(x)

Χ Υ

x2

x1

x3

xi

x

y2

y1

y3

yi

y

Solution:Linear Regression

hθ(x) = ϴ0 + ϴ1x1 + ϴ2x2 + … + ϴNxN = ∑ ϴixi (from i=0 to N)

Considering x and ϴ to be vectors we get :hθ(x) = ϴ0x0 + ϴ1x1 + ϴ2x2 + … + ϴNxN

row vector : ϴT = [ϴ0,ϴ1,ϴ2,…,ϴN]

hθ(x) = ϴTx (the signal)

Now let’s find ϴ.

Cleaning things up

We define :

J(ϴ) = ½ ∑ ( hϴ(xi)-yi )2

This method is ordinary least squares (OLS)

J(ϴ) outputs the cost/errorof our hypothesis in terms of ϴ.

Since J(ϴ) we chose is quadratic we are guaranteed the existence of a minimum.

How to find the minimum?

The Cost/Loss function:

We start from an initial guess for ϴ and then iterate as follows:

ϴj := ϴj - α ∇ J(ϴj)

ϴj := ϴj - α ∇ [½ ∑ ( hϴ(xi)-yi )2]

after differentiation we get:

ϴj := ϴj + α (yi-hϴ(xi))xj

α is the learning rate.

Gradient Descent:

Two ways to descend:Stochastic descent :

Repeat (for every j) {

for i=1 to n {

ϴj:= ϴj+ α(yi-hϴ(xi))xj}

}

Batch descent :

Repeat (for every j) {

ϴj:= ϴj+ α∑(yi-hϴ(xi))xj

} until convergence

Batch vs Stochastic:Batch Gradient Descent (BGD) descent has to scan through the entire training set before making progress.

BGD is very costly for large data sets.

Stochastic Gradient Descent (SGD) can start making progress right away and converges faster.

SGD might not converge though!

A Closed form solution:We define :

Then we solve for ∇J(ϴ)=0 :

Visualising things:

J(ϴ) GD:

✘ Change the form of h(x) (Logistic Regression)

✘ Change the Cost/Loss J(ϴ) (e.g. Locally Weighted Regression (non-parametric)).

✘ Considering probability.

✘ Going from regression to classification.

✘ Preprocessing the Data (Dimensionality Reduction).

Where to go from here ?

THANKS!Any questions?You can find me at:[email protected]

June 2015

A Brief Introduction to Linear Regression

Data & Analytics

Transcript of A Brief Introduction to Linear Regression