Lecture 5 Linear Models

Lin ZHANG, SSE, 2021

Lecture 5Linear Models

Lin ZHANG, PhDSchool of Software Engineering

Tongji UniversityFall 2021

Outline

• Linear model– Linear regression– Logistic regression– Softmax regression

Linear regression

• Our goal in linear regression is to predict a target continuous value y from a vector of input values ; we use a linear function h as the model

• At the training stage, we aim to find h(x) so that we have for each training sample

• We suppose that h is a linear function, so

dR∈x

( )i ih y≈x

1( , ) ( ) ,T d

bh b R ×= + ∈x xθ θ θRewrite it, ' ',

'' ' '( )T T+b = h≡x x x

θθ θ

Later, we simply use ( ) ( )1 1 1 1( ) , ,d dTh = R R+ × + ×∈ ∈x x xθ θ θ

( , )i iyx

Linear regression

• Then, our task is to find a choice of so that is as close as possible to

( )ih xθ

= −∑ xθ θ

The cost function can be written as,

Then, the task at the training stage is to find

1arg min2

= −∑ xθ

Here we use a more general method, gradient descentmethod

For this special case, it has a closed-form optimal solution

Linear regression

• Gradient descent – It is a first-order optimization algorithm– To find a local minimum of a function, one takes steps

proportional to the negative of the gradient of the function at the current point

– One starts with a guess for a local minimum of and considers the sequence such that

0θ ( )J θ

1 |: ( )nn n Jα+ == − ∇θ θ θθ θ θ

where is called as learning rateα

Linear regression

• Gradient descent

Linear regression

Repeat until convergence ( will not reduce anymore){

}1 |: ( )

nn n Jα+ == − ∇θ θ θθ θ θ

( )J θ

GD is a general optimization solution; for a specific problem, the key step is how to compute gradient

Linear regression

• Gradient of the cost function of linear regression

= −∑θ θ x

( )( )

∂ ∂ ∂ ∂∇ = ∂ ∂

The gradient is,

where, ( )1

( ) ( )m

i i ijij

J h y xθ =

∂= −

∂ ∑ θθ x

Linear regression

• Some variants of gradient descent– The ordinary gradient descent algorithm looks at every

sample in the entire training set on every step; it is also called as batch gradient descent

– Stochastic gradient descent (SGD) repeatedly run through the training set, and each time when we encounter a training sample, we update the parameters according to the gradient of the error w.r.t that single training sample only

Repeat until convergence{

for i = 1 to m (m is the number of training samples){

( )1 : Tn n n i i iyα+ = − −θ θ θ x x

Linear regression

• Some variants of gradient descent– The ordinary gradient descent algorithm looks at every

sample in the entire training set on every step; it is also called as batch gradient descent

– Stochastic gradient descent (SGD) repeatedly run through the training set, and each time when we encounter a training sample, we update the parameters according to the gradient of the error w.r.t that single training sample only

– Minibatch SGD: it works identically to SGD, except that it uses more than one training samples to make each estimate of the gradient

Linear regression

• More concepts– m Training samples can be divided into N minibatches– When the training sweeps all the batches, we say we

complete one epoch of training process; for a typical training process, several epochs are usually required

epochs = 10;numMiniBatches = N; while epochIndex< epochs && not convergent{

for minibatchIndex = 1 to numMiniBatches {

update the model parameters based on this minibatch}

Outline

Logistic regression

• Logistic regression is used for binary classification• It squeezes the linear regression into the range (0,

1) ; thus the prediction result can be interpreted as probability

• At the testing stage

1( )1 exp( )Th =+ −

xxθ θ

Function is called as sigmoid or logisticfunction

1( )1 exp( )

σ =+ −

The probability that the testing sample x is positive is represented as

The probability that the testing sample x is negative is represented as 1- ( )h xθ

Logistic regression

The shape of sigmoid function

One property of the sigmoid function

' ( ) ( )(1 ( ))z z zσ σ σ= −

Can you verify?

Logistic regression

• The hypothesis model can be written neatly as

• Our goal is to search for a value so that is large when x belongs to “1” class and small when xbelongs to “0” class

( ) ( )1( | ; ) ( ) 1 ( )y yP y h h −= −θ θθx x x

Thus, given a training set with binary labels , we want to maximize,

{ }( , ) : 1,...,i iy i m=x

( ) ( )11

( ) 1 ( )i im

y yi i

h h −

−∏ θ θx x

θ ( )hθ x

Equivalent to maximize,

( ) ( )1

log ( ) (1 ) log 1 ( )m

i i i ii

y h y h=

+ − −∑ θ θx x

Logistic regression

• Thus, the cost function for the logistic regression is (we want to minimize),

To solve it with gradient descent, gradient needs to be computed,

( ) ( )1

( ) log ( ) (1 ) log 1 ( )m

i i i ii

J y h y h=

= − + − −∑ θ θθ x x

( ) ( )m

i i ii

J h y=

∇ = −∑θ θθ x x

Assignment!

Logistic regression

• Exercise– Use logistic regression to perform digital classification

Outline

Softmax regression

• Softmax operation– It squashes a K-dimensional vector z of arbitrary real values

to a K-dimensional vector of real values in the range (0, 1). The function is given by,

– Since the components of the vector sum to one and are all strictly between 0 and 1, they represent a categorical probability distribution

( )σ z

exp( )( )

exp( )

( )σ z

Softmax regression

• For multiclass classification, given a test input x, we want our hypothesis to estimate for each value k=1,2,…,K

( | )p y k= x

Softmax regression

• The hypothesis should output a K-dimensional vector giving us K estimated probabilities. It takes the form,

( )( )

( )( )( )( )

( )( )

exp( 1| ; )

exp( 2 | ; ) 1( )exp

( | ; )exp

p yp y

= = = = =

where [ ] ( 1)1 2, ,..., d K

K Rφ + ×= ∈θ θ θ

Softmax regression

• In softmax regression, for each training sample we have,

( )( )( )( )( )

exp| ;

i i K T

p y k φ

At the training stage, we want to maximize for each training sample for the correct label k

( )| ;i ip y k φ= x

Softmax regression

• Cost function for softmax regression

• Gradient of the cost function

( )( )( )( )1 1

exp( ) 1{ }log

Tm K k i

i K Ti kj i

J y kφ= =

= − =∑∑∑

where 1{.} is an indicator function

( )( )1

( ) 1{ } | ;k

i i i ii

J y k p y kφ φ=

∇ = − = − = ∑θ x x

Can you verify?

Cross entropy

• After the softmax operation, the output vector can be regarded as a discrete probability density function

• For multiclass classification, the ground-truth label for a training sample is usually represented in one-hot form, which can also be regarded as a density function

• Thus, at the training stage, we want to minimize

For example, we have 10 classes, and the ith training sample belongs to class 7, then [0 0 0 0 0 010 0 0]iy =

( ( ; ), )i ii

dist h y∑ x θ

How to define dist? Cross entroy is a common choice

Cross entropy

• Information entropy is defined as the average amount of information produced by a probabilistic stochastic source of data ( ) ( ) log ( )i i

iH X p x p x= −∑

where X is a random variable and is the event in the sample space represented by X

Cross entropy

• Information entropy is defined as the average amount of information produced by a probabilistic stochastic source of data

• Cross entropy can measure the difference between two distributions

where p is the ground-truth, q is the prediction result，and xi is the class index• For multiclass classification, the last layer usually is a

softmax layer and the loss is the ‘cross entropy’

( ) ( ) log ( )i ii

H X p x p x= −∑

( , ) ( ) log ( )i ii

H p q p x q x= −∑

Cross entropy

Example:Suppose that the label of one sample is [1 0 0]

For model 1, the output of the last softmax layer is [0.5 0.4 0.1]

For model 2, the output of the last softmax layer is [0.8 0.1 0.1]

Its cross entropy is (base 10),

( )( , ) ( ) log ( )= 0.31*log 0.5 0*log 0.4 0*log 0.1i ii

H p q p x q x= − − ≈+ +∑

( )( , ) ( ) log ( )= 0.11*log 0.8 0*log 0.1 0*log 0.1i ii

H p q p x q x= − − ≈+ +∑

Model 2 is better

Lecture 5 Linear Models

Documents

Transcript of Lecture 5 Linear Models

Lecture 12: Generalized Linear Models for Binary Datapeople.musc.edu/~bandyopd/bmtry711.11/lecture_12.pdf · Lecture 12: Generalized Linear Models for Binary Data Dipankar Bandyopadhyay,

Lecture 5. Linear Models for Correlated Data: Inference.

Lecture 2: Linear Models - University of Arizonanitro.biosci.arizona.edu/.../notes/pdfs/Lecture02-GLM.pdf1 Lecture 2: Linear Models Bruce Walsh lecture notes SISG -Mixed Model Course

D S Lecture 8 &9 - Optimization of Linear Models

Lecture 2 Linear Models - College of Engineeringweb.engr.oregonstate.edu/~xfern/classes/cs434/... · Lecture 2 Linear Models. Last lecture • You have learned about what is machine

Lecture 23: Log-linear Models for Multidimensional ...

STAT 8260 | Theory of Linear Models Lecture Notesfaculty.franklin.uga.edu/dhall/sites/faculty.franklin.uga.edu.dhall/... · STAT 8260 | Theory of Linear Models Lecture Notes Classical

STAT 8260 | Theory of Linear Models Lecture Notes · 2018-09-06 · STAT 8260 | Theory of Linear Models Lecture Notes Classical linear models are at the core of the ﬂeld of statistics,

Lecture 4 Linear random coefficients models

Lecture 5: Procedure & Process Models · Linear vs. Non-Linear Procedure Models –5– 2018-05-03 –Slinear– 16 /69 • linear: thestrict Waterfall Model (nofeedback) • non-linear:

Linear and Generalized Linear Models - Lecture 10pitt.edu/~njc23/Lecture10.pdf · Linear and Generalized Linear Models Lecture 10 Nicholas Christian ... data Optional dataframe containing

Calculus for the Life Sciences - Lecture Notes Linear Models

Lecture 10: Linear Mixed Models (Linear Models with …sayan/Sta613/2017/lec/LMM.pdf · c (Claudia Czado, TU Munich) – 0 – Lecture 10: Linear Mixed Models (Linear Models with

Lecture 2 Linear Algebra and Linear Modelsnitro.biosci.arizona.edu/workshops/Aarhus2006/pdfs/...Lecture 2 Linear Algebra and Linear Models Bruce Walsh. jbwalsh@u.arizona.edu. University

Lecture 2: Linear models

Introduction to log-linear models - personal.psu.edupersonal.psu.edu/abs12/stat504/Lecture/lec16.pdf · Stat 504, Lecture 16 1 Introduction to log-linear models Key Concepts: •

Statistics 149 - Generalized Linear Modelsmarkirwin.net/stat149/Lecture/Intro.pdf · Syllabus † General Linear Model (Regression/ANOVA) review † Generalized Linear Models †

Lecture 3 Linear Models II

1 Functions and Linear Models Chapter 1 Functions: Numerical, Algebraic and Graphical Linear Functions Linear Models Linear Regression Lecture 1.

Lecture 4 Linear Programming Models: Standard Form