Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Supervised...

Introduction to Machine Learning

Introduction to Machine Learning Amo G. Tong 1

Lecture 8Supervised Learning

• Artificial Neural Networks

• Backpropagation Algorithm

• Some materials are courtesy of Vibhave Gogate, Eric Xing and Tom Mitchell.

• All pictures belong to their creators.



Supervised Learning

• < 𝑥, 𝑦 >

• Estimate Pr[𝑦|𝑥] by Bayesian Theory: Pr 𝑦 𝑥 =Pr 𝑥 𝑦 Pr 𝑦

Pr[𝑥]

• Naïve Bayes.

• Estimate Pr[𝑦|𝑥] directed by assuming a certain form

• Logistic regression.

• Assume the form of 𝑦 = 𝑓(𝑥)

• Error-driven.

Pr 𝑥 𝑦 and Pr 𝑦 are hard to compute.No good form for Pr[𝑦|𝑥]No prior Knowledge on 𝑓(𝑥).

You can try Neural Networks.


Neural Networks

• Input 𝑥, output 𝑦

• Given 𝑥, how to compute 𝑦?

• 𝑥 and 𝑦 are generally real vectors.

perceptronSigmoid unit


Neural Networks





Neural Networks




• Many neuron-like units (perceptron, sigmoid)• Weighted interconnections between units.• Parallel distributed process.


Neural Networks




• Design a neural network:

• What is the function within each unit?

• How are the units connected?


Neural Networks




• Units:

• Input Unit

• Processing Units

• Receive input from other units

• Output result to other units

• Edges

• weights


Examples

• Linear function 𝑦 = 𝜔1𝑥1 +⋯+𝜔𝑛𝑥𝑛

• Input (𝑥1, … , 𝑥𝑛) output 𝑦.

𝑥1

sum𝑥2

𝑥𝑛

…

𝜔1

𝜔2

𝜔𝑛


Examples

• Quadratic function 𝑦 = (𝜔11𝑥1 + 𝜔12𝑥2)(𝜔21𝑥1 + 𝜔22𝑥2)

• Input (𝑥1, … , 𝑥𝑛) output 𝑦.

𝑥1sum

𝑥2

multiply

𝜔11

𝜔12

1

sum

𝜔21

𝜔22

1


Train a Neural Network

• Define the error 𝐸

• Update the weight 𝜔 of each edge to minimize 𝐸

• 𝜔 ← 𝜔 − 𝜂𝜕 𝐸

𝜕 𝜔(delta rule)

• Batch Mode

• E: the total error among training data.

• Consider all the training data each time.

• Incremental Mode

• E: the error for one training instance.

• Consider training instances one by one.

Calculate 𝜕 𝐸

𝜕 𝜔


Train a Neural Network

• Define the error 𝐸

• Update the weight 𝜔 of each edge to minimize 𝐸

• 𝜔 ← 𝜔 − 𝜂𝜕 𝐸

𝜕 𝜔(delta rule)

• Incremental Mode

• E: the error for one training instance.

• Consider training instances one by one.

Calculate 𝜕 𝐸

𝜕 𝜔


Neural Networks

• One layer of sigmoid with one output

• Two layers of single functions

• Two layers of multiple sigmoid units with one output.

• Two layers of multiple sigmoid units with multiple outputs.

• Goal: how to find the parameters that can minimize the error.


Neural Networks

• One layer of sigmoid with one output

𝑥1

sigmoid𝑥2

𝑥𝑛

…

𝜔1

𝜔2

𝜔𝑛

𝜕𝐸𝑥𝜕 𝜔𝑖

= −(𝑡𝑥 − 𝑜𝑥)𝜕𝑜𝑥𝜕 𝜔𝑖

= −(𝑡𝑥 − 𝑜𝑥) ⋅ 𝑥𝑖⋅ 𝑜𝑥(1 − 𝑜𝑥)

𝑜 𝑥1, … , 𝑥𝑛 =1

1 + 𝑒−𝝎⋅𝒙

𝑜𝑥


Neural Networks

• Multilayers

• Update 𝜔1 or 𝜔2 first?

𝑥 𝑓 𝑔𝜔1 𝜔2

𝑓 𝜔1𝑥 𝑔 𝜔2𝑓 𝜔1𝑥

𝜕𝐸

𝜕 𝜔1= −(𝑡𝑥 − 𝑔)

𝜕𝑔

𝜕 𝜔1𝜕𝑔

𝜕 𝜔1=

𝜕𝑔

𝜕𝜔2𝑓

𝜕𝜔2𝑓

𝜕𝜔1

=𝜕𝑔

𝜕𝜔2𝑓𝜔2

𝜕𝑓

𝜕𝜔1

=𝜕𝑔

𝜕𝜔2𝑓𝜔2

𝜕𝑓

𝜕𝜔1𝑥

𝜕𝜔1𝑥

𝜕𝜔1

=𝜕𝑔

𝜕𝜔2𝑓𝜔2

𝜕𝑓

𝜕𝜔1𝑥𝑥

𝜕𝐸

𝜕 𝜔2= −(𝑡𝑥 − 𝑔)

𝜕𝑔

𝜕 𝜔2𝜕𝑔

𝜕 𝜔2=

𝜕𝑔

𝜕𝜔2𝑓

𝜕𝜔2𝑓

𝜕𝜔2=

𝜕𝑔

𝜕𝜔2𝑓𝑓

𝐸 =1

2|𝑡𝑥 − 𝑔 ቚ

2(𝑥, 𝑡𝑥)


Neural Networks


𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoidsigmoid

sigmoid

…

today’s weather tomorrow’s temp (normalized)

𝑥

𝑡𝑥 = 𝑓(𝑥)

𝑓(𝑥, 𝜔)

𝜔: weight on each edge

𝑜𝑥

For each 𝑥 in 𝐷

𝐸𝑥 =1

2𝑡𝑥 − 𝑜𝑥

2


Neural Networks


𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoidsigmoid

sigmoid

…

𝑜𝑥

𝜔𝑗

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗


𝐸𝑥 =1


2

𝜕𝐸𝑥

𝜕 𝜔𝑗= − 𝑡𝑥 − 𝑜𝑥

𝜕𝑜𝑥

𝜕 𝜔𝑗= − 𝑡𝑥 − 𝑜𝑥 ⋅ ℎ𝑥

𝑗⋅ 𝑜𝑥(1 − 𝑜𝑥)


Neural Networks


𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoidsigmoid

sigmoid

…

𝑜𝑥

𝜔𝑖,𝑗

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗


𝐸𝑥 =1


2

𝜕𝐸𝑥

𝜕 𝜔𝑖,𝑗= − 𝑡𝑥 − 𝑜𝑥

𝜕𝑜𝑥

𝜕𝜔𝑖,𝑗= − 𝑡𝑥 − 𝑜𝑥

𝜕𝑜𝑥

𝜕 ℎ𝑥𝑗

𝜕ℎ𝑥𝑗

𝜕𝜔𝑖,𝑗

𝑜𝑥 =1

1 + 𝑒−∑𝜔𝑗⋅ℎ𝑥𝑗

𝜔𝑗


Neural Networks


𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoidsigmoid

sigmoid

…

𝑜𝑥

𝜔𝑖,𝑗

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗


𝐸𝑥 =1


2

𝜕𝑜𝑥

𝜕 ℎ𝑥𝑗= 𝜔𝑗 ⋅ 𝑜𝑥(1 − 𝑜𝑥)

𝑜𝑥 =1

1 + 𝑒−∑𝜔𝑗⋅ℎ𝑥𝑗

𝜕ℎ𝑥𝑗

𝜕𝜔𝑖,𝑗= 𝑥𝑖 ⋅ ℎ𝑥

𝑗(1 − ℎ𝑥

𝑗)

𝜔𝑗


Neural Networks


𝑥𝑖 sigmoidsigmoid𝑜𝑥𝜔𝑖,𝑗

ℎ𝑥𝑗

𝜕𝐸𝑥

𝜕 𝜔𝑖,𝑗= − 𝑡𝑥 − 𝑜𝑥 𝜔𝑗 ⋅ 𝑜𝑥(1 − 𝑜𝑥) 𝑥𝑖 ⋅ ℎ𝑥

𝑗(1 − ℎ𝑥

𝑗)

𝜔𝑗

𝜕𝐸𝑥

𝜕 𝜔𝑗= − 𝑡𝑥 − 𝑜𝑥 ⋅ 𝑜𝑥(1 − 𝑜𝑥) ⋅ ℎ𝑥

𝑗

𝛿𝑥𝑜 = 𝑡𝑥 − 𝑜𝑥 𝑜𝑥(1 − 𝑜𝑥)

𝛿𝑥ℎ,𝑗

= ℎ𝑥𝑗1 − ℎ𝑥

𝑗𝜔𝑗𝛿𝑥

𝑜

𝜕𝐸𝑥𝜕 𝜔𝑗

= −ℎ𝑥𝑗𝛿𝑥𝑜

𝜕𝐸𝑥𝜕 𝜔𝑖,𝑗

= −𝑥𝑖𝛿𝑥ℎ,𝑗

𝜔 ← 𝜔 + Δ𝜔

Δ𝜔 = −𝜂𝜕 𝐸

𝜕 𝜔Δ𝜔j = 𝜂 ℎ𝑥

𝑗𝛿𝑥𝑜

Δ𝜔𝑖,j = 𝜂 𝑥𝑖𝛿𝑥ℎ,𝑗


Neural Networks


𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoidsigmoid

sigmoid

…

𝑜𝑥

𝜔𝑖,𝑗

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗

𝜔𝑗


Δ𝜔j = 𝜂 ℎ𝑥𝑗𝛿𝑥𝑜


ℎ𝑥𝑗

and 𝑥𝑖, input of the edge

𝛿𝑥𝑜 and 𝛿𝑥

𝑗, error term

associated with the units

𝛿 = −𝜕𝐸

𝜕 𝑢𝑛𝑖𝑡

𝑎 𝑏𝜔 Input from 𝑎

Error term 𝜕 𝐸

𝜕 𝑏on 𝑏

Δ𝜔 = 𝜂𝑎𝜕 𝐸

𝜕 𝑏


Neural Networks


𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoidsigmoid

sigmoid

…

𝑜𝑥

𝜔𝑖,𝑗

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗

𝜔𝑗


Δ𝜔j = 𝜂 ℎ𝑥𝑗𝛿𝑥𝑜


𝛿𝑥𝑜 = 𝑡𝑥 − 𝑜𝑥 𝑜𝑥(1 − 𝑜𝑥)

𝛿𝑥ℎ,𝑗


𝑗𝜔𝑗𝛿𝑥

𝑜


Forward, calculate ℎ𝑥𝑗

and 𝑜𝑥Backward, calculate 𝛿𝑥

𝑜 and 𝛿𝑥ℎ,𝑗

Update 𝜔𝑗 and 𝜔𝑖,𝑗

Backpropagation Framework


Neural Networks


𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoid

sigmoid

sigmoid

sigmoid

sigmoid

… …

𝑜𝑥1

𝑜𝑥2

𝑜𝑥𝑘

today’s weather tomorrow’s weather (normalized)

𝑥 𝑡𝑥 = 𝑓(𝑥)𝑓(𝑥, 𝜔)

𝜔: weight on each edge

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗


Neural Networks


𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoid

sigmoid

sigmoid

sigmoid

sigmoid

… …

𝑜𝑥1

𝑜𝑥2

𝑜𝑥𝑘

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗

𝜔𝑖,𝑗


𝐸𝑥 =1

2

𝑘

𝑡𝑥𝑘 − 𝑜𝑥

𝑘 2 𝑎 𝑏𝜔 Input from 𝑎

Error term 𝜕 𝐸

𝜕 𝑏on 𝑏

Δ𝜔 = 𝜂𝑎𝜕 𝐸

𝜕 𝑏

𝜔𝑗,𝑘


Neural Networks


𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoid

sigmoid

sigmoid

sigmoid

sigmoid

… …

𝑜𝑥1

𝑜𝑥2

𝑜𝑥𝑘

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗

𝜔𝑖,𝑗 𝜔𝑗,𝑘


𝐸𝑥 =1

2

𝑘


𝑘 2

One ouput: 𝛿𝑥𝑜 = 𝑡𝑥 − 𝑜𝑥 𝑜𝑥(1 − 𝑜𝑥)

Multiple output: 𝛿𝑥𝑜,𝑘 = 𝑡𝑥

𝑘 − 𝑜𝑥𝑘 𝑜𝑥

𝑘(1 − 𝑜𝑥𝑘)

Δ𝜔𝑗,𝑘 = 𝜂 ℎ𝑥𝑗𝛿𝑥𝑜,𝑘


Neural Networks


sigmoid sigmoid

sigmoid

sigmoid

sigmoid

sigmoid

… …

𝑜𝑥1

𝑜𝑥2

𝑜𝑥𝑘

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗

One ouput: 𝛿𝑥ℎ,𝑗


𝑗𝜔𝑗 𝛿𝑥

𝑜

Multiple output: 𝛿𝑥ℎ,𝑗


𝑗 ∑𝑘𝜔𝑗,𝑘𝛿𝑥𝑜,𝑘

If not fully connected, sum over the connected units.


Neural Networks


𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoid

sigmoid

sigmoid

sigmoid

sigmoid

… …

𝑜𝑥1

𝑜𝑥2

𝑜𝑥𝑘

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗

𝜔𝑖,𝑗

Δ𝜔𝑖,𝑗 = 𝜂 𝑥𝑖 𝛿𝑥ℎ,𝑗

One ouput: 𝛿𝑥ℎ,𝑗


𝑗𝜔𝑗 𝛿𝑥

𝑜

Multiple output: 𝛿𝑥ℎ,𝑗



𝜔𝑗,𝑘


𝐸𝑥 =1

2

𝑘


𝑘 2


Neural Networks





𝛿𝑥𝑜,𝑘 = 𝑡𝑥


𝑘 1 − 𝑜𝑥𝑘

𝛿𝑥ℎ,𝑗




Forward, calculate ℎ𝑥𝑗

and 𝑜𝑥𝑘

Backward, calculate 𝛿𝑥𝑜,𝑘 and 𝛿𝑥

ℎ,𝑗

Update 𝜔𝑗,𝑘 and 𝜔𝑖,𝑗

Backpropagation Framework

𝑥1 sigmoid

𝑥2

𝑥𝑖

…

sigmoid

sigmoid

sigmoid

sigmoid

sigmoid

… …

𝑜𝑥1

𝑜𝑥2

𝑜𝑥𝑘

ℎ𝑥1

ℎ𝑥2

ℎ𝑥𝑗

𝜔𝑖,𝑗 𝜔𝑗,𝑘


Neural Networks


• Initialize the weights with random values.

• Do until converge

• For each 𝑥 ∈ 𝐷

• (Forward) Calculate ℎ𝑥𝑗

and 𝑜𝑥𝑘

• (Backward) calculate 𝛿𝑥𝑜,𝑘 and 𝛿𝑥

ℎ,𝑗

• Update 𝜔𝑗,𝑘 and 𝜔𝑖,𝑗

𝛿𝑥𝑜,𝑘 = 𝑡𝑥


𝑘 1 − 𝑜𝑥𝑘

𝛿𝑥ℎ,𝑗







Neural Networks

• Train a neural network with:

• One sigmoid units.



• Backpropagation framework: any units, any acyclic graph.

• Update the weight from right to left.


Neural Networks

• Backpropagation framework

• Good news: using the network is fast.

• Bad news: training the network is slow

• More bad news: it may converge to local minima.

• More good news: it performs well in practice.

• To avoid local minima

• Add a momentum

• Δ𝜔𝑛∗ = Δ𝜔𝑛 + 𝛼Δ𝜔𝑛−1

∗

• Train with different initializations.


Neural Networks

• Representational Power of Neural Networks.

• Boolean functions: every Boolean function can be represented exactly by some network with two layers of units.

• Continuous functions: every bounded continuous function can be approximated with arbitrarily small error by a network with two layers of units.(sigmoid + linear will do)

• Arbitrary function: any function can be approximated to arbitrary accuracy by a network with three layers of units. (sigmoid + sigmoid+ linear will do)

• Learn an identity function with eight training examples.


An example (Mitchell)




1 0 0

0 0 1

0 1 0

1 1 1

0 0 0

0 1 1

1 0 1

1 1 0

• Overfitting is everywhere…

• Overfitting may happen, when

• The learned model is too complicated

• Fitting data too well when data has noise

• When training set is small

• Avoid large parameters.


Overfitting



Overfitting


• Weight Decay:

• Decrease each weight by some small factor during each iteration?

• Penalize large weights:

• 𝐸𝑟𝑟𝑜𝑟 = 𝐸𝑟𝑟𝑜𝑟 + 𝛾 ∑𝜔𝑖2

• Using Validation data.


Overfitting

• What is an artificial neural network?

• Process the input by the connected units.

• How to training a neural network?

• Error-driven.

• Backpropagation algorithm.

• Two layers of sigmoid units.


Summary

Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Supervised...

Documents

Transcript of Introduction to Machine Learning - udel.eduudel.edu/~amotong/teaching/machine learning...Supervised...