Introduction of Machine / Deep Learning

61
Introduction of Machine / Deep Learning Hung-yi Lee ๆŽๅฎๆฏ…

Transcript of Introduction of Machine / Deep Learning

Page 1: Introduction of Machine / Deep Learning

Introduction of Machine / Deep Learning

Hung-yi Lee ๆŽๅฎๆฏ…

Page 2: Introduction of Machine / Deep Learning

Machine Learning โ‰ˆ Looking for Function โ€ข Speech Recognition

โ€ข Image Recognition

โ€ข Playing Go

( ) = f

( ) = f

( ) = f

โ€œCatโ€

โ€œHow are youโ€

โ€œ5-5โ€(next move)

Page 3: Introduction of Machine / Deep Learning

Different types of Functions

Regression: The function outputs a scalar.

PredictPM2.5 f PM2.5 of

tomorrow

PM2.5 todaytemperature

Concentrationof O3

Spam filtering Yes/Nof

Classification: Given options (classes), the function outputsthe correct one.

Page 4: Introduction of Machine / Deep Learning

Different types of Functions

Function

(19 x 19 classes)

Next move

Each position is a class

a position on the board

Classification: Given options (classes), the functionoutputs the correct one.

Playing GO

Page 5: Introduction of Machine / Deep Learning

Regression, Classification

create something with structure (image, document)

Structured Learning

Page 6: Introduction of Machine / Deep Learning

How to find a function?A Case Study

Page 7: Introduction of Machine / Deep Learning

YouTube Channel

https://www.youtube.com/c/HungyiLeeNTU

Page 8: Introduction of Machine / Deep Learning

๐‘ฆ = ๐‘“

The function we want to find โ€ฆ

no. of views on 2/26

Page 9: Introduction of Machine / Deep Learning

1. Function with Unknown Parameters

๐‘ฆ = ๐‘ + ๐‘ค๐‘ฅ!

๐‘ฆ = ๐‘“

๐‘ฆ: no. of views on 2/26, ๐‘ฅ!: no. of views on 2/25

based on domain knowledge

๐‘ค and ๐‘ are unknown parameters (learned from data)

Model

weight bias

feature

Page 10: Introduction of Machine / Deep Learning

2. Define Loss from Training Data

Data from 2017/01/01 โ€“ 2020/12/31

&๐‘ฆ= ๐‘ฆ โˆ’ &๐‘ฆ

๐ฟ ๐‘, ๐‘คร˜ Loss is a function of

parameters

0.5๐‘˜+1๐‘ฅ! = ๐‘ฆ

ร˜ Loss: how good a set of values is.

๐‘’! = 0.4๐‘˜

2017/01/01 01/02 12/312020/12/30

4.8k 4.9k 3.4k 9.8k7.5k

01/03 โ€ฆโ€ฆ

4.9k

label

5.3k

๐ฟ 0.5๐‘˜, 1 ๐‘ฆ = ๐‘ + ๐‘ค๐‘ฅ! ๐‘ฆ = 0.5๐‘˜ + 1๐‘ฅ! How good it is?

Page 11: Introduction of Machine / Deep Learning

2. Define Loss from Training Data

Data from 2017/01/01 โ€“ 2020/12/31 2017/01/01 01/02 12/312020/12/30

4.8k 4.9k

&๐‘ฆ= ๐‘ฆ โˆ’ &๐‘ฆ

0.5๐‘˜+1๐‘ฅ! = ๐‘ฆ

4.9k 7.5k 9.8k

๐‘’" = 2.1๐‘˜

3.4k

&๐‘ฆ

0.5๐‘˜+1๐‘ฅ! = ๐‘ฆ๐‘’#

9.8k7.5k

01/03 โ€ฆโ€ฆ

5.4k

๐ฟ 0.5๐‘˜, 1 ๐‘ฆ = ๐‘ + ๐‘ค๐‘ฅ! ๐‘ฆ = 0.5๐‘˜ + 1๐‘ฅ! How good it is?

๐ฟ ๐‘, ๐‘คร˜ Loss is a function of

parameters

ร˜ Loss: how good a set of values is.

Page 12: Introduction of Machine / Deep Learning

2. Define Loss from Training Data

๐ฟ =1๐‘,"

๐‘’"Loss:

๐‘’ = ๐‘ฆ โˆ’ &๐‘ฆ "

๐‘’ = ๐‘ฆ โˆ’ &๐‘ฆ

๐ฟ is mean square error (MSE)

๐ฟ is mean absolute error (MAE)

&๐‘ฆ

๐‘ + ๐‘ค๐‘ฅ! = ๐‘ฆ๐‘’!

4.8k 4.9k

4.9k

If ๐‘ฆ and &๐‘ฆ are both probability distributions Cross-entropy

๐ฟ ๐‘, ๐‘คร˜ Loss is a function of

parameters

ร˜ Loss: how good a set of values is.

Page 13: Introduction of Machine / Deep Learning

2. Define Loss from Training Data

Small ๐ฟ

Large ๐ฟ ๐‘ค

๐‘ Error Surface

๐ฟ ๐‘, ๐‘คร˜ Loss is a function of

parameters

ร˜ Loss: how good a set of values is.๐‘ฆ = ๐‘ + ๐‘ค๐‘ฅ!Model

Page 14: Introduction of Machine / Deep Learning

3. Optimization ๐‘คโˆ—, ๐‘โˆ— = ๐‘Ž๐‘Ÿ๐‘”min%,'

๐ฟ

Loss ๐ฟ

ร˜ (Randomly) Pick an initial value ๐‘ค(

ร˜ Compute

๐‘ค(

Positive

Negative

Decrease w

Increase w

๐‘ค

Source of image: http://chico386.pixnet.net/album/photo/171572850

Gradient Descent

๐œ•๐ฟ๐œ•๐‘ค

|%)%!

Page 15: Introduction of Machine / Deep Learning

3. Optimization

Loss ๐ฟ

๐‘ค( ๐‘ค

Source of image: http://chico386.pixnet.net/album/photo/171572850

Gradient Descent

๐‘ค! โ† ๐‘ค( โˆ’ ๐œ‚๐œ•๐ฟ๐œ•๐‘ค

|%)%!

๐œ‚: learning rate

๐‘คโˆ—, ๐‘โˆ— = ๐‘Ž๐‘Ÿ๐‘”min%,'

๐ฟ

ร˜ (Randomly) Pick an initial value ๐‘ค(

ร˜ Compute๐œ•๐ฟ๐œ•๐‘ค

|%)%!

hyperparameters

๐œ‚๐œ•๐ฟ๐œ•๐‘ค |%)%

!

๐‘ค!

Page 16: Introduction of Machine / Deep Learning

3. Optimization

Loss ๐ฟ

ร˜ Update ๐‘ค iteratively

๐‘ค( ๐‘ค

Source of image: http://chico386.pixnet.net/album/photo/171572850

Gradient Descent

๐‘ค! ๐‘ค"

Local minima

๐‘ค*

global minima

๐‘คโˆ—, ๐‘โˆ— = ๐‘Ž๐‘Ÿ๐‘”min%,'

๐ฟ

ร˜ (Randomly) Pick an initial value ๐‘ค(

ร˜ Compute๐œ•๐ฟ๐œ•๐‘ค

|%)%!

๐‘ค! โ† ๐‘ค( โˆ’ ๐œ‚๐œ•๐ฟ๐œ•๐‘ค

|%)%!

Does local minima truly cause the problem?

Page 17: Introduction of Machine / Deep Learning

3. Optimizationร˜ (Randomly) Pick initial values ๐‘ค(, ๐‘(

ร˜ Compute

๐œ•๐ฟ๐œ•๐‘ค

|%)%!,')'! ๐‘ค! โ† ๐‘ค( โˆ’ ๐œ‚๐œ•๐ฟ๐œ•๐‘ค

|%)%!,')'!

๐œ•๐ฟ๐œ•๐‘|%)%!,')'! ๐‘! โ† ๐‘( โˆ’ ๐œ‚

๐œ•๐ฟ๐œ•๐‘|%)%!,')'!

Can be done in one line in most deep learning frameworks

๐‘คโˆ—, ๐‘โˆ— = ๐‘Ž๐‘Ÿ๐‘”min%,'

๐ฟ

ร˜ Update ๐‘ค and ๐‘ interatively

Page 18: Introduction of Machine / Deep Learning

3. Optimization

๐‘ค

๐‘

Compute โ„๐œ•๐ฟ ๐œ•๐‘ค, โ„๐œ•๐ฟ ๐œ•๐‘

(โˆ’๐œ‚ โ„๐œ•๐ฟ ๐œ•๐‘ค, โˆ’๐œ‚ โ„๐œ•๐ฟ ๐œ•๐‘)

Compute โ„๐œ•๐ฟ ๐œ•๐‘ค, โ„๐œ•๐ฟ ๐œ•๐‘

๐‘คโˆ—, ๐‘โˆ— = ๐‘Ž๐‘Ÿ๐‘”min%,'

๐ฟ

๐ฟ ๐‘คโˆ—, ๐‘โˆ— = 0.48๐‘˜๐‘คโˆ— = 0.97, ๐‘โˆ— = 0.1๐‘˜

๐‘ฆ = ๐‘ + ๐‘ค๐‘ฅ!Model

Page 19: Introduction of Machine / Deep Learning

Step 1: function with

unknown

Step 2: define loss from

training data

Step 3: optimization

Machine Learning is so simple โ€ฆโ€ฆ๐‘ฆ = ๐‘ + ๐‘ค๐‘ฅ!

๐‘คโˆ— = 0.97, ๐‘โˆ— = 0.1๐‘˜๐ฟ ๐‘คโˆ—, ๐‘โˆ— = 0.48๐‘˜

Page 20: Introduction of Machine / Deep Learning

Step 1: function with

unknown

Step 2: define loss from

training data

Step 3: optimization

Machine Learning is so simple โ€ฆโ€ฆ๐‘ฆ = ๐‘ + ๐‘ค๐‘ฅ!

๐‘คโˆ— = 0.97, ๐‘โˆ— = 0.1๐‘˜๐ฟ ๐‘คโˆ—, ๐‘โˆ— = 0.48๐‘˜

๐‘ฆ = 0.1๐‘˜ + 0.97๐‘ฅ! achieves the smallest loss ๐ฟ = 0.48๐‘˜on data of 2017 โ€“ 2020 (training data)

How about data of 2021 (unseen during training)?

Training

๐ฟโ€ฒ = 0.58๐‘˜

Page 21: Introduction of Machine / Deep Learning

2021/01/01 2021/02/14

๐‘ฆ = 0.1๐‘˜ + 0.97๐‘ฅ!

Views(k)

Red: real no. of viewsblue: estimated no. of views

Page 22: Introduction of Machine / Deep Learning

๐‘ฆ = ๐‘ +,#$!

%&

๐‘ค# ๐‘ฅ#

๐‘ฆ = ๐‘ + ๐‘ค๐‘ฅ!

๐‘ฆ = ๐‘ +,#$!

'

๐‘ค# ๐‘ฅ#

๐‘ฆ = ๐‘ +,#$!

()

๐‘ค# ๐‘ฅ#

๐ฟโ€ฒ = 0.58๐‘˜๐ฟ = 0.48๐‘˜

๐ฟโ€ฒ = 0.49๐‘˜๐ฟ = 0.38๐‘˜

๐ฟโ€ฒ = 0.46๐‘˜๐ฟ = 0.33๐‘˜

2017 - 2020 2021

2017 - 2020

2017 - 2020

2021

2021

๐’ƒ ๐’˜๐Ÿโˆ— ๐’˜๐Ÿ

โˆ— ๐’˜๐Ÿ‘โˆ— ๐’˜๐Ÿ’

โˆ— ๐’˜๐Ÿ“โˆ— ๐’˜๐Ÿ”

โˆ— ๐’˜๐Ÿ•โˆ—

0.05k 0.79 -0.31 0.12 -0.01 -0.10 0.30 0.18

Linear models

๐ฟโ€ฒ = 0.46๐‘˜๐ฟ = 0.32๐‘˜2017 - 2020 2021

Page 23: Introduction of Machine / Deep Learning

๐‘ฆ

๐‘ฅ!๐‘ฅ!

Linear models have severe limitation.

We need a more flexible model!

Different ๐‘

Different w

Model Bias

Linear models are too simple โ€ฆ we need more sophisticated modes.

Page 24: Introduction of Machine / Deep Learning

๐‘ฆ

๐‘ฅ!๐‘ฅ!0

1

2

3

sum of a set ofconstant + red curve =

Page 25: Introduction of Machine / Deep Learning

All Piecewise Linear Curves

More pieces require more

sum of a set ofconstant + =

Page 26: Introduction of Machine / Deep Learning

Beyond Piecewise Linear?

๐‘ฆ

๐‘ฅ!To have good approximation, we need sufficient pieces.

Approximate continuous curve by a piecewise linear curve.

Page 27: Introduction of Machine / Deep Learning

๐‘ฅ!

๐‘ฆ = ๐‘1

1 + ๐‘’2 '3%4"

๐‘ฅ!

How to represent this function?

sum of a set ofconstant + red curve =

Hard Sigmoid

Sigmoid Function

= ๐‘ ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘ + ๐‘ค๐‘ฅ!

Page 28: Introduction of Machine / Deep Learning

Different ๐‘ค

Change slopes

Different b

Shift

Different ๐‘

Change height

Page 29: Introduction of Machine / Deep Learning

๐‘ฆ

๐‘ฅ!๐‘ฅ!

sum of a set of + constant

0

2

3

red curve =

1

๐‘! ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘! +๐‘ค!๐‘ฅ!

๐‘" ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘" +๐‘ค"๐‘ฅ!

๐‘5 ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘5 +๐‘ค5๐‘ฅ!

๐‘ฆ = ๐‘ +,*

๐‘* ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘* +๐‘ค*๐‘ฅ!0 1 2 3+ +

Page 30: Introduction of Machine / Deep Learning

New Model: More Features

๐‘ฆ = ๐‘ +,*

๐‘* ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘* +,#

๐‘ค*#๐‘ฅ#

๐‘ฆ = ๐‘ +,*

๐‘* ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘* +๐‘ค*๐‘ฅ!

๐‘ฆ = ๐‘ + ๐‘ค๐‘ฅ!

๐‘ฆ = ๐‘ +,#

๐‘ค#๐‘ฅ#

Page 31: Introduction of Machine / Deep Learning

๐‘! +๐‘ค!!๐‘ฅ! +๐‘ค!"๐‘ฅ" +๐‘ค!5๐‘ฅ5 +

+

1

1

+

1

๐‘ฅ!

๐‘ฅ"

๐‘ฅ5

๐‘ค!!๐‘ค!"

๐‘ค!5

๐‘!

๐‘" +๐‘ค"!๐‘ฅ! +๐‘ค""๐‘ฅ" +๐‘ค"5๐‘ฅ5

๐‘5 +๐‘ค5!๐‘ฅ! +๐‘ค5"๐‘ฅ" +๐‘ค55๐‘ฅ5

๐‘Ÿ! =

๐‘Ÿ" =

๐‘Ÿ5 =

๐‘–: 1,2,3

๐‘—: 1,2,3๐‘ฆ = ๐‘ +,

*

๐‘* ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘* +,#

๐‘ค*#๐‘ฅ#

1

2

3

no. of features

no. of sigmoid

๐‘ค67: weight for ๐‘ฅ7 for i-th sigmoid

Page 32: Introduction of Machine / Deep Learning

๐‘Ÿ! = ๐‘! +๐‘ค!!๐‘ฅ! +๐‘ค!"๐‘ฅ" +๐‘ค!5๐‘ฅ5๐‘Ÿ" = ๐‘" +๐‘ค"!๐‘ฅ! +๐‘ค""๐‘ฅ" +๐‘ค"5๐‘ฅ5๐‘Ÿ5 = ๐‘5 +๐‘ค5!๐‘ฅ! +๐‘ค5"๐‘ฅ" +๐‘ค55๐‘ฅ5

๐‘Ÿ!๐‘Ÿ(๐‘Ÿ+

=๐‘!๐‘(๐‘+

+๐‘ค!! ๐‘ค!( ๐‘ค!+๐‘ค(! ๐‘ค(( ๐‘ค(+๐‘ค+! ๐‘ค+( ๐‘ค++

๐‘ฅ!๐‘ฅ(๐‘ฅ+

๐‘Š ๐’™๐’“

๐‘–: 1,2,3๐‘—: 1,2,3

๐‘ฆ = ๐‘ +,*

๐‘* ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘* +,#

๐‘ค*#๐‘ฅ#

๐’ƒ= +

Page 33: Introduction of Machine / Deep Learning

+

+

1

1

+

1

๐‘ฅ!

๐‘ฅ"

๐‘ฅ5

๐‘ค!!๐‘ค!"

๐‘ค!5

๐‘!

๐‘–: 1,2,3๐‘—: 1,2,3

๐‘ฆ = ๐‘ +,*

๐‘* ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘* +,#

๐‘ค*#๐‘ฅ#

1

2

3

๐‘Š ๐’™๐’“ ๐’ƒ= +

๐‘Ÿ!

๐‘Ÿ"

๐‘Ÿ5

Page 34: Introduction of Machine / Deep Learning

+

+

1

1

+

1

๐‘ฅ!

๐‘ฅ"

๐‘ฅ5

๐‘ค!!๐‘ค!"

๐‘ค!5

๐‘!

๐‘–: 1,2,3๐‘—: 1,2,3

๐‘ฆ = ๐‘ +,*

๐‘* ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘* +,#

๐‘ค*#๐‘ฅ#

1

2

3

๐‘Ÿ!

๐‘Ÿ"

๐‘Ÿ5

๐‘Ž!

๐‘Ž"

๐‘Ž5๐’‚ ๐’“= ๐œŽ

๐‘Ž! = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘Ÿ! =1

1 + ๐‘’28"

Page 35: Introduction of Machine / Deep Learning

+

+

1

1

+

1

๐‘ฅ!

๐‘ฅ"

๐‘ฅ5

๐‘ค!!๐‘ค!"

๐‘ค!5

๐‘!

๐‘–: 1,2,3๐‘—: 1,2,3

๐‘ฆ = ๐‘ +,*

๐‘* ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘* +,#

๐‘ค*#๐‘ฅ#

1

2

3

๐‘Ÿ!

๐‘Ÿ"

๐‘Ÿ5

๐‘Ž!

๐‘Ž"

๐‘Ž5

๐‘ฆ +

1

๐‘!

๐‘"

๐‘5

๐‘

๐’‚+๐‘๐‘ฆ = ๐’„,

Page 36: Introduction of Machine / Deep Learning

+

+

1

1

+

1

๐‘ฅ!

๐‘ฅ"

๐‘ฅ5

๐‘ค!!๐‘ค!"

๐‘ค!5

๐‘!

1

2

3

๐‘Ÿ!

๐‘Ÿ"

๐‘Ÿ5

๐‘Ž!

๐‘Ž"

๐‘Ž5

๐‘ฆ +

1

๐‘!

๐‘"

๐‘5

๐‘

๐’‚ ๐’“= ๐œŽ ๐‘Š ๐’™๐’“ ๐’ƒ= +

๐’‚+๐‘๐‘ฆ = ๐’„,

Page 37: Introduction of Machine / Deep Learning

+

+

1

1

+

1

๐‘ฅ!

๐‘ฅ"

๐‘ฅ5

๐‘ค!!๐‘ค!"

๐‘ค!5

๐‘!

1

2

3

๐‘Ÿ!

๐‘Ÿ"

๐‘Ÿ5

๐‘Ž!

๐‘Ž"

๐‘Ž5

๐‘ฆ +

1

๐‘!

๐‘"

๐‘5

๐‘

๐œŽ ๐‘Š ๐’™๐’ƒ ++๐‘= ๐’„,๐‘ฆ

Page 38: Introduction of Machine / Deep Learning

โ€ฆโ€ฆ

Rows of ๐‘Š

๐œฝ =

๐œƒ!๐œƒ(๐œƒ+โ‹ฎ

๐œŽ ๐‘Š ๐’™๐’ƒ ++๐‘= ๐’„,๐‘ฆ

๐’™ feature

Unknown parameters

๐‘Š ๐’ƒ

๐’„, ๐‘

Function with unknown parameters

Page 39: Introduction of Machine / Deep Learning

Step 1: function with

unknown

Step 2: define loss from

training data

Step 3: optimization

Back to ML Framework

๐œŽ ๐‘Š ๐’™๐’ƒ ++๐‘= ๐’„,๐‘ฆ

Page 40: Introduction of Machine / Deep Learning

Loss

D๐‘ฆ

๐‘’

๐ฟ =1๐‘,"

๐‘’"Loss:

๐œŽ ๐‘Š ๐’™๐’ƒ ++๐‘= ๐’„,๐‘ฆ

feature

๐ฟ ๐œƒร˜ Loss is a function of parameters

ร˜ Loss means how good a set of values is.

Given a set of values label

Page 41: Introduction of Machine / Deep Learning

Step 1: function with

unknown

Step 2: define loss from

training data

Step 3: optimization

Back to ML Framework

๐œŽ ๐‘Š ๐’™๐’ƒ ++๐‘= ๐’„,๐‘ฆ

Page 42: Introduction of Machine / Deep Learning

Optimization of New Model

๐œฝโˆ— = ๐‘Ž๐‘Ÿ๐‘”min๐œฝ๐ฟ

ร˜ (Randomly) Pick initial values ๐œฝ(

๐œฝ! โ† ๐œฝ- โˆ’ ๐œ‚๐’ˆ

๐’ˆ =

๐œ•๐ฟ๐œ•๐œƒ!

|๐œฝ$๐œฝ!

๐œ•๐ฟ๐œ•๐œƒ(

|๐œฝ$๐œฝ!

โ‹ฎ

๐œƒ!!

๐œƒ(!โ‹ฎ

โ†๐œƒ!-

๐œƒ(-โ‹ฎ

โˆ’

๐œ‚๐œ•๐ฟ๐œ•๐œƒ!

|๐œฝ$๐œฝ!

๐œ‚๐œ•๐ฟ๐œ•๐œƒ(

|๐œฝ$๐œฝ!

โ‹ฎ

gradient

๐’ˆ = โˆ‡๐ฟ ๐œฝ-

๐œฝ =

๐œƒ!๐œƒ(๐œƒ+โ‹ฎ

Page 43: Introduction of Machine / Deep Learning

Optimization of New Model

๐œฝโˆ— = ๐‘Ž๐‘Ÿ๐‘”min๐œฝ๐ฟ

ร˜ (Randomly) Pick initial values ๐œฝ(

ร˜ Compute gradient ๐’ˆ = โˆ‡๐ฟ ๐œฝ(

๐œฝ! โ† ๐œฝ( โˆ’ ๐œ‚๐’ˆร˜ Compute gradient ๐’ˆ = โˆ‡๐ฟ ๐œฝ!

๐œฝ" โ† ๐œฝ! โˆ’ ๐œ‚๐’ˆร˜ Compute gradient ๐’ˆ = โˆ‡๐ฟ ๐œฝ"

๐œฝ5 โ† ๐œฝ" โˆ’ ๐œ‚๐’ˆ

Page 44: Introduction of Machine / Deep Learning

Optimization of New Model

๐œฝโˆ— = ๐‘Ž๐‘Ÿ๐‘”min๐œฝ๐ฟ

ร˜ (Randomly) Pick initial values ๐œฝ(

ร˜ Compute gradient ๐’ˆ = โˆ‡๐ฟ! ๐œฝ(

๐œฝ! โ† ๐œฝ( โˆ’ ๐œ‚๐’ˆร˜ Compute gradient ๐’ˆ = โˆ‡๐ฟ" ๐œฝ!

๐œฝ" โ† ๐œฝ! โˆ’ ๐œ‚๐’ˆร˜ Compute gradient ๐’ˆ = โˆ‡๐ฟ5 ๐œฝ"

๐œฝ5 โ† ๐œฝ" โˆ’ ๐œ‚๐’ˆ

N

B batch

batch

batch

batch1 epoch = see all the batches once

๐ฟ!

๐ฟ"

๐ฟ5

๐ฟ

update

update

update

Page 45: Introduction of Machine / Deep Learning

Optimization of New Model

B batch

batch

batch

batch

Example 1

ร˜ 10,000 examples (N = 10,000)ร˜ Batch size is 10 (B = 10)How many update in 1 epoch?

1,000 updatesExample 2

ร˜ 1,000 examples (N = 1,000)ร˜ Batch size is 100 (B = 100)How many update in 1 epoch?

10 updates

N

Page 46: Introduction of Machine / Deep Learning

Step 1: function with

unknown

Step 2: define loss from

training data

Step 3: optimization

Back to ML Framework

๐œŽ ๐‘Š ๐’™๐’ƒ ++๐‘= ๐’„,๐‘ฆ

More variety of models โ€ฆ

Page 47: Introduction of Machine / Deep Learning

Sigmoid โ†’ ReLU

๐‘ฅ!

How to represent this function?

๐‘ฅ!

๐‘ ๐‘š๐‘Ž๐‘ฅ 0, ๐‘ + ๐‘ค๐‘ฅ!

๐‘โ€ฒ ๐‘š๐‘Ž๐‘ฅ 0, ๐‘โ€ฒ + ๐‘คโ€ฒ๐‘ฅ!

Rectified Linear Unit (ReLU)

Page 48: Introduction of Machine / Deep Learning

Sigmoid โ†’ ReLU

๐‘ฆ = ๐‘ +,*

๐‘* ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘* +,#

๐‘ค*#๐‘ฅ#

๐‘ฆ = ๐‘ +,(*

๐‘* ๐‘š๐‘Ž๐‘ฅ 0, ๐‘* +,#

๐‘ค*#๐‘ฅ#

Which one is better?

Activation function

Page 49: Introduction of Machine / Deep Learning

Experimental Results

linear 10 ReLU 100 ReLU 1000 ReLU2017 โ€“ 2020 0.32k 0.32k 0.28k 0.27k

2021 0.46k 0.45k 0.43k 0.43k

๐‘ฆ = ๐‘ +,(*

๐‘* ๐‘š๐‘Ž๐‘ฅ 0, ๐‘* +,#

๐‘ค*#๐‘ฅ#

Page 50: Introduction of Machine / Deep Learning

Step 1: function with

unknown

Step 2: define loss from

training data

Step 3: optimization

Back to ML Framework

๐œŽ ๐‘Š ๐’™๐’ƒ ++๐‘= ๐’„,๐‘ฆ

Even more variety of models โ€ฆ

Page 51: Introduction of Machine / Deep Learning

+

+

1

1

+

1

โ€ฆโ€ฆ

๐‘ฅ!

๐‘ฅ"

๐‘ฅ5

+

+

1

1

+

1

๐‘Ž!

๐‘Ž"

๐‘Ž5

= ๐œŽ ๐‘Š ๐’™๐’ƒ +=๐’‚โ€ฒ ๐œŽ ๐‘Šโ€ฒ๐’ƒโ€ฒ + ๐’‚๐’‚

or

Page 52: Introduction of Machine / Deep Learning

1 layer 2 layer 3 layer 4 layer2017 โ€“ 2020 0.28k 0.18k 0.14k 0.10k

2021 0.43k 0.39k 0.38k 0.44k

Experimental Results

โ€ข Loss for multiple hidden layersโ€ข 100 ReLU for each layerโ€ข input features are the no. of views in the past 56

days

Page 53: Introduction of Machine / Deep Learning

2021/01/01 2021/02/14

3 layers

Views(k)

Red: real no. of viewsblue: estimated no. of views

?

Page 54: Introduction of Machine / Deep Learning

Step 1: function with

unknown

Step 2: define loss from

training data

Step 3: optimization

Back to ML Framework

๐œŽ ๐‘Š ๐’™๐’ƒ ++๐‘= ๐’„,๐‘ฆ

It is not fancy enough.

Letโ€™s give it a fancy name!

Page 55: Introduction of Machine / Deep Learning

๐‘Ž!

๐‘Ž"

๐‘Ž5

+

+

1

1

+

1

โ€ฆโ€ฆ

๐‘ฅ!

๐‘ฅ"

๐‘ฅ5

+

+

1

1

+

1 Neuron

Neural Network This mimics human brains โ€ฆ (???)

Many layers means Deep

hidden layerhidden layer

Deep Learning

Page 56: Introduction of Machine / Deep Learning

8 layers

19 layers

22 layers

AlexNet (2012) VGG (2014) GoogleNet (2014)

16.4%

7.3%6.7%

http://cs231n.stanford.edu/slides/winter1516_lecture8.pdf

Deep = Many hidden layers

Page 57: Introduction of Machine / Deep Learning

AlexNet(2012)

VGG (2014)

GoogleNet(2014)

152 layers

3.57%

Residual Net (2015)

Taipei101

101 layers

16.4%7.3% 6.7%

Deep = Many hidden layers

Special structure

Why we want โ€œDeepโ€ network, not โ€œFatโ€ network?

Page 58: Introduction of Machine / Deep Learning

Why donโ€™t we go deeper?

โ€ข Loss for multiple hidden layersโ€ข 100 ReLU for each layerโ€ข input features are the no. of views in the past 56

days 1 layer 2 layer 3 layer 4 layer

2017 โ€“ 2020 0.28k 0.18k 0.14k 0.10k2021 0.43k 0.39k 0.38k 0.44k

Page 59: Introduction of Machine / Deep Learning

Why donโ€™t we go deeper?

โ€ข Loss for multiple hidden layersโ€ข 100 ReLU for each layerโ€ข input features are the no. of views in the past 56

days 1 layer 2 layer 3 layer 4 layer

2017 โ€“ 2020 0.28k 0.18k 0.14k 0.10k2021 0.43k 0.39k 0.38k 0.44k

Better on training data, worse on unseen dataOverfitting

Page 60: Introduction of Machine / Deep Learning

Letโ€™s predict no. of views today!

โ€ข If we want to select a model for predicting no. of views today, which one will you use?

1 layer 2 layer 3 layer 4 layer2017 โ€“ 2020 0.28k 0.18k 0.14k 0.10k

2021 0.43k 0.39k 0.38k 0.44k

We will talk about model selection next time. J

Page 61: Introduction of Machine / Deep Learning

To learn more โ€ฆโ€ฆ

https://youtu.be/Dr-WRlEFefw https://youtu.be/ibJpTrp5mcE

Basic Introduction Backpropagation

Computing gradients in an efficient way