Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The...

Deep Learning Theory and PracticeLecture 1

The Learning Problem

Dr. Ted Willke [email protected]

Monday, April 1, 2019

mailto:[email protected]

Today’s Lecture

•Course Overview

•What is learning?

•Types of learning

•Tutorial: Perceptron learning algorithm

!2

Resources1. Web page: http://web.cecs.pdx.edu/~willke/courses/510

• course info: http://web.cecs.pdx.edu/~willke/courses/510/lectures/syllabus-dltp-spring-2019.pdf

• slides: http://web.cecs.pdx.edu/~willke/courses/510/lectures/

• assignments: http://web.cecs.pdx.edu/~willke/courses/510/homework/hw00.pdf

2. Text book: Primary References:

1. Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville

2. Learning From Data, Yaswer S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin

3.Other References:

1. Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto

2. Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares, Stephen Boyd and Lieven Vandenberghe

3. Linear Algebra and Learning from Data, Gilbert Strang

4. Parallel and Distributed Processing, David E. Rumelhart, James L. McClelland, and PDP Research Group

4.Grader/TA: No TA. Grader(s) TBD.

5. Professor :-)

6. Prerequisites? Assignment #0

!3

http://web.cecs.pdx.edu/~willke/courses/510

!4

(Le Cun et al., 1990) (Krizhevsky et al., 2012) (Apple, 2014)

(Wu et al., 2016) (Silver et al., 2016) (Riedmiller et al., 2018)

Example: Predicting how a viewer will rate a movie

!5(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)

10% improvement = $1M prize (2009)

When does machine learning make sense?

1) A pattern exists.

2) We cannot pin it down mathematically.

3) We have data on it.

Our class mantra

One Solution

!6

Satisfies our mantra? 1. Viewer taste and movie content

reflected in rating

2. No known formula to predict rating

3. Netflix has the data (100M ratings!)

The learning approach

!7

LEARNING

rating

Components of learning

!8

Metaphor: Credit approval

Expect a pattern (better with on-time payments, years in residence, etc.)

No magic formula for credit approval

Banks have lots of data (salary, debt, default history, etc.)

age (years) 27

salary (dollars) 80,000

debt (dollars) 26,000

employed (years) 3

in residence (years) 2.5

… …

A pattern exists. We don’t know it. But we have data to learn it.

Components of learning

Formalization:

•Input:

•Output:

•Target function:

•Data:

•Hypothesis:

!9

x ∈ ℝd = Xy ∈ {−1, +1} = Y

f : X ↦ YD = (x1, y1), (x2, y2), . . . , (xN, yN)

(Ideal credit approval formula. f is unknown)

g : X ↦ Y

(Historical records)

(Approve credit or not?)

(Bank data, including customer application)

(Formula we learn from the data)

We choose and the learning algorithm.

Learning•Start with a set of candidate hypotheses that you think are likely to represent .

is called the hypothesis set or model.

•Select a hypothesis from . We do this with a learning algorithm.

•Use for new customers. We strive for .

• and are given by the learning problem.

•The target is fixed but unknown.

!10H

f

H = {h1, h2, . . . , hM}

Hg

g

X, Y D

f

g ≈ f

H

Setting up the learning problem

!11

UNKNOWN TARGET FUNCTION

TRAINING EXAMPLES

LEARNING ALGORITHM

FINAL HYPOTHESIS

HYPOTHESIS SET

f : X ↦ Y

(x1, y1), (x2, y2), . . . , (xN, yN)

H

Ag ≈ f

(ideal credit approval formula)

(historical records)

(set of candidate formulae)

(learned credit approval formula)

Solution components

The 2 solution components of the learning problem:

•The Hypothesis Set

•The Learning Algorithm

Together, these are referred to as the learning model.

!12

The results and aftermath of the Netflix Prize• 10% improvement = RMSE from 0.9525 to 0.8572

• 2007 Progress Prize:

- 8.43% improvement in 2007 from combination of 107 algorithms and 2000+ hours of work

- Netflix adopted 2 of the algorithms

• 2009 Grand Prize:

- Blend of hundreds of predictive models

- “Additional accuracy gains… did not seem to justify the engineering effort needed to bring them into a production environment.”

!13

(https://www.netflixprize.com/leaderboard.html)

Netflix never used its $1 million algorithm due to engineering costs.1https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-1-55838468f4291.

https://www.netflixprize.com/leaderboard.html

https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-1-55838468f429

Deep Neural Networks

• Can we do better?

• 107 algorithms = 10% improvement

• 2 deep learning models on MovieLens dataset*

- RMSE of 0.833 on 1M ratings = 12.5% improvement

- RMSE of 0.774 on 10M ratings = 18.8% improvement

!14

X

g

Y

(learned rating formula)

(predicted rating)

(viewer features) (movie features)

(M. Fu et al., March 2019)*This is NOT the Netflix dataset and the results may or may not be comparable

A simple learning model - the ‘perceptron’•Input

•Give importance weights to input dimensions and compute

‘credit score’

•Credit approval

Approve credit if threshold,

Deny credit if threshold.

•Written formally as

!15

x = [x1, . . . , xd]T

=d

∑i=1

wixi

d

∑i=1

wixi ≥

d

∑i=1

wixi <

h(x) = sign((d

∑i=1

wixi) +w0) = sign(wTx)

Perceptron hypothesis setWe have defined a hypothesis set

This hypothesis set is called the perceptron or linear separator.!16

h ∈ H

H = {h(x) = sign(wTx)}

w =

w0w1⋮wd

∈ ℝd+1, x =

1x1⋮xd

∈ {1} × ℝd .

infinite set

(Note the change in the definition of !)x

Geometric interpretation

Which one should we pick?!17

h(x) = sign(wTx)

Inco

me

Inco

me

AgeAge

Use the data to select a line

!18In

com

e

Inco

me

AgeAge

++

++

+

++

++

+-

-

--

-

-

--

A perceptron ‘fits’ the data by using a line to separate the +1 from the -1 data.

Learning a final hypothesis

•Want such that on the dataset

•Ideally,

How do we find such a , if it exists, in the infinite hypothesis set?

Idea: Start with some weight vector and try to improve it.

!19

Inco

me

Age

++

++

+-

-

--g ∈ H g ≈ f D

g(xn) = yn .

g

The Perceptron Learning Algorithm (PLA)

A simple iterative algorithm:

1:

2:

3:

4:

5:

!20

w(1) = 0

for iteration t = 1, 2, 3, . . .

pick any misclassified (x*, y*) ∈ D ▹ sign(w(t)) ⋅ x*) ≠ y*

update the weight w(t + 1) = w(t) + y*x*

t ← t + 1

w(t)

w(t + 1)y*x*

y*x* w(t + 1)

w(t)

x*

x*

y* = +1

y* = −1

PLA implements our idea, incrementally learning from one example at a time.

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

!21

Inco

me

Age

++

++

+-

-

--


!22

Inco

me

Age

++

++

+-

-

--


!23

Inco

me

Age

++

++

+-

-

--


!24

Inco

me

Age

++

++

--

+-

-


!25

Inco

me

Age

++

++

+-

-

--


!26

Inco

me

Age

++

++

+-

-

--


!27

Inco

me

Age

++

++

+-

-

--


After how long?

!28

Inco

me

Age

++

++

+-

-

--


After how long?

What if the data is not linearly separable?

!29

Inco

me

Age

+

+

+-

-

+

-

+-


After how long?

What if the data is not linearly separable?

!30

+

+

+-

+

-+

-

--

-

-

+0

1

0 1


After how long?

What if the data is not separable by a hyperplane?

!31

Inco

me

Age

++

++

+-

-

--

Perceptron summary

• We can find a that works, from an infinite set.

• Remember: this only demonstrates that PLA can find a correct decision boundary from labeled data.

• What about unlabeled data? What about prediction?

Can a limited dataset reveal enough information about a target function to make it useful outside the given dataset?

!32

g

Types of learning and the premise

• Learning is “using a set of observations to uncover an underlying process”

• Broad premise, many variations (e.g., statistics)

• Types of learning:

- Supervised learning: Given (includes the answer you want to predict)

- Unsupervised learning: Only given , learn to group or organize the data

- Reinforcement learning: Given feedback on potential answers you try

!33

x → try something → get feedback .

(xn, f(xn))

xn

Supervised learning

Vending machine example: coin recognition

!34

Unsupervised learning

Instead of (input, correct output), we get (input, ?)

!35

Unsupervised learning

Instead of (input, correct output), we get (input, ?)

!36

3 or 4 clusters?

Reinforcement learning

Instead of (input, correct output), we get (input, some output, grade for this output)

!37(LunarLander-v2, OpenAI Gym, accessed March 2019 at https://gym.openai.com/envs/#box2d)

Deep neural networks now beat humans on many reinforcement learning problems!

https://gym.openai.com/envs/#box2d

A learning puzzle

!38

( f = − 1)

( f = + 1)

( f = ?)

Further reading

• Abu-Mostafa, Y. S., Magdon-Ismail, M., Lin, H.-T. (2012) Learning from data. AMLbook.com.

• Minsky, M., Papert, S. (1969). Perceptrons: An introduction to computational geometry. A good review of the book by Alan Newell: http://science.sciencemag.org/content/165/3895/780

• Wu, Y., et al. (2016). Google’s neural machine translation system: bridging the gap between human and machine translation. https://arxiv.org/abs/1609.08144

• Silver, D., et al. (2016). Mastering the game of go without human knowledge. https://www.nature.com/articles/nature24270

• Riedmiller, R., et al. (2018). Learning by playing - Solving sparse reward tasks from scratch. https://arxiv.org/abs/1802.10567

• Fu, M., et al., (2019). A novel deep learning-based collaborative filtering model for recommendation system. https://ieeexplore.ieee.org/document/8279660

!39

http://AMLbook.com

http://science.sciencemag.org/content/165/3895/780

https://arxiv.org/abs/1609.08144

https://www.nature.com/articles/nature24270

https://www.nature.com/articles/nature24270



https://ieeexplore.ieee.org/document/8279660

Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The...

Documents

Transcript of Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The...