Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The...

39
Deep Learning Theory and Practice Lecture 1 The Learning Problem Dr. Ted Willke [email protected] Monday, April 1, 2019

Transcript of Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The...

Page 1: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Deep Learning Theory and PracticeLecture 1

The Learning Problem

Dr. Ted Willke [email protected]

Monday, April 1, 2019

Page 2: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Today’s Lecture

•Course Overview

•What is learning?

•Types of learning

•Tutorial: Perceptron learning algorithm

!2

Page 3: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Resources1. Web page: http://web.cecs.pdx.edu/~willke/courses/510

• course info: http://web.cecs.pdx.edu/~willke/courses/510/lectures/syllabus-dltp-spring-2019.pdf

• slides: http://web.cecs.pdx.edu/~willke/courses/510/lectures/

• assignments: http://web.cecs.pdx.edu/~willke/courses/510/homework/hw00.pdf

2. Text book: Primary References:

1. Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville

2. Learning From Data, Yaswer S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin

3.Other References:

1. Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto

2. Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares, Stephen Boyd and Lieven Vandenberghe

3. Linear Algebra and Learning from Data, Gilbert Strang

4. Parallel and Distributed Processing, David E. Rumelhart, James L. McClelland, and PDP Research Group

4.Grader/TA: No TA. Grader(s) TBD.

5. Professor :-)

6. Prerequisites? Assignment #0

!3

Page 4: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

!4

(Le Cun et al., 1990) (Krizhevsky et al., 2012) (Apple, 2014)

(Wu et al., 2016) (Silver et al., 2016) (Riedmiller et al., 2018)

Page 5: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Example: Predicting how a viewer will rate a movie

!5(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)

10% improvement = $1M prize (2009)

When does machine learning make sense?

1) A pattern exists.

2) We cannot pin it down mathematically.

3) We have data on it.

Our class mantra

Page 6: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

One Solution

!6

Satisfies our mantra? 1. Viewer taste and movie content

reflected in rating

2. No known formula to predict rating

3. Netflix has the data (100M ratings!)

Page 7: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

The learning approach

!7

LEARNING

rating

Page 8: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Components of learning

!8

Metaphor: Credit approval

Expect a pattern (better with on-time payments, years in residence, etc.)

No magic formula for credit approval

Banks have lots of data (salary, debt, default history, etc.)

age (years) 27

salary (dollars) 80,000

debt (dollars) 26,000

employed (years) 3

in residence (years) 2.5

… …

A pattern exists. We don’t know it. But we have data to learn it.

Page 9: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Components of learning

Formalization:

•Input:

•Output:

•Target function:

•Data:

•Hypothesis:

!9

x ∈ ℝd = Xy ∈ {−1, +1} = Y

f : X ↦ YD = (x1, y1), (x2, y2), . . . , (xN, yN)

(Ideal credit approval formula. f is unknown)

g : X ↦ Y

(Historical records)

(Approve credit or not?)

(Bank data, including customer application)

(Formula we learn from the data)

Page 10: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

We choose and the learning algorithm.

Learning•Start with a set of candidate hypotheses that you think are likely to represent .

is called the hypothesis set or model.

•Select a hypothesis from . We do this with a learning algorithm.

•Use for new customers. We strive for .

• and are given by the learning problem.

•The target is fixed but unknown.

!10H

f

H = {h1, h2, . . . , hM}

Hg

g

X, Y D

f

g ≈ f

H

Page 11: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Setting up the learning problem

!11

UNKNOWN TARGET FUNCTION

TRAINING EXAMPLES

LEARNING ALGORITHM

FINAL HYPOTHESIS

HYPOTHESIS SET

f : X ↦ Y

(x1, y1), (x2, y2), . . . , (xN, yN)

H

Ag ≈ f

(ideal credit approval formula)

(historical records)

(set of candidate formulae)

(learned credit approval formula)

Page 12: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Solution components

The 2 solution components of the learning problem:

•The Hypothesis Set

•The Learning Algorithm

Together, these are referred to as the learning model.

!12

Page 13: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

The results and aftermath of the Netflix Prize• 10% improvement = RMSE from 0.9525 to 0.8572

• 2007 Progress Prize:

- 8.43% improvement in 2007 from combination of 107 algorithms and 2000+ hours of work

- Netflix adopted 2 of the algorithms

• 2009 Grand Prize:

- Blend of hundreds of predictive models

- “Additional accuracy gains… did not seem to justify the engineering effort needed to bring them into a production environment.”

!13

(https://www.netflixprize.com/leaderboard.html)

Netflix never used its $1 million algorithm due to engineering costs.1https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-1-55838468f4291.

Page 14: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Deep Neural Networks

• Can we do better?

• 107 algorithms = 10% improvement

• 2 deep learning models on MovieLens dataset*

- RMSE of 0.833 on 1M ratings = 12.5% improvement

- RMSE of 0.774 on 10M ratings = 18.8% improvement

!14

X

g

Y

(learned rating formula)

(predicted rating)

(viewer features) (movie features)

(M. Fu et al., March 2019)*This is NOT the Netflix dataset and the results may or may not be comparable

Page 15: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

A simple learning model - the ‘perceptron’•Input

•Give importance weights to input dimensions and compute

‘credit score’

•Credit approval

Approve credit if threshold,

Deny credit if threshold.

•Written formally as

!15

x = [x1, . . . , xd]T

=d

∑i=1

wixi

d

∑i=1

wixi ≥

d

∑i=1

wixi <

h(x) = sign((d

∑i=1

wixi) +w0) = sign(wTx)

Page 16: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Perceptron hypothesis setWe have defined a hypothesis set

This hypothesis set is called the perceptron or linear separator.!16

h ∈ H

H = {h(x) = sign(wTx)}

w =

w0w1⋮wd

∈ ℝd+1, x =

1x1⋮xd

∈ {1} × ℝd .

infinite set

(Note the change in the definition of !)x

Page 17: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Geometric interpretation

Which one should we pick?!17

h(x) = sign(wTx)

Inco

me

Inco

me

AgeAge

Page 18: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Use the data to select a line

!18In

com

e

Inco

me

AgeAge

++

++

+

++

++

+-

-

--

-

-

--

A perceptron ‘fits’ the data by using a line to separate the +1 from the -1 data.

Page 19: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Learning a final hypothesis

•Want such that on the dataset

•Ideally,

How do we find such a , if it exists, in the infinite hypothesis set?

Idea: Start with some weight vector and try to improve it.

!19

Inco

me

Age

++

++

+-

-

--g ∈ H g ≈ f D

g(xn) = yn .

g

Page 20: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

The Perceptron Learning Algorithm (PLA)

A simple iterative algorithm:

1:

2:

3:

4:

5:

!20

w(1) = 0

for iteration t = 1, 2, 3, . . .

pick any misclassified (x*, y*) ∈ D ▹ sign(w(t)) ⋅ x*) ≠ y*

update the weight w(t + 1) = w(t) + y*x*

t ← t + 1

w(t)

w(t + 1)y*x*

y*x* w(t + 1)

w(t)

x*

x*

y* = +1

y* = −1

PLA implements our idea, incrementally learning from one example at a time.

Page 21: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

!21

Inco

me

Age

++

++

+-

-

--

Page 22: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

!22

Inco

me

Age

++

++

+-

-

--

Page 23: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

!23

Inco

me

Age

++

++

+-

-

--

Page 24: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

!24

Inco

me

Age

++

++

--

+-

-

Page 25: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

!25

Inco

me

Age

++

++

+-

-

--

Page 26: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

!26

Inco

me

Age

++

++

+-

-

--

Page 27: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

!27

Inco

me

Age

++

++

+-

-

--

Page 28: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

After how long?

!28

Inco

me

Age

++

++

+-

-

--

Page 29: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

After how long?

What if the data is not linearly separable?

!29

Inco

me

Age

+

+

+-

-

+

-

+-

Page 30: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

After how long?

What if the data is not linearly separable?

!30

+

+

+-

+

-+

-

--

-

-

+0

1

0 1

Page 31: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.

After how long?

What if the data is not separable by a hyperplane?

!31

Inco

me

Age

++

++

+-

-

--

Page 32: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Perceptron summary

• We can find a that works, from an infinite set.

• Remember: this only demonstrates that PLA can find a correct decision boundary from labeled data.

• What about unlabeled data? What about prediction?

Can a limited dataset reveal enough information about a target function to make it useful outside the given dataset?

!32

g

Page 33: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Types of learning and the premise

• Learning is “using a set of observations to uncover an underlying process”

• Broad premise, many variations (e.g., statistics)

• Types of learning:

- Supervised learning: Given (includes the answer you want to predict)

- Unsupervised learning: Only given , learn to group or organize the data

- Reinforcement learning: Given feedback on potential answers you try

!33

x → try something → get feedback .

(xn, f(xn))

xn

Page 34: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Supervised learning

Vending machine example: coin recognition

!34

Page 35: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Unsupervised learning

Instead of (input, correct output), we get (input, ?)

!35

Page 36: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Unsupervised learning

Instead of (input, correct output), we get (input, ?)

!36

3 or 4 clusters?

Page 37: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Reinforcement learning

Instead of (input, correct output), we get (input, some output, grade for this output)

!37(LunarLander-v2, OpenAI Gym, accessed March 2019 at https://gym.openai.com/envs/#box2d)

Deep neural networks now beat humans on many reinforcement learning problems!

Page 38: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

A learning puzzle

!38

( f = − 1)

( f = + 1)

( f = ?)

Page 39: Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The results and aftermath of the Netflix Prize • 10% improvement = RMSE from 0.9525

Further reading

• Abu-Mostafa, Y. S., Magdon-Ismail, M., Lin, H.-T. (2012) Learning from data. AMLbook.com.

• Minsky, M., Papert, S. (1969). Perceptrons: An introduction to computational geometry. A good review of the book by Alan Newell: http://science.sciencemag.org/content/165/3895/780

• Wu, Y., et al. (2016). Google’s neural machine translation system: bridging the gap between human and machine translation. https://arxiv.org/abs/1609.08144

• Silver, D., et al. (2016). Mastering the game of go without human knowledge. https://www.nature.com/articles/nature24270

• Riedmiller, R., et al. (2018). Learning by playing - Solving sparse reward tasks from scratch. https://arxiv.org/abs/1802.10567

• Fu, M., et al., (2019). A novel deep learning-based collaborative filtering model for recommendation system. https://ieeexplore.ieee.org/document/8279660

!39