Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The...
Transcript of Deep Learning Theory and Practiceweb.cecs.pdx.edu/~willke/courses/510/lectures/lecture1.pdf · The...
Deep Learning Theory and PracticeLecture 1
The Learning Problem
Dr. Ted Willke [email protected]
Monday, April 1, 2019
Today’s Lecture
•Course Overview
•What is learning?
•Types of learning
•Tutorial: Perceptron learning algorithm
!2
Resources1. Web page: http://web.cecs.pdx.edu/~willke/courses/510
• course info: http://web.cecs.pdx.edu/~willke/courses/510/lectures/syllabus-dltp-spring-2019.pdf
• slides: http://web.cecs.pdx.edu/~willke/courses/510/lectures/
• assignments: http://web.cecs.pdx.edu/~willke/courses/510/homework/hw00.pdf
2. Text book: Primary References:
1. Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville
2. Learning From Data, Yaswer S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin
3.Other References:
1. Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto
2. Introduction to Applied Linear Algebra - Vectors, Matrices, and Least Squares, Stephen Boyd and Lieven Vandenberghe
3. Linear Algebra and Learning from Data, Gilbert Strang
4. Parallel and Distributed Processing, David E. Rumelhart, James L. McClelland, and PDP Research Group
4.Grader/TA: No TA. Grader(s) TBD.
5. Professor :-)
6. Prerequisites? Assignment #0
!3
!4
(Le Cun et al., 1990) (Krizhevsky et al., 2012) (Apple, 2014)
(Wu et al., 2016) (Silver et al., 2016) (Riedmiller et al., 2018)
Example: Predicting how a viewer will rate a movie
!5(Many slides adapted from Yaser Abu-Mostafa and Malik Magdon-Ismail, with permission of the authors. Thanks guys!)
10% improvement = $1M prize (2009)
When does machine learning make sense?
1) A pattern exists.
2) We cannot pin it down mathematically.
3) We have data on it.
Our class mantra
One Solution
!6
Satisfies our mantra? 1. Viewer taste and movie content
reflected in rating
2. No known formula to predict rating
3. Netflix has the data (100M ratings!)
The learning approach
!7
LEARNING
rating
Components of learning
!8
Metaphor: Credit approval
Expect a pattern (better with on-time payments, years in residence, etc.)
No magic formula for credit approval
Banks have lots of data (salary, debt, default history, etc.)
age (years) 27
salary (dollars) 80,000
debt (dollars) 26,000
employed (years) 3
in residence (years) 2.5
… …
A pattern exists. We don’t know it. But we have data to learn it.
Components of learning
Formalization:
•Input:
•Output:
•Target function:
•Data:
•Hypothesis:
!9
x ∈ ℝd = Xy ∈ {−1, +1} = Y
f : X ↦ YD = (x1, y1), (x2, y2), . . . , (xN, yN)
(Ideal credit approval formula. f is unknown)
g : X ↦ Y
(Historical records)
(Approve credit or not?)
(Bank data, including customer application)
(Formula we learn from the data)
We choose and the learning algorithm.
Learning•Start with a set of candidate hypotheses that you think are likely to represent .
is called the hypothesis set or model.
•Select a hypothesis from . We do this with a learning algorithm.
•Use for new customers. We strive for .
• and are given by the learning problem.
•The target is fixed but unknown.
!10H
f
H = {h1, h2, . . . , hM}
Hg
g
X, Y D
f
g ≈ f
H
Setting up the learning problem
!11
UNKNOWN TARGET FUNCTION
TRAINING EXAMPLES
LEARNING ALGORITHM
FINAL HYPOTHESIS
HYPOTHESIS SET
f : X ↦ Y
(x1, y1), (x2, y2), . . . , (xN, yN)
H
Ag ≈ f
(ideal credit approval formula)
(historical records)
(set of candidate formulae)
(learned credit approval formula)
Solution components
The 2 solution components of the learning problem:
•The Hypothesis Set
•The Learning Algorithm
Together, these are referred to as the learning model.
!12
The results and aftermath of the Netflix Prize• 10% improvement = RMSE from 0.9525 to 0.8572
• 2007 Progress Prize:
- 8.43% improvement in 2007 from combination of 107 algorithms and 2000+ hours of work
- Netflix adopted 2 of the algorithms
• 2009 Grand Prize:
- Blend of hundreds of predictive models
- “Additional accuracy gains… did not seem to justify the engineering effort needed to bring them into a production environment.”
!13
(https://www.netflixprize.com/leaderboard.html)
Netflix never used its $1 million algorithm due to engineering costs.1https://medium.com/netflix-techblog/netflix-recommendations-beyond-the-5-stars-part-1-55838468f4291.
Deep Neural Networks
• Can we do better?
• 107 algorithms = 10% improvement
• 2 deep learning models on MovieLens dataset*
- RMSE of 0.833 on 1M ratings = 12.5% improvement
- RMSE of 0.774 on 10M ratings = 18.8% improvement
!14
X
g
Y
(learned rating formula)
(predicted rating)
(viewer features) (movie features)
(M. Fu et al., March 2019)*This is NOT the Netflix dataset and the results may or may not be comparable
A simple learning model - the ‘perceptron’•Input
•Give importance weights to input dimensions and compute
‘credit score’
•Credit approval
Approve credit if threshold,
Deny credit if threshold.
•Written formally as
!15
x = [x1, . . . , xd]T
=d
∑i=1
wixi
d
∑i=1
wixi ≥
d
∑i=1
wixi <
h(x) = sign((d
∑i=1
wixi) +w0) = sign(wTx)
Perceptron hypothesis setWe have defined a hypothesis set
This hypothesis set is called the perceptron or linear separator.!16
h ∈ H
H = {h(x) = sign(wTx)}
w =
w0w1⋮wd
∈ ℝd+1, x =
1x1⋮xd
∈ {1} × ℝd .
infinite set
(Note the change in the definition of !)x
Geometric interpretation
Which one should we pick?!17
h(x) = sign(wTx)
Inco
me
Inco
me
AgeAge
Use the data to select a line
!18In
com
e
Inco
me
AgeAge
++
++
+
++
++
+-
-
--
-
-
--
A perceptron ‘fits’ the data by using a line to separate the +1 from the -1 data.
Learning a final hypothesis
•Want such that on the dataset
•Ideally,
How do we find such a , if it exists, in the infinite hypothesis set?
Idea: Start with some weight vector and try to improve it.
!19
Inco
me
Age
++
++
+-
-
--g ∈ H g ≈ f D
g(xn) = yn .
g
The Perceptron Learning Algorithm (PLA)
A simple iterative algorithm:
1:
2:
3:
4:
5:
!20
w(1) = 0
for iteration t = 1, 2, 3, . . .
pick any misclassified (x*, y*) ∈ D ▹ sign(w(t)) ⋅ x*) ≠ y*
update the weight w(t + 1) = w(t) + y*x*
t ← t + 1
w(t)
w(t + 1)y*x*
y*x* w(t + 1)
w(t)
x*
x*
y* = +1
y* = −1
PLA implements our idea, incrementally learning from one example at a time.
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
!21
Inco
me
Age
++
++
+-
-
--
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
!22
Inco
me
Age
++
++
+-
-
--
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
!23
Inco
me
Age
++
++
+-
-
--
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
!24
Inco
me
Age
++
++
--
+-
-
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
!25
Inco
me
Age
++
++
+-
-
--
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
!26
Inco
me
Age
++
++
+-
-
--
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
!27
Inco
me
Age
++
++
+-
-
--
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
After how long?
!28
Inco
me
Age
++
++
+-
-
--
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
After how long?
What if the data is not linearly separable?
!29
Inco
me
Age
+
+
+-
-
+
-
+-
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
After how long?
What if the data is not linearly separable?
!30
+
+
+-
+
-+
-
--
-
-
+0
1
0 1
Does PLA work?Theorem. If the data can be fit by a linear separator, then after some finite number of steps, PLA will find one.
After how long?
What if the data is not separable by a hyperplane?
!31
Inco
me
Age
++
++
+-
-
--
Perceptron summary
• We can find a that works, from an infinite set.
• Remember: this only demonstrates that PLA can find a correct decision boundary from labeled data.
• What about unlabeled data? What about prediction?
Can a limited dataset reveal enough information about a target function to make it useful outside the given dataset?
!32
g
Types of learning and the premise
• Learning is “using a set of observations to uncover an underlying process”
• Broad premise, many variations (e.g., statistics)
• Types of learning:
- Supervised learning: Given (includes the answer you want to predict)
- Unsupervised learning: Only given , learn to group or organize the data
- Reinforcement learning: Given feedback on potential answers you try
!33
x → try something → get feedback .
(xn, f(xn))
xn
Supervised learning
Vending machine example: coin recognition
!34
Unsupervised learning
Instead of (input, correct output), we get (input, ?)
!35
Unsupervised learning
Instead of (input, correct output), we get (input, ?)
!36
3 or 4 clusters?
Reinforcement learning
Instead of (input, correct output), we get (input, some output, grade for this output)
!37(LunarLander-v2, OpenAI Gym, accessed March 2019 at https://gym.openai.com/envs/#box2d)
Deep neural networks now beat humans on many reinforcement learning problems!
A learning puzzle
!38
( f = − 1)
( f = + 1)
( f = ?)
Further reading
• Abu-Mostafa, Y. S., Magdon-Ismail, M., Lin, H.-T. (2012) Learning from data. AMLbook.com.
• Minsky, M., Papert, S. (1969). Perceptrons: An introduction to computational geometry. A good review of the book by Alan Newell: http://science.sciencemag.org/content/165/3895/780
• Wu, Y., et al. (2016). Google’s neural machine translation system: bridging the gap between human and machine translation. https://arxiv.org/abs/1609.08144
• Silver, D., et al. (2016). Mastering the game of go without human knowledge. https://www.nature.com/articles/nature24270
• Riedmiller, R., et al. (2018). Learning by playing - Solving sparse reward tasks from scratch. https://arxiv.org/abs/1802.10567
• Fu, M., et al., (2019). A novel deep learning-based collaborative filtering model for recommendation system. https://ieeexplore.ieee.org/document/8279660
!39