CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8....

32
CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate

Transcript of CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8....

Page 1: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

CS 6375Machine Learning

(Qualifying Exam Section)

Nicholas RuozziUniversity of Texas at Dallas

Slides adapted from David Sontag and Vibhav Gogate

Page 2: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Course Info.

• Instructor: Nicholas Ruozzi

• Office: ECSS 3.409

• Office hours: Mon. 1pm-2pm

• TA: ?

• Office hours and location ?

• Course website: www.utdallas.edu/~nrr150130/cs6375/2018fa/

• Piazza (online forum): sign-up link on eLearning

2

Page 3: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Prerequisites

• CS 5343 (data structures & algorithms)

• “Mathematical sophistication”

• Basic probability

• Linear algebra

• Eigenvalues, eigenvectors, matrices, vectors, etc.

• Multivariate calculus

• Derivatives, gradients, convex functions, etc.

• I’ll review some concepts as we come to them, but you should brush up in areas that you aren’t as comfortable

3

Page 4: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Grading

• 5-6 problem sets (50%)

• See collaboration policy on the web

• Mix of theory and programming (in MATLAB or Python)

• Available and turned in on eLearning

• Approximately one assignment every two weeks

• Midterm Exam (20%)

• Final Exam (30%)

-subject to change-

4

Page 5: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Course Topics

• Dimensionality reduction• PCA• Matrix Factorizations

• Learning• Supervised, unsupervised, active, reinforcement, …• Learning theory: PAC learning, VC dimension• SVMs & kernel methods• Decision trees, k-NN, logistic regression, … • Parameter estimation: Bayesian methods, MAP estimation, maximum likelihood

estimation, expectation maximization, …• Clustering: k-means & spectral clustering

• Probabilistic models• Bayesian networks• Naïve Bayes

• Neural networks• Statistical methods

• Boosting, bagging, bootstrapping• Sampling

• Ranking & Collaborative Filtering

5

Page 6: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

What is ML?

6

Page 7: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

What is ML?

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its

performance on T, as measured by P, improves with experience E.”

- Tom Mitchell

7

Page 8: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Basic Machine Learning Paradigm

• Collect data

• Build a model using “training” data

• Use model to make predictions

8

Page 9: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Supervised Learning

• Input: 𝑥𝑥(1),𝑦𝑦(1) , … , (𝑥𝑥(𝑀𝑀),𝑦𝑦(𝑀𝑀))

• 𝑥𝑥(𝑚𝑚) is the 𝑚𝑚𝑡𝑡𝑡 data item and 𝑦𝑦(𝑚𝑚) is the 𝑚𝑚𝑡𝑡𝑡 label

• Goal: find a function 𝑓𝑓 such that 𝑓𝑓 𝑥𝑥(𝑚𝑚) is a “good approximation” to 𝑦𝑦(𝑚𝑚)

• Can use it to predict 𝑦𝑦 values for previously unseen 𝑥𝑥values

9

Page 10: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Examples of Supervised Learning

• Spam email detection

• Handwritten digit recognition

• Stock market prediction

• More?

10

Page 11: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Supervised Learning

• Hypothesis space: set of allowable functions 𝑓𝑓:𝑋𝑋 → 𝑌𝑌

• Goal: find the “best” element of the hypothesis space

• How do we measure the quality of 𝑓𝑓?

11

Page 12: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Supervised Learning

• Simple linear regression

• Input: pairs of points 𝑥𝑥(1),𝑦𝑦(1) , … , (𝑥𝑥(𝑀𝑀),𝑦𝑦(𝑀𝑀)) with 𝑥𝑥(𝑚𝑚) ∈ ℝ and 𝑦𝑦(𝑚𝑚) ∈ ℝ

• Hypothesis space: set of linear functions 𝑓𝑓 𝑥𝑥 = 𝑎𝑎𝑥𝑥 + 𝑏𝑏with 𝑎𝑎, 𝑏𝑏 ∈ ℝ

• Error metric: squared difference between the predicted value and the actual value

12

Page 13: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Regression

𝑥𝑥

𝑦𝑦

13

Page 14: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Regression

𝑥𝑥

𝑦𝑦

Hypothesis class: linear functions 𝑓𝑓 𝑥𝑥 = 𝑎𝑎𝑥𝑥 + 𝑏𝑏

How do we compute the error of a specific hypothesis?

14

Page 15: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Linear Regression

• For any data point, 𝑥𝑥, the learning algorithm predicts 𝑓𝑓(𝑥𝑥)

• In typical regression applications, measure the fit using a squared loss function

𝐿𝐿 𝑓𝑓 =1𝑀𝑀�𝑚𝑚

𝑓𝑓 𝑥𝑥 𝑚𝑚 − 𝑦𝑦 𝑚𝑚 2

• Want to minimize the average loss on the training data

• The optimal linear hypothesis is then given by

min𝑎𝑎,𝑏𝑏

1𝑀𝑀�𝑚𝑚

𝑎𝑎𝑥𝑥(𝑚𝑚) + 𝑏𝑏 − 𝑦𝑦(𝑚𝑚) 2

15

Page 16: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Linear Regression

min𝑎𝑎,𝑏𝑏

1𝑀𝑀�𝑚𝑚

𝑎𝑎𝑥𝑥(𝑚𝑚) + 𝑏𝑏 − 𝑦𝑦(𝑚𝑚) 2

• How do we find the optimal 𝑎𝑎 and 𝑏𝑏?

16

Page 17: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Linear Regression

min𝑎𝑎,𝑏𝑏

1𝑀𝑀�𝑚𝑚

𝑎𝑎𝑥𝑥(𝑚𝑚) + 𝑏𝑏 − 𝑦𝑦(𝑚𝑚) 2

• How do we find the optimal 𝑎𝑎 and 𝑏𝑏?

• Solution 1: take derivatives and solve (there is a closed form solution!)

• Solution 2: use gradient descent

17

Page 18: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Linear Regression

min𝑎𝑎,𝑏𝑏

1𝑀𝑀�𝑚𝑚

𝑎𝑎𝑥𝑥(𝑚𝑚) + 𝑏𝑏 − 𝑦𝑦(𝑚𝑚) 2

• How do we find the optimal 𝑎𝑎 and 𝑏𝑏?

• Solution 1: take derivatives and solve (there is a closed form solution!)

• Solution 2: use gradient descent

• This approach is much more likely to be useful for general loss functions

18

Page 19: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Gradient Descent

Iterative method to minimize a (convex) differentiable function 𝑓𝑓

• Pick an initial point 𝑥𝑥0

• Iterate until convergence

𝑥𝑥𝑡𝑡+1 = 𝑥𝑥𝑡𝑡 − 𝛾𝛾𝑡𝑡𝛻𝛻𝑓𝑓(𝑥𝑥𝑡𝑡)

where 𝛾𝛾𝑡𝑡 is the 𝑡𝑡𝑡𝑡𝑡 step size (sometimes called learning rate)

19

Page 20: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Gradient Descent

source: Wikipedia20

Page 21: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Gradient Descent

min𝑎𝑎,𝑏𝑏

1𝑀𝑀�𝑚𝑚

𝑎𝑎𝑥𝑥(𝑚𝑚) + 𝑏𝑏 − 𝑦𝑦(𝑚𝑚) 2

• What is the gradient of this function?

• What does a gradient descent iteration look like for this simple regression problem?

(on board)

21

Page 22: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Linear Regression

• In higher dimensions, the linear regression problem is essentially the same with 𝑥𝑥(𝑚𝑚) ∈ ℝ𝑛𝑛

min𝑎𝑎∈ℝ𝑛𝑛,𝑏𝑏

1𝑀𝑀�𝑚𝑚

𝑎𝑎𝑇𝑇𝑥𝑥(𝑚𝑚) + 𝑏𝑏 − 𝑦𝑦(𝑚𝑚) 2

• Can still use gradient descent to minimize this

• Not much more difficult than the 𝑛𝑛 = 1 case

22

Page 23: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Gradient Descent

• Gradient descent converges under certain technical conditions on the function 𝑓𝑓 and the step size 𝛾𝛾𝑡𝑡

• If 𝑓𝑓 is convex, then any fixed point of gradient descent must correspond to a global optimum of 𝑓𝑓

• In general, convergence is only guaranteed to a local optimum

23

Page 24: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Regression

• What if we enlarge the hypothesis class?

• Quadratic functions

• 𝑘𝑘-degree polynomials

• Can we always learn better with a larger hypothesis class?

24

Page 25: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Regression

• What if we enlarge the hypothesis class?

• Quadratic functions

• 𝑘𝑘-degree polynomials

• Can we always learn better with a larger hypothesis class?

25

Page 26: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Regression

• Larger hypothesis space always decreases the cost function, but this does NOT necessarily mean better predictive performance

• This phenomenon is known as overfitting

• Ideally, we would select the simplest hypothesis consistent with the observed data

• In practice, we cannot simply evaluate our learned hypothesis on the training data, we want it to perform well on unseen data (otherwise, we can just memorize the training data!)

• Report the loss on some held out test data (i.e., data not used as part of the training process)

26

Page 27: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Binary Classification

• Regression operates over a continuous set of outcomes

• Suppose that we want to learn a function 𝑓𝑓:𝑋𝑋 → {0,1}

• As an example:

How do we pick the hypothesis space?

How do we find the best 𝑓𝑓 in this space?

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝑥𝑥3 𝑦𝑦1 0 0 1 02 0 1 0 13 1 1 0 14 1 1 1 0

27

Page 28: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Binary Classification

• Regression operates over a continuous set of outcomes

• Suppose that we want to learn a function 𝑓𝑓:𝑋𝑋 → {0,1}

• As an example:

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝑥𝑥3 𝑦𝑦1 0 0 1 02 0 1 0 13 1 1 0 14 1 1 1 0

How many functions with three binary inputs and one binary output are there?

28

Page 29: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Binary Classification

𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝑥𝑥3 𝑦𝑦0 0 0 ?

1 0 0 1 02 0 1 0 1

0 1 1 ?1 0 0 ?1 0 1 ?

3 1 1 0 14 1 1 1 0

28 possible functions

24 are consistent with the observations

How do we choose the best one?

What if the observations are noisy?

29

Page 30: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Challenges in ML

• How to choose the right hypothesis space?

• Number of factors influence this decision: difficulty of learning over the chosen space, how expressive the space is, …

• How to evaluate the quality of our learned hypothesis?

• Prefer “simpler” hypotheses (to prevent overfitting)

• Want the outcome of learning to generalize to unseen data

30

Page 31: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Challenges in ML

• How do we find the best hypothesis?

• This can be an NP-hard problem!

• Need fast, scalable algorithms if they are to be applicable to real-world scenarios

31

Page 32: CS 6375 Machine Learning - University of Texas at Dallasnrr150130/cs6375/2018fa/... · 2018. 8. 22. · CS 6375 Machine Learning (Qualifying Exam Section) Nicholas Ruozzi. University

Other Types of Learning

• Unsupervised• The training data does not include the desired output

• Semi-supervised• Some training data comes with the desired output

• Active learning• Semi-supervised learning where the algorithm can ask for the

correct outputs for specifically chosen data points

• Reinforcement learning• The learner interacts with the world via allowable actions which

change the state of the world and result in rewards• The learner attempts to maximize rewards through trial and

error

32