Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 ›...
Transcript of Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 ›...
![Page 1: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/1.jpg)
Midterm Exam Review+ Multinomial Logistic Reg.
+ Feature Engineering+ Regularization
1
10-601 Introduction to Machine Learning
Matt GormleyLecture 10
Feb. 18, 2019
Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University
![Page 2: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/2.jpg)
Reminders
• Homework 4: Logistic Regression
– Out: Fri, Feb 15
– Due: Fri, Mar 1 at 11:59pm
• Midterm Exam 1
– Thu, Feb 21, 6:30pm – 8:00pm
• Today’s In-Class Poll
– http://p10.mlcourse.org
• Reading on Probabilistic Learning is reusedlater in the course for MLE/MAP
3
![Page 3: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/3.jpg)
Outline
• Midterm Exam Logistics• Sample Questions• Classification and Regression:
The Big Picture• Q&A
4
![Page 4: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/4.jpg)
MIDTERM EXAM LOGISTICS
5
![Page 5: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/5.jpg)
Midterm Exam• Time / Location
– Time: Evening ExamThu, Feb. 21 at 6:30pm – 8:00pm
– Room: We will contact each student individually with your room assignment. The rooms are not based on section.
– Seats: There will be assigned seats. Please arrive early. – Please watch Piazza carefully for announcements regarding room / seat
assignments.
• Logistics– Covered material: Lecture 1 – Lecture 8– Format of questions:
• Multiple choice• True / False (with justification)• Derivations• Short answers• Interpreting figures• Implementing algorithms on paper
– No electronic devices– You are allowed to bring one 8½ x 11 sheet of notes (front and back)
6
![Page 6: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/6.jpg)
Midterm Exam
• How to Prepare– Attend the midterm review lecture
(right now!)– Review prior year’s exam and solutions
(we’ll post them)– Review this year’s homework problems– Consider whether you have achieved the
“learning objectives” for each lecture / section
7
![Page 7: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/7.jpg)
Midterm Exam• Advice (for during the exam)– Solve the easy problems first
(e.g. multiple choice before derivations)• if a problem seems extremely complicated you’re likely
missing something– Don’t leave any answer blank!– If you make an assumption, write it down– If you look at a question and don’t know the
answer:• we probably haven’t told you the answer• but we’ve told you enough to work it out• imagine arguing for some answer and see if you like it
8
![Page 8: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/8.jpg)
Topics for Midterm• Foundations– Probability, Linear
Algebra, Geometry, Calculus
– Optimization
• Important Concepts– Overfitting– Experimental Design
• Classification– Decision Tree– KNN– Perceptron
• Regression– Linear Regression
9
![Page 9: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/9.jpg)
SAMPLE QUESTIONS
10
![Page 10: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/10.jpg)
Sample Questions
11
10-601: Machine Learning Page 4 of 16 2/29/2016
1.3 MAP vs MLE
Answer each question with T or F and provide a one sentence explanation of youranswer:
(a) [2 pts.] T or F: In the limit, as n (the number of samples) increases, the MAP andMLE estimates become the same.
(b) [2 pts.] T or F: Naive Bayes can only be used with MAP estimates, and not MLEestimates.
1.4 Probability
Assume we have a sample space ⌦. Answer each question with T or F. No justificationis required.
(a) [1 pts.] T or F: If events A, B, and C are disjoint then they are independent.
(b) [1 pts.] T or F: P (A|B) / P (A)P (B|A)P (A|B)
. (The sign ‘/’ means ‘is proportional to’)
(c) [1 pts.] T or F: P (A [ B) P (A).
(d) [1 pts.] T or F: P (A \ B) � P (A).
10-601: Machine Learning Page 4 of 16 2/29/2016
1.3 MAP vs MLE
Answer each question with T or F and provide a one sentence explanation of youranswer:
(a) [2 pts.] T or F: In the limit, as n (the number of samples) increases, the MAP andMLE estimates become the same.
(b) [2 pts.] T or F: Naive Bayes can only be used with MAP estimates, and not MLEestimates.
1.4 Probability
Assume we have a sample space ⌦. Answer each question with T or F. No justificationis required.
(a) [1 pts.] T or F: If events A, B, and C are disjoint then they are independent.
(b) [1 pts.] T or F: P (A|B) / P (A)P (B|A)P (A|B)
. (The sign ‘/’ means ‘is proportional to’)
(c) [1 pts.] T or F: P (A [ B) P (A).
(d) [1 pts.] T or F: P (A \ B) � P (A).
10-601: Machine Learning Page 4 of 16 2/29/2016
1.3 MAP vs MLE
Answer each question with T or F and provide a one sentence explanation of youranswer:
(a) [2 pts.] T or F: In the limit, as n (the number of samples) increases, the MAP andMLE estimates become the same.
(b) [2 pts.] T or F: Naive Bayes can only be used with MAP estimates, and not MLEestimates.
1.4 Probability
Assume we have a sample space ⌦. Answer each question with T or F. No justificationis required.
(a) [1 pts.] T or F: If events A, B, and C are disjoint then they are independent.
(b) [1 pts.] T or F: P (A|B) / P (A)P (B|A)P (A|B)
. (The sign ‘/’ means ‘is proportional to’)
(c) [1 pts.] T or F: P (A [ B) P (A).
(d) [1 pts.] T or F: P (A \ B) � P (A).
![Page 11: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/11.jpg)
Sample Questions
12
![Page 12: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/12.jpg)
Sample Questions
13
10-701 Machine Learning Midterm Exam - Page 8 of 17 11/02/2016
Now we will apply K-Nearest Neighbors using Euclidean distance to a binary classifi-cation task. We assign the class of the test point to be the class of the majority of thek nearest neighbors. A point can be its own neighbor.
Figure 5
3. [2 pts] What value of k minimizes leave-one-out cross-validation error for the datasetshown in Figure 5? What is the resulting error?
4. [2 pts] Sketch the 1-nearest neighbor boundary over Figure 5.
5. [2 pts] What value of k minimizes the training set error for the dataset shown inFigure 5? What is the resulting training error?
10-701 Machine Learning Midterm Exam - Page 7 of 17 11/02/2016
4 K-NN [12 pts]
In this problem, you will be tested on your knowledge of K-Nearest Neighbors (K-NN), wherek indicates the number of nearest neighbors.
1. [3 pts] For K-NN in general, are there any cons of using very large k values? Selectone. Briefly justify your answer.
(a) Yes (b) No
2. [3 pts] For K-NN in general, are there any cons of using very small k values? Selectone. Briefly justify your answer.
(a) Yes (b) No
![Page 13: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/13.jpg)
Sample Questions
14
10-601: Machine Learning Page 10 of 16 2/29/2016
4 SVM, Perceptron and Kernels [20 pts. + 4 Extra Credit]
4.1 True or False
Answer each of the following questions with T or F and provide a one line justification.
(a) [2 pts.] Consider two datasets D(1) and D(2) where D(1) = {(x(1)1 , y
(1)1 ), ..., (x(1)
n , y(1)n )}
and D(2) = {(x(2)1 , y
(2)1 ), ..., (x(2)
m , y(2)m )} such that x(1)
i 2 Rd1 , x(2)i 2 Rd2 . Suppose d1 > d2
and n > m. Then the maximum number of mistakes a perceptron algorithm will makeis higher on dataset D(1) than on dataset D(2).
(b) [2 pts.] Suppose �(x) is an arbitrary feature mapping from input x 2 X to �(x) 2 RN
and let K(x, z) = �(x) · �(z). Then K(x, z) will always be a valid kernel function.
(c) [2 pts.] Given the same training data, in which the points are linearly separable, themargin of the decision boundary produced by SVM will always be greater than or equalto the margin of the decision boundary produced by Perceptron.
4.2 Multiple Choice
(a) [3 pt.] If the data is linearly separable, SVM minimizes kwk2 subject to the constraints8i, yiw · xi � 1. In the linearly separable case, which of the following may happen to thedecision boundary if one of the training samples is removed? Circle all that apply.
• Shifts toward the point removed
• Shifts away from the point removed
• Does not change
(b) [3 pt.] Recall that when the data are not linearly separable, SVM minimizes kwk2 +CP
i ⇠i subject to the constraint that 8i, yiw · xi � 1 � ⇠i and ⇠i � 0. Which of thefollowing may happen to the size of the margin if the tradeo↵ parameter C is increased?Circle all that apply.
• Increases
• Decreases
• Remains the same
![Page 14: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/14.jpg)
Sample Questions
15
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 8 of 16 2/29/2016
(a) Adding one outlier to the
original data set.
(b) Adding two outliers to the original data
set.
(c) Adding three outliers to the original data
set. Two on one side and one on the other
side.
(d) Duplicating the original data set.
(e) Duplicating the original data set and
adding four points that lie on the trajectory
of the original regression line.
Figure 3: New data set Snew.
Dataset
![Page 15: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/15.jpg)
Sample Questions
16
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 8 of 16 2/29/2016
(a) Adding one outlier to the
original data set.
(b) Adding two outliers to the original data
set.
(c) Adding three outliers to the original data
set. Two on one side and one on the other
side.
(d) Duplicating the original data set.
(e) Duplicating the original data set and
adding four points that lie on the trajectory
of the original regression line.
Figure 3: New data set Snew.
Dataset
![Page 16: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/16.jpg)
Sample Questions
17
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 8 of 16 2/29/2016
(a) Adding one outlier to the
original data set.
(b) Adding two outliers to the original data
set.
(c) Adding three outliers to the original data
set. Two on one side and one on the other
side.
(d) Duplicating the original data set.
(e) Duplicating the original data set and
adding four points that lie on the trajectory
of the original regression line.
Figure 3: New data set Snew.
Dataset
![Page 17: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/17.jpg)
Sample Questions
18
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 7 of 16 2/29/2016
3 Linear and Logistic Regression [20 pts. + 2 Extra Credit]
3.1 Linear regression
Given that we have an input x and we want to estimate an output y, in linear regressionwe assume the relationship between them is of the form y = wx+ b+ ✏, where w and b arereal-valued parameters we estimate and ✏ represents the noise in the data. When the noiseis Gaussian, maximizing the likelihood of a dataset S = {(x1, y1), . . . , (xn, yn)} to estimatethe parameters w and b is equivalent to minimizing the squared error:
argminw
nX
i=1
(yi � (wxi + b))2.
Consider the dataset S plotted in Fig. 1 along with its associated regression line. Foreach of the altered data sets Snew plotted in Fig. 3, indicate which regression line (relativeto the original one) in Fig. 2 corresponds to the regression line for the new data set. Writeyour answers in the table below.
Dataset (a) (b) (c) (d) (e)Regression line
Figure 1: An observed data set and its associated regression line.
Figure 2: New regression lines for altered data sets Snew.
10-601: Machine Learning Page 8 of 16 2/29/2016
(a) Adding one outlier to the
original data set.
(b) Adding two outliers to the original data
set.
(c) Adding three outliers to the original data
set. Two on one side and one on the other
side.
(d) Duplicating the original data set.
(e) Duplicating the original data set and
adding four points that lie on the trajectory
of the original regression line.
Figure 3: New data set Snew.
Dataset
![Page 18: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/18.jpg)
Matching Game
Goal: Match the Algorithm to its Update Rule
19
1. SGD for Logistic Regression
2. Least Mean Squares
3. Perceptron
4.
5.
6.
�k � �k +1
1 + exp �(h�(x(i)) � y(i))
�k � �k + (h�(x(i)) � y(i))
�k � �k + �(h�(x(i)) � y(i))x(i)k
h�(x) = p(y|x)
h�(x) = �T x
h�(x) = sign(�T x)
A. 1=5, 2=4, 3=6B. 1=5, 2=6, 3=4C. 1=6, 2=4, 3=4D. 1=5, 2=6, 3=6
E. 1=6, 2=6, 3=6F. 1=6, 2=5, 3=5G. 1=5, 2=5, 3=5H. 1=4, 2=5, 3=6
![Page 19: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/19.jpg)
Q&A
26
![Page 20: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/20.jpg)
MULTINOMIAL LOGISTIC REGRESSION
28
![Page 21: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/21.jpg)
29
![Page 22: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/22.jpg)
Multinomial Logistic RegressionChalkboard– Background: Multinomial distribution– Definition: Multi-class classification– Geometric intuitions– Multinomial logistic regression model – Generative story– Reduction to binary logistic regression– Partial derivatives and gradients– Applying Gradient Descent and SGD– Implementation w/ sparse features
30
![Page 23: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/23.jpg)
Debug that Program!
In-Class Exercise: Think-Pair-ShareDebug the following program which is (incorrectly) attempting to run SGD for multinomial logistic regression
31
Buggy Program:while not converged:
for i in shuffle([1,…,N]):for k in [1,…,K]:
theta[k] = theta[k] - lambda * grad(x[i], y[i], theta, k)
Assume: grad(x[i], y[i], theta, k) returns the gradient of the negative
log-likelihood of the training example (x[i],y[i]) with respect to vector theta[k].
lambda is the learning rate. N = # of examples. K = # of output classes. M = # of
features. theta is a K by M matrix.
![Page 24: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/24.jpg)
FEATURE ENGINEERING
33
![Page 25: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/25.jpg)
Handcrafted Features
34
NNP : VBN NNP VBD
PERLOC
Egypt - born Proyas directed
S
NP VP
ADJP VPNP
egypt - born proyas direct
p(y|x) ∝exp(Θy�f( ))
born-in
![Page 26: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/26.jpg)
Where do features come from?
35
Feat
ure
Engi
neer
ing
Feature Learning
hand-craftedfeatures
Sun et al., 2011
Zhou et al.,2005
First word before M1Second word before M1Bag-of-words in M1Head word of M1Other word in betweenFirst word after M2Second word after M2Bag-of-words in M2Head word of M2Bigrams in betweenWords on dependency pathCountry name listPersonal relative triggersPersonal title listWordNet TagsHeads of chunks in betweenPath of phrase labelsCombination of entity types
![Page 27: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/27.jpg)
Where do features come from?
36
Feat
ure
En
gin
ee
rin
g
Feature Learning
hand-craftedfeatures
Sun et al., 2011
Zhou et al.,2005 word
embeddingsMikolov et al.,
2013
CBOW model in Mikolov et al. (2013)
input(context words)
embedding
missing word
Look-up table Classifier
0.13 .26 … -.52
0.11 .23 … -.45
dog:
cat:similar words,similar embeddings
unsupervisedlearning
![Page 28: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/28.jpg)
Where do features come from?
37
Feat
ure
Engi
neer
ing
Feature Learning
hand-craftedfeatures
Sun et al., 2011
Zhou et al.,2005 word
embeddingsMikolov et al.,
2013
stringembeddings
Collobert & Weston, 2008
Socher, 2011
Convolutional Neural Networks (Collobert and Weston 2008)
The [movie] showed [wars]
pooling
CNN
Recursive Auto Encoder (Socher 2011)
The [movie] showed [wars]
RAE
![Page 29: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/29.jpg)
Where do features come from?
38
Feat
ure
Engi
neer
ing
Feature Learning
hand-craftedfeatures
Sun et al., 2011
Zhou et al.,2005 word
embeddingsMikolov et al.,
2013
treeembeddings
Socher et al.,2013
Hermann & Blunsom, 2013
stringembeddings
Collobert & Weston, 2008
Socher, 2011
The [movie] showed [wars]
WNP,VP
WDT,NN WV,NN
S
NP VP
![Page 30: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/30.jpg)
Where do features come from?
39
wordembeddings
treeembeddings
hand-craftedfeatures
stringembeddings
Feat
ure
Engi
neer
ing
Feature Learning
Sun et al., 2011
Zhou et al.,2005
Mikolov et al.,2013
Collobert & Weston, 2008
Socher, 2011
Socher et al.,2013
Hermann & Blunsom, 2013
Hermann et al.2014
word embedding features
Turian et al. 2010
Koo et al. 2008
Refine embedding
features with
semantic/syntactic info
![Page 31: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/31.jpg)
Where do features come from?
40
wordembeddings
treeembeddings
word embedding featureshand-crafted
features
best of both worlds?
stringembeddings
Feat
ure
Engi
neer
ing
Feature Learning
Sun et al., 2011
Zhou et al.,2005
Mikolov et al.,2013
Collobert & Weston, 2008
Socher, 2011
Socher et al.,2013
Turian et al. 2010
Koo et al. 2008
Hermann et al.2014
Hermann & Blunsom, 2013
![Page 32: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/32.jpg)
Feature Engineering for NLP
Suppose you build a logistic regression model to predict a part-of-speech (POS) tag for each word in a sentence.
What features should you use?
41The movie I watched depicted hopedeter. noun noun nounverb verb
![Page 33: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/33.jpg)
Per-word Features:
Feature Engineering for NLP
42The movie I watched depicted hopedeter. noun noun nounverb verb
is-capital(wi)
endswith(wi,“e”)
endswith(wi,“d”)
endswith(wi,“ed”)
wi == “aardvark”
wi == “hope”
…
110000…
010000…
100000…
001100…
001100…
010001…
x(1) x(2) x(3) x(4) x(5) x(6)
![Page 34: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/34.jpg)
Context Features:
Feature Engineering for NLP
43The movie I watched depicted hopedeter. noun noun nounverb verb
…wi == “watched”wi+1 == “watched”wi-1 == “watched”wi+2 == “watched”wi-2 == “watched”
…
…00000…
…00010…
…01000…
…10000…
…00100…
…00001…
x(1) x(2) x(3) x(4) x(5) x(6)
![Page 35: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/35.jpg)
Context Features:
Feature Engineering for NLP
44The movie I watched depicted hopedeter. noun noun nounverb verb
…wi == “I”wi+1 == “I”wi-1 == “I”wi+2 == “I”wi-2 == “I”
…
…00010…
…01000…
…10000…
…00100…
…00001…
…00000…
x(1) x(2) x(3) x(4) x(5) x(6)
![Page 36: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/36.jpg)
Feature Engineering for NLP
45The movie I watched depicted hopedeter. noun noun nounverb verb
and learning methods give small incremental gains in POS tagging performance,bringing it close to parity with the best published POS tagging numbers in 2010.These numbers are on the now fairly standard splits of the Wall Street Journalportion of the Penn Treebank for POS tagging, following [6].3 The details of thecorpus appear in Table 2 and comparative results appear in Table 3.
Table 2. WSJ corpus for POS tagging experiments.
Set Sections Sentences Tokens UnknownTraining 0-18 38,219 912,344 0Development 19-21 5,527 131,768 4,467Test 22-24 5,462 129,654 3,649
Table 3. Tagging accuracies with different feature templates and other changes on theWSJ 19-21 development set.
Model Feature Templates # Sent. Token Unk.Feats Acc. Acc. Acc.
3gramMemm See text 248,798 52.07% 96.92% 88.99%naacl 2003 See text and [1] 460,552 55.31% 97.15% 88.61%Replication See text and [1] 460,551 55.62% 97.18% 88.92%Replication′ +rareFeatureThresh = 5 482,364 55.67% 97.19% 88.96%5w +⟨t0, w−2⟩, ⟨t0, w2⟩ 730,178 56.23% 97.20% 89.03%5wShapes +⟨t0, s−1⟩, ⟨t0, s0⟩, ⟨t0, s+1⟩ 731,661 56.52% 97.25% 89.81%5wShapesDS + distributional similarity 737,955 56.79% 97.28% 90.46%
3gramMemm shows the performance of a straightforward, fast, discrimina-tive sequence model tagger. It uses the templates ⟨t0, w−1⟩, ⟨t0, w0⟩, ⟨t0, w+1⟩,⟨t0, t−1⟩, ⟨t0, t−2, t−1⟩ and the unknown word features from [1]. The higherperformance naacl 2003 tagger numbers come from use of a bidirectionalcyclic dependency network tagger, which adds the feature templates ⟨t0, t+1⟩,⟨t0, t+1, t+2⟩, ⟨t0, t−1, t+1⟩, ⟨t0, t−1, w0⟩, ⟨t0, t+1, w0⟩, ⟨t0, w−1, w0⟩, ⟨t0, w0, w+1⟩The next line shows results from an attempt to replicate those numbers in 2010.The results are similar but a fraction better.4 The line after that shows thatthe numbers are pushed up a little by lowering the support threshold for in-cluding rare word features to 5. Thereafter, performance is improved a little byadding features. 5w adds the words two to the left and right as features, and5wShapes also adds word shape features that we have described for named en-
3 In this paper, when I refer to “the Penn Treebank”, I am actually referring to justthe WSJ portion of the treebank, and am using the LDC99T42 Treebank release 3version.
4 I think the improvements are due to a few bug fixes by Michel Galley. Thanks!
Table from Manning (2011)
![Page 37: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/37.jpg)
Feature Engineering for CVEdge detection (Canny)
50Figures from http://opencv.org
Corner Detection (Harris)
![Page 38: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/38.jpg)
Feature Engineering for CV
Scale Invariant Feature Transform (SIFT)
51Figure from Lowe (1999) and Lowe (2004)
![Page 39: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/39.jpg)
NON-LINEAR FEATURES
53
![Page 40: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/40.jpg)
Nonlinear Features• aka. “nonlinear basis functions”• So far, input was always• Key Idea: let input be some function of x
– original input:– new input:– define
• Examples: (M = 1)
55
For a linear model: still a linear function of b(x) even though a nonlinear function of xExamples:- Perceptron- Linear regression- Logistic regression
![Page 41: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/41.jpg)
Example: Linear Regression
59x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is y = tanh(x) + noise
![Page 42: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/42.jpg)
Example: Linear Regression
60x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is y = tanh(x) + noise
![Page 43: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/43.jpg)
Example: Linear Regression
61x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is y = tanh(x) + noise
![Page 44: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/44.jpg)
Example: Linear Regression
62x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is y = tanh(x) + noise
![Page 45: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/45.jpg)
Example: Linear Regression
63x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is y = tanh(x) + noise
![Page 46: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/46.jpg)
Example: Linear Regression
64x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is y = tanh(x) + noise
![Page 47: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/47.jpg)
Example: Linear Regression
65x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is y = tanh(x) + noise
![Page 48: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/48.jpg)
Example: Linear Regression
66x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is y = tanh(x) + noise
![Page 49: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/49.jpg)
Example: Linear Regression
67x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is y = tanh(x) + noise
![Page 50: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/50.jpg)
Example: Linear Regression
68x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is y = tanh(x) + noise
![Page 51: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/51.jpg)
Example: Linear Regression
69x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is linear with negative slope and gaussiannoise
![Page 52: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/52.jpg)
Example: Linear Regression
70x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is linear with negative slope and gaussiannoise
![Page 53: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/53.jpg)
Example: Linear Regression
71x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is linear with negative slope and gaussiannoise
![Page 54: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/54.jpg)
Example: Linear Regression
72x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is linear with negative slope and gaussiannoise
![Page 55: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/55.jpg)
Example: Linear Regression
73x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is linear with negative slope and gaussiannoise
![Page 56: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/56.jpg)
Example: Linear Regression
74x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is linear with negative slope and gaussiannoise
![Page 57: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/57.jpg)
Example: Linear Regression
75x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is linear with negative slope and gaussiannoise
![Page 58: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/58.jpg)
Over-fitting
Root-Mean-Square (RMS) Error:
Slide courtesy of William Cohen
![Page 59: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/59.jpg)
Polynomial Coefficients
Slide courtesy of William Cohen
![Page 60: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/60.jpg)
Example: Linear Regression
78x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
true “unknown” target function is linear with negative slope and gaussiannoise
![Page 61: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/61.jpg)
Example: Linear Regression
79x
y
Goal: Learn y = wT f(x) + bwhere f(.) is a polynomial basis function
Same as before, but now with N = 100 points
true “unknown” target function is linear with negative slope and gaussiannoise
![Page 62: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/62.jpg)
REGULARIZATION
80
![Page 63: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/63.jpg)
OverfittingDefinition: The problem of overfitting is when the model captures the noise in the training data instead of the underlying structure
Overfitting can occur in all the models we’ve seen so far: – Decision Trees (e.g. when tree is too deep)– KNN (e.g. when k is small)– Perceptron (e.g. when sample isn’t representative)– Linear Regression (e.g. with nonlinear features)– Logistic Regression (e.g. with many rare features)
81
![Page 64: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/64.jpg)
Motivation: RegularizationExample: Stock Prices• Suppose we wish to predict
Google’s stock price at time t+1 • What features should we use?
(putting all computational concerns aside)– Stock prices of all other stocks at
times t, t-1, t-2, …, t - k– Mentions of Google with positive /
negative sentiment words in all newspapers and social media outlets
• Do we believe that all of these features are going to be useful?
82
![Page 65: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/65.jpg)
Motivation: Regularization
• Occam’s Razor: prefer the simplest hypothesis
• What does it mean for a hypothesis (or model) to be simple?1. small number of features (model selection)2. small number of “important” features
(shrinkage)
83
![Page 66: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/66.jpg)
Regularization
Chalkboard– L2, L1, L0 Regularization– Example: Linear Regression
84
![Page 67: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/67.jpg)
Regularization
85
Don’t Regularize the Bias (Intercept) Parameter!• In our models so far, the bias / intercept parameter is
usually denoted by !" -- that is, the parameter for which we fixed #" = 1
• Regularizers always avoid penalizing this bias / intercept parameter
• Why? Because otherwise the learning algorithms wouldn’t be invariant to a shift in the y-values
Whitening Data• It’s common to whiten each feature by subtracting its
mean and dividing by its variance• For regularization, this helps all the features be penalized
in the same units (e.g. convert both centimeters and kilometers to z-scores)
![Page 68: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/68.jpg)
Regularization: +
Slide courtesy of William Cohen
![Page 69: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/69.jpg)
Polynomial Coefficients
none exp(18) huge
Slide courtesy of William Cohen
![Page 70: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/70.jpg)
Over Regularization:
Slide courtesy of William Cohen
![Page 71: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/71.jpg)
Regularization ExerciseIn-class Exercise1. Plot train error vs. # features (cartoon)2. Plot test error vs. # features (cartoon)
89
erro
r
# features
![Page 72: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/72.jpg)
Example: Logistic Regression
90
Training Data
![Page 73: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/73.jpg)
Example: Logistic Regression
91
TestData
![Page 74: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/74.jpg)
Example: Logistic Regression
92
1/lambda
erro
r
![Page 75: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/75.jpg)
Example: Logistic Regression
93
![Page 76: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/76.jpg)
Example: Logistic Regression
94
![Page 77: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/77.jpg)
Example: Logistic Regression
95
![Page 78: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/78.jpg)
Example: Logistic Regression
96
![Page 79: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/79.jpg)
Example: Logistic Regression
97
![Page 80: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/80.jpg)
Example: Logistic Regression
98
![Page 81: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/81.jpg)
Example: Logistic Regression
99
![Page 82: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/82.jpg)
Example: Logistic Regression
100
![Page 83: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/83.jpg)
Example: Logistic Regression
101
![Page 84: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/84.jpg)
Example: Logistic Regression
102
![Page 85: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/85.jpg)
Example: Logistic Regression
103
![Page 86: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/86.jpg)
Example: Logistic Regression
104
![Page 87: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/87.jpg)
Example: Logistic Regression
105
![Page 88: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/88.jpg)
Example: Logistic Regression
106
1/lambda
erro
r
![Page 89: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/89.jpg)
Regularization as MAP
• L1 and L2 regularization can be interpreted as maximum a-posteriori (MAP) estimation of the parameters
• To be discussed later in the course…
107
![Page 90: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/90.jpg)
Takeaways
1. Nonlinear basis functions allow linear models (e.g. Linear Regression, Logistic Regression) to capture nonlinear aspects of the original input
2. Nonlinear features are require no changes to the model (i.e. just preprocessing)
3. Regularization helps to avoid overfitting4. Regularization and MAP estimation are
equivalent for appropriately chosen priors
109
![Page 91: Machine Learning Department School of Computer … › ~mgormley › courses › 10601-s19 › slides › ...(c) [2 pts.] Given the same training data, in which the points are linearly](https://reader030.fdocuments.us/reader030/viewer/2022040617/5f1cf2ede4f9a36b2d79a4fa/html5/thumbnails/91.jpg)
Feature Engineering / Regularization Objectives
You should be able to…• Engineer appropriate features for a new task• Use feature selection techniques to identify and
remove irrelevant features• Identify when a model is overfitting• Add a regularizer to an existing objective in order to
combat overfitting• Explain why we should not regularize the bias term• Convert linearly inseparable dataset to a linearly
separable dataset in higher dimensions• Describe feature engineering in common application
areas
110