Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine...

23
Basics of Machine Learning COMPSCI 527 — Computer Vision COMPSCI 527 — Computer Vision Basics of Machine Learning 1 / 23

Transcript of Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine...

Page 1: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Basics of Machine Learning

COMPSCI 527 — Computer Vision

COMPSCI 527 — Computer Vision Basics of Machine Learning 1 / 23

Page 2: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Outline

1 Classification and Regression

2 Loss and Risk

3 Training, Validation, Testing

4 The State of the Art of Image Classification

COMPSCI 527 — Computer Vision Basics of Machine Learning 2 / 23

Page 3: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Classification and Regression

A First Classification Problem

• MNIST handwritten digit recognition(60,000 images, labeled, curated)

• 28× 28 pixel black-and-white images of individual digits• What is the label y ∈ Y = {0,1, . . . ,9} for a given input

image x?

COMPSCI 527 — Computer Vision Basics of Machine Learning 3 / 23

Page 4: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Classification and Regression

Supervised Machine Learning• Supervised machine learning is the problem of learning a

function y = h(x) : X ⊆ Rd → Y ⊆ R from sampleinput/output pairs (x, y)

• “Supervised” means that the samples are provided• Depending on the problem, h may map an image, an image

window, or a set of images x to• A yes or no answer to the question “Is this a [person, car, cat]”:

Y = {yes,no} for object detection• A category out of a small set: Y = {0,1, . . . ,9} for digit recognition• A category out of a large set: Y = {person, car, . . . , tree} for

object recognition• A number or small vector of numbers: Y = R5 for camera motion• A whole field (array) of numbers: Y = R2×1000×1000 for image

motion estimation

COMPSCI 527 — Computer Vision Basics of Machine Learning 4 / 23

Page 5: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Classification and Regression

Classification and Regression

• Two types of supervised machine learning problems:• Classification: Y is categorical, i.e., finite and unordered

(For digit recognition, the order in Y = {0,1, . . . ,9} does not matter)

y is then called a label• Examples: Object detection, object recognition,

foreground/background segregation• Regression: Y = Re; y is then called a value or response• Examples: Camera motion estimation, depth from stereo,

image motion estimation, object tracking

COMPSCI 527 — Computer Vision Basics of Machine Learning 5 / 23

Page 6: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Classification and Regression

Is This Just Data Fitting?• Supervised machine learning is the problem of learning a

function y = h(x) : X ⊆ Rd → Y ⊆ R from sampleinput/output pairs (x, y)

0 10

5

k = 1

k = 3

k = 9

COMPSCI 527 — Computer Vision Basics of Machine Learning 6 / 23

Page 7: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Loss and Risk

Loss• To know if the learned function h does well on sample

(x, y), we need to measure how far the value it predicts forx is from the true value y

• The loss is a measure of the discrepancybetween y and y = h(x)

• The loss function maps pairs (y , y) to real values:` : Y × Y → R

• Simplest loss for classification: the zero-one loss ormisclassification loss

`(y , y) = I(y 6= y) =

{1 if y 6= y0 if y = y

• Simplest loss for regression: The quadratic loss:`(y , y) = (y − y)2

• Different problems call for different measures of lossCOMPSCI 527 — Computer Vision Basics of Machine Learning 7 / 23

Page 8: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Loss and Risk

Empirical Risk

• Given a training set T = {(x1, y1), . . . , (xN , yN)}with xn ∈ X and yn ∈ Y , and loss function `,and a set H of allowed functions (e.g., polynomials)the empirical risk or training error is the average loss on T :LT (h) = 1

N

∑Nn=1 `(yn,h(xn))

• This is what we minimize in a data fitting problem:estimate h ∈ arg minh∈H LT (h)

• H is called the hypothesis space

COMPSCI 527 — Computer Vision Basics of Machine Learning 8 / 23

Page 9: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Loss and Risk

Machine Learning and the Statistical Risk• Given a training set T = {(x1, y1), . . . , (xN , yN)}

with xn ∈ X and yn ∈ Y , and loss function `,• Empirical risk LT (h) = 1

N

∑Nn=1 `(yn,h(xn))

• Data fitting: estimate h ∈ arg minh∈H Lp(h)

• In machine learning, we go much farther: We also wanth to do well on previously unseen inputs

• Key assumption: All data (training and otherwise) comesfrom the same joint probability distribution p(x, y)

• p is unknown and cannot be estimated. However, we canstill aim at minimizing the statistical risk or expectedrisk or generalization error Lp(h) = Ep[`(y ,h(x))]

• We can estimate Lp(h) (but not p), by using a sufficientlylarge test set S = {(x1, y1), . . . , (xM , yM)}, disjoint from T

COMPSCI 527 — Computer Vision Basics of Machine Learning 9 / 23

Page 10: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Loss and Risk

The Digit Recognition Problem Revisited• MNIST handwritten digit recognition

(60,000 images, labeled, curated)

• X = set of all 28× 28 pixel black-and-white images• Y = {0,1, . . . ,9}• ` is the zero-one loss• H depends on the classification method used: Fit a

multivariate polynomial of a certain degree to T and thenround the results? All neural networks of a certain depth?

• Learning: Fit h ∈ H to T , but do well on a separate setCOMPSCI 527 — Computer Vision Basics of Machine Learning 10 / 23

Page 11: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Training, Validation, Testing

Parameters and Hyper-Parameters• The example for the MNIST problem had

• parameters (polynomial coefficients, weights in the neuralnetwork)

• hyper-parameters (maximum degree of polynomial, numberof layers in the neural network)

• What is the difference between parameters andhyper-parameters?

• If we were allowed to optimize the hyper-parameters on thetraining set, we would get trivial solutions with zeroempirical loss

• Examples: Fit a degree N − 1 polynomial to N trainingsamples, make a huge neural network

• Hyper-parameters need to be chosen separately fromtraining

COMPSCI 527 — Computer Vision Basics of Machine Learning 11 / 23

Page 12: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Training, Validation, Testing

Underfitting and Overfitting

• Being too stingy with hyper-parameters (H too small)makes the learner do poorly both on T and on previouslyunseen data: underfitting

• Being too generous with hyper-parameters (H too large)makes the learner do well on T , but not on previouslyunseen data: overfitting

COMPSCI 527 — Computer Vision Basics of Machine Learning 12 / 23

Page 13: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Training, Validation, Testing

Example: Polynomial Fitting

0 10

5

k = 1

k = 2

k = 3

k = 6

k = 9

COMPSCI 527 — Computer Vision Basics of Machine Learning 13 / 23

Page 14: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Training, Validation, Testing

Validation

• How to find the Goldilocks hyper-parameters?• For finite hyper-parameter sets: Try them all, train each on

T , use a separate data set V to see which hyper-parameterdoes best

• The separation of T and V (validation set) is crucial• We don’t allow V to “peek at the solution”• For infinite hyper-parameter sets: Find the first local

minimum by binary search

COMPSCI 527 — Computer Vision Basics of Machine Learning 14 / 23

Page 15: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Training, Validation, Testing

Example: Polynomial Fitting

0 10

5

k = 1

k = 2

k = 3

k = 6

k = 9

0 2 4 6 8 100

0.5

1

1.5training risk

validation risk

COMPSCI 527 — Computer Vision Basics of Machine Learning 15 / 23

Page 16: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Training, Validation, Testing

Supervised Learning

• Given• A training set T = {(x1, y1), . . . , (xN , yN)} with

xn ∈ X and yn ∈ Y• A hypothesis space, i.e., a set of functionsH = {h : X → Y}

• A loss function ` : Y × Y → R• Estimate h ∈ arg minh∈H Lp(h)

where Lp(h) = Ep[`(y ,h(x))]

• How can we estimate Lp(h) (so that we can estimatehyper-parameters and optimal estimator h)if we cannot estimate p?

• Use a validation set V separate from T

COMPSCI 527 — Computer Vision Basics of Machine Learning 16 / 23

Page 17: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Training, Validation, Testing

Validation

• Loop on hyper-parameters π ∈ Π, train on T , validate on V :

procedure VALIDATION(H,Π,T ,V , `)L =∞ . Stores the best risk so far on V

for π ∈ Π doh ∈ arg minh′∈Hπ

LT (h′) . Use loss ` to compute best predictor on T

L = LV (h) . Use loss ` to evaluate the predictor’s risk on V

if L < L then(π, h, L) = (π,h,L) . Keep track of the best hyper-parameters, predictor, and risk

end ifend forreturn (π, h, L) . Return best hyper-parameters, predictor, and risk estimate

end procedure

COMPSCI 527 — Computer Vision Basics of Machine Learning 17 / 23

Page 18: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Training, Validation, Testing

Testing

• A machine learning system has been trained, using both Tand V , to yield h

• We cannot report LV (h) as the measure of performance• The set V is tainted since we used it during training• Performance measures are accepted only on pristine sets,

not used in any way for training• We need to test the system on a third set S, the test set• Estimate the true risk Lp(h) = Ep[`(y , h(x))] by computing

the empirical risk LS(h) = 1|S|∑|S|

n=1 `(yn, h(xn)) on S

COMPSCI 527 — Computer Vision Basics of Machine Learning 18 / 23

Page 19: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

Training, Validation, Testing

Summary of Sets Involved

• A training set T to train the predictor given a specific set ofhyper-parameters (if any)

• A validation set V to choose good hyper-parameters• A test set S to evaluate the generalization performance of

the predictor h learned by training on T and validating on V• Resampling techniques (“cross-validation”) exist for making

the same set play the role of both T and V• S must still be entirely separate

COMPSCI 527 — Computer Vision Basics of Machine Learning 19 / 23

Page 20: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

The State of the Art of Image Classification

The State of the Art of Image Classification• ImageNet Large Scale Visual Recognition Challenge

(ILSVRC)• Based on ImageNet,1.4 million images, 1000 categories

(Fei-Fei Li, Stanford)• Three different competitions:

• Classification:• One label per image, 1.2M images available for training, 50k

for validation, 100k withheld for testing• Zero-one loss for performance evaluation

• Localization: Classification, plus bounding box on oneinstanceCorrect if ≥ 50% overlap with true box

• Detection: Same as localization, but find every instance inthe image. Measure the fraction of mistakes (false positives,false negatives)

COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23

Page 21: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

The State of the Art of Image Classification

[Image from Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge, Int’l. J. Comp. Vision 115:211-252, 2015]

COMPSCI 527 — Computer Vision Basics of Machine Learning 21 / 23

Page 22: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

The State of the Art of Image Classification

Difficulties of ILSVRC

• Images are “natural.” Arbitrary backgrounds, different sizes,viewpoints, lighting. Partially visible objects

• 1,000 categories, subtle distinctions. Example: Siberianhusky and Eskimo dog

• Variations of appearance within one category can besignificant (how many lamps can you think of?)

• What is the label of one image? For instance, a picture of agroup of people examining a fishing rod was labeled as“reel.”

COMPSCI 527 — Computer Vision Basics of Machine Learning 22 / 23

Page 23: Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine Learning 20 / 23. The State of the Art of Image Classification [Image from Russakovsky

The State of the Art of Image Classification

Performance for Image Classification

• 2010: 28.2 percent• 2017: 2.3 percent (ensemble of several deep networks)• Improvement results from both architectural insights

(residuals, squeeze-and-excitation networks, ...) andpersistent engineering

• A book on “tricks of the trade in deep learning!”• We will see some after studying the basics

COMPSCI 527 — Computer Vision Basics of Machine Learning 23 / 23