Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine...

Basics of Machine Learning

COMPSCI 527 — Computer Vision

COMPSCI 527 — Computer Vision Basics of Machine Learning 1 / 23

Outline

1 Classification and Regression

2 Loss and Risk

3 Training, Validation, Testing

4 The State of the Art of Image Classification


Classification and Regression

A First Classification Problem

• MNIST handwritten digit recognition(60,000 images, labeled, curated)

• 28× 28 pixel black-and-white images of individual digits• What is the label y ∈ Y = {0,1, . . . ,9} for a given input

image x?



Supervised Machine Learning• Supervised machine learning is the problem of learning a

function y = h(x) : X ⊆ Rd → Y ⊆ R from sampleinput/output pairs (x, y)

• “Supervised” means that the samples are provided• Depending on the problem, h may map an image, an image

window, or a set of images x to• A yes or no answer to the question “Is this a [person, car, cat]”:

Y = {yes,no} for object detection• A category out of a small set: Y = {0,1, . . . ,9} for digit recognition• A category out of a large set: Y = {person, car, . . . , tree} for

object recognition• A number or small vector of numbers: Y = R5 for camera motion• A whole field (array) of numbers: Y = R2×1000×1000 for image

motion estimation




• Two types of supervised machine learning problems:• Classification: Y is categorical, i.e., finite and unordered

(For digit recognition, the order in Y = {0,1, . . . ,9} does not matter)

y is then called a label• Examples: Object detection, object recognition,

foreground/background segregation• Regression: Y = Re; y is then called a value or response• Examples: Camera motion estimation, depth from stereo,

image motion estimation, object tracking



Is This Just Data Fitting?• Supervised machine learning is the problem of learning a

function y = h(x) : X ⊆ Rd → Y ⊆ R from sampleinput/output pairs (x, y)

0 10

5

k = 1

k = 3

k = 9


Loss and Risk

Loss• To know if the learned function h does well on sample

(x, y), we need to measure how far the value it predicts forx is from the true value y

• The loss is a measure of the discrepancybetween y and y = h(x)

• The loss function maps pairs (y , y) to real values:` : Y × Y → R

• Simplest loss for classification: the zero-one loss ormisclassification loss

`(y , y) = I(y 6= y) =

{1 if y 6= y0 if y = y

• Simplest loss for regression: The quadratic loss:`(y , y) = (y − y)2

• Different problems call for different measures of lossCOMPSCI 527 — Computer Vision Basics of Machine Learning 7 / 23

Loss and Risk

Empirical Risk

• Given a training set T = {(x1, y1), . . . , (xN , yN)}with xn ∈ X and yn ∈ Y , and loss function `,and a set H of allowed functions (e.g., polynomials)the empirical risk or training error is the average loss on T :LT (h) = 1

N

∑Nn=1 `(yn,h(xn))

• This is what we minimize in a data fitting problem:estimate h ∈ arg minh∈H LT (h)

• H is called the hypothesis space


Loss and Risk

Machine Learning and the Statistical Risk• Given a training set T = {(x1, y1), . . . , (xN , yN)}

with xn ∈ X and yn ∈ Y , and loss function `,• Empirical risk LT (h) = 1

N

∑Nn=1 `(yn,h(xn))

• Data fitting: estimate h ∈ arg minh∈H Lp(h)

• In machine learning, we go much farther: We also wanth to do well on previously unseen inputs

• Key assumption: All data (training and otherwise) comesfrom the same joint probability distribution p(x, y)

• p is unknown and cannot be estimated. However, we canstill aim at minimizing the statistical risk or expectedrisk or generalization error Lp(h) = Ep[`(y ,h(x))]

• We can estimate Lp(h) (but not p), by using a sufficientlylarge test set S = {(x1, y1), . . . , (xM , yM)}, disjoint from T


Loss and Risk

The Digit Recognition Problem Revisited• MNIST handwritten digit recognition

(60,000 images, labeled, curated)

• X = set of all 28× 28 pixel black-and-white images• Y = {0,1, . . . ,9}• ` is the zero-one loss• H depends on the classification method used: Fit a

multivariate polynomial of a certain degree to T and thenround the results? All neural networks of a certain depth?

• Learning: Fit h ∈ H to T , but do well on a separate setCOMPSCI 527 — Computer Vision Basics of Machine Learning 10 / 23

Training, Validation, Testing

Parameters and Hyper-Parameters• The example for the MNIST problem had

• parameters (polynomial coefficients, weights in the neuralnetwork)

• hyper-parameters (maximum degree of polynomial, numberof layers in the neural network)

• What is the difference between parameters andhyper-parameters?

• If we were allowed to optimize the hyper-parameters on thetraining set, we would get trivial solutions with zeroempirical loss

• Examples: Fit a degree N − 1 polynomial to N trainingsamples, make a huge neural network

• Hyper-parameters need to be chosen separately fromtraining



Underfitting and Overfitting

• Being too stingy with hyper-parameters (H too small)makes the learner do poorly both on T and on previouslyunseen data: underfitting

• Being too generous with hyper-parameters (H too large)makes the learner do well on T , but not on previouslyunseen data: overfitting



Example: Polynomial Fitting

0 10

5

k = 1

k = 2

k = 3

k = 6

k = 9



Validation

• How to find the Goldilocks hyper-parameters?• For finite hyper-parameter sets: Try them all, train each on

T , use a separate data set V to see which hyper-parameterdoes best

• The separation of T and V (validation set) is crucial• We don’t allow V to “peek at the solution”• For infinite hyper-parameter sets: Find the first local

minimum by binary search



Example: Polynomial Fitting

0 10

5

k = 1

k = 2

k = 3

k = 6

k = 9

0 2 4 6 8 100

0.5

1

1.5training risk

validation risk



Supervised Learning

• Given• A training set T = {(x1, y1), . . . , (xN , yN)} with

xn ∈ X and yn ∈ Y• A hypothesis space, i.e., a set of functionsH = {h : X → Y}

• A loss function ` : Y × Y → R• Estimate h ∈ arg minh∈H Lp(h)

where Lp(h) = Ep[`(y ,h(x))]

• How can we estimate Lp(h) (so that we can estimatehyper-parameters and optimal estimator h)if we cannot estimate p?

• Use a validation set V separate from T



Validation

• Loop on hyper-parameters π ∈ Π, train on T , validate on V :

procedure VALIDATION(H,Π,T ,V , `)L =∞ . Stores the best risk so far on V

for π ∈ Π doh ∈ arg minh′∈Hπ

LT (h′) . Use loss ` to compute best predictor on T

L = LV (h) . Use loss ` to evaluate the predictor’s risk on V

if L < L then(π, h, L) = (π,h,L) . Keep track of the best hyper-parameters, predictor, and risk

end ifend forreturn (π, h, L) . Return best hyper-parameters, predictor, and risk estimate

end procedure



Testing

• A machine learning system has been trained, using both Tand V , to yield h

• We cannot report LV (h) as the measure of performance• The set V is tainted since we used it during training• Performance measures are accepted only on pristine sets,

not used in any way for training• We need to test the system on a third set S, the test set• Estimate the true risk Lp(h) = Ep[`(y , h(x))] by computing

the empirical risk LS(h) = 1|S|∑|S|

n=1 `(yn, h(xn)) on S



Summary of Sets Involved

• A training set T to train the predictor given a specific set ofhyper-parameters (if any)

• A validation set V to choose good hyper-parameters• A test set S to evaluate the generalization performance of

the predictor h learned by training on T and validating on V• Resampling techniques (“cross-validation”) exist for making

the same set play the role of both T and V• S must still be entirely separate


The State of the Art of Image Classification

The State of the Art of Image Classification• ImageNet Large Scale Visual Recognition Challenge

(ILSVRC)• Based on ImageNet,1.4 million images, 1000 categories

(Fei-Fei Li, Stanford)• Three different competitions:

• Classification:• One label per image, 1.2M images available for training, 50k

for validation, 100k withheld for testing• Zero-one loss for performance evaluation

• Localization: Classification, plus bounding box on oneinstanceCorrect if ≥ 50% overlap with true box

• Detection: Same as localization, but find every instance inthe image. Measure the fraction of mistakes (false positives,false negatives)



[Image from Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge, Int’l. J. Comp. Vision 115:211-252, 2015]



Difficulties of ILSVRC

• Images are “natural.” Arbitrary backgrounds, different sizes,viewpoints, lighting. Partially visible objects

• 1,000 categories, subtle distinctions. Example: Siberianhusky and Eskimo dog

• Variations of appearance within one category can besignificant (how many lamps can you think of?)

• What is the label of one image? For instance, a picture of agroup of people examining a fishing rod was labeled as“reel.”



Performance for Image Classification

• 2010: 28.2 percent• 2017: 2.3 percent (ensemble of several deep networks)• Improvement results from both architectural insights

(residuals, squeeze-and-excitation networks, ...) andpersistent engineering

• A book on “tricks of the trade in deep learning!”• We will see some after studying the basics


Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine...

Documents

Transcript of Basics of Machine Learning - Duke University...COMPSCI 527 — Computer Vision Basics of Machine...