Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization...

31
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment

Transcript of Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization...

Page 1: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Review of fundamental 1Data mining in 1D: curve fitting by LLSApproximation-generalization tradeoffFirst homework assignment

Page 2: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Review: Concepts from fundamentals 1Define the following:

Supervised learningUnsupervised learningReinforcement learningGeneralizationHypothesis setEin(h|X)Hopt = argminh(Ein(h|X))Eout(hopt)Etest(hopt)Version spaceMarginsSupport vectors

Page 3: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Review: Concepts from fundamentals 1Define the following:

H shatters N pointsVC dimensionBreak point

Page 4: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Review: Question about VC dimension

The VC dimension of a linear dichotomizer in 2D is 3. What does 2D mean?What does dichotomizer mean?What does linear dichotomier mean?What does VC dimension of 3 mean for the

linear dichotomizer in 2D ?Why is 4 a break point for the linear

dichotomizer in 2D

Page 5: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

examples of family cars

An expert on family cars has given us 100x more data and with engine power measured by a standard test. Data contains used cars.Use these data to find a relationship between engine power and price of a family car.Interpret your training data in terms of p(x,y)=p(x)p(y|x).

Page 6: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

examples of family cars

Take your 500 family car data points, let x = engine power and y = price.How do you use these data to find the dependents of price on engine power?How do you estimate p(x) and p(y|x)?

Page 7: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

examples of family cars

How do you estimate p(x) and p(y|x)?Set up a bin structure in the x variable. What is a good choice of bin width?Assign the 500 y values to bins according to their x value.Define a new x variable as the bin center.What is p(x)? What is p(y|x)?

Page 8: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Curve fitting: “regression” in 1D

Regression can have any number of attributes

Label on examples is always a number

Page 9: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Fit a parabola to data

“target function” is “trend” in the dataScatter around trend interpreted as noiseH in this case is the set of all 2nd degree polynomialsSelect best member of H by min sum squared residuals

Page 10: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Finding the best member of H by calculus

2

1

||

N

t

ttin xgrE X

Take derivatives of Ein(g) with respect to the coefficients of a parabola (collective call q ) and set equal to zero.Solve resulting 3x3 linear system

Generalize to any degree polynomial using a matrix algebra

Page 11: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press

(V1.0)

Assume g(x|q) is polynomial of degree n-1(i.e linear combination of 1, x, x2, …, xn-1 )

m = number of examples (xit, ri

t) in the training set

Define mxn matrix AAij = jth function in evaluated at xi

t

q column vector of n unknown coefficientsb column vector of m values of ri

t in training set

If Aq = b has a solution, then g(xit|q) = ri

t for all i Not what we want, why?

with n << m, Aq = b has no exact solution

Polynomial regression by linear least squares

Page 12: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press

(V1.0)

Look for an approximate solution which minimizes the Euclidean norm of the residual vector r = b – Aq,

define f(q) = ||r||2 = rTr f(q) = (b – Aq)T(b – Aq) = bTb –2qTATb + qTATAq

A necessary condition for q0 to be minimum of f(q) is f(q0) = o

f(q) = 2ATAq – 2ATb

optimal set of parameters is a solution of nxn symmetric systemof linear equations ATAq = ATb

Normal Equations

Page 13: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Polynomial Regression: degree k with N data points

01

2

2012 wxwxwxwwwwwxg ttktkk

t ,,,,|

NkNNN

k

k

r

r

r

xxx

xxx

xxx

2

1

2

2222

1211

1

1

1

rD

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Solve DTDw = DTr for k+1 coefficients

Page 14: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

k

1

0

2

2222

1211

w

w

w

1

1

1

Dw

kNNN

k

k

xxx

xxx

xxx

fitY

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Given the parameters that minimize the sum of squared deviations,

are the values of the fit at xt, the locations data points,and R = Yfit – Y are the residuals at the data points

Page 15: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Coefficient of determination

15Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Denominator is the sum of squared error associated with the hypothesis that data is approximated by its mean value, a polynomial of degree zero

2

1

2

12|

1

N

t

t

N

t

tt

rr

xgrR

Page 16: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Review

1D polynomial regression (curve fitting) has all of the fundamental characteristics of data mining

• Data points (x, y) support supervised machine learning with x as the attribute and y as the label• The degree of the polynomial defines an hypothesis set• Polynomials of higher degree are more complex hypotheses.• Sum of squared residuals defines an Ein that can be used to select a member of the hypothesis set by matrix algebra.• Eout can be analytically defined and calculated for in silico datasets (target function + noise)

Page 17: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Tuning regression models

The degree of the polynomial used in fitting data by polynomials is an example of complexity in the hypothesis set H used in data mining.

As degree increases the hypothesis set has more adjustable parameters;hence, a greater diversity of shapes is possible.

Page 18: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Over-fitting

Parabolic fit shown here looks OK but would a cubic give a better fit?

Cubic fit will give a smaller Ein(g) but likely at the cost of a larger Eout(g)

Cubic lets me fit more of the noise in the data, which is specific to this data set

The optimum cubic fit to this data set is likely a poorer approximation to a different data set because noise is different.

Page 19: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Approximation – Generalization TradeoffIn the theory of generalization (covered fundaments 3) it can be shown that Eout(g) < Ein(g) + W(N, H, d) where W is a function of N the training-set size, H the hypothesis set, and d the allowable uncertainty in the final model.

W(N, H, d) is a bound on the difference between Eout(g) and Ein(g)

If W(N, H, d) is small we can be confident of good generalization.

At given complexity (determined by H), higher statistical confidence (1-d) can usually be achieved with larger N

At fixed N and d, W usually increases with the complexity of H, making generalization less certain.

Even though Ein(g) may decrease with higher complexity, Eout(g) may not.

In least-squares 1D regression, this effect can be illustrated by the “Bias/Variance dilemma”

Page 20: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Fit a cubic to each in silico data set.

Averaging these results we get a consensus cubic fit

Difference between consensus fit and target function called “bias”

From consensus fit and individual cubic fits, we can calculate a variance

Given a parabolic target function, construct several “in silico” data sets by adding noise drawn from a normal distribution with zero mean and a specified variance

Page 21: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Formal definitions of Bias & Variance

21Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Assume the target function f(x) is knownCreate M in silico datasets of size N by adding noise to f(x)For each dataset find the best gi(x) of given complexityAverage gi(x) to get best overall estimator of f(x)Calculate bias and variance of best estimator as follows

2

22

1Variance

1Bias

1

t i

tti

t

tt

ti

xgxgNM

g

xfxgN

g

xgM

xg

Page 22: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Expectation values of Eout(g)

where < > denotes average over data sets.

Eout can be written as sum of 3 terms, s2 + bias2 + variance

where s2 is a contribution from noise in the data

s2 does not depend on complexity of the hypothesis set, so we can ignore it in this discussion

22Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

2xE| xfxgE iiout X is out-of-sample error for ith training set

Ex denotes average over the specified domain for f(x).

)(EE| 2x

2x xfxgxfxgE iiout X

Page 23: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Derive Eout = bias2 + variance

22222222

2x

2x

2222x

22x

2x

22)(

used wewhere

)]([E))((E

)2(E

)2(E

)(E|

ggggggggggg

fgggE

ffggggE

ffggE

xfxgE

iiiii

iout

iout

iout

iout

X

Page 24: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

24

Bias is RMSDf

gi

g

f

one in silico experiment Linear regression: 5 experiments

Each cubic has shape like f(x) Shape of gi varies more

Polynomial fits to sin(x) + noise

Smaller Bias Larger variance

Page 25: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

25

Best complexity is degree 3Beyond 3, decreases in bias are offset of increases in variance

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

Bias, variance and Eout from polynomial fits to sin(x) + noise

Page 26: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Cannot use bias/variance analysis to tune polynomial fits to real data because f(x) is unknown; hence we cannot calculate the bias.

Page 27: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

27

“elbow” in estimate of Eout indicates best complexity

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

divide real data into training and validation sets

Use validation set to estimate Eout

Page 28: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Assignment 1 due 9-18-15

Generate the in silico data set of 2sin(1.5x)+N(0,1) with 100 random values of x between 0 and 5

Use 25 samples for training, 75 for validation

Fit polynomials of degree 1 – 5 to the training set.

Calculate at each degree.

Plot your result as shown in previous slide to find the “elbow” in Eval and best complexity for data mining

Use the full data set to find the optimum polynomial of best complexity

Show this result as plot of data and fit on the same set of axes.

Report the minimum sum of squared residuals and coefficient of determination

2

1

||

valN

t

ttval xgrE X

Page 29: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Get in silico data Calculate in-sample and validation errors

Page 30: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Degree of polynomial

Ein and Eval

Evidence for cubic as best choice for degree of polynomialVC bound suggests that small decreases in Eval for degree>3 do not indicate better generalization.

Page 31: Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Expected results: solid curve is target function, *’s are cubic fit, +’s are training data