Machine Learning - uwyo.educlan/teach/ai19/ml_a.pdf · What is machine learning? 0.4*δ{lottery} -...

$: Machine Learning - uwyo.educlan/teach/ai19/ml_a.pdf · What is machine learning? 0.4*δ{lottery} - 0.7*δ{lottery} + 0.18*δ{account} - 0.32*δ{birth} > 0.5 A hypothetical pattern$
Machine Learning

Chao Lan

Background

Can we build a machine that can automatically filter spams?

Which words imply spam?

Does this word imply spam?

Does this combination of words imply spam?

Manually designing patterns for spam is hard.

Can we let the machine learn patterns of spam?

Computers learn from examples to improve its generalizable (classification) performance. - without being explicitly programmed

What is machine learning?

0.4*δ{lottery} - 0.7*δ{lottery} + 0.18*δ{account} - 0.32*δ{birth} > 0.5

A hypothetical pattern of spam learned by the machine.

Other Examples

Concepts

Computers learn from examples to improve its generalizable (classification) performance. - without being explicitly programmed

Revisit: What is machine learning?

Instance, Label

instance x

spam

label y instance x

ham

label y

Model

instance x

ham

predicted label f(x)model f

Prediction Error (or, Generalization Error)

err(f) = 0.3

Training, Training Set

model(training) instances train a model

Supervised Learning versus Unsupervised Learning Tasks


spam

ham

know instances and their labels in the training set

Supervised Learning versus Unsupervised Learning Tasks


know instances, not their labels, in the training set

?

?

Testing, Testing Set

(testing) instance predict predicted label

ham

Classification versus Regression


ham

label is discrete

Classification versus Regression


minutes for the survey

label is continuous

[E1] Build a model to classify article topic (sports, politics, etc)

1. what is an instance, what is the label?

2. what are the model input and output?

3. If we have a set of documents with on sports, politics, education and academic, is it a supervised or unsupervised learning task?

4. Is it a classification or regression task?

[E1] Build a model to classify article topic (sports, politics, etc)



3. If we have a set of documents with known topics on sports, politics and academic, is it a supervised or unsupervised learning task?


[E2] Build a model to predict student GPA.



3. If we have a set of students whose GPAs will be known by the end of this semester, is it a supervised or unsupervised learning task?


An instance is often represented as a feature vector x.

An instance is often represented as a feature vector x.

x =

steal

lie,cheat

behavior

peer rej

low ac

.

.

=

0

1

2

1

2

.

.

Q: How to represent a text document?

Example

x =

google lotterycatemailtransportpandamillion ..

=

1101001..

Q: How to represent an image?

Example

.

.

.

x =

Q: how to represent a user in a graph?

A B

C

D E

F G

Example

x =

A?

B?

C?

D?

E?

F?

G?

A B

C

D E

F G

x =

A?

B?

C?

D?

E?

F?

G?

=

0

0

1

0

1

1

1

A B

C

D E

F G

Example

x =

A?

B?

C?

D?

E?

F?

G?

=

0

0

1

0

1

1

1

A B

C

D E

F G

Q: better ways to build vector? (feature engineering)

A model is a function governed by unknown parameters.

Example: model f is a linear function of features xi with unknown parameters θi’s.

f(x) = θ1x1 + θ2x2 + … + θpxp

- training f means estimating θ’s from training instances

- once θ’s are fixed, model f is fixed and can be applied

Example: use a hyper-parameter λ to control the domain of θ’s.

f(x) = θ1x1 + θ2x2 + … + θpxp

- if λ = 10, then θ ∈ [-1,1] — larger domain, f is complex

- if λ = 1, then θ ∈ {0, 1} — smaller domain, f is simple

A model’s complexity is governed by hyper-parameters.

1. f(x) = θ1x1 + θ2x2 + … + θpxp, θ ∈ [0,1]

2. f(x) = θ1x1 + θ2x2 + … + θpxp, θ ∈ {0,1}

Q: which model is has higher complexity?

1. f(x) = θ1x1 + θ2x2 + … + θpxp, θ ∈ [0,1]

2. f(x) = θ1x1 + θ2x2 + … + θpxp, θ ∈ {0,1}

A model with larger domain is often more complex.

Q: what is the hyper-parameter?


1. f(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]

2. f(x) = θ1x1 + θ2x2 + … + θpxp , θ ∈ [0,1]

1. f(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]

2. f(x) = θ1x1 + θ2x2 + … + θpxp , θ ∈ [0,1]

A model with more parameters is often more complex.


1. f(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]

2. f(x) = θ1x1 + θ2x2 + … + θpxp , θ ∈ [0,1]


1. f(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]

2. f(x) = θ1x1 + θ2x2 + … + θpxp , θ ∈ [0,1]

A model capturing more complicated relations is often more complex.


1. f(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]

2. f(x) = θ1x1 + θ2x2 + … + θpxp , θ ∈ [0,1]


A more complex model is more likely to recover true relation between x and y.

Example: true relation is y = 0.3*x1 - 0.7*x2

- if λ = 10, then θ ∈ [-1,1] — f is complex and can recover the above relation

- if λ = 1, then θ ∈ {0, 1} — f is simple and cannot recover the above relation

- better recovery of the true relation implies higher model accuracy

Connection: Model Complexity and Achievable Accuracy

Q: True or False?

Always build complex model since it is more likely to recover the true relation.

- f1(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]

- f2(x) = θ1x1 + θ2x2 + … +xp , θ ∈ [0,1]

Q: Which model estimation has less variance?


- f1(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]

- f2(x) = θ1x1 + θ2x2 + … +xp , θ ∈ [0,1]

Student ID x1: #hour/day x2: #hw/week ... x10: major GPA

1 3.5 0.8 ... cs 3.7

2 2 0.4 ... cs 3.4

Complex model is more demanding on training data volume.


- f1(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]

- f2(x) = θ1x1 + θ2x2 + … +xp , θ ∈ [0,1]

Student ID x1: #hour/day x2: #hw/week ... x10: major GPA

1 3.5 0.8 ... cs 3.7

2 2 0.4 ... cs 3.4

model f

population

Another way to look at estimation variance.

sample a training set

Stu ID x1: x2: ...

1 3.5 0.8 ...

2 2 0.4 ...

apply on new (testing) datatraining

model f

Another way to look at estimation variance.

training set is small

many models may work well on training data, but not everyone works well on the population.

It is likely to learn a model that works well on training data, but not so well on new data in the population, especially if the training set is biased.

model f

Overfitting

training set is small

many models may work well on training data, but not everyone works well on the population.

It is likely to learn a model that works well on training data, but not so well on new data in the population, especially if the training set is biased.

If testing error >> training error, we say f overfits.

Q: which model (indexed by λ) overfits?

λ=1, 2, 3, 4, 5, 6, 7, 8, 9, 10

λ=1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Connection: more complex model is more likely to overfit.

Q: True or False?Since more complex model is more likely to overfit, always build simple model.

λ=1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Q: How to choose model complexity (λ) in practice?

λ=1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Model Selection by K-Fold Cross Validation

choose a candidate hyper-parameter λ1

Q: how to choose candidate hyper-parameters?

choose a candidate hyper-parameter λ1

Strategies of choosing multiple candidate hyper-parameters.

Wrap Up: Introduction

Concepts: instance, label, model, training, testing

Data: feature vector representation (profile, text, image, graph, etc)

Model: parameter, hyper-parameter, model complexity, overfitting

Model Selection: k-fold cross validation

Machine Learning - uwyo.educlan/teach/ai19/ml_a.pdf · What is machine learning? 0.4*δ{lottery} -...

Documents

Transcript of Machine Learning - uwyo.educlan/teach/ai19/ml_a.pdf · What is machine learning? 0.4*δ{lottery} -...