Machine Learning - uwyo.educlan/teach/ai19/ml_a.pdf · What is machine learning? 0.4*δ{lottery} -...
Transcript of Machine Learning - uwyo.educlan/teach/ai19/ml_a.pdf · What is machine learning? 0.4*δ{lottery} -...
Machine Learning
Chao Lan
Background
Can we build a machine that can automatically filter spams?
Which words imply spam?
Does this word imply spam?
Does this word imply spam?
Does this combination of words imply spam?
Manually designing patterns for spam is hard.
Can we let the machine learn patterns of spam?
Computers learn from examples to improve its generalizable (classification) performance. - without being explicitly programmed
What is machine learning?
0.4*δ{lottery} - 0.7*δ{lottery} + 0.18*δ{account} - 0.32*δ{birth} > 0.5
A hypothetical pattern of spam learned by the machine.
Other Examples
Other Examples
Other Examples
Other Examples
Concepts
Computers learn from examples to improve its generalizable (classification) performance. - without being explicitly programmed
Revisit: What is machine learning?
Instance, Label
instance x
spam
label y instance x
ham
label y
Model
instance x
ham
predicted label f(x)model f
Prediction Error (or, Generalization Error)
err(f) = 0.3
Training, Training Set
model(training) instances train a model
Supervised Learning versus Unsupervised Learning Tasks
model(training) instances train a model
spam
ham
know instances and their labels in the training set
Supervised Learning versus Unsupervised Learning Tasks
model(training) instances train a model
know instances, not their labels, in the training set
?
?
Testing, Testing Set
(testing) instance predict predicted label
ham
Classification versus Regression
(testing) instance predict predicted label
ham
label is discrete
Classification versus Regression
(testing) instance predict predicted label
minutes for the survey
label is continuous
[E1] Build a model to classify article topic (sports, politics, etc)
1. what is an instance, what is the label?
2. what are the model input and output?
3. If we have a set of documents with on sports, politics, education and academic, is it a supervised or unsupervised learning task?
4. Is it a classification or regression task?
[E1] Build a model to classify article topic (sports, politics, etc)
1. what is an instance, what is the label?
2. what are the model input and output?
3. If we have a set of documents with on sports, politics, education and academic, is it a supervised or unsupervised learning task?
4. Is it a classification or regression task?
[E1] Build a model to classify article topic (sports, politics, etc)
1. what is an instance, what is the label?
2. what are the model input and output?
3. If we have a set of documents with known topics on sports, politics and academic, is it a supervised or unsupervised learning task?
4. Is it a classification or regression task?
[E1] Build a model to classify article topic (sports, politics, etc)
1. what is an instance, what is the label?
2. what are the model input and output?
3. If we have a set of documents with known topics on sports, politics and academic, is it a supervised or unsupervised learning task?
4. Is it a classification or regression task?
[E2] Build a model to predict student GPA.
1. what is an instance, what is the label?
2. what are the model input and output?
3. If we have a set of students whose GPAs will be known by the end of this semester, is it a supervised or unsupervised learning task?
4. Is it a classification or regression task?
[E2] Build a model to predict student GPA.
1. what is an instance, what is the label?
2. what are the model input and output?
3. If we have a set of students whose GPAs will be known by the end of this semester, is it a supervised or unsupervised learning task?
4. Is it a classification or regression task?
[E2] Build a model to predict student GPA.
1. what is an instance, what is the label?
2. what are the model input and output?
3. If we have a set of students whose GPAs will be known by the end of this semester, is it a supervised or unsupervised learning task?
4. Is it a classification or regression task?
[E2] Build a model to predict student GPA.
1. what is an instance, what is the label?
2. what are the model input and output?
3. If we have a set of students whose GPAs will be known by the end of this semester, is it a supervised or unsupervised learning task?
4. Is it a classification or regression task?
An instance is often represented as a feature vector x.
An instance is often represented as a feature vector x.
x =
steal
lie,cheat
behavior
peer rej
low ac
.
.
=
0
1
2
1
2
.
.
Q: How to represent a text document?
Example
x =
google lotterycatemailtransportpandamillion ..
=
1101001..
Q: How to represent an image?
Q: How to represent an image?
Example
.
.
.
x =
Q: how to represent a user in a graph?
A B
C
D E
F G
Example
x =
A?
B?
C?
D?
E?
F?
G?
A B
C
D E
F G
x =
A?
B?
C?
D?
E?
F?
G?
=
0
0
1
0
1
1
1
A B
C
D E
F G
Example
x =
A?
B?
C?
D?
E?
F?
G?
=
0
0
1
0
1
1
1
A B
C
D E
F G
Q: better ways to build vector? (feature engineering)
A model is a function governed by unknown parameters.
Example: model f is a linear function of features xi with unknown parameters θi’s.
f(x) = θ1x1 + θ2x2 + … + θpxp
- training f means estimating θ’s from training instances
- once θ’s are fixed, model f is fixed and can be applied
Example: use a hyper-parameter λ to control the domain of θ’s.
f(x) = θ1x1 + θ2x2 + … + θpxp
- if λ = 10, then θ ∈ [-1,1] — larger domain, f is complex
- if λ = 1, then θ ∈ {0, 1} — smaller domain, f is simple
A model’s complexity is governed by hyper-parameters.
1. f(x) = θ1x1 + θ2x2 + … + θpxp, θ ∈ [0,1]
2. f(x) = θ1x1 + θ2x2 + … + θpxp, θ ∈ {0,1}
Q: which model is has higher complexity?
1. f(x) = θ1x1 + θ2x2 + … + θpxp, θ ∈ [0,1]
2. f(x) = θ1x1 + θ2x2 + … + θpxp, θ ∈ {0,1}
A model with larger domain is often more complex.
Q: what is the hyper-parameter?
Q: which model is has higher complexity?
1. f(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]
2. f(x) = θ1x1 + θ2x2 + … + θpxp , θ ∈ [0,1]
1. f(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]
2. f(x) = θ1x1 + θ2x2 + … + θpxp , θ ∈ [0,1]
A model with more parameters is often more complex.
Q: what is the hyper-parameter?
1. f(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]
2. f(x) = θ1x1 + θ2x2 + … + θpxp , θ ∈ [0,1]
Q: which model is has higher complexity?
1. f(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]
2. f(x) = θ1x1 + θ2x2 + … + θpxp , θ ∈ [0,1]
A model capturing more complicated relations is often more complex.
Q: what is the hyper-parameter?
1. f(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]
2. f(x) = θ1x1 + θ2x2 + … + θpxp , θ ∈ [0,1]
Q: which model is has higher complexity?
A more complex model is more likely to recover true relation between x and y.
Example: true relation is y = 0.3*x1 - 0.7*x2
- if λ = 10, then θ ∈ [-1,1] — f is complex and can recover the above relation
- if λ = 1, then θ ∈ {0, 1} — f is simple and cannot recover the above relation
- better recovery of the true relation implies higher model accuracy
Connection: Model Complexity and Achievable Accuracy
Q: True or False?
Always build complex model since it is more likely to recover the true relation.
- f1(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]
- f2(x) = θ1x1 + θ2x2 + … +xp , θ ∈ [0,1]
Q: Which model estimation has less variance?
Always build complex model since it is more likely to recover the true relation.
- f1(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]
- f2(x) = θ1x1 + θ2x2 + … +xp , θ ∈ [0,1]
Student ID x1: #hour/day x2: #hw/week ... x10: major GPA
1 3.5 0.8 ... cs 3.7
2 2 0.4 ... cs 3.4
Complex model is more demanding on training data volume.
Always build complex model since it is more likely to recover the true relation.
- f1(x) = θ1x1 + θ2x2 + … + θ10x10, θ ∈ [0,1]
- f2(x) = θ1x1 + θ2x2 + … +xp , θ ∈ [0,1]
Student ID x1: #hour/day x2: #hw/week ... x10: major GPA
1 3.5 0.8 ... cs 3.7
2 2 0.4 ... cs 3.4
model f
population
Another way to look at estimation variance.
sample a training set
Stu ID x1: x2: ...
1 3.5 0.8 ...
2 2 0.4 ...
apply on new (testing) datatraining
model f
Another way to look at estimation variance.
training set is small
many models may work well on training data, but not everyone works well on the population.
It is likely to learn a model that works well on training data, but not so well on new data in the population, especially if the training set is biased.
model f
Overfitting
training set is small
many models may work well on training data, but not everyone works well on the population.
It is likely to learn a model that works well on training data, but not so well on new data in the population, especially if the training set is biased.
If testing error >> training error, we say f overfits.
Q: which model (indexed by λ) overfits?
λ=1, 2, 3, 4, 5, 6, 7, 8, 9, 10
λ=1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Connection: more complex model is more likely to overfit.
Q: True or False?Since more complex model is more likely to overfit, always build simple model.
λ=1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Q: How to choose model complexity (λ) in practice?
λ=1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Model Selection by K-Fold Cross Validation
choose a candidate hyper-parameter λ1
Q: how to choose candidate hyper-parameters?
choose a candidate hyper-parameter λ1
Strategies of choosing multiple candidate hyper-parameters.
Wrap Up: Introduction
Concepts: instance, label, model, training, testing
Data: feature vector representation (profile, text, image, graph, etc)
Model: parameter, hyper-parameter, model complexity, overfitting
Model Selection: k-fold cross validation