Introduction to Machine Learning Laurent Orseau AgroParisTech [email protected] EFREI...
-
Upload
amos-potter -
Category
Documents
-
view
227 -
download
3
Transcript of Introduction to Machine Learning Laurent Orseau AgroParisTech [email protected] EFREI...
Introduction to Introduction to Machine LearningMachine Learning
Laurent OrseauAgroParisTech
EFREI 2010-2011Based on slides by Antoine Cornuejols
2
OverviewOverview
• Introduction to Induction (Laurent Orseau)• Neural Networks• Support Vector Machines• Decision Trees • Introduction to Data-Mining (Christine Martin)• Association Rules• Clustering• Genetic Algorithms
3
Overview: IntroductionOverview: Introduction
• Introduction to Induction Examples of applications Learning types
• Supervised Learning• Reinforcement Learning• No-supervised Learning
Machine Learning Theory
• What questions to ask?
IntroductionIntroduction
5
What is Machine Learning ?What is Machine Learning ?• Memory
Knowledge acquisition Neurosciences
• Short-term (working) Keep 7±2 objects at a time
• Long-term Procedural
» Action sequences Declarative
» Semantic (concepts) » Episodic (facts)
• Learning Types By heart From rules By imitation / demonstration By trial & error
• Knowledge reuseKnowledge reuse In similar situations
Introduction
6
What is Machine Learning?What is Machine Learning?
• "The field of study that gives computers the ability to learn without being explicitly programmed "
Arthur Samuel, 1959
Samuel's Checkers> Schaeffer 2007 (solved)+ TD-Gammon, Tesauro 1992
Introduction
7
What is Machine Learning?What is Machine Learning?
Given:Experience E, A class of tasks T A performance measure P,
A computer is said to learn if
its performance on a task of T
measured by P
increases with experience E
Tom Mitchell, 1997
Introduction
8
Terms related to Machine LearningTerms related to Machine Learning
• Robotic Automatic Google Cars, Nao
• Prediction / forecasting Stock exchange, pollution peaks, …
• Recognition Face, language, writing, moves, …
• Optimization Subway speed, traveling salesman, …
• Regulation Heat, traffic, fridge temperature, …
• Autonomy Robots, hand prosthesis
• Automatic problem solving• Adaptation
User preferences, robot in changing environment• Induction• Generalization• Automatic discovery• …
Introduction
Some applicationsSome applications
10
Learning to cookLearning to cook
•Learning by imitation / demonstration•Procedural Learning (motor precision)•Object recognition
Applications
11
DARPA Grand challenge (2005)DARPA Grand challenge (2005)
Applications
12
200km of desert
Natural and artificial dangers
No driver
No remote control
200km of desert
Natural and artificial dangers
No driver
No remote control
Applications > DARPA Grand Challenge
13
5 Finalists5 Finalists
Applications > DARPA Grand Challenge
14
Recognition of the roadRecognition of the road
Applications > DARPA Grand Challenge
15
Learning to label images:Learning to label images:Face recognitionFace recognition
“Face Recognition: Component-based versus Global Approaches” (B. Heisele, P. Ho, J. Wu and T. Poggio), Computer Vision and Image Understanding, Vol. 91, No. 1/2, 6-21, 2003.
Applications
16
Applications > Reconnaissance d'images
Feature combinationsFeature combinations
17
Hand prosthesisHand prosthesis
• Recognition of pronator and supinator signals Imperfect sensors Noise Uncertainty
Applications
18
Autonomous robot rover on MarsAutonomous robot rover on Mars
Applications
19
Supervised Supervised LearningLearning
Learning by heart? UNEXPLOITABLE Generalize
How to encode forms?
b
Introduction to Introduction to Machine Learning TheoryMachine Learning Theory
21
Introduction to Machine Learning theoryIntroduction to Machine Learning theory
• Supervised Learning
• Reinforcement Learning
• Unsupervised Learning (CM)
• Genetic Algorithms (CM)
22
Supervised LearningSupervised Learning
• Set of examples xi labeled ui
• Find a hypothesis h so that:
h(xi) = ui ?
h(xi): predicted label
• Best hypothesis h* ?
23
Supervised Learning: 1Supervised Learning: 1stst Example Example
• Houses: Price / m²
• Searching for h Nearest neighbors? Linear, polynomial regression?
• More information Localization (x, y ? or symbolic variable?),
age of building, neighborhood, swimming-pool, local taxes, temporal evolution,…?
Supervised Learning
24
Problem Problem
Prediction du prix du m² pour une maison donnee.
1) Modeling
2) Data gathering
3) Learning
4) Validation
5) Use in real case
Supervised Learning
Ideal Practice
25
1) Modeling1) Modeling
• Input space What is the meaningful information? Variables
• Output space What is to be predicted?
• Hypothesis space Input –(computation) Output What (kind of) computation?
Supervised Learning
26
1-a) Input space: Variables1-a) Input space: Variables
• What is the meaningful information?• Should we get as much as possible?• Information quality?
Noise Quantity
• Cost of information gathering? Economic Time Risk (invasive?) Ethic Law (CNIL)
• Definition domain of each variable? Symbolic, bounded numeric, not bounded, etc.
Supervised Learning > 1) Modeling
27
Price of m²: VariablesPrice of m²: Variables
• Localization Continuous: (x, y) longitude latitude ? Symbolic: city name?
• Age of building Year of creation? Relative to present or to creation date?
• Nature of soil
• Swimming-pool?
Supervised Learning > 1) Modeling > a) Variables
28
1-b) Output space1-b) Output space
• What do we want on output? Symbolic classes? (classification)
• Boolean Yes/No (concept learning)• Multi-valued A/B/C/D/…
Numeric? (regression)• [0 ; 1] ?• [-∞ ; +∞] ?
• How many outputs? Multi-valued Multi-class ?
• 1 output for each class Learn a model for each output?
• More "free" Learn 1 model for all outputs?
• Each model can use others' information
Supervised Learning > 1) Modeling
29
1-c) Hypothesis space1-c) Hypothesis space
• Critical!
• Depends on the learning algorithm Linear Regression: space = ax + b
• Parameters: a and b Polynomial regression
• # parameters = polynomial degree• Neural Networks, SVM, Gen Algo, …
…
Supervised Learning > 1) Modeling
30
Choice of hypothesis spaceChoice of hypothesis space
Estimation
Error
Total ErrorApproximation Error
31
Choice of hypothesis spaceChoice of hypothesis space
• Space too "poor" Inadequate solutions Ex: model sin(x) with y=ax+b
• Space too "rich" risk of overfittingoverfitting• Defined by set of parametersparameters
High # params learning more difficult
• But prefer a richer hypothesis space! Use of generic methods Add regularization
Supervised Learning > 1) Modeling > c) Hypothesis space
32
2) Data gathering2) Data gathering
• Gathering Electronic sensors Simulation Polls Automated on the Internet …
• Get highest quantity of data Collect cost
• Data as "pure" as possible Avoid all noise
• Noise in variables• Noise in labels!
1 example = 1 value for each variable• missing value = useless example?
Supervised Learning
33
Gathered dataGathered data
x1x1 x2x2 x3x3 uu
Example 1 Yes 1.5 Green -
Example 2 No 1.4 Orange +
Example 3 Yes 3.7 Orange -
… … … … …
Inputs / Variables
measured
Output /
Class /
Label
But true label y
unreachable !
Supervised Learning > 2) Data gathering
34
Data preprocessingData preprocessing
• Clean up data ex: Reduce background noise
• Transform data Final format adapted to task Ex: Fourier Transform of radio signal
time/amplitude frequency/amplitude
Supervised Learning > 2) Data gathering
35
3) Learning3) Learning
a) Choice of program parameters
b) Choice of inductive test
c) Running the learning program
d) Performance test
If bad, return to a)…
Supervised Learning
36
a) Choice of program parametersa) Choice of program parameters
• Max allocated computation time
• Max accepted error
• Learning parameters Specific to model
• Knowledge introduction Initialize parameters to "ok" values?
• …
Supervised Learning > 3) Learning
37
b) Choice of inductive testb) Choice of inductive test
Goal: find hypothesis h H minimizing real riskreal risk (risk expectancy, generalization error)
predictedlabel
true label y(or desired u)
Loss functionLoss functionJoint probability law
over X Y
Supervised Learning > 3) Learning
R(h) l h(x),y dP(x, y)XY
38
Real riskReal risk
• Goal: Minimize real risk
• Real risk is not known, in particular P(X,Y).
Supervised Learning > 3) Learning > b) Inductive test
• Discrimination
• Regression
l (h(xi),ui) 0 si ui h(xi )
1 si ui h(xi )
l (h(xi),ui) h(xi) ui 2
R (h ) l h (x ), y dP (x , y )X Y
39
Empirical Risk MinimizationEmpirical Risk Minimization
• ERM principleERM principle Find h H minimizing empirical risk empirical risk
• Least error on training set
REmp (h) l h(xi ),ui i 1
m
Supervised Learning > 3) Learning > b) Inductive test
40
Learning curveLearning curve
• Data quantity is important!
Training set size
"error"
Learning curve
Supervised Learning > 3) Learning > b) Inductive test > Empirical risk
41
Test / ValidationTest / Validation
• Measures overfitting overfitting / generalizationgeneralization Acquired knowledge can be reused in new new
circumstancescircumstances? Do NOT validate over training set!
• Validation over additional test settest set
• Cross Validation Useful when few data leave-p-out
Supervised Learning > 3) Learning > b) Inductive test > Empirical risk
42
OverfittingOverfittingSupervised Learning > 3) Learning > b) Inductive test > Empirical risk
Real Risk
Emprirical Risk
Overfitting
Data quantity
43
RegularizationRegularization
• Limit overfitting before measuring it on test set
• Add penalizationpenalization in inductive test Ex:
• Penalize large number• Penalize resource use• …
Supervised Learning > 3) Learning > b) Inductive test > Empirical risk
44
Maximum a posterioriMaximum a posteriori
• Bayesian approach• We suppose there exists a priorprior probability distribution over
space H: pH(h)
Maximum A Posteriori principleMaximum A Posteriori principle (MAP)(MAP)::• Search for most probable h after observing data S
• Ex: Observation of sheep color h = "A sheep is white"
Supervised Learning > 3) Learning > b) Inductive test
45
Minimum Description Length PrincipleMinimum Description Length Principle
• Occam RazorOccam Razor"Prefer simplest hypotheses"
• Simplicity: size of h Maximum compressionMaximum compression
• Maximum a posteriori with pH(h) = 2-d(h)
• d(h): length in bits of h
• Compression generalization
Supervised Learning > 3) Learning > b) Inductive test
46
c) Running the learning programc) Running the learning program
• Search for h
• Use examples of training settraining set One by one All together
• Minimize inductive testinductive test
Supervised Learning > 3) Learning
47
Finding the parameters of the modelFinding the parameters of the model
• Explore hypothesis space H Best hypothesis given inductive test? Fundamentally depends on H
a) Structured exploration
b) Local exploration
c) No exploration
Supervised Learning > 3) Learning > c) Running the program
48
Structured explorationStructured exploration• Structured by generality relation Structured by generality relation
(partial order)(partial order) Version space ILP (Inductive Logic Programming) EBL (Explanation Based Learning) Grammatical inference Program enumeration
hi hj
gms(hi, hj)
smg(hi, hj)
H
Supervised Learning > 3) Learning > c) Running the program > Exploring H
49
Representation of the version spaceRepresentation of the version space
Structured by:
Upper bound: G-set
Lower bound: S-set
• G-set = Set of all most general hypotheses
consistent with known examples
• S-set = Set of all most specific hypotheses
consistent with known examples
H
G
S
hi hj
Supervised Learning > 3) Learning > c) Running the program > Exploring H
50
Learning…Learning…
… by iterated updates of the version space
Idea:
update S-set
and G-set
after each new example
Candidate elimination algorithm
Example: rectangles (cf. blackboard…)
Supervised Learning > 3-c) > Exploring H > Version space
51
Candidate Elimination algorithmCandidate Elimination algorithm
Initialize S (resp. G):
Set of most specific (resp. general), consistent with 1st example
For each new example (+ or -)
update S
update G
Until convergence
or until S = G = Ø
Supervised Learning > 3-c) > Exploring H > Version space
54
Updating S and G: xUpdating S and G: xii is is positivepositive
• Updating SUpdating S Generalize hypotheses in S not covering xi ,
just enough to cover it
Then eliminate hypotheses in S
• covering one or more negative example
• more general than another hypothesis in S
• Updating GUpdating G Eliminate hypotheses in G not covering xi
Supervised Learning > 3-c) > Exploring H > Version space
55
Updating S and G: xUpdating S and G: xii is is negativenegative
• Updating SUpdating S Eliminate hypotheses in S (wrongly) covering xi
• Updating GUpdating G Specialize hypotheses in G covering xi
just enough not to cover it
Then eliminate hypotheses in G
• not more general than at least one element of S
• more specific than at least another hypothesis of G
Supervised Learning > 3-c) > Exploring H > Version space
56
Candidate Elimitation AlgorithmCandidate Elimitation Algorithm
Updating S et G
H
G
Sx
x
x
x(a)
(b)
(c)
(d)
x
(d')
(b')
(a')
xx
x
Supervised Learning > 3-c) > Exploring H > Version space
57
Local explorationLocal exploration
• Only a Only a neighborhoodneighborhood notion in notion in HH "Gradient" methods
• Neural Networks• SVM• Simulated annealing / simulated evolution
• /!\ Local Minima
Supervised Learning > 3) Learning > c) Running the program > Exploring H
xh
H
58
Exploration without hypothesis spaceExploration without hypothesis space
• No hypothesis spaceNo hypothesis space Use examples directly
• and example space Nearest Neighbors methods
(Case Based Reasoning / Instance-based learning)
Notion of distancedistance
• Example: k Nearest Neighbors Optional: Vote weighted by distance
Supervised Learning > 3) Learning > c) Running the program > Exploring H
59
Inductive biaisInductive biais
• A priori preference of some hypotheses Depends on H Depends on search algorithm
• Whatever the inductive test: ERM: implicit in H MAP: explicit, user chooses MDL: explicit, (length in bits) PPV: distance notion
• What justification?
Supervised Learning
Supervised LearningSupervised Learning
Less frequent learning typesLess frequent learning types
61
Incremental LearningIncremental Learning
• Examples are given/taken one after the other Incremental update of best hypothesis Use acquired knowledge to
• learn better• learn faster
• Data is no more i.i.d. ! i.i.d: Independently and Identically distributed
= sampled uniformly from a non-changing example generator Dependence to time / sequence
• Ex: Mobile phone users preferences Learning to program …
Supervised Learning
62
Active LearningActive Learning
• Set of unlabeled examples Labeling an example is expansive
Choose an example to be labeled How to choose?
• Data is not i.i.d.
• Ex: video sequence labeling
Supervised Learning
Other types of Machine LearningOther types of Machine Learning
Reinforcement LearningReinforcement Learning
Unsupervised LearningUnsupervised Learning
64
Reinforcement LearningReinforcement Learning
• Pavlov Bell : trigger Dog bowl : reward Salivate : action Association
bell ↔ bowl Reinforcement of
"salivation"
ActionPerception
Environment
Reward /Punition
• Control behavior with rewards/punitions
65
Reinforcement LearningReinforcement Learning
• Agent must discover the right behavior And optimize itMaximize expected rewardexpected reward
st: state at time t
Action selection: at:= argmaxa Q(st, a)
• Updating valuesrt: reward received at time tQ(st, at) α Q(st, at) + (1- α) [ rt+1 + γ maxa Q(st+1, a) ]
66
Unsupervised LearningUnsupervised Learning
• No class, no output, no reward• Goal: group similar group similar examples together
• Notion of distance• Inductive bias
67
ConclusionConclusion
• Induction Find a general hypothesis from examples
• Avoid overfitting• Choose the right hypothesis space
Not too small (bad induction) Not too large (overfitting)
• Use an adequate algorithm With data With hypothesis space
68
What to rememberWhat to remember
• Mostly supervised learning is studied
• Learning is always biased
• Learning depends on the structure of the hypothesis space No structure: interpolation methods
Local structure: gradient methods (approximation)
Partial order relation: guided exploration (exploration)