MAI0203 Lecture 13: Machine Learning
Transcript of MAI0203 Lecture 13: Machine Learning
MAI0203 Lecture 13: MachineLearning
Methods of Artificial IntelligenceWS 2002/2003
Part III: Special Aspects
III.13 Machine Learning
MAI0203 Lecture 13: Machine Learning – p.345
Learning
Remember “Deduction vs. Induction”Induction: Derive a general rule (axiom) frombackground knowledge and observations.
Socrates is a human. (background knowledge)Socrates is mortal. (observation/example)Therefore, I hypothesize that all humans are mortal.(generalization)
Most important approach to learning: InductiveLearningothers: rote learning, adaptation/re-structuring,speed-up learning
Deduction is knowledge extraction, induction isknowledge acquisition!
MAI0203 Lecture 13: Machine Learning – p.346
Intelligent Systems must Learn
A flexible and adaptive an organism cannot rely on a fixedset of behavior rules but must learn (over its completelife-span)!
If an expert system – brilliantly designed, engineered andimplemented – cannot learn not to repeat its mistakes, it isnot as intelligent as a worm or a sea anemone or a kitten.
(O. Selfridge)
MAI0203 Lecture 13: Machine Learning – p.347
Learning vs. Knowledge Engineering
Technological advantage: learning avoids the“knowledge engineering bottleneck”!
Good analysis of data/observations can often besuperior to explicit programming.
E.g., expert system for medical diagnosis:Describe each patient by his/her clinical data,let physicians decide whether some disease isimplied or not,learn the classification rule (relevant featurers andtheir dependencies)
Epistemological problem: induced information cannever be “sure” knowledge, it has hypothetical statusand is open to revision!
MAI0203 Lecture 13: Machine Learning – p.348
Classification/Concept Learning
Classification/concept learning is the best researchedarea in machine learning.
Representation of a concept:Extensional: (infinite) set of all enti-ties belonging to a class/concept.Intensional: finite characterizationT � ��� �
has-3/4-Legs(x), has-Top(x)
� � �� �� �� ����
����
�
� � � � �� � � � �
Learning: Construction of a finite characterization froma subset of entities (training examples).
Typically, supervised learning: pre-classified positiveand negative examples for the concept/class.
In the following: decision tree learning as a symbolicapproach to inductive learning of concepts. (Quinlan,ID3, C4.5)
MAI0203 Lecture 13: Machine Learning – p.349
Decision Tree Learning/Training Ex.Fictional Example: Learn in which regions of the world,the Tse-Tse fly is living.
Training Examples: can presented sequentially(incremental learning) or all at once (batch learning)
Nr Vegetation Longitude Humidity Altitude Tse-Tse fly
1 swamp near eq. high high occurs
2 grassland near eq. high low occurs not
3 forrest far f. eq. low high occurs not
4 grassland far f. eq. high high occurs not
5 forrest near eq. high low occurs
6 grassland near eq. low low occurs not
7 swamp near eq. low low occurs
8 swamp far f. eq. low low occurs not
MAI0203 Lecture 13: Machine Learning – p.350
Feature Vectors
Examples are represented as feature vectors���
Attributes and their value ranges:Vegetation: �� � ���� �� �
Longitude: � � ���� � �
Humidity: �� � �� � � �
Altitude: �� � ���� � �
Class � ���� � �
MAI0203 Lecture 13: Machine Learning – p.351
DT Learning/Hypothesispossible decision tree: x1 (vegetation)
x2 (longitude) x2 (longitude)NO Tse−Tse Fly
forrestswamp 0
grassland
1 2
NO Tse−Tse FlyNO Tse−Tse Fly Tse−Tse FlyTse−Tse Fly
0 1
near eq. far from eq. far from eq.near eq.0 1
Hypotheses: IF-THEN Classification RulesIF: Disjunction of conjunctions of constraints overattribute values THEN: Decision for a class
Each path is a ruleIF grassland THEN no Tse-Tse flyIF (swamp AND near equator) THEN Tse-Tse flye...
Generalization: only relevant attributes, rules can beapplied to classify unseen examples
MAI0203 Lecture 13: Machine Learning – p.352
DT Learning/Basic AlgorithmID3(examples, attributes, target-class):
1. Create a root node2. IF all examples are positive, return the root with label “target-class”
�
3. IF all examples are negative, return the root with label“target-class” �
4. IF attributes is empty, return the root and use the most commonvalue for the target class in the examples as label
5. Otherwise(a) Select an attribute
�
from attributes(b) Assign
�
to the root(c) For each possible value ��� of
�
i. Add a new branch below the root, corresponding to the test
�
� ��
ii. Let examples �� be the subset of examples that have value � �
for
�
iii. IF examples �� is emptyTHEN add a leaf node and use the most common value forthe target class in the examples as labelELSE ID3(examples �� , attributes
�
, target-class)6. Return the tree
MAI0203 Lecture 13: Machine Learning – p.353
Example
x1 (vegetation)
forrestswamp 0
grassland
1 2
{1, 7, 8}
NO Tse−Tse Fly
{2, 4, 6} {3, 5}
leaf−node
next attribute? next attribute?
x2 (longitude)
0 1
near eq. far from eq.
swamp 0
Tse−Tse Fly NO Tse−Tse Fly
{1, 7, 8}
{1, 7} {8}
MAI0203 Lecture 13: Machine Learning – p.354
Selecting an AttributeNaive: From the list of attributes in some arbitraryorder, take always the next oneProblem: DT might be more complex then necessary;contain branches which are irrelevant
Selection by information gain � � DT with minimal depthWhat is the contribution of an attribute to decidewhether an entity belongs to the class or not?
Entropy of a collection
�
� � � � ��
��� �
�� �� � �
with � as proportion (relative frequency) of objectsbelonging to category
�
.
Logarithm to base 2: information theory, expectedencoding length measured in bits.
MAI0203 Lecture 13: Machine Learning – p.355
Entropy/IllustrationIllustration of entropy
S: flipping a fair coin
head
�
tail
� ��� �
� �
Coin
� � � �� �� log ��
� �
� �� �
head up
� �� �� log ��
� �
� �� �
tail up
� �
bit
Flipping a coin with
�
heads:
� � �� � �
bitsFlipping a coin with
� � � �head:
� � � bits
Entropy of a set of training examples: proportion ofexamples belonging to the class vs. not belonging tothe class � � � � �
��� log
��
�� log
�� �� �
without looking at any attribute, we have a slightadvantage if guessing the more frequent class “noTse-Tse fly”
MAI0203 Lecture 13: Machine Learning – p.356
Information GainWhat does the knowledge of the value of an attribute
�contribute to predict the class?��� ��� �� � � � �� ���
���� �
� � � �� � � � �� � �
� �� �
“relative” Entropy of A
Example: Attribute �� “Longitude” � �� � � � � � �� � "!# � �� � � � � "!# � � � $
� �&% ' �� �% () � �&% ) �� * % +, $ �&% - (,
� �� � * � � � . � � "!# � .� � � � � "!# � �� � � � * � �
(remark
"!# �
is undefined, we set
� � "!# �
as
�
)weight with relation of examples with � � � �
:
�/ and �� � *
:
�/
��� ��� �� �� � � �&% -0 � 1 � �&% ', 0 � �&% - (, � 2 � �&% + (0 � � �3 $ �&% + )
Remark: In different subtrees, different attributes might give
the best information gain.MAI0203 Lecture 13: Machine Learning – p.357
Performance Evaluation
A learning algorithm is good, if it produces hypotheseswhich work good when classifying unseen examples.
Validation: use a training set and a test setConstruct a hypothesis using the training setEstimate the generalization error using the test set
For decision trees, working on linearly separable data,the training error is always
�.
Danger: Overfitting of dataIf crucial information is missing (relevant features), theDT algorithm might nevertheless find a hypothesisconsistent with the examples: use irrelevant attributes.E.g., predict whether head or tail of a dice is top basedon the day of the week.
MAI0203 Lecture 13: Machine Learning – p.358
How to Avoid Overfitting
Pruning (Okams razor – prefer simple hypotheses): donot use attributes with information gain near zero
Cross-validation: run the alg.
�
-times, each time leaveout a different subset of
� � �
examples
MAI0203 Lecture 13: Machine Learning – p.359
Characteristics DT Algorithms
Hypothesis Language (DTs): restricted to separationplanes which are parallel to the axis in feature space
Inductive biases (no learning without bias):Each learning algorithm has a language bias: thehypothesis language restricts what can be learnedSearch bias: top-down search in the space ofpossible decision trees
DT research was inspired by psychological work ofBruner, Goodnow, & Austin, 1956 on concept formation
Advantage of symbolic approach: verbalizable ruleswhich can be directly used in knowledge-basedsystems
MAI0203 Lecture 13: Machine Learning – p.360
DTs vs. Perceptrons: XOR
Perceptron: � �� � �
�
if
���� �� � �
� otherwise
Output: calculated as linear combination of the inputs
Learning: incremental, change weights
� � whenclassification error occurs
Works for linearly separable classes
e.g. not for XOR
��� ��� class
0 0 0
0 1 1
1 0 1
1 1 0
x2
no yes
0 1
x10 1
x2
0 1
yes no
MAI0203 Lecture 13: Machine Learning – p.361
Other Learning Approaches
Theoretical foundations:Algorithmic/Computational/Statistic Learning TheoryWhat kind of hypotheses can be learned from which data withwhich effort?
Three main approaches:(1) Supervised Learning: learning from examples
Decision Tree AlgorithmsInductive Logic Programming (ILP)Genetic AlgorithmsFeedforward Nets (Backpropagation)Support Vector Machines, Statistical approaches
MAI0203 Lecture 13: Machine Learning – p.362
Other Learning Approaches cont.
...(2) Unsupervised Learning:
�
-means clustering,Self-organized Maps
Clustering of data in groups (e.g. customers withsimilar behavior)Application: Data Mining
(3) Reinforcement Learning: learning from problemsolving experience, positive/negative feedback
MAI0203 Lecture 13: Machine Learning – p.363
Other Learning Approaches cont.
Inductive Program Synthesis; learn recursive rules,typically, small sets of only positive examples
ILPGenetic ProgrammingFunctional Approaches (recurrence detection interms)Grammar Inference
Explanation-base learning
Learning from Analogical/Case-Based Reasoning(Schema acquisition, abstraction)
Textbook:
Tom Mitchell, Machine Learning, McGraw-Hill, 1997
MAI0203 Lecture 13: Machine Learning – p.364
The Running Gag of MAI 02/03
Question: How many AI people does it take to change alightbulb? Answer: At least 67.6th part of the solution: The Learning Group (4)
One to collect twenty lightbulbs
One to collect twenty "near misses"
One to write a concept learning program that learns toidentify lightbulbs
One to show that the program found a local maximumin the space of lightbulb descriptions
Check the groups we did not consider in this lecture:The Statistical Group (1), The Robotics Group (7), The LispHackers (7), The Connectionist Group (6), The NaturalLanguage Group (5). (“Artificial Intelligence”, Rich & Knight)
MAI0203 Lecture 13: Machine Learning – p.365