Intro to AI Learning Ruth Bergman Fall 2002. Learning What is learning? Learning is a process by...
-
date post
22-Dec-2015 -
Category
Documents
-
view
218 -
download
4
Transcript of Intro to AI Learning Ruth Bergman Fall 2002. Learning What is learning? Learning is a process by...
Intro to AILearning
Ruth Bergman
Fall 2002
Learning
• What is learning?Learning is a process
by which the learner improves his/her/its predictive ability
using new experiences
Machine Learning
• Learn a function• process: algorithms for improving the
model of the function• Improve predictive ability: less
difference between the true function and the model
• Experiences: input examples or instances
A classification problem
• predict whether a patient will experience heart disease
• Key steps:– Data: what “past experience” do we have? What are the
underlying assumptions?Medical records, exercise habits, smoking habits…
– Representation: how do we summarize a patient? – Estimation: how do we construct a map from patients to
presence of heart disease?– Evaluation: how well does our model predict? Can we do
better?
Heart-Disease Representation• Fourteen attributes:
1. Age (in years)2. Sex (male/female)3. Chest pain type (normal angina, atypical angina, non-anginal
pain, asymptomatic)4. Resting blood pressure (in mm Hg on admission to the hospital)5. serum cholestoral (in mg/dl)6. fasting blood sugar (1 if > 120 mg/dl, 0 otherwise)7. resting electrocardiographic results (normal, ST-T wave
abnormality, hypertrophy)8. maximum heart rate9. exercise induced angina (yes/no)10. ST depression induced by exercise relative to rest11. the slope of the peak exercise ST segment (upsloping, flat,
downsloping)12. number of major vessels (0-3) colored by flourosopy13. thal (normal, fixed defect, reversable defect)14. diagnosis of heart disease (0 if < 50% diameter narrowing in any
major vessel, 1 otherwise)
Heart-Disease Examplesage sex Chest
painResting blood pressure
cholestoral
blood sugar
exercise induced angina
… Heart disease
63 male typical 145 233 1 0 … no
67 male asymptomatic
160 286 0 1 … yes
67 male asymptomatic
120 229 0 1 yes
37 male non-anginal
130 250 0 0 no
41 female atypical 130 204 0 0 no
Learning Paradigms
• Supervised learning– given a set of examples and the correct results• Instance: <feature vector, classification>• Example: x f(x)
• Unsupervised learning– Capture the inherent organization in the data.– Instance:<feature vector>– Example: x
• Reinforcement learning– given feedback (reward) for performing well (or badly), not
what we should be doing– Instance: <feature vector>– Example: x rewards based on performance
Inductive Learning• Suppose the underlying problem domain is described by a
function f• Given pairs <x, f(x)>• Compute a hypothesis h that approximates f as well as possible
given the presented data
• In general the input under-constrains the function h, so we have to choose. The way that choice is performed is called bias.
Two Function Types • Classification
The target function is a classification
+
+ ++
+
-
-
-
-
-
-
Cx
CxxfC 0
1)(
• Regression
Representation of Classifiers
• We could use many representation models – Decision trees– A set of rules– A Prolog program (Horn clauses)– Neural Networks– Belief Networks
• Each representation has multiple learning algorithms
Decision Trees
blood pressure
cholesterol
1
10
0
0
10
cholesterol
Chest pain Chest pain
high
low
medium
high high
low
low
no yes yesno
Inducing decision trees from examples
• Trivial solution: one path in the tree for each example- Bad generalization
A Decision Tree Learning Algorithm
ID3(example, attributes, default)
if (examples is empty) return defaultif (examples have same classification) return classificationif (attributes is empty) return majority-value(examples)
best CHOOSE_ATTRIBUTE(attributes, examples)tree new tree with root test best
for each value vi of bestexamplesi = elements of examples with best=vi
subtree ID3(examplesi,attributes-best,majority-value(examples))
add a branch to tree with label vi and subtree subtree
return tree
Possible Trees
blood pressure
high mediumlow
cholesterol
highlowChest pain
yesno
Which is the best tree?
What Makes A Good Tree
• Ockham’s razor principle (assumption):The most likely hypothesis is the simplest one that is consistent with training examples.
Bias for short trees: minimize (on average) the # of questions we need to answer before reaching a decision
• Finding the smallest decision tree that matches training examples is NP-hard.
• Select test attributes using a heuristic from Information Theory– The attribute that provides the most information.
Information Content of Attributes
• A perfect attribute divides the examples into sets that are positive and negative.
• A useless attribute leaves the example sets with the same proportion of positive and negative examples as the original.
---
-- --
+
++
++
++
---
-- --
+
++
++
++
+
++
++
++
---
-- --
+
+--
++
+--
-+ -+-
Using Information Theory
• Suppose we have p positive and n negative queries– The probability of a positive example is p/p+n – The probability of a negative example is n/p+n
• Information content – For possible values v1,…, vn with respective probabilities
P(v1)...P(vn)– I(P(v1)...P(vn)) = Σi –P(vi) log2(P(vi)) for possible values vi
• Information content of an attribute
• for example, an even split gives – ½ (-1) – ½ (-1) = 1 bit • something that doesn’t divide at all yields 0 bits
np
n
np
n
np
p
np
pnpI
22 loglog),(
Using Information Theory
• Suppose test A divides the training set into sets E1,E2, ... Em and each subset has pi positive and ni negative examples– The uncertainty in the children is measure by
– on average we will need Remainder(A) more bits of information• the gain of test A is
Gain(A) = I(p,n) – Remainder(A)• for example, if A completely classifies the set, the Gain is 1
(remainder 0)• Gain(front_row) = ??1 – [2/12 I(0,1) + 4/12 I(1,0) + 6/12
I(2/6,4/6) = .541• Gain(prev_grade) = ??1 – [1 I(1/2,1/2)] = 0
),()(Re iiiii npI
np
npAmainder
Results of Learning• ID3 algorithm• 50 examples from Cleaveland heart disease database
N-Examples: 50. Choosing from : (att1 att2 att3 att4 att5 att6 att7 att8 att9 att10 att11 att12 att13)
Attribute: att1. Gain: 0.0741303 Attribute: att2. Gain: 0.041963935 Attribute: att3. Gain: 0.20732212 Attribute: att4. Gain: 0.06303495 Attribute: att5. Gain: 0.04215479 Attribute: att6. Gain: 0.12409884 Attribute: att7. Gain: 0.014267027 Attribute: att8. Gain: 0.12125653 Attribute: att9. Gain: 0.3067463 Attribute: att10. Gain: 0.18902457 Attribute: att11. Gain: 0.0412457 Attribute: att12. Gain: 0.23261213 Attribute: att13. Gain: 0.24738503Selected attribute: att9
Results of LearningN-Examples: 37. Choosing from : (att1 att2 att3 att4 att5 att6
att7 att8 att10 att11 att12 att13)
Attribute: att1. Gain: 0.13357532 Attribute: att2. Gain: 0.07225275 Attribute: att3. Gain: 0.06493038 Attribute: att4. Gain: 0.05581081 Attribute: att5. Gain: 0.053394675 Attribute: att6. Gain: 0.18130744 Attribute: att7. Gain: 0.020807564 Attribute: att8. Gain: 0.070365906 Attribute: att10. Gain: 0.08575398 Attribute: att11. Gain: 0.0064561963 Attribute: att12. Gain: 0.26021093 Attribute: att13. Gain: 0.20255792Selected attribute: att12
Results of Learning
N-Examples: 6. Choosing from : (att1 att2 att3 att4 att5 att6 att7 att8 att10 att11 att13)
Attribute: att1. Gain: 0.31668913 Attribute: att2. Gain: 0.45914793 Attribute: att3. Gain: 0.45914793 Attribute: att4. Gain: 0.31668913 Attribute: att5. Gain: 0.91829586 Attribute: att6. Gain: 0.10917032 Attribute: att7. Gain: 0.0 Attribute: att8. Gain: 0.45914793 Attribute: att10. Gain: 0.10917032 Attribute: att11. Gain: 0.25162917 Attribute: att13. Gain: 0.25162917Selected attribute: att5
Results of Learningexercise angina (9)
Cholesterol (5)
1
10
bp (4)
yes
1
no
0
flourosopy (12)
0
10
defect (13)normal
normal
1
32
10
<=239 >239
10
<=120 >120
0
fixedreversible
pain (3)
1 1
atypical non asymCholesterol (5)Cholesterol (5)
<=245<=241 >241 >245
Example Derived From 300 Tests
13
1
defectnormal
fixedreversible
pain 123
5normal
0
atypical non asym5 10 10
0<=225
>225
12
10
<=.15>.15
1
01
120
1
1 11 92 3
1 0
01
0
<=50
1
>50
>1.9
1
<=1.9
11
1
0 0
2 3
4
<=1220
>122
0
12
1
011
3
0 5
1 10
<=229
>267
1
<=.7
0
1 5
>.65
>237
8
0
<=114
1
12
1
1
0
1 31
1<=42
0
013
8
1
<=167
0
8
<=161
2113
12 4
1
21 35
0
<=301
18 2 5
0
<=258
1
0
<=1610
1
0 1
5
0 1
0 2
4 1>60.5
01
1
<=61.5
0
10>3.2
11
20
2
8
1
<=57
<=1570 1
<=63.5
4
08
1
<=120
86
… …
How Do We Assess Performance?
• Collect a large set of samples
• Divide into training set and test set (no examples in common!)
• Use the learning algorithm to generate function hypothesis (DT) H
• Measure the percentage of examples that are right
• Do this for lots of training sets of lots of different sizes
The Learning Curve
Cleaveland heart-disease learning curve
0102030405060708090
0 100 200 300 400
number of examples
%co
rrec
t cl
assi
fica
tio
n
Application: GASOIL
GASOIL is an expert system for designing gas/oil separation systems stationed off-shore.
• System attributes: proportions of gas, oil and water, flow rate, pressure, density, viscosity, temperature, and others.
• To build by hand would take ~10 person-years • Built by decision-tree induction ~ 100 person-days • At the time (1986), GASOIL was the biggest Expert
System in the world, containing ~2500 rules, and saved BP millions.
Application: Learning to Fly
• Learning to fly a Cessna on a flight simulator (1992). – Three skilled pilots performed an assigned flight plan 30
times each. – Each control action (e.g. on throttle, flaps) created an
example. – 90,000 examples– Decision tree created. – Converted into C and put into the simulator control loop.
• Program flies better than teachers! – probably because generalization cleans up occasional
mistakes
Representation power of decision trees
Any Boolean function can be written as a decision tree.
x2
x1
No
Yes
No
Yes
YesNo
Cannot represent tests that refer to 2 or more objects, e.g.r2 Nearby(r2,r) Price(r,p) Price(r2,r2) Cheaper (p2,p)
Representation with decision trees…
Parity problem x1
x2 x2
x3 x3 x3x3
0
1
0
0 0
0
0 0
1
1 1
1
1
1
Y N N Y N Y Y N
Exponentially large tree.Cannot be compressed.
•n features (aka attributes).•2n rows in truth table.•Each row can take one of 2 values.•So there are Boolean functions of n attributes.n22
Machine Learning Issues
• Unrepresentative examples• Insufficient Data• Noise – incorrectly labelled examples
– if there are errors in our examples, then these will end up in the decision tree
– Overfitting
• Missing data– sometimes we don’t have all of the attributes
• Attributes with lots of values– tend to look good because each example has an almost
unique value
• Continuous-valued attributes
Empty Leaves
• The examples do not represent all possible attribute values
• Pass a default value to subtrees when splitting– Usually the majority classification at the parent node
Noisy Input
• Incorrectly labeled examples can result in leaves where the examples have conflicting labels and no split exists
• Select majority label
Many Valued Attributes• Problem with Information Gain:
– prefer attributes with many values
– extreme cases: • Social Security Numbers • patient ID’s• integer/nominal attributes with many values (JulianDay)
– Use Gain Ratio splitting criterion
GainRatio(A) = gain(A)/SplitInfo(A)
+ – – + – + + –+. . .
np
np
np
npASplitInfo ii
iii
2log)(
Continuous Valued Attributes
• how to cluster into logical segments of values– Sort by value, then find best threshold for
binary split– Cluster into n intervals and do n-way split
Missing Attribute Values
• Some data sets have many missing values• Assume that the missing value is
– The same as the majority value for the attribute– The same as the majority value for this attribute at
this node– The same as the majority value among examples
with the same label
Overfitting
©Tom Mitchell, McGraw Hill, 1997
The tree is too large and has poor predictive power.
Pre-Pruning (Early Stopping)
• Evaluate splits before installing them: – don’t install splits that don’t look worthwhile– when no worthwhile splits to install, done
• Seems right, but:– hard to properly evaluate split without seeing what
splits would follow it (use lookahead?)– some attributes useful only in combination with
other attributes– suppose no single split looks good at root node?
Post-Pruning
• Grow decision tree to full depth (no pre-pruning)
• Prune-back full tree by eliminating splits that do not appear to be warranted statistically
• Use train set, or an independent prune/test set, to evaluate splits
• Stop pruning when remaining splits all appear to be warranted
• Alternate approach: convert to rules, then prune rules
Converting Decision Trees to Rules
• each path from root to a leaf is a separate rule:
if (fp=1 & ¬pc & primip & ¬fd & bw<3349) => 0,if (fp=2) => 1,if (fp=3) => 1.
fetal_presentation = 1: +822+116 (tree) 0.8759 0.1241 0| previous_csection = 0: +767+81 (tree) 0.904 0.096 0| | primiparous = 1: +368+68 (tree) 0.8432 0.1568 0| | | fetal_distress = 0: +334+47 (tree) 0.8757 0.1243 0| | | | birth_weight < 3349: +201+10.555 (tree) 0.9482 0.05176 0fetal_presentation = 2: +3+29 (tree) 0.1061 0.8939 1fetal_presentation = 3: +8+22 (tree) 0.2742 0.7258 1
Advantages of Decision Trees
• DT learning is relatively fast, even with large data sets (106) and many attributes (103)– advantage of recursive partitioning: only process all cases at
root
• Small-medium size trees usually intelligible• Can be converted to rules• The algorithm does feature selection• The resulting model is often compact (Occam’s
Razor)• Decision tree representation is understandable
Decision Trees are Intelligible
Not ALL Decision Trees Are Intelligible
Part of Best Performing C-Section Decision Tree
from Rich Caruana.
Disadvantages of Decision Trees
• Large or complex trees can be just as unintelligible as other models
• Trees don’t easily represent some basic concepts such as M-of-N, parity, non-axis-aligned classes…
• Don’t handle real-valued parameters as well as Booleans• If model depends on summing contribution of many different
attributes, DTs probably won’t do well• DTs that look very different can be same/similar• Propositional (as opposed to 1st order)• Recursive partitioning: run out of data fast as descend tree
Instance-Based Learning
Inductive Assumption• Similar inputs map to similar outputs
– If not true => learning is impossible– If true => learning reduces to defining “similar”
• Not all similarities created equal– predicting a person’s weight may depend on
different attributes than predicting their IQ
Nearest Neighbor Classification
• Training: retain all examples• Prediction: new example assigned the same
classification as the nearest neighbor.• Similarity measure: a distance function in
attribute space
attribute_1at
trib
ute_
2
++
+
+
+
++++
o
o
ooo
ooo oo
o
+
+
ooo
Similarity Measure
N
iii cacaccD
1
22121 )()(),(
Euclidean Distance
attribute_1
attr
ibut
e_2
++
+
+
+
++++
oo
ooo
ooo oo
o
Booleans, Nominals, Ordinals, and Reals
• Consider attribute value differences:
• Reals: easy! full continuum of differences
• Integers: not bad: discrete set of differences
• Ordinals: not bad: discrete set of differences
• Booleans: less info: use hamming distance
• Nominals: less info: use hamming distance
)(c a) (ca
)(c a) (ca )) (c) - a(cah
ii
iiii
21
2121 1
0(
)(c) - a(ca ii 21
k-Nearest Neighbor• 1-NN works well if no attribute or class noise• Average of k points more reliable when:
– noise in attributes– noise in class labels– classes partially overlap
• Prediction: new example assigned the classification of the majority of the k-nearest neighbors.
attribute_1
attr
ibut
e_2
++
+
+
+
++++
o
o
ooo
ooo oo
o
+
++o
o o
How to choose “k”
• Large k:– less sensitive to noise (particularly class noise)– better probability estimates for discrete classes– larger training sets allow larger values of k
• Small k:– captures fine structure of space better– may be necessary with small training sets
• Balance must be struck between large and small k
Cross-Validation
• Models usually perform better on training data than on future test cases
• 1-NN is 100% accurate on training data!• Leave-one-out-cross validation:
– “remove” each case one-at-a-time– use as test case with remaining cases as train set– average performance over all test cases
• LOOCV is impractical with most learning methods, but extremely efficient with Instance-Based methods!
Distance-Weighted kNN
• tradeoff between small and large k can be difficult– use large k, but more emphasis on nearer neighbors?
),(
11
1
testkk
k
ii
k
iii
test
ccDistw
w
classwprediction
Locally Weighted Averaging
• Let k = number of training points• Let weight fall-off rapidly with distance
),(
1
1
1testk ccDisthKernelWidtk
k
ii
k
iii
test
ew
w
classwprediction
• KernelWidth controls size of neighborhood that has large effect on value (analogous to k)
Similarity Measure
D(c1,c2) attri(c1) attr
i(c2) 2
i1
N
Euclidean Distance
• gives all attributes equal weight?– only if scale of attributes and differences are similar– scale attributes to equal range or equal variance
• assumes spherical classes
attribute_1
attr
ibut
e_2
++
+
+
+
++++
oo
o
o
o
ooo oo
o
Euclidean Distance?
• Attributes on a larger range affect distance more than attributes on small range
• Some attributes are more/less important than other attributes
• Some attributes may have more/less noise
attribute_1
attr
ibut
e_2
++
+
+
+
++++
oo
ooo
ooo oo
o
attribute_1
attr
ibut
e_2 +
+
+
++
+ +
+
++ oo o
o
oo
o
o
o
oo
o
Weighted Euclidean Distance
• large weights => attribute is more important• small weights => attribute is less important• zero weights => attribute doesn’t matter
• Weights allow kNN to be effective with elliptical classes– Use the weight to normalize for attribute range
N
iiii cacawccD
1
22121 )()(),(
iiiw
minmax
1
Curse of Dimensionality• as number of dimensions increases, distance between points becomes larger and more uniform• if number of relevant attributes is fixed, increasing the number of less relevant attributes may swamp distance
• when more irrelevant dimensions relevant dimensions, distance becomes less reliable• solutions: larger k or KernelWidth, feature selection, feature weights, more complex distance functions
D(c1,c2) attri(c1) attr
i(c2) 2 attr
j(c1) attr
j(c2) 2
j1
irrelevant
i1
relevant
Advantages of Memory-Based Methods
• Lazy learning: don’t do any work until you know what you want to predict (and from what variables!)– never need to learn a global model– many simple local models taken together can represent a
more complex global model– better focussed learning– handles missing values, time varying distributions, ...
• Very efficient cross-validation• Intelligible learning method to many users• Nearest neighbors support explanation and training• Can use any distance metric: string-edit distance, …• Easy to implement an incremental learning version
Disadvantages of Memory-Based Methods
• Curse of Dimensionality:– often works best with 25 or fewer dimensions
• Run-time cost scales with training set size• Large training sets will not fit in memory• Many MBL methods are strict averagers• Sometimes doesn’t seem to perform as well
as other methods such as neural nets• Predicted values for regression not
continuous
A Learning Problem
Assume a two dimensional space with positive and negative examples. Find a rectangle that includes the positive examples but not the negatives (input space is R2):
+++
++
+---
--
-
-
-
-
true concept
Definitions
Distribution D.Assume instances are generated at random from a distribution D.
Class of Concepts C Let C be a class of concepts that we wish to learn. In our example C is the family of all rectangles in R2.
Class of Hypotheses H The hypotheses our algorithm considers while learning the target concept.
True error of a hypothesis herrorD(h) = Pr[c(x) = h(x)]D
True Error
+++
++
+---
--
-
-
-
-
true concept c
hypothesis h
True error is the probability of regions A and B.Region A : false negativesRegion B : false positives
Region A
Region B
Learning Algorithm Desiderata
The learning algorithm• uses a small number of examples• is computationally efficient• Outputs a hypothesis subject to
1. The hypothesis does not need to be correct on every sample. The probability of failure will be bounded by a constant δ).
2. We don’t require a hypothesis with zero error. There might be some error as long as it is small (bounded by a constant
ε).A probably approximately correct (PAC) hypothesis.
PAC Learning
A concept class C is PAC-learnable if
• there is a learning algorithm L• for all target concepts c in C• for all δ>0 • for all ε>0• and for all distributions D
L given ε, δ and a source of examples, produces with probability at least 1-δ a hypothesis h with true error less than ε, in time polynomial in 1/ ε , 1/ δ and the size of C.
Example
+++
++
+---
--
-
-
-
-
true concept c
hypothesis h (most specific)
The learning algorithm: output the smallest rectangle that covers the positive examples.
Is this class of problems (rectangles in R2) PAC learnable by this learning algorithm?
Example
+++
++
+---
--
-
-
-
-
true concept c
hypothesis h (most specific)
The error is the probability of the area between h and the true targetrectangle c.
How many example do we need to make this error less than ε?
error
Example: Analysis
In general, the probability that m independent examples have NOT fallen within the error region is (1- ε) m which we want to be less than δ.
(1- ε) m <= δ
Since (1-x) <= e–x we have that
e – εm <= δ or
m >= (1/ ε) ln (1/ δ)
The resulting grows linearly in 1/ ε and logarithmically 1/ δ
Computational Learning Theory
Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows when learning may be impossible
Results due to theoretical analysis
1. Sample ComplexityHow many examples we need to find a good hypothesis?
2. Computational ComplexityHow much computational power we need to find a good
hypothesis?3. Mistake Bound
How many mistakes we will make before finding a good hypothesis?