CS 478 - Machine Learning Instance Based Learning (Adapted from various sources)
CS 478 – Tools for Machine Learning and Data Mining
description
Transcript of CS 478 – Tools for Machine Learning and Data Mining
![Page 1: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/1.jpg)
CS 478 – Tools for Machine Learning and Data Mining
The Need for and Role of Bias
![Page 2: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/2.jpg)
Learning
• Rote:– Until you discover the rule/concept(s), the very
BEST you can ever expect to do is:• Remember what you observed• Guess on everything else
• Inductive:– What you do when you GENERALIZE from your
observations and make (accurate) predictions
Claim: “All [most of] the laws of nature were discovered by inductive reasoning”
![Page 3: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/3.jpg)
The Big Question
• All you have is what you have OBSERVED• Your generalization should at least be
consistent with those observations• But beyond that…
– How do you know that your generalization is any good?
– How do you choose among various candidate generalizations?
![Page 4: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/4.jpg)
The Answer Is
BIASAny basis for choosing one decision over
another, other than strict consistency with past observations
![Page 5: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/5.jpg)
Why Bias?
• If you have no bias you cannot go beyond mere memorization– Mitchell’s proof using UGL and VS
• The power of a generalization system follows directly from its biases
• Progress towards understanding learning mechanisms depends upon understanding the sources of, and justification for, various biases
![Page 6: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/6.jpg)
Concept Learning Given:
A language of observations/instances A language of concepts/generalizations A matching predicate A set of observations
Find generalizations that:1. Are consistent with the observations, and2. Classify instances beyond those observed
![Page 7: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/7.jpg)
Claim
The absence of bias makes it impossible to solve part 2 of the Concept Learning problem, i.e.,
learning is limited to rote learning
![Page 8: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/8.jpg)
Unbiased Generalization Language
• Generalization set of instances it matches• An Unbiased Generalization Language (UGL),
relative to a given language of instances, allows describing every possible subset of instances
• UGL = power set of the given instance language
![Page 9: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/9.jpg)
Unbiased Generalization Procedure
• Uses Unbiased Generalization Language • Computes Version Space (VS) relative to UGL
• VS set of all expressible generalizations consistent with the training instances
![Page 10: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/10.jpg)
Version Space (I)
• Let:– S be the set of maximally specific generalizations
consistent with the training data – G be the set of maximally general generalizations
consistent with the training data
![Page 11: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/11.jpg)
Version Space (II)
• Intuitively– S keeps generalizing to accommodate new positive
instances– G keeps specializing to avoid new negative instances
• The key issue is that they only do that to the smallest extent necessary to maintain consistency with the training data, that is, G remains as general as possible and S remains as specific as possible.
![Page 12: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/12.jpg)
Version Space (III)
• The sets S and G precisely delimit the version space (i.e., the set of all plausible versions of the emerging concept).
• A generalization g is in the version space represented by S and G if and only if:– g is more specific than or equal to some member
of G, and– g is more general than or equal to some member
of S
![Page 13: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/13.jpg)
Version Space (IV)• Initialize G to the most general concept in the space• Initialize S to the first positive training instance• For each new positive training instance p
– Delete all members of G that do not cover p– For each s in S
• If s does not cover p– Replace s with its most specific generalizations that cover p
– Remove from S any element more general than some other element in S– Remove from S any element not more specific than some element in G
• For each new negative training instance n– Delete all members of S that cover n– For each g in G
• If g covers n– Replace g with its most general specializations that do not cover n
– Remove from G any element more specific than some other element in G– Remove from G any element more specific than some element in S
• If G=S and both are singletons– A single concept consistent with the training data has been found
• If G and S become empty– There is no concept consistent with the training data
![Page 14: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/14.jpg)
Lemma 1
Any new instance, NI, is classified as positive if and only if NI is identical to some observed
positive instance
![Page 15: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/15.jpg)
Proof of Lemma 1
• (). If NI is identical to some observed positive instance, then NI is classified as positive– Follows directly from the definition of VS
• (). If NI is classified as positive, then NI is identical to some observed positive instance– Let g={p: p is an observed positive instance}
• UGL gVS• NI matches all of VS NI matches g
![Page 16: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/16.jpg)
Lemma 2
Any new instance, NI, is classified as negative if and only if NI is identical to some observed
negative instance
![Page 17: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/17.jpg)
Proof of Lemma 2
• (). If NI is identical to some observed negative instance, then NI is classified as negative– Follows directly from the definition of VS
• (). If NI is classified as negative, then NI is identical to some observed negative instance – Let G={all subsets containing observed negative instances}
• UGL GVS=UGL • NI matches none in VS NI was observed
![Page 18: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/18.jpg)
Lemma 3
If NI is any instance which was not observed, then NI matches exactly one half of VS, and so
cannot be classified
![Page 19: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/19.jpg)
Proof of Lemma 3
• (). If NI was not observed, then NI matches exactly one half of VS, and so cannot be classified – Let g={p: p is an observed positive instance}– Let G’={all subsets of unobserved instances}
• UGL VS={gg’: g’G’}• NI was not observed NI matches exactly ½ of G’ NI
matches exactly ½ of VS
![Page 20: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/20.jpg)
Theorem
An unbiased generalization procedure can never make the inductive leap necessary to classify
instances beyond those it has observed
![Page 21: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/21.jpg)
Proof of the Theorem
• The result follows immediately from Lemmas 1, 2 and 3
• Practical consequence: If a learning system is to be useful, it must have some form of bias
![Page 22: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/22.jpg)
Sources of Bias in Learning
• The representation language cannot express all possible classes of observations
• The generalization procedure is biased– Domain knowledge (e.g., double bonds rarely break)– Intended use (e.g., ICU – relative cost)– Shared assumptions (e.g., crown, bridge – dentistry)– Simplicity and generality (e.g., white men can’t jump)– Analogy (e.g., heat vs. water flow, thin ice)– Commonsense (e.g., social interactions, pain, etc.)
![Page 23: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/23.jpg)
Fall 2004 CS 478 - Machine Learning
23
Representation Language
Decrease number of expressible generalizations Increase ability to make the inductive leap
Example: Restrict generalizations to conjunctive constraints on features in a Boolean domain
![Page 24: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/24.jpg)
Fall 2004 CS 478 - Machine Learning
24
Proof of Concept (I)
Let N = number of features22N subsets of instances
Let GL = {0, 1, *}can only denote subsets of size 2p for 0pN
![Page 25: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/25.jpg)
CS 478 - Machine Learning
25Fall 2004
Proof of Concept (II)
For each p, there are only 2N-p expressible subsets Fix N-p features (there are ways of choosing
which)Set values for the selected features (there are 2N-p
possible settings)
pNNC
pNNC
![Page 26: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/26.jpg)
CS 478 - Machine Learning
26Fall 2004
Proof of Concept (III)
Excluding the empty set, the ratio of expressible to total generalizations is given by:
1N22
N0p
pNNC
pN2
![Page 27: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/27.jpg)
CS 478 - Machine Learning
27Fall 2004
Proof of Concept (IV)
For example, if N=5 then only about 1 in 107 subsets may be representedStrong bias
Two-edge sword: representation could be too sparse
![Page 28: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/28.jpg)
Generalization Procedure
• Domain knowledge (e.g., double bonds rarely break)
• Intended use (e.g., ICU – relative cost)• Shared assumptions (e.g., crown, bridge –
dentistry)• Simplicity and generality (e.g., white men can’t
jump)• Analogy (e.g., heat vs. water flow, thin ice)• Commonsense (e.g., social interactions, pain, etc.)
![Page 29: CS 478 – Tools for Machine Learning and Data Mining](https://reader036.fdocuments.us/reader036/viewer/2022081520/56816554550346895dd7d21c/html5/thumbnails/29.jpg)
Fall 2004 CS 478 - Machine Learning
29
Conclusion
Absence of bias = rote learningEfforts should focus on combined use of prior
knowledge and observations in guiding the learning process
Make biases and their use as explicit as observations and their use