CS 478 – Tools for Machine Learning and Data Mining

29
CS 478 – Tools for Machine Learning and Data Mining The Need for and Role of Bias

description

CS 478 – Tools for Machine Learning and Data Mining. The Need for and Role of Bias. Learning. Rote: Until you discover the rule/concept(s), the very BEST you can ever expect to do is: Remember what you observed Guess on everything else Inductive: - PowerPoint PPT Presentation

Transcript of CS 478 – Tools for Machine Learning and Data Mining

Page 1: CS 478 – Tools for Machine Learning and Data Mining

CS 478 – Tools for Machine Learning and Data Mining

The Need for and Role of Bias

Page 2: CS 478 – Tools for Machine Learning and Data Mining

Learning

• Rote:– Until you discover the rule/concept(s), the very

BEST you can ever expect to do is:• Remember what you observed• Guess on everything else

• Inductive:– What you do when you GENERALIZE from your

observations and make (accurate) predictions

Claim: “All [most of] the laws of nature were discovered by inductive reasoning”

Page 3: CS 478 – Tools for Machine Learning and Data Mining

The Big Question

• All you have is what you have OBSERVED• Your generalization should at least be

consistent with those observations• But beyond that…

– How do you know that your generalization is any good?

– How do you choose among various candidate generalizations?

Page 4: CS 478 – Tools for Machine Learning and Data Mining

The Answer Is

BIASAny basis for choosing one decision over

another, other than strict consistency with past observations

Page 5: CS 478 – Tools for Machine Learning and Data Mining

Why Bias?

• If you have no bias you cannot go beyond mere memorization– Mitchell’s proof using UGL and VS

• The power of a generalization system follows directly from its biases

• Progress towards understanding learning mechanisms depends upon understanding the sources of, and justification for, various biases

Page 6: CS 478 – Tools for Machine Learning and Data Mining

Concept Learning Given:

A language of observations/instances A language of concepts/generalizations A matching predicate A set of observations

Find generalizations that:1. Are consistent with the observations, and2. Classify instances beyond those observed

Page 7: CS 478 – Tools for Machine Learning and Data Mining

Claim

The absence of bias makes it impossible to solve part 2 of the Concept Learning problem, i.e.,

learning is limited to rote learning

Page 8: CS 478 – Tools for Machine Learning and Data Mining

Unbiased Generalization Language

• Generalization set of instances it matches• An Unbiased Generalization Language (UGL),

relative to a given language of instances, allows describing every possible subset of instances

• UGL = power set of the given instance language

Page 9: CS 478 – Tools for Machine Learning and Data Mining

Unbiased Generalization Procedure

• Uses Unbiased Generalization Language • Computes Version Space (VS) relative to UGL

• VS set of all expressible generalizations consistent with the training instances

Page 10: CS 478 – Tools for Machine Learning and Data Mining

Version Space (I)

• Let:– S be the set of maximally specific generalizations

consistent with the training data – G be the set of maximally general generalizations

consistent with the training data

Page 11: CS 478 – Tools for Machine Learning and Data Mining

Version Space (II)

• Intuitively– S keeps generalizing to accommodate new positive

instances– G keeps specializing to avoid new negative instances

• The key issue is that they only do that to the smallest extent necessary to maintain consistency with the training data, that is, G remains as general as possible and S remains as specific as possible.

Page 12: CS 478 – Tools for Machine Learning and Data Mining

Version Space (III)

• The sets S and G precisely delimit the version space (i.e., the set of all plausible versions of the emerging concept).

• A generalization g is in the version space represented by S and G if and only if:– g is more specific than or equal to some member

of G, and– g is more general than or equal to some member

of S

Page 13: CS 478 – Tools for Machine Learning and Data Mining

Version Space (IV)• Initialize G to the most general concept in the space• Initialize S to the first positive training instance• For each new positive training instance p

– Delete all members of G that do not cover p– For each s in S

• If s does not cover p– Replace s with its most specific generalizations that cover p

– Remove from S any element more general than some other element in S– Remove from S any element not more specific than some element in G

• For each new negative training instance n– Delete all members of S that cover n– For each g in G

• If g covers n– Replace g with its most general specializations that do not cover n

– Remove from G any element more specific than some other element in G– Remove from G any element more specific than some element in S

• If G=S and both are singletons– A single concept consistent with the training data has been found

• If G and S become empty– There is no concept consistent with the training data

Page 14: CS 478 – Tools for Machine Learning and Data Mining

Lemma 1

Any new instance, NI, is classified as positive if and only if NI is identical to some observed

positive instance

Page 15: CS 478 – Tools for Machine Learning and Data Mining

Proof of Lemma 1

• (). If NI is identical to some observed positive instance, then NI is classified as positive– Follows directly from the definition of VS

• (). If NI is classified as positive, then NI is identical to some observed positive instance– Let g={p: p is an observed positive instance}

• UGL gVS• NI matches all of VS NI matches g

Page 16: CS 478 – Tools for Machine Learning and Data Mining

Lemma 2

Any new instance, NI, is classified as negative if and only if NI is identical to some observed

negative instance

Page 17: CS 478 – Tools for Machine Learning and Data Mining

Proof of Lemma 2

• (). If NI is identical to some observed negative instance, then NI is classified as negative– Follows directly from the definition of VS

• (). If NI is classified as negative, then NI is identical to some observed negative instance – Let G={all subsets containing observed negative instances}

• UGL GVS=UGL • NI matches none in VS NI was observed

Page 18: CS 478 – Tools for Machine Learning and Data Mining

Lemma 3

If NI is any instance which was not observed, then NI matches exactly one half of VS, and so

cannot be classified

Page 19: CS 478 – Tools for Machine Learning and Data Mining

Proof of Lemma 3

• (). If NI was not observed, then NI matches exactly one half of VS, and so cannot be classified – Let g={p: p is an observed positive instance}– Let G’={all subsets of unobserved instances}

• UGL VS={gg’: g’G’}• NI was not observed NI matches exactly ½ of G’ NI

matches exactly ½ of VS

Page 20: CS 478 – Tools for Machine Learning and Data Mining

Theorem

An unbiased generalization procedure can never make the inductive leap necessary to classify

instances beyond those it has observed

Page 21: CS 478 – Tools for Machine Learning and Data Mining

Proof of the Theorem

• The result follows immediately from Lemmas 1, 2 and 3

• Practical consequence: If a learning system is to be useful, it must have some form of bias

Page 22: CS 478 – Tools for Machine Learning and Data Mining

Sources of Bias in Learning

• The representation language cannot express all possible classes of observations

• The generalization procedure is biased– Domain knowledge (e.g., double bonds rarely break)– Intended use (e.g., ICU – relative cost)– Shared assumptions (e.g., crown, bridge – dentistry)– Simplicity and generality (e.g., white men can’t jump)– Analogy (e.g., heat vs. water flow, thin ice)– Commonsense (e.g., social interactions, pain, etc.)

Page 23: CS 478 – Tools for Machine Learning and Data Mining

Fall 2004 CS 478 - Machine Learning

23

Representation Language

Decrease number of expressible generalizations Increase ability to make the inductive leap

Example: Restrict generalizations to conjunctive constraints on features in a Boolean domain

Page 24: CS 478 – Tools for Machine Learning and Data Mining

Fall 2004 CS 478 - Machine Learning

24

Proof of Concept (I)

Let N = number of features22N subsets of instances

Let GL = {0, 1, *}can only denote subsets of size 2p for 0pN

Page 25: CS 478 – Tools for Machine Learning and Data Mining

CS 478 - Machine Learning

25Fall 2004

Proof of Concept (II)

For each p, there are only 2N-p expressible subsets Fix N-p features (there are ways of choosing

which)Set values for the selected features (there are 2N-p

possible settings)

pNNC

pNNC

Page 26: CS 478 – Tools for Machine Learning and Data Mining

CS 478 - Machine Learning

26Fall 2004

Proof of Concept (III)

Excluding the empty set, the ratio of expressible to total generalizations is given by:

1N22

N0p

pNNC

pN2

Page 27: CS 478 – Tools for Machine Learning and Data Mining

CS 478 - Machine Learning

27Fall 2004

Proof of Concept (IV)

For example, if N=5 then only about 1 in 107 subsets may be representedStrong bias

Two-edge sword: representation could be too sparse

Page 28: CS 478 – Tools for Machine Learning and Data Mining

Generalization Procedure

• Domain knowledge (e.g., double bonds rarely break)

• Intended use (e.g., ICU – relative cost)• Shared assumptions (e.g., crown, bridge –

dentistry)• Simplicity and generality (e.g., white men can’t

jump)• Analogy (e.g., heat vs. water flow, thin ice)• Commonsense (e.g., social interactions, pain, etc.)

Page 29: CS 478 – Tools for Machine Learning and Data Mining

Fall 2004 CS 478 - Machine Learning

29

Conclusion

Absence of bias = rote learningEfforts should focus on combined use of prior

knowledge and observations in guiding the learning process

Make biases and their use as explicit as observations and their use