B. Ross Cosc 4f79 1 Learning Machine learning is an area of AI concerned with the automatic learning...

27
B. Ross Cosc 4f79 1 Learning ine learning is an area of AI concerned with the automatic learning owledge ways that machine learning can be used in expert systems increase efficiency of inference engine and knowledge base processi testing the knowledge base use learning principles to acquire knowledge itself ??? learning techniques exploit heuristics: problem-specific informat h makes the search for a solution more efficient out heuristics, typical learning problems either take too long to e ctively, or produce results which are too large & general to be use

Transcript of B. Ross Cosc 4f79 1 Learning Machine learning is an area of AI concerned with the automatic learning...

B. Ross Cosc 4f79 1

Learning

• Machine learning is an area of AI concerned with the automatic learning of knowledge

• some ways that machine learning can be used in expert systems

1. increase efficiency of inference engine and knowledge base processing 2. testing the knowledge base

3. use learning principles to acquire knowledge itself

4. ???

• Most learning techniques exploit heuristics: problem-specific information which makes the search for a solution more efficient

• Without heuristics, typical learning problems either take too long to execute effectively, or produce results which are too large & general to be useful

B. Ross Cosc 4f79 2

Learning

1. Increase inference engine efficiency

• 20-80 principle: 20% of rules in KB account for 80% of diagnoses

• these 20% rules should take precedence in order to make execution faster

• otherwise, roughly half of the KB needs to be looked at for every diagnosis, which is a waste of time for most (80%) problems

• However, it is also possible that the set of rules most often used can vary according to who, where, and how the expert system is used

• One way to fix this: keep a record of the number of times a high-level rule was successful in making a diagnosis

eg. record(rule12, 102). record(rule6, 25). etc

• Save this information in a file, and reload it every session.

• Use these rule stats to determine the order in which diagnoses are to be executed

B. Ross Cosc 4f79 3

Learning

1. ordering rules: p.118 Schnupp

p.118 schnupp

B. Ross Cosc 4f79 4

Learning• Another possibility is to ask some preliminary questions to determine general high-level information, and then order the high-level inference accordingly

p.112-113 Schnupp

B. Ross Cosc 4f79 5

Learning2. Knowledge acquisition

(i) Learning rules from examples

• user inputs typical examples (decision table) for a given rule domain; or, can process a database to automatically generate production rules

• system constructs a rule or set of rules (decision tree) for this example set

• can then generate production rules from this tree

• trivial to make comprehensive tree but more involved to make a minimal one

• inductive inference: learning technique which constructs a general rule from specific examples (compare with mathematical induction)

• popular algorithm: Quinlan's ID3, used in shells such as VP-expert, ExpertEase, RuleMaster, and others

B. Ross Cosc 4f79 6

Learning

• Decision table: table of attribute values, with one or more conclusions - each row is an example or true instance of attribute value - conclusion

•  Convenient for classification & identification problems

• Could create production rules directly from table: one rule per row

• induction: tries to generalize information in table, disregarding superfluous information, and yielding an efficient smaller decision tree

- results in "smarter" system

- good example of machine learning : computer tries to generalize and abstract from examples

•  Given a table, can look at conclusions, and see if particular attributes have any effect on them. If not, then disregard those attributes when deriving that conclusion.

• There exist one or more "minimal" sets of tests for a table; however, finding this minimal set can be intractable in general

B. Ross Cosc 4f79 7

ID3 definitions• entropy: measure of how much an attribute matches the conclusion

– match 1:1 -- low entropy (high information content, low uncertainty)

eg. a:1, b:2, c:3, d:4

– differ on every value matching - high entropy (low info content, high uncertainty)

eg. a:1, a:2, a:3, a:4

• Information or entropy: mathematical measurement of an entity that provides the answer to a question, or certainty about an outcome, eg. to describe whether a coin will be heads or tails

– a value between 0 and 1, measured in bits

– 1 bit = enough information to answer a yes/no question of a fair, random event (because log2 (2) = 1)

– 4 events require log2(4) = 2 bits etc.

– Note: log2 (X) = ln(X) / ln(2)

B. Ross Cosc 4f79 8

ID3 definitions

• Information content: average entropy of different events weighted by probabilities of those events: Pr(E) log2 E

– Formula: - Pr(E1)log2 E1 - Pr(E2)log2 E2 - ... - Pr(Ek) log2 Ek• when attribute A has events E1,...,Ek

– eg. fair coin: IC(0.5, 0.5) = -0.5 log2 0.5 - 0.5 log2 0.5 = 1 bit

– eg. weighted heads:

– IC(0.01, 0.99) = -0.01 log2 0.01 - 0.99 log2 0.99 = 0.08 bits

– eg. always heads: IC(0, 1) = 0 bits

– Note: if event has prob 0, don’t use in equation (log 0 = ???)• instead, set value as 0.

• Information gain: difference between original information content and new information content, after attribute A selected

B. Ross Cosc 4f79 9

ID3 Algorithm

• minimal trees are useful when example table is completely adequate for observations of interest, and user input is certain– for uncertain input, then additional tests are required for support

• ID3: induction algorithm that tries to find a small tree efficiently– not guaranteed to be minimal, but it is generally small

1. Given the example set C, find an attribute A that gives the highest information content

--> this is the most discriminatory test for distinguising the data

--> it yields the highest information gain: ideally, if we have X entropy before, then after A we have 0 entropy (ie. perfect decision strategy!)

2. Add it as the next node in tree

3. Partition the examples into subtables, and recurse (fills in remainder of tree)

B. Ross Cosc 4f79 10

ID3 algorithm

• consider p positive and n negative examples– probability of p’s: p+ = p / (p + n)– probability of n’s: p- = n / (p + n)– compute the above from the example set

• information content for each attribute value:– IC(Vi) = - (p+) log2 (p+) - (p-) log2 (p-)– where Vi is value of attribute A

• information content for table after attribute A is used as test: – B(C, A) = sum_i : [ (prob value of A is Vi) * IC(Ci) ]– for attribute A, value Vi (i=1,...,#values), – subset of examples Ci corresp. to each Vi

• We wish to select an attribute which maximizes the information gain at that node in the tree. Repeat the following for all attribute in example set:– compute p+, p- for each value of attribute– compute IC for each attribute, value pair– compute overall info gain B(C, A) for that attribute

• --> select attribute A which maximizes information gain, ie. yields low information content after it is applied --> lowest B(C,A) value.

B. Ross Cosc 4f79 11

ID3: entropy extremesattr. 1 attr. 2 value IC(C) = -(3/4 log2 3/4) - (1/4 log2 1/4)

a c x = -(.75)(-.45) - (.25)(.5) = 0.725

b c y

a c x

a c x• attr 1:

– IC(a) = -1 log2 1 - 0 log2 0 = 0– IC(b) = -1 log2 1 -0 log2 0 = 0– B(C, attr1) = Pr(a)*IC(a) + Pr(b)*IC(b) = 0 + 0 = 0– Gain: IC(C) -B(C, attr1) = 0.725 - 0 = 0.725

--> maximum gain! All values precisely predicted using attribute 1.• attr 2:

– IC(c) = -(3/4) log2 (3/4) - (1/4) log2 (1/4) = 0.725– B(C, attr2) = Pr(c)*IC(c) = 1*.725 = 0.725– Gain: IC(C) - B(C, attr2) = 0.725 - 0.725 = 0

--> minimum gain; no information gained at all fromusing attribute 2.• Note: we can simply select attribute yielding minimum information content “B”

for table; computing the gain is redundant (doesn’t help us).

B. Ross Cosc 4f79 12

ID3: example (from Durkin, pp.496-498)

• IC(C) = -(4/8) log2 (4/8) - (4/8) log2 (4/8) = 1 (initial info content of all examples)• (a) test “wind”:

– IC(North) = -3/5 log2 (3/5) - 2/5 log2 (2/5) = .971– IC(South) = -1/3 log2 (1/3) - 2/3 log2 (2/3) = .918– B(C, “Wind”) = 5/8* 0.971 + 3/8 log 0.918 = 0.951– gain = IC(C) - B(C, “Wind”) = 1 - 0.951= - .049

• (b) test “sky”:– “clear’: all are negative (= 0)– IC(cloudy) = -4/5 log2(4/5) - 1/5 log2 (1/5) = .722– B(C, “sky”) = 3/8 * 0 + 5/8 * .722 = .45– gain = IC(C) - B(C, “sky”) = 1 - .45 = .548

• (c) barometer: gain is .156• therefore sky gives highest info gain, and is selected.• Algorithm partitions example set for each new subcategory, and recurses

• Note: we’re simply finding attribute that yields smallest information content for remaining table table after it is applied

B. Ross Cosc 4f79 13

Example (cont)

• Must now apply ID3 to remaining examples that have differing result values --> new layers of decision tree

• (a) barometer:

– IC(rising) = - 1 log2 1 - 0 log2 0 = 0

– IC(steady) = - 1 log2 1 - 0 log2 0 = 0

– IC(falling) = - 1/2 log2 1/2 - 1/2 log2 1/2 = .951

– B(C, barometer) = (2/5)*0 + (1/5)*0 + (2/5)*.951= .38

• (b) wind:

– IC(south) = - 1/2 log2 1/2 - 1/2 log2 1/2 = .951

– IC(north) = - 1 log 2 1 - 0 = 0

– B(C, wind) = (2/5)*.951 + (3/5)*0 = .38

• --> choose either

• note that you’ll need both attributes together to classify remaining table

B. Ross Cosc 4f79 14

Example

• If we choose ‘barometer’, then remaining table left is:

– 5 Cloudy, Falling, North +

– 7 Cloudy, Falling, South -

• Should be obvious that ‘wind’ is only possibility now.

B. Ross Cosc 4f79 15

Example: final tree

B. Ross Cosc 4f79 16

ID3

• ID3 generalizes to multivalued classifications (not just plus and minus): information content expression extends to multiple categories...

– IC(a, b, c) = -pa log2 pa - pb log2 pb - pc log2 pc

• Note: final tree can have “no data” leafs, meaning that the example set does not cover that combination of tests

– can presume that such a leaf is “impossible” wrt the examples

– otherwise, implies that example set is missing information

– --> must assume that examples are complete and correct; this is responsibility of knowledge engineer!

B. Ross Cosc 4f79 17

LearningInductive Inference:

p.129-132

B. Ross Cosc 4f79 18

Learning

Inductive Inference

B. Ross Cosc 4f79 19

Learning

B. Ross Cosc 4f79 20

Learning

ID3 algorithm

p. 134-5

B. Ross Cosc 4f79 21

Learning

B. Ross Cosc 4f79 22

Learning

B. Ross Cosc 4f79 23

Learning

B. Ross Cosc 4f79 24

Possible ID3 problems

• clashing examples: need more data attributes, or must correct knowledge

• continuous values (floating point): must create ranges,

– eg. 0 < x < 5

• noise: if compressing a database, noise can unduly influence decision tree

• trees can be too large: production rules are therefore large too

– break up table and create hierarchy of tables (structured induction)

• flat rules: only one or more conclusions, no intermediate rules

B. Ross Cosc 4f79 25

ID3 enhancements

• If you have a large example set, can use a subset (“window”) of examples

– then need to verify that resulting decision tree is valid, in case that window wasn’t inclusive or had noise

– knowledge bases must be 100% correct!

• Data mining: finding trends in large databases

– ID3 is one tool used there

– don’t care about 100% correctness

– rather, a good random sample of database may yield enough useful information

– can also apply to entire database, in an attempt to categorize information into useful classes and trends

• C4.5 is the successor of ID3. Uses more advanced heuristic.

B. Ross Cosc 4f79 26

Learning: Genetic algorithms

• Another way to create small decision trees from examples: genetic programming

• 1. Create a population of random decision trees• 2. Repeat Until a suitably correct and small tree is found:

– a. Rate all trees based on: (i) size, (ii) how many examples they cover (iii) how many examples they miss

– --> fitness score– b. Create a new population: – (i) mate trees using Crossover: swap subtrees between parents– (ii) mutate trees using Mutation: random change to tree

• This will search the space of decision trees for a correct and small sized tree

• Much slower than ID3, but possibly better results• Remember: finding the smallest tree for an example set is NP-

complete

B. Ross Cosc 4f79 27

Learning comments

• Inductive inference is a convenient way of obtaining productions automatically from databases or user examples.

• Good for classification & diagnosis problems

• Assumes that domain is deterministic: that particular premises lead to only one conclusion, not multiple ones

• Need to have: good data, no noise, no clashes - assumes that the entire universe of interest is encapsulated in example set or database - clashes probably mean you need to identify more attributes

• ID3 doesn't assure the minimal tree, but its entropy measure is a heuristic that often generates a small tree

• Note that a minimal rule is not necessarily desireable: when you discard attributes, you discard information which might be relevant, especially later when the system is being upgraded

• Not desireable to use this technique on huge tables with many attributes. Better approach is to modularise data hierarchically.