CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

44
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012

Transcript of CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Page 1: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

CS 4100 Artificial Intelligence

Prof. C. HafnerClass Notes March 27, 2012

Page 2: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 3: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Bayes net another example• What are the conditional independence assumptions

embodied in this model ?Ulcer

Infection Stomach Ache

Fever

Page 4: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Bayes net another example• What are the conditional independence assumptions embodied in this model ? (and how is it

useful)– Fever is conditionally independent of ulcer and stomach ache, given Infection– Stomach ache is conditionally independent of Infection and fever, given Ulcer.

Ulcer

Infection Stomach Ache

Fever

Page 5: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Bayes net another example• What are the conditional independence assumptions

embodied in this model ?Ulcer

Infection Stomach Ache

Fever

P(Ulcer | Fever) = α P(Fever | Ulcer) P(Ulcer)P(Fever | Ulcer) = α [P(Fever, Inf, SA | Ulc) + P(Fever, ~Inf, SA | Ulc) +

P(Fever, Inf, ~SA | Ulc) + P(Fever, ~Inf, ~SA | Ulc) ] P(Ulc)

Simplifications: P(Fever, Inf, SA | Ulc) is P(Fever |Inf) P(Inf | Ulc) P(SA | Ulc)P(Fever, ~Inf, SA | Ulc) is P(Fever |~Inf) P(~Inf | Ulc) P(SA | Ulc)

Page 6: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Test your understanding: design a Bayes net with plausible numbers

Page 7: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Information Theory

• Information is about categories and classification• We measure quantity of information by the

resources needed to represent/store/transmit the information

• Messages are sequences of 0’s and 1’s (dots/dashes) which we call “bits” (for binary digits)

• You need to send a message containing the identity of a spy– It is known to be Mr. Brown or Mr. Smith

• You can send the message with 1 bit, therefore the event “the spy is Smith” has 1 bit of information

Page 8: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Calculating quantity of information

• Def: A uniform distribution of a set of possible outcomes (X1 . . . Xn) means the outcomes are equally probable; that is, they each have probability 1/n.

• Suppose there are 8 people who can be the spy. Then the message requires 3 bits. If there are 64 possible spies the message requires 6 bits, etc. (assuming a uniform distribution)

• Def: The information quantity of a message where the (uniform) probability of each value is p:

I = -log p bits

Page 9: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Intuition and Examples• Intuitively, the more “surprising” a message is, the more

information it contains. If there are 64 equally-probable spies we are more surprised by the identity of the spy than if there are only two equally probable spies.

• There are 26 letters in the alphabet. Assuming they are equally probable, how much information is in each letter: I = -log (1/26) = log 26 = 4.7 bits

• Assuming the digits from 0 to 9 are equally probable. Will the information in each digit be more or less than the information in each letter?

Page 10: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Sequences of messages

• Things get interesting when we looks beyond a single message to a long sequence of messages.

• Consider a 4-sided die, with symbols A, B, C, D:– Let 00 = A, 01=B, 10=C, 11=D– Each message is 2 bits. If you throw the die 800

times, you get a message 1600 bits long

That’s the best you can do if A,B,C,D equally probable

Page 11: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Non-uniform distributions (cont.)

• Consider a 4-sided die, with symbols A, B, C, D:– But assume P(A) = 7/8 and P(B)=P(C)=P(D) = 3/24– We can take advantage of that with a different code:

0 = A, 10= B, 110 = C, 111 = D– If we throw the die 800 times, what is the expected

length of the message? What is the entropy?

• ENTROPY is the average information (in bits) of events in a long repeated sequence

Page 12: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

EntropyFormula for entropy with outcomes x1 . . . xn :

- Σ P(xi) * log P(xi) bits

For a uniform distribution this is the same as –log P(x1) since all the P(xi) are the same.

What does it mean? Consider 6-sided die, outcomes equally probable:

-log 1/6 = 2.58 tells us a long sequence of die throws can be transmitted using 2.58 bits per throw on the average and this is the theoretical best

Page 13: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Review/Explain Entropy• Let the possible outcomes be x1 . . . . xn

– With probabilities p1 . . . pn that add up to 1• Ex: an unfair coin where n = 2, x1 = H (3/4), x2 = T

(1/4)• In a long sequence of events E = e1 . . . ek, we

assume that outcome xi will occur k * pi times, etc.

E = HHTHTHHHTTHHHHHTHHHHTTHTHTHHHH …….If k = 10000, we can assume H occurs 7500 times, T

2500.Note: the concept TYPES vs. TOKENS. There are two

types and 10000 tokens in this scenario.

Page 14: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Review/Explain Entropy

The entropy of E H(E) is the average information of the events in the sequence e1 . . . ek :

k1/k * Σ I(ej) = [now switch to summation over outcomes]

j=1

n n

1/k * Σ I(xi) * (k*pi) = k/k * Σ I(xi) * pi i=1 i=1

n

Σ -log(pi) * pi bits i=1

Page 15: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Review/Explain Entropy Entropy is sometimes called “disorder” – it represents the

lack of predictability as to the outcome for any element of a sequence (or set)

If a set has just one outcome, entropy = 1 * -log(1) = 0

If there are 2 outcomes, then 50/50 probability gives the maximum entropy – complete unpredictability. This generalizes to any uniform distribution for n outcomes.

- (0.5 * log(.5) + 0.5 * log(.5)) = 1 bit

Note: log(1/2) = -log(2) = -1

Page 16: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Calculating Entropy• Consider a biased coin: P(heads) = ¾; P(tails) = ¼• What is the entropy of a coin toss outcome?

• H = ¼ * -log(1/4) + ¾ * -log(3/4) = 0.811 bits• Using the Information Theory Log Table• H = 0.25 * 2.0 + 0.75 * 0.415 = 0.5 + 0.311 = .811

• A fair coin toss has more “information” • The more unbalanced the probabilities, the more

predictable the outcome, the less you learn from each message.

Page 17: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

H (entropy in bits)

0 ½ 1

1

Entropy for a set containing 2 possible outcomes (x1, x2)

What if there are 3 possible outcomes? for equal probability case: H = -log(1/3) = about 1.58

Probability of x1

Maximum disorder

Page 18: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Define classification tree and ID3 algorithm• Def: Given a table with one result attribute and several

designated predictor attributes, a classification tree for that table is a tree such that:– Each leaf node is labeled with a value of the result

attribute– Each non-leaf node is labeled with the name of a

predictor attribute– Each link is labeled with one value of the parent’s

predictor• Def: the ID3 algorithm takes a table as input and

“learns” a classification tree that efficiently maps predictor value sets into their results from the table.

Page 19: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Record# Color Shape Fruit

1 red round apple

2 yellow round lemon

3 yellow oblong banana

A trivial example of a classification tree

Color

lemon

Shape apple

banana

red yellow

round oblong

The goal is to create an “efficient” classification tree which always gives the same answer as the table

Page 20: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

NoYesLightShortBlondeKatie

NoNoHeavyAverageBrownJohn

NoNoHeavyTallBrownPete

YesNoHeavyAverageRedEmily

YesNoAverageShortBlondeAnnie

NoYesAverageShortBrownAleX

NoYesAverageTallBlondeDana

YesNoLightAverageBlondeSarah

SunburnedLotionWeightHeightHairName

A well-known “toy” example: sunburn data

Predictor attributes: hair, height, weight, lotion

Page 21: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Hair

Blonde

RedBrown

Sunburned

NotSunburnedLotion

Y N

Sunburned

NotSunburned

Page 22: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Outline of the algorithm

1. Create the root, and make its COLLECTION the entire table2. Select any non-singular leaf node N to SPLIT

1. Choose the best attribute A for splitting N (use info theory)2. For each value of A (a1, a2, . .) create a child of N, Nai

3. Label the links from N to its children: “A = ai”4. SPLIT the collection of N among its children according to

their values of A3. When no more non-singular leaf nodes exist, the tree is finished4. Def: a singular node is one whose COLLECTION includes just one

value for the result attribute (therefore its entropy = 0)

Page 23: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 24: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Choosing the best attribute to SPLIT: the one that is MOST INFORMATIVE

that reduces the entropy (DISORDER) the most

Assume there are k attributes we can choose. For each one, we compute how much less entropy exists in the resulting children than we had in the parents: H(N) – weighted sum of H(children of N)

Each child’s entropy is weighted by the “probability” of that child (estimated by the proportion of the parent’s collection that would be transferred to the child in the split)

Page 25: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

C(S1) = {S,D,X,A,E,P,J,K}(3,5)/____}

Calculate entropy: - [3/8 log 3/8 + 5/8 log 5/8] = .53 + .424 = .954

S1: _______

Find information gain (IG) for all 4 predictors: hair, height, weight, lotion

Start with lotion: values (yes, no)Child 1: (yes) = {D,X,K}(0, 3)/0Child 2: (no) = {S,A,E,P,J}(3,2)/ -[3/5 log 3/5 + 2/5 log 2/5] = .971

Child set entropy = 3/8 * 0 + 5/8 * .971 = 0.607IG(Lotion) = .954 - .607 = .347

Then try hair color: values (blond, brown, red)Child 1(blond) = {S,D,A,K}(2,2)/1Child 2(brown) = {X,P,J}(0,3)/0Child 3(red) = {E}(1,0)/0

Child set entropy = 4/8 * 1 + 3/8 * 0 + 1/8 * 0 = 0.5IG(Hair color) = .954 - 0.5 = .454

Page 26: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Next try Height: values (average, tall short)Child1(average) = {S,E,J}(2,1)/ -[2/3 log 2/3 + 1/3 log 1/3] = 0.92 Child2(tall) = {D,P}(0,2)/0Child3(short)={X,A,K}(1,2)/0.92

Child set entropy = 3/8 * 0.92 + 2/8 * 0 + 3/8 * 0.92 = 0.69IG(Height) = .954 - .69 = 0.26

Next try Weight . . . IG(Weight) 0.954 – 0.94 = 0.014

So Hair color wins: Draw the first split and assign the collections

N1: Hair Color

yes no

RedBrown

Blond: C = {S,D,A,K}(2,2)/1

S2:_______

Page 27: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

C(S2) = {S,D,A,K}(2,2)/1

Start with lotion: values (yes, no)Child 1: (yes) = {D, K}(0, 2)/0Child 2: (no) = {S,A}(2,0)/ 0

Child set entropy = 0IG(Lotion) = 1 – 0 = 1 No reason to go any farther

S1: Hair Color

yes no

RedBrown

Blond: C = {S,D,A,K}(2,2)/1

S2: Lotion

S2:_________

yes no

yesno

Page 28: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Discuss assignment 5

Page 29: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Perceptrons and Neural Networks: Another Supervised Learning Approach

Page 30: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 31: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 32: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 33: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 34: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 35: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 36: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 37: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.

Perceptron Learning (Supervised)

• Assign random weights (or set all to 0)• Cycle through input data until change < target• Let α be the “learning coefficient”• For each input:

– If perceptron gives correct answer, do nothing– If perceptron says yes when answer should be no,

decrease the weights on all units that “fired” by α– If perceptron says no when answer should be yes,

increase the weights on all units that “fired” by α

Page 38: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 39: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 40: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 41: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 42: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 43: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Page 44: CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.