Last lecture summary. Cluster analysis unsupervised hierarchical clustering – agglomerative –...

Last lecture summary

Cluster analysis

• unsupervised• hierarchical clustering– agglomerative– divisive– dendrogram

• partitional– k-means MI TOBA NA RM FL

dissimilarity219 138

255

268

295

• different methods to get the distances between object within two clusters– sigle linkage– complete linkage– average linkage– centroid linkage– Ward’s method

• k-means– divides data points into some prespecified number

of clusters without the hierarchical structure

Information Theory(new stuff)

• mathematical theory of the measurement of the information– does not deal with semantic (i.e. meaning)

definition of information– it quantifies information, measures its

transmission– information is coupled with sender (source),

receiver and channel (means of communication)

Source(Sender) Receiver

Channel

• information is inherently linked with uncertainty and surprise– Imagine you attend a biology symposium.– Speaker tells you he/she/it has a PhD in biology.– Does it surprise you?– Speaker tells you he/she/it plays a djembe

drum.– Does it surprise you?– In the first case the information content

was low, we were rather certain about speaker’s degree.

– In the second case we were given a large amount of information, as we were uncertain about speaker’s leisure time.

• Thus we can define information as a degree of uncertainty.– Before we roll a die, our uncertainty about the result is

6 (6 possible equally likely outcomes).– After rolling die our uncertainty is zero.– Therefore we have gained an amount of information of

6.• Pittfals of this simplistic approach:– Two dice – after rolling both of them we gained 66=36

units of information?– However, intuitively, the information after rolling two

dice is just 2-times more than the information after rolling one dice. Not 6-times!

• In another words, information feels additive.

• Which function would you use to capture information’s additivity?– i.e. if I have 6 x 6 possible outcomes, I get only 12

units of information.– logarithm

• This is precisely the definition of information by Hartley (1928)

I(N) = log(N) N … number of possible results

• This definition is useful for equiprobable results, but will fail for biased dice.

• If number 6 turns up 50% of the time, the information 6 provides is less than that provided by 3.

• Incorporate the probability in the definition of information. Shannon 1948.

ai … possible results (values of random variable) p(ai), pi … probability of gaining value ai

iii

i papap

aI loglog1

log

Information units

• The unit to measure the information should be as simple as possible.

• Simplest experiment possible – just one outcome. Too simple, it yields no information!

• Two possible results: I(a1) = I(a2) = log(2).• If we use base 2 for the logarithm, we obtain

log2(2) = 1.• We say that this amount of information is one bit.– Bit is the amount of information conveyed by an

experiment with two equally probable outcomes.

• Other units based on other logarithm bases– nat – information conveyed by an experiment with e

(≈2.718281828459045) equiprobable outcomes … ln()– dit, digit, hartley … log10()

• Shannon was working in the field of communication (Bell Labs).

• He was not thinking about experiments like rolling dice.

• Outcomes of his experiments were the possible symbols emmited by a source, and he was interested in analyzing the average information emitted by a source.

• More formally, memory-less source emits messages using a given alphabet S=[a1, …, an] with emission probabilities P=[p1, …, pn].

• Shannon defined the entropy of the source X as

SourceS=[a, b, c, d, …, z]P=[5, 8, 4, 9, …….]

ReceiverChannel

N

iii

N

iii aIpppXH

11

log

What is entropy?

• H(X) is the weighted mean (expectation) of log(pi)• H(X) is the average (i.e., expectation) of the

information of the source• H(X) is the measure of uncertainty of the source– if it is zero, there is no uncertainty (i.e. no information was

transmitted)

N

iii

N

iii aIpppXH

11

log

Entropy of genome• Genome constitutes from A C T G symbols• Their probabilities for bacteria E. coli: – 24.6% A, 25.4% C, 24.6% T, 25.4% G

• I(A) = I(T) = 2.023 bits, I(G) = I(C) = 1.977 bits• H(Eco) = 0.246*2.023 + 0.254*1.9777 + 0.246*2.023 +

0.254*1.9777 = 1.999 bits– Close to 2 bits, as expected. 2 bits is maximum information

we can get from an experiment with 4 outcomes.• The entropy is the average information per symbol !!

• Entropy of another organism, bacteria T. aquaticus:– 15.3% A, 15.3% T, 34.6% C, 34.6% G– H(Taq) = 1.876 bits … lower as expected, %GC content.– The decrease (0.123 bits) may not seem huge. But if we

consider size of typical bacterial genome (4 Mbp), then such a decrease gains a relevance.

– E. coli genome is able to encode 492 000 bits more than T. aquaticus.

• Which makes you wonder: if the ancestral microorganism was living in conditions to those of T. aquaticus, wouldn’t it have chosen another set of bases that gave it maximal information encoding capacity in these conditions?

Noise and conditional entropy

• Information theory is concerned mainly about how information is transmitted from a source to a receiver by means of a channel.– Roll a die, observe a result – the channel is almost

noise-free.• You throw a coin from the balcony on the 25th

floor. Down there is your friend who shouts (no mobiles allowed!) the result back on you.

• Now the channel is not so nice anymore. We have noise.

• Assume that we mishear “head” for “tail” one in every 100 (1%) coin tosses.

• If we are using a fair coin, the source entropy is H(X) = -0.5*log2(0.5)*2 = 1 bit

• We can factor in noise, and compute the entropy after the coin have been tossed and we have heard the shouted result – conditional entropy.

H(X|Y) = -0.99*log2(0.99) - 0.01*log2(0.01) = 0.081

bits.

• Conditional entropy expresses our (as receiver) uncertainty on the result after the experiment has been carried out.

• X is the result of the experiment and Y is what we hear (mishear) as being the result.– We toss the coin, the outcome is X.– Friend shouts, we hear Y.– Knowing Y (the result of the experiment as we

perceive it), H(X|Y) expresses our remaining uncertainty over X.

Mutual information

• H(X|Y) represents our uncertainty over X once we know Y.

• It is intimately linked to the channel over which the original message X travels in order to arrive to us as Y.

• We are also implying, that H(X|Y) is a measure of information loss (increase in uncertainty due to transmission through the channel).

• Mutual information: I(X,Y) = H(X) - H(X|Y)

• So far we have been talking about source, receiver and channel.

• However, you can consider a random variable X and ask how much information is received when the specific value of this variable is observed.

• The amount of information can be viewed as the ‘degree of surprise’ on learning the value of X.

• You can easily calculate I(Xi) and H(X) if you can estimate the probability with which the variable gains its values. Or if you know the variable’s probability distribution.

N

iiiii aIpXHpxI

1

log

• Further, you can consider two random variable X and Y.

• Now, you may want to quantify the remaining entropy (i.e. uncertainty) of a random variable X given that the value of another random variable Y is known.

• Conditional entropy of a random variable X given that the value of other random variable Y is known – H(X|Y)

2| | log |y x

H X Y P Y P X Y P X Y

• Mutual information between the variables X and Y

I(X,Y) = H(X) - H(X|Y)

• Mutual information is reduction in uncertainty about X as a consequence of the observation of Y.

• Mutual information measures the information that X and Y share.

• It measures how much knowing one of these variables reduces our uncertainty about the other. – X and Y are independent – then knowing X does not give any

information about Y, their mutual information is zero– X and Y are identical – all information conveyed by X is shared

with Y, knowing X determines Y. I(X,Y) is the same as the uncertainty (entropy) contained in Y (or X) alone.

• I(X,Y) ≥ 0 (is non-negative)• I(X,Y) = I(Y,X) (is symmetric)

Decision trees

Weather Light Ground condition

Umpires’ decision

Sunny Good Dry PlayOvercast Good Dry PlayRaining Good Dry No play

Overcast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No play

Overcast Good Damp PlaySunny Poor Dry Play

Intelligent bioinformatics The application of artificial intelligence techniques to bioinformatics problems, Keedwell

Umpires’ decision to play a cricket match

• Supervised • Used both for– classification – classification tree– regression – regression tree

• Advantages– relatively undemanding in computational terms– provide clear, explicit reasoning of their decision

making in the form of symbolic decision trees which can be converted to sets of rules

– accurate and, in more recent guises, increasingly robust in the face of noise

• Task - determine, from the data, the rules the umpires are explicitly or implicitly using to determine whether play should take place.

• How to split the data so that each subset in the data uniquely identifies a class in the data?

• In our case, divide up the set of training examples into two smaller sets that completely encapsulate each class ‘Play’ and ‘No play’.

• Each division is known as a test and splits the dataset in subsets according to the value of the attribute.

• E. g. if a test on ‘Light’ is performed this gives– Light = Good: yields 4 examples, 3 of class ‘Play’, 1 of ‘No play’

– Light = Poor: yields 4 examples, 1 of class ‘No play’, 3 of ‘Play’

Weather Light Ground condition DecisionSunny Good Dry Play

Overcast Good Dry PlayRaining Good Dry No play

Overcast Good Damp Play

Weather Light Ground condition DecisionOvercast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No playSunny Poor Dry Play

• The above test on ‘Light’ separates the samples into two subsets, each with three examples of one class and one of another.

• This test has been chosen at random.• Is it really a best way of splitting the data? • A measurement of the effectiveness of each

attribute/feature is required. This measure must reflect the distribution of examples over the classes in the problem.

Gain criterion

• Based on the amount of information that a test on the data conveys.

• The information contained within a test is related to the probability of selecting one training example from that class.

• T – training set, Cj – particular class– What is the probability?

Weather Light Ground ClassSunny Good Dry Play

Overcast Good Dry PlayRaining Good Dry No play

Overcast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No play

Overcast Good Damp PlaySunny Poor Dry Play

freq ,jC T

T

• What is the information conveyed by selecting one training example from class Cj?

• What is the expected information from the whole training set?

• How is this quantity called?– Entropy

2

freq ,log

jC T

T

2

1

freq , freq ,info log

kj j

j

C T C TT

T T

• OK, we know the information measure for the entire training set.

• Each test that is devised by the algorithm must be compared with this to determine how much of an improvement (if any) is seen in classification.

• Now consider a similar measurement after T has been partitioned in a test x.

1

info infon

ix i

i

TT T

T

• How is infox(T) called?• Conditional entropy– the entropy of the training set on condition that split x has

been performed

• Information gain (Kulback-Leibler divergence) measures the information yielded by a test x. It is defined as

gain(x) = info(T) – infox(T)• So what is information gain actually?– mutual information between the test x and the class

• Gain criterion selects a test to maximize the information gain.

|T| = ?|T| = 8j = ?j = 1, 2freq(‘Play’, T) = ?freq(‘Play’, T) = 4freq(‘No play’, T) = ?freq(‘No play’, T) =

4

info(T) = ?

Weather Light Ground ClassSunny Good Dry PlaySunny Poor Dry Play

Overcast Good Dry PlayOvercast Good Damp PlayOvercast Poor Dry No playOvercast Poor Damp No playRaining Poor Damp No playRaining Good Dry No play

2

1

freq , freq ,info log

kj j

j

C T C TT

T T

info(T) = -4/8 * log2(4/8) - 4/8 * log2(4/8) = 1.0



1

info infon

ix i

i

TT T

T

split on x = weather

i = ?i = 1, 2, 3i=1

weather = ‘sunny’|T1|/|T| = ?

|T1|/|T| = 2/8

info(T1) = ?

info(T1) = -2/2 * log2(2/2) - 0/2 * log2(0/2)infox(T) = 2/8 * info(T1) + …



i=2weather = ‘overcast’

|T2|/|T| = ?|T2|/|T| = 4/8info(T2) = ?

info(T2) = -2/4 * log2(2/4) – 2/4 * log2(2/4)infox(T) = 2/8 * info(T1) + 4/8 * info(T2) + …

1

info infon

ix i

i

TT T

T




i=3weather = ‘raining’

|T3|/|T| = ?|T3|/|T| = 2/8info(T3) = ?

info(T3) = -0/2 * log2(0/2) – 2/2 * log2(2/2)infox(T) = 2/8 * info(T1) + 4/8 * info(T2) + 2/8 * info(T3)

1

info infon

ix i

i

TT T

T


• infoweather(T) = 0.5 bits– Gain = 1.0 - 0.5 = 0.5

• test “Light”– Gain = 0.189

• test “Ground”– Gain = 0.049

• Choose a split with maximum Gain.– i. e. split by weather first.

• ‘Sunny’ and ‘Raining’ are clean, they contain just one class.

• However, ‘Overcast’ contains both classes.



• So the algorithm now proceeds by investigating which of two remaining features (‘Light’ or ‘Ground’) can classify the dataset correctly.

• Now, our training set are only those instances with ‘Weather’ = ‘Overcast’

Weather Light Ground Class

Overcast Good Dry PlayOvercast Good Damp PlayOvercast Poor Dry No playOvercast Poor Damp No play

• info(T) = -2/4 * log2(2/4) – 2/4 * log2(2/4) = 1.0 bit

• infolight(T) = 2/4 * (-2/2 * log2(2/2) – 0/2 * log2(0/2))

+2/4 * (-0/2 * log2(0/2) – 2/2 * log2(2/2))

= 0 bits Gain = 1.0 – 0.0 = 1.0

• infoground(T) = 2/4 * (-1/2 * log2(1/2) – 1/2 * log2(1/2))

+2/4 * (-1/2 * log2(1/2) – 1/2 * log2(1/2))

= 1.0 bit Gain = 1.0 – 1.0 = 0.0

Weather Light Ground Class

Overcast Good Dry Play

Overcast Good Damp Play

Overcast Poor Dry No play

Overcast Poor Damp No play

(Good)

(Poor)

(Dry)

(Damp)


1. split – Weather1. Sunny and Raining – fully classified as Play and No play, resp.

2. split – Light1. Good – Play, Poor – No play

3. End

Gain ratio• Gain criterion is biased towards tests which have many subsets.• Revised gain measure taking into account the size of the subsets

created by test is called gain ratio.

• In our example, split by ‘Weather’ yielded three subsets, split by other two yielded only two subsets.

• Gain is biased for ‘Weather’ (Gain = 0.5), while Gain ratio corrects for this bias (it equals 0.33). However, split by ‘Weather’ still wins.

21

split info log

gaingain ratio

split info

ni i

i

T Tx

T T

xx

x

• J. Ross Quinlan, C4.5: Programs for machine learning (book)

“In my experience, the gain ratio criterion is robust and typically gives a consistently better choice of test than the gain criterion”.

• However, Mingers J.1 finds that though gain ratio leads to smaller trees (which is good), it has tendency to favor unbalanced splits in which one subset is much smaller than the others.

1 Mingers J., ”An empirical comparison of selection measures for decision-tree induction.”, Machine Learning 3(4), 319-342, 1989

Continuous data

• How to split on real, continuous data?• Use threshold and comparison operators <, ≤,

>, ≥ (e.g. “if Light ≥ 6 then Play” for Light variable being between 1 and 10).

• If continuous variable in the data set has n values, there are n-1 possible tests.

• Algorithm evaluates each of these splits, and it is actually not expensive.

Pruning

• Decision tree overfits, i.e. it learns to reproduce training data exactly.

• Strategy to prevent overfitting – pruning:– Build the whole tree.– Prune the tree back, so that complex branches are

consolidated into smaller (less accurate on the training data) sub-branches.

– Pruning method uses some estimate of the expected error.

Regression tree

Regression tree for predicting price of 1993-model cars.

All features have been standardized to have zero mean and unit variance.

The R2 of the tree is 0.85, which is significantly higher than thatof a multiple linear regression fit to the same data (R2 = 0.8)

Algorithms, programs• ID3, C4.5, C5.0(Linux)/See5(Win) (Ross Quinlan)• Only classification• ID3

– uses information gain• C4.5

– extension of ID3– Improvements from ID3

• Handling both continuous and discrete attributes (threshold)• Handling training data with missing attribute values• Pruning trees after creation

• C5.0/See5– Improvements from C4.5 (for comparison see

http://www.rulequest.com/see5-comparison.html)• Speed• Memory usage• Smaller decision trees

• CART (Leo Breiman)– Classification and Regression Trees– only binary splits

– splitting criterion – Gini impurity (index)• not based on information theory

• Both C4.5 and CART are robust tools• No method is always superior – experiment!

Not binary

Last lecture summary. Cluster analysis unsupervised hierarchical clustering – agglomerative –...

Documents

Transcript of Last lecture summary. Cluster analysis unsupervised hierarchical clustering – agglomerative –...