The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good...

39
The joy of Entropy
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    2

Transcript of The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good...

Page 1: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

The joy of Entropy

Page 2: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Administrivia•Reminder: HW 1 due next week

•No other news. No noose is good noose...

Page 3: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Time wings on...•Last time:

•Hypothesis spaces

•Intro to decision trees

•This time:

•Loss matrices

•Learning bias

•The getBestSplitFeature function

•Entropy

Page 4: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Loss

•For problem 8.11, you need cost values

•A.k.a. loss values

•Introduced in DH&S ch. 2.2

•Basic idea: some mistakes are more expensive than others

Page 5: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Loss•Example: classifying computer network traffic

•Traffic is either normal or intrusive

•There’s way more normal traffic than intrusive

•Data is normal, but classifier says “intrusive”?

•Data is intrusive, but classifier says “normal”?

Page 6: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Cost of mistakes

NormalNormal IntrusionIntrusion

NormalNormal $0 $5

IntrusionIntrusion $5,000 $0

True class

Pre

dic

ted

class

Page 7: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Cost of mistakes

ω1 ωω22

ωω11 λ11 λ12

ωω22 λ21 λ22

True class

Pre

dic

ted

class

Page 8: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

In general...

ω1 ωω22 ...... ωωkk

ωω11 λ11 λ12ωω22 λ21 λ22

...... ... ...

ωωkk ... λkk

True class

Pre

dic

ted

class

Page 9: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Cost-based criterion•For the misclassification error case, we wrote the risk of a classifer f as:

•For the cost-based case, this becomes:

Page 10: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Back to decision trees...•Reminders:

•Hypothesis space for DT:

•Data struct view: All trees with single test per internal node and constant leaf value

•Geometric view: Sets of axis-orthagonal hyper-rectangles; piecewise constant approximation

•Open question: getBestSplitFeature function

Page 11: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Splitting criteria•What properties do we want our getBestSplitFeature() function to have?

•Increase the purity of the data

•After split, new sets should be closer to uniform labeling than before the split

•Want the subsets to have roughly the same purity

•Want the subsets to be as balanced as possible

Page 12: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Bias•These choices are designed to produce small trees

•May miss some other, better trees that are:

•Larger

•Require a non-greedy split at the root

•Definition: Learning bias == tendency of an algorithm to find one class of solution out of H in preference to another

Page 13: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Bias: the pretty picture

Space of all functionson

Page 14: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Bias: the algebra•Bias also seen as expected difference between true concept and induced concept:

•Note: expectation taken over all possible data sets

•Don’t actually know that distribution either :-P

•Can (sometimes) make a prior assumption

Page 15: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

More on Bias

•Bias can be a property of:

•Risk/loss function

•How you measure “distance” to best solution

•Search strategy

•How you move through H to find

Page 16: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Back to splitting...•Consider a set of true/false labels

•Want our measure to be small when the set is pure (all true or all false), and large when set is almost evenly divided between the classes

•In general: we call such a function impurity, i(y)

•We’ll use entropy

•Expresses the amount of information in the set

•(Later we’ll use the negative of this function, so it’ll be better if the set is almost pure)

Page 17: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Entropy, cont’d•Define: class fractions (a.k.a., class prior probabilities)

Page 18: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Entropy, cont’d•Define: class fractions (a.k.a., class prior probabilities)

•Define: entropy of a set

Page 19: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Entropy, cont’d•Define: class fractions (a.k.a., class prior probabilities)

•Define: entropy of a set

•In general, for classes :

Page 20: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

The entropy curve

Page 21: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Properties of entropy

•Maximum when class fractions equal

•Minimum when data is pure

•Smooth

•Differentiable; continuous

•Convex

•Intuitively: entropy of a dist tells you how “predictable” that dist is.

Page 22: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Entropy in a nutshell

From: Andrew Moore’s tutorial on information gain:

http://www.cs.cmu.edu/~awm/tutorials

Page 23: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Entropy in a nutshell

data values (location of soup)sampled from tight distribution(bowl) -- highly predictable

Low entropy distribution

Page 24: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Entropy in a nutshell

data values (location of soup) sampled from loose distribution (uniformly around dining room) -- highly unpredictable

High entropy distribution

Page 25: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Entropy of a split•A split produces a number of sets (one for each branch)

•Need a corresponding entropy of a split (i.e., entropy of a collection of sets)

•Definition: entropy of a split

where:

Page 26: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Information gain•The last, easy step:

•Want to pick the attribute that decreases the information content of the data as much as possible

•Q: Why decrease?

•Define: gain of splitting data set [X,y] on attribute a:

Page 27: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

The splitting methodFeature getBestSplitFeature(X,Y) {// Input: instance set X, label set Ydouble baseInfo=entropy(Y);double[] gain=new double[];for (a : X.getFeatureSet()) {[X0,...,Xk,Y0,...,Yk]=a.splitData(X,Y);gain[a]=baseInfo-splitEntropy(Y0,...,Yk);

}return argmax(gain);

}

Page 28: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

DTs in practice...•Growing to purity is bad (overfitting)

Page 29: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

DTs in practice...•Growing to purity is bad (overfitting)

x1: petal length

x2:

sepa

l wid

th

Page 30: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

DTs in practice...•Growing to purity is bad (overfitting)

x1: petal length

x2:

sepa

l wid

th

Page 31: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

DTs in practice...•Growing to purity is bad (overfitting)

•Terminate growth early

•Grow to purity, then prune back

Page 32: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

DTs in practice...•Growing to purity is bad (overfitting)

x1: petal length

x2:

sepa

l wid

th

Not statisticallysupportable leaf

Remove split& merge leaves

Page 33: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

DTs in practice...•Multiway splits are a pain

•Entropy is biased in favor of more splits

•Correct w/ gain ratio (DH&S Ch. 8.3.2, Eqn 7)

Page 34: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

DTs in practice...•Real-valued attributes

•rules of form if (x1<3.4) { ... }

•How to pick the “3.4”?

Page 35: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Measuring accuracy•So now you have a DT -- what now?

•Usually, want to use it to classify new data (previously unseen)

•Want to know how well you should expect it to perform

•How do you estimate such a thing?

Page 36: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Measuring accuracy•So now you have a DT -- what now?

•Usually, want to use it to classify new data (previously unseen)

•Want to know how well you should expect it to perform

•How do you estimate such a thing?

•Theoretically -- prove that you have the “right” tree

•Very, very hard in practice

•Measure it

•Trickier than it sounds!....

Page 37: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Testing with training data•So you have a data set:

•and corresponding labels:

•You build your decision tree:

•tree=buildDecisionTree(X,y)

•What happens if you just do this:double acc=0.0;

for (int i=1;i<=N;++i) {

acc+=(tree.classify(X[i])==y[i]);

}

acc/=N;

return acc;

Page 38: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Testing with training data•Answer: you tend to overestimate real accuracy (possibly drastically)

x2:

sepa

l wid

th ??

?

?

?

?

Page 39: The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...

Separation of train & test•Fundamental principle (1st amendment of ML):

•Don’t evaluate accuracy (performance) of your classifier (learning system) on the same data used to train it!