COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees.

COMP 2208

Dr. Long [email protected]

University of Southampton

Decision Trees

mailto:[email protected]

Classification

Environment

Perception

Behaviour

Categorize inputs Update

belief model

Update decision making policy

Decision making

Perception

Behaviour

Recognizing the type of situation you are in right now is a basic agent task:

Classification

• Robotics: misidentifying a human body with some part of a car on the assembly line would be disastrous

• Military: friend or foo?

• Electric card usage: was it a fraud or not?

Last lecture: neural networks

Why more classification methods?

• Very powerful in theory

• Promising direction: deep learning

• Still difficult to fully control the technology

• In many cases: other techniques are more efficient

Occam’s razor: the simpler (the model) the better (the performance is) – go for something more complicated only if it’s really necessary

In many real-world problems, data cleaning is the most important step – after that, a simple classification method would do the job

Classification

Classification Algorithm

Bottom up: inspiration from biology - e.g., neural networks


Top down: inspiration from higher abstraction levels

Prof or hobo 1?

http://individual.utoronto.ca/somody/quiz.html

Prof or hobo 2?


Prof or hobo 3?


Prof or hobo answers


Hobo HoboProfessor

Back to classification


Different ways to go:

Honey? Fired? Evil plan?

Back to classification


Some classification algorithms:

Logistic regression

Support vector machines (SVMs)

Decision trees + its family

…

• Easy to understand• (Relatively) easy to implement• Vey efficient in many cases

Decision making process

Did it go well?

Did it go well?

Yes

Yes

No

No

What are the clues that allow you to distinguish a prof from a hobo?

• Clothes people are wearing

• Their eyes

• The beard

• …

Back to the “Prof or hobo” quiz

Main idea: checking out some properties in some order

Classification with decision trees

• A decision tree takes a series of inputs defining a situation, and

outputs a binary decision/classification.

• A decision tree spells out an order for checking the properties

(attributes) of the situation until we have enough information to

decide what's going on.

• We use the observable attributes to predict the outcome (or some

important hidden or unknown quantity).

Question: what is the optimal (efficient) order of the attributes?

The importance of the ordering

• Think about the “20 questions” game: inefficient questions will lead to low performance

• Think about binary search:

• Optimal: always halve the interval

• Decision trees are very simple to produce if we already know the underlying rules.

• But what we don’t have the rules, just past examples (experience)?

Often we don't know in advance how to classify things, and want our agent to learn from examples.

Our objective

Which attribute to start with?The order of attributes is still very important

Idea: choose the next attribute whose value can reduce the uncertainty about the outcome of the classification the most

What does it mean when we say that something reduces the uncertainty in our knowledge?

Reducing uncertainty (in knowledge) = increase (known) information

So we should choose the attribute that provides the highest information gain

EntropyHow to measure information gain (and how to define it)?

Answer: borrow similar concepts from information & coding theory

Entropy (Shannon, 1948):

• A measure of the amount of disorder or uncertainty in a system. • A tidy room has low entropy: You can be reasonably certain your

keys are on the hook you made for them. • A messy room has high entropy: things are all over the place and

your keys could be absolutely anywhere.

Input X Output Y

Entropy

Uncertainty about the outcome

Classification:

Entropy (Shannon, 1948):

How often Y =y Measure of information (surprise) when Y = y

(in bits)

Entropy example

Good OK Terrible

Birmingham 0.33 0.33 0.33

Southampton 0.3 0.6 0.1

Glasgow 0 0 1

Weather:

Entropy example

Birmingham P(x) logP(x) - P(x)logP(x)

Good 0.33 -1.58 0.53

OK 0.33 -1.58 0.53

Terrible 0.33 -1.58 0.53

Sum = 1.58 (bits)

Entropy example

Southampton P(x) logP(x) - P(x)logP(x)

Good 0.3 -1.74 0.52

OK 0.6 -0.74 0.44

Terrible 0.1 -3.32 0.33

Sum = 1.29 (bits)

Entropy example

Glasgow P(x) logP(x) - P(x)logP(x)

Good 0 -infinity 0

OK 0 -infinity 0

Terrible 1 0 0

Sum = 0 (bits)

When we are certain, the entropy is 0

Conditional entropy

Input X Output Y

Classification:

Entropy measures the uncertainty of a given state of the system

How to measure the change?

Conditional entropy:Joint probability

Conditional probability

• How much uncertainty would remain about the outcome Y if we knew (for instance) the outcome of attribute X

Information gain

Information gain:

Current level of uncertainty(entropy)

Possible new level of uncertainty

(conditional entropy)

• The difference represents how much uncertainty would decrease

Building a decision tree

• Split the tree on the attribute with the highest information gain. Then repeat.

Stopping Conditions: • Don't split if all matching records have same output value (no point,

we know what happens!). • Don't split if all matching records have same attribute values (no

point, we can't distinguish them).

Recursive algorithm:

Example: Predicting the importance of emails

Objective: predict whether the user will read the email

18 emails: 8 read, 8 skipped

“Thread” attribute:

Reads Skips Row total

new_thread 7 (70%) 3 (30%) 10

follow_up 2 (25%) 6 (75%) 8


What is the information gain if we choose “Thread” ?

Calculation steps:

• Calculate H(Read)• Calculate H(Read | Thread)• Calculate G(Read, Thread) = H(Read) – H(Read | Thread)

Example: Predicting the importance of emails Calculating H(Read)

• 18 emails: 8 read, 8 skipped

• P(Read = True) = P(Read = False) = 0.5

• H(Read) = -(0.5*log2(0.5) + 0.5*log2(0.5)) = 1 (bit)

Example: Predicting the importance of emails Calculating H(Read | Thread)

Specific conditional entropy

Calculation steps:

• Calculate H(Read | Thread = new)• Calculate H(Read | Thread = follow_up)• Calculate H(Read | Thread) = p(new)*H(Read | Thread = new) +

+ p(follow_up)*H(Read | Thread = follow_up)

Example: Predicting the importance of emails Calculating G(Read,Thread):

• G(Read,Thread) = H(Read) – H(Read | Thread)

• G(Read,Thread) = 1– 0.85 = 0.15

Advantages of decision trees

• Decision trees are able to generate understandable rules (i.e., human-readable).

• Once learned, decision trees perform classification very efficiently. • Decision trees are able to handle continuous as well as categorical

variables. You choose a threshold to split the continuous variables based on information gain.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees.

Documents

Transcript of COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees.