Classifiers and Machine Learning Data Intensive Linguistics Spring 2008 Ling 684.02.

Classifiers and Machine Learning

Data Intensive LinguisticsSpring 2008Ling 684.02

Decision Trees What does a decision tree do? How do you prepare training data? How do you use a decision tree? The traditional example is a tiny

data set about weather. Here I use Wagon, many other

similar packages exist.

Decision processes Challenge: Who am I? Q:Are you alive? A: Yes Q: Are you famous? A: Yes … Q: Are you a tennis player? A: No Q: Are you a golfer? A:

Yes Q: Are you Tiger Woods A: Yes

Decision trees Played rationally, this game has the

property that each binary question partitions the space of possible entities.

Thus, the structure of the search can be seen as a tree.

Decision trees are encodings of a similar search process. Usually, a wider range of questions is allowed.

Decision trees In a problem solving setup, we’re

not dealing with people, but with a finite number of classes for the predicted variable.

But the task is essentially the same, given a set of available questions, narrow down the possibilities till you are confident about the class.

How to choose a question? Look for the question that most

increases your knowledge of the class We can’t tell ahead of time which answer

will arise. So take the average over all possible

answers, weighted by how probable each answer seems to be.

The maths behind this is either information theory or an approximation to it.

How to be confident

If a simple majority classifier would achieve acceptable performance on the data in the current partition.

Obvious generalization (Kohavi): be confident if some other baseline classifier would perform well enough.

Input data

Data format Each row of the table is an

instance Each column of the table is an

attribute (or feature) You also have to say which

attribute is the predictee or class variable. In this case we choose Playable.

Attribute types We also need to understand the

types of the attributes. For the weather data:

Windy and Playable look boolean Temperature and Humidity look as if

they can take any numerical value Cloudy looks as if it can take any of

“sunny”,”overcast”,”rainy”,”sunny”

Wagon description files Because guessing the range of an

attribute is tricky, Wagon instead requires you to have a “description file”

Fortunately (especially if you have big data files), Wagon also provides make_wagon_desc which makes a reasonable guess at the desc file.

For the weather data(

(outlook overcast rainy sunny)

(temperature float)

(humidity float)

( windy FALSE TRUE)

( play no yes)

)

(needed a little help: replacing lists of numbers with “float”)

Commands for Wagon wagon –data weather.dat –desc

weather.desc –o weather.tree This produces unsatisfying results,

because we need to tell it that the data set is small by setting –stop 2 (or else it notices that there are < 50 examples in the top-level tree, and doesn’t build a tree)

Using the stopping criterion

wagon –data weather.dat \–desc weather.desc \ –o weather.tree \–stop 1

This allows the system to learn the exact circumstances under which Play takes on particular values.

Using Wagon to classify

wagon_test -data weather.dat \ -tree weather.tree \ -desc weather.desc \ -predict play

Output data

Over-training -stop=1 is over-confident, because

it might build a leaf for every quirky example.

There will be other quirky examples once we move to new data. Unless we are very lucky, what is learnt from the training set will be too detailed.

Over-training 2 The bigger -stop is, the more errors

that the system will commit on the training data.

Conversely, the smaller -stop is, the more likely that the tree will learn irrelevant detail.

The risk of overtraining grows as -stop shrinks, and as the set of available attributes increases.

Why over-training hurts

If you have a complex attribute space, your training data will not cover everything.

Unless you learn general rules, new instances will not be correctly classified.

Also, the system's estimates of how well it is doing will be very optimistic.

This is like doing Linguistics but only on written, academic English...

Setting -stop automatically

Split training data in two. Use first half to train, trying several different values for -stop

Use second half for cross-validation: measure performance of the various trees learnt.

Train

Tune

Test

Cross-validation

If performance gain generalizes to cross-validation half, then probably also to unseen data.

Any problems? Train

Tune

Test

Data efficiency

Train/tune split is wasteful. Reduce tuning part to 10% of data.

Train on 90%. Rotate the 10% through the

training data.Train

Tune

Test

Cross-validation

Train

Tune

Test

Train

Train

Tune

Test

Train

Train

Tune

Test

Train

Train

Tune

Test

Train

Cross-validation

Because the tuning set was 10%, this is 10-fold cross-validation. 20-fold would be 5%

In the limit (very expensive or small training data) we have “leave one out” cross-validation.

Clustering with decision trees

The standard stopping criterion is purity of the classes at the leaves of the trees.

Another criterion uses a distance matrix measuring the dissimilarity of instances. Stop when the groups of instances at the leaves from tight clusters.

What are decision trees?

A decision tree is a classifer. Given an input instance it inspects the features and delivers a selected class.

But it knows slightly more than this. The set of instances grouped at the leaves may not be a pure class. This set defines a probability distribution over the classes. So a decision tree is a distribution classifier.

There are many other varieties of classifier.

Nearest neighbour(s)

If you have a distance measure, and you have a labelled training set, you can assign a class by finding the class of the nearest labelled instance.

Relying on just one labelled data point could be risky, so an alternative is to consider the classes of k neighbours.

You need to find a suitable distance measure. You might use cross-validation to set an

appropriate value of k

Bellman's curse

Nearest neighbour classifiers make sense if classes are well-localized in the space defined by the distance measure.

As you move from lines to planes, volumes and high-dimensional hyperplanes, the chance that you will find enough labelled data points “close enough” decreases.

This is a general problem, not specific to nearest-neighbour, and is known as Bellman's curse of dimensionality

Dimensionality

If we had uniformly spread data, and we wanted to catch 10% of the data, we would need 10% of the range of x in a 1-D space, but 31% of the range of each x and y in a 2-D space and 46% of the range of x,y,z in a cube. In 10 dimensions you need to cover ~80% of the ranges.

In high dimensional spaces, most data points are closer to the boundaries of the space than they are to any other data point

Text problems are very often high-dimensional

Decision trees in high-D space

Decision trees work by picking on an important dimension and using it to split the instance space into slices of lower dimensionality.

They typically don't use all the dimensions of the input space.

Different branches of the tree may select different dimensions as relevant.

Once the subspace is pure enough, or well enough clustered, the DT is finished.

Cues to class variables

If we have many features, any one could be a useful cue to the class variable. (If the token is a single upper case letter followed by a ., it might be part of A. Name)

If cues conflict, we need to decide which ones to trust. “... S. p. A. In other news”

In general, we may need to take account of combinations of features (The “[A-Z]\.” feature is relevant only if we haven't found abbreviations

Compound cues

Unfortunately there are very many potential compound cues. Training them all separately will throw us into a very high-D space.

The naive Bayes classifier “deals with” this by adopting very strong assumptions about the relation between feature and the underlying class.

Assumption: Each feature is independently affected by the class, nothing else matters.

The naïve Bayes classifier

P(F1,F2...Fn|C) ≈ P(F1|c)P(F2|c)...P(Fn|c) Classify by finding class with highest score

given features and this (crass) assumption. Nice property: easy to train, just count

number of times that Fi and class co-occur

C

F1 F2 Fn...

Comments on naïve Bayes

Clearly, the independence assumption is false. All features, relevant or not, get the same

chance to contribute. If there are many irrelevant features, they may swamp the real effects we are after.

But it is very simple and efficient, so can be used in schemes such as boosting that rely on combinations of many slightly different classifiers.

In that context, even simpler classifiers (majority classifier, single rule) can be useful

Decision trees and classifiers

Attributes and instances Learning from instances Over-training Cross-validation Dimensionality Independence assumptions

Classifiers and Machine Learning Data Intensive Linguistics Spring 2008 Ling 684.02.

Documents

Transcript of Classifiers and Machine Learning Data Intensive Linguistics Spring 2008 Ling 684.02.