Knowledge Discovery and Data Mining

Knowledge Discovery and Data MiningLecture 05 - Tree methods - Introduction

Tom Kelsey

School of Computer ScienceUniversity of St Andrews

http://tom.home.cs.st-andrews.ac.uktwk@st-andrews.ac.uk

Tom Kelsey ID5059-05-TM 11 Feb 2015 1 / 27

Administration

P2 description needs to be agreedForest Cover Type PredictionAllstate Purchase Prediction ChallengeDon’t Get Kicked!Claim Prediction Challenge (Allstate)KDD Cup 2013 - Author Disambiguation Challenge (Track 2)

Validation Recap

Validation analysis example

Validation Recap

Response variable The y variable. The variable(s) we seek topredict. If categorical, we are classifyingCovariates The x variable(s). The variables(s) we think might beused to predict the response – a.k.a. attributes, predictorvariablesCovariate space Conceptually the space defined by ourcovariates e.g. the x values give coordinates of observations in aspace.Mean Squared Error (MSE) The theoretical expected/averagesquared error between the “true" quantity and our “method" ofestimating this (estimator). In practice use estimated MSE= n−1Σi(yi − yi)

Supervised/unsupervised learning Supervised learning meanswe know the response values for model building (most common).Unsupervised does not (e.g. clustering).

Selection & Validation Summary

Process1 Choose a measure for models2 Choose a (some) candidate model(s)3 For each model, find the number of parameters giving

optimal generalisation MSENotes

Covariates are the variables we have to potentially predictthe responseThese may be represented by functions in our X designmatrix, so by 1 or more columnsParsimony is achieved by reducing the size of the designmatrix/number of parameters to estimateClearly there is a tradeoff: few covariates to achieveparsimony, possibly many parameters to achieve goodgeneralisation errorTom Kelsey ID5059-05-TM 11 Feb 2015 6 / 27

Tree methods

Need-to-knows1 How recursive binary partitioning ofRp works.2 How to sketch a partitioning ofR2 on the basis of a series of

simple binary splits.3 How to go from a series of binary splitting rules to a tree

representation and vice versa.

Supervised/Unsupervised learning

http://practiceovertheory.com/blog/2010/02/15/machine-learning-who-s-the-boss/

A classification problem....

For the Titanic worked examples in week 1, the models Idescribed differed in subtle ways.

Regression and Tree both returned probabilities rather thanthe 1 and 0 returned by random forests.I used random forests to impute missing values, estimaterelative importance of covariates, and estimatemisclassification rateThe tree model supplied a confusion matrix, allowing moredetailed error analysis than simple misclassification rate(easy to derive confusion matrices for the other two)Straightforward training/test data validation – I didn’texamine the overfit/undefit tradeoff in any detailI performed naïve covariate selection using myunderstanding of the problem domain – this is often a sourceof significant error in model development & validation

Historical perspective

1960 Automatic Interaction Detection (AID) related to theclustering literature.THAID, CHAIDID3, C4.5, C5.0CART 1984 Breiman et al.

Recursive partitioning onR

Take an n× p matrix X, define a p dimensional spaceRp. Wewish to apply a simple rule recursively:

1 Select a variable xi and split on the basis of a single valuexi = a. We now have two spaces: xi ≤ a and xi > a.

2 Select one of the current sub-spaces, select a variable xj, andsplit this sub-space on the basis of a single value xj = b.

3 Repeatedly select sub-spaces and split in two.4 Note that this process can extend beyond just the two

dimensions represented by x1 and x2. If this were3-dimensions (i.e. include an x3) then the partitions wouldbe cubes. Beyond this the partitions are conceptuallyhyper-cubes.

An arbitrary 2-D space

Space splitting

Y=f(X1>=a)Y=f(X1<a)

A single split

Space splitting

X2=bY=f(X1>=a)

Y=f(X2>=b&(X1<a))

Y=f(X2<c&(X1<a))

Split of a subspace

Space splitting

Y=f(X2>=c&(X1>=a))

Y=f(X2<c&(X1>=a))

Y=f(X2>=b&(X1<a))

Y=f(X2<c&(X1<a))

Further splitting of a subspace

Space splitting

Y=f(X2>=c&(X1>=a&<d))

Y=f(X2<c&(X1>=a))

Y=f(X2>=b&(X1<a))

Y=f(X2<c&(X1<a))

Y=f(X2>=c&(X1>=a&>=d))

Further splitting

Space splitting

Potential 3-D surface

Binary partitioning process as a tree

<a >=a

<b >=b <c >=c

<d >=d

An example tree diagram for a contrived partioning

Tree representation

The splitting points are called nodes - these have a binarysplitting rule associated with themThe two new spaces created by the split are represented bylines leaving the nodes, these are referred to as the branches.A tree with one split is a stump.The nodes at the bottom of the diagram are referred to as theterminal nodes and collectively represent all the finalpartitions/subspaces of the data.You can ‘drop’ a vector x down the tree to determine whichsubspace this coordinate falls into.

Exercise

The following is the summary of a series of splits inR2:

(x1 > 10)(x1 ≤ 10) & (x2 ≤ 5)(x1 ≤ 10) & (x2 > 5) & (x2 ≤ 10)

1 Sketch the progression of splits in 2-dimensions.2 Produce a tree that summarises this series of splits.

Tree construction

We can model the response as a constant for each region (or.equivalently, leaf)If we are minimising sums of squares, the optimal constantfor a region/leaf is the average of the observed outputs forall inputs associated with the region/leafComputing the optimal binary partition for given inputsand output is computationally intractable, in generalA greedy algorithm is used that finds an optimal variableand split point given an initial choice (or guess), thencontinues for sub-regionsThis is quick to compute (sums of averages) but errors at theroot lead to errors at the leaves

How big should the tree be?

Tradeoff between bias and variance

Small tree – high bias, low varianceNot big enough to capture the correct model structureLarge tree – low bias, high varianceOverfitting – in the extreme case each input is in exactly oneregionOptimal size should be adaptively chosen from the data

We could stop splitting based on a threshold for decreases in sumof squares, but this might rule out a useful split further down thetree.Instead we construct a tree that is probably too large, and pruneit by cost-complexity calculations – next lecture

Regression trees

Consider our general regression problem (note can beclassification):

y = f (X) + e

and the usual approximation model (linear in its parameters):

y = Xβ + e

‘Standard’ interactions of form βp(X1X2)

These are simple in form and quite hard to interpretsuccinctlyWhat is probably the simplest interaction form to interpret?Recursive binary splitting rules for the Covariate space

Advantages of tree models

Actually tree models in general, and CART in particularNonparametric

no probabilistic assumptionsAutomatically performs variable selection

important variables at or near the rootAny combination of continuous/discrete variables allowed

in the Titanic example, no need to specify that the response iscategoricalso we can automatically bin massively categorical variablesinto a few categoriese.g. zip code, make/model, etc.

Advantages of tree models

Discovers interactions among variablesHandles missing values automatically

using surrogate splits

Invariant to monotonic transformations of predictivevariableNot sensitive to outliers in predictive variablesEasy to spot when CART is struggling to capture a linearrelationship (and therefore might not be suitable)

repeated splits on the same variableGood for data exploration, visualisation, multidisciplinarydiscussion

in the Titanic example gives hard values for "child" tosupport the heuristic "women & children first"

Disdvantages of tree models

Discrete output values, rather than continuousone response per finite number of leaf nodes

Trees can be large and hence hard to interpretCan be unstable when covariates are correlated

slightly different data gives completely different trees

Not good for describing linear relationshipsNot always the best predictive model

might be outperformed by NN, RF, SVM, etc.

Tree methods

Need-to-knows1 How recursive binary partitioning ofRp works.2 How to sketch a partitioning ofR2 on the basis of a series of

simple binary splits.3 How to go from a series of binary splitting rules to a tree

representation and vice versa.

Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05...

Documents

Transcript of Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05...

From Data Mining to Knowledge Discovery

CIS671-Knowledge Discovery and Data Mining

Data Mining and Knowledge Discovery - IJSkt.ijs.si/PetraKralj/DataMining0809/DM-2008.pdfData Mining and Knowledge Discovery ... Text and Web mining ... – Data Mining is the application

Data Mining and Knowledge Discovery - IJSkt.ijs.si/PetraKralj/IPS_DM_1516/DM-2015.pdf · Data Mining and Knowledge Discovery ... –equation discovery –inductive databases • Data

Knowledge Discovery and Data Mining

knowledge discovery and data mining(kdd)

COMP5318 Knowledge Discovery and Data Mining

Data mining and knowledge discovery

Data Mining and Knowledge Discovery - GBV

Geospatial Data Mining and Knowledge Discovery

Introduction to Data Mining and Knowledge Discovery · PDF fileIntroduction to Data Mining and ... Introduction to Data Mining and Knowledge Discovery, ... Data mining and data warehousing

Multimedia Data Mining and Knowledge Discovery

Knowledge Discovery and Data Mining 1 [0.1cm] (Data Mining ...€¦ · Lehrstuhl fur Datenbanksysteme und Data Mining Prof. Dr. Thomas Seidl Knowledge Discovery and Data Mining 1

Knowledge Discovery and Data Mining 1 [0.1cm] (Data Mining ...

DATA MINING AND KNOWLEDGE DISCOVERY HANDBOOK

COMP 5331: Knowledge Discovery and Data Mining

Data Mining & Knowledge Discovery

Data Mining Knowledge Discovery: An Introduction.

Basics of data mining, Knowledge Discovery in …Basics of data mining, Knowledge Discovery in databases, KDD process, data mining tasks primitives, Integration of data mining systems