Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6.

Data Mining – Algorithms: Linear Models

Chapter 4, Section 4.6

Numeric Attributes• Numeric prediction and/ or numeric attributes as

predictors• Linear regression is well established statistical

technique– Designed to predict numeric value based on numeric

attributes– Determines optimal set of coefficients for linear equation:

• pred = w0 + w1a1 + w2a2 + … + wnan

– Optimal means prediction errors squared is minimized– For data mining, this would be done on training data so that

it can be tested on test data– I hope that a CSC major could read a statistics book and then

write the code to do this– However, there is no need to do this, since this method is so

available, unless you are seeking to create an improved version of it

Example

• <Show Basketball Spreadsheet – Baskball sheet

• NOTE – input values, weights, prediction vs actual

• <Show testReg sheet – test on separate instances

• NOTE – how it did – prediction vs actual – difference, correlation

Using Regression for Classification• Perform regression for each class

• Set output to be predicted = 1 for training instances that belong to a class

• Set output to be predicted = 0 for training instances that do NOT belong to the class

• Do this for each class, and you will have an “membership function” equation for each class

• On test, plug new instance into each equation, and highest value produced will be the prediction to make

Example

• <Show discretized sheet• NOTE – prep of data – into low, medium, high• NOTE – Weights for 3 regressions, high, med, low• <Show Test sheet• NOTE – Calcs Hi, Med, Low• (doesn’t do that well, suspect that the data may not be

from same source (NBA), and that the discretization was a bit of a problem (very few low)

More sophisticated• Do as many pairwise competitions as necessary• Training – two classes against each other:

– temporarily toss training instances that are not one of the two– Set output = 1 for class to be predicted and –1 for other

• Test – do all pairwise competitions, winner of each gets a vote– E.g. say – – Medium beats High– Medium beats Low– High beats Low– Medium wins 2-1-0

• Conservative approach would be to predict nothing if no prediction dominates

In Context

• Has been used for decades for various applications (e.g. social science research)

• Bias – only searches for linear equations – no squares, cubes etc

• To work well, data must fit a linear model – e.g must be “linearly separable” – be able to divide with a line (in 2D, a plane in 3D, a hyperplane in multi-D)

• To work well, attributes should not be highly correlated with each other

• Depends on numeric attributes

Let’s Look at WEKA

• Linear Regression with Basketball data

• No Correctness measures– Correlations– Error

• Discretize Points per minute– Try logistic regression – a categorical prediction

approach

End Section 4.6

Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6.

Documents

Transcript of Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6.