Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining...
-
Upload
philippa-lindsey -
Category
Documents
-
view
227 -
download
1
Transcript of Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining...
![Page 1: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/1.jpg)
Chapter 6 – Three Simple Classification Methods
© Galit Shmueli and Peter Bruce 2008
Data Mining for Business Intelligence
Shmueli, Patel & Bruce
![Page 2: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/2.jpg)
Methods & Characteristics
The three methods:Naïve ruleNaïve Bayes K-nearest-neighbor
Common characteristics:Data-driven, not model-drivenMake no assumptions about the data
![Page 3: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/3.jpg)
Naïve Rule
Classify all records as the majority classNot a “real” methodIntroduced so it will serve as a benchmark
against which to measure other results
![Page 4: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/4.jpg)
Naïve Bayes
![Page 5: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/5.jpg)
Naïve Bayes: The Basic Idea
For a given new record to be classified, find other records like it (i.e., same values for the predictors)
What is the prevalent class among those records?
Assign that class to your new record
![Page 6: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/6.jpg)
Usage
Requires categorical variablesNumerical variable must be binned and
converted to categoricalCan be used with very large data setsExample: Spell check – computer attempts
to assign your misspelled word to an established “class” (i.e., correctly spelled word)
![Page 7: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/7.jpg)
Exact Bayes ClassifierRelies on finding other records that share
same predictor values as record-to-be-classified.
Want to find “probability of belonging to class C, given specified values of predictors.”
Even with large data sets, may be hard to find other records that exactly match your record, in terms of predictor values.
![Page 8: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/8.jpg)
Solution – Naïve BayesAssume independence of predictor
variables (within each class)
Use multiplication rule
Find same probability that record belongs to class C, given predictor values, without limiting calculation to records that share all those same values
![Page 9: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/9.jpg)
Example: Financial Fraud
Target variable: Audit finds fraud, no fraud
Predictors: Prior pending legal charges (yes/no) Size of firm (small/large)
![Page 10: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/10.jpg)
Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud
![Page 11: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/11.jpg)
Exact Bayes CalculationsGoal: classify (as “fraudulent” or as
“truthful”) a small firm with charges filed
There are 2 firms like that, one fraudulent and the other truthful
P(fraud|charges=y, size=small) = ½ = 0.50
Note: calculation is limited to the two firms matching those characteristics
![Page 12: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/12.jpg)
Naïve Bayes CalculationsGoal: Still classifying a small firm with charges filedCompute 2 quantities:
Proportion of “charges = y” among frauds, times proportion of “small” among frauds, times proportion frauds = 3/4 * 1/4 * 4/10 = 0.075
Prop “charges = y” among frauds, times prop. “small” among truthfuls, times prop. truthfuls = 1/6 * 4/6 * 6/10 = 0.067
P(fraud|charges, small) = 0.075/(0.075+0.067) = 0.53
![Page 13: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/13.jpg)
Naïve Bayes, cont.
Note that probability estimate does not differ greatly from exact
All records are used in calculations, not just those matching predictor values
This makes calculations practical in most circumstances
Relies on assumption of independence between predictor variables within each class
![Page 14: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/14.jpg)
Independence Assumption
Not strictly justified (variables often correlated with one another)
Often “good enough”
![Page 15: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/15.jpg)
Advantages
Handles purely categorical data wellWorks well with very large data setsSimple & computationally efficient
![Page 16: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/16.jpg)
Shortcomings
Requires large number of recordsProblematic when a predictor category is
not present in training data Assigns 0 probability of response, ignoring
information in other variables
![Page 17: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/17.jpg)
On the other hand…
Probability rankings are more accurate than the actual probability estimatesGood for applications using lift (e.g. response
to mailing), less so for applications requiring probabilities (e.g. credit scoring)
![Page 18: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/18.jpg)
K-Nearest Neighbors
![Page 19: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/19.jpg)
Basic Idea
For a given record to be classified, identify nearby records
“Near” means records with similar predictor values X1, X2, … Xp
Classify the record as whatever the predominant class is among the nearby records (the “neighbors”)
![Page 20: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/20.jpg)
How to Measure “nearby”?
The most popular distance measure is Euclidean distance
![Page 21: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/21.jpg)
Choosing kK is the number of nearby neighbors to be
used to classify the new recordk=1 means use the single nearest recordk=5 means use the 5 nearest records
Typically choose that value of k which has lowest error rate in validation data
![Page 22: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/22.jpg)
Low k vs. High k
Low values of k (1, 3 …) capture local structure in data (but also noise)
High values of k provide more smoothing, less noise, but may miss local structure
Note: the extreme case of k = n (i.e. the entire data set) is the same thing as “naïve rule” (classify all records according to majority class)
![Page 23: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/23.jpg)
Example: Riding Mowers
Data: 24 households classified as owning or not owning riding mowers
Predictors = Income, Lot Size
![Page 24: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/24.jpg)
Income Lot_Size Ownership60.0 18.4 owner85.5 16.8 owner64.8 21.6 owner61.5 20.8 owner87.0 23.6 owner110.1 19.2 owner108.0 17.6 owner82.8 22.4 owner69.0 20.0 owner93.0 20.8 owner51.0 22.0 owner81.0 20.0 owner75.0 19.6 non-owner52.8 20.8 non-owner64.8 17.2 non-owner43.2 20.4 non-owner84.0 17.6 non-owner49.2 17.6 non-owner59.4 16.0 non-owner66.0 18.4 non-owner47.4 16.4 non-owner33.0 18.8 non-owner51.0 14.0 non-owner63.0 14.8 non-owner
![Page 25: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/25.jpg)
XLMiner Output
For each record in validation data (6 records) XLMiner finds neighbors amongst training data (18 records).
The record is scored for k=1, k=2, … k=18.Best k seems to be k=8.K = 9, k = 10, k=14 also share low error
rate, but best to choose lowest k.
![Page 26: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/26.jpg)
Value of k% Error
Training% Error
Validation
1 0.00 33.33
2 16.67 33.33
3 11.11 33.33
4 22.22 33.33
5 11.11 33.33
6 27.78 33.33
7 22.22 33.33
8 22.22 16.67 <--- Best k
9 22.22 16.67
10 22.22 16.67
11 16.67 33.33
12 16.67 16.67
13 11.11 33.33
14 11.11 16.67
15 5.56 33.33
16 16.67 33.33
17 11.11 33.33
18 50.00 50.00
![Page 27: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/27.jpg)
Using K-NN for Prediction (for Numerical Outcome)
Instead of “majority vote determines class” use average of response values
May be a weighted average, weight decreasing with distance
![Page 28: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/28.jpg)
AdvantagesSimpleNo assumptions required about Normal
distribution, etc.Effective at capturing complex interactions
among variables without having to define a statistical model
![Page 29: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/29.jpg)
Shortcomings Required size of training set increases
exponentially with # of predictors, p This is because expected distance to
nearest neighbor increases with p (with large vector of predictors, all records end up “far away” from each other)
In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s)
These constitute “curse of dimensionality”
![Page 30: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/30.jpg)
Dealing with the Curse
Reduce dimension of predictors (e.g., with PCA)
Computational shortcuts that settle for “almost nearest neighbors”
![Page 31: Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.](https://reader035.fdocuments.us/reader035/viewer/2022062301/5697bf6f1a28abf838c7d1fa/html5/thumbnails/31.jpg)
SummaryNaïve rule: benchmarkNaïve Bayes and K-NN are two variations
on the same theme: “Classify new record according to the class of similar records”
No statistical models involvedThese methods pay attention to complex
interactions and local structure Computational challenges remain