Attribute Selection

40
Fall 2003 Data Mining 1 Exploratory Data Mining and Data Preparation

Transcript of Attribute Selection

Page 1: Attribute Selection

Fall 2003 Data Mining 1

Exploratory Data Mining and Data Preparation

Page 2: Attribute Selection

Fall 2003 Data Mining 2

The Data Mining Process

Businessunderstanding

Deployment Data

Datapreparation

Modeling

Dataevaluation

Evaluation

Page 3: Attribute Selection

Fall 2003 Data Mining 3

Exploratory Data Mining

Preliminary processData summariesAttribute meansAttribute variationAttribute relationships

Visualization

Page 4: Attribute Selection

Fall 2003 Data Mining 4

Select an attribute

Summary Statistics

Possible Problems:• Many missing values (16%)• No examples of one value

Visualization

Appears to be a good predictor of the class

Page 5: Attribute Selection

Fall 2003 Data Mining 5

Page 6: Attribute Selection

Fall 2003 Data Mining 6

Exploratory DM Process

For each attribute:Look at data summaries

Identify potential problems and decide if an action needs to be taken (may require collecting more data)

Visualize the distribution Identify potential problems (e.g., one dominant

attribute value, even distribution, etc.)Evaluate usefulness of attributes

Page 7: Attribute Selection

Fall 2003 Data Mining 7

Weka Filters

Weka has many filters that are helpful in preprocessing the data Attribute filters

Add, remove, or transform attributes Instance filters

Add, remove, or transform instances

Process Choose for drop-down menu Edit parameters (if any) Apply

Page 8: Attribute Selection

Fall 2003 Data Mining 8

Data Preprocessing

Data cleaningMissing values, noisy or inconsistent data

Data integration/transformationData reductionDimensionality reduction, data

compression, numerosity reduction

Discretization

Page 9: Attribute Selection

Fall 2003 Data Mining 9

Data Cleaning

Missing values Weka reports % of missing values Can use filter called ReplaceMissingValues

Noisy data Due to uncertainty or errors Weka reports unique values Useful filters include

RemoveMisclassified MergeTwoValues

Page 10: Attribute Selection

Fall 2003 Data Mining 10

Data Transformation

Why transform data? Combine attributes. For example, the ratio of two

attributes might be more useful than keeping them separate

Normalizing data. Having attributes on the same approximate scale helps many data mining algorithms(hence better models)

Simplifying data. For example, working with discrete data is often more intuitive and helps the algorithms(hence better models)

Page 11: Attribute Selection

Fall 2003 Data Mining 11

Weka Filters

The data transformation filters in Weka include:AddAddExpressionMakeIndicatorNumericTransformNormalizeStandardize

Page 12: Attribute Selection

Fall 2003 Data Mining 12

Discretization

Discretization reduces the number of values for a continuous attributeWhy?Some methods can only use nominal data

E.g., in Weka ID3 and Apriori algorithmsHelpful if data needs to be sorted

frequently (e.g., when constructing a decision tree)

Page 13: Attribute Selection

Fall 2003 Data Mining 13

Unsupervised Discretization

Unsupervised - does not account for classesEqual-interval binning

Equal-frequency binning

64 65 68 69 70 71 72 75 80 81 83 85

Yes No Yes Yes Yes No NoYes

YesYes

No Yes Yes No

64 65 68 69 70 71 72 75 80 81 83 85

Yes No Yes Yes Yes No NoYes

YesYes

No Yes Yes No

Page 14: Attribute Selection

Fall 2003 Data Mining 14

Take classification into account

Use “entropy” to measure information gain

Goal: Discretizise into 'pure' intervals

Usually no way to get completely pure intervals:

Supervised Discretization

64 65 68 69 70 71 72 75 80 81 83 85

Yes No Yes Yes Yes No NoYes

YesYes

No Yes Yes No

ABCDEF

9 yes & 4 no 1 no1 yes 8 yes & 5 no

Page 15: Attribute Selection

Fall 2003 Data Mining 15

Error-Based DiscretizationCount number of misclassifications Majority class determines prediction Count instances that are different

Must restrict number of classes. Complexity Brute-force: exponential time Dynamic programming: linear time

Downside: cannot generate adjacent intervals with same label

Page 16: Attribute Selection

Fall 2003 Data Mining 16

Weka Filter

Page 17: Attribute Selection

Fall 2003 Data Mining 17

Attribute Selection

Before inducing a model we almost always do input engineeringThe most useful part of this is attribute selection (also called feature selection)Select relevant attributesRemove redundant and/or irrelevant

attributesWhy?

Page 18: Attribute Selection

Fall 2003 Data Mining 18

Reasons for Attribute Selection

Simpler model More transparent Easier to interpret

Faster model induction What about overall time?

Structural knowledge Knowing which attributes are important may be

inherently important to the applicationWhat about the accuracy?

Page 19: Attribute Selection

Fall 2003 Data Mining 19

Attribute Selection Methods

What is evaluated?

Attributes Subsets of attributes

Evaluation Method

IndependentFilters Filters

Learning algorithm Wrappers

Page 20: Attribute Selection

Fall 2003 Data Mining 20

Filters

Results in eitherRanked list of attributes

Typical when each attribute is evaluated individually

Must select how many to keepA selected subset of attributes

Forward selectionBest firstRandom search such as genetic algorithm

Page 21: Attribute Selection

Fall 2003 Data Mining 21

Filter Evaluation Examples

Information GainGain rationRelief

CorrelationHigh correlation with class attributeLow correlation with other attributes

Page 22: Attribute Selection

Fall 2003 Data Mining 22

Wrappers

“Wrap around” the learning algorithmMust therefore always evaluate subsetsReturn the best subset of attributesApply for each learning algorithmUse same search methods as before

Select a subset of attributes

Induce learning algorithm on this subset

Evaluate the resulting model (e.g., accuracy)

Stop? YesNo

Page 23: Attribute Selection

Fall 2003 Data Mining 23

How does it help?

Naïve Bayes

Instance-based learning

Decision tree induction

Page 24: Attribute Selection

Fall 2003 Data Mining 24

Page 25: Attribute Selection

Fall 2003 Data Mining 25

Scalability

Data mining uses mostly well developed techniques (AI, statistics, optimization)Key difference: very large databasesHow to deal with scalability problems?Scalability: the capability of handling increased load in a way that does not effect the performance adversely

Page 26: Attribute Selection

Fall 2003 Data Mining 26

Massive Datasets

Very large data sets (millions+ of instances, hundreds+ of attributes)Scalability in space and timeData set cannot be kept in memory

E.g., processing one instance at a timeLearning time very long

How does the time depend on the input?Number of attributes, number of instances

Page 27: Attribute Selection

Fall 2003 Data Mining 27

Two Approaches

Increased computational powerOnly works if algorithms can be sped upMust have the computing availability

Adapt algorithmsAutomatically scale-down the problem so

that it is always approximately the same difficulty

Page 28: Attribute Selection

Fall 2003 Data Mining 28

Computational ComplexityWe want to design algorithms with good computational complexity

exponential

linear

logarithm

Number of instances(Number of attributes)

Timepolynomial

Page 29: Attribute Selection

Fall 2003 Data Mining 29

Example: Big-Oh Notation

Define n =number of instances m =number of attributes

Going once through all the instances has complexity O(n)Examples Polynomial complexity: O(mn2) Linear complexity: O(m+n) Exponential complexity: O(2n)

Page 30: Attribute Selection

Fall 2003 Data Mining 30

Classification

If no polynomial time algorithm exists to solve a problem it is called NP-completeFinding the optimal decision tree is an example of a NP-complete problemHowever, ID3 and C4.5 are polynomial time algorithms Heuristic algorithms to construct solutions to a

difficult problem “Efficient” from a computational complexity

standpoint but still have a scalability problem

Page 31: Attribute Selection

Fall 2003 Data Mining 31

Decision Tree Algorithms

Traditional decision tree algorithms assume training set kept in memorySwapping in and out of main and cache memory expensiveSolution: Partition data into subsets Build a classifier on each subset Combine classifiers Not as accurate as a single classifier

Page 32: Attribute Selection

Fall 2003 Data Mining 32

Other Classification Examples

Instance-Based LearningGoes through instances one at a timeCompares with new instancePolynomial complexity O(mn)Response time may be slow, however

Naïve BayesPolynomial complexityStores a very large model

Page 33: Attribute Selection

Fall 2003 Data Mining 33

Data Reduction

Another way is to reduce the size of the data before applying a learning algorithm (preprocessing)Some strategiesDimensionality reductionData compressionNumerosity reduction

Page 34: Attribute Selection

Fall 2003 Data Mining 34

Dimensionality Reduction

Remove irrelevant, weakly relevant, and redundant attributesAttribute selection Many methods available E.g., forward selection, backwards elimination,

genetic algorithm search

Often much smaller problemOften little degeneration in predictive performance or even better performance

Page 35: Attribute Selection

Fall 2003 Data Mining 35

Data Compression

Also aim for dimensionality reductionTransform the data into a smaller spacePrinciple Component Analysis Normalize data Compute c orthonormal vectors, or principle

components, that provide a basis for normalized data

Sort according to decreasing significance Eliminate the weaker components

Page 36: Attribute Selection

Fall 2003 Data Mining 36

PCA: Example

Page 37: Attribute Selection

Fall 2003 Data Mining 37

Numerosity Reduction

Replace data with an alternative, smaller data representationHistogram

1-10 11-20 21-30co

unt

1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30

Page 38: Attribute Selection

Fall 2003 Data Mining 38

Other Numerosity Reduction

ClusteringData objects (instance) that are in the

same cluster can be treated as the same instance

Must use a scalable clustering algorithm

SamplingRandomly select a subset of the instances

to be used

Page 39: Attribute Selection

Fall 2003 Data Mining 39

Sampling Techniques

Different samples Sample without replacement Sample with replacement Cluster sample Stratified sample

Complexity of sampling actually sublinear, that is, the complexity is O(s) where s is the number of samples and s<<n

Page 40: Attribute Selection

Fall 2003 Data Mining 40

Weka Filters

PrincipalComponents is under the Attribute Selection tabAlready talked about filters to discretize the dataThe Resample filter randomly samples a given percentage of the data If you specify the same seed, you’ll get the

same sample again