Attribute Selection

Fall 2003 Data Mining 1

Exploratory Data Mining and Data Preparation


The Data Mining Process

Businessunderstanding

Deployment Data

Datapreparation

Modeling

Dataevaluation

Evaluation


Exploratory Data Mining

Preliminary processData summariesAttribute meansAttribute variationAttribute relationships

Visualization


Select an attribute

Summary Statistics

Possible Problems:• Many missing values (16%)• No examples of one value

Visualization

Appears to be a good predictor of the class


Exploratory DM Process

For each attribute:Look at data summaries

Identify potential problems and decide if an action needs to be taken (may require collecting more data)

Visualize the distribution Identify potential problems (e.g., one dominant

attribute value, even distribution, etc.)Evaluate usefulness of attributes


Weka Filters

Weka has many filters that are helpful in preprocessing the data Attribute filters

Add, remove, or transform attributes Instance filters

Add, remove, or transform instances

Process Choose for drop-down menu Edit parameters (if any) Apply


Data Preprocessing

Data cleaningMissing values, noisy or inconsistent data

Data integration/transformationData reductionDimensionality reduction, data

compression, numerosity reduction

Discretization


Data Cleaning

Missing values Weka reports % of missing values Can use filter called ReplaceMissingValues

Noisy data Due to uncertainty or errors Weka reports unique values Useful filters include

RemoveMisclassified MergeTwoValues


Data Transformation

Why transform data? Combine attributes. For example, the ratio of two

attributes might be more useful than keeping them separate

Normalizing data. Having attributes on the same approximate scale helps many data mining algorithms(hence better models)

Simplifying data. For example, working with discrete data is often more intuitive and helps the algorithms(hence better models)


Weka Filters

The data transformation filters in Weka include:AddAddExpressionMakeIndicatorNumericTransformNormalizeStandardize


Discretization

Discretization reduces the number of values for a continuous attributeWhy?Some methods can only use nominal data

E.g., in Weka ID3 and Apriori algorithmsHelpful if data needs to be sorted

frequently (e.g., when constructing a decision tree)


Unsupervised Discretization

Unsupervised - does not account for classesEqual-interval binning

Equal-frequency binning

64 65 68 69 70 71 72 75 80 81 83 85

Yes No Yes Yes Yes No NoYes

YesYes

No Yes Yes No

64 65 68 69 70 71 72 75 80 81 83 85


YesYes

No Yes Yes No


Take classification into account

Use “entropy” to measure information gain

Goal: Discretizise into 'pure' intervals

Usually no way to get completely pure intervals:

Supervised Discretization

64 65 68 69 70 71 72 75 80 81 83 85


YesYes

No Yes Yes No

ABCDEF

9 yes & 4 no 1 no1 yes 8 yes & 5 no


Error-Based DiscretizationCount number of misclassifications Majority class determines prediction Count instances that are different

Must restrict number of classes. Complexity Brute-force: exponential time Dynamic programming: linear time

Downside: cannot generate adjacent intervals with same label


Weka Filter


Attribute Selection

Before inducing a model we almost always do input engineeringThe most useful part of this is attribute selection (also called feature selection)Select relevant attributesRemove redundant and/or irrelevant

attributesWhy?


Reasons for Attribute Selection

Simpler model More transparent Easier to interpret

Faster model induction What about overall time?

Structural knowledge Knowing which attributes are important may be

inherently important to the applicationWhat about the accuracy?


Attribute Selection Methods

What is evaluated?

Attributes Subsets of attributes

Evaluation Method

IndependentFilters Filters

Learning algorithm Wrappers


Filters

Results in eitherRanked list of attributes

Typical when each attribute is evaluated individually

Must select how many to keepA selected subset of attributes

Forward selectionBest firstRandom search such as genetic algorithm


Filter Evaluation Examples

Information GainGain rationRelief

CorrelationHigh correlation with class attributeLow correlation with other attributes


Wrappers

“Wrap around” the learning algorithmMust therefore always evaluate subsetsReturn the best subset of attributesApply for each learning algorithmUse same search methods as before

Select a subset of attributes

Induce learning algorithm on this subset

Evaluate the resulting model (e.g., accuracy)

Stop? YesNo


How does it help?

Naïve Bayes

Instance-based learning

Decision tree induction


Scalability

Data mining uses mostly well developed techniques (AI, statistics, optimization)Key difference: very large databasesHow to deal with scalability problems?Scalability: the capability of handling increased load in a way that does not effect the performance adversely


Massive Datasets

Very large data sets (millions+ of instances, hundreds+ of attributes)Scalability in space and timeData set cannot be kept in memory

E.g., processing one instance at a timeLearning time very long

How does the time depend on the input?Number of attributes, number of instances


Two Approaches

Increased computational powerOnly works if algorithms can be sped upMust have the computing availability

Adapt algorithmsAutomatically scale-down the problem so

that it is always approximately the same difficulty


Computational ComplexityWe want to design algorithms with good computational complexity

exponential

linear

logarithm

Number of instances(Number of attributes)

Timepolynomial


Example: Big-Oh Notation

Define n =number of instances m =number of attributes

Going once through all the instances has complexity O(n)Examples Polynomial complexity: O(mn2) Linear complexity: O(m+n) Exponential complexity: O(2n)


Classification

If no polynomial time algorithm exists to solve a problem it is called NP-completeFinding the optimal decision tree is an example of a NP-complete problemHowever, ID3 and C4.5 are polynomial time algorithms Heuristic algorithms to construct solutions to a

difficult problem “Efficient” from a computational complexity

standpoint but still have a scalability problem


Decision Tree Algorithms

Traditional decision tree algorithms assume training set kept in memorySwapping in and out of main and cache memory expensiveSolution: Partition data into subsets Build a classifier on each subset Combine classifiers Not as accurate as a single classifier


Other Classification Examples

Instance-Based LearningGoes through instances one at a timeCompares with new instancePolynomial complexity O(mn)Response time may be slow, however

Naïve BayesPolynomial complexityStores a very large model


Data Reduction

Another way is to reduce the size of the data before applying a learning algorithm (preprocessing)Some strategiesDimensionality reductionData compressionNumerosity reduction


Dimensionality Reduction

Remove irrelevant, weakly relevant, and redundant attributesAttribute selection Many methods available E.g., forward selection, backwards elimination,

genetic algorithm search

Often much smaller problemOften little degeneration in predictive performance or even better performance


Data Compression

Also aim for dimensionality reductionTransform the data into a smaller spacePrinciple Component Analysis Normalize data Compute c orthonormal vectors, or principle

components, that provide a basis for normalized data

Sort according to decreasing significance Eliminate the weaker components


PCA: Example


Numerosity Reduction

Replace data with an alternative, smaller data representationHistogram

1-10 11-20 21-30co

unt

1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30


Other Numerosity Reduction

ClusteringData objects (instance) that are in the

same cluster can be treated as the same instance

Must use a scalable clustering algorithm

SamplingRandomly select a subset of the instances

to be used


Sampling Techniques

Different samples Sample without replacement Sample with replacement Cluster sample Stratified sample

Complexity of sampling actually sublinear, that is, the complexity is O(s) where s is the number of samples and s<<n


Weka Filters

PrincipalComponents is under the Attribute Selection tabAlready talked about filters to discretize the dataThe Resample filter randomly samples a given percentage of the data If you specify the same seed, you’ll get the

same sample again

Attribute Selection

Documents

Transcript of Attribute Selection