8/12/2019 Exploratory Input
1/40
Fall 2003 Data Mining 1
Exploratory Data Mining and
Data Preparation
8/12/2019 Exploratory Input
2/40
Fall 2003 Data Mining 2
The Data Mining Process
Business
understanding
DeploymentData
Data
preparation
Modeling
Data
evaluation
Evaluation
8/12/2019 Exploratory Input
3/40
Fall 2003 Data Mining 3
Exploratory Data Mining
Preliminary process
Data summaries
Attribute means
Attribute variation
Attribute relationships
Visualization
8/12/2019 Exploratory Input
4/40
Fall 2003 Data Mining 4
Select an attribute
Summary Statistics
Possible Problems:
Many missing values (16%)
No examples of one value
Visualization
Appears to be a
good predictor of
the class
8/12/2019 Exploratory Input
5/40
Fall 2003 Data Mining 5
8/12/2019 Exploratory Input
6/40
Fall 2003 Data Mining 6
Exploratory DM Process
For each attribute:
Look at data summaries
Identify potential problems and decide if anaction needs to be taken (may require
collecting more data)
Visualize the distribution
Identify potential problems (e.g., one dominant
attribute value, even distribution, etc.)
Evaluate usefulness of attributes
8/12/2019 Exploratory Input
7/40
Fall 2003 Data Mining 7
Weka Filters
Weka has many filters that are helpful inpreprocessing the data Attribute filters
Add, remove, or transform attributes
Instance filters Add, remove, or transform instances
Process Choose for drop-down menu
Edit parameters (if any)
Apply
8/12/2019 Exploratory Input
8/40
Fall 2003 Data Mining 8
Data Preprocessing
Data cleaning
Missing values, noisy or inconsistent data
Data integration/transformation
Data reduction
Dimensionality reduction, data
compression, numerosity reduction
Discretization
8/12/2019 Exploratory Input
9/40
Fall 2003 Data Mining 9
Data Cleaning
Missing values
Weka reports % of missing values
Can use filter called ReplaceMissingValuesNoisy data
Due to uncertainty or errors
Weka reports unique values
Useful filters include
RemoveMisclassified
MergeTwoValues
8/12/2019 Exploratory Input
10/40
Fall 2003 Data Mining 10
Data Transformation
Why transform data?
Combine attributes. For example, the ratio of two
attributes might be more useful than keeping themseparate
Normalizing data. Having attributes on the same
approximate scale helps many data mining
algorithms(hence better models)
Simplifying data. For example, working with
discrete data is often more intuitive and helps the
algorithms(hence better models)
8/12/2019 Exploratory Input
11/40
Fall 2003 Data Mining 11
Weka Filters
The data transformation filters in Wekainclude:
AddAddExpression
MakeIndicator
NumericTransform
Normalize
Standardize
8/12/2019 Exploratory Input
12/40
Fall 2003 Data Mining 12
Discretization
Discretization reduces the number of
values for a continuous attribute
Why?
Some methods can only use nominal data
E.g., in Weka ID3 and Apriori algorithms
Helpful if data needs to be sortedfrequently (e.g., when constructing a
decision tree)
8/12/2019 Exploratory Input
13/40
Fall 2003 Data Mining 13
Unsupervised Discretization
Unsupervised - does not account for classes
Equal-interval binning
Equal-frequency binning
64 65 68 69 70 71 72 75 80 81 83 85
Yes No Yes Yes Yes No No
Yes
Yes
Yes
No Yes Yes No
64 65 68 69 70 71 72 75 80 81 83 85
Yes No Yes Yes Yes No No
Yes
Yes
Yes
No Yes Yes No
8/12/2019 Exploratory Input
14/40
Fall 2003 Data Mining 14
Take classification into account
Use entropyto measure information gain
Goal: Discretizise into 'pure' intervalsUsually no way to get completely pure intervals:
Supervised Discretization
64 65 68 69 70 71 72 75 80 81 83 85
Yes No Yes Yes Yes No No
Yes
Yes
Yes
No Yes Yes No
ABCDEF
9 yes & 4 no 1 no1 yes 8 yes & 5 no
8/12/2019 Exploratory Input
15/40
Fall 2003 Data Mining 15
Error-Based Discretization
Count number of misclassifications
Majority class determines prediction
Count instances that are different
Must restrict number of classes.
Complexity
Brute-force: exponential time
Dynamic programming: linear time
Downside: cannot generate adjacent intervalswith same label
8/12/2019 Exploratory Input
16/40
Fall 2003 Data Mining 16
Weka Filter
8/12/2019 Exploratory Input
17/40
Fall 2003 Data Mining 17
Attribute Selection
Before inducing a model we almostalways do input engineering
The most useful part of this is attributeselection(also called feature selection)
Select relevant attributes
Remove redundant and/or irrelevantattributes
Why?
8/12/2019 Exploratory Input
18/40
Fall 2003 Data Mining 18
Reasons for Attribute
SelectionSimpler model More transparent
Easier to interpret
Faster model induction What about overall time?
Structural knowledge
Knowing which attributes are important may beinherently important to the application
What about the accuracy?
8/12/2019 Exploratory Input
19/40
Fall 2003 Data Mining 19
Attribute Selection Methods
What is evaluated?
Attributes Subsets ofattributes
EvaluationMethod
Independent
Filters Filters
Learning
algorithm Wrappers
8/12/2019 Exploratory Input
20/40
Fall 2003 Data Mining 20
Filters
Results in either
Ranked list of attributes
Typical when each attribute is evaluatedindividually
Must select how many to keep
A selected subset of attributes
Forward selection Best first
Random search such as genetic algorithm
8/12/2019 Exploratory Input
21/40
Fall 2003 Data Mining 21
Filter Evaluation Examples
Information Gain
Gain ration
Relief
Correlation
High correlation with class attribute
Low correlation with other attributes
8/12/2019 Exploratory Input
22/40
Fall 2003 Data Mining 22
Wrappers
Wrap aroundthe
learning algorithm
Must therefore always
evaluate subsets
Return the best subset
of attributes
Apply for each learning
algorithm
Use same search
methods as before
Select a subset of
attributes
Induce learning
algorithm on this subset
Evaluate the resulting
model (e.g., accuracy)
Stop? YesNo
8/12/2019 Exploratory Input
23/40
Fall 2003 Data Mining 23
How does it help?
Nave Bayes
Instance-based learning
Decision tree induction
8/12/2019 Exploratory Input
24/40
Fall 2003 Data Mining 24
8/12/2019 Exploratory Input
25/40
Fall 2003 Data Mining 25
Scalability
Data mining uses mostly well developed
techniques (AI, statistics, optimization)
Key difference: very large databases
How to deal with scalability problems?
Scalability: the capability of handling
increased load in a way that does not
effect the performance adversely
8/12/2019 Exploratory Input
26/40
Fall 2003 Data Mining 26
Massive Datasets
Very large data sets (millions+ of
instances, hundreds+ of attributes)
Scalability in space and time
Data set cannot be kept in memory
E.g., processing one instance at a time
Learning time very long How does the time depend on the input?
Number of attributes, number of instances
8/12/2019 Exploratory Input
27/40
Fall 2003 Data Mining 27
Two Approaches
Increased computational power
Only works if algorithms can be sped up
Must have the computing availability
Adapt algorithms
Automatically scale-down the problem so
that it is always approximately the samedifficulty
8/12/2019 Exploratory Input
28/40
Fall 2003 Data Mining 28
Computational Complexity
We want to design algorithms with good
computational complexity
exponential
linear
logarithm
Number of instances
(Number of attributes)
Timepolynomial
8/12/2019 Exploratory Input
29/40
Fall 2003 Data Mining 29
Example: Big-Oh Notation
Define n =number of instances
m =number of attributes
Going once through all the instances hascomplexity O(n)
Examples
Polynomial complexity: O(mn2
) Linear complexity: O(m+n)
Exponential complexity: O(2n)
8/12/2019 Exploratory Input
30/40
Fall 2003 Data Mining 30
Classification
If no polynomial time algorithm exists to solvea problem it is called NP-complete
Finding the optimal decision tree is anexample of a NP-complete problem
However, ID3 and C4.5 are polynomial timealgorithms
Heuristic algorithms to construct solutions to adifficult problem
Efficientfrom a computational complexitystandpoint but still have a scalability problem
8/12/2019 Exploratory Input
31/40
Fall 2003 Data Mining 31
Decision Tree Algorithms
Traditional decision tree algorithms assumetraining set kept in memory
Swapping in and out of main and cachememory expensive
Solution: Partition data into subsets
Build a classifier on each subset Combine classifiers
Not as accurate as a single classifier
8/12/2019 Exploratory Input
32/40
Fall 2003 Data Mining 32
Other Classification Examples
Instance-Based Learning
Goes through instances one at a time
Compares with new instance Polynomial complexity O(mn)
Response time may be slow, however
Nave Bayes
Polynomial complexity
Stores a very large model
8/12/2019 Exploratory Input
33/40
Fall 2003 Data Mining 33
Data Reduction
Another way is to reduce the size of the
data before applying a learning
algorithm (preprocessing)Some strategies
Dimensionality reduction
Data compression Numerosity reduction
8/12/2019 Exploratory Input
34/40
Fall 2003 Data Mining 34
Dimensionality Reduction
Remove irrelevant, weakly relevant, and
redundant attributes
Attribute selection Many methods available
E.g., forward selection, backwards elimination,
genetic algorithm search
Often much smaller problem
Often little degeneration in predictive
performance or even better performance
8/12/2019 Exploratory Input
35/40
Fall 2003 Data Mining 35
Data Compression
Also aim for dimensionality reduction
Transform the data into a smaller space
Principle Component Analysis Normalize data
Compute corthonormal vectors, orprinciple
components, that provide a basis for normalized
data Sort according to decreasing significance
Eliminate the weaker components
8/12/2019 Exploratory Input
36/40
Fall 2003 Data Mining 36
PCA: Example
8/12/2019 Exploratory Input
37/40
Fall 2003 Data Mining 37
Numerosity Reduction
Replace data with an alternative,
smaller data representation
Histogram
1-10 11-20 21-30
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,
15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,
20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30
8/12/2019 Exploratory Input
38/40
Fall 2003 Data Mining 38
Other Numerosity Reduction
Clustering
Data objects (instance) that are in the
same cluster can be treated as the sameinstance
Must use a scalable clustering algorithm
Sampling Randomly select a subset of the instances
to be used
8/12/2019 Exploratory Input
39/40
Fall 2003 Data Mining 39
Sampling Techniques
Different samples
Sample without replacement
Sample with replacement Cluster sample
Stratified sample
Complexity of sampling actually sublinear,
that is, the complexity is O(s) where sis thenumber of samples and s
8/12/2019 Exploratory Input
40/40
F ll 2003 D t Mi i 40
Weka Filters
PrincipalComponentsis under theAttribute Selection tab
Already talked about filters to discretizethe data
The Resamplefilter randomly samplesa given percentage of the data If you specify the same seed, youll get the
same sample again
Top Related