Three Challenges in Data Mining
description
Transcript of Three Challenges in Data Mining
![Page 1: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/1.jpg)
Three Challenges in Data Mining
Anne DentonDepartment of
Computer Science NDSU
![Page 2: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/2.jpg)
Why Data Mining?
Parkinson’s Law of Data
Data expands to fill the space available for storage
Disk-storage version of Moore’s law
Capacity 2 t / 18 months
Available data grows exponentially!
![Page 3: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/3.jpg)
Outline Motivation of 3 challenges
More records (rows) More attributes (columns) More subject domains
Some answers to the challenges Thesis work
Generalized P-Tree structure Kernel-based semi-naïve Bayes classification
KDD-cup 02/03 and with Csci 366 students Data with graph relationship Outlook: Data with time dependence
![Page 4: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/4.jpg)
Examples More records
Many stores save each transaction Data warehouses keep historic data Monitoring network traffic Micro sensors / sensor networks
More attributes Items in a shopping cart Keywords in text Properties of a protein (multi-valued
categorical) More subject domains
Data mining hype increases audience
![Page 5: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/5.jpg)
Algorithmic Perspective More records
Standard scaling problem More attributes
Different algorithms needed for 1000 vs. 10 attributes More subject domains
New techniques needed Joining of separate fields
Algorithms should be domain-independent Need for experts does not scale well
Twice as many data sets Twice as many domain experts??
Ignore domain knowledge? No! Formulate it systematically
![Page 6: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/6.jpg)
Some Answers to Challenges Large data quantity (Thesis)
Many records P-Tree concept and its generalization to
non-spatial data Many attributes
Algorithm that defies curse of dimensionality New techniques / Joining separate fields
Mining data on a graph Outlook: Mining data with time dependence
![Page 7: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/7.jpg)
Challenge 1: Many Records Typical question
How many records satisfy given conditions on attributes?
Typical answer In record-oriented database systems
Database scan: O(N) Sorting / indexes?
Unsuitable for most problems P-Trees
Compressed bit-column-wise storage Bit-wise AND replaces database scan
![Page 8: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/8.jpg)
P-Trees: Compression Aspect
![Page 9: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/9.jpg)
P-Trees: Ordering Aspect Compression relies on long
sequences of 0 or 1 Images
Neighboring pixels are probably similar Peano-ordering
Other data? Peano-ordering can be generalized Peano-order sorting
![Page 10: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/10.jpg)
Peano-Order Sorting
![Page 11: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/11.jpg)
Impact of Peano-Order SortingImpact of Sorting on Execution Speed
0
20
40
60
80
100
120
adult
spam
mus
hroo
m
func
tion
crop
Tim
e in
Sec
on
ds Unsorted
Simple Sorting
Generalized PeanoSorting
0
20
40
60
80
0 5000 10000 15000 20000 25000 30000
Number of Training Points
Tim
e p
er T
est
Sam
ple
in
Mill
isec
on
ds
Speed improvement especially for large data sets
Less than O(N) scaling for all algorithms
![Page 12: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/12.jpg)
So Far Answer to challenge 1: Many records
P-Tree concept allows scaling better than O(N) for AND (equivalent to database scan)
Introduced effective generalization to non-spatial data (thesis)
Challenge 2: Many attributes Focus: Classification Curse of dimensionality Some algorithms suffer more than others
![Page 13: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/13.jpg)
Curse of Dimensionality Many standard classification algorithms
E.g., decision trees, rule-based classification For each attribute 2 halves: relevant irrelevant How often can we divide by 2 before small size of
“relevant” part makes results insignificant? Inverse of
Double number of rice grains for each square of the chess board
Many domains have hundreds of attributes Occurrence of terms in text mining Properties of genes
![Page 14: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/14.jpg)
Possible Solution Additive models
Each attribute contributes to a sum Techniques exist (statistics)
Computationally intensive Simplest: Naïve Bayes
x(k) is value of kth attribute
Considered additive model Logarithm of probability additive
M
ki
ki cCxPcCP
1
)( )|()|(x
![Page 15: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/15.jpg)
Semi-Naïve Bayes Classifier Correlated attributes are joined
Has been done for categorical data Kononenko ’91, Pazzani ’96 Previously: Continuous data discretized
New (thesis) Kernel-based evaluation of correlation
0
0.02
0.04
0.06
0.08
0.1
kerneldensityestimate
distributionfunction
data points
1
),(
),(
),(
, 1
)()()(
1 ,
)()()(
bak
N
t
kt
kk
N
t bak
kt
kk
xxK
xxK
baCorr
![Page 16: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/16.jpg)
Results Error decrease in units of standard deviation for
different parameter sets Improvement for wide range of correlation thresholds:
0.05 (white) to 1 (blue)
Semi-Naive Classifier Compard with P-Tree Naive Bayes
-5
0
5
10
15
20
25
spam crop adult sick-euthyroid
mushroom gene-function
spliceDec
reas
e in
Err
or
Rat
e
Parameters (a)
Parameters (b)
Parameters (c)
![Page 17: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/17.jpg)
So Far Answer to challenge 1: More records
Generalized P-tree structure Answer to challenge 2: More attributes
Additive algorithms Example: Kernel-based semi-naïve Bayes
Challenge 3: More subject domains Data on a graph Outlook: Data with time dependence
![Page 18: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/18.jpg)
Standard Approach to Data Mining
Conversion to a relation (table) Domain knowledge goes into table
creation Standard table can be mined with
standard tools Does that solve the problem?
To some degree, yes But we can do better
![Page 19: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/19.jpg)
“Everything should be made as simple as
possible, but not simpler”
Albert Einstein
![Page 20: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/20.jpg)
Claim: Representation as single relation is not rich enough Example:
Contribution of a graph structure to standard mining problems Genomics
Protein-protein interactions
WWW Link structure
Scientific publications Citations
Scientific American 05/03
![Page 21: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/21.jpg)
Data on a Graph: Old Hat? Common Topics
Analyze edge structure Google Biological Networks
Sub-graph matching Chemistry
Visualization Focus on graph structure
Our work Focus on mining node data Graph structure provides connectivity
![Page 22: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/22.jpg)
Protein-Protein Interactions Protein data
From Munich Information Center for Protein Sequences (also KDD-cup 02)
Hierarchical attributes Function Localization Pathways
Gene-related properties
Interactions From experiments Undirected graph
![Page 23: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/23.jpg)
Questions Prediction of a property
(KDD-cup 02: AHR*) Which properties in
neighbors are relevant? How should we integrate
neighbor knowledge? What are interesting
patterns? Which properties say
more about neighboring nodes than about the node itself?
But not:
*AHR: Aryl Hydrocarbon Receptor Signaling Pathway
![Page 24: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/24.jpg)
AHR
Possible Representations OR-based
At least one neighbor has property Example: Neighbor essential true
AND-based All neighbors have property Example: Neighbor essential false
Path-based (depends on maximum hops) One record for each path Classification: weighting? Association Rule Mining:
Record base changes
essential
AHR essential
AHR not essential
![Page 25: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/25.jpg)
Association Rule Mining OR-based representation Conditions
Association rule involves AHR Support across a link greater than within a
node Conditions on minimum confidence and support Top 3 with respect to support:
(Results by Christopher Besemann, project CSci 366)
AHR essential
AHR nucleus (localization)
AHR transcription (function)
![Page 26: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/26.jpg)
Classification Results Problem
(especially path-based representation) Varying amount of information per record Many algorithms unsuitable in principle
E.g., algorithms that divide domain space KDD-cup 02
Very simple additive model Based on visually identifying relationship Number of interacting essential genes adds to
probability of predicting protein as AHR
![Page 27: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/27.jpg)
KDD-Cup 02: Honorable Mention
NDSU Team
![Page 28: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/28.jpg)
Outlook: Time-Dependent Data KDD-cup 03
Prediction of citations of scientific papers Old: Time-series prediction New: Combination with similarity-based
prediction
![Page 29: Three Challenges in Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813b3c550346895da411a8/html5/thumbnails/29.jpg)
Conclusions and Outlook Many exciting problems in data mining Various challenges
Scaling of existing algorithms (more records) Different properties in algorithms become relevant
(more attributes) Identifying and solving new domain-independent
challenges (more subject areas) Examples of general structural components
that apply to many domains Graph-structure Time-dependence Relationships between attributes