Three Challenges in Data Mining

Three Challenges in Data Mining

Anne DentonDepartment of

Computer Science NDSU

Why Data Mining?

Parkinson’s Law of Data

Data expands to fill the space available for storage

Disk-storage version of Moore’s law

Capacity 2 t / 18 months

Available data grows exponentially!

Outline Motivation of 3 challenges

More records (rows) More attributes (columns) More subject domains

Some answers to the challenges Thesis work

Generalized P-Tree structure Kernel-based semi-naïve Bayes classification

KDD-cup 02/03 and with Csci 366 students Data with graph relationship Outlook: Data with time dependence

Examples More records

Many stores save each transaction Data warehouses keep historic data Monitoring network traffic Micro sensors / sensor networks

More attributes Items in a shopping cart Keywords in text Properties of a protein (multi-valued

categorical) More subject domains

Data mining hype increases audience

Algorithmic Perspective More records

Standard scaling problem More attributes

Different algorithms needed for 1000 vs. 10 attributes More subject domains

New techniques needed Joining of separate fields

Algorithms should be domain-independent Need for experts does not scale well

Twice as many data sets Twice as many domain experts??

Ignore domain knowledge? No! Formulate it systematically

Some Answers to Challenges Large data quantity (Thesis)

Many records P-Tree concept and its generalization to

non-spatial data Many attributes

Algorithm that defies curse of dimensionality New techniques / Joining separate fields

Mining data on a graph Outlook: Mining data with time dependence

Challenge 1: Many Records Typical question

How many records satisfy given conditions on attributes?

Typical answer In record-oriented database systems

Database scan: O(N) Sorting / indexes?

Unsuitable for most problems P-Trees

Compressed bit-column-wise storage Bit-wise AND replaces database scan

P-Trees: Compression Aspect

P-Trees: Ordering Aspect Compression relies on long

sequences of 0 or 1 Images

Neighboring pixels are probably similar Peano-ordering

Other data? Peano-ordering can be generalized Peano-order sorting

Peano-Order Sorting

Impact of Peano-Order SortingImpact of Sorting on Execution Speed

0

20

40

60

80

100

120

adult

spam

mus

hroo

m

func

tion

crop

Tim

e in

Sec

on

ds Unsorted

Simple Sorting

Generalized PeanoSorting

0

20

40

60

80

0 5000 10000 15000 20000 25000 30000

Number of Training Points

Tim

e p

er T

est

Sam

ple

in

Mill

isec

on

ds

Speed improvement especially for large data sets

Less than O(N) scaling for all algorithms

So Far Answer to challenge 1: Many records

P-Tree concept allows scaling better than O(N) for AND (equivalent to database scan)

Introduced effective generalization to non-spatial data (thesis)

Challenge 2: Many attributes Focus: Classification Curse of dimensionality Some algorithms suffer more than others

Curse of Dimensionality Many standard classification algorithms

E.g., decision trees, rule-based classification For each attribute 2 halves: relevant irrelevant How often can we divide by 2 before small size of

“relevant” part makes results insignificant? Inverse of

Double number of rice grains for each square of the chess board

Many domains have hundreds of attributes Occurrence of terms in text mining Properties of genes

Possible Solution Additive models

Each attribute contributes to a sum Techniques exist (statistics)

Computationally intensive Simplest: Naïve Bayes

x(k) is value of kth attribute

Considered additive model Logarithm of probability additive

M

ki

ki cCxPcCP

1

)( )|()|(x

Semi-Naïve Bayes Classifier Correlated attributes are joined

Has been done for categorical data Kononenko ’91, Pazzani ’96 Previously: Continuous data discretized

New (thesis) Kernel-based evaluation of correlation

0

0.02

0.04

0.06

0.08

0.1

kerneldensityestimate

distributionfunction

data points

1

),(

),(

),(

, 1

)()()(

1 ,

)()()(

bak

N

t

kt

kk

N

t bak

kt

kk

xxK

xxK

baCorr

Results Error decrease in units of standard deviation for

different parameter sets Improvement for wide range of correlation thresholds:

0.05 (white) to 1 (blue)

Semi-Naive Classifier Compard with P-Tree Naive Bayes

-5

0

5

10

15

20

25

spam crop adult sick-euthyroid

mushroom gene-function

spliceDec

reas

e in

Err

or

Rat

e

Parameters (a)

Parameters (b)

Parameters (c)

So Far Answer to challenge 1: More records

Generalized P-tree structure Answer to challenge 2: More attributes

Additive algorithms Example: Kernel-based semi-naïve Bayes

Challenge 3: More subject domains Data on a graph Outlook: Data with time dependence

Standard Approach to Data Mining

Conversion to a relation (table) Domain knowledge goes into table

creation Standard table can be mined with

standard tools Does that solve the problem?

To some degree, yes But we can do better

“Everything should be made as simple as

possible, but not simpler”

Albert Einstein

Claim: Representation as single relation is not rich enough Example:

Contribution of a graph structure to standard mining problems Genomics

Protein-protein interactions

WWW Link structure

Scientific publications Citations

Scientific American 05/03

Data on a Graph: Old Hat? Common Topics

Analyze edge structure Google Biological Networks

Sub-graph matching Chemistry

Visualization Focus on graph structure

Our work Focus on mining node data Graph structure provides connectivity

Protein-Protein Interactions Protein data

From Munich Information Center for Protein Sequences (also KDD-cup 02)

Hierarchical attributes Function Localization Pathways

Gene-related properties

Interactions From experiments Undirected graph

Questions Prediction of a property

(KDD-cup 02: AHR*) Which properties in

neighbors are relevant? How should we integrate

neighbor knowledge? What are interesting

patterns? Which properties say

more about neighboring nodes than about the node itself?

But not:

*AHR: Aryl Hydrocarbon Receptor Signaling Pathway

AHR

Possible Representations OR-based

At least one neighbor has property Example: Neighbor essential true

AND-based All neighbors have property Example: Neighbor essential false

Path-based (depends on maximum hops) One record for each path Classification: weighting? Association Rule Mining:

Record base changes

essential

AHR essential

AHR not essential

Association Rule Mining OR-based representation Conditions

Association rule involves AHR Support across a link greater than within a

node Conditions on minimum confidence and support Top 3 with respect to support:

(Results by Christopher Besemann, project CSci 366)

AHR essential

AHR nucleus (localization)

AHR transcription (function)

Classification Results Problem

(especially path-based representation) Varying amount of information per record Many algorithms unsuitable in principle

E.g., algorithms that divide domain space KDD-cup 02

Very simple additive model Based on visually identifying relationship Number of interacting essential genes adds to

probability of predicting protein as AHR

KDD-Cup 02: Honorable Mention

NDSU Team

Outlook: Time-Dependent Data KDD-cup 03

Prediction of citations of scientific papers Old: Time-series prediction New: Combination with similarity-based

prediction

Conclusions and Outlook Many exciting problems in data mining Various challenges

Scaling of existing algorithms (more records) Different properties in algorithms become relevant

(more attributes) Identifying and solving new domain-independent

challenges (more subject areas) Examples of general structural components

that apply to many domains Graph-structure Time-dependence Relationships between attributes

Three Challenges in Data Mining

Documents

Transcript of Three Challenges in Data Mining