Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach

Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach

Ryszard S. Michalski and Kenneth A. Kaufman

2

[개요 ]

Emergence of new research area

: Data mining & Knowledge discovery

Abundant raw data

Useful task- oriented

Knowledge

Machine learning

P attern recognition

Statis tical analys is

Data visualization

Neural nets , etc

Usning

[2.1] Introduction

How to extract useful, task-oriented knowledge from abundant raw data?

Regression analysisC lustering analysis

Multi- dimensional analysisTime series analysisNumerical taxanomy

Stochastic modelNon- linear estimation

techniqueetc

Tradition/currentmethods

Limitation: Primarily oriented toward the explanations ofquantitative & statistical data characteristics

Continued

Traditional statistical methods can

But can’t

explane covarianc / correlation btw variables in dataexplane central tendency/ variance of given factors

fit a curve to a set of datapointscreate a c lassification of entities

spec ify a numerical s imilarityetc .

characterize the dependenc ies at an abstract, conceptual levelproduce a causal explanation of reasons why

develop a justification of these relationships in the form of higher- level logic style

produce a qualitative description of the regularitiesdetermine the functions not explic itly provieded in the data

draw an analogy btw the discovered regularity and one in another domain

hypothesize reasons for the entities being in the same category

Continued그리고 traditional methods 는 스스로 domain knowledge 를 취하여 자동적으로 관련된 attributes 를 만들어 내지는 못한다 .

In order to overcome

DataBackground Knowledge+

Machine learning 기 법 을 이 용Symbolic reasoning가 능

Data mining & Knowledge discovery

Goal of the researches in this field :To develop computational models for acquiring knowledge

from data and background knowledge

Continued

• Machine learning 과 기존의 전통적인 방법을 적용하여 Task-oriented data characteristics 와 generalization 을 도출해낸다 .

•‘Task-oriented’ 는 동일한 data 로부터 다른 knowledge 를 얻을수 있어야 함을 의미하므로 결국은 Multi-strategy approach 를 요한다 . ( 다른 task 는 다른 data exploration 과 knowledge generalization 을 요하므로 )

•Multi-strategy approach 의 목적은 human expert 가 얻을수 있는 data description 과 유사한 형태의 Knowledge 를 얻는것이다 .

•Main constraints: domain expert 가 쉽게 이해 / 해석할 수 있는 Knowledge description 이어야 한다 .

, "즉 Principle of comprehensibility" .를 만 족 해 야 한 다

Knowledge Logical/Numerical/Statistical/Graphical 이 러 한 는 description 등 의 여 러 가 지 의 형 태 로 가 능

Continued

Distinction between Data mining & Knowledge discovery

D-M: Application of Machine learning and other methods to the enumeration of patterns over the data

K-D: The whole process of data analysis lifecycle

Identification of data analysis goalAcquisition & organization of raq data

Generation of potentially useful knoledgeInterpretation and testing of the result

[2.2] Machine learning & multi-strategy data exploration

Two points to be explained here

•Relationship between Machine learning methodology & goals of Data mining and Knowledge discovery

•How methods of symbolic M-L can be used for (semi-)automating tasks with conceptual exploration of data and a generation of task-oriented knowledge from data?

[2.2.1] Determining general rules from specific cases

•Multi-strategy data exploration is based on “Symbolic inductive learning”

(1)Examples of dec is ion c lasses (or c lass of relationship)

(2)P roblem- relevant knowledge

General description of each c lass in the fo llowing forms

(1)dec is ion rules(2)dec is ion tree(3)semantic net

etc .

Hypothesize

3 types of descriptions

Attributional desc ription of entities

Struc tural desc ription of entities

Relational desc ription of entitiesTwo types of data exploration operators(1)Operators for defining general symbolic descriptions of a designed group or groups of entities in a data set.•각 group 내의 entity 에 대한 공통적 특성을 기술•‘Constructive induction’ 이라고 하는 mechanism 을 이용해 original data 에 존재하지 않는 추상적 개념을 이용할 수 있다 .

Learning “Characteristic concept descriptions”

Continued(2)Operators for defining differences between different groups of entities

Learning “Discriminant concept descriptions”

Basic assumptions in concept learning•Examples don’t have errors.•All attributes have a specified values in them.•All examples are located in the same database.•All concepts have a precise(crisp) description that doesn’t change over time.

Doesn't hold in real problems

(1)Incorrect data : error/noise 존 재(2)Incomplete data : values of some attributes are unknown(3)Distributed data : learning from separate collection of data(4)Drifting or evolving concepts : unstable, unstatic concepts(5)Data arriving over time : incremental learning(6)Biased data : actual distribution of the event 를 반 영 치 않 음

Continued•Integrating qualitative & quantitative discovery

: To define sets of equations for a given set of data points, and qualitative conditions for the application of their equations.

•Qualitative prediction

:Sequence/process 내의 pattern 을 찾고 이것을 이용해서 미래의 input 에 대해 정량적으로 예측 .

[2.2.2] Conceptual clustering•Another class of machine learning methods related to D-M & K-D.

•Similar to traditional cluster analysis but quite different.

(1)A set of attrib u tion ald escrip tion s of som e en tities(2)Descrip tion lan g u ag e forch aracteriz in g class of su chen tities(3)A classification q u alitycriterion

(1)C lassification stru ctu re ofen tities(2)Sym b olic d escrip tion ofth e ou tcom e classes

Clustering

classical cluster 기법과의 주된 차이

Diffenence between Conceptual & Traditional clustering

•In Traditional clustering : similar measure is a function only of the properties(attribute values) of the entities.

Similarity(A,B) = f(properties)

Continued•In Conceptual clustering : similarity measure is a function of properties of entities,

and two other factors Description language(D)

Environment(E)

Conceptual cohesiveness(A,B) = f(properties,L,E)

...

. .

. .

.

. .. .

.

...

..

.. ..

.

... .

..

.

.

.

. .. .

..

.

..

..

.

.

.

.

.

.

.

.

A B

Fig. An illustration of the difference between closeness and conceptual cohesiveness

Two points A and B may be put into the same cluster in the viewpoint of the Traditional method but into the different clusters in the conceptual clustering.

[2.2.3] Constructive induction

•In learning rules or decision trees from examples, the initially given attributes may not be directly relevant or irrelevant to the learning problem at hand.

•Advantage of the symbolic methods over statistical methods : symbolic methods 가 statistical method 에 비해 non-essential attributes 를 쉽게 판단할 수 있다 .

•How to improve the representation space

(1)Removing less relevant attributes.

(2)Generating new relevant attributes.

(3)Abstracting attributes.(or Grouping some attribute value)

•“Constructive Induction” consists of two pahses

(1)Construction of the best representation space

(2)Generation of the best hypothesis in the found space above

[2.2.4] Selection of the most representative examples

Usually database is very large => Process of determining, generating patterns/rules is quite time-consuming.

Therefor extraction of the most representative cases of given classes is necessary to make the process more efficient.

[2.2.5] Integration of Qualitative & Quantitative discoveryNumerical attributes 를 포함한 database 의 경우 equation 을 통해 이들 attributes 들간의 관계를 잘 설명하는 quantitative discovery 를 수행할 수 있으나 different qualitative condition 에서는 이러한 고정된 quantitative equation 만으로는 설명이 불가능하므로 qualitative condition 에 따라 quantitative equation 을 결정하는 방법이 요구된다 .

[2.2.6] Qualitative prediction

The goal is not to predict a specific value of a variable(as in Time series analysis), but to qualitatively describe a plausible future object

[2.2.7] Summarizing the ML-oriented approach

Traditional statistical methods

•Oriented towards numerical characterization of a data set

•Used for globally characterizing a given class of objects

Machine learning methods

• Primarily oriented towards symbolic logic-style descriptions of data

•Can determine the description for predicting class membership of future objecs

But Multi-strategy approach combining the above two is necessary,since different type of questions require different exploratory strategies.

[2.3] Classification of data exploration tasks

How to use the GDT(General Data Table) to relate Machine learning techniques to data exploration problems?

(1) Learning rules from examples

하나의 discrete attribute 를 output attribute 로 하고 나머지 attributes 를 input으로 하여 주어진 set of rows 를 training samples 로 하여 이들간의 relationship(rule) 을 구한다 . => 모든 attributes 들에 대해 적용할 수 있다 .

(2) Determining tree-dependent patterns

Detection of temporal patterns in sequences of data arranged along the true dimension in a GDT.

Using Multi-model method for qualitative prediction

Temporal constructive induction technique

(3) Example selection

Select rows from the table corresponding to the most representative examples of different classes.

Continued

(4) Attribute selection

Feature selection 이라고도 하며 least relevant attributes to the learning 에 해당하는 column 을 제거한다 .

주로 Gain ration 나 Promise level 과 같은 attribute selection 기법을 이용한다 .

(5) Generating new attributes

앞에서 설명한 Constructive induction 에 의해 초기에 주어진 attribute 으로부터 새로운 relevant attributes 를 생성한다 .

(6) Clustering

역시 앞에서 설명한 Conceptual clustering 에 의해 rows of the GDT 를 목적하는 group(cluster) 로 partition 한다 . => 이 결과로 나온 cluster 를 기술하는 Rule 은 Knowledge base 에 저장된다 .

(7) Determining attribute dependencies

Determine relationships(e.g., correlation, causal dependencies, logical dependencies) among attributes(column) using statistical/logical methods

Continued

(8) Incremental rule update

Update the working knowledge(rules) to accommodate new information

(9) Searching for approximate patterns in the (imperfect) data

Determine the best hypothesis that accounts for most of the available data

(10) Filling in missing data

Determine the plausible values of the missing entities through the analysis of the currently available data

(11) Determining decision structures for declarative knowledge(Decision rules)

주어진 data set(GDT) 에 대한 general decision rule 이 가정되었을 때 새로운 case 에 대한 예측을 위해 사용되기 위해서는 decision tree(or decision structure) 의 형태로 변환하는 것이 바람직하다 .

Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach

Documents

Transcript of Data Mining & Knowledge Discovery: A Review of Issues and a Multi-strategy Approach