2 Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures...

27
2 Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion Objectives, Prerequisite and Content

Transcript of 2 Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures...

2

Outline of the presentation

Objectives,

Prerequisite

and Content

Brief

Introduction

to Lectures

Discussion

and

Conclusion

Objectives,

Prerequisite

and Content

3

Objectives

This course provides:

•fundamental techniques of knowledge discovery and data mining (KDD)

•issues in KDD practical use and tools

•case-studies of KDD application

4

Nothing special but the followings are expected:

Prerequisite for the course

• experience of computer use

• basis of databases, statistics, and mathematics

• programming skills

5

Content of the course

•Overview of KDD•Mining association rules•Mining action rules•Decision tree induction •Distributed knowledge systems and distributed

query answering•Cluster analysis

6

Outline of the presentation

Objectives,

Prerequisite

and Content

Brief

Introduction

to Lectures

Discussion

and

Conclusion

7

Brief introduction to lectures

Overview of KDD

8

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

9

KDD: A Definition

106-1012 bytes:we never see the wholedata set, so will put it in the memory of computers

What is the knowledge?How to represent and use it?

Then run Data Mining algorithms

KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.

10

We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily.

Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data.

Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”.

Data, Information, Knowledge

Knowledge can be considered data at a high level of abstraction and generalization.

11

From Data to KnowledgeFrom Data to Knowledge From Data to KnowledgeFrom Data to Knowledge

...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS

12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA

15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA

16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0,  0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS,  VIRUS...

Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes

Numerical attribute categorical attribute missing values class labels

IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15THEN Prediction = VIRUS [87,5%]

[confidence, predictive accuracy]

12

People gathered and stored so much data because they think some valuable assetsare implicitly coded within it.

Raw data is rarely of direct benefit.

Its true value depends on the ability to extract information useful for decision support.

Impractical Manual Data Analysis

knowledge base

inference engine

How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem.

?

Tradition: via knowledge engineers

New trend: via automatic programs

Data Rich Knowledge Poor

13

Volume

Value

EDP

MIS

DSS

Benefits of Knowledge Discovery

Generate

Rapid Response

Disseminate

EDP: Electronic Data ProcessingMIS: Management Information Systems

DSS: Decision Support Systems

14

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

15

The KDD processThe non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

non-trivial process

Multiple process

valid Justified patterns/models

novel Previously unknown

useful Can be used

understandableby human and machine

16

The Knowledge Discovery ProcessThe Knowledge Discovery Process The Knowledge Discovery ProcessThe Knowledge Discovery Process

KDD is inherentlyinteractive and iterative

a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations

1

2

3

4

5

Understand the domain and Define problems

Collect and Preprocess Data

Data MiningExtract Patterns/Models

Interpret and Evaluate discovered knowledge

Putting the results in practical use

17

The KDD ProcessData organized by function

Create/selecttarget database

Select samplingtechnique and

sample data

Supply missing values

Normalizevalues

Select DM task (s)

Transform todifferent

representation

Eliminatenoisy data

Transformvalues

Select DM method (s)

Create derivedattributes

Extract knowledge

Find importantattributes &value ranges

Test knowledge

Refine knowledge

Query & report generationAggregation & sequencesAdvanced methods

Data warehousing 1

2

3 4

5

18

Main Contributing Areas of KDDMain Contributing Areas of KDD Main Contributing Areas of KDDMain Contributing Areas of KDD

DatabasesStore, access, search, update data (deduction)

StatisticsInfer info from data (deduction & induction, mainly numeric data)

Machine LearningComputer algorithms that improve automatically through experience (mainly induction, symbolic data)

KDD

[data warehouses:integrated data]

[OLAP: On-Line Analytical Processing]

19

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

20

Potential ApplicationsPotential Applications Potential ApplicationsPotential ApplicationsBusiness information

- Marketing and sales data analysis- Investment analysis- Loan approval- Fraud detection- etc.

Manufacturing information

- Controlling and scheduling- Network management- Experiment result analysis- etc.

Scientific information- Sky survey cataloging- Biosequence Databases- Geosciences: Quakefinder- etc.

Personal information

21

KDD: Opportunity and Challenges KDD: Opportunity and Challenges KDD: Opportunity and Challenges KDD: Opportunity and Challenges

Data RichKnowledge Poor(the resource)

Enabling Technology(Interactive MIS, OLAP, parallel computing, Web, etc.)

Competitive Pressure

Data Mining TechnologyMature

KDD

22

KDD workshops: since 1989.Inter. Conferences: KDD (USA), first in 1995;PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997.ML’04/PKDD’04 (in Pisa, Italy)

Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …About 80% of the Fortune 500 companies are involved in data mining projects or using data mining systems.

JAPAN: FGCS Project (logic programming and reasoning).

“Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ.

KDD: A New and Fast Growing Area

23

Lecture 1: Overview of KDD

1. What is KDD and Why ?

2. The KDD Process

4. Data Mining Methods

3. KDD Applications

5. Challenges for KDD

24

Primary Tasks of Data MiningPrimary Tasks of Data Mining Primary Tasks of Data MiningPrimary Tasks of Data Mining

Classification

Deviation andchange detection Summarization

Clustering

Dependency Modeling

Regression

finding the descriptionof several predefined classes and classify a data item into one of them.

maps a data item to a real-valued prediction variable.

identifying a finite set of categories or clusters to describe

the data.

finding a compact description

for a subset of data

finding a model which describes

significant dependencies between variables.

discovering the most significant changes in the data

25

Data General patterns

Examples

Cancerous Cell Data

Classification“What factors determine cancerous cells?”

Classification Algorithm

MiningAlgorithm

- Rule Induction

- Decision tree- Neural Network

26

If Color = light and Tails = 1 and Nuclei = 2Then Healthy Cell (certainty = 92%)

If Color = dark and Tails = 2 and Nuclei = 2Then Cancerous Cell (certainty = 87%)

Classification: Rule Induction“What factors determine a cell is cancerous?”

27

Color = dark Color = light

healthy

Classification: Decision Trees

#nuclei=1 #nuclei=2 #nuclei=1 #nuclei=2

#tails=1 #tails=2

cancerous

cancerous healthy

healthy

#tails=1 #tails=2

cancerous

28

Healthy

Cancerous

“What factors determine a cell is cancerous?”

Classification: Neural Networks

Color = dark

# nuclei = 1

# tails = 2